Article Preview
TopIntroduction
Publishing or maintaining Linked data on the Web goes beyond making datasets available through resource description framework (RDF) serializations, which is the innovations and applications cornerstone of semantic web and information systems (Avila-Garzon, 2020). Then, newly published data must be linked to other existing datasets. However, creating links between datasets requires careful analysis by an expert, which, despite being an effective approach, is not scalable, given that the amount of data published is constantly increasing. Consequently, the manual publishing process is unviable. Therefore, to efficiently build the Web of data, there must be solutions capable of linking data automatically or semi-automatically.
Automatically linking data is a problem recognized by many communities. In Databases, the problem is known by record linkage (Gu et al., 2003; Karr et al., 2019), which aims to identify and link resources that are judged to represent the same real-world entity. Additionally, it is possible to find other terms for this problem, such as the entity resolution problem (Menestrina et al., 2005; Ebraheem et al., 2017; Wu et al., 2020), deduplication (Sarawagi and Bhamidipaty, 2002; Xu et al., 2017; Yang et al., 2019), and Instance matching.
Instance matching is the term that the Linked data community uses to refer to the problem. In this community, the main goal is to find matching instances in different datasets (Abubakar et al., 2018). However, instance matching has additional characteristics (Castano et al., 2011; Mountantonakis & Tzitzikas, 2019; Azmy et al., 2019), such as (i) structural heterogeneity, which refers to variation in the structure of the instances; (ii) implicit knowledge, which refers to the characteristics and constraints exhibited by the domain; and (iii) URI-oriented identification, which refers to reusing URIs to identify new information about existing instances. Thus, there is a need for specific solutions for the correct execution of the instance matching process.
To identify and link resources on the Web, the community has been developing a growing number of solutions. The Ontology Alignment Evaluation Initiative (OAEI) conducts an annual evaluation consisting of aligning two predefined datasets and comparing the alignment generated by the solution with the reference alignment. However, according to Homoceanu et al. (2014), the solutions are not ready to automatically align data despite the good results. Most works are used only on conventional OAEI datasets with small ontologies (Ferranti et al., 2021), and there is a small number of real-world ontology matching application approaches (Otero-Cerdera et al., 2015; Ferranti et al., 2021). Also, no technique stands out from the others in all aspects (Xue & Tang, 2017).
This study proposes a context-independent approach for the alignment of Linked data through an alignment process that considers aspects of the ontological model’s data and characteristics. Data properties and relationships drive the alignment of resources/instances. For this purpose, a cascade alignment approach is proposed. Moreover, the proposed approach addresses the alignment between real datasets, which enables reliable alignment of datasets distributed on the Web. This work provides the following contributions: i) development of a context-independent process for the alignment of Linked data; ii) enabling the execution of the alignment directly in the data storage; and iii) presenting a real-world case study dealing with heterogeneity and data quality issues.
Then, this research targets the following problem: