Article Preview
Top1. Introduction
Instance matching (aka entity reconciliation, entity resolution, or record linkage) (Winkler, 2006) is the process of detecting coreferent instances, which describe the same object. One prominent application of instance matching is data integration. Since data are created independently in many repositories, gathering information from multiple sources can greatly improve the completeness and diversity of the objects of interest. Detecting coreferent instances is indispensable for achieving perfect integration quality. In linked data, instance matching also plays an important role in the data publication process. The newly published instances should be linked to their existing coreferent instances on the web of linked data. In other words, instance matching, together with other tools, allows linked data instead of just enriched data to become closer to the vision of the semantic web (Jain, Hitzler, Yeh, Verma, & Sheth, 2010). Instance matching in linked data (Ferrara, Nikolov, & Scharffe, 2011) is also considered as a representative of link discovery, because the result of matching can be used to generate the owl:sameAs1 links, which are conventionally used to declare the coreferences.
The major challenges of instance matching are the ambiguity of instances and the inconsistency between different repositories. The first challenge is the natural heterogeneity of real-world objects (e.g., Tokyo, Tokyo Station, Tokyo Imperial Palace). The second challenge is the different schemata, in which the attributes of objects are declared through arbitrary properties (e.g., ‘name’ and ‘label’ co-describe the same attribute). In linked data and other sorts of web-based data, some of the challenges are even harder compared to other forms of structured data because most resources are contributed by the prolific Internet community. On the one hand, the linked data resources provide excellent benefits thanks to the plentifulness of the data. However, on the other hand, they increase the chance of having more instances that refer to very similar objects. Many linked data sources are constructed by many users or from crowdsourced data. Consequently, the inconsistencies of schemata become more complex. For instance matching on linked data, it is more difficult to construct all correct property mappings between given schemata. However, the difficulty can be solved by a schema-independent system. Therefore, schema-independent instance matching systems, which can work on repositories with any schema, have the highest generality.
Many years of investigating a perfect solution for linked data instance matching have resulted in considerable achievements, but not yet the optimal solution. Numerous studies have been published, and they vary from manually operated to semi-automatic and automatic systems. To use manual systems (Volz, Bizer, Gaedke, & Kobilarov, 2009, Ngomo & Auer, 2011, Li, Tang, Li, & Luo, 2009), the user needs to provide matching specifications (e.g., property mappings, similarity measures). Semi-automatic systems try to reduce the user involvement by suggesting a specification (Lyko, Höffner, Speck, Ngomo, & Lehmann, 2013) or by requiring a small number of labeled data (Ngomo, Lehmann, Auer, & Höffner, 2011, Isele & Bizer, 2013). Recently, studies on automatic approaches have increased because of their generality. Existing automatic systems can be categorized into three families: unsupervised learning of specifications (Nikolov, d’Aquin, & Motta, 2012, Ngomo & Lyko, 2013); probabilistic matching (Niepert, Meilicke, & Stuckenschmidt, 2010, Suchanek, Abiteboul, & Senellart, 2011); and similarity-based matching with statistical estimation of property mappings (Araujo, Tran, DeVries, Hidders, & Schwabe, 2012, Nguyen, Ichise, & Le, 2012a). The first two families have a limitation in scalability, because they either repeatedly browse the data or memorize all computations. Meanwhile, the third one is more scalable, due to its simple architecture. One drawback of previous systems in this third family is the low accuracy on large data. However, with its advantage in scalability, this is still one of the most promising solutions.