Large quantities of records need to be read and analyzed in cloud computing; many records referring to the same entity bring challenges for data processing and analysis. Entity resolution has become one of the hot issues in database research. Clustering based on records similarity is one of most commonly used methods, but the existing methods of computing records similarity often cost much time and are not suitable for cloud computing. This chapter shows that it is necessary to use wave of strings to compute records similarity in cloud computing and provides a method based on wave of strings of entity resolution. Theoretical analysis and experimental results show that the method proposed in this chapter is correct and effective.
TopIntroduction
In the collections of data, some records referring to the same entity have different presentations, and the process of finds them is called entity resolution. Without good results of entity resolution to take as a unit records referring to the same entity, confusion between information arises to afftect the use of information. Hence in the processes where data plays an important role, such as information integration, data cleansing, information ex-change etc., entity resolution is an important step.
With the development of information techniques, people are faced with large quantities of data to query, handle and analyze. Different data sources and data quality amplifies the probability of different presentations of the same entity in data collections, so direct analysis of the collections leads to an incomplete view, which impacts the final decision and wastes lots of time and internet bandwidth as well. It is necessary to perform entity resolution in large quantities of data.
To address entity resolution in large quantities problems in cloud computing, we adopt widely used Mapreduce paradigm(Dean & Ghemawat 2008) to design algorithms, and to make use of it, our algorithm based on wave similarity of strings based on Jaccard similarity function, which uses wave of strings to describe features of strings and calculates similarity to cluster records with high similarity. Wave of strings generated by our algorithms, can re-arrange characters by features, which can be filtered easily and need not access files, and decreases transported information. Meanwhile, wave of strings can be used on Chinese word attributes without performing Chinese word segmentation, which avoids errors made by bad Chinese word segmentation.
Our contributions are as follows:
- 1.
We study entity resolution in large quantities problem in cloud computing.
- 2.
We propose a method based on wave of strings generated by Mapreduce paradigm to address the problem.
- 3.
We show the feasibility and correctness of our algorithm by experiments.
Since cloud computing is one of the feasible solutions for big data processing, entity resolution techniques based on cloud could be applied in e-commerce (Chapter 16) and healthcare information systems (Chapter 17).
TopBackground
Entity resolution in large quantities of data is faced with many challenges, such as:
Firstly, large quantities of data lead to so many accesses and computation that expensive computers with high performance cannot handle it well.
Secondly, different representations exist in different sources. Such as:
- 1.
Different orders of description,
- 2.
Spelling mistakes in an attribute,
- 3.
Different choices of attributes. For example, consider two records R1(006,Jones Smith, America, 8678901 8276571) and R2(019,Jonse R . Smith,8276571 8678901). R1 has attributes of name, nationality, phone number, while R2 only has attributes of only name and nationality. Besides, there are spelling mistakes in name and different orders in phone number.
Thirdly, there are also large quantities of intermediate data, and frequently accessing of data brings large cost.
Current methods cannot solve these problems well. Some methods do not have good scalability, such as (Hassanzadeh, & Miller, 2009), which need the similarity values between every two strings in the candidate set to get a high precision and leads to high computation cost, so they cannot handle large quantities of data. Some parallel methods (Xiao, Wang, Lin, Yu & Wang, (2011), Vernica, Carey & Li. 2010), based on the analysis of tokens extracted from strings, need Chinese word segmentation to prepare the data, whose results are determined by the results of Chinese word segmentation.