Our research efforts are oriented towards applying text mining techniques in order to help librarians make more informative decisions when selecting learning resources to be included in the library’s offer. The proper selection of learning resources to be included in the library’s offer is one of the key factors determining the overall usefulness of the library. Our task was to match abbreviated journal titles from citations with journals in existing digital libraries. The main problem is that for one journal there is often a number of different abbreviated forms in the citation report, hence the matching depends on the detection of duplicate records. We used character-based and token-based metrics together with a generated thesaurus for detecting duplicate records.
In the real world, data are not perfectly clean and there are various reasons for that, such as data entry errors, missing check constraints, lack of standardization in recording data in different sources, and so forth. In general, data originating from different sources can vary in value, structure, semantics and the underlying assumptions (Elmagarmid et al, 2007). This is the problem of data heterogeneity. There are two basic types of data heterogeneity: structural (differently structured data in different databases) and lexical (diverse representations of the same word entity) (Elmagarmid, Ipeirotis, & Verykios, 2007). The task of lexical heterogeneity has been explored in different research areas, such as statistics, databases, data mining, digital libraries, and natural language processing. Researchers in different areas have proposed various techniques and refer to the problem differently: record linkage (Newcomb & Kennedy, 1962), data duplication (Sarawagi & Bhamidipaty, 2002), database hardening and name matching (Bilenko, Mooney, Cohen, Ravikumar, & Fienber, 2003), data cleaning (McCallum & Wellner, 2003) or object identification (Tejada, Knoblock, & Minton, 2002), approximate matching (Guha, Koudas, Marathe, & Srivastava, 2004), fuzzy matching (Ananthakrishna, Chaudhuri, & Ganti, 2002), and entity resolution (Benjelloum, Garcia-Molina, Su, & Widom, 2005).
Data heterogeneity can have a negative impact on many common data library services. In our work we are addressing the problem of lexical heterogeneity in general and duplicate record detection in particular. The technique for matching fields depends on the particular problem, and there is no absolute solution. Basically, these techniques may be classified into the following categories:
Key Terms in this Chapter
Structural Data Heterogeneity: Refers to differently structured data in different databases.
Lexical Data Heterogeneity: Refers to diverse syntax of the same word entity.
Data Mining: Data mining is an iterative process of searching for new, previously hidden, and usually unexpected patterns in large volumes of data.
Character-based Similarity Metrics: They consider distance as the difference between characters, and is useful in the case of typographical errors.
Phonetic Similarity Metrics: Based on the fact that strings may be phonetically similar if they are not similar in character or token level.
Token-based Similarity Metrics: Based on statistics of common words and are useful when word order is not important.
Duplicate Record Detection: The process of identifying record replicas that refer to the same real-world entity or object in spite of the fact that they are syntactically different.