Our research efforts are oriented towards applying text mining techniques in order to help librarians make more informative decisions when selecting learning resources to be included in the library’s offer. The proper selection of learning resources to be included in the library’s offer is one of the key factors determining the overall usefulness of the library. Our task was to match abbreviated journal titles from citations with journals in existing digital libraries. The main problem is that for one journal there is often a number of different abbreviated forms in the citation report, hence the matching depends on the detection of duplicate records. We used character-based and token-based metrics together with a generated thesaurus for detecting duplicate records.
In the real world, data are not perfectly clean and there are various reasons for that, such as data entry errors, missing check constraints, lack of standardization in recording data in different sources, and so forth. In general, data originating from different sources can vary in value, structure, semantics and the underlying assumptions (Elmagarmid et al, 2007). This is the problem of data heterogeneity. There are two basic types of data heterogeneity: structural (differently structured data in different databases) and lexical (diverse representations of the same word entity) (Elmagarmid, Ipeirotis, & Verykios, 2007). The task of lexical heterogeneity has been explored in different research areas, such as statistics, databases, data mining, digital libraries, and natural language processing. Researchers in different areas have proposed various techniques and refer to the problem differently: record linkage (Newcomb & Kennedy, 1962), data duplication (Sarawagi & Bhamidipaty, 2002), database hardening and name matching (Bilenko, Mooney, Cohen, Ravikumar, & Fienber, 2003), data cleaning (McCallum & Wellner, 2003) or object identification (Tejada, Knoblock, & Minton, 2002), approximate matching (Guha, Koudas, Marathe, & Srivastava, 2004), fuzzy matching (Ananthakrishna, Chaudhuri, & Ganti, 2002), and entity resolution (Benjelloum, Garcia-Molina, Su, & Widom, 2005).
Data heterogeneity can have a negative impact on many common data library services. In our work we are addressing the problem of lexical heterogeneity in general and duplicate record detection in particular. The technique for matching fields depends on the particular problem, and there is no absolute solution. Basically, these techniques may be classified into the following categories: