Data Integration Framework: A Children and Parents Cohort Case Study

Data Integration Framework: A Children and Parents Cohort Case Study

Maria Vargas-Vera (Universidad Adolfo Ibanez, Vinia del Mar, Chile)
Copyright: © 2016 |Pages: 14
DOI: 10.4018/IJKSR.2016010107
OnDemand PDF Download:
No Current Special Offers


This paper presents a proposal for a data integration framework. The purpose of the framework is to locate automatically records of participants from the ALSPAC database (Avon Longitudinal Study of Parents and Children) within its counterpart GPRD database (General Practice Research Database). The ALSPAC database is a collection of data from children and parents from before birth to late puberty. This collection contains several variables of interest for clinical researchers but we concentrate in asthma as a golden standard for evaluation of asthma has been made by a clinical researcher. The main component of the framework is a module called Mapper which locates similar records and performs record linkage. The mapper contains a library of similarity measures such Jaccard, Jaro-Winkler, Monge-Elkan, MatchScore, Levenstein and TFIDF similarity. Finally, the author evaluates the approach on quality of the mappings.
Article Preview


We start our discussion clarifying the different meanings of the term Data Integration. The definition can be found in several fields for example but not limited to Databases and Knowledge Management. We found that, the definition vary strongly in each of the fields. For example, in the databases community, data integration is seen as the process of combining data from different sources and providing users with a unified view of the data. In contrast, to the view taken from Knowledge Management where data integration refers to the enterprise information integration. However, in our opinion data integration is the process of creating a unified data model by means of integrating individual models made by different stakeholders. This integration is basically a reconciliation of the terms and relations used by each stakeholder while building their own model. Our scenario is a real problem that a community of clinical researchers face when performing longitudinal studies on populations. In particular, one of our databases came from ALSPAC which is the largest birth cohort with detailed biological and behavioural data from before birth through till late adolescence. The second database was the GPRD Database (General Practice Research Database) GPRD which contains patients of practices in England.

The main contribution of this paper is to propose an integration framework for the problem of linking records from different databases. The linking records problem has been in several communities such as Statistics, Databases and Artificial Intelligence. The statistics community has been concentrated on record linkage based in the seminal paper of Fellegi & Sunter (Fellegi & Sunter, 1969). Fellegi & Sunter proposed a solution to the entity matching problem as a classification problem, where the goal is to classify entity pairs as two classes matching or non-matching. Their pioneer work was the development of a mathematical model to provide a theoretical framework for a computer-oriented solution to the problem of recognizing records in two files which represent identical persons, objects or events. Fellegi and Sunter proposals have been adopted by other researchers, although often with enhancements of the underlying statistical model (Jaro 1989; 1995; Winkler 1999; Larsen 1999; Belin & Rubin 1997).The Artificial Intelligence community has focussed their efforts in supervised learning, which has been used for learning the parameters of string-edit distance metrics (Ristad & Yianilos 1998; Bilenko & Mooney 2002). Also, the combination of the results of different distance functions has been explored by a good number of researchers. (Tejada et. al, 2001; Cohen & Richman 2002; Bilenko & Mooney 2002). Some work of the database community on record matching has been based on knowledge approaches (Hernandez & Stolfo 1995; Galhardas et al. 2000; Raman & Hellerstein 2001; Pinheiro & Sun 1998). However, the use of string-edit distances as a general-purpose record matching scheme was proposed by Monge and Elkan (Monge & Elkan 1997; 1996),

Finally, it is worth to say that the solution to the data integration problem has the promise of improving query engines as the linked data can be used to retrieve a richer set of answers than performing standard SQL queries. The rest of the paper is organized as follows: firstly, provides an overview of related work. Secondly, it gives a description of ALSPAC (Avon Longitudinal Study of Parents and Children). Thirdly, it presents our framework for data integration. Fourthly, it presents our linkage algorithms. Finally, it gives our conclusions and future work.

Complete Article List

Search this Journal:
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing