Record Linkage in Data Warehousing

Record Linkage in Data Warehousing

Alfredo Cuzzocrea, Laura Puglisi
Copyright: © 2015 |Pages: 10
DOI: 10.4018/978-1-4666-5888-2.ch189
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Chapter Preview

Top

Introduction

“A Data Warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data supporting decision-making processes” (Inmon, 2002). At a more practical level, a data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site (i.e., the Data Warehousing server). Looking into inside, Data Warehouses are characterized by different processes: data cleaning, data integration, data transformation, data loading, and periodic data refreshing. All these convey in the so-called ETL (Extraction-Transformation-Loading) main process (Inmon, 2002), which, essentially, aliments the Data Warehouse.

In order to support decision-making processes, data in a Data Warehouse are organized around so-called subjects, such as Customer, Item, and Activity, and so forth. Also, data are stored in such a way as to provide information from a historical perspective (e.g., since the past 5-10 years) and are typically summarized according to a given level of granularity. Consider, for instance, the case of a Data Warehouse storing sale data. Here, rather than storing the details of each sale transaction, the Data Warehouse may rather store a summary of transactions per-item-type for each store or, at a higher level, (summarized) for each sale region. In addition to this, Data Warehousing platforms provide On-Line Analytical Processing (OLAP) (Gray et al., 1997) tools for supporting interactive data analysis according to a multidimensional and multi-resolution vision. Also, many other Data Mining functionalities (Fayyad et al., 1996; Frawley et al., 1992), such as Association Rule Discovery, Classification, Prediction and Clustering, can be integrated with the OLAP layer in order to enhance interactive (summarized) knowledge discovery and mining at multiple levels of abstraction. Figure 1 shows the reference architecture on a Data Warehousing platform (Inmon, 2002).

Figure 1.

Reference architecture of a data warehousing platform

978-1-4666-5888-2.ch189.f01

According to Inmon (2002), the major distinctive features of a Data Warehouse are the following: (i) subject-orientation, which refers to the amenity according to which a Data Warehouse focuses on subjects of analysis, and features or data that are not useful to the target decision-making process are excluded from the analysis; (ii) integration, which refers to the amenity according to which input data for a Data Warehouse come from multiple and heterogeneous sources, such as relational databases, flat files etc. – as a consequence, in order to remove possible inconsistency and duplicated information, data cleaning and data transformation processes are exploited to this end; (iii) time-variance, which refers to the amenity according to which input data in a Data Warehouse have a marked temporal perspective and multi-versioning (e.g., across the past 5-10 years); (iv) non-volatility, which refers to the amenity according to which, in a Data Warehouse, (summarized) analytical data are maintained separated from (alimenting) transactional data – due to this clear separation, a Data Warehouse server does not require transaction processing and recovery, and concurrency control mechanisms (like conventional DBMS servers) but, rather, it only requires three main operations: (initial) data loading, data refreshing, data accessing.

Key Terms in this Chapter

Similarity: The condition of resemblance between two or more objects.

Data Cleaning: Process of detection and correction of incomplete, corrupted or inaccurate data from a data source.

Relational Database: Database management system based on the relational data model, in which data is represented in terms of tuples and grouped into relations.

Data Heterogeneity: In the Data Warehousing context, it refers to data coming from disparate data sources and presented to the user with a unified interface.

ETL: The process of Extraction, Transformation and Loading of data coming from a collection of heterogeneous data sources, needed to make them suitable for Data Warehouse usage.

Data Warehousing: A central repository of current and historical data made by integrating and aggregating data from heterogeneous sources.

OLAP: On-Line Analytical Processing, or OLAP, designates a set of software techniques for interactive analysis of large amounts of multidimensional data from multiple perspectives.

Complete Chapter List

Search this Book:
Reset