Learning Classifiers from Distributed Data Sources

Learning Classifiers from Distributed Data Sources

Doina Caragea (Kansas State University, USA) and Vasant Honavar (Iowa State University, USA)
DOI: 10.4018/978-1-60566-242-8.ch063
OnDemand PDF Download:
$37.50

Abstract

Recent development of high throughput data acquisition technologies in a number of domains (e.g., biological sciences, atmospheric sciences, space sciences, commerce) together with advances in digital storage, computing, and communications technologies have resulted in the proliferation of a multitude of physically distributed data repositories created and maintained by autonomous entities (e.g., scientists, organizations). The resulting increasingly data-rich domains offer unprecedented opportunities in computer assisted data-driven knowledge acquisition in a number of applications, including, in particular, data-driven scientific discovery, data-driven decision-making in business and commerce, monitoring and control of complex systems, and security informatics.
Chapter Preview
Top

Introduction

Recent development of high throughput data acquisition technologies in a number of domains (e.g., biological sciences, atmospheric sciences, space sciences, commerce) together with advances in digital storage, computing, and communications technologies have resulted in the proliferation of a multitude of physically distributed data repositories created and maintained by autonomous entities (e.g., scientists, organizations). The resulting increasingly data-rich domains offer unprecedented opportunities in computer assisted data-driven knowledge acquisition in a number of applications, including, in particular, data-driven scientific discovery, data-driven decision-making in business and commerce, monitoring and control of complex systems, and security informatics.

Machine learning (Duda, Hart & Stork, 2000; Mitchell, 1997) offers one of the most cost-effective approaches to analyzing, exploring, and extracting knowledge (i.e., features, correlations, and other complex relationships and hypotheses that describe potentially interesting regularities) from data. However, the applicability of current machine learning approaches in emerging data-rich applications is severely limited by a number of factors:

  • a.

    Data repositories are large in size, dynamic, and physically distributed. Consequently, it is neither desirable nor feasible to gather all of the data in a centralized location for analysis. Hence, there is a need for efficient algorithms for analyzing and exploring multiple distributed data sources without transmitting large amounts of data.

  • b.

    Autonomously developed and operated data sources often differ in their structures and organizations (e.g., relational databases, flat files, etc.) and the operations that can be performed on the data sources (e.g., types of queries—relational queries, statistical queries, keyword matches). Hence, there is a need for theoretically well-founded strategies for efficiently obtaining the information needed for analysis within the operational constraints imposed by the data sources.

The purpose of this entry is to precisely define the problem of learning classifiers from distributed data and summarize recent advances that have led to a solution to this problem (Caragea, Silvescu & Honavar, 2004; Caragea, Zhang, Bao, Pathak & Honavar, 2005).

Top

Background: Problem Specification

Given a dataset D, a hypothesis class H, and a performance criterion P, an algorithm L for learning (from centralized data D) outputs a hypothesis hH that optimizes P. In pattern classification applications, h is a classifier (e.g., a decision tree, a support vector machine, etc.) (see Figure 1). Data D typically consist of a set of training examples. Each training example is an ordered tuple of attribute values where one of the attributes corresponds to a class label and the remaining attributes represent inputs to the classifier. The goal of learning is to produce a hypothesis that optimizes the performance criterion (e.g., minimizes classification error on the training data) and the complexity of the hypothesis.

Figure 1.

Learning from centralized data

In a distributed setting, a dataset D is distributed among the sites 1,...,n containing the dataset fragments D1,...,Dn. Two common types of data fragmentation are horizontal fragmentation and vertical fragmentation. In the case of horizontal fragmentation, each site contains a subset of data tuples that make up D, for example,

Key Terms in this Chapter

Sufficient Statistics: A statistic is called a sufficient statistic for a parameter if the statistic captures all the information about the parameter, contained in the data. More generally, a statistic is called a sufficient statistic for learning a hypothesis using a particular learning algorithm applied to a given dataset if there exists an algorithm that takes as input the statistic and outputs the desired hypothesis. A query that returns a statistic is called a statistical query.

Learning from Data: Given a dataset, a hypothesis class (e.g., decision trees), and a performance criterion (e.g., accuracy), a learning algorithm outputs a hypothesis that optimizes the performance criterion.

Machine Learning: A key objective of Machine Learning is to design and analyze algorithms that are able to improve the performance at some task through experience. A machine learning system is specified by several components: (a) Learner – an algorithm or a computer program that is able to use the experience to improve its performance; (b) Task – a description of the task that the learner is trying to accomplish (e.g., learn a concept, a function, etc.); (c) Experience – specification of the information that the learner uses to perform the learning; (d) Background knowledge – the information that the learner has about the task before the learning process (e.g., ”simple” answers are preferable over “complex” answers); (e) Performance Criteria – measure the quality of the learning output in terms of accuracy, simplicity, efficiency, and so forth.

Distributed Data Sources: In a distributed setting, the data are distributed across several data sources. Each data source contains only a fragment of the data. This leads to a fragmentation of a data. Two common types of data fragmentation are horizontal fragmentation, wherein (possibly overlapping) subsets of data tuples are stored at different sites; and vertical fragmentation, wherein (possibly overlapping) subtuples of data tuples are stored at different sites. More generally, the data may be fragmented into a set of relations (tables of a relational database, distributed across multiple sites).

Learning Task Decomposition: A learning algorithm can be decomposed in two components: (a) an information extraction component that formulates and sends a statistical query to a data source; and (b) a hypothesis generation component that uses the resulting statistic to modify a partially constructed algorithm output (and further invokes the information extraction component, if needed, to generate the final algorithm output).

Learning from Distributed Data: Given several fragments of a dataset (distributed across several sites), a set of constraints (referring to privacy concerns, storage issues, operations allowed, etc.), a hypothesis class, and a performance criterion, the task of the distributed learner is to output a hypothesis that optimizes the performance criterion without violating the set of constraints. We say that an algorithm for learning from distributed data is exact if the hypothesis that it outputs is identical to that produced by the centralized algorithm for learning from the complete dataset, obtained by appropriately combining the distributed data fragments.

Complete Chapter List

Search this Book:
Reset