Class Prediction in Test Sets with Shifted Distributions

Óscar Pérez; Manuel Sánchez-Montañés

doi:10.4018/978-1-59904-849-9.ch044

Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Class Prediction in Test Sets with Shifted Distributions

Óscar Pérez, Manuel Sánchez-Montañés

Source Title: Encyclopedia of Artificial Intelligence

DOI: 10.4018/978-1-59904-849-9.ch044

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Machine learning has provided powerful algorithms that automatically generate predictive models from experience. One specific technique is supervised learning, where the machine is trained to predict a desired output for each input pattern x. This chapter will focus on classification, that is, supervised learning when the output to predict is a class label. For instance predicting whether a patient in a hospital will develop cancer or not. In this example, the class label c is a variable having two possible values, “cancer” or “no cancer”, and the input pattern x is a vector containing patient data (e.g. age, gender, diet, smoking habits, etc.). In order to construct a proper predictive model, supervised learning methods require a set of examples x_i together with their respective labels c_i. This dataset is called the “training set”. The constructed model is then used to predict the labels of a set of new cases x_j called the “test set”. In the cancer prediction example, this is the phase when the model is used to predict cancer in new patients.

One common assumption in supervised learning algorithms is that the statistical structure of the training and test datasets are the same (Hastie, Tibshirani & Friedman, 2001). That is, the test set is assumed to have the same attribute distribution p(x) and same class distribution p(c|x) as the training set. However, this is not usually the case in real applications due to different reasons. For instance, in many problems the training dataset is obtained in a specific manner that differs from the way the test dataset will be generated later. Moreover, the nature of the problem may evolve in time. These phenomena cause p^Tr(x, c)

p^Test(x, c), which can degrade the performance of the model constructed in training.

Here we present a new algorithm that allows to re-estimate a model constructed in training using the unlabelled test patterns. We show the convergence properties of the algorithm and illustrate its performance with an artificial problem. Finally we demonstrate its strengths in a heart disease diagnosis problem where the training set is taken from a different hospital than the test set.

Chapter Preview

Top

Background

In practical problems, the statistical structure of training and test sets can be different, that is, pTr(x, c) ¹ pTest(x, c). This effect can be caused by different reasons. For instance, due to biases in the sampling selection of the training set (Heckman, 1979; Salganicoff, 1997). Other possible cause is that training and test sets can be related to different contexts. For instance, a heart disease diagnosis model that is used in a hospital which is different from the hospital where the training dataset was collected. Then, if the hospitals are located in cities where people have different habits, average age, etc., this will cause a test set with a different statistical structure than the training set.

The special case pTr(x) ¹ pTest(x) and pTr(c | x) = pTest(c | x) is known in the literature as “covariate shift” (Shimodaira, 2000). In the context of machine learning, the covariate shift can degrade the performance of standard machine learning algorithms. Different techniques have been proposed to deal with this problem, see for example (Heckman, 1979; Salganicoff, 1997; Shimodaira, 2000; Sugiyama, Krauledat & Müller, 2007). Transductive learning has also been suggested as another way to improve performance when the statistical structure of the test set is shifted with respect to the training set (Vapnik, 1998; Chen, Wang & Dong, 2003; Wu, Bennett, Cristianini & Shawe-Taylor, 1999).

The statistics of the patterns x can also change in time, for example in a company that has a continuous flow of new and leaving clients (figure 1). If we are interested in constructing a model for prediction, the statistics of the clients when the model is exploited will differ from the statistics in training. Finally, often the concept to be learned is not static but evolves in time (for example, predicting which emails are spam or not), causing pTr(x, c) ¹ pTest(x, c). This problem is known as “concept drift” and different algorithms have been proposed to cope with it (Black & Hickey, 1999; Wang, Fan, Yu, & Han, 2003; Widmer & Kubat, 1996).

Figure 1.

Changes across time of the statistics of clients in a car insurance company. The histograms of two different variables (a, b) related to the clients’ use of their insurance are shown. Dash: data collected four months later than data shown in solid.

Key Terms in this Chapter

Classifier: function that associates a class c to each input pattern x of interest. A classifier can be directly constructed from a set of pattern examples with their respective classes, or indirectly from a statistical model

Statistical model: mathematical function that models the statistical structure of the problem. For classification problems, the statistical model is or equivalently {, } since

EM (Expectation-Maximization algorithm): standard iterative algorithm for estimating the parametersof a parametric statistical model. EM finds the specific parameter values that maximize the likelihood of the observed data D given the statistical model, . The algorithm alternates between the Expectation step and the Maximization step, finishing when meets some convergence criterium

Missing value: special value of an attribute that denotes that it is not known or can not be measured.

Attribute: each of the components that constitute an input pattern.

Training/Test sets: in the context of this chapter, the training set is composed by all labelled examples that are provided for constructing a classifier. The test set is composed by the new unlabelled patterns whose classes should be predicted by the classifier

Semi-Supervised Learning: machine learning technique that uses both labelled and unlabelled data for constructing the model.

Supervised Learning: type of learning where the objective is to learn a function that associates a desired output (‘label’) to each input pattern. Supervised learning techniques require a training dataset of examples with their respective desired outputs. Supervised learning is traditionally divided into regression (the desired output is a continuous variable) and classification (the desired output is a class label).

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Class Prediction in Test Sets with Shifted Distributions

Abstract

Background

Key Terms in this Chapter

Complete Chapter List