Learning From Imbalanced Data

Learning From Imbalanced Data

Lincy Mathews (M. S. Ramaiah Institute of Technology, India) and Seetha Hari (Vellore Institute of Technology, India)
Copyright: © 2018 |Pages: 10
DOI: 10.4018/978-1-5225-2255-3.ch159


A very challenging issue in real world data is that in many domains like medicine, finance, marketing, web, telecommunication, management etc., the distribution of data among classes is inherently imbalanced. A widely accepted researched issue is that the traditional classifier algorithms assume a balanced distribution among the classes. Data imbalance is evident when the number of instances representing the class of concern is much lesser than other classes. Hence, the classifiers tend to bias towards the well-represented class. This leads to a higher misclassification rate among the lesser represented class. Hence, there is a need of efficient learners to classify imbalanced data. This chapter aims to address the need, challenges, existing methods and evaluation metrics identified when learning from imbalanced data sets. Future research challenges and directions are highlighted.
Chapter Preview

Characteristics Of Imbalanced Data

The imbalance ratio between the majority and minority instances need not necessarily affect the performance of classifiers if the degree of imbalance is moderate (Chen & Wasikowski, 2008). The inherent characteristics within minority data however; can cause degrade in performance by the learning models. Two basic categorization of minority instances exist; Safe and unsafe minority instances. Safe minority instances are instances where the misclassification is minimal by the base learners. These instances exist much away from the borderline of majority instances. Unsafe minority instances are so called because the misclassifications occur highly with these kinds of minority instances.

Key Terms in this Chapter

Cost Matrix: A classification cost matrix is a matrix, where the element of value is the misclassification cost of guessing a case belongs to class X, when it actually belongs to class Y.

Bootstrapping: A statistical method generates artificial samples from the original training samples.

Sampling: It is a statistical analysis technique used to select, manipulate and analyze a representative set of data points in order to identify patterns in the larger data set.

Synthetic Data: It is defined as artificial generated data that is produced by some method.

Overfitting: A modeling error that occurs when a function is too closely fit to a limited set of data point is known as Overfitting.

Classification: It is a process by which an unseen sample is assigned a class label by a model trained on data of known class labels.

Bagging: It is the process of training multiple models on different samples (data splits) and averaging their predictions.

Complete Chapter List

Search this Book: