Advances in Algorithms for Re-Sampling Class-Imbalanced Educational Data Sets

Advances in Algorithms for Re-Sampling Class-Imbalanced Educational Data Sets

William Rivera (Institute for Simulation and Training, University of Central Florida, USA), Amit Goel (Institute for Simulation and Training, University of Central Florida, USA) and J Peter Kincaid (Institute for Simulation and Training, University of Central Florida, USA)
DOI: 10.4018/978-1-4666-9983-0.ch002
OnDemand PDF Download:


Real world data sets often contain disproportionate sample sizes of observed groups making it difficult for predictive analytics algorithms. One of the many ways to combat inherent bias from class imbalance data is to perform re-sampling. In this book chapter we discuss popular re-sampling methods proposed in research literature, such as Synthetic Minority Over-sampling Technique (SMOTE) and Propensity Score Matching (PSM). We provide an insight into recent advances and our own novel algorithms under the umbrella term of Over-sampling Using Propensity Scores (OUPS). Using simulation we conduct experiments that result in statistical improvement in accuracy and sensitivity by using these new algorithmic approaches.
Chapter Preview


With any real world data there is often difficulty in creating prediction models that are highly accurate. In classification of outcomes there is typically a large disparity between the amount of observations collected from equally represented groups or classes. This makes the task of accurately predicting group membership on new data difficult. The problem of disparity between groups is called class imbalance.

Class imbalance is a common property of real world data sets but the issue with class imbalance is that the classifier tends to classify new observations as belonging to the over represented group or majority group because of the inherit bias. The problem is intensified with larger levels of imbalance most commonly found in observational studies. Extreme cases of class imbalance are commonly found in fraud detection, mammography of cancerous cells and post term births. Reported cases of imbalance have been as extreme as 100,000 to (Chawla, 2005; D’Agostino, 1998; Mendes-Moreira & Soares, 2012; Tian, Gu, & Liu, 2010).

Another inherent problem in class imbalance classification is that the classifier will usually contain high prediction accuracy because the underrepresented group is so small thus nullifying the misclassification cost of those observations since the impact is not noticeable. In most cases the target of interest is prediction of the underrepresented group which results in poor predictability.

The first major study to evaluate class imbalance was conducted in 2000. Japkowicz performed experiments on 125 randomly (using uniform distribution) synthesized data sets with varying degrees in complexity, training set size and imbalance in order to search for factors that impact class imbalance data. Using multilayer perceptron networks they identified that domains that contained linearly separable data sets did not suffer misclassification from imbalance. Second, the degree of complexity increases with the level of imbalance and lastly that the error rate is subject to the proportion of imbalance.

Further studies followed suit in highlighting additional reasons why classifiers perform poorly. These include inappropriate metrics for highly class imbalanced data, lack of generalization of classification rules for minority examples and the view minority examples as noise. Data intrinsic properties that perpetuate the class imbalance problem include the degree of class imbalance, complexity of the target concept and the classifier involved (Fernández, García, & Herrera, 2011; X. Guo, Yin, Dong, Yang, & Zhou, 2008; Japkowicz, 2000; López, Fernández, García, Palade, & Herrera, 2013; R. Prati, Batista, & Monard, 2004; R. C. Prati, Batista, & Monard, 2004; Weiss, 2010a, 2010b). The next few sections provide a further overview of these characteristics.

Complete Chapter List

Search this Book: