Article Preview
TopIntroduction
The real-time data accumulated in the society due to day-to-day activities like credit card transactions, patient’s health record, failure in a manufacturing unit, medical diagnosis, detection of oil spills, text classification etc., are always overlapped and class imbalanced in nature (Sumana, 2016). Usually in an imbalanced dataset the classifier misclassifies minority class instances because they get biased by the majority class instances which are highly represented hence classifier shows degradation performance. It frequently occurs in overlapping region as high dimensional data is the main cause for class overlap. As such class imbalance is not a crucial problem but combination of class imbalance with class overlap including high dimensional data is a crucial problem and is the cause for the degrading performance of the classifier (Sumana, 2016).
The data is said to be imbalanced if classes in the data space are not represented in equal proportion. The class representing with higher number of instances is called majority class and the class representing with fewer number of instances is called minority class. Due to class imbalance nature of the dataset classification task becomes very difficult because the classifier gets biased towards the majority class as it does not get necessary information about the minority class to make an accurate prediction therefore show poor classification rates on minority class, because it treats the instances of the minority class as noise hence due to class imbalance nature there will be degradation in the performance of the classifiers. Therefore, a balanced dataset is necessary for building a good prediction model as most of the classifiers perform well when the number of instances of each class is approximately equal in proportion (Guo, 2016).
When samples from different classes have similar characteristics, they do not form separate clusters and are not linearly separated, instead few samples overlap in the data space known as overlapping samples. Class imbalance is not a crucial problem on itself, but combination of class overlap with class imbalance poses a new challenge and is the cause for the degradation performance of the classifier. Liu (2008) in his work stated that overlapping region contains data from more than one class and misclassification often occurs near the class boundaries where overlapping is present and Aida Ali (2015) suggested that high dimensionality with redundant or irrelevant features makes the classifier difficult to recognize the class boundaries and hence is one of the causes for class overlap.
Methods to Address Class Imbalance
Methods to overcome class imbalance can be classified into two categories, data level approach and algorithmic level approach. Data level approach modifies the data and balances it using sampling methods or synthetic data generation methods to overcome classifier getting biased towards majority class whereas in algorithmic level approach the classifier is modified to overcome the bias towards majority class objects.
Data Level Approach
Sampling methods are further divided into over sampling, under sampling and hybrid methods. Under sampling methods balances the proportion of the class distribution by randomly eliminating the samples of majority class retaining the minority class samples. Over sampling methods balances the proportion of the class distribution by randomly replicating the samples of the minority class from the existing samples retaining the majority class samples. Hybrid method is a combination of both over sampling and under sampling methods which balances the proportion of the class distribution by randomly eliminating the majority class samples and replicating the minority class samples.
The synthetic data generation method artificially generates data using bootstrapping or Knn to balance the class distribution example ROSE, ADASYN, SMOTE, MSMOTE, BORDERLINE SMOTE, SMOTE-TL and SMOTE-E, selective pre-processing of imbalanced data (SPIDER) etc.