Article Preview
TopThere have been several strategies in handling imbalanced data sets. First approach is resizing training sets includes over-sampling minority class examples and under-sampling majority ones. Drummond and Holte (2003) test these two methods and find that the under-sampling outperforms over-sampling, because the over-sampling does not increase information, but it does lead to overfitting, which always make performance of the classifier poorly. Chawla, Hall, Bowyer, and Kegelmeyer (2002) use SMOTE method to balance the date sets,which is applicable when the data sets are highly imbalanced or there are very few examples of minority class. Yet this technique employs a lot of synthetic data for both minority and majority class cases, which is not applicable for data streams environment. The second approach emphasizes cost sensitive learning (Dietterich, Margineantu, Provost, & Turney, 2000; Elkan, 2001). In many real applications like credit fraud detection, medical diagnosis, making wrong decision is usually associated with very different costs. So, assigning different cost factors to false negatives and false positive will lead to better performance with respect to positive classes (Chawla, Japkowicz, & Kolcz, 2004).The ensemble approach consists of a set of individually trained classifiers whose predictions are combined to classify new instances. Hongyu and Herna (2004) use boosting and data generation method to improve performance of the skewed data set mining. Chen, Liaw, Breiman, (2004) use random forest to learn the skewed data sets. However, these ensemble methods only are suitable for the ordinary data sets.