Mining Data Streams with Skewed Distribution based on Ensemble Method

Mining Data Streams with Skewed Distribution based on Ensemble Method

Yi Wang (College of Information Engineering, Northwest A&F University, Yangling, Shaanxi, China)
DOI: 10.4018/japuc.2012100106
OnDemand PDF Download:
No Current Special Offers


In recent years, there have been some interesting studies on predictive modeling in data streams. However, most such studies assume relatively balanced and stable data streams but cannot handle well skewed (e.g., few positives but lots of negatives) and skewed distributions, which are typical in many data stream applications. In this paper, we propose an ensemble and cluster based sample method to deal with this situation. The study shows that this method has effective result on skewed data streams mining.
Article Preview

There have been several strategies in handling imbalanced data sets. First approach is resizing training sets includes over-sampling minority class examples and under-sampling majority ones. Drummond and Holte (2003) test these two methods and find that the under-sampling outperforms over-sampling, because the over-sampling does not increase information, but it does lead to overfitting, which always make performance of the classifier poorly. Chawla, Hall, Bowyer, and Kegelmeyer (2002) use SMOTE method to balance the date sets,which is applicable when the data sets are highly imbalanced or there are very few examples of minority class. Yet this technique employs a lot of synthetic data for both minority and majority class cases, which is not applicable for data streams environment. The second approach emphasizes cost sensitive learning (Dietterich, Margineantu, Provost, & Turney, 2000; Elkan, 2001). In many real applications like credit fraud detection, medical diagnosis, making wrong decision is usually associated with very different costs. So, assigning different cost factors to false negatives and false positive will lead to better performance with respect to positive classes (Chawla, Japkowicz, & Kolcz, 2004).The ensemble approach consists of a set of individually trained classifiers whose predictions are combined to classify new instances. Hongyu and Herna (2004) use boosting and data generation method to improve performance of the skewed data set mining. Chen, Liaw, Breiman, (2004) use random forest to learn the skewed data sets. However, these ensemble methods only are suitable for the ordinary data sets.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 11: 4 Issues (2019)
Volume 10: 4 Issues (2018)
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing