A Survey of Class Imbalance Problem on Evolving Data Stream

A Survey of Class Imbalance Problem on Evolving Data Stream

D. Himaja (Vignan's Foundation for Science, Technology and Research (Deemed), Guntur, India), T. Maruthi Padmaja (Vardhaman College of Engineering, Hyderabad, India) and P. Radha Krishna (National Institute of Technology, Warangal, India)
DOI: 10.4018/978-1-7998-7371-6.ch002
OnDemand PDF Download:
No Current Special Offers


Learning from data streams with both online class imbalance and concept drift (OCI-CD) is receiving much attention in today's world. Due to this problem, the performance is affected for the current models that learn from both stationary as well as non-stationary environments. In the case of non-stationary environments, due to the imbalance, it is hard to spot the concept drift using conventional drift detection methods that aim at tracking the change detection based on the learner's performance. There is limited work on the combined problem from imbalanced evolving streams both from stationary and non-stationary environments. Here the data may be evolved with complete labels or with only limited labels. This chapter's main emphasis is to provide different methods for the purpose of resolving the issue of class imbalance in emerging streams, which involves changing and unchanging environments with supervised and availability of limited labels.
Chapter Preview


The real-world classification problems, such as fraud and fault detection are characterized by continuously imbalanced evolving streams from non-stationary environments. The combined problem of class imbalance and concept drift, on the other hand, hinders the success of online learners. Learning from unbalanced emerging streams poses a different set of problems than learning from balanced groups. The problem of learning from emerging streams, on the other hand, faces significant challenges such as infinite stream size, varying speed, and concept drift. The method of extracting information from continuous, rapid data records is known as stream learning. It’s a huge challenge to learn from these streams. The evolving data may be completely or partially labeled.

Class Imbalance Learning

On standalone training sets, the class imbalance learning (CIL) problem arises when one class of data vastly outnumbers the others, causing the underrepresented class output to suffer. Therefore the classes from over represented class are correctly classified where as from under represented are misclassified. This scenario is common in applications such as fraud and fault detection. Smart building is an example for imbalanced data. It has sensors to detect risky conditions. Any fault in sensors causes a great destruction. The size of smart building is 5000 with two classes namely faulty and good, where only 1% is faulty and remaining 99% is good conditions. A model can be built on this kind of datasets, to predict faults in sensors. But predicting faults is expensive as faulty conditions are underrepresented than good. Figure 1 shows the scenario for class imbalance. Here there are two classes (i.e. blue and yellow). From the figure 1, it is clear that the dataset is highly imbalanced where blue is 70% and remaining 30% is yellow class.

Figure 1.

Class imbalance


Several researchers examined the behavior of different classifiers on various data attributes like class imbalance (CI), overlapping, minority class disjuncts and data set size, concluded that most of the classifiers favored majority class. Solutions are proposed at both the data and algorithm level to resolve this problem (Haibo & Garcia, 2009; Sun et al., 2009). The data level solutions balance the class distributions either by oversampling or by undersampling approaches. The algorithm level solutions shift the decision boundary towards the under-represented class by applying additional costs or weights to the parameters that represent the minority class. The vast majority of these strategies are usually driven by the degree of imbalance, i.e., the majority to minority class ratio, which can be calculated directly from the data. In addition to these approaches, the literature proposes many combinations of data and algorithm levels, as well as their ensembles (Galar et al., 2012).

Complete Chapter List

Search this Book: