Article Preview
Top1. Introduction
A key assumption in supervised learning is that the training and the testing data (or operational data) used to train the classifier come from the same distribution. This means that training data is representative and the classifier will perform well on all future unseen data instances. However, if the statistical properties of the target variable, which the model is trying to predict, change over time while the same classifier is still applicable, the prediction will be no longer accurate. In machine learning this phenomenon of change in data distribution over time is known as concept drift (Tsymbal, 2004). Concept drift problem have been stated as the tenth challenging problem facing researchers in data mining and machine learning fields (Yang & Wu, 2006).
To show the importance of this problem, assume a data mining application for spam filtering that is developed using the latest generated spam dataset. As this filter adapted to deal with today’s types of spam emails, the spammers will try to bypass the spam filters by disguising their emails to look more like legitimate. So new spam will be generated and the current application will go toward approximation to classify this strange pattern. As time goes by, this will lead to less accurate, poor performance and incorrect knowledge. This dynamic nature of spam email raises a requirement for update in any filter that is to be successful over time in identifying spam (Delany, Cunningham, Tsymbal, & Coyle, 2005).
The main difficulty in mining non-stationary data like spam, intrusion, stock marketing, weather and customer preferences is to cope with the changing of data concept. The fundamental processes generating most real-time data may change over years, months and even seconds, at times drastically. Effective learning in environments with hidden contexts and concept drift requires a learning algorithm that can detect context changes without being explicitly informed about them, recover quickly from a context change and adjust itself to the new context, and can make use of previous experience in situations where old contexts and corresponding concepts reappear (Nishida & Yamauchi, 2009).
In our research, we try to add a contribution to scientific research in solving the problem of concept drift in supervised learning when true labels become known with certain delay. The work presented in this paper is based on training set formation strategy which is reforming the training sets when concept drift is detected. Training set formation methods have an advantages over other adaptivity methods since they do not require complicated parameterization and they can be used for online learning plugging in different types of base classifiers. We can summarize our contribution as:
- •
We introduce Adaptive Training Set Formation for Delayed Labeling Algorithm (SFDL), which is based on selective training set formation. Our proposed solution is considered as the first systematic training set formation approach that take into account delayed labeling problem. Our proposed algorithm can be used with any base classifier without the need to change the implementation or setting of the classifier;
- •
We test our algorithm implementation using synthetic and real dataset from various domains which might have different drift types (sudden, gradual, incremental recurrences) with different speed of change. Experimental evaluation confirms improvement in classification accuracy as compared to ordinary classifier for all drift types.
The rest of the paper is organized as follows: Section 2 presents related work and gives an introductory background to the main topic of this research, namely concept drift problem and detectability of concept drift when labeled is delayed. Section 3 defines training set formation strategy and summarize the main contributions of our research. Section 4 describes our methodology and proposed algorithms. Experimental results discussed in Section 5. Finally Section 6 concludes the paper.