This chapter presents an award-winning algorithm for the data mining competition of PAKDD 2007, in which the goal is to help a financial company to predict the likelihood of taking up a home loan for their credit card based customers. The involved data are very limited and characterized by very low buying rate. To tackle such an unbalanced classification problem, the authors apply a bagging algorithm based on probit model ensembles. One integral element of the algorithm is a special way of conducting the resampling in forming bootstrap samples. A brief justification is provided. This method offers a feasible and robust way to solve this difficult yet very common business problem.
TopBagging With Weighted Resampling
The common approach to unbalanced classification is to modify the weights, borrowing the idea from retrospective designs (see, e.g., Agrestri, 1990). This amounts to either decreasing the weight for the majority class by under-sampling or increasing the weight for the minority class by over-sampling. However, how to adjust the weights is quite an art. In the following, we shall present our procedure with justification and compare it with some alternative approaches.
To proceed, we first introduce some notations to set up the problem. Let
denote the training sample, where
is the
-th binary 0-1 outcome with Class 1 severely underrepresented and
is the associated input vector. Let
denote the test sample that contains the input information only.
Let denote the distribution underlying the data. What is under modeling is the conditional probability that is equal to 1 conditioning on, i.e.,