Article Preview
TopThe CDC Binning algorithm improves upon another discretization technique called Distribution Skew-based Binning (DS Binning) introduced in Skapura and Dong (2015). In that previous work, we proposed the DS Binning technique, which was also built on the class distribution curve. Importantly, CDC Binning provides significant improvements over DS Binning by simplifying the technical aspects (including the bin formation process) of the algorithm, reducing the number of parameters, and generalizing the class distribution curve to include other measures of class purity. In Skapura and Dong (2015) we also applied the DS method exclusively to EEG/EMG time series datasets; in this paper, we demonstrate that the class distribution curve concept can be applied to different types of data as well.
We now provide a high-level comparison of CDC Binning and other well- known binning methods. First, neither Equi-Width nor Equi-Density Binning uses class information in forming the bins. Moreover, while Entropy-based Binning, Fayyad and Irani (1993), uses class information, it only uses the purity information of the entirety of candidate intervals to form intervals. In contrast, CDC Binning uses the entire class distribution curve (based on the class ratios over localized sliding windows) to find optimal bin boundaries. So CDC Binning can be viewed as a generalization of Entropy-based Binning, since it makes better use of class purity.
As will be discussed later, it can be said that CDC Binning outperforms other methods when the class distribution curve is complicated, although traditional methods such as Entropy-based Binning, Fayyad and Irani (1993), gives very good performance when the class distribution curve is simple.