A Disk-Based Algorithm for Fast Outlier Detection in Large Datasets
Faxin Zhao (Northeastern University, China and Tonghua Teachers College, China), Yubin Bao (Northeastern University, China), Huanliang Sun (Northeastern University, China) and Ge Yu (Northeastern University, China)
Copyright: © 2007
In data mining fields, outlier detection is an important research issue. The number of cells in the cell-based disk algorithm increases exponentially. The performance of this algorithm will decrease dramatically with the increasing of the number of cells and data points. Through further analysis, we find that there are many empty cells that are useless to outlier detection. So this chapter proposes a novel index structure, called CD-Tree, in which only non-empty cells are stored, and a cluster technique is adopted to store the data objects in the same cell into linked disk pages. Some experiments are made to test the performance of the proposed algorithms. The experimental results show that the performance of the CD-Tree structure and of the cluster technique based disk algorithm outperforms that of the cell-based disk algorithm, and the dimensionality processed by the proposed algorithm is higher than that of the old one.