This paper revisits the problem of active learning and decision making when the cost of labeling incurs cost and unlabeled data is available in abundance. In many real world applications large amounts of data are available but the cost of correctly labeling it prohibits its use. In such cases, active learning can be employed. In this paper the authors propose rough set based clustering using active learning approach. The authors extend the basic notion of Hamming distance to propose a dissimilarity measure which helps in finding the approximations of clusters in the given data set. The underlying theoretical background for this decision is rough set theory. The authors have investigated our algorithm on the benchmark data sets from UCI machine learning repository which have shown promising results.
Top1. Introduction
Clustering is one of the major techniques in the data mining as well as an active research topic. It involves identification of clusters (groups) in the data that are similar to each other. The cluster usually corresponds to points that are more similar to each other than they are to points from another cluster (Bagirov, Rubinov, Soukhoroukova, & Yearwood, 2003). Clustering is an unsupervised technique which naturally exploits the subset that has points with greater similarity and it doesn’t have a priori information about the structure of cluster. It is basically data driven and it leads to the discovery of the knowledge structures (clusters) which provides information about the data in the cluster. The clustering technique has been extensively studied in many fields such as pattern recognition, image segmentation, data visualization and similarity search, signal processing and trend analysis (Jain, Murthy, & Flynn, 1999; Dy & Brodley, 2004). Many algorithms on clustering exist in literature (Gibson, Kleinberg, & Raghavan, 1998; Zhang, Fu, Cai, & Heng, 2000; Huang, 1997; Wang, Xu, & Liu, 1999; Huang, 1999).
Rough sets have been introduced as a tool to deal with inexact, imprecise or vague knowledge in the real world data (Gibson, Kleinberg, & Raghavan, 1998). It uses the concept of lower and upper approximation of rough sets to handle the uncertainty in information. It has been applied in many fields such as machine learning, document classification and so on (Xu, Xuedong, Sen, & Bin, 2006; Widzlzak & Revett, 2004; Chengdong, Mengxin, Zhonghua, Zhang, & Yong, 2004). Chen et al. have applied rough set theory to clustering analysis and introduced the decision attribute to better define attribute membership matrix. They have used similarity measure based on Euclidean distance (Chen, Cui, Wang, & Wang, 2006). BWang has proposed a clustering technique based on Olary code to transform nominal attributes into integer. It provides a useful way to estimate the number of underlying clusters (Wang, 2010). Chengdong. Wu et al. have given a hierarchical clustering method for attribute discretization. It gives the significant clusters by determining the best classes for discretization picked from scatter plots of several statistics (Chengdong, Mengxin, Zhonghua, Zhang, & Yong, 2004).
Mitra et al have introduced a novel clustering architecture which integrates advantages of both fuzzy sets and rough sets. The algorithm aims to find a common structure revealed at global level which is determined by exchanging prototypes of the subsets of data & by moving prototypes of the corresponding clusters toward each other (Mitra, Banka, & Pedrycz, 2006). Hirano and Tsumoto have proposed a method which secluded the influence of initial equivalence relations by introducing perfect initial equivalence relations defined by the original class of each object and their randomly mutated variants (Hirano & Tsumoto, 2006).