Clustering Categorical Data with k-Modes

Clustering Categorical Data with k-Modes

Joshua Zhexue Huang (The University of Hong Kong, Hong Kong)
Copyright: © 2009 |Pages: 5
DOI: 10.4018/978-1-60566-010-3.ch040
OnDemand PDF Download:


A lot of data in real world databases are categorical. For example, gender, profession, position, and hobby of customers are usually defined as categorical attributes in the CUSTOMER table. Each categorical attribute is represented with a small set of unique categorical values such as {Female, Male} for the gender attribute. Unlike numeric data, categorical values are discrete and unordered. Therefore, the clustering algorithms for numeric data cannot be used to cluster categorical data that exists in many real world applications. In data mining research, much effort has been put on development of new techniques for clustering categorical data (Huang, 1997b; Huang, 1998; Gibson, Kleinberg, & Raghavan, 1998; Ganti, Gehrke, & Ramakrishnan, 1999; Guha, Rastogi, & Shim, 1999; Chaturvedi, Green, Carroll, & Foods, 2001; Barbara, Li, & Couto, 2002; Andritsos, Tsaparas, Miller, & Sevcik, 2003; Li, Ma, & Ogihara, 2004; Chen, & Liu, 2005; Parmar, Wu, & Blackhurst, 2007). The k-modes clustering algorithm (Huang, 1997b; Huang, 1998) is one of the first algorithms for clustering large categorical data. In the past decade, this algorithm has been well studied and widely used in various applications. It is also adopted in commercial software (e.g., Daylight Chemical Information Systems, Inc, http://www.
Chapter Preview

Main Focus

The k-modes clustering algorithm is an extension to the standard k-means clustering algorithm for clustering categorical data. The major modifications to k-means include distance function, cluster center representation and the iterative clustering process (Huang, 1998).

Complete Chapter List

Search this Book: