Feature-based semantic measurements have played a dominant role in conventional data clustering algorithms for many existing applications. However, the applicability of existing data clustering approaches to a wider range of applications is limited due to issues such as complexity involved in semantic computation, long pre-processing time required for feature preparation, and poor extensibility of semantic measurement due to non-incremental feature source. This chapter first summarises the many commonly used clustering algorithms and feature-based semantic measurements, and then highlights the shortcomings to make way for the proposal of an adaptive clustering approach based on featureless semantic measurements. The chapter concludes with experiments demonstrating the performance and wide applicability of the proposed clustering approach.
Data clustering has a wide range of applicability ranging from ontology learning (Wong et al., 2006) and market research, to pattern recognition and image processing (Jain et al., 1999). Depending on the areas of applications, many different names such as cluster analysis, numerical taxonomy, automatic classification, botryology and typological analysis have been devised to refer to essentially the same practice of data clustering. Each application of data clustering can be characterised by two aspects, namely, the manner through which the clusters are formed, and the criteria that govern the formation of clusters. The first aspect relates to the choice of clustering algorithm while the second aspect dictates the type of semantic measure (i.e. similarity or relatedness) to be used. The choice of the clustering algorithms, and semantic measures very much depends on the data elements to be clustered, computational constraints, and also the desired results. In the past, data clustering have been particularly successful with certain types of data such as documents, software systems, lexical units, webpages, Uniform Resource Identifiers (URIs) and images. This gives rise to various sub-areas of clustering such as document clustering(Steinbach et al., 2000), software botryology(Tzerpos and Holt, 1998), term clustering(Wong et al., 2007), webpage clustering(Wang and Kitsuregawa, 2002), usage-based URI clustering(Mobasher et al., 1999) and clustering-based image segmentation(Jain et al. 1999).
Key Terms in this Chapter
Uniform Resource Identifier (URI): A compact string of characters used to identify or name a resource.
Semantic Measure: A computational means of determining to what extent two elements are semantically related or similar.
Clustering: The process of discovering naturally-occurring groups of data elements.
Term: A word used in domain-specific context.