Clustering is an unsupervised technique used in various application, namely machine learning, image segmentation, social network analysis, health analytics, and financial analysis. It is a task of grouping similar objects together and dissimilar objects in different group. The quality of the cluster relies on two factors: distance metrics and data representation. Deep learning is a new field of machine learning research that has been introduced to move machine learning closer to artificial intelligence. Learning using deep network provides multiple layers of representation that helps to understand images, sound, and text. In this chapter, the need for deep network in clustering, various architecture, and algorithms for unsupervised learning is discussed.
TopIntroduction
Growth in digital data and storage methodologies has resulted in collection of a huge database. The process of extracting or finding relevant and hidden information from large databases in the field of Data Mining is a powerful technology with great potential for analysis of meaningful information in data warehouse. Knowledge discovery task involves various steps, selection of target data, pre-processing of data, transformation, extracting meaningful pattern and interpreting discovered pattern. It also involves the process of extracting hidden information from large databases, which is a powerful tool to help companies to focus on predicting behavior of future trends and improve performance of the organization by analyzing future market trends.
The process of clustering involves grouping similar objects into one cluster and dissimilar objects into different cluster. The quality of clusters is evaluated based on two measures namely, intra-cluster similarity and inter-cluster similarity. The similarity among objects within the cluster is defined as intra-cluster similarity whereas the similarity of objects in different clusters is called as inter-cluster similarity. The cluster with low inter-cluster similarity and high intra-cluster similarity is considered to be a quality cluster. Clustering is classified into different types namely Partition based clustering, Hierarchical clustering, Model-based Clustering, Density based clustering and Graph clustering.
The important task in partition based clustering is to group set of objects into disjoint points such that points within group are similar. Given a set of n data points, the algorithm partition into clusters such that each data point is assigned to a unique cluster . There are two types of Hierarchical clustering namely Agglomerative (merge) and Divisive (divide). In the Agglomerative method, clusters are measured on the basis of distance. Initially, each object is formed as a cluster and then cluster centroids that are close to each other are merged. The Divisive method starts with a single cluster that consists of all data points and iteratively splits into various subgroups. This process continues until each cluster contains single object.
Density based clustering is a method that groups objects on the basis of density. DBSCAN algorithm is a widely used density based algorithm that detects clusters with arbitrary shape and varied density. The main idea of this algorithm is that the number of objects within neighborhood must be greater than or equal to the threshold points.
Clustering technique finds extensive application in the field of business and financial data, bioinformatics, telecommunication and health applications. Credit card holders can be grouped on the basis of usage of card, purchase pattern, money spent on card, frequency of use and location of a card used. This information can be very useful for market analysis to find a group to which promotional activities can be targeted that might be of interest. This analysis can be of mutual benefit to both card holders and sellers. Market analysis is based on lifestyle, past purchase behavior or their demographic characteristics.
Clustering can also be applied in a chain store which wants to examine profit of outlets similarly placed, on the basis of variables like social neighborhood, purchase pattern, vicinity to other shops and so on. Cluster analysis has also been used widely in areas of medicine such as psychiatry, disease modelling, gene modelling and disease diagnostics. The other applications of clustering are grouping policy holders with average claim cost, creating thematic maps by grouping feature space, document classification and cluster weblog data to discover groups of similar patterns.