Receive a 20% Discount on All Purchases Directly Through IGI Global's Online Bookstore

DOI: 10.4018/978-1-5225-3270-5.ch003

Chapter Preview

TopClustering methods are descriptive techniques (not interpretative let alone predictive) for patterns and outliers among available samples. It classifies a set of observations into two or more mutually exclusive unknown groups with the aim of data reduction. The purpose is to discover a system of organizing observations into groups, where members of those groups share common properties. For instance, the researcher may create clusters of customers who have similar buying habits in specific supermarkets for geosegmentation marketing purposes. From the city planning perspective, identifying groups of houses according to their type (financial value and geographical location) is another example of clustering applications.

Anything can be clustered but a good methodology should produce a high intra-class similarity and a low inter-class similarity. It is critical to highlight that clustering analysis additionally looks for individual outliers (not only patterns) among the data as well. Different distance (Euclidean, Manhattan, Minkowski or Chebychev) metrics and clustering (partitional K-means or agglomerative hierarchical) algorithms often lead to different outcomes too. Thus, it is up to the researcher to decide which approach and parameters should be applied to his/her problem. A typically hard question raised on statistical cluster analysis regards the input number of clusters to be defined and with a tremendous significance influence on the clustering result. As a principle (and working as a default start value), this parameter should be a number between 3 and 5 but no law should be setup.

From the computational standpoint, partional clustering of K-means works as follows: (1) Set K as the desired number of clusters; (2) Select randomly K representative elements called centroids; (3) Compute the distance of each pattern (points) from all centroids; (4) Assign all data points to the centroid with the minimum distance; (5) Update the centroids as the mean of the element belonging to each cluster and compute a new cluster membership; (6) Check the convergence condition, i.e., if all data points are assigned to the same cluster with respect to the previous iteration (all the centroids remain the same) then stop the process. Otherwise, reapply the assignment process starting from step 3.

Alternatively, the algorithm of the hierarchical clustering (myGeoffice^{©} option) works as follows: (1) Calculate the distance between all data points; (2) Cluster the data points to the initial clusters; (3) Calculate the distance metrics between all clusters; (4) Repeatedly gather the most similar clusters into a higher level of similarity; (5) Repeat steps 3 and 4 for the most high-level clusters. Another extra parameter to be defined from this agglomerative clustering technique is the choice of the distance function between clusters, i.e., single linkage (distance between the closest neighbors), complete linkage (distance between the furthest neighbors), central linkage (distance of centroids) and average linkage (average distance of all patterns in each cluster).

From a GIS perspective, this last methodology has proven to be quite useful in clustering neighborhoods (zip codes) into different sets based on census information (such as population density, income and age) regarding decisions about where to locate future depots. Nonetheless, this procedure is considered a-spatial since the geographical component is not used under this algorithm. Moreover, this numerical approach does not explain the reasons why a specific phenomenon behaves in a particular way. Basically, it only supplies clues or evidence for the scientist to investigate possible explanations or causes for that group’s behavior.

Within myGeoffice^{©}, the hierarchical cluster analysis starts with a single data column, where each cell holds the observation value for each region. The key problem consists of how to combine multiple measures into a single number and the similarity between two observations, where the grouping concept is based on value’s proximity to each region. myGeoffice^{©} depicts two alternatives for this parameter: (A) Absolute value of the difference between scores (Single Difference); (B) Squared difference between the two measures (Squared Euclidean Difference). After the comparison matrix among data values has been assessed, the last step is to join the present regions into clusters based on the nearest values’ distances. This simple linkage computes the distance between two subgroups as the minimum distance between any two members of opposite groups. The following step reshuffles the remaining regions into clusters by re-assigning them based on the similarity among regions. The recalculation of the nearest distances is accomplished and this process continues recursively until no further regions are left out.

Search this Book:

Reset