Article Preview
TopIntroduction
String data is omnipresent and appears in a wide range of applications. Often string data must be partitioned into clusters of similar strings, for example, for cleansing noisy data. Cleansing approaches that are based on clustering substitute all strings in a cluster by the most frequent string. Such cleansing methods are typically applied to non-dictionary strings since for such data the cleansing algorithm cannot take advantage of a reference table with the correct values. Non-dictionary data are, for example, personal data like name and address, or proper names in geography, biology, or meteorology (e.g., names of geographic regions, plants, cyclones, and hurricanes).
Recently, Mazeika and Böhlen (Mazeika et al., 2006) introduced the graph proximity cleansing (GPC) method for clustering and cleansing non-dictionary strings. A distinguishing feature of the GPC clustering is the automatic detection of the cluster borders using a so-called proximity graph. GPC randomly selects a cluster center among the dataset strings and adds all strings within some similarity neighborhood into the cluster. In GPC, the neighborhood includes all dataset strings sharing (at least) a specific number of substrings of the fixed length q (called q-grams). After each step that adds the strings into the cluster, the cluster center is recomputed. Intuitively, GPC enlarges the neighborhood until further enlarging it does not increase the size of the cluster (Figure 1(a)).
Figure 1. Neighborhoods, exact and approximate proximity graphs for “vivian”
GPC finds the cluster border for a given string by computing its proximity graph. The proximity graph shows the number of strings within the neighborhood (y-axis) computed for the respective similarity threshold (x-axis). The cluster border is defined to be the rightmost endpoint of the longest horizontal line in the proximity graph (or of the rightmost line if there are multiple horizontal lines of the same length). The proximity graph for the string vivian in the dataset {vivian, adriana, vivien, marvin, vivyan, manuel, jeanne, clive} is shown in Figure 1(b). The longest horizontal line is between the similarity thresholds 3 and 5, thus 5 is the cluster border, i.e., the strings having 5 or more common q-grams with the respective center form a cluster around vivian.
The computation of the proximity graphs is the bottleneck in GPC. The proximity graph is expensive to compute and must be computed for each potential cluster. To make GPC feasible for realistic datasets, Mazeika and Böhlen (Mazeika et al., 2006) propose an algorithm that approximates the proximity graph using a sampling technique for merging inverted q-gram lists of the inverted list index. Each inverted list in the index consists of all string IDs containing a particular q-gram. The approximated proximity graph is then used to decide the cluster border. But the approximate proximity graph is different from the exact one and thus leads to errors in the cluster (Figure 1(c)).
In this article we present two efficient GPC algorithms that compute the exact proximity graph. The first algorithm, PG-SM, is based on a sort-merge technique; the second algorithm, PG-DS, uses an inverted list index and a divide-skip strategy for the efficient merging of inverted lists to compute the proximity graph. We experimentally evaluate our exact algorithms on large real-world datasets and show that our algorithms are faster than the previously proposed sampling algorithm even for small samples.
Unfortunately, the quality of the GPC clusters has been poorly investigated in literature. In particular, GPC has never been compared to standard clustering techniques. Despite the promising features of GPC, this lack of experience limits its usability in practice.