Clustering with Proximity Graphs: Exact and Efficient Algorithms

Clustering with Proximity Graphs: Exact and Efficient Algorithms

Michail Kazimianec (Faculty of Economics, Vilnius University, Vilnius, Lithuania) and Nikolaus Augsten (Faculty of Computer Science, Free University of Bozen-Bolzano, Bozen-Bolzano, Italy)
Copyright: © 2013 |Pages: 21
DOI: 10.4018/ijkbo.2013100105
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Graph Proximity Cleansing (GPC) is a string clustering algorithm that automatically detects cluster borders and has been successfully used for string cleansing. For each potential cluster a so-called proximity graph is computed, and the cluster border is detected based on the proximity graph. However, the computation of the proximity graph is expensive and the state-of-the-art GPC algorithms only approximate the proximity graph using a sampling technique. Further, the quality of GPC clusters has never been compared to standard clustering techniques like k-means, density-based, or hierarchical clustering. In this article the authors propose two efficient algorithms, PG-DS and PG-SM, for the exact computation of proximity graphs. The authors experimentally show that our solutions are faster even if the sampling-based algorithms use very small sample sizes. The authors provide a thorough experimental evaluation of GPC and conclude that it is very efficient and shows good clustering quality in comparison to the standard techniques. These results open a new perspective on string clustering in settings, where no knowledge about the input data is available.
Article Preview

Introduction

String data is omnipresent and appears in a wide range of applications. Often string data must be partitioned into clusters of similar strings, for example, for cleansing noisy data. Cleansing approaches that are based on clustering substitute all strings in a cluster by the most frequent string. Such cleansing methods are typically applied to non-dictionary strings since for such data the cleansing algorithm cannot take advantage of a reference table with the correct values. Non-dictionary data are, for example, personal data like name and address, or proper names in geography, biology, or meteorology (e.g., names of geographic regions, plants, cyclones, and hurricanes).

Recently, Mazeika and Böhlen (Mazeika et al., 2006) introduced the graph proximity cleansing (GPC) method for clustering and cleansing non-dictionary strings. A distinguishing feature of the GPC clustering is the automatic detection of the cluster borders using a so-called proximity graph. GPC randomly selects a cluster center among the dataset strings and adds all strings within some similarity neighborhood into the cluster. In GPC, the neighborhood includes all dataset strings sharing (at least) a specific number of substrings of the fixed length q (called q-grams). After each step that adds the strings into the cluster, the cluster center is recomputed. Intuitively, GPC enlarges the neighborhood until further enlarging it does not increase the size of the cluster (Figure 1(a)).

Figure 1.

Neighborhoods, exact and approximate proximity graphs for “vivian”

GPC finds the cluster border for a given string by computing its proximity graph. The proximity graph shows the number of strings within the neighborhood (y-axis) computed for the respective similarity threshold (x-axis). The cluster border is defined to be the rightmost endpoint of the longest horizontal line in the proximity graph (or of the rightmost line if there are multiple horizontal lines of the same length). The proximity graph for the string vivian in the dataset {vivian, adriana, vivien, marvin, vivyan, manuel, jeanne, clive} is shown in Figure 1(b). The longest horizontal line is between the similarity thresholds 3 and 5, thus 5 is the cluster border, i.e., the strings having 5 or more common q-grams with the respective center form a cluster around vivian.

The computation of the proximity graphs is the bottleneck in GPC. The proximity graph is expensive to compute and must be computed for each potential cluster. To make GPC feasible for realistic datasets, Mazeika and Böhlen (Mazeika et al., 2006) propose an algorithm that approximates the proximity graph using a sampling technique for merging inverted q-gram lists of the inverted list index. Each inverted list in the index consists of all string IDs containing a particular q-gram. The approximated proximity graph is then used to decide the cluster border. But the approximate proximity graph is different from the exact one and thus leads to errors in the cluster (Figure 1(c)).

In this article we present two efficient GPC algorithms that compute the exact proximity graph. The first algorithm, PG-SM, is based on a sort-merge technique; the second algorithm, PG-DS, uses an inverted list index and a divide-skip strategy for the efficient merging of inverted lists to compute the proximity graph. We experimentally evaluate our exact algorithms on large real-world datasets and show that our algorithms are faster than the previously proposed sampling algorithm even for small samples.

Unfortunately, the quality of the GPC clusters has been poorly investigated in literature. In particular, GPC has never been compared to standard clustering techniques. Despite the promising features of GPC, this lack of experience limits its usability in practice.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 8: 4 Issues (2018): 1 Released, 3 Forthcoming
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing