Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Clustering with Proximity Graphs: Exact and Efficient Algorithms

Michail Kazimianec, Nikolaus Augsten

Source Title: International Journal of Knowledge-Based Organizations (IJKBO) 3(4)

DOI: 10.4018/ijkbo.2013100105

OnDemand:

(Individual Articles)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Graph Proximity Cleansing (GPC) is a string clustering algorithm that automatically detects cluster borders and has been successfully used for string cleansing. For each potential cluster a so-called proximity graph is computed, and the cluster border is detected based on the proximity graph. However, the computation of the proximity graph is expensive and the state-of-the-art GPC algorithms only approximate the proximity graph using a sampling technique. Further, the quality of GPC clusters has never been compared to standard clustering techniques like k-means, density-based, or hierarchical clustering. In this article the authors propose two efficient algorithms, PG-DS and PG-SM, for the exact computation of proximity graphs. The authors experimentally show that our solutions are faster even if the sampling-based algorithms use very small sample sizes. The authors provide a thorough experimental evaluation of GPC and conclude that it is very efficient and shows good clustering quality in comparison to the standard techniques. These results open a new perspective on string clustering in settings, where no knowledge about the input data is available.

Article Preview

Top

Introduction

String data is omnipresent and appears in a wide range of applications. Often string data must be partitioned into clusters of similar strings, for example, for cleansing noisy data. Cleansing approaches that are based on clustering substitute all strings in a cluster by the most frequent string. Such cleansing methods are typically applied to non-dictionary strings since for such data the cleansing algorithm cannot take advantage of a reference table with the correct values. Non-dictionary data are, for example, personal data like name and address, or proper names in geography, biology, or meteorology (e.g., names of geographic regions, plants, cyclones, and hurricanes).

Recently, Mazeika and Böhlen (Mazeika et al., 2006) introduced the graph proximity cleansing (GPC) method for clustering and cleansing non-dictionary strings. A distinguishing feature of the GPC clustering is the automatic detection of the cluster borders using a so-called proximity graph. GPC randomly selects a cluster center among the dataset strings and adds all strings within some similarity neighborhood into the cluster. In GPC, the neighborhood includes all dataset strings sharing (at least) a specific number of substrings of the fixed length q (called q-grams). After each step that adds the strings into the cluster, the cluster center is recomputed. Intuitively, GPC enlarges the neighborhood until further enlarging it does not increase the size of the cluster (Figure 1(a)).

Figure 1.

Neighborhoods, exact and approximate proximity graphs for “vivian”

GPC finds the cluster border for a given string by computing its proximity graph. The proximity graph shows the number of strings within the neighborhood (y-axis) computed for the respective similarity threshold (x-axis). The cluster border is defined to be the rightmost endpoint of the longest horizontal line in the proximity graph (or of the rightmost line if there are multiple horizontal lines of the same length). The proximity graph for the string vivian in the dataset {vivian, adriana, vivien, marvin, vivyan, manuel, jeanne, clive} is shown in Figure 1(b). The longest horizontal line is between the similarity thresholds 3 and 5, thus 5 is the cluster border, i.e., the strings having 5 or more common q-grams with the respective center form a cluster around vivian.

The computation of the proximity graphs is the bottleneck in GPC. The proximity graph is expensive to compute and must be computed for each potential cluster. To make GPC feasible for realistic datasets, Mazeika and Böhlen (Mazeika et al., 2006) propose an algorithm that approximates the proximity graph using a sampling technique for merging inverted q-gram lists of the inverted list index. Each inverted list in the index consists of all string IDs containing a particular q-gram. The approximated proximity graph is then used to decide the cluster border. But the approximate proximity graph is different from the exact one and thus leads to errors in the cluster (Figure 1(c)).

In this article we present two efficient GPC algorithms that compute the exact proximity graph. The first algorithm, PG-SM, is based on a sort-merge technique; the second algorithm, PG-DS, uses an inverted list index and a divide-skip strategy for the efficient merging of inverted lists to compute the proximity graph. We experimentally evaluate our exact algorithms on large real-world datasets and show that our algorithms are faster than the previously proposed sampling algorithm even for small samples.

Unfortunately, the quality of the GPC clusters has been poorly investigated in literature. In particular, GPC has never been compared to standard clustering techniques. Despite the promising features of GPC, this lack of experience limits its usability in practice.

Complete Article List

Search this Journal:

Reset

Volume 14: 1 Issue (2024): Forthcoming, Available for Pre-Order

Volume 13: 1 Issue (2023)

Volume 12: 4 Issues (2022): 3 Released, 1 Forthcoming

Volume 11: 4 Issues (2021)

Volume 10: 4 Issues (2020)

Volume 9: 4 Issues (2019)

Volume 8: 4 Issues (2018)

Volume 7: 4 Issues (2017)

Volume 6: 4 Issues (2016)

Volume 5: 4 Issues (2015)

Volume 4: 4 Issues (2014)

Volume 3: 4 Issues (2013)

Volume 2: 4 Issues (2012)

Volume 1: 4 Issues (2011)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Clustering with Proximity Graphs: Exact and Efficient Algorithms

Abstract

Introduction

Complete Article List