Pattern Matching Techniques to Identify Syntactic Variations of Tags in Folksonomies

Pattern Matching Techniques to Identify Syntactic Variations of Tags in Folksonomies

F. Echarte, J. J. Astrain, A. Córdoba, J. Villadangos
DOI: 10.4018/978-1-60566-272-5.ch011
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Folksonomies offer an easy method to organize information in the current Web. This fact and their collaborative features have derived in an extensive involvement in many Social Web projects. However they present important drawbacks regarding their limited exploring and searching capabilities, in contrast with other methods as taxonomies, thesauruses and ontologies. One of these drawbacks is an effect of its flexibility for tagging, producing frequently multiple syntactic variations of a same tag. In this chapter we study the application of two classical pattern matching techniques, Levenshtein distance for the imperfect string matching and Hamming distance for the perfect string matching, to identify syntactic variations of tags.
Chapter Preview
Top

Introduction

Folksonomies (Vander Wal, 2008) are based in the assignation of text tags to different resources, such as photos, web pages, documents, etc., in order to classify these resources in Web 2.0. Users use these tags to annotate resources defining collaboratively the meaning of the annotated resources, and the used tags.

New search and exploration approaches are possible with Folksnomies, based on the use of the tags (Millen, 2006; Golder, 2005). Users can search for tags, or use navigation systems such as clouds of words, to locate resources tagged by other users and to find information.

Though folksonomies have a great success in current web, mainly due to their simplicity of use, they have also important disadvantages. The fact of users creating tags and assigning them freely to resources produces the inexistence of any structure among these tags. As folksonomies become larger, more problems appear regarding the use of synonyms, syntactic tag variations and different granularity levels (Gruber, 1993). All these problems make more and more difficult the exploration and retrieval of information (Mathes, 2004; Guy, 2006) decreasing the quality of folksonomies Thus, the reduction of syntactic tag variations aids to improve the quality of folksonomies.

There exist different types of syntactic variations of tags: typographical misspellings in the annotation process (semanticweb/semnticwev/zemantcweb); grammatical number (singular or plural) of the same word (semanticweb/ semanticwebs); separators (semantic-web/semanticweb); or a combination of them (semntic-web/smanticweb, semntic-webs, etc.). The existence of these variations causes the classification of the resources under different tags, when they should be classified under just one. This fact makes more confusing the clouds of words, the location of information and the navigation on the folksonomy. However, by identifying all of them as variations of the same label “semantic web” and grouping them under the same tag, a user can access this tag obtaining all the information concerning the resources associated with it and its syntactic variations.

This chapter focuses on the application of pattern matching techniques to identify syntactic tag variations. We propose the utilization of pattern matching techniques to identify syntactic variations of tags. We study two classical pattern matching techniques as Levenshtein (Levenshtein, 1966) and Hamming (Hamming, 1950) distances on a large real dataset, evaluating how these techniques perform the identification of both variations of known tags and new (non-existing) tags.

We show the percentages of correct identification achieved with each distance considering different types of variations, as typographic errors, transpositions of adjacent characters, singulars and plurals, and substitution/deletion of separators.

To our knowledge, there is not any study about the application of pattern matching techniques to the identification of syntactic variations of tags. Only in (Specia, 2007) a pre-filtering of the tags is performed before applying an algorithm for tag clustering. This is used to minimize the effects of syntactic variations and to increase the quality of tag clustering. Authors group similar tags using the Levenshtein similarity metric to determine morphological variations, although over a reduced experimental data set and following a non in detail described process. Another way to represent these variations is presented in (Gruber, 1993) where a ontology with three properties associated to tags (prefLabel, altLabel and hiddenLabel) is used.

The use of pattern matching techniques designed to automatically recognize syntactic variations of tags provides mechanisms to improve the quality of folksonomies.

Approximate string matching techniques allow dealing with the problem introduced by syntactic variations on folksonomies. The problem consists on the comparison of a candidate input string called α, maybe containing errors, and a pattern string ω in order to transform α in ω (Navarro, 2001).

Complete Chapter List

Search this Book:
Reset