Effective and Efficient Classification of Topically-Enriched Domain-Specific Text Snippets: The TETSC Method

Effective and Efficient Classification of Topically-Enriched Domain-Specific Text Snippets: The TETSC Method

Marco Spruit, Bas Vlug
Copyright: © 2015 |Pages: 17
DOI: 10.4018/IJSDS.2015070101
Article PDF Download
Open access articles are freely available for download

Abstract

Due to the explosive growth in the amount of text snippets over the past few years and their sparsity of text, organizations are unable to effectively and efficiently classify them, missing out on business opportunities. This paper presents TETSC: the Topically-Enriched Text Snippet Classification method. TETSC aims to solve the classification problem for text snippets in any domain. TETSC recognizes that there are different types of text snippets and, therefore, allows for stop word removal, named-entity recognition, and topical enrichment for the different types of text snippets. TETSC has been implemented in the production systems of a personal finance organization, which resulted in a classification error reduction of over 21%. Highlights: The authors create the TETSC method for classifying topically-enriched text snippets; the authors differentiate between different types of text snippets; the authors show a successful application of Named-Entity Recognition to text snippets; using multiple enrichment strategies appears to reduce effectivity.
Article Preview
Top

1. Introduction: The Wicked Problem Of Classifying Text Snippets

The recent years have witnessed an unprecedented growth in the amount of text snippets. The Washington Post reports that in March 2013 over 400 million tweets are sent per day, an increase from 200 million since 2011 (Tsukayama, 2013; Twitter Engineering, 2011). This is only the increase in the number of text snippets from one source. In today’s society there are plenty of places where text snippets are found. Twitter is one place mentioned earlier, but also for instance search engines or banks produce a large amount of text snippets per day in the form of search result snippets or financial transactions.

Most of these text snippets are taken as being of no domain. This, however, is far from the truth. There are plenty of domain-related tweets being sent on a daily basis, customer service tweets of companies being one example thereof. Furthermore, there even exist domain-specific search engines, such as MEDLINE, which are designed to yield better results as they are aimed at specific domains.

While a lot of text snippets are created and generated on a daily basis, it currently still is a problem to even only summarize these so-called text snippets through classification. While the classification of large documents has reached effectiveness levels comparable to those of trained professionals, the classification of short texts, in this research denoted as text snippets, is different (Sebastiani, 2002). Chen, Xiaoming & Shen (2011) identify the reason being mainly that text snippets are of short length and therefore suffer from sparsity.

By not being able to correctly classify text snippets, companies miss out on business opportunities. Being able to correctly classify tweets, for instance, could provide a lot of information that can be used to identify trends, or, being able to correctly classify financial transactions could provide account owners with valuable overviews of expenses, which in turn can make them more in control of their finances. Another application domain which is well known to suffer from valuable information in unstructured text snippets, is healthcare, where doctors often record a patient’s diagnosis and/or prognosis in the dossier’s comment field only (Spruit, Vroon & Batenburg, 2014).

This paper attempts to solve the problem of correctly classifying domain-specific text snippets to predefined categories. A vast amount of literature can be found intended to solve this problem. Most of this literature is related to the enrichment of text snippets through various means:

  • 1.

    Search query results (e.g.Sahami & Heilman, 2006; Shen et al., 2006);

  • 2.

    The categorical structure of an intermediary (such as Wikipedia or Yahoo, see, e.g.Shen et al., 2006; Gabrilovich & Markovitch, 2005);

  • 3.

    An external corpus (e.g.Gabrilovich & Markovitch, 2006; Wang & Domeniconi, 2008);

  • 4.

    Topic models (e.g.Phan, Nguyen & Horiguchi, 2008; Ramage, Dumais & Liebling, 2010); or

  • 5.

    Lexical information (e.g.Hu et al., 2009).

Complete Article List

Search this Journal:
Reset
Volume 15: 1 Issue (2024): Forthcoming, Available for Pre-Order
Volume 14: 1 Issue (2023)
Volume 13: 4 Issues (2022): 1 Released, 3 Forthcoming
Volume 12: 3 Issues (2021)
Volume 11: 4 Issues (2020)
Volume 10: 4 Issues (2019)
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing