On Document Representation and Term Weights in Text Classification

On Document Representation and Term Weights in Text Classification

Ying Liu (The Hong Kong Polytechnic University Hong Kong SAR, China)
Copyright: © 2009 |Pages: 22
DOI: 10.4018/978-1-59904-990-8.ch001
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

In the automated text classification, a bag-of-words representation followed by the tfidf weighting is the most popular approach to convert the textual documents into various numeric vectors for the induction of classifiers. In this chapter, we explore the potential of enriching the document representation with the semantic information systematically discovered at the document sentence level. The salient semantic information is searched using a frequent word sequence method. Different from the classic tfidf weighting scheme, a probability based term weighting scheme which directly reflect the term’s strength in representing a specific category has been proposed. The experimental study based on the semantic enriched document representation and the newly proposed probability based term weighting scheme has shown a significant improvement over the classic approach, i.e., bag-of-words plus tfidf, in terms of Fscore. This study encourages us to further investigate the possibility of applying the semantic enriched document representation over a wide range of text based mining tasks.
Chapter Preview
Top

Introduction

Text classification (TC) is such a task to categorize documents into predefined thematic categories. In particular, it aims to find the mapping ξ, from a set of documents D: {d1, …, di} to a set of thematic categories C: {C1, …, Cj}, i.e. ξ: DC. In its current study, which is dominated by supervised learning, the construction of a text classifier is often conducted in two main phases (Debole & Sebastiani, 2003; Sebastiani, 2002):

  • 1.

    Document indexing: The creation of numeric representations of documents

    • Term selection: To select a subset of terms from all terms occurring in the collection to represent the documents in a sensible way, either to facilitate computing or to achieve best effectiveness in classification.

    • Term weighting: To assign a numeric value to each term in order to weight its contribution which helps a document stand out from others.

  • 2.

    Classifier induction: The building of a classifier by learning from the numeric representations of documents

In literature, the most popular approach in handling document indexing is probably the bag-of-words (BoW) approach, i.e. each unique word in the document is considered as the smallest unit to convey information (Manning & Schütze, 1999; Sebastiani, 2002; van_Rijsbergen, 1979), followed by the tfidf weighting scheme, i.e. term frequency (tf) times inverse document frequency (idf) (Salton & Buckley, 1988; Salton & McGill, 1983). Numerous machine learning algorithms have been applied to the task of text classification. These include, but are not limited to, Naïve Bayes (Lewis, 1992a; Rennie, Shih, Teevan, & Karger, 2003a), decision tree and decision rule (Apté, Damerau, & Weiss, 1994, 1998), artificial neural network (Ng, Goh, & Low, 1997; Ruiz & Srinivasan, 2002), k-nearest neighbor (kNN) (Baoli, Qin, & Shiwen, 2004; Han, Karypis, & Kumar, 2001; Yang & Liu, 1999), Bayes Network related (Friedman, Geiger, & Goldszmidt, 1997; Heckerman, 1997), and recently support vector machines (SVM) (Burges, 1998; Joachims, 1998; Vapnik, 1999). Researchers and professionals from relevant areas, e.g. machine learning, information retrieval and natural language processing, constantly introduce new algorithms, test data sets, benchmark results, etc. A comprehensive review about text classification was given by (Sebastiani, 2002).

While text classification has emerged as a fast growing area, particularly due to the involvement of machine learning based algorithms, some key questions remain. Although, it is generally understood that word sequences, e.g. “finite state machine”, “machine learning” and “supply chain management”, convey more semantic information in representing textual documents than single terms, they have seldom been explored in both text classification and clustering. In fact, it is lack of a systematic approach to generate such high quality word sequences automatically. The rich semantic information resident in word sequences has been ignored. Although, the tfidf weighting scheme performs pretty well, particularly in the task of Web based information search and retrieval, these weights do not directly indicate the closeness of thematic relation between a term and a specific category.

Key Terms in this Chapter

Document Representation: Document representation is concerned about how textual documents should be represented in various tasks, e.g. text processing, retrieval and knowledge discovery and mining. Its prevailing approach is the vector space model, i.e. a document di is represented as a vector of term weights , where is the collection of terms that occur at least once in the document collection D.

Text Classification: Text classification intends to categorize documents into a series of predefined thematic categories. In particular, it aims to find the mapping ?, from a set of documents D: {d1, …, di} to a set of thematic categories C: {C1, …, Cj}, i.e. ?: D ? C.

Term Weighting Scheme: Term weighting is a process to compute and assign a numeric value to each term in order to weight its contribution in distinguishing a particular document from others. The most popular approach is tfidf weighting scheme, i.e. term frequency (tf) times inverse document frequency (idf).

Complete Chapter List

Search this Book:
Reset