Chinese Text Categorization via Bottom-Up Weighted Word Clustering

Chinese Text Categorization via Bottom-Up Weighted Word Clustering

Yu-Chieh Wu (Ming-Chuan University, Taiwan)
DOI: 10.4018/978-1-5225-1759-7.ch022
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Most of the researches on text categorization are focus on using bag of words. Some researches provided other methods for classification such as term phrase, Latent Semantic Indexing, and term clustering. Term clustering is an effective way for classification, and had been proved as a good method for decreasing the dimensions in term vectors. The authors used hierarchical term clustering and aggregating similar terms. In order to enhance the performance, they present a modify indexing with terms in cluster. Their test collection extracted from Chinese NETNEWS, and used the Centroid-Based classifier to deal with the problems of categorization. The results had shown that term clustering is not only reducing the dimensions but also outperform than bag of words. Thus, term clustering can be applied to text classification by using any large corpus, its objective is to save times and increase the efficiency and effectiveness. In addition to performance, these clusters can be considered as conceptual knowledge base, and kept related terms of real world.
Chapter Preview
Top

Introduction

With the rapid growth of the Internet, there is an increasing need in information technology. Under this situation, creating large corpus becomes much easier than isolated works. In order to effectively deal with the news, organizing text documents is required. The automatic text categorization (TC) is the task of learning to recognize the class label given the testing document. Usually, a machine learning-based classifier is employed to predict the class label. It learns rules from the labeled training data and applied these rules to label testing document. Before applying machine learning algorithms, the document is firstly represented as vectors. In general, the vector is derived from a set of selected words, the so-called bag-of-word (BOW; Uma, Sankar, & Aghila, 2008; Yen, Lee, Wu, Ying, & Tseng, 2011; Galar, Fernandez, Barrenechea, Bustince, & Herrera, 2010; Katakis, Tsoumakas, & Vlahavas, 2010; Han & Karypis, 2000). However, the biggest challenge of this approach is that the unknown word is missing and the curse of high dimension. To solve this, a set of dimension reduction-based approaches (Sebastiani, 2002; Bekkerman, El-Yaniv, Tishby, & Winter, 2003; Pereira, Tishby, & Lee, 1993) were proposed over the past years. Examples include, latent semantic indexing (Sebastiani, 2002; Baker & McCallum, 1998), information theoretic clustering (Bekkerman, El-Yaniv, Winter, & Tishby, 2001; Pereira, Tishby, & Lee, 1993). Dhillon et al. (Dhillon, Mallela, & Kumar, 2003) also shows better results with term cluster representation.

Text categorization is a rich and wide research issue. It provides the fundamental step for text mining (Janev, Dudukovic, & Vraneš, 2009; Krogstie, Veres, & Sindre, 2007; Lee, Wu, & Yang, 2009; Nour & Mouakket, 2011; Wu & Chang, Efficient Text Chunking using Linear Kernel with Mask Method, 2007; Wu, Lee, & Yang, Robust and efficient multiclass SVM models for phrase pattern recognition, 2008; Yang, Huang, Tsai, Chung, & Wu, 2009). Some well-known machine learning methods, such as support vector machines (SVM) (Joachims, 1997), had received a great successful in this field. Several well-known machine learning algorithms had been widely used in recent years, for example, SVM (Joachims, 1997), linear classifiers (Widrow-Hoff weighting) (Lewis, Schapire, Callan, & Papka, 1996), Centroid-based learners (Han & Karypis, 2000), memory-based learning (k nearest neighbors kNN) (Uma, Sankar, & Aghila, 2008; Yang Y., 1999), generalized instant set Widrow (GIS-W) (Lam & Ho, 1998), decision tree (Apte, Damerau, & Weiss, 1994; Sebastiani, 2002); Naïve Bayes (Apte, Damerau, & Weiss, 1994; Bekkerman, El-Yaniv, Tishby, & Winter, 2003), Neural networks (Apte, Damerau, & Weiss, 1994; Uma, Sankar, & Aghila, 2008; Sebastiani, 2002; Yang Y., 1999).

Complete Chapter List

Search this Book:
Reset