Classifying Two-Class Chinese Texts in Two Steps

Classifying Two-Class Chinese Texts in Two Steps

Xinghua Fan
Copyright: © 2009 |Pages: 6
ISBN13: 9781605660103|ISBN10: 1605660108|EISBN13: 9781605660110
DOI: 10.4018/978-1-60566-010-3.ch034
Cite Chapter Cite Chapter

MLA

Fan, Xinghua. "Classifying Two-Class Chinese Texts in Two Steps." Encyclopedia of Data Warehousing and Mining, Second Edition, edited by John Wang, IGI Global, 2009, pp. 208-213. https://doi.org/10.4018/978-1-60566-010-3.ch034

APA

Fan, X. (2009). Classifying Two-Class Chinese Texts in Two Steps. In J. Wang (Ed.), Encyclopedia of Data Warehousing and Mining, Second Edition (pp. 208-213). IGI Global. https://doi.org/10.4018/978-1-60566-010-3.ch034

Chicago

Fan, Xinghua. "Classifying Two-Class Chinese Texts in Two Steps." In Encyclopedia of Data Warehousing and Mining, Second Edition, edited by John Wang, 208-213. Hershey, PA: IGI Global, 2009. https://doi.org/10.4018/978-1-60566-010-3.ch034

Export Reference

Mendeley
Favorite

Abstract

Text categorization (TC) is a task of assigning one or multiple predefined category labels to natural language texts. To deal with this sophisticated task, a variety of statistical classification methods and machine learning techniques have been exploited intensively (Sebastiani, 2002), including the Naïve Bayesian (NB) classifier (Lewis, 1998), the Vector Space Model (VSM)-based classifier (Salton, 1989), the example-based classifier (Mitchell, 1996), and the Support Vector Machine (Yang & Liu, 1999). Text filtering is a basic type of text categorization (two-class TC). There are many real-life applications (Fan, 2004), a typical one of which is the ill information filtering, such as erotic information and garbage information filtering on the web, in e-mails and in short messages of mobile phones. It is obvious that this sort of information should be carefully controlled. On the other hand, the filtering performance using the existing methodologies is still not satisfactory in general. The reason lies in that there exist a number of documents with high degree of ambiguity, from the TC point of view, in a document collection, that is, there is a fuzzy area across the border of two classes (for the sake of expression, we call the class consisting of the ill information- related texts, or, the negative samples, the category of TARGET, and, the class consisting of the ill information-not-related texts, or, the positive samples, the category of Non-TARGET). Some documents in one category may have great similarities with some other documents in the other category, for example, a lot of words concerning love story and sex are likely appear in both negative samples and positive samples if the filtering target is erotic information.

Request Access

You do not own this content. Please login to recommend this title to your institution's librarian or purchase it from the IGI Global bookstore.