Smoothing Text Representation Models Based on Rough Set

Smoothing Text Representation Models Based on Rough Set

Zhihua Wei, Duoqian Miao, Ruizhi Wang, Zhifei Zhang
ISBN13: 9781609608811|ISBN10: 160960881X|EISBN13: 9781609608828
DOI: 10.4018/978-1-60960-881-1.ch003
Cite Chapter Cite Chapter

MLA

Ramon F. Brena and Adolfo Guzman-Arenas. "Smoothing Text Representation Models Based on Rough Set." Quantitative Semantics and Soft Computing Methods for the Web: Perspectives and Applications, IGI Global, 2012, pp.50-68. https://doi.org/10.4018/978-1-60960-881-1.ch003

APA

R. Brena & A. Guzman-Arenas (2012). Smoothing Text Representation Models Based on Rough Set. IGI Global. https://doi.org/10.4018/978-1-60960-881-1.ch003

Chicago

Ramon F. Brena and Adolfo Guzman-Arenas. "Smoothing Text Representation Models Based on Rough Set." In Quantitative Semantics and Soft Computing Methods for the Web: Perspectives and Applications. Hershey, PA: IGI Global, 2012. https://doi.org/10.4018/978-1-60960-881-1.ch003

Export Reference

Mendeley
Favorite

Abstract

Text representation is the prerequisite of various document processing tasks, such as information retrieval, text classification, text clustering, etc. It has been studied intensively for the past few years, and many excellent models have been designed as well. However, the performance of these models is affected by the problem of data sparseness. Existing smoothing techniques usually make use of statistic theory or linguistic information to assign a uniform distribution to absent words. They do not concern the real word distribution or distinguish between words. In this chapter, a method based on a kind of soft computing theory, Tolerance Rough Set theory, which makes use of upper approximation and lower approximation theory in Rough Set to assign different values for absent words in different approximation regions, is proposed. Theoretically, our algorithms can estimate smoothing value for absent words according to their relation with respect to existing words. Text classification experiments by using Vector Space Model (VSM) and Latent Dirichlet Allocation (LDA) model on public corpora have shown that our algorithms greatly improve the performance of text representation model, especially for the performance of unbalanced corpus.

Request Access

You do not own this content. Please login to recommend this title to your institution's librarian or purchase it from the IGI Global bookstore.