Smoothing Text Representation Models Based on Rough Set

Smoothing Text Representation Models Based on Rough Set

Zhihua Wei, Duoqian Miao, Ruizhi Wang, Zhifei Zhang
DOI: 10.4018/978-1-60960-881-1.ch003
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Text representation is the prerequisite of various document processing tasks, such as information retrieval, text classification, text clustering, etc. It has been studied intensively for the past few years, and many excellent models have been designed as well. However, the performance of these models is affected by the problem of data sparseness. Existing smoothing techniques usually make use of statistic theory or linguistic information to assign a uniform distribution to absent words. They do not concern the real word distribution or distinguish between words. In this chapter, a method based on a kind of soft computing theory, Tolerance Rough Set theory, which makes use of upper approximation and lower approximation theory in Rough Set to assign different values for absent words in different approximation regions, is proposed. Theoretically, our algorithms can estimate smoothing value for absent words according to their relation with respect to existing words. Text classification experiments by using Vector Space Model (VSM) and Latent Dirichlet Allocation (LDA) model on public corpora have shown that our algorithms greatly improve the performance of text representation model, especially for the performance of unbalanced corpus.
Chapter Preview
Top

Introduction

Representation of texts is critical in text information processing tasks like retrieval, classification, clustering, summarization etc. Text representation is the prerequisite of document processing mainly because it determines the coding ways of text which directly affect processing performance. Effective representing form enables efficient processing on large amount of documents while preserving as much as possible semantic information that is useful to certain tasks.

Many excellent text representation models are designed based on intensive mathematic theories, including Vector Space Model (VSM) which is proposed by G. Salton et al. (1975) and some statistical topic models. VSM, a predominant method used for the present, represents a document as a vector where each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. Several different ways of computing these values and also known as (term) weights, have been developed. In recent years, statistical topic models have been successfully applied in many text processing tasks. These models can capture the word correlations in the corpus with a low-dimensional set of multinomial distribution, called ‘‘topics’’, and can find a relatively short description of the documents. Latent Dirichlet Allocation (LDA) model which is proposed by D. Blei et al. (2003) is a widely used statistical topic model. Its basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words.

However, most of these models suffer from the data sparseness problem. Taking text classification task as an example, there is a variety of sparseness which is resulted from the difference of vocabularies of different classes. In a corpus, vocabulary of each class is a subset of whole vocabulary. Consequently, when a document is represented with a model constructed on global vocabulary, it is not likely to contain the weight information of words that did not appear on its own. This phenomenon affects classification performance greatly. In fact, “absence” is not actually “not exist”. If the corpus is large enough, these words should present themselves. In order to simulate the reality and the improved classification performance, it is necessary to conduct some smoothing strategies. In this chapter, we focus on the smoothing problem of VSM and LDA.

The name “smoothing” comes from the fact that these techniques tend to make distributions more uniform. Discounting seen words and assigning reasonable counts to unseen words are two exact goals of text model smoothing. S. Chen & J. Goodman (1998) regarded that not only do smoothing methods generally prevent zero probabilities, but they also attempt to improve the accuracy of the model as a whole. Previous researches on smoothing problem mainly follow two ways: smoothing based on statistical theory and smoothing based on semantic information. Statistical based smoothing method includes Laplace smoothing, Jelinek-Mercer smoothing, absolute discount and so on. While semantic based smoothing methods mainly consider semantic similarity from many views. Although the former is effective to prevent zero probability in many cases, it treats all terms as the same and cannot emphasis the class-specific terms. The later consider this problem and has proposed some effective solutions. In this way, linguistic information is supposed to be combined for smoothing purpose. This chapter proposes a new smoothing strategy from the view of soft computing. It regards the smoothing problem as a kind of imprecision problem. Tolerance Rough Set theory is adopted to describe different imprecision degrees according to co-occurrence which is a kind of statistical semantic information.

Complete Chapter List

Search this Book:
Reset