Article Preview
Top2. Text Representation
Feature selection algorithm seeks to retain certain characteristics to optimize classification performance by removing the noise and redundancy. The feature extraction is to represent original features into another space through some sort of transformation. This transformation is a kind of representation from high-dimensional vector space to low-dimensional vector space.
Vector space model (VSM) (Salton, 1975) is still the most popular method for text representation, which reduces each document in the corpus to a vector of real numbers. Related research focuses on what are the most appropriate terms for document representation and how to calculate the weight of these terms. Much research adopt “word” or “n - gram” as terms and tf*idf as weight.
Although the reduction of tf*idf has some attractive features including the identification of words that are discriminatory for all documents in the collection, the approach also provides a relatively small amount of reduction in description length and reveals little in the way of inter or intra of the structure statistical document. To overcome these shortcomings, researchers have proposed several methods for dimensionality reduction, including latent semantic indexing (LSI) (Deerwester, 1990).
LSI uses a singular value decomposition of the matrix X to identify a subspace in the space of tf*idf features that capture most of the variance in the collection. This approach can achieve significant compression in large collections. In fact, according to Deerwester et al, derived characteristics of LSI, which are linear combinations of the original tf*idf features can capture some aspects of basics linguistic notions such as synonymy and polysemy.