Local and Global Latent Semantic Analysis for Text Categorization

Local and Global Latent Semantic Analysis for Text Categorization

Khadoudja Ghanem (Constantine 2 University, Algeria)
DOI: 10.4018/978-1-5225-5191-1.ch060
OnDemand PDF Download:
No Current Special Offers


In this paper the authors propose a semantic approach to document categorization. The idea is to create for each category a semantic index (representative term vector) by performing a local Latent Semantic Analysis (LSA) followed by a clustering process. A second use of LSA (Global LSA) is adopted on a term-Class matrix in order to retrieve the class which is the most similar to the query (document to classify) in the same way where the LSA is used to retrieve documents which are the most similar to a query in Information Retrieval. The proposed system is evaluated on a popular dataset which is 20 Newsgroup corpus. Obtained results show the effectiveness of the method compared with those obtained with the classic KNN and SVM classifiers as well as with methods presented in the literature. Experimental results show that the new method has high precision and recall rates and classification accuracy is significantly improved.
Chapter Preview

The two main stages in automated document categorization are term reduction and classification. Term reduction is carried by performing feature extraction followed by feature selection. The feature selection methods select a subset of the original set of features (features that have the highest scores) using a global ranking metric (Chi-Squared and Information Gain, for example) or a function of the classifier performance that use a selected feature set. Most authors concentrate their researches on this step, different methods were proposed to reduce terms.

In (Jiang et al, 2012), authors propose an improved KNN algorithm for term reduction, which builds the classification model by combining constrained one pass clustering algorithm and KNN text categorization.

In (Roberto et al, 2012), authors propose a filtering method for feature selection called ALOFT (At Least One FeaTure). The proposed method focuses on specific characteristics of text categorization domain. Also, it ensures that every document in the training set is represented by at least one feature and the number of selected features is determined in a data-driven way.

In (Karabulut, 2013), a two-stage term reduction strategy based on Information Gain (IG) theory and Geometric Particle Swarm Optimization (GPSO) search is proposed with a fuzzy unordered rule induction algorithm (FURIA) to categorize multi-label texts.

A projected-prototype based classifier is proposed in (zhang et al, 2013) for text categorization, in which a document category is represented by a set of prototypes, each assembling a representative for the documents in a subclass and its corresponding term subspace.

Complete Chapter List

Search this Book: