Article Preview
TopThe two main stages in automated document categorization are term reduction and classification. Term reduction is carried by performing feature extraction followed by feature selection. The feature selection methods select a subset of the original set of features (features that have the highest scores) using a global ranking metric (Chi-Squared and Information Gain, for example) or a function of the classifier performance that use a selected feature set. Most authors concentrate their researches on this step, different methods were proposed to reduce terms.
In (Jiang et al, 2012), authors propose an improved KNN algorithm for term reduction, which builds the classification model by combining constrained one pass clustering algorithm and KNN text categorization.
In (Roberto et al, 2012), authors propose a filtering method for feature selection called ALOFT (At Least One FeaTure). The proposed method focuses on specific characteristics of text categorization domain. Also, it ensures that every document in the training set is represented by at least one feature and the number of selected features is determined in a data-driven way.
In (Karabulut, 2013), a two-stage term reduction strategy based on Information Gain (IG) theory and Geometric Particle Swarm Optimization (GPSO) search is proposed with a fuzzy unordered rule induction algorithm (FURIA) to categorize multi-label texts.
A projected-prototype based classifier is proposed in (zhang et al, 2013) for text categorization, in which a document category is represented by a set of prototypes, each assembling a representative for the documents in a subclass and its corresponding term subspace.