Text categorization (TC) is a data mining technique for automatically classifying documents to one or more predefined categories. This paper will introduce the principles of TC, discuss common TC methods and steps, give an overview of the various types of TC systems, and discuss future trends. TC systems begin with a group of known categories and a set of training documents already assigned to a category, usually by a human expert. Depending on the system, the documents may undergo a process called dimensionality reduction, which reduces the number of words or features that the classifier evaluates during the learning process. The system then analyzes the documents and “learns” which words or features of each document caused it to be classified into a particular category. This is known as supervised learning, because it is based on human knowledge of the categories and their criteria. The learning process results in a classifier which can apply the rules it learned during the training phase to additional documents.
Preparing The Classifier
Before classifiers can begin categorizing new documents, there are several steps required to prepare and train the classifier. First, categories must be established and decisions must be made about how documents are categorized. Second, a training set of documents must be selected. Finally and optionally, the training set often undergoes dimensionality reduction, either through basic techniques such as stemming or, in some cases, more advanced feature selection or extraction. Decisions made at each point in these preparatory steps can significantly affect classifier performance.
TC always begins with a fixed set of categories to which documents must be assigned. This distinguishes TC from text clustering, where categories are built on-the-fly in response to similarities among query results. Often, the categories are a broad range of subjects or topics, like those employed in some of the large text corpora used in TC research such as Reuters news stories, websites, and articles (for example, as the Reuters and Ohsumed corpora are used in Joachims ). Classifiers can be designed to allow as many categories to be assigned to a given document as are deemed relevant, or they may be restricted to the top k most relevant categories. In some instances, TC applications have just one category, and the classifier learns to make yes/no decisions about whether documents should or should not be assigned to that category. This is called binary TC. Examples include authorship verification (Koppel & Schler, 2004) and detecting system misuse (Zhang & Shen, 2005).
The next step in preparing a classifier is to select a training set of documents which will be used to build the classification algorithm. In order to build a classifier, a TC system must be given a set of documents that are already classified into the desired categories. This training set needs to be robust and representative, consisting of a large variety of documents that fully represent the categories to be learned. TC systems also often require a test set, an additional group of pre-classified documents given to the classifier after the training set, used to test classifier performance against a human indexer. Training can occur all at once, in a batch process before the classifier begins categorizing new documents, or training can continue simultaneously with categorization; this is known as online training.
One final, optional step for TC systems is to reduce the number of terms in the index. The technique for doing so, known as dimensionality reduction, is taken from research in information retrieval and is applicable to many data mining tasks. Dimensionality reduction entails reducing the number of terms (i.e., dimensions in the vector space model of IR) in order to make classifiers perform more efficiently. The goal of dimensionality reduction is to streamline types of classifiers such as naïve Bayes and kNN classifiers, which employ other information retrieval techniques such as document similarity. It can also address the problem of “noisy” training data, where similar terms (for example, forms of the same word with different suffixes) would otherwise be interpreted by the classifier as different words.