Massive quantities of information continue accumulating at about 1.5 billion gigabytes per year in numerous repositories held at news agencies, at libraries, on corporate intranets, on personal computers, and on the Web. A large portion of all available information exists in the form of text. Researchers, analysts, editors, venture capitalists, lawyers, help desk specialists, and even students are faced with text analysis challenges. Text mining tools aim at discovering knowledge from textual databases by isolating key bits of information from large amounts of text, identifying relationships among documents. Text mining technology is used for plagiarism and authorship attribution, text summarization and retrieval, and deception detection.
Key Terms in this Chapter
Knowledge Discovery in Databases (KDD): KDD is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data (Fayyad, 1996).
Natural-Language Processing: Natural-language processing is a subfield of artificial intelligence and linguistics that addresses the problems of automated generation and understanding of human languages.
Text Mining (TM): Also known as intelligent text analysis, text data mining, and knowledge discovery in text, TM is an essential part of discovering previously unknown patterns useful for particular purposes from textual databases.
Feature Extraction: Feature extraction refers to the extraction of linguistic items from the documents to provide a representative sample of their content. Distinctive vocabulary items found in a document are assigned to the different categories by measuring the importance of those items to the document content.
Data Mining (DM): DM is the essential and arduous step in the process of knowledge discovery in databases with the goal of extracting high-level knowledge from low-level data.
Thematic Indexing or Topic Tracking: Thematic indexing refers to the identification of the significant terms for a particular document collection. Indexing identifies a given document or a query text by a set of weighted or unweighted terms obtained from a document or a query text.
Semantic Web: The semantic Web is a web of data, like a global database. The Web was designed as an information space, with the goal that it should be useful not only for human-human communication, but also that machines would be able to participate and help. The semantic Web approach aims at developing languages for expressing information in a machine-processable form.
Collocation: Collocation is defined as a sequence of words or terms that co-occur more often than would be expected by chance. Collocation is the way in which words are used together regularly.
Text Clustering: Text clustering is a process of partitioning a given collection into a number of previously unknown groups of documents with similar content. Clustering allows for the discovery of unknown or previously unnoticed links in the subset of documents or terms in any particular document collection.
Text Categorization: Text categorization assigns documents to preexisting categories, called topics or themes. Automatic document categorization for knowledge-sharing purposes, document indexing in libraries, Web page classification into Internet directories, and some other tasks can be accomplished by implementing categorization algorithms.