Since the invention of the printing press, text has been the predominate mode for collecting, storing and disseminating a vast, rich range of information. With the unprecedented increase of electronic storage and dissemination, document collections have grown rapidly, increasing the need to manage and analyze this form of data in spite of its unstructured or semistructured form. Text-data analysis (Hearst, 1999) has emerged as an interdisciplinary research area forming a junction of a number of older fields like machine learning, natural language processing, and information retrieval (Grobelnik, Mladenic, & Milic-Frayling, 2000). It is sometimes viewed as an adapted form of a very similar research field that has also emerged recently, namely, data mining, which focuses primarily on structured data mostly represented in relational tables or multidimensional cubes. This article provides an overview of the various research directions in text-data analysis. After the “Introduction,” the “Background” section provides a description of a ubiquitous text-data representation model along with preprocessing steps employed for achieving better text-data representations and applications. The focal section, “Text-Data Analysis,” presents a detailed treatment of various text-data analysis subprocesses such as information extraction, information retrieval and information filtering, document clustering and document categorization. The article closes with a “Future Trends” section followed by a “Conclusion” section.
Text-data analysis is defined as the computerized process of automatically extracting useful knowledge from enormous collections of natural text documents (a.k.a. document collections) usually coming from various dynamic sources. It is a broad process embedding a number of subprocesses, all of which deal with textual resources which are naturally unstructured or semistructured, as in the case of HTML (HyperText Markup Language) and XML (eXtensible Markup Language) documents; a fact that makes it extremely difficult to apply computational solutions to real life text-based problems.
In order to alleviate the difficulty faced by computers when dealing with the unstructured nature of text-data resources, a process called indexing is utilized. This process is normally preceded by a number of preprocessing steps that attempt to optimize the indexing process mainly by feature reduction, as explained in this section.
Key Terms in this Chapter
Case Folding: The process of converting all the characters in a document into the same case, either all upper case or lower case, in order to speed up comparisons during the indexing process.
Stemming: The process of removing prefixes and suffixes from words to reduce them to stems thus eliminating tag-of-speech and other verbal or plural inflections.
Indexing: Indexing is the process of mapping a document into a structured (tabular) format that captures its content.
Information Retrieval: The process of discovering a set of documents from static collection which satisfy a user’s temporary information need defined by a query.
Information Filtering: The process of eliminating documents from a document stream routed to a subscribed user based on the user’s profile.
Information Extraction: The process of extracting predefined information on known entities and relationships among those entities from streams of documents and storing this information in pre-designed templates.
Document Clustering: The process of grouping similar documents into partitions where documents within the same partition exhibit higher degree of similarity among each other than to any other document in any other partition.