Text Mining

Text Mining

Thomas Mandl (Universität Hildesheim, Germany)
DOI: 10.4018/978-1-4666-5888-2.ch185

Chapter Preview


Basic Components Of Text Mining

Text Mining typically begins with the processing of natural language. Initially, the creation of numerical representations for further processing is necessary. Natural language processing tasks are identical for many text mining applications.

Lexical Operations

Texts contain words in many different forms. The words need to be identified and separated, a difficult task for languages without blanks between words (e.g. some East Asian languages). In the case of most European languages, punctuation marks and hyphens need to be regarded.

The following step is grouping words which have a common basic form. These forms could be e.g. different grammatical forms of a verb. Their meaning is basically identical and only their morphology changes. In languages with many cases for nouns and many temporal forms for verbs (e.g. Finnish), this task can be challenging. Identical stemming operations are carried in Information Retrieval.

An example would be the word forms “run,” “runs” and “running.” They should be all mapped to the same stem “run.”

The remaining words are counted and their frequency in each text and in the entire collection is determined. Based on the frequencies, weights are calculated expressing the importance of a word or term for a text document. These weights show the topicality or “aboutness” of a document. This information can be stored in a document-term matrix where a vector contains the weights for all terms regarding one document. Each column shows the distribution of a term over all documents in a collection. (Manning et al., 2008)

Key Terms in this Chapter

Stemming: Stemming refers to the mapping to word forms to stems or basic word forms. Word forms may differ from stems due to morphological changes necessary for grammatical reasons. Plural for English nouns, for example, is mostly constructed by adding an s to the basic noun.

Classification: Objects are assigned to pre-defined classes based on similarity. Similar objects are assigned to the same class. The function defining similarity is given by examples for the assignment. These are objects which have been assigned to a class before. The algorithm needs to learn a function which reflects the class definition as determined by the learning examples.

Concepts: Meaning is defined beyond a word. A concept is a semantic entity which can be expressed by several words or by a group of words.

Opinion Mining: Opinion mining or Sentiment Analysis means finding and classifying opinionated parts of texts. These subjective parts need to by identified by Text Mining methods and separated from objective text parts. A technique typically applied is the search for words which express opinion.

Information Retrieval: Information retrieval is concerned with the representation and knowledge and subsequent search for relevant information within these knowledge sources. Information retrieval provides the technology behind search engines.

Clustering: Objects are being grouped based on similarity. Each cluster contains objects which are more similar among each other than to objects in other clusters.

Complete Chapter List

Search this Book: