A very large percentage of business and academic data is stored in textual format. With the exception of metadata, such as author, date, title and publisher, this data is not overtly structured like the standard, mainly numerical, data in relational databases. Parallel to data mining, which finds new patterns and trends in numerical data, text mining is the process aimed at discovering unknown patterns in free text. Owing to the importance of competitive and scientific knowledge that can be exploited from these texts, “text mining has become an increasingly popular and essential theme in data mining” (Han & Kamber, 2001, p. 428). Text mining is an evolving field and its relatively short history goes hand in hand with the recent explosion in availability of electronic textual information. Chen (2001, p. vi) remarks that “text mining is an emerging technical area that is relatively unknown to IT professions”. This explains the fact that despite the value of text mining, most research and development efforts still focus on data mining using structured data (Fan et al., 2006). In the next section, the background and need for text mining will be discussed after which the various uses and techniques of text mining are described. The importance of visualisation and some critical issues will then be discussed followed by some suggestions for future research topics.
Definitions of text mining vary a great deal, from views that it is an advanced form of information retrieval (IR) to those that regard it as a sibling of data mining:
Text mining is the discovery of texts.
Text mining is the exploration of available texts.
Text mining is the extraction of information from text.
Text mining is the discovery of new knowledge in text.
Text mining is the discovery of new patterns, trends and relations in and among texts.
Han & Kamber (2001, pp. 428-435), for example, devote much of their rather short discussion of text mining to information retrieval. However, one should differentiate between text mining and information retrieval. Text mining does not consist of searching through metadata and full-text databases to find existing information. The point of view expressed by Nasukawa & Nagano (2001, p. 969), to wit that text mining “is a text version of generalized data mining”, is correct. Text mining should “focus on finding valuable patterns and rules in text that indicate trends and significant features about specific topics” (ibid., p. 967).
Like data mining, text mining is a proactive process that automatically searches data for new relationships and anomalies to serve as a basis for making business decisions aimed at gaining competitive advantage (cf. Rob & Coronel, 2004, p. 597). Although data mining can require some interaction between the investigator and the data-mining tool, it can be considered as an automatic process because “data-mining tools automatically search the data for anomalies and possible relationships, thereby identifying problems that have not yet been identified by the end user”, while mere data analysis “relies on the end users to define the problem, select the data, and initiate the appropriate data analyses to generate the information that helps model and solve problems those end-users uncover” (ibid.). The same distinction is valid for text mining. Therefore, text-mining tools should also “initiate analyses to create knowledge” (ibid., p. 598).
In practice, however, the borders between data analysis, information retrieval and text mining are not always quite so clear. Montes-y-Gómez et al. (2004) proposed an integrated approach, called contextual exploration, which combines robust access (IR), non-sequential navigation (hypertext) and content analysis (text mining).