There is an explosion in the amount of data that organizations generate, collect, and store. Organizations are gradually relying more on new technologies to access, analyze, summarize, and interpret information intelligently. Data mining, therefore, has become a research area with increased importance (Amaratunga & Cabrera, 2004). Data mining is the search for valuable information in large volumes of data (Hand, Mannila, & Smyth, 2001). It can discover hidden relationships, patterns, and interdependencies and generate rules to predict the correlations, which can help the organizations make critical decisions faster or with a greater degree of confidence (Gargano & Ragged, 1999). There is a wide range of data mining techniques, which has been successfully used in many applications. This article is an attempt to provide an overview of existing data mining applications. The article begins by explaining the key tasks that data mining can achieve. It then moves to discuss applications domains that data mining can support. The article identifies three common application domains, including bioinformatics, electronic commerce, and search engines. For each domain, how data mining can enhance the functions will be described. Subsequently, the limitations of current research will be addressed, followed by a discussion of directions for future research.
Data mining can be used to achieve many types of tasks. Based on the kinds of knowledge to be discovered, it can be broadly divided into supervised learning and unsupervised learning. The former requires the data to be pre-classified. Each item is associated with a unique label, signifying the class in which the item belongs. In contrast, the latter does not require pre-classification of the data and can form groups that share common characteristics (Nolan, 2002). To achieve these two main tasks, four data mining approaches are commonly used: classification, clustering, association rules, and visualization.
Classification, which is a process of supervised learning, is an important issue in data mining. It refers to discovering predictive patterns where a predicted attribute is nominal or categorical. The predicted attribute is called the class. Subsequently, a data item is assigned to one of the predefined sets of classes by examining its attributes (Changchien & Lu, 2001). One example of classification applications is to analyze the functions of genes on the basis of predefined classes that biologists set (see the section on “Classifying Gene Functions”).
Clustering is also known as exploratory data analysis (EDA) (Tukey, 1977). This approach is used in those situations where a training set of pre-classified records is unavailable. Objects are divided into groups based on their similarity. Each group, called cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups (Roussinov & Zhao, 2003). From a data mining perspective, clustering is an approach for unsupervised learning. One of the major applications of clustering is the management of customers’ relationships, which is described in the section “Customer Management.”
Association rules that were first proposed by Agrawal and Srikant (1994) are mainly used to find out the meaningful relationships between items or features that occur synchronously in databases (Wang, Chuang, Hsu, & Keh, 2004). This approach is useful when one has an idea of different associations that are being sought out. This is because one can find all kinds of correlations in a large data set. It has been widely applied to extract knowledge from Web log data (Lee, Kim, Chung, & Kwon, 2002). In particular, it is very popular among marketing managers and retailers in electronic commerce who want to find associative patterns among products (see the section on “Market Basket Analysis”).
Key Terms in this Chapter
Electronic Commerce: Commercial activities that facilitate the buying and selling of goods and services over the Internet.
Collaborative Filtering: A technique that is used for making recommendations by computing the similarities among users.
Search Engines: Web services that help search through Internet addresses for user-defined terms or topics.
Bayesian Network: A directed acyclic graph of nodes representing variables and arcs representing dependence relations among the variables.
MicroArrays: A high-throughput technology that allows the simultaneous determination of mRNA abundance for many thousands of genes in a single experiment.
Author Co-Citation Analysis: The analysis of how authors are cited together.
Bioinformatics: An integration of mathematical, statistical, and computational methods to organize and analyze biological data.
Content-Based Filtering: A technique that involves a direct comparison between the content or attributes of a user’s profile and the document to make recommendations.