Discovering an Effective Measure in Data Mining

Discovering an Effective Measure in Data Mining

Takao Ito (Ube National College of Technology, Japan)
Copyright: © 2009 |Pages: 9
DOI: 10.4018/978-1-60566-010-3.ch102
OnDemand PDF Download:
No Current Special Offers


One of the most important issues in data mining is to discover an implicit relationship between words in a large corpus and labels in a large database. The relationship between words and labels often is expressed as a function of distance measures. An effective measure would be useful not only for getting the high precision of data mining, but also for time saving of the operation in data mining. In previous research, many measures for calculating the one-to-many relationship have been proposed, such as the complementary similarity measure, the mutual information, and the phi coefficient. Some research showed that the complementary similarity measure is the most effective. The author reviewed previous research related to the measures in one-to-many relationships and proposed a new idea to get an effective one, based on the heuristic approach in this article.
Chapter Preview


Generally, the knowledge discover in databases (KDD) process consists of six stages: data selection, cleaning, enrichment, coding, data mining, and reporting (Adriaans & Zantinge, 1996). Needless to say, data mining is the most important part in the KDD. There are various techniques, such as statistical techniques, association rules, and query tools in a database, for different purposes in data mining. (Agrawal, Mannila, Srikant, Toivonen & Verkamo, 1996; Berland & Charniak, 1999; Caraballo, 1999; Fayyad, Piatetsky-Shapiro, Smyth & Uthurusamy, 1996; Han & Kamber, 2001).

When two words or labels in a large database have some implicit relationship with each other, one of the different purposes is to find out the two relative words or labels effectively. In order to find out relationships between words or labels in a large database, the author found the existence of at least six distance measures after reviewing previously conducted research.

The first one is the mutual information proposed by Church and Hanks (1990). The second one is the confidence proposed by Agrawal and Srikant (1995). The third one is the complementary similarity measure (CSM) presented by Hagita and Sawaki (1995). The fourth one is the dice coefficient. The fifth one is the Phi coefficient. The last two are both mentioned by Manning and Schutze (1999). The sixth one is the proposal measure (PM) suggested by Ishiduka, Yamamoto, and Umemura (2003). It is one of the several new measures developed by them in their paper.

In order to evaluate these distance measures, formulas are required. Yamamoto and Umemura (2002) analyzed these measures and expressed them in four parameters of a, b, c, and d (Table 1).

Table 1.
Kind of distance measures and their formulas
NoKind of Distance MeasuresFormula
1the mutual information978-1-60566-010-3.ch102.m01
2the confidence978-1-60566-010-3.ch102.m02
3the complementary similarity measure978-1-60566-010-3.ch102.m03
4the dice coefficient978-1-60566-010-3.ch102.m04
5the Phi coefficient978-1-60566-010-3.ch102.m05
6the proposal measure978-1-60566-010-3.ch102.m06

Complete Chapter List

Search this Book: