One of the most important issues in data mining is to discover an implicit relationship between words in a large corpus and labels in a large database. The relationship between words and labels often is expressed as a function of distance measures. An effective measure would be useful not only for getting the high precision of data mining, but also for time saving of the operation in data mining. In previous research, many measures for calculating the one-to-many relationship have been proposed, such as the complementary similarity measure, the mutual information, and the phi coefficient. Some research showed that the complementary similarity measure is the most effective. The author reviewed previous research related to the measures in one-to-many relationships and proposed a new idea to get an effective one, based on the heuristic approach in this article.
Generally, the knowledge discover in databases (KDD) process consists of six stages: data selection, cleaning, enrichment, coding, data mining, and reporting (Adriaans & Zantinge, 1996). Needless to say, data mining is the most important part in the KDD. There are various techniques, such as statistical techniques, association rules, and query tools in a database, for different purposes in data mining. (Agrawal, Mannila, Srikant, Toivonen & Verkamo, 1996; Berland & Charniak, 1999; Caraballo, 1999; Fayyad, Piatetsky-Shapiro, Smyth & Uthurusamy, 1996; Han & Kamber, 2001).
When two words or labels in a large database have some implicit relationship with each other, one of the different purposes is to find out the two relative words or labels effectively. In order to find out relationships between words or labels in a large database, the author found the existence of at least six distance measures after reviewing previously conducted research.
The first one is the mutual information proposed by Church and Hanks (1990). The second one is the confidence proposed by Agrawal and Srikant (1995). The third one is the complementary similarity measure (CSM) presented by Hagita and Sawaki (1995). The fourth one is the dice coefficient. The fifth one is the Phi coefficient. The last two are both mentioned by Manning and Schutze (1999). The sixth one is the proposal measure (PM) suggested by Ishiduka, Yamamoto, and Umemura (2003). It is one of the several new measures developed by them in their paper.
In order to evaluate these distance measures, formulas are required. Yamamoto and Umemura (2002) analyzed these measures and expressed them in four parameters of a, b, c, and d (Table 1).Table 1.
Kind of distance measures and their formulas
|No||Kind of Distance Measures||Formula|
|1||the mutual information|
|3||the complementary similarity measure|
|4||the dice coefficient|
|5||the Phi coefficient|
|6||the proposal measure|