Tools and techniques that have been developed during the last 40 years in the field of fuzzy set theory (FST) have been applied quite successfully in a variety of application areas. A prominent example of the practical usefulness of corresponding techniques is fuzzy control, where the idea is to represent the input-output behaviour of a controller (of a technical system) in terms of fuzzy rules. A concrete control function is derived from such rules by means of suitable inference techniques. While aspects of knowledge representation and reasoning have dominated research in FST for a long time, problems of automated learning and knowledge acquisition have more and more come to the fore in recent years. There are several reasons for this development, notably the following: Firstly, there has been an internal shift within fuzzy systems research from “modelling” to “learning”, which can be attributed to the awareness that the well-known “knowledge acquisition bottleneck” seems to remain one of the key problems in the design of intelligent and knowledge-based systems. Secondly, this trend has been further amplified by the great interest that the fields of knowledge discovery in databases (KDD) and its core methodical component, data mining, have attracted in recent years. It is hence hardly surprising that data mining has received a great deal of attention in the FST community in recent years (Hüllermeier, 2005). The aim of this chapter is to give an idea of the usefulness of FST for data mining. To this end, we shall briefly highlight, in the next but one section, some potential advantages of fuzzy approaches. In preparation, the next section briefly recalls some basic ideas and concepts from FST. The style of presentation is purely non-technical throughout; for technical details we shall give pointers to the literature.
Background On Fuzzy Sets
A fuzzy subset F of a reference set X is identified by a so-called membership function (often denoted μF(•)), which is a generalization of the characteristic function of an ordinary set A ⊆ X (Zadeh, 1965). For each element x ∈ X, this function specifies the degree of membership of x in the fuzzy set. Usually, membership degrees μF(x) are taken from the unit interval [0,1], i.e., a membership function is an X → [0,1] mapping, even though more general membership scales (such as ordinal scales or complete lattices) are conceivable.
Fuzzy sets formalize the idea of graded membership according to which an element can belong “more or less” to a set. Consequently, a fuzzy set can have “non-sharp” boundaries. Many sets or concepts associated with natural language terms have boundaries that are non-sharp in the sense of FST. Consider the concept of “forest” as an example. For many collections of trees and plants it will be quite difficult to decide in an unequivocal way whether or not one should call them a forest.
In a data mining context, the idea of “non-sharp” boundaries is especially useful for discretizing numerical attributes, a common preprocessing step in data analysis. For example, in gene expression analysis, one typically distinguishes between normally expressed, underexpressed, and overexpressed genes. This classification is made on the basis of the expression level of the gene (a normalized numerical value), as measured by so-called DNA-chips, by using corresponding thresholds. For example, a gene is often called overexpressed if its expression level is at least twofold increased. Needless to say, corresponding thresholds (such as 2) are more or less arbitrary. Figure 1 shows a fuzzy partition of the expression level with a “smooth” transition between under-, normal, and overexpression. For instance, according to this formalization, a gene with an expression level of at least 3 is definitely considered overexpressed, below 1 it is definitely not overexpressed, but in-between, it is considered overexpressed to a certain degree (Ortoloani et al., 2004).
Fuzzy partition of the gene expression level with a “smooth” transition (grey regions) between underexpression, normal expression, and overexpression