Representing and consequently processing fuzzy data in standard and binary databases is problematic. The problem is further amplified in binary databases where continuous data is represented by means of discrete ‘1’ and ‘0’ bits. As regards classification, the problem becomes even more acute. In these cases, we may want to group objects based on some fuzzy attributes, but unfortunately, an appropriate fuzzy similarity measure is not always easy to find. The current paper proposes a novel model and measure for representing fuzzy data, which lends itself to both classification and data mining. Classification algorithms and data mining attempt to set up hypotheses regarding the assigning of different objects to groups and classes on the basis of the similarity/distance between them (Estivill-Castro & Yang, 2004) (Lim, Loh & Shih, 2000) (Zhang & Srihari, 2004). Classification algorithms and data mining are widely used in numerous fields including: social sciences, where observations and questionnaires are used in learning mechanisms of social behavior; marketing, for segmentation and customer profiling; finance, for fraud detection; computer science, for image processing and expert systems applications; medicine, for diagnostics; and many other fields. Classification algorithms and data mining methodologies are based on a procedure that calculates a similarity matrix based on similarity index between objects and on a grouping technique. Researches proved that a similarity measure based upon binary data representation yields better results than regular similarity indexes (Erlich, Gelbard & Spiegler, 2002) (Gelbard, Goldman & Spiegler, 2007). However, binary representation is currently limited to nominal discrete attributes suitable for attributes such as: gender, marital status, etc., (Zhang & Srihari, 2003). This makes the binary approach for data representation unattractive for widespread data types. The current research describes a novel approach to binary representation, referred to as Fuzzy Binary Representation. This new approach is suitable for all data types - nominal, ordinal and as continuous. We propose that there is meaning not only to the actual explicit attribute value, but also to its implicit similarity to other possible attribute values. These similarities can either be determined by a problem domain expert or automatically by analyzing fuzzy functions that represent the problem domain. The added new fuzzy similarity yields improved classification and data mining results. More generally, Fuzzy Binary Representation and related similarity measures exemplify that a refined and carefully designed handling of data, including eliciting of domain expertise regarding similarity, may add both value and knowledge to existing databases.
Binary representation creates a storage scheme, wherein data appear in binary form rather than the common numeric and alphanumeric formats. The database is viewed as a two-dimensional matrix that relates entities according to their attribute values. Having the rows represent entities and the columns represent possible values, entries in the matrix are either ‘1’ or ‘0’, indicating that a given entity (e.g., record, object) has or lacks a given value, respectively (Spiegler & Maayan, 1985).
In this way, we can have a binary representation for discrete and continuous attributes.
Figure 1 illustrates binary representation of a database consists of five entities with the following two attributes: Marital Status (nominal) and Height (continuous).
Standard binary representation Figure
Marital Status, with four values: S (single), M (married), D (divorced), W (widowed).
Heights, with four values: 1.55, 1.56, 1.60 and 1.84.
However, practically, binary representation is currently limited to nominal discrete attributes only. In the current study, we extend the binary model to include continuous data and fuzzy representation.
Key Terms in this Chapter
Similarity: A numerical estimate of the difference or distance between two entities. The similarity values are in the range of [0,1], indicating similarity degree
Data Mining: The process of automatically searching large volumes of data for patterns, using tools such as classification, association rule mining, clustering, etc
Membership Function: The mathematical function that defines the degree of an element’s membership in a fuzzy set. Membership functions return a value in the range of [0,1], indicating membership degree
Fuzzy Logic: An extension of Boolean logic dealing with the concept of partial truth. Fuzzy logic replaces Boolean truth values (0 or 1, black or white, yes or no) with degrees of truth
Fuzzy Set: An extension of classical set theory. Fuzzy set theory used in Fuzzy Logic, permits the gradual assessment of the membership of elements in relation to a set
Database Binary Representation: A representation where a database is viewed as a two-dimensional matrix that relates entities (rows) to attribute values (columns). Entries in the matrix are either ‘1’ or ‘0’, indicating that a given entity has or lacks a given value
Classif ication: The partitioning of a data set into subsets, so that the data in each subset (ideally) share some common traits - often proximity according to some defined similarity/distance measure