The fuzzy logic and the fuzzy set theory have been proposed by Lotfi Zadeh in 1965, and largely developed since, in various directions, including reasoning, control, data representation and data mining. They now provides numerous tools to handle data in a very relevant and comprehensive way, in particular offering theoretically well founded means to deal with uncertainty and imprecision. Furthermore, they constitute an interface between numerical and linguistic representations, increasing the interpretability of the developed tools and making it possible to compute with words, using the expression proposed by L. Zadeh in 1996.
Despite these advantages, fuzzy approaches often suffer from the opinion that they cannot address huge amounts of data and are inappropriate because of scalability difficulties: a high computational complexity or high memory requirements are feared, that might hinder their applications to very large datasets, as occur more and more frequently nowadays. Now this is not the case, as many applications, including industrial success stories, have shown that fuzziness and scalability are not antagonistic concepts. This book aims at highlighting the relevance of fuzzy methods for huge datasets, considering both the theoretical and practical points of view and bringing together contributions from various fields.
This book gathers up-to-date methods and algorithms that tackle this problem, showing that fuzzy logic is a very powerful way to provide users with relevant results within reasonable time and memory. The chapters cover a wide range of research areas where very large databases are involved, considering among others issues related to data representation and structuring, in particular in data warehouses, as well as the related querying problems, and the extraction of relevant and characterizing information from large datasets, to summarize them in a flexible, robust and interpretable way that takes into account uncertainty and imprecision. The book also includes success stories based on fuzzy logic that address real world challenges to handle huge amounts of data for practical tasks. The databases considered in the various chapters take different forms, including data warehouses, data cubes, tabular or relational data, and different application types, among which multimedia, medical, bioinformatics, financial, semantic web or data stream contexts.
The book aims at providing researchers, master students, engineers and practitioners the state of the art tools to address the new challenges of current applications that must now both remain scalable and provide user-friendly and actionable results. The readers will get a panorama of the existing methods, algorithms and applications devoted to scalability and fuzziness. They will find the necessary material concerning implementation issues and solutions, algorithms, evaluation, case studies and real applications. Besides, being the very first reference gathering scalable fuzzy methods from various fields, this book contributes to bridging the gap between research communities (e.g. databases, machine learning) that are not always enough combined and mixed.
The book is organized in four complementary parts: after two introductory chapters that provide general overviews on fuzziness and scalability from two different points of view, the second part, entitled “Databases and queries”, is devoted to methods that consider data structuring as the core of the approach and propose either flexible representations, through the incorporation of fuzzy components in the data, or flexible queries that make interactions of the user with the database easy and intuitive thanks to linguistic formulations. The third part, called “Summarization”, tackles the complexity of huge datasets through the extraction of relevant and characteristic information that provide summaries of the whole data. In this context, fuzzy approaches offer a linguistic interface to increase the interpretability of the results, flexibility and tools to handle imprecision and uncertainty. Lastly, the fourth part, entitled “Real world challenges”, presents success stories involving fuzzy approaches, considering various domains such as stream, multimedia and biological data. In the following, we detail each part in turn.
The first two chapters of the book introduce general overviews, respectively from the hardware point of view, and from a machine learning perspective.
The chapter “Electronic Hardware for Fuzzy Computation”, by Koldo Basterretxea and Ines Del Campo, presents a comprehensive synthesis of the state of the art and the progress in the electronic hardware design for the fuzzy computation field over the past two decades, in particular for the implementation of fuzzy inference systems. The authors show how fuzzy hardware has evolved, from general purpose processors (GPPs) to High Performance Reconfigurable Computing (HPRC), as well as the development of the hardware/software codesign methodology. They discuss their relationships with the scalability issue, and the new trends and challenges to be faced. The last part of the chapter, dedicated to the architectures proposed to speed up fuzzy data mining processing specifically, constitutes a promising research direction for the development and improvement of implementation of fuzzy data mining algorithms.
Chapter 2, entitled “Scaling Fuzzy Models” by Lawrence O. Hall, Dmitry B. Goldgof, Juana Canul-Reich, Prodip Hore, Weijian Cheng and Larry Shoemaker, considers the scalability issue from the machine learning and data mining point of view, to extract knowledge from huge amounts of data, studying in turn both supervised and unsupervised learning. It focuses on ensemble based approaches that basically consist in learning classifiers on subsets of data, to reduce the amount of data that must be fit in computer memory at any time. This approach is also used in Chapter 15 in the case of fuzzy random forests to handle large multimedia datasets. In the unsupervised learning case, the authors concentrate on data streams that are more and more common nowadays and can lead to very large datasets to be handled incrementally. They offer an overview of existing algorithms to deal with such data and propose an online variant of the classic fuzzy c-means. Their experimental results, performed on datasets containing up to 5 millions magnetic resonance images, illustrate the possibility to apply fuzzy approaches to data mining from huge datasets.
The chapters of the second part, chapters 3 to 7, address the topic of databases and queries coupled with fuzzy methods: they consider the scalability issue from the point of view of data structuring and organization, as well as for the querying step. Chapter 3, 4 and 5 mainly focus on the data storing issue, respectively considering data warehouses adapted to fuzzy set representation (Chapter 3), fuzzy data cubes following the OLAP model (Chapter 4) and fuzzy description logic to both represent and exploit imprecise data in a logical reasoning framework (Chapter 5). Chapter 6 and 7 concentrate on queries, considering two different types: Chapter 6 considers linguistic data queries and more specifically quantified linguistic queries, proposing a framework to model and answer them, Chapter 7 focuses on the results provided by queries submitted to search engines and tackle the problem of managing them through a flexible exploratory language.
More precisely, Chapter 3, entitled “Using Fuzzy Song Sets in Music Warehouses” by François Deliège and Torben Bach Pedersen considers data warehouses used to manage large collections of music data, in the purpose of designing music recommendation systems. The authors introduce a fuzzy representation through the concept of fuzzy songs and study several solutions for storing and managing fuzzy sets in general, considering three options, namely tables, arrays and compressed bitmaps. They construct theoretical estimates for the cost of each solution that are also studied experimentally and compared for various data collection sizes. Furthermore, they discuss the definition of an algebra to query the built data cubes and examine the operators both from a theoretical and practical point of view. Thus this chapter provides both an insight on theoretical works on scalability issues for storing and managing fuzzy sets, and an example of a real world challenge.
In the same framework of data warehouses and OLAP systems, the chapter “Mining Association Rules from Fuzzy Data Cubes” by Nicolás Marín, Carlos Molina, Daniel Sánchez and M. Amparo Vila investigates the particular topic of On-Line Analytical Mining (OLAM) which aims at coupling data mining and OLAP, bridging the gap between parts II and III of the book. The authors consider association rules which are one of the most used data mining techniques to extract summarized knowledge from data, focusing on the particular framework of data cubes for which they must be further studied. The authors propose methods to support imprecision which results from the multiple data sources handled in such applications and constitutes a challenge when designing association rule mining algorithms. The chapter studies the influence of the fuzzy logic use for different size problems, both in terms of the cube density (number of records) and topology (number of dimensions), comparing the results with a crisp approach. Experiments are performed on medical, financial and census data.
In Chapter 5 entitled “Scalable Reasoning with Tractable Fuzzy Ontology Languages”, Giorgos Stoilos, Jeff Z. Pan, and Giorgos Stamou consider another data model that is in particular adapted to databases in the form of ontology, namely the fuzzy description logic format. The latter offers the possibility to both model and reason with imprecise knowledge in a formal framework that provides expressive means to represent and query information. It is of particular use to handle fuzziness in semantic web applications whose high current development makes such works crucial. The authors show that the increased expressivity does not come at the expense of efficiency and that there exist methods capable of scaling up to millions of data. More precisely, the authors study the scalability of the two main inference services in this enriched data description language, which are query answering and classification, i.e. computation of the implied concept hierarchy. To that aim, they consider two languages: on one hand, they show how Fuzzy DL-Lite provides scalable algorithms for expressive queries over fuzzy ontologies; on the other hand, they show how fuzzy EL+ leads to very efficient algorithms for classification and extend it to allow for fuzzy subsumption.
Focusing on the issue of query formulation, in particular for expressive queries, Chapter 6, entitled “A Random Set and Prototype Theory Model of Linguistic Query Evaluation”, by Jonathan Lawry and Yongchuan Tang, deals with linguistic data queries, that belongs to the computing with words domain introduced by Zadeh in 1996. More precisely the authors consider quantified data queries, for which a new interpretation based on a combination of the random set theory and prototype theory is proposed: concepts are defined as random set neighborhood of a set of prototypes, which means that a linguistic label is deemed appropriate to describe an instance if the latter is sufficiently close to the prototypes of the label. Quantifiers are then defined as random set constraints on ratios or absolute values. These notions are then combined to a methodology to evaluate the quality of quantified statements about instances, so as to answer quantified linguistic queries.
The chapter “A Flexible Language for Exploring Clustered Search Results” by Gloria Bordogna, Alessandro Campi, Stefania Ronchi and Giuseppe Psaila, considers a specific type of queries, namely those submitted to search engines: they tackle the more and more crucial problem of managing the results from search engines that can be very large, and automatically extracting hidden relations from them. Assuming that the set of documents retrieved by a search engine is given in the form of a set of clusters, the authors propose a flexible exploratory language for manipulating the groups of clustered documents returned by several engines. To that aim, they define various operators among which refinement, union, coalescing and reclustering and propose several ranking criteria and functions based on the fuzzy set theory. This makes it possible to preserve the interpretability of the retrieved results despite the large amount of answers obtained for the query.
The chapters in the next part, Chapters 8 to 13, consider a different approach on the problem of scalability and fuzziness and address the topic of exploiting fuzzy tools to summarize huge amounts of data to extract from them relevant information that captures their main characteristics. Several approaches can be distinguished, referring to different types of data mining tools, as detailed below: Chapter 8 considers linguistic summaries, and uses fuzzy logic to model the linguistic information, Chapter 9 proposes an aggregation operator relevant to summarize statistics in particular, Chapters 10 and 11 consider the association rules to summarize data, Chapters 12 and 13 belong to the fuzzy clustering framework. It must be underlined that Chapter 4 also considers association rules, in the case where data are stored in a structure as fuzzy cubes.
More precisely, Chapter 8, entitled “Linguistic data summarization: a high scalability through the use of natural language” by Janusz Kacprzyk and Slawomir Zadrozny, studies user-friendly data summaries through the use of natural language, and a fuzzy logic based model. The focus is laid on the interpretability of the summaries, defining scalability as the capability of algorithms to preserve understandable and intuitive results even when the dataset sizes increase, at a more perceptual or cognitive level than the usual “technical scalability”. The authors offer a general discussion of the scalability notion and show how linguistic summaries answer its perceptual definition, detailing their automatic extraction from very large databases.
The summarization process is also the topic of Chapter 9 “Human Focused Summarizing Statistics Using OWA Operators” by Ronald R. Yager that provides a description of the Order Weighted Averaging operator (OWA): this operator generates summarizing statistics over large datasets. The author details its flexibility derived from weight generating functions as well as methods to adapt them to the data analysts, based on graphical and linguistic specifications.
Another common way to summarize datasets consists in extracting association rules that underline frequent and regular relations in the data. Chapter 10 entitled “(Approximate) Frequent Item Set Mining Made Simple with a Split and Merge Algorithm” by Christian Borgelt and Xiaomeng Wang, considers this framework and focuses on its computationally most complex part, namely the problem of mining frequent itemsets. In order to improve its scalability, the authors propose efficient data structures and processing schemes, using a split and merge technique, that can be applied even if all data cannot be loaded into the main memory. Approximation is introduced by considering that missing items can be inserted into transactions with a user-specified penalty. The authors study the behavior of the proposed algorithm and compare it to some well-known itemsets mining algorithms, providing a comprehensive overview of methods.
The chapter “Fuzzy Association Rules to Summarise Multiple Taxonomies in Large Databases” by Trevor Martin and Yun Shen also considers the domain of association rules learning when huge amount of data are to be handled, focusing on the case where the data are grouped into hierarchically organized categories: the aim is then to extract rules to describe relations between these categories; fuzziness allows avoiding the difficulties raised when crisp separations must be defined. They propose a new definition of fuzzy confidence to be consistent with the framework addressed in the chapter.
Chapter 12, entitled “Fuzzy Cluster Analysis of Larger Data Sets” by Roland Winkler, Frank Klawonn, Frank Höppner and Rudolf Kruse, explores another method for data summarization, namely fuzzy clustering. The authors propose to combine two approaches to decrease the computation time and improve the scalability of the classic fuzzy c-means algorithm, based on a theoretical analysis of the reasons for the high complexity, both for time and memory, and on an efficient data structure. Indeed the high computational cost of the fuzzy c-means is basically due to the fact that all data belong to all clusters: the membership degrees can be very low, but do not equal 0, which also implies that all data have an influence on all clusters. The authors combine a modification of the fuzzifier function to avoid this effect with a suitable data organization exploiting a neighborhood representation of the data to significantly speed up the algorithm; the efficiency of the proposed method is illustrated through experiments.
Chapter 13, entitled “Fuzzy Clustering with Repulsive Prototypes” by Frank Rehm, Roland Winkler and Rudolf Kruse, also considers fuzzy clustering, focusing on the selection of the appropriate number of clusters: the latter is classically determined in a procedure that consists in testing several values and choosing the optimal one according to a validation criterion. This process can be very time consuming, the authors propose to address this problem as an integrated part of the clustering process, by making the algorithm insensitive to too high values for this parameter. To that aim, they modify the update equations for the cluster centers, to impose a repulsive effect between centers, rejecting the unnecessary ones to locations where they do not disturb the result. Both the classic fuzzy c-means and its Gustafson-Kessel variant are considered.
The last part of the book, Chapters 14 to 16, is dedicated to real world challenges that consider the scalability of fuzzy methods from a practical point of view, showing success stories in different domains and using different techniques, both for supervised and unsupervised data mining issues: Chapter 14 consider massive stream data describing car warranty data, Chapter 15 addresses the indexation of huge amounts of multimedia data using random forest trees, following the same approach as the one presented in Chapter 2; Chapter 16 belongs to the bioinformatics domain that is among the domains that currently give rise to the largest datasets to handle, it more precisely focuses on micro-array data. Chapter 3 that describes a data warehouse used to manage large collections of music data also belongs to this real world challenges part.
Chapter 14, entitled “Early Warning from Car Warranty Data using a Fuzzy Logic Technique” by Mark Last, Yael Mendelson, Sugato Chakrabarty and Karishma Batra, addresses the problem of detecting as early as possible problems on cars by managing data stored in a warranty database which contains customer claims recording information on dealer location, car model, car manufacturing and selling dates, claim date, mileage to date, complaint code, labor code, etc. Warranty databases constitute massive stream data that are updated with thousands of new claims on a daily basis. This chapter introduces an original approach to mine these data streams by proposing a fuzzy method for the automatic detection of evolving maintenance problems. For this purpose, the authors propose to study frequency histograms using a method based on a cognitive model of human perception instead of crisp statistical models. The obtained results reveal significant emerging and decreasing trends in the car warranty data.
The problem of video mining is tackled in Chapter 15 entitled “High Scale Fuzzy Video Mining” by Christophe Marsala and Marcin Detyniecki, where the authors propose to use forests of fuzzy decision trees to perform automatic indexing of huge volumes of video shots. The main purpose of the chapter is to detect high-level semantic concepts such as “indoor”, “map” or “military staff” that can then be used for any query and treatment on videos. This data mining problem requires addressing large, unbalanced and multiclass datasets and takes place in the highly competitive context of the TRECVid challenge organized by NIST. The authors report the success of the fuzzy ensemble learning approach they propose, that proves to be both tractable and of high quality. They also underline the robustness advantage of the fuzzy framework that improves the results as compared to other data mining tools.
Chapter 16 entitled “Fuzzy clustering of large relational bioinformatics datasets” by Mihail Popescu considers a practical problem of fuzzy clustering with very large relational datasets, in the framework of bioinformatics to extract information from micro-array data. It describes the whole process of how such problems can be addressed, presenting the theoretical machine learning methods to be used as well as the practical processing system. The considered three-step approach consists in subsampling the data, clustering the sample data and then extending the results to the whole dataset. The practical system describes the methods applied to select the appropriate method parameters, including the fuzzifier and the number of clusters, determined using a cluster validity index. It also describes the adjustments that appear to be necessary to handle the real dataset, in particular regarding the sampling step. The experiments are performed with real data containing around 37,000 gene sequences.
The book thus gathers contributions from various research domains that address the combined issue of fuzziness and scalability from different perspectives, including both theoretical and experimental points of view, considering different definitions of scalability and different topics related to the fuzzy logic and fuzzy set theory use. The variety of these points of view is one of the key features of this book, making it a precious guide for researchers, students and practitioners.