Intelligent Data Analysis for Real-Life Applications: Theory and Practice

Intelligent Data Analysis for Real-Life Applications: Theory and Practice

Rafael Magdalena-Benedito (Intelligent Data Analysis Laboratory, University of Valencia, Spain), Marcelino Martínez-Sober (Intelligent Data Analysis Laboratory, University of Valencia, Spain), José María Martínez-Martínez (Intelligent Data Analysis Laboratory, University of Valencia, Spain), Joan Vila-Francés (Intelligent Data Analysis Laboratory, University of Valencia, Spain) and Pablo Escandell-Montero (Intelligent Data Analysis Laboratory, University of Valencia, Spain)
Indexed In: SCOPUS
Release Date: June, 2012|Copyright: © 2012 |Pages: 444
ISBN13: 9781466618060|ISBN10: 146661806X|EISBN13: 9781466618077|DOI: 10.4018/978-1-4666-1806-0

Description

With the recent and enormous increase in the amount of available data sets of all kinds, applying effective and efficient techniques for analyzing and extracting information from that data has become a crucial task.

Intelligent Data Analysis for Real-Life Applications: Theory and Practice investigates the application of Intelligent Data Analysis (IDA) to these data sets through the design and development of algorithms and techniques to extract knowledge from databases. This pivotal reference explores practical applications of IDA, and it is essential for academic and research libraries as well as students, researchers, and educators in data analysis, application development, and database management.

Topics Covered

The many academic areas covered in this publication include, but are not limited to:

  • Computational Intelligence
  • Data Mining
  • Decision and regression trees
  • Graphical models
  • Information Retrieval
  • Intelligent data analysis applications
  • Knowledge-based systems
  • Scalable algorithms
  • Swarm Intelligence
  • Systems application

Reviews and Testimonials

This handbook provides a state-of-the-art overview of Intelligent Data Analysis methods and a wide range of application domains. Each contribution in this book is clear evidence of the tremendous potential of Machine Learning techniques to make the computers of tomorrow 'smarter.'

– Adam E Gaweda, University of Louisville, USA

The 18 papers in this collection explore machine learning techniques for extracting information from large amounts of data with several variables. Six papers from Spanish universities review Bayesian network classifiers, apply computer vision to sorting fruit, describe data visualization for industrial processes, and propose agent-based systems for reading engineering sketches. Other topics include landmark sliding from 3D shape correspondence, detecting impact craters in planetary images, probabilistic graphical models for sports video mining, automatic text classification from labeled and unlabeled data, and the impact of enterprises' organizational quality on their economic results. B&w images and diagrams are provided.

– Book News Inc. Portland, OR

Table of Contents and List of Contributors

Search this Book:
Reset

Preface

What is Intelligent Data Analysis? The question is fully well-grounded, because the name itself is in some ways very ambiguous. The main idea lying under this definition is extract knowledge from data.

This is now the Age of Information. Technology is ubiquitous, technology is cheap: technology is nowadays everything. Moore’s Law has brought our world to the Technology Information Society, and even the most far away corner in the world is today covered by telecommunications technology. A high end technology cellular phone exhibits more computing power that the computer that drove man to the Moon 30 years ago. And we use it for playing bird-killing on-line games!

But with great power comes great responsibility, or it should. The cheap, powerful computing capabilities of nearly even appliance, the fast data highways that plough and fly through the Earth, and the nearly unlimited storage resources available everywhere, every time, are flooding us with digital data. The Age of Information could be also defined as the Curse of Data, because is quite cheap and easy to gather and store data, but people need information and chase knowledge. They have the haystack, but want the needle.

It is not easy to extract knowledge starting from raw data, and it is also not cheap. The curse of cheap hardware, cheap bandwidth, and cheap processors is an extraordinary large amount of data, a very large number of variables, and very little knowledge about what is cooking inside these data.

In the recent past, scientists and technologists have relied on traditional statistics to cope with the task of extracting information from data. Statistics building is deeply rooted in the ground of Mathematics since XVII Century but, during recent decades, this enormous amount of data and variables have overwhelmed the capabilities of classical statistics. There is no way for classical methods to deal with such amount of data; people cannot visualize even the lesser information. They are unable to extract knowledge form the radiant, brand-new gathered datasets.

Mathematics are also now coming to help, going beyond classical statistics and bringing tools that enable extraction of some information from these huge datasets. These new tools are collectively called “Intelligent Data Analysis.” But Mathematics is not the only discipline involved in Data Analysis. Engineering, Computing Sciences, Database Science, Machine Learning, and even Artificial Intelligence are bringing their power to this newly-born data analysis discipline.

Intelligent Data Analysis could be defined as the tools that enable extracting information lying under very large amount of data, with very large amount of variables, data that represent very complex, non-linear, in two words, real-life problems, which are intractable with the old tools. People must be able to cope with high dimensionality, sparse data, very complex, unknown relationships, biased, wrong or incomplete data, or mathematics algorithms or methods that lie in the foggy frontier of Mathematics, Engineering, Physics, Computer Science, Statistics, Biology, or even Philosophy.

Moreover, Intelligent Data Analysis can help us in, starting from the raw data, coping with prediction tasks without knowing the theoretical description of the underlying process, classification tasks of new event in base of the past ones, or modeling the aforementioned unknown process. Classification, prediction, and modeling are the cornerstones that Intelligent Data Analysis can bring to us.

And in this Brave New Information World, information is the key. It is the key, the power and the engine that moves the economy. Because the world is moving with markets data, with medical epidemiologic sets, with Internet browsing records, with geological surveys data, complex engineering models, and so on. Nearly every digital activity nowadays is generating a big amount of data that can be easily gathered and stored, and the greatest value of the data is the information lying behind.

This book approaches Intelligent Data Analysis from a very practical point of view. There are many theoretical, academic books about theory on data mining and analysis. But the approach in this book comes from a real world view: solving common life problems with data analysis tools. It is a very “engineering” point of view, in the sense that the book presents a real problem, usually defined by complex, non-linear, and unknown processes, and offers a Data Analysis based solution, that gives the opportunity to solve the problem or even to infer the process underlying the raw data. The book discusses practical experiences with intelligent data analysis.

So this book is aimed to scientists and engineers carrying out research in very complex, non linear areas, as economics, biology, data processing, with large amount of data that need to extract some knowledge starting from the data, knowledge that can take the flavor of prediction, classification, or modeling. But this book also brings a valuable point of view to engineers and business men that work in companies, trying to solve practical, economical or technical problems in the field of their company activities or expertise. The pure practical approach helps to transmit the idea and the aim of the author of communicate the way to approach and to cope with problems that would be intractable in any other way. And at last, final courses of academic degrees in Engineering, Mathematics, or Business can use this book to provide students a new point of view for approaching and solving real, practical problems when underlying processes are no clear.

Obviously prior knowledge of statistics, discrete mathematics, and machine learning is desirable, although authors provide several references to help engineers and scientists to use the experience and the know-how described in every chapter to their own benefit.

The book is structured as follows. The first section of the book is about machine learning methods applied to real-world problems. In Chapter 1, “A Discovery Method of Attractive Rules from the Tabular Structured Data,” Prof. Sakurai introduces an analysis method of transactions generated from the tabular structured data. The method focuses on relationships between attributes and their values in the data. The chapter introduces a processing method of transactions in which missing values occurs, an efficient discovery method of patterns, and their evaluation criteria. The topic of chapter 2, “Learning Different Concept Hierarchies and the Relations between them from Classified Data” by Benites and Sapozhnikova, is closely related to two fields of intelligent data analysis, namely to automatic ontology learning and multi-label classification. In the chapter, the authors investigate multi-label classification when the labels come from several taxonomies providing different insights into the data. It is a more interesting and less investigated task: finding interclass relationships may reveal new and unexpected links between different concept hierarchies. This enables the integration of multiple data sources, on the one hand, and improvement of the classification performance, on the other hand. Thus, the task is first to extract concept hierarchies by analyzing multi-labels in each label set and then to find hierarchical or so-called generalized association rules that describe the most important connections between different label sets. To be more precise, the co-occurrences of each label pair from two label sets are examined, taking into account the extracted hierarchies. This method is well validated by the experiments on real world data in the chapter. In Chapter 3, “Individual Prediction Reliability Estimates in Classification and Regression,” Pevec, Bosnic, and Kononenko say that current machine learning algorithms perform well on many problem domains, but many experts are reluctant to use them because overall assessments of models do not provide them with enough information about individual predictions. The authors summarize the research areas that motivated the development of various approaches to estimating individual prediction reliability. Following an extensive empirical evaluation, the chapter shows the usefulness of these estimates in attempts to predict breast cancer recurrence, where reliability for individual predictions is of crucial importance. In Chapter 4, “Landmark Sliding for 3D Shape Correspondence,” Dalal and Wang discuss shape correspondence in a population of shapes, which enables landmark recognition and classification. The discussion is centered on 3D Landmark Sliding method, with a measure based in Thin-plate splines and additional consistence restrictions. Chapter 5, titled “Supervised Classification with Bayesian Networks: A Review on Models and Applications,” by Flores, Gámez, and Martínez, is about Bayesian Networks. Bayesian Network classifiers (BNCs) are Bayesian Network (BN) models specifically tailored for classification tasks. This chapter presents an overview of the main existing BNCs, especially semi-naïve Bayes, but also other networks-based classifiers, dynamic models, and multi-dimensional ones. Besides, mechanisms to handle numeric variables are described and analyzed. The final section of this chapter is focused on applications and recent developments, including some of the BNCs’ approaches to the multi-class problem, together with other traditionally successful and cutting edge cases regarding real-world applications. 

The second section groups the chapters on Machine Learning Applications in Computer Vision. Namely, Chapter 6, “Decay Detection in Citrus Fruits Using Hyperspectral Computer Vision” by Gómez, Olivas, Lorente, Martínez, Escandell, Guimerá, and Blasco, is about early automatic detection of fungal infections in post-harvest citrus fruits, which is especially important for the citrus industry because only a few infected fruits can spread the infection to a whole batch during operations such as storage or exportation. Penicillium fungi are among the main defects that may affect the commercialization of citrus fruits. Economic losses in fruit production may become enormous if an early detection of that kind of fungi is not carried out. Nowadays, this detection is carried out manually by trained workers illuminating the fruit with dangerous ultraviolet lighting. This work presents a new approach based on hyperspectral imagery and a set of machine learning techniques in order to detect decay caused by Penicillium digitatum and Penicillium italicum. The proposed system constitutes a feasible and implementable solution for the citrus industry, which has been proven by the fact that several machinery enterprises have shown their interest in the  implementation and patent of the system. Chapter 7, “In-line Sorting of Processed Fruit Using Computer Vision: Application to the Inspection of Satsuma Segments and Pomegranate Arils,” by Blasco, Aleixos, Cubero, Albert, Lorente, and Gómez, deals with the creation of a system for the in-line inspection of processed fruit based on computer vision that has been applied to the particular cases of pomegranate arils and satsuma segments ready for consumption. Computer vision system has been widely applied for the inspection of fresh fruit. However, due to the relative reduced market and the difficult of handling and analysing a very complex product such as minimally processed fruit, this technology it is being not already applied to this sector. This work shows the development of complete prototypes for the in-line and automatic inspection of this kind of fruit, including the development of the image processing algorithms. Once the images are captured, statistical methods based on the Bayes decision rule are used to achieve a decision about the quality of each object in order to classify and separate them into commercial categories. The prototypes have been tested in producing companies in actual commercial conditions. In Chapter 8, titled “Detecting Impact Craters in Planetary Images Using Machine Learning,” Stepinski, Ding, and Vilalta, remark that robotic exploration of the Solar System over the past few decades has resulted in the collection of a vast amount of imagery data, turning traditional methods of planetary data analysis into a bottleneck from the discovery process. The chapter describes an application of machine learning to a particularly common and important approach for analysis of planetary images – detection and cataloging of impact craters. The chapter discusses how supervised learning can help in improving the efficiency and accuracy of crater detection algorithms. The first algorithm has been successfully applied to catalog relatively large craters over the entire surface of planet Mars, and the second algorithm addresses the need for detecting very small craters in high resolution images. In Chapter 9, “Integration of the Image and NL-text Analysis/Synthesis Systems,” by Khakhalin, Kurbatov, Naidenova, and Lobzin, the authors describe an intelligent analytical system intended for combining image and natural text processing and modeling the interconnection of these processes in the framework of translating images into texts and vice versa. Plane geometry (“planimetry”) has been selected as an applied domain of the system. The system includes the following subsystems: Image Analyzer, Image Synthesizer, Linguistic Analyzer of NL-text, Synthesizer of NL- text, and Applied Ontology. The ontology describes the knowledge common for these systems. The Analyzers use the applied ontology language for describing the results of their work, and this language is input for the Synthesizers. The language of semantic hypergraphs as a semantic network with the use of which n – dimensional relations are naturally represented has been selected for ontological knowledge representation. All principal questions of implementing enumerated subsystems are considered, including realization of key components of linguistic and image analysis, such as the parsing of complicated and elliptic clauses, the segmentation of complicated, complex, and compound sentences, machine learning in natural language processing, and some others.

The next section of the book groups the chapters under Other Machine Learning Applications. In Chapter 10, “Fault-Tolerant Control of Mechanical Systems Using Neural Networks,” Sunan, Kok Kiong, and Tong Heng discuss about control techniques suitable to harsh environments. They propose a fault tolerant controller, based on Artificial Neural Networks, which can cope successfully with errors while keeping the system running continuously. The results can be applied to mechanical systems with probable component failures. Chapter 11 is titled “Supervision of Industrial Processes using Self Organizing Maps,” written by Díaz, Cuadrado, Díez, Domínguez, Fuertes, and Prada. The chapter presents a corpus of visualization techniques based on the self-organizing map (SOM) to explore the behavior of industrial processes as well as to monitor its state. The chapter includes well established techniques, but also recent developments for novelty detection, correlation discovery, or visual analysis of process dynamics. The chapter illustrates these ideas with two application cases. In Chapter 12, “Learning and Explaining the Impact of Enterprises’ Organizational Quality on their Economic Results,” Pregeljc, Strumbelj, Mihelcic, and Kononenko present their study of 72 enterprises' economic results, illustrating the usefulness of machine learning tools in economic research. Prediction models are used to predict the enterprises' performance from various indicators of the enterprises' organizational quality. Furthermore, a novel post-processing method is used to aid the interpretation of the models' predictions, which provides useful economic insights, even in the case of more complex and non-transparent predictions models. In Chapter 13, “Automatic Text Classification from Labeled and Unlabeled Data,” Professor Jiang presents a semi-supervised text classification system that integrates a clustering based Expectation-Maximization algorithm into radio basis function networks and can learn for classification effectively from a very small set of previously labeled samples and a largequantity of additional unlabeled data. In the last few years, there has been surging interest in developing semi-supervised learning models and these models are particularly relevant to many text classification problems where labeled training samples are limited in supply while relevant unlabeled data are abundantly available. The proposed system can be applied in many areas in information retrieval, document filtering, business intelligence mining andcustomer service automation. Chapter 14, “Agent Based Systems to Implement Natural Interfaces for CAD Applications” by Fernández, Aleixos, and Albert, is about CAS. CAS (Computer Aided Sketching) applications are intended to replace traditional menus with natural interfaces that support sketching for both commands and drawing, but the recognition process is very complex and not solved yet. This chapter gives an overview of the most important advances carried out in CAS field. The authors propose a solution for a CAS tool based on agents in the framework of CAD (Computer Aided Design) applications, proving that agent-based systems are valid for applications that require decision-making rules guided by knowledge, and also for these particular applications. The result is a paradigm based on agents that allows the user to draw freely no matter what he draws, the intended action, the number of strokes, or their sequence of introduction. Finally, some recommendations and future work are stated. Chapter 15, “Gaussian Process-based Manifold Learning for Human Motion Modeling,” by Guoliang and Xin, studies human motion modeling using Gaussian Process-based manifold learning approaches. Specifically, the authors focused on the walking motion that is unique for every individual and could be used for many medical and biometric applications. The goal is to develop a general low-dimensional (LD) model from of a set of high-dimensional (HD) motion capture (MoCap) data acquired from different individuals, where there are two main factors involved, i.e., pose (a specific posture in a walking cycle) and gait (a specific walking style). Many Gaussian process (GP)-based manifold learning methods have been proposed to explore a compact and smooth LD manifold embedding for motion representation where only one factor (i.e., pose) is revealed explicitly with the other factor (i.e., gait) implicitly or independently treated. The authors recently proposed a new GP-based joint gait-pose manifold (JGPM) that unifies these two variables into one manifold structure to capture the coupling effect between them. As the result, JGPM is able to capture the motion variability both across different poses and among multiple gaits (i.e., individuals) simultaneously. In order to show advantages of joint modeling of combining the two factors in one manifold, the authors developed a validation technique to compare the proposed JGPM with recent GP-based methods in terms of their capability of motion interpolation, extrapolation, denoising, and recognition. The experimental results demonstrate advantages of the proposed JGPM for human motion modeling. Chapter 16, “Probabilistic Graphical Models for Sports Video Mining,” by Guoliang and Yi, studies the application of probabilistic graphical models for sports video mining. The authors present a multi-level video semantic analysis framework that is featured by hybrid generative-discriminative probabilistic graphical models. A three-layer semantic space is introduced, by which the problem of semantic video analysis is cast into two inter-related inference problems defined at different semantic levels. In the first stage, a multi-channel segmental hidden Markov model (MCSHMM) is developed to jointly detect multi-channel mid-level keywords from low-level visual features, which can serve as building blocks for high-level semantic analysis. In the second stage, the authors propose the auxiliary segmentation conditional random fields (ASCRFs) to discover the game flow from multi-channel key-words, which provides a unified semantic representation for both event-based and structure-based semantic representation. The use of hybrid generative-discriminative approaches is proven to be effective and appropriate in two sequential and related stages of sports video mining. The experimental results from a set of American football video data demonstrate that the proposed method offers superior results compared with other traditional machine learning-based video mining approaches. The proposed framework along with the two new probabilistic graphical models has potential to be used in other video mining applications. In Chapter 17, “Static and Dynamic Multi-robot Coverage with Grammatical Evolution Guided by Reinforcement and Semantic Rules,” Mingo, Aler, Maravall, and De Lope propose a method to solve multi-robot coverage problems in simulation by means of grammatical evolution of high-level controllers. Grammars are a valuable tool in order to create programs or controllers, because they allow users to specify a hierarchical structure for the behaviors instead of creating a monolithic whole system as other evolutionary approaches generally do. Using grammars, allows for developing solutions that are more readable and understandable than monolithic ones. Another advantage in the use of modular decomposition is the possible reuse of modules when new behaviors are being developed. Evolutionary algorithms implement a global search, and they are usually slow and do not scale well when the size of the problem grows. In order to improve the process, the proposed method includes two features: a learning process, and a formalism to allow semantic rules in the grammatical production rules. During the learning process, the algorithm can drive a local search and the semantic rules are a method for including common sense reasoning because the user can discard production rules in the process of building the controller when the production rules are semantically incorrect (although they are syntactically correct). This way, automatic high-level controllers can be built in fewer generations, and the solutions found are more readable as well. Besides, fewer individuals are needed in the population. The system proposed in the chapter is completely reactive, and it does consider neither communication among robots nor a map of the environment. In Chapter 18, “Computer-controlled Graphical Avatars and Reinforcement Learning,” He and Tang talk about the resources of 3D graphical environments and avatars, which are growing at explosive speeds in the areas of World Wide Web and multimedia (e.g. MEPG-7). At the same time, the requirements of making 3D animations are growing much faster than the resources. However, the work of making high quality animations is difficult, because it can only be done by a small number of experts. To solve this problem, the authors present the method of bestowing 3D avatar intelligence to make them to adapt different circumstances automatically. Thus, machine learning is used in this area to achieve the goal - "Intelligent 3D avatars.” 

The Editors

Rafael Magdalena-Bendito

Marcelino Martínez-Sober

José María Martínez-Martínez

Pablo Escandell-Moreno

Joan Vila-Francés

Author(s)/Editor(s) Biography

José M. Martínez-Martínez received the B.Eng. degree in Telecommunication Engineering in 2006, and the M.Eng degree in Electronics Engineering in 2009, both from the University of Valencia, Spain. He is currently working towards the Ph.D. degree in the IDAL research group, University of Valencia. His research interest is machine learning methods, data mining, and data visualization.

Indices

Editorial Board

  • Darko Pevec, University of Ljubljana, Slovenia
  • Elena Sapozhnikova, University of Konstanz, Germany
  • Shigeaki Sakurai, Tokyo Institute of Technology, Japan
  • José Antonio Gámez, University of Castilla – La Mancha (UCLM), Spain
  • Ricardo Vilalta, University of Houston, USA