Data Mining for Model Identification

Data Mining for Model Identification

Diego Liberati (Italian National Research Council, Italy)
Copyright: © 2009 |Pages: 7
DOI: 10.4018/978-1-60566-010-3.ch069
OnDemand PDF Download:
$37.50

Abstract

In many fields of research, as well as in everyday life, it often turns out that one has to face a huge amount of data, without an immediate grasp of an underlying simple structure, often existing. A typical example is the growing field of bio-informatics, where new technologies, like the so-called Micro-arrays, provide thousands of gene expressions data on a single cell in a simple and fast integrated way. On the other hand, the everyday consumer is involved in a process not so different from a logical point of view, when the data associated to his fidelity badge contribute to the large data base of many customers, whose underlying consuming trends are of interest to the distribution market. After collecting so many variables (say gene expressions, or goods) for so many records (say patients, or customers), possibly with the help of wrapping or warehousing approaches, in order to mediate among different repositories, the problem arise of reconstructing a synthetic mathematical model capturing the most important relations between variables. To this purpose, two critical problems must be solved: 1 To select the most salient variables, in order to reduce the dimensionality of the problem, thus simplifying the understanding of the solution 2 To extract underlying rules implying conjunctions and/or disjunctions between such variables, in order to have a first idea of their even non linear relations, as a first step to design a representative model, whose variables will be the selected ones When the candidate variables are selected, a mathematical model of the dynamics of the underlying generating framework is still to be produced. A first hypothesis of linearity may be investigated, usually being only a very rough approximation when the values of the variables are not close to the functioning point around which the linear approximation is computed. On the other hand, to build a non linear model is far from being easy: the structure of the non linearity needs to be a priori known, which is not usually the case. A typical approach consists in exploiting a priori knowledge to define a tentative structure, and then to refine and modify it on the training subset of data, finally retaining the structure that best fits a cross-validation on the testing subset of data. The problem is even more complex when the collected data exhibit hybrid dynamics, i.e. their evolution in time is a sequence of smooth behaviors and abrupt changes.
Chapter Preview
Top

Introduction

In many fields of research, as well as in everyday life, it often turns out that one has to face a huge amount of data, without an immediate grasp of an underlying simple structure, often existing. A typical example is the growing field of bio-informatics, where new technologies, like the so-called Micro-arrays, provide thousands of gene expressions data on a single cell in a simple and fast integrated way. On the other hand, the everyday consumer is involved in a process not so different from a logical point of view, when the data associated to his fidelity badge contribute to the large data base of many customers, whose underlying consuming trends are of interest to the distribution market.

After collecting so many variables (say gene expressions, or goods) for so many records (say patients, or customers), possibly with the help of wrapping or warehousing approaches, in order to mediate among different repositories, the problem arise of reconstructing a synthetic mathematical model capturing the most important relations between variables. To this purpose, two critical problems must be solved:

  • 1 To select the most salient variables, in order to reduce the dimensionality of the problem, thus simplifying the understanding of the solution

  • 2 To extract underlying rules implying conjunctions and/or disjunctions between such variables, in order to have a first idea of their even non linear relations, as a first step to design a representative model, whose variables will be the selected ones

When the candidate variables are selected, a mathematical model of the dynamics of the underlying generating framework is still to be produced. A first hypothesis of linearity may be investigated, usually being only a very rough approximation when the values of the variables are not close to the functioning point around which the linear approximation is computed.

On the other hand, to build a non linear model is far from being easy: the structure of the non linearity needs to be a priori known, which is not usually the case. A typical approach consists in exploiting a priori knowledge to define a tentative structure, and then to refine and modify it on the training subset of data, finally retaining the structure that best fits a cross-validation on the testing subset of data. The problem is even more complex when the collected data exhibit hybrid dynamics, i.e. their evolution in time is a sequence of smooth behaviors and abrupt changes.

Top

Main Focus

More recently, a different approach has been suggested, named Hamming Clustering (Muselli & Liberati 2000). It is related to the classical theory exploited in minimizing the size of electronic circuits, with the additional care to obtain a final function able to generalize from the training data set to the most likely framework describing the actual properties of the data. Such approach enjoys the following remarkable two properties:

Complete Chapter List

Search this Book:
Reset