Minimum Database Determination and Preprocessing for Machine Learning

Minimum Database Determination and Preprocessing for Machine Learning

Angel Fernando Kuri-Morales (ITAM, Mexico)
Copyright: © 2019 |Pages: 38
DOI: 10.4018/978-1-5225-7268-8.ch005


The exploitation of large databases implies the investment of expensive resources both in terms of the storage and processing time. The correct assessment of the data implies that pre-processing steps be taken before its analysis. The transformation of categorical data by adequately encoding every instance of categorical variables is needed. Encoding must be implemented that preserves the actual patterns while avoiding the introduction of non-existing ones. The authors discuss CESAMO, an algorithm which allows us to statistically identify the pattern preserving codes. The resulting database is more economical and may encompass mixed databases. Thus, they obtain an optimal transformed representation that is considerably more compact without impairing its informational content. For the equivalence of the original (FD) and reduced data set (RD), they apply an algorithm that relies on a multivariate regression algorithm (AA). Through the combined application of CESAMO and AA, the equivalent behavior of both FD and RD may be guaranteed with a high degree of statistical certainty.
Chapter Preview

Analysis Of Large Databases

To extract the best information of a database it is convenient to use a set of strategies or techniques which will allow us to analyze large volumes of data. These tools are generically known as data mining (DM) which targets on new, valuable, and nontrivial information in large volumes of data. It includes techniques such as clustering (which corresponds to non-supervised learning) and statistical analysis (which includes, for instance, sampling and multivariate analysis).

Complete Chapter List

Search this Book: