Minimum Data Base Determination using Machine Learning

Minimum Data Base Determination using Machine Learning

Angel Ferrnando Kuri-Morales (Departamento de Computación, Instituto Tecnológico Autónomo de México, Mexico City, Mexico)
Copyright: © 2016 |Pages: 18
DOI: 10.4018/IJWSR.2016100101


The exploitation of large data bases frequently implies the investment of large and, usually, expensive resources both in terms of the storage and processing time required. It is possible to obtain equivalent reduced data sets where the statistical information of the original data may be preserved while dispensing with redundant constituents. Therefore, the physical embodiment of the relevant features of the data base is more economical. The author proposes a method where we may obtain an optimal transformed representation of the original data which is, in general, considerably more compact than the original without impairing its informational content. To certify the equivalence of the original data set (FD) and the reduced one (RD), the author applies an algorithm which relies in a Genetic Algorithm (GA) and a multivariate regression algorithm (AA). Through the combined application of GA and AA the equivalent behavior of both FD and RD may be guaranteed with a high degree of statistical certainty.
Article Preview


Nowadays, commercial enterprises are importantly oriented to continuously improving customer-business (CRM) relationship. With the increasing influence of CRM Systems, such companies dedicate more time and effort to maintain better customer-business relationships. The effort implied in getting to better know the customer involves the accumulation of very large data bases where the largest possible quantity of data regarding the customer is stored.

Data warehouses offer a way to access detailed information about the customer’s history, business facts and other aspects of the customer’s behavior. The databases constitute the information backbone for any well-established company. However, from each step and every new attempted link of the company to its customers the need to store increasing volumes of data arises. Hence databases and data warehouses are always growing up in terms of number of registers and tables which will allow the company to improve the general vision of the customer.

Data warehouses are difficult to characterize when trying to analyze the customers from company’s standpoint. This problem is generally approached through the use of data mining techniques (Palpanas, 2000; Silva, 2002). To attempt direct clustering over a data base of several terabytes with millions of registers results in a costly and not always fruitful effort. There have been many attempts to solve this problem. For instance, one may use parallel computation, optimization of clustering algorithms, alternative distributed and grid computing and so on. But still the more efficient methods are unwieldy when attacking the clustering problem for databases as considered above. In this work we present a methodology derived from the practical solution of an automated clustering process over large database from a real large sized (over 20 million customers) company. We emphasize the way we used statistical methods to reduce the search space of the problem as well as the treatment given to the customer’s information stored in multiple tables of multiple databases.

Because of confidentiality issues the name of the company and the actual final results of the customer characterization are withheld.

Paper Outline

The outline of the paper is as follows. First, we give an overview of the analysis of large databases in section 2; next we give an overview of the methodology we applied. We describe two possible methods to certify the equivalence of the original data set (the “Universe”) which we denote with UD and the reduced (equivalent) data set which we denote with RD. In section 3 we briefly discuss the case study treated with the proposed methodology. Finally, we conclude in Section 4.

Analysis Of Large Databases

To extract the best information of a database it is convenient to use a set of strategies or techniques which will allow us to analyze large volumes of data. These tools are generically known as data mining (DM) which targets on new, valuable, and nontrivial information in large volumes of data. It includes techniques such as clustering (which corresponds to non-supervised learning) and statistical analysis (which includes, for instance, sampling and multivariate analysis).

Clustering in Large Databases

Clustering is a popular data mining task which consist of processing a large volume of data to obtain groups where the elements of each group exhibit quantifiably (under some measure) small differences between them and, contrariwise, large dissimilarities between elements of different groups. Given its high importance as a data mining task, clustering has been the subject of multiple research efforts and has proven to be useful for many purposes (Jain, Murty, & Flynn, 1999).

Many techniques and algorithms for clustering have been developed, improved and applied (Berkhin, 2006; Kleinberg, Papadimitriou, & Raghavan, 1998; Guha, Rastogi, & Shim, 1998). Some of them try to ease the process on a large database as in (Peter, Chiochetti, & Giardina, 2003; Ng, & Han, 1994). On the other hand, the so-called “Divide and Merge” (Cheng, Kannan, Vempala et al., 2006) or “Snakes and Sandwiches” (Jagadish, Lakshmanan, & Srivastava, 1999) methods refer to clustering attending to the physical storage of the records comprising data warehouses. Another strategy to work with a large database is based upon the idea of working with statistical sampling optimization (Liu, & Motoda, 2012).

Complete Article List

Search this Journal:
Open Access Articles
Volume 16: 4 Issues (2019): 1 Released, 3 Forthcoming
Volume 15: 4 Issues (2018)
Volume 14: 4 Issues (2017)
Volume 13: 4 Issues (2016)
Volume 12: 4 Issues (2015)
Volume 11: 4 Issues (2014)
Volume 10: 4 Issues (2013)
Volume 9: 4 Issues (2012)
Volume 8: 4 Issues (2011)
Volume 7: 4 Issues (2010)
Volume 6: 4 Issues (2009)
Volume 5: 4 Issues (2008)
Volume 4: 4 Issues (2007)
Volume 3: 4 Issues (2006)
Volume 2: 4 Issues (2005)
Volume 1: 4 Issues (2004)
View Complete Journal Contents Listing