Article Preview
TopIntroduction
Nowadays, commercial enterprises are importantly oriented to continuously improving customer-business (CRM) relationship. With the increasing influence of CRM Systems, such companies dedicate more time and effort to maintain better customer-business relationships. The effort implied in getting to better know the customer involves the accumulation of very large data bases where the largest possible quantity of data regarding the customer is stored.
Data warehouses offer a way to access detailed information about the customer’s history, business facts and other aspects of the customer’s behavior. The databases constitute the information backbone for any well-established company. However, from each step and every new attempted link of the company to its customers the need to store increasing volumes of data arises. Hence databases and data warehouses are always growing up in terms of number of registers and tables which will allow the company to improve the general vision of the customer.
Data warehouses are difficult to characterize when trying to analyze the customers from company’s standpoint. This problem is generally approached through the use of data mining techniques (Palpanas, 2000; Silva, 2002). To attempt direct clustering over a data base of several terabytes with millions of registers results in a costly and not always fruitful effort. There have been many attempts to solve this problem. For instance, one may use parallel computation, optimization of clustering algorithms, alternative distributed and grid computing and so on. But still the more efficient methods are unwieldy when attacking the clustering problem for databases as considered above. In this work we present a methodology derived from the practical solution of an automated clustering process over large database from a real large sized (over 20 million customers) company. We emphasize the way we used statistical methods to reduce the search space of the problem as well as the treatment given to the customer’s information stored in multiple tables of multiple databases.
Because of confidentiality issues the name of the company and the actual final results of the customer characterization are withheld.
Paper Outline
The outline of the paper is as follows. First, we give an overview of the analysis of large databases in section 2; next we give an overview of the methodology we applied. We describe two possible methods to certify the equivalence of the original data set (the “Universe”) which we denote with UD and the reduced (equivalent) data set which we denote with RD. In section 3 we briefly discuss the case study treated with the proposed methodology. Finally, we conclude in Section 4.
TopAnalysis Of Large Databases
To extract the best information of a database it is convenient to use a set of strategies or techniques which will allow us to analyze large volumes of data. These tools are generically known as data mining (DM) which targets on new, valuable, and nontrivial information in large volumes of data. It includes techniques such as clustering (which corresponds to non-supervised learning) and statistical analysis (which includes, for instance, sampling and multivariate analysis).
Clustering in Large Databases
Clustering is a popular data mining task which consist of processing a large volume of data to obtain groups where the elements of each group exhibit quantifiably (under some measure) small differences between them and, contrariwise, large dissimilarities between elements of different groups. Given its high importance as a data mining task, clustering has been the subject of multiple research efforts and has proven to be useful for many purposes (Jain, Murty, & Flynn, 1999).
Many techniques and algorithms for clustering have been developed, improved and applied (Berkhin, 2006; Kleinberg, Papadimitriou, & Raghavan, 1998; Guha, Rastogi, & Shim, 1998). Some of them try to ease the process on a large database as in (Peter, Chiochetti, & Giardina, 2003; Ng, & Han, 1994). On the other hand, the so-called “Divide and Merge” (Cheng, Kannan, Vempala et al., 2006) or “Snakes and Sandwiches” (Jagadish, Lakshmanan, & Srivastava, 1999) methods refer to clustering attending to the physical storage of the records comprising data warehouses. Another strategy to work with a large database is based upon the idea of working with statistical sampling optimization (Liu, & Motoda, 2012).