PAKDD-2007: A Near-Linear Model for the Cross-Selling Problem

PAKDD-2007: A Near-Linear Model for the Cross-Selling Problem

Thierry Van de Merckt (VADIS Consulting, Belgium) and Jean-François Chevalier (VADIS Consulting, Belgium)
DOI: 10.4018/978-1-60566-717-1.ch020
OnDemand PDF Download:
List Price: $37.50


This chapter presents VADIS Consulting’s solution for the cross-selling problem of the PAKDD_2007 competition. For this competition, the authors have used their in-house developed tool RANK, which automates a lot of important tasks that must be done to provide a good solution for predictive modelling projects. It was for them a way of benchmarking their 3 years of investment effort against other tools and techniques. RANK encodes some important steps of the CRISP-DM methodology: Data Quality Audit, Data Transformation, Modelling, and Evaluation. The authors have used RANK as they would do in a normal project, however with much less access to the business information, and hence the task was quite elementary: they have audited the data quality and found some problems that were further corrected, they have then let RANK build a model by applying its standard recoding, and then applied automatic statistical evaluation for variable selection and pruning. The result was not extremely good in terms of prediction, but the model was extremely stable, which is what the authors were looking for.
Chapter Preview

Applied Methodology

Our methodology for building analytical solutions is based on CRISP-DM. In order to support our consultants in applying this methodology in a rigorous and consistent way, we have developed a platform called RANK that automates some of the major steps of the process, as shown in figure 1.

Figure 1.

Methodological steps & RANK contribution

Since PAKDD07 contest provides the data sets, the target, and the data dictionary, the first three steps are not applicable. Hence, the process is the following:

  • Audit – Evaluation of the data quality, its consistency, etc.

  • Transformation – Preparation of the data for modeling: defining types, binning, recoding, deriving new variables, linearization of the vector space, normalization, etc.

  • Modeling – building the model itself, by choosing the best technique, the set of relevant variables, etc.

  • Evaluation – asserting the model stability, its statistical relevance, etc. And reviewing the business relevance (this last important step is not applicable to the contest).

The last two steps (Learning and Deployment) are not applicable to the PAKDD07 contest.

RANK provides a great help for all these steps to the analyst.


The audit allows analyzing the distribution of the variables and to spot anomalies. An example is given in the next figure.

This variable indicates the Number of Bureau Enquiries in the last 6 months for Mortgages. Maximum actual value is 97. Special values are:

  • 98 = Went to bureau and no match found (new file created)

  • 99 = Did not go to bureau

From the data dictionary, the distribution, and the output of Rank, we immediately see that there is a problem with value 98 & 99. Figure 2 shows for each modality (possibly grouped to form a statistically relevant sample of the data) the total number of cases, the number of clients (target), the equivalent percentages, the index which shows the target density compared to the total population, and the statistical significance of the modality in relation with the target density. We see that all modalities are significant and that the more a prospect enquires for mortgage, the more chances to sell one. However, when we look at the 8+9+ …+99 group of modalities, we see a decrease of the Index which does not make any business sense. This is just the side effect of the coding of “no match found” and “did not go to the bureau” into 98 and 99 values, which are grouped with high values of enquiries. This has to be corrected.

Figure 2.

Anomaly for 98 & 98 modalities

Another example is the presence of un-documented modalities such as for MASTERCARD, where the modalities “2” and “1” are not described in the data dictionary. These values must be corrected as well.

The Audit takes less than 10 sec to be computed, and took minutes to analyze and spot anomalies (see Figure 3).

Figure 3.

Audit output

Complete Chapter List

Search this Book: