This chapter introduces a new way of using soft constraints for selecting data analysis methods that match certain user requirements. It presents a software platform for automatic data analysis that uses a fuzzy knowledge base for automatically selecting and executing data analysis methods. In order to support business users in running data analysis projects the analytical process must be automated as much as possible. The authors argue that previous approaches based on the formalisation of analytical processes were less successful because selecting and running analytical methods is very much an experience-led heuristic process. The authors show that a system based on a fuzzy knowledge base that stores heuristic expert knowledge about data analysis can successfully lead to automatic intelligent data analysis.
Automating Data Analysis
When we talk about data analysis in this chapter we refer to the task of discovering a relationship between a number of attributes and representing this relationship in form of a model. Typically, we are interested in determining the value of some attributes given some other attributes (inference) or in finding groups of attribute-value combinations (segmentation). In this context we will not consider describing parameters of attribute distributions or visualisation.
Models are typically used to support a decision making process by inferring or predicting the (currently unknown) values of some output attributes given some input attributes or by determining a group to which the currently observed data record possibly belongs to. In this scenario we expect a model to be as accurate as possible. Models also can be used to explain a relationship between attributes. In this scenario we want a model to be interpretable.
A model is created in a (machine) learning process, where the parameters of the models are adapted based on set of training data. The learning process can be controlled by a separate validation set to prevent over-generalisation on the training set. The model performance is finally tested on a different test set.
In business environments data and problem owners are typically domain experts, but not data analysis experts. That means they do not have the required knowledge to decide which type of model and learning algorithm to choose, how to set the parameters of the learning procedure, how to adequately test the learned model, and so on. In order to support this group of users, we have developed an approach for automating data analysis to some extent.