How Big Does Big Data Need to Be?

How Big Does Big Data Need to Be?

Martin Stange (Leuphana University, Germany) and Burkhardt Funk (Leuphana University, Germany)
Copyright: © 2016 |Pages: 12
DOI: 10.4018/978-1-5225-0293-7.ch001
OnDemand PDF Download:
List Price: $37.50


Collecting and storing of as many data as possible is common practice in many companies these days. To reduce costs of collecting and storing data that is not relevant, it is important to define which analytical questions are to be answered and how much data is needed to answer these questions. In this chapter, a process to define an optimal sampling size is proposed. Based on benefit/cost considerations, the authors show how to find the sample size that maximizes the utility of predictive analytics. By applying the proposed process to a case study is shown that only a very small fraction of the available data set is needed to make accurate predictions.
Chapter Preview


Sample size determination is a topic that has often been examined in medical or sociological science (e.g., Brutti et al. 2009, Sahu & Smith 2006, Santis 2007), since, in these fields, samples are often expensive in comparison to big data environments. Additionally, the available methods to determine the needed sample size often focus on a specific task, such as to find the needed number of participants of a survey. Therefore, these methods do not seem appropriate for a generally applicable framework in predictive analytics with its variety of machine learning and data mining techniques.

In contrast to the available methods that calculate the needed sample size a priori, the proposed process is based on the evaluation of the predictive accuracy and the calculation of the economic value of the classifier.

The predictive accuracy of classifiers with dichotomous outcomes can be calculated by integrating the receiver operator characteristic (ROC) curve. The obtained value is called the area under the curve (AUC; Bradley 1997), which represents the probability that a data record with unknown class is classified correctly. In the case study, the authors employ two logistic regression models with elastic net regularization (Friedman et al. 2010) to estimate the model parameters that the authors use to predict the dependent variable on the holdout sample. Based on these predictions, the authors show that increasing the sample size results in convergence the AUC. Other types of dependent variables, such as multinomial outcomes, require other measures, such as the misclassification error.

Complete Chapter List

Search this Book: