How Size Matters: The Role of Sampling in Data Mining

How Size Matters: The Role of Sampling in Data Mining

Paul D. Scott (University of Essex, UK)
Copyright: © 2002 |Pages: 20
DOI: 10.4018/978-1-930708-26-6.ch008
OnDemand PDF Download:
$37.50

Abstract

This chapter addresses the question of how to decide how large a sample is necessary in order to apply a particular data mining procedure to a given data set. A brief review of the main results of basic sampling theory is followed by a detailed consideration and comparison of the impact of simple random sample size on two well-known data mining procedures: naïve Bayes classifiers and decision tree induction. It is shown that both the learning procedure and the data set have a major impact on the size of sample required but that the size of the data set itself has little effect. The next section introduces a more sophisticated form of sampling, disproportionate stratification, and shows how it may be used to make much more effective use of limited processing resources. This section also includes a discussion of dynamic and static sampling. An examination of the impact of target function complexity concludes that neither target function complexity nor size of the attribute tuple space need be considered explicitly in determining sample size. The chapter concludes with a summary of the major results, a consideration of their relevance for small data sets and some brief remarks on the role of sampling for other data mining procedures.

Complete Chapter List

Search this Book:
Reset
Table of Contents
Preface
Ruhul Sarker, Hussein A. Abbass, Charles S. Newton
Chapter 1
R. Sarker, H. Abbass, C. Newton
The terms Data Mining (DM) and Knowledge Discovery in Databases (KDD) have been used interchangeably in practice. Strictly speaking, KDD is the... Sample PDF
Introducing Data Mining and Knowledge Discovery
$37.50
Chapter 2
A. M. Bagirov, A. M. Rubinov, J. Yearwood
The feature selection problem involves the selection of a subset of features that will be sufficient for the determination of structures or clusters... Sample PDF
A Heuristic Algorithm for Feature Selection Based on Optimization Techniques
$37.50
Chapter 3
Kai Ming Ting
This chapter reports results obtained from a series of studies on costsensitive classification using decision trees, boosting algorithms, and... Sample PDF
Cost-Sensitive Classification Using Decision Trees, Boosting and MetaCost
$37.50
Chapter 4
Agapito Ledezma, Ricardo Aler, Daniel Borrajo
Currently, the combination of several classifiers is one of the most active fields within inductive learning. Examples of such techniques are... Sample PDF
Heuristic Search-Based Stacking of Classifiers
$37.50
Chapter 5
Craig M. Howard
The overall size of software packages has grown considerably over recent years. Modular programming, object-oriented design and the use of static... Sample PDF
Designing Component-Based Heuristic Search Engines for Knowledge Discovery
$37.50
Chapter 6
Jose Ruiz-Shulcloper, Guillermo Sanchez-Diaz, Mongi A. Abidi
In this chapter, we expose the possibilities of the Logical Combinatorial Pattern Recognition (LCPR) tools for Clustering Large and Very Large Mixed... Sample PDF
Clustering Mixed Incomplete Data
$37.50
Chapter 7
Bayesian Learning  (pages 108-121)
Paula Macrossan, Kerrie Mengersen
Learning from the Bayesian perspective can be described simply as the modification of opinion based on experience. This is in contrast to the... Sample PDF
Bayesian Learning
$37.50
Chapter 8
Paul D. Scott
This chapter addresses the question of how to decide how large a sample is necessary in order to apply a particular data mining procedure to a given... Sample PDF
How Size Matters: The Role of Sampling in Data Mining
$37.50
Chapter 9
The Gamma Test  (pages 142-167)
Antonia J. Jones, Dafydd Evans, Steve Margetts, Peter J. Durrant
The Gamma Test is a non-linear modelling analysis tool that allows us to quantify the extent to which a numerical input/output data set can be... Sample PDF
The Gamma Test
$37.50
Chapter 10
Denny Meyer, Andrew Balemi, Chris Wearing
Neural networks are commonly used for prediction and classification when data sets are large. They have a big advantage over conventional... Sample PDF
Neural Networks - Their Use and Abuse for Small Data Sets
$37.50
Chapter 11
Hyeyoung Park
Feed forward neural networks or multilayer perceptrons have been successfully applied to a number of difficult and diverse applications by using the... Sample PDF
How to Train Multilayer Perceptrons Efficiently With Large Data Sets
$37.50
Chapter 12
Kevin E. Voges, Nigel K.L. Pope, Mark R. Brown
Cluster analysis is a common market segmentation technique, usually using k-means clustering. Techniques based on developments in computational... Sample PDF
Cluster Analysis of Marketing Data Examining On-line Shopping Orientation: A Comparison of K-Means and Rough Clustering Approaches
$37.50
Chapter 13
Susan E. George
This chapter presents a survey of medical data mining focusing upon the use of heuristic techniques. We observe that medical mining has some unique... Sample PDF
Heuristics in Medical Data Mining
$37.50
Chapter 14
A. de Carvalho, A. P. Braga, S. O. Rezende, E. Martineli, T. Ludermir
In the last few years, a large number of companies are starting to realize the value of their databases. These databases, which usually cover... Sample PDF
Understanding Credit Card User's Behaviour: A Data Mining Approach
$37.50
Chapter 15
Alina Lazar
The goal of this research is to investigate and develop heuristic tools in order to extract meaningful knowledge from archeological large-scale data... Sample PDF
Heuristic Knowledge Discovery for Archaeological Data Using Genetic Algorithms and Rough Sets
$37.50
About the Authors