# Classification and Regression Trees

Johannes Gehrke (Cornell University, USA)
DOI: 10.4018/978-1-60566-010-3.ch031

## Abstract

It is the goal of classification and regression to build a data mining model that can be used for prediction. To construct such a model, we are given a set of training records, each having several attributes. These attributes can either be numerical (for example, age or salary) or categorical (for example, profession or gender). There is one distinguished attribute, the dependent attribute; the other attributes are called predictor attributes. If the dependent attribute is categorical, the problem is a classification problem. If the dependent attribute is numerical, the problem is a regression problem. It is the goal of classification and regression to construct a data mining model that predicts the (unknown) value for a record where the value of the dependent attribute is unknown. (We call such a record an unlabeled record.) Classification and regression have a wide range of applications, including scientific experiments, medical diagnosis, fraud detection, credit approval, and target marketing (Hand, 1997). Many classification and regression models have been proposed in the literature, among the more popular models are neural networks, genetic algorithms, Bayesian methods, linear and log-linear models and other statistical methods, decision tables, and tree-structured models, the focus of this chapter (Breiman, Friedman, Olshen, & Stone, 1984). Tree-structured models, socalled decision trees, are easy to understand, they are non-parametric and thus do not rely on assumptions about the data distribution, and they have fast construction methods even for large training datasets (Lim, Loh, & Shih, 2000). Most data mining suites include tools for classification and regression tree construction (Goebel & Gruenwald, 1999).
Chapter Preview
Top

## Background

Let us start by introducing decision trees. For the ease of explanation, we are going to focus on binary decision trees. In binary decision trees, each internal node has two children nodes. Each internal node is associated with a predicate, called the splitting predicate, which involves only the predictor attributes. Each leaf node is associated with a unique value for the dependent attribute. A decision encodes a data mining model as follows: For an unlabeled record, we start at the root node. If the record satisfies the predicate associated with the root node, we follow the tree to the left child of the root, and we go to the right child otherwise. We continue this pattern through a unique path from the root of the tree to a leaf node, where we predict the value of the dependent attribute associated with this leaf node. An example decision tree for a classification problem, a classification tree, is shown in Figure 1. Note that a decision tree automatically captures interactions between variables, but it only includes interactions that help in the prediction of the dependent attribute. For example, the rightmost leaf node in the example shown in Figure 1 is associated with the classification rule: “If (Age >= 40) and (Salary > 80k), then YES”, as classification rule that involves an interaction between the two predictor attributes age and salary.

Figure 1.

An example classification tree

Decision trees can be mined automatically from a training database of records where the value of the dependent attribute is known: A decision tree construction algorithm selects which attribute(s) to involve in the splitting predicates and the algorithm decides also on the shape and depth of the tree (Murthy, 1998).

Top

## Main Thrust

Let us discuss how decision trees are mined from a training database. A decision tree is usually constructed in two phases. In the first phase, the growth phase, an overly large and deep tree is constructed from the training data. In the second phase, the pruning phase, the final size of the tree is determined with the goal to minimize the expected mis-prediction error (Quinlan, 1993).

## Complete Chapter List

Search this Book:
Reset
Contents by Volume
Contents by Topic
Foreword
Alexander Tuzhilin
Acknowledgment
John Wang
Chapter 1
Action Rules Mining  (pages 1-5)
Zbigniew W. Ras, Elzbieta Wyrzykowska, Li-Shiang Tsay
\$37.50
Chapter 2
Ion Muslea
\$37.50
Chapter 3
Xueping Li
\$37.50
Chapter 4
\$37.50
Chapter 5
Dan Zhu
\$37.50
Chapter 6
Chun-Che Huang, Tzu-Liang ("Bill") Tseng
\$37.50
\$37.50
Chapter 8
J. Ben Schafer
\$37.50
Chapter 9
Gustavo Camps-Valls, Manel Martínez-Ramón, José Luis Rojo-Álvarez
\$37.50
Chapter 10
Sandra Elizabeth González Císaro
\$37.50
Chapter 11
Wenxue Huang, Milorad Krneta, Limin Lin, Jianhong Wu
\$37.50
Chapter 12
Vassilios S. Verykios
\$37.50
Chapter 13
Yew-Kwong Woon
\$37.50
Chapter 14
Luminita Dumitriu
\$37.50
Chapter 15
Anne Denton
\$37.50
Chapter 16
\$37.50
Chapter 17
Zheng-Hua Tan
\$37.50
Chapter 18
Audio Indexing  (pages 104-109)
Gaël Richard
\$37.50
\$37.50
Chapter 20
Xiaoyan Yu, Manas Tungare, Weiguo Fan, Manuel Pérez-Quiñones, Edward A. Fox, William Cameron, Lillian Cassel
\$37.50
Chapter 21
Xin Zhang
\$37.50
Chapter 22
Shu-Chiang Lin
\$37.50
Chapter 23
Yinghui Yang
\$37.50
Chapter 24
Les Pang
\$37.50
Chapter 25
Scott Nicholson
\$37.50
Chapter 26
Gustavo Camps-Valls, Alistair Morgan Chalk
\$37.50
Chapter 27
Jieping Ye, Ravi Janardan, Sudhir Kumar
\$37.50
Chapter 28
\$37.50
Chapter 29
Lei Tang, Huan Liu, Jiangping Zhang
\$37.50
\$37.50
Chapter 31
Johannes Gehrke
\$37.50
Chapter 32
Classification Methods  (pages 196-201)
Aijun An
\$37.50
Chapter 33
Andrzej Dominik
\$37.50
\$37.50
Chapter 35
Frank Klawonn, Frank Rehm
\$37.50
\$37.50
Chapter 37
Dingxi Qiu, Edward C. Malthouse
\$37.50
Chapter 38
Cluster Validation  (pages 231-236)
Ricardo Vilalta, Tomasz Stepinski
\$37.50
Chapter 39
Athman Bouguettaya, Qi Yu
\$37.50
Chapter 40
Joshua Zhexue Huang
\$37.50
Chapter 41
Mei Li, Wang-Chien Lee
\$37.50
Chapter 42
Anne Denton
\$37.50
Chapter 43
On Clustering Techniques  (pages 264-268)
Sheng Ma, Tao Li
\$37.50
Chapter 44
Richard S. Segall
\$37.50
Chapter 45
Eamonn Keogh, Li Keogh, John C. Handley
\$37.50
Chapter 46
Amin A. Abdulghani
\$37.50
Chapter 47
Elzbieta Malinowski, Esteban Zimányi
\$37.50
Chapter 48
Constrained Data Mining  (pages 301-306)
\$37.50
Chapter 49
Carson Kai-Sang Leung
\$37.50
Chapter 50
Francesco Bonchi
\$37.50
Chapter 51
Alexander mirnov
\$37.50
Chapter 52
Marko Robnik-Šikonja
\$37.50
Chapter 53
Yi-Cheng Tu, Gang Ding
\$37.50
Chapter 54
Cost-Sensitive Learning  (pages 339-345)
Victor S. Sheng, Charles X. Ling
\$37.50
Chapter 55
Kehan Gao, Taghi M. Khoshgoftaar
\$37.50
Chapter 56
Christine W. Chan
\$37.50
Chapter 57
Seunghyun Im, Zbigniew W. Ras
\$37.50
Chapter 58
Alfredo Cuzzocrea
\$37.50
Chapter 59
Junjie Wu, Jian Chen, Hui Xiong
\$37.50
\$37.50
Chapter 61
Data Mining and Privacy  (pages 388-393)
Esma Aïmeur
\$37.50
Chapter 62
Paola Cerchiello
\$37.50
Chapter 63
Joaquín Ordieres-Meré, Manuel Castejón-Limas, Ana González-Marcos
\$37.50
Chapter 65
Roberto Marmo
\$37.50
\$37.50
Chapter 67
Luciana Dalla Valle
\$37.50
Chapter 68
Silvia Figini
\$37.50
Chapter 69
Diego Liberati
\$37.50
Chapter 70
Mª Dolores del Castillo
\$37.50
Chapter 71
\$37.50
Chapter 72
Ng Yew Seng, Rajagopalan Srinivasan
\$37.50
\$37.50
Chapter 74
Haipeng Wang
\$37.50
Chapter 75
Aleksandar Lazarevic
\$37.50
\$37.50
\$37.50
Chapter 79
Data Mining on XML Data  (pages 506-510)
Qin Ding
\$37.50
Chapter 80
Christophe Giraud-Carrier
\$37.50
Chapter 81
Amin A. Abdulghani
\$37.50
Chapter 82
Hai Wang, Shouhong Wang
\$37.50
Chapter 83
Mohammed Alshalalfa
\$37.50
Chapter 84
Magdi Kamel
\$37.50
Chapter 85
Data Provenance  (pages 544-549)
Vikram Sorathia
\$37.50
Chapter 86
William E. Winkler
\$37.50
Chapter 87
Richard Jensen
\$37.50
Chapter 88
Data Streams  (pages 561-565)
João Gama, Pedro Pereira Rodrigues
\$37.50
Chapter 89
Amitava Mitra
\$37.50
Chapter 90
Alkis Simitsis, Dimitri Theodoratos
\$37.50
Chapter 91
Beixin ("Betsy") Lin, Yu Hong, Zu-Hsu Lee
\$37.50
Chapter 92
Richard Mathieu
\$37.50
Chapter 93
Yuefeng Li
\$37.50
Chapter 94
Lutz Hamel
\$37.50
Chapter 95
Patricia E.N. Lutu
\$37.50
Chapter 96
Edgar R. Weippl
\$37.50
Chapter 97
Martin Žnidaršic, Marko Bohanec, Blaž Zupan
\$37.50
Chapter 98
Decision Tree Induction  (pages 624-630)
Roberta Siciliano, Claudio Conversano
\$37.50
Chapter 99
Monica Maceli, Min Song
\$37.50
Chapter 100
Matteo Golfarelli
\$37.50
Chapter 101
Hanghang Tong, Yehuda Koren, Christos Faloutsos
\$37.50
\$37.50
Chapter 103
Richi Nayak
\$37.50
Chapter 104
Jan H Kroeze
\$37.50
Chapter 105
William W. Agresti
\$37.50
Chapter 106
Haiquan Li, Jinyan Li, Xuechun Zhao
\$37.50
Chapter 107
\$37.50
Chapter 108
Mafruz Zaman Ashrafi
\$37.50
Chapter 109
Yu Chen, Wei-Shinn Ku
\$37.50
Chapter 110
Distributed Data Mining  (pages 709-715)
Grigorios Tsoumakas
\$37.50
Chapter 111
José Ignacio Serrano
\$37.50
Chapter 112
Dynamic Data Mining  (pages 722-728)
Richard Weber
\$37.50
Chapter 113
Chang-Chia Liu, W. Art Chaovalitwongse, Panos M. Pardalos, Basim M. Uthman
\$37.50
Chapter 114
Efficient Graph Matching  (pages 736-743)
Diego Reforgiato Recupero
\$37.50
Chapter 115
Xunkai Wei
\$37.50
Chapter 116
Daniel Crabtree
\$37.50
Chapter 117
Ji-Rong Wen
\$37.50
\$37.50
Chapter 119
Nikunj C. Oza
\$37.50
Chapter 120
Niall Rooney
\$37.50
Chapter 121
Ethics of Data Mining  (pages 783-788)
Jack Cook
\$37.50
Chapter 122
Paolo Giudici
\$37.50
Chapter 123
Ivan Bruha
\$37.50
Chapter 124
Caitlin Kelly Maurie
\$37.50
Chapter 125
Amit Saxena, Megha Kothari, Navneet Pandey
\$37.50
Chapter 126
William H. Hsu
\$37.50
Chapter 127
Laetitia Jourdan
\$37.50
Chapter 128
Daniel Rivero
\$37.50
Chapter 129
Jorge Muruzábal
\$37.50
Chapter 130
Yiyu Yao
\$37.50
Chapter 131
Elzbieta Malinowski, Esteban Zimányi
\$37.50
Chapter 132
Facial Recognition  (pages 857-862)
Rory A. Lewis, Zbigniew W. Ras
\$37.50
Chapter 133
Seoung Bum Kim
\$37.50
Chapter 134
Shouxian Cheng, Frank Y. Shih
\$37.50
Chapter 135
Feature Selection  (pages 878-882)
Damien François
\$37.50
Chapter 136
Indranil Bose
\$37.50
Chapter 137
Hong Shen
\$37.50
Chapter 138
Jamil M. Saquer
\$37.50
Chapter 139
Xuan Hong Dang, Wee-Keong Ng, Kok-Leong Ong, Vincent Lee
\$37.50
Chapter 140
Eyke Hüllermeier
\$37.50
Chapter 141
Michel Schneider
\$37.50
Chapter 142
\$37.50
Chapter 143
Genetic Programming  (pages 926-931)
William H. Hsu
\$37.50
Chapter 144
Alex A. Freitas, Gisele L. Pappa
\$37.50
Chapter 145
Marek Kretowski, Marek Grzes
\$37.50
Chapter 146
Graph-Based Data Mining  (pages 943-949)
Lawrence B. Holder
\$37.50
Chapter 147
Graphical Data Mining  (pages 950-956)
Carol J. Romanowski
\$37.50
Chapter 148
Liang Xiong
\$37.50
Chapter 149
Guided Sequence Alignment  (pages 964-969)
Abdullah N. Arslan
\$37.50
Chapter 150
Benjamin C.M. Fung, Ke Wang, Martin Ester
\$37.50
Chapter 151
Francesco Buccafurri
\$37.50
Chapter 152
Bhavani Thuraisingham
\$37.50
Chapter 153
Janet Delve
\$37.50
Chapter 154
Sancho Salcedo-Sanz, Gustavo Camps-Valls, Carlos Bousoño-Calzón
\$37.50
Chapter 155
Marvin L. Brown, John F. Kros
\$37.50
Chapter 156
Incremental Learning  (pages 1006-1012)
Abdelhamid Bouchachia
\$37.50
Chapter 157
Seokkyung Chung
\$37.50
Chapter 158
Honghua Dai
\$37.50
\$37.50
Chapter 160
Benjamin Griffiths
\$37.50
Chapter 161
Instance Selection  (pages 1041-1045)
Huan Liu
\$37.50
Chapter 162
Stephan Meisel
\$37.50
Chapter 163
Andreas Koeller
\$37.50
\$37.50
Chapter 165
P. Punitha, D.S. Guru
\$37.50
Chapter 166
Zbigniew W. Ras, Agnieszka Dardzinska
\$37.50
Chapter 167
Zheng Zhao
\$37.50
Chapter 168
On Interactive Data Mining  (pages 1085-1090)
Yan Zhao
\$37.50
Chapter 169
Interest Pixel Mining  (pages 1091-1096)
Qi Li, Jieping Ye, Chandra Kambhamettu
\$37.50
Chapter 170
Gustavo Camps-Valls, Manel Martínez-Ramón, José Luis Rojo-Álvarez
\$37.50
Chapter 171
Malcolm J. Beynon
\$37.50
Chapter 172
Doina Caragea, Vasant Honavar
\$37.50
Chapter 173
QingXiang Wu, Martin McGinnity, Girijesh Prasad, David Bell
\$37.50
Chapter 174
Learning Bayesian Networks  (pages 1124-1128)
Marco F. Ramoni, Paola Sebastiani
\$37.50
Chapter 175
Rallou Thomopoulos
\$37.50
Chapter 176
Learning from Data Streams  (pages 1137-1141)
João Gama, Pedro Pereira Rodrigues
\$37.50
Chapter 177
Bojun Yan
\$37.50
Chapter 178
Feng Pan
\$37.50
Chapter 179
Abdelhamid Bouchachia
\$37.50
Chapter 180
Kirsten Wahlstrom, John F. Roddick, Rick Sarre, Vladimir Estivill-Castro, Denise de Vries
\$37.50
Chapter 181
\$37.50
Chapter 182
Carlotta Domeniconi, Dimitrios Gunopulos
\$37.50
Chapter 183
Xiang Zhang, Seza Orcun, Mourad Ouzzani, Cheolhwan Oh
\$37.50
Chapter 184
Dimitri Theodoratos, Wugang Xu, Alkis Simitsis
\$37.50
Chapter 185
Jun Zhang, Jie Wang, Shuting Xu
\$37.50
Chapter 186
Raymond K. Pon, Alfonso F. Cardenas, David J. Buttler
\$37.50
Chapter 187
Miguel García Torres
\$37.50
Chapter 188
Meta-Learning  (pages 1207-1215)
Christophe Giraud-Carrier, Pavel Brazdil, Carlos Soares, Ricardo Vilalta
\$37.50
Chapter 189
Xinghua Fan
\$37.50
Chapter 190
Microarray Data Mining  (pages 1224-1230)
Li-Min Fu
\$37.50
Chapter 191
Diego Liberati
\$37.50
Chapter 192
Li Shen, Fillia Makedon
\$37.50
Chapter 193
Mining Chat Discussions  (pages 1243-1247)
Stanley Loh Daniel Licthnow, Thyago Borges Tiago Primo
\$37.50
Chapter 194
Mining Data Streams  (pages 1248-1256)
Tamraparni Dasu, Gary Weiss
\$37.50
Chapter 195
Gabriele Kern-Isberner
\$37.50
Chapter 196
Mining Email Data  (pages 1262-1267)
Steffen Bickel
\$37.50
Chapter 197
Wen-Yang Lin, Ming-Cheng Tseng
\$37.50
\$37.50
Chapter 199
Mining Group Differences  (pages 1282-1286)
Shane M. Butler
\$37.50
Chapter 200
Junsong Yuan
\$37.50
\$37.50
Chapter 202
David Lo
\$37.50
Chapter 203
Ramon F. Brena, Ana Maguitman
\$37.50
Chapter 204
Lutz Hamel
\$37.50
Chapter 205
Modeling Quantiles  (pages 1324-1329)
Claudia Perlich, Saharon Rosset, Bianca Zadrozny
\$37.50
Chapter 206
Modeling Score Distributions  (pages 1330-1336)
Anca Doloc-Mihu
\$37.50
Chapter 207
Modeling the KDD Process  (pages 1337-1345)
Vasudha Bhatnagar, S. K. Gupta
\$37.50
Chapter 208
Pasquale De Meo, Giovanni Quattrone, Giorgio Terracina, Domenico Ursino
\$37.50
Chapter 209
Chia Huey Ooi
\$37.50
Chapter 210
Omar Boussaid, Doulkifli Boukraa
\$37.50
Chapter 211
\$37.50
Chapter 213
Multilingual Text Mining  (pages 1380-1385)
Peter A. Chew
\$37.50
Chapter 214
Gang Kou, Yi Peng, Yong Shi
\$37.50
Chapter 215
Sach Mukherjee
\$37.50
Chapter 216
Music Information Retrieval  (pages 1396-1402)
Alicja A. Wieczorkowska
\$37.50
Chapter 217
Ingrid Fischer
\$37.50
Chapter 218
Victor S.Y. Lo
\$37.50
Chapter 219
Dilip Kumar Pratihar
\$37.50
Chapter 220
Ioannis N. Kouris
\$37.50
Chapter 221
Indrani Chakravarty
\$37.50
Chapter 222
Alfredo Cuzzocrea, Svetlana Mansmann
\$37.50
Chapter 223
Rebecca Boon-Noi Tan
\$37.50
Chapter 224
Online Signature Recognition  (pages 1456-1462)
Indrani Chakravarty
\$37.50
Chapter 225
James Geller
\$37.50
Chapter 226
Order Preserving Data Mining  (pages 1470-1475)
Ioannis N. Kouris
\$37.50
Chapter 227
Outlier Detection  (pages 1476-1482)
Sharanjit Kaur
\$37.50
Chapter 228
Fabrizio Angiulli
\$37.50
Chapter 229
Jorge Cardoso, W.M.P. van der Aalst
\$37.50
Chapter 230
Andrew K.C. Wong, Yang Wang, Gary C.L. Li
\$37.50
Chapter 231
Hui Xiong, Michael Steinbach, Pang-Ning Tan, Vipin Kumar, Wenjun Zhou
\$37.50
Chapter 232
P. Viswanath, Narasimha M. Murty, Bhatnagar Shalabh
\$37.50
Chapter 233
\$37.50
Chapter 234
Clifton Phua, Vincent Lee, Kate Smith-Miles
\$37.50
Chapter 235
Konstantinos Kotis
\$37.50
Chapter 236
Nilmini Wickramasinghe, Rajeev K. Bali
\$37.50
Chapter 237
\$37.50
\$37.50
Chapter 239
D. R. Mani, Andrew L. Betz, James H. Drew
\$37.50
Chapter 240
Seung-won Hwang
\$37.50
Chapter 241
Alfredo Cuzzocrea, Vincenzo Russo
\$37.50
Chapter 242
Stanley R.M. Oliveira
\$37.50
\$37.50
Chapter 244
Profit Mining  (pages 1598-1602)
Senqiang Zhou
\$37.50
Chapter 245
Ioannis N. Kouris
\$37.50
Chapter 246
Minh Ngoc Ngo
\$37.50
Chapter 247
Ping Deng, Qingkai Ma, Weili Wu
\$37.50
Chapter 248
\$37.50
Chapter 250
Wen-Chi Hou
\$37.50
Chapter 251
Andrew Hamilton-Wright, Daniel W. Stashuk
\$37.50
Chapter 252
Colin Cooper, Michele Zito
\$37.50
Chapter 253
Brian C. Lovell, Shaokang Chen, Ting Shan
\$37.50
Chapter 254
Marzena Kryszkiewicz
\$37.50
Chapter 255
Nicolas Lachiche
\$37.50
Chapter 256
Juha Kontio
\$37.50
Chapter 257
Brian C. Lovell, Shaokang Chen, Ting Shan
\$37.50
Chapter 258
Rough Sets and Data Mining  (pages 1696-1701)
Jerzy W. Grzymala-Busse, Wojciech Ziarko
\$37.50
\$37.50
Chapter 260
V. Suresh Babu, P. Viswanath, Narasimha M. Murty
\$37.50
Chapter 261
Scientific Web Intelligence  (pages 1714-1719)
Mike Thelwall
\$37.50
Chapter 262
Päivikki Parpola
\$37.50
Chapter 263
\$37.50
Chapter 264
Nils Pharo
\$37.50
Chapter 265
Shuguo Han
\$37.50
Chapter 266
Yehuda Lindell
\$37.50
Chapter 267
Parvathi Chundi, Daniel J. Rosenkrantz
\$37.50
\$37.50
Chapter 269
Semantic Data Mining  (pages 1765-1770)
Protima Banerjee
\$37.50
Chapter 270
Chrisa Tsinaraki
\$37.50
Chapter 271
Ludovic Denoyer
\$37.50
Chapter 272
Semi-Supervised Learning  (pages 1787-1793)
Tobias Scheffer
\$37.50
Chapter 273
Cane W.K. Leung
\$37.50
Chapter 274
Sequential Pattern Mining  (pages 1800-1805)
Florent Masseglia, Maguelonne Teisseire, Pascal Poncelet
\$37.50
Chapter 275
K. G. Srinivasa, K. R. Venugopal, L. M. Patnaik
\$37.50
Chapter 276
Liping Jing, Michael K. Ng, Joshua Zhexue Huang
\$37.50
Chapter 277
Seoung Bum Kim, Chivalai Temiyasathit, Sun-Kyoung Park, Victoria C.P. Chen
\$37.50
Chapter 278
Wenyuan Li
\$37.50
Chapter 279
Christophe Giraud-Carrier
\$37.50
Chapter 280
Statistical Data Editing  (pages 1835-1840)
Claudio Conversano, Roberta Siciliano
\$37.50
Chapter 281
Maria Vardaki
\$37.50
Chapter 282
Concetto Elvio Bonafede
\$37.50
Chapter 283
Jun Zhu, Zaiqing Nie, Bo Zhang
\$37.50
Chapter 284
Alexander Thomasian
\$37.50
Chapter 285
Subgraph Mining  (pages 1865-1870)
Ingrid Fischer
\$37.50
Chapter 286
Jason Chen
\$37.50
Chapter 287