Generating and Verifying Risk Prediction Models using Data Mining

Generating and Verifying Risk Prediction Models using Data Mining

Darryl N. Davis (University of Hull, UK) and Thuy T.T. Nguyen (University of Hull, UK)
DOI: 10.4018/978-1-60566-218-3.ch009
OnDemand PDF Download:


Risk prediction models are of great interest to clinicians. They offer an explicit and repeatable means to aide the selection, from a general medical population, those patients that require a referral to medical consultants and specialists. In many medical domains, including cardiovascular medicine, no gold standard exists for selecting referral patients. Where evidential selection is required using patient data, heuristics backed up by poorly adapted more general risk prediction models are pressed into action, with less than perfect results. In this study, existing clinical risk prediction models are examined and matched to the patient data to which they may be applied using classification and data mining techniques, such as neural nets. Novel risk prediction models are derived using unsupervised cluster analysis algorithms. All existing and derived models are verified as to their usefulness in medical decision support on the basis of their effectiveness on patient data from two UK sites.
Chapter Preview

Risk Prediction Models

In this section, two forms of risk prediction model, as used in routine clinical practice, are introduced. The first, POSSUM, typifies the application of generic models to specific medical disciplines. The second set reflect the clinical heuristics regularly used in medicine. The data used throughout this case study is from two UK clinical sites. The attributes are a mixture of real number, integer, Boolean and categorical values. The data records typically contain many default and missing values. For both sites there is typically too high a data value space (i.e. the space of all possible values for all attributes in the raw data) for the data volume (i.e. the number of records) to perform naïve data mining, and some form of data preprocessing is required before using any classifier if meaningful results are to be obtained. Furthermore, as can be seen in the tabulated results, the data once labeled is class-imbalanced; with low risk patients heavily out numbering high risk patients.The main characteristics of the cardiovascular data from Clinical Site One (98 attributes and 499 patient records) are:

Complete Chapter List

Search this Book: