A Comparison of Principal Component-Based and Multivariate Regression of Cardiac Disease

A Comparison of Principal Component-Based and Multivariate Regression of Cardiac Disease

Fox Underwood (University of Calgary, Canada) and Stefania Bertazzon (University of Calgary, Canada)
DOI: 10.4018/978-1-4666-1924-1.ch003
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Selecting factors suitable to use in a regression model is often a complicated process: the researcher strives to retain all theoretically important factors while avoiding high correlations among independent variables. This chapter models cardiac disease and compares the explanatory ability of component-based multivariate regression models, created through the use of principal component analysis (PCA), with that of direct variable-based, multivariate regression models. The variable-based demographic and socio-economic model contains education, sex, and 3 age factors; in contrast, the component-based model contains age as well as several modifiable risk factors: education, income, family, and housing factors. Moreover, the latter model also has statistically higher explanatory power. Components made through data reduction techniques may not always be interpretable, but, given closer examination of individual components, a component-based model becomes more interpretable. Further, all important factors will potentially be present in models. As such, component-based modelling can be a useful tool for research and public health planning. A key limitation of this work, to be addressed in future research, is the use of a variable (cardiac catheterisation procedures) that remains a crude proxy for cardiovascular disease. More effective analysis will be performed as data becomes available. Exploration into the relationship of factor and their spatial patterns will also be considered.
Chapter Preview
Top

Introduction

Many primary health concerns of contemporary societies are inherently spatial: effective accessibility to health care services, prompt and efficient response to epidemic outbreaks, detection and monitoring of environmental health hazards, and consequent urban and regional planning. Management and planning decisions are often supported by quantitative models; indeed the use of quantitative and statistical methods in health research is well established, and the application of spatial analytical methods has gained acceptance over the last several years (Elliott et al., 2000; Waller and Gotway, 2003; Elliott and Wartenberg, 2004). It has been argued that the integration of analytical and visual methods can improve the effectiveness of spatial analysis as a decision support tool for policy and management (Guo, 2007). Further, it has been argued that, though some difficulties still exist, geographic information science (GIS) has a potential role in improving public health (Rushton, 2000).

Despite this potential in assisting management and planning decisions, the use of quantitative methods to spatial data remains a delicate process, and its simplistic application does not necessarily yield reliable and relevant results. One well known problem is that of estimating uncertainty, which stems from two intrinsic properties of geographical phenomena: spatial dependence (i.e. near things are more related than distant things) and spatial non-stationarity (i.e. inconstant variability over space) (Cliff and Ord, 1981). Violation of either assumption inflates the variance—and hence the uncertainty—of the regression estimates, resulting in less reliable models (Anselin, 1988). An additional source of model instability is cross-correlation of the explanatory variables, known as multicollinearity in the model (McGarigal et al., 2001). A common strategy to avoid multicollinearity or near-multicollinearity is a drastic model selection, where only explanatory variables that display low cross-correlation are retained in the model (McGarigal et al., 2001).

This paper presents the use of multivariate regression analysis to identify demographic and socio-economic variables significantly associated with cardiovascular disease prevalence in a Calgary: a large Canadian city. In consideration of the available data (i.e. spatially aggregated records) spatial regression techniques are applied in the presence of spatial dependence to enhance the reliability of the model parameters. In response to multicollinearity, the use of principal component analysis (PCA) (McGarigal et al., 2001) is proposed, leading to the comparison of regression models which make use of orthogonal components and of selected uncorrelated variables, respectively.

Selecting factors suitable to use in a regression model is often a complicated process: the researcher strives to retain all theoretically important factors while avoiding high correlations among independent variables. This tends to result in a compromise rather than a success as not all important factors can always be retained. ‘Rules of thumb’ and intuition are also often used when deciding on factors to retain; the larger the pool of factors, the more time-consuming this process is. Such effort can be reduced by employing sets of components gleaned through data reduction techniques rather than sets of variables.

Complete Chapter List

Search this Book:
Reset