Outlier Detection in Logistic Regression

Outlier Detection in Logistic Regression

A. A. M. Nurunnabi (SLG, University of Rajshahi, Bangladesh), A. B. M. S. Ali (CQUniversity, Australia), A. H. M. Rahmatullah Imon (Ball State University, USA) and Mohammed Nasser (University of Rajshahi, Bangladesh)
DOI: 10.4018/978-1-4666-1830-5.ch016

Abstract

The use of logistic regression, its modelling and decision making from the estimated model and subsequent analysis has been drawn a great deal of attention since its inception. The current use of logistic regression methods includes epidemiology, biomedical research, criminology, ecology, engineering, pattern recognition, machine learning, wildlife biology, linguistics, business and finance, et cetera. Logistic regression diagnostics have attracted both theoreticians and practitioners in recent years. Detection and handling of outliers is considered as an important task in the data modelling domain, because the presence of outliers often misleads the modelling performances. Traditionally logistic regression models were used to fit data obtained under experimental conditions. But in recent years, it is an important issue to measure the outliers scale before putting the data as a logistic model input. It requires a higher mathematical level than most of the other material that steps backward to its study and application in spite of its inevitability. This chapter presents several diagnostic aspects and methods in logistic regression. Like linear regression, estimates of the logistic regression are sensitive to the unusual observations: outliers, high leverage, and influential observations. Numerical examples and analysis are presented to demonstrate the most recent outlier diagnostic methods using data sets from medical domain.
Chapter Preview
Top

Logistic Regression Model Formulation

Regression analysis deals how the values of the response (dependent variable) change with the change of one or more explanatory (independent) variables. It is appealing because it provides a conceptually simple method for investigating functional relationship among variables (Chatterjee and Hadi, 2006). In any regression problem the key quantity is the mean value of the outcome (dependent or response) variable, given the value of the explanatory (independent) variable(s), E (Y|X). In linear regression, we assume that this mean is expressed as an equation linear in X (or some transformations of X or Y) such as978-1-4666-1830-5.ch016.m01. (1) Hence 978-1-4666-1830-5.ch016.m02, (2)978-1-4666-1830-5.ch016.m03978-1-4666-1830-5.ch016.m04, (3) where X is an 978-1-4666-1830-5.ch016.m05 matrix containing the data for each case with 978-1-4666-1830-5.ch016.m06, Y is an 978-1-4666-1830-5.ch016.m07vector of response, 978-1-4666-1830-5.ch016.m08is the vector of regression parameters and 978-1-4666-1830-5.ch016.m09 is the error vector. Main difference between linear regression and logistic regression is that the outcome (response) variable is categorical (binary, ordinal or nominal). In case of logistic regression, we use the quantity 978-1-4666-1830-5.ch016.m10 to represent the conditional mean of Y given X. The specific form of the logistic regression model is978-1-4666-1830-5.ch016.m11 ; 978-1-4666-1830-5.ch016.m12(4)978-1-4666-1830-5.ch016.m13, (5) where 978-1-4666-1830-5.ch016.m14. This form gives an S-curve configuration. The well-known ‘Logit’ transformation in terms of 978-1-4666-1830-5.ch016.m15 is978-1-4666-1830-5.ch016.m16. (6) Hence, in logistic regression, the model in Equation (3) stands as

Complete Chapter List

Search this Book:
Reset