The Acceptable R-Square in Empirical Modelling for Social Science Research

The Acceptable R-Square in Empirical Modelling for Social Science Research

DOI: 10.4018/978-1-6684-6859-3.ch009
(Individual Chapters)
No Current Special Offers


This chapter examines the acceptable R-square in social science empirical modelling with particular focus on why a low R-square model is acceptable in empirical social science research. The paper shows that a low R-square model is not necessarily bad. This is because the goal of most social science research modelling is not to predict human behaviour. Rather, the goal is often to assess whether specific predictors or explanatory variables have a significant effect on the dependent variable. Therefore, a low R-square of at least 0.1 (or 10 percent) is acceptable on the condition that some or most of the predictors or explanatory variables are statistically significant. If this condition is not met, the low R-square model cannot be accepted. A high R-square model is also acceptable provided that there is no spurious causation in the model and there is no multi-collinearity among the explanatory variables.
Chapter Preview

2. Literature Review

There is adequate literature about the R-squared. The literature about R-squared shows some of its applications. Miles (2005) showed that the R-squared and the adjusted R-squared statistics are derived from analyses based on the general linear model (e.g., regression, ANOVA), and they represent the proportion of variance in the outcome variable which is explained by the predictor variables in the sample (R-squared) and an estimate in the population (adjusted R-squared). Hagle and Mitchell (1992) suggest a refinement to the R-squared called the pseudo R-squared. They suggest that the corrected Aldrich-Nelson pseudo R-squared is a good estimate of the R-squared of a regression model because of its smaller standard deviations and range of errors, and its smaller error of regression. They also point out that the Aldrich-Nelson correction to the R-squared is more robust when the assumption of normality is violated. However, they cautioned that the pseudo R-squared should be used with caution because even a good summary measure can be misinterpreted; therefore, it was suggested that the pseudo R-squared should be used in conjunction with other measures of model performance. Chicco et al (2021) suggested that the use of the R-squared statistic as a standard metric to evaluate regression analyses is popular in any scientific domain. This is because the coefficient of determination (or R-squared) is more informative and truthful than other goodness of fit measures. Cameron and Windmeijer (1997) show that R-squared type goodness-of-fit summary statistics have been constructed for linear models using a variety of methods. They propose an R-squared measure of goodness of fit for the class of exponential family regression models, which includes logit, probit, Poisson, geometric, gamma, and exponential. They defined the R-squared as the proportionate reduction in uncertainty, measured by Kullback-Leibler divergence, due to the inclusion of regressors. They also show that, under further conditions concerning the conditional mean function, the R-squared can also be interpreted as the fraction of uncertainty explained by the fitted model. Gelman et al (2019) argued that the usual definition of the R-squared statistic (variance of the predicted values divided by the variance of the data) has a problem for Bayesian fits, as the numerator can be larger than the denominator. They proposed an alternative definition similar to one that has appeared in the survival analysis literature. The definition they propose defined the R-squared as the variance of the predicted values divided by the variance of predicted values plus the expected variance of the errors.

Complete Chapter List

Search this Book: