US Medical Expense Analysis Through Frequency and Severity Bootstrapping and Regression Model

US Medical Expense Analysis Through Frequency and Severity Bootstrapping and Regression Model

Fangjun Li, Gao Niu
DOI: 10.4018/978-1-7998-8455-2.ch007
(Individual Chapters)
No Current Special Offers


For the purpose of control health expenditures, there are some papers investigating the characteristics of patients who may incur high expenditures. However fewer papers are found which are based on the overall medical conditions, so this chapter was to find a relationship among the prevalence of medical conditions, utilization of healthcare services, and average expenses per person. The authors used bootstrapping simulation for data preprocessing and then used linear regression and random forest methods to train several models. The metrics root mean square error (RMSE), mean absolute percent error (MAPE), mean absolute error (MAE) all showed that the selected linear regression model performs slightly better than the selected random forest regression model, and the linear model used medical conditions, type of services, and their interaction terms as predictors.
Chapter Preview

Data Description And Preprocessing

The original data was collected by the Medical Expenditure Panel Survey (MEPS). MEPS is a set of large-scale surveys on the health services and the frequency, the cost of using these services, how they are paid, as well as the data about health insurance in the US. Among the two major components of MEPS, the Household Component (HC) provides information on household-reported medical conditions. The sample of families and individuals are from households that participated in the prior year's National Health Interview Survey (conducted by the National Center for Health Statistics) (Agency for Healthcare Research and Quality, 2019).

The data we use come from the HC summary data tables conducted by the American Agency for Healthcare Research and Quality (AHRQ) which provides number of people with care and mean expenditure per person from 2016 to 2018 (Agency for Healthcare Research and Quality, n.d.). The data are grouped by 53 condition categories collapsed from the household-reported conditions coded into ICD-10 and CCSR codes, and six event types (emergency room visits, home health events, inpatient stays, office-based events, outpatient events, prescription medicines).

Key Terms in this Chapter

Bootstrapping: A resampling method used to estimate statistics on a population by sampling a dataset with replacement. The process resamples a single dataset to create many simulated samples.

Multiple R-Squared: Also known as coefficient of determination, multiple R-squared is the proportion of the variation in dependent variable that can be explained by the independent variables. It provides a measure of how well observed outcomes are replicated by the model.

Mean Absolute Percentage Error (mape): A measure of how accurate a forecast system is and measures this accuracy as a percentage.

Mean Absolute Error (MAE): MAE measures the average magnitude of the errors in a set of predictions, without considering their direction.

Root-Mean-Square Error (RMSE): A frequently used measure of the differences between values predicted by a model or an estimator and the values observed, and it is sensitive to outliers.

Node: A node of a decision tree represents a “test” on an attribute. A root node is at the beginning of a tree where the entire population are analyzed. And each leaf node represents a class label.

goodness of fit: The goodness of fit of a statistical model describes how well it fits a set of observations. Some commonly used metrics include R squared, chi-squared test, etc.

MTRY: A parameter in Random Forest modeling that represents the number of variables sampled at each split.

Type of Events: A home health event is defined as one month during which home health service was received. For prescription medicines, an event is defined as a purchase or refill.

Adjusted R-Squared: It indicates how well terms fit a curve or line but adjusts for the number of terms in a model.

Predictive Modeling: A commonly used statistical technique to predict future behavior. Predictive modeling solutions are a form of data-mining technology that works by analyzing historical and current data and generating a model to help predict future outcomes.

Bagging: Bagging is an acronym for Bootstrap Aggregating. It is an ensemble meta-algorithm that is commonly used to reduce variance within a noisy dataset. Several data samples are generated by random selection with replacement, and then weak models are then trained independently to yield a more accurate estimate.

Split: One node can split into several branches and each branch represents the outcome of the test.

Medical Conditions: The data we used are after 2016, when the household-reported conditions are coded into ICD-10 codes and then collapsed into the Condition categories in the tables.

NTREE: A parameter in Random Forest modeling that represents the number of trees to grow.

Data Preprocessing: One important step in machine learning. Manipulating data before it is used to build a model for example, in order to enhance performance and more applicable to algorithms or models. It usually involves steps such as data cleaning, data integration, data reduction, and data transformation.

Akaike's Information Criteria (AIC): It is a standard to measure the goodness of statistical model fitting. Based on the concept of entropy, the criterion can weigh the complexity of the estimated model and the goodness of the model fitting data.

Simulated Dataset: New datasets that resemble but are not identical to the existing dataset by methods such as bootstrapping.

Complete Chapter List

Search this Book: