Weighting Imputation for Categorical Data

Weighting Imputation for Categorical Data

Liang-Ting Tsai (National Taichung University of Education, Taiwan), Chih-Chien Yang (National Taichung University of Education, Taiwan) and Timothy Teo (University of Macau, Macau)
Copyright: © 2014 |Pages: 11
DOI: 10.4018/978-1-4666-5202-6.ch241


This article aims to propose the Learning Vector Quantization (LVQ) approach to impute missing group membership and sampling weights in inferring the accuracy of population parameters of confirmatory factor analysis (CFA) models with categorical questionnaires. Survey data with missing group memberships, for example, gender, age, or ethnicity, are very familiar. However, the group memberships of examinees are critical for calculating the stratum sampling weights. Asparouhov (2005), Tsai and Yang (2008), and Yang and Tsai (2008) have described that appropriate imputation can further improve the precision of CFA model estimations. Questionnaires with categorical responses are not well established yet. In this study, a Monte Carlo simulation was conducted to compare the LVQ method with the other three existing methods (e.g., listwise-deletion, weighting-class adjustment, non-weighted). Four experimental factors, such as missing data rates, sampling sizes, disproportionate sampling, and different populations, were used to examine the performance of these four methods. The results showed that the LVQ method outperformed the other three methods in terms of accuracy of parameters of CFA model with binary or 5-category responses. The conclusion and discussion sections of this article provide for some practical guidelines.
Chapter Preview


LVQ (Learning Vector Quantization) has been used to impute missing group membership and stratum weights in confirmatory factor analysis (CFA) model with continuous indicators (Chen, Tsai, & Yang, 2010; Tsai & Yang, 2012). Currently, categorical questionnaires (e.g., Binary and Likert-type items) are widely used in education, business, economy, and psychology tests as well as international large-scale surveys (e.g., Trend in International Mathematics and Science Study, TIMSS; Progress in International Reading Literacy Study, PIRLS; Program for International Students Assessment, PISA; German Survey of Income and Expenditure, SIE; British Labour Force Survey, LFS). This article aims to adapt the LVQ approach to assess the accuracy of parameters in a CFA model with missing background information in binary and Likert-type questionnaires through a series of simulations.

Questionnaires utilizing categorical and binary items are widely used in business tests and large-scale international surveys. In addition to the responses taken from the items included in the questionnaire, databases used for the analysis of questionnaire results also often provide weighting factors to compensate for non-response bias. This information can be utilized to produce estimates at the level of the population. However, weighting factors in such surveys are unable to consider all the background variables which may affect population level estimates. For example, in the LFS survey, the weight allocated to each individual to better ensure that the respondents were representative of the population was calculated based on age, sex, and region of residence alone (Office for National Statistics, 2011). However, while the researchers conducting the LFS were interested in the relationship between income and economic activity, the survey database did not provide a weighting factor for participant income. Without this weighting factor, a bias would have been introduced on account of the large number of subjects with missing incomes. This type of non-response bias is frequently encountered in the analysis of large-scale questionnaire data, however, to the best of our knowledge no method has been proposed in the literature to account for it. Therefore, to better compensate for this bias and provide more accurate population level estimates, the current study applied the LVQ method to calculate weighing factors for variables of interests.

The concept of sampling weights and the practical applications of survey data have gradually gained importance in advanced statistical models (e.g., CFA, structural equation modeling, multilevel modeling; latent class analysis; latent growth model) (Asparouhov, 2005, 2006; Grilli & Pratesi, 2004; Kaplan & Ferguson, 1999; Patterson, Dayton, & Graubard, 2002; Stapleton, 2002, 2006, 2008; Sonnenschein, Stapleton, & Benson, 2010; Tsai & Yang, 2008; Yang & Tsai, 2006, 2008). To achieve effective results from the analysis of survey data, the analyst needs to adopt proper sampling weights for calculating parameters in statistical models. However, missing data is a common problem for researchers (Friedman, Huang, Zhang, & Cao, 2012). This is especially the case since the missing data occur in background information, thus also making the sampling weights unrecognizable. Researchers need to appropriately impute the missing information (i.e. Background information and sampling weights) to correctly infer the population characteristics.

Key Terms in this Chapter

Sampling Weights: Sampling weights means the number of individuals in the population each respondent in the sample is representing.

CFA: This means Confirmatory Factor Analysis. CFA is a statistical technique used to verify the measures of a construct of a set of observed.

Listwise Deletion (LWD): It refers to complete case analysis.

Categorical Questionnaires: It means the type of questionnaire item is binary or Likert-type.

LVQ: It refers to Learning Vector Quantization. The main function of LVQ is to categorize information and to make predictions about missing information.

Weighting-Class Adjustment (WCA): It refers weighting-class adjustment. It indicates that uses weighted class and nonresponse rates to evaluate and calibrate for nonresponse bias.

Missing Data: It refers that no data value is stored for the variable in the observation.

Complete Chapter List

Search this Book: