Article Preview
Top1. Introduction
One of the goals of data analysis is to find factors that characterize events and situations in the world. By understanding a current situation, policy makers, decision makers, and researchers can take actions directed toward changing society and individual lives for the better.
Sometimes decision makers cannot obtain all the data necessary to make a sound decision. For instance, a particular data point may be too costly, or it may be unobservable. In these cases, a proxy variable or attribute may be used. A proxy attribute is an attribute used in place of the unacquirable data. While not a direct measure of the desired data point, a good proxy attribute should be strongly related to the unobserved attribute of interest (Clinton, 2004). Proxies can introduce error in measuring the outcome (Kimball, Sahm, & Shapiro, 2008) but are necessary because the desired value is needed although unattainable. Sometimes multiple proxies exist and using a combination of these proxies can reduce proxy-introduced error (Lubotsky & Wittenberg, 2006; Trickett, Persky, & Espino, 2009). The extent of error is difficult or impossible to measure because the baseline, unattainable attribute is, well, unattainable.
At other times decision makers may want to assess the similarities between datasets. Here the decision maker hopes to make the decision using one dataset as a proxy for another dataset. Braslow and Humez (2014) and Hargittai (2005) investigated using survey data as a proxy for observation data. Observation data are more difficult and expensive to collect. Saunders, Bex, and Woods (2013) investigated the use of crowdsourcing data as a proxy for lab collected data in the medical domain. Crowdsourcing is well established in medical research for assembling large normative datasets. Of course, using one dataset as a proxy for another can also result in errors.
The ability to compare datasets could conceivably have great utility and real-world ramifications. One might want to know, for example, if gender or ethnic differences mattered in terms of election outcomes in one year but not in another, or if systemic differences like the institutional structure of a regime could correspond with the level of freedom in a state. Those studying the causes of war might be interested in whether or not the causes of civil and international war are similar and compare datasets on each to analyze the question. Comparing datasets on the causes of religious and secular terrorism would indicate if the determinants of both kinds of terrorism are the same. Analyzing a question in this way and showing differences between similar domains can carry ramifications for policy makers and researchers seeking to address the root causes of particular types of problems.
When a proxy attribute is used, regression analysis and other statistical techniques can test hypotheses to evaluate data against an expected outcome. These techniques inform a researcher about relationships between independent attributes including the proxy attributes and the dependent attribute or class attribute. This analysis can be used to study how an attribute influences an outcome while accounting for the other attributes that also influence the outcome. However, when attempting to substitute one dataset for another, regression and other statistical techniques may not provide sufficient information. Other techniques can be more insightful and practical than regression when predicting the interaction of attributes on the dependent attribute or class attribute (Andoh-Baidoo & Osei-Bryson, 2007; Chang, 2006). Classification can be used as an analysis technique when proxy attributes are used and the classification tree itself may act as a proxy tree for a similar domain.
This paper offers a methodology for evaluating the use of a dataset’s classification model as a proxy model for a similar dataset; it then presents three cases that demonstrate the methodology and the three types of results. In this endeavor, the next sections describe classification analysis and its use as a mechanism for identifying proxy models. Then, it presents the three case studies regarding executive leadership, voter turnout, and terrorism followed by a conclusion.