Article Preview
TopIntroduction
With the increasing availability of large volumes of publicly available secondary data, the empirical analysis of such data has gained increasing relevance and importance in information systems (IS) research (Black et al., 2020). Secondary data analysis also aligns well with the positivist research paradigm, which is the most dominant research approach within the IS community (Burton-Jones & Lee, 2017). Furthermore, there is an increasing expectation of obtaining data from multiple sources to publish research, making the use of secondary data even more relevant. Prior research has also highlighted several benefits of using secondary data, including (a) the reduction of bias that is sometimes introduced in qualitative approaches such as case studies (Choy, 2014); (b) the lack of intrusiveness that is associated with other methods, such as action research and interviews (Rabinovich & Cheon, 2011); (c) the absence of issues, such as survey fatigue (Sinickas, 2007); and (d) efficiency and cost-effectiveness of data procurement and use. With the emergence of reputable and highly credible secondary data sources and improved archival and management processes, the use of secondary data for empirical research is slated to grow even further (Black et al., 2020).
There are some limitations to the use of secondary data. A significant limitation is associated with the imbalanced nature of secondary data, particularly when the research study attempts to explore certain demographic factors or rare events. An imbalanced dataset occurs when the categories for classification are disproportionately represented (Ramyachitra & Manikandan, 2014). For example, in the case of the chosen dataset, if the number of instances of one class (consumer adopting peer-to-peer [P2P] services) is much smaller or larger than the number of instances of the other class (consumer not adopting P2P services), the dataset is said to be imbalanced. Traditional data analysis approaches often fall short when applied to such skewed data, necessitating the adoption of specialized empirical techniques and informed discretion on the part of researchers. Although there is growing recognition of the problem of imbalanced datasets in the IS research community post-COVID-19 pandemic (Dorn et al., 2021), there is insufficient methodological guidance in dealing with the challenge of highly skewed datasets.
We endeavor to address this gap by presenting an example of an empirical analysis of a highly imbalanced dataset. Following prior exemplars that offer methodological guidelines (Gefen et al., 2000; Chua & Storey, 2016), we bring to the fore a series of salient decisions the researchers must make while dealing with imbalanced data, including the selection and categorization of variables, choice of models to use, and parameters to set. Furthermore, we demonstrate how different decisions made during empirical analysis lead to diverse findings. We explore the suitability and use of propensity score matching (PSM) (Rosenbaum & Rubin, 1983) and weighted logistic regression (WLR) techniques (King & Zeng, 2001) to analyze imbalanced data. We also compare the results of the two models and elaborate on when it is appropriate to choose one model over the other.
For illustrative purposes, we make use of secondary data that consists of responses to a survey conducted by one of the top 25 banks in the northeast United States regarding the use of P2P money transfer services. The data are highly skewed, with only 5.4% of customers using bank-based P2P services. We used the responses to this survey in our study to empirically show and explain how methodological decisions impact outcomes. Furthermore, in this study, we use six research questions that are of interest to banks. We focus on demographic factors (age, gender, income, education, and employment status) and trust that can be harnessed for strategic business gain.