Handling Imbalanced Data With Weighted Logistic Regression and Propensity Score Matching methods: The Case of P2P Money Transfers

Lavlin Agrawal, Pavankumar Mulgund, Raj Sharman

Source Title: Journal of Database Management (JDM) 35(1)

DOI: 10.4018/JDM.335888

Article PDF Download Open access articles are freely available for download

Abstract

The adoption of empirical methods for secondary data analysis has witnessed a significant surge in IS research. However, the secondary data is often incomplete, skewed, and imbalanced at best. Consequently, there is a growing recognition of the importance of empirical techniques and methodological decisions made to navigate through such issues. However, there is not enough methodological guidance, especially in the form of a worked case study that demonstrates the challenges of imbalanced datasets and offers prescriptive on how to deal with them. Using data on P2P money transfer services, this article presents a running example by analyzing the same dataset using several different methods. It then compares the outcomes of these choices and explicates the rationale behind some decisions such as inclusion and categorization of variables, parameter setting, and model selection. Finally, the article discusses certain regressions models such as weighted logistic regression and propensity matching, and when they should be used.

Article Preview

Top

Introduction

With the increasing availability of large volumes of publicly available secondary data, the empirical analysis of such data has gained increasing relevance and importance in information systems (IS) research (Black et al., 2020). Secondary data analysis also aligns well with the positivist research paradigm, which is the most dominant research approach within the IS community (Burton-Jones & Lee, 2017). Furthermore, there is an increasing expectation of obtaining data from multiple sources to publish research, making the use of secondary data even more relevant. Prior research has also highlighted several benefits of using secondary data, including (a) the reduction of bias that is sometimes introduced in qualitative approaches such as case studies (Choy, 2014); (b) the lack of intrusiveness that is associated with other methods, such as action research and interviews (Rabinovich & Cheon, 2011); (c) the absence of issues, such as survey fatigue (Sinickas, 2007); and (d) efficiency and cost-effectiveness of data procurement and use. With the emergence of reputable and highly credible secondary data sources and improved archival and management processes, the use of secondary data for empirical research is slated to grow even further (Black et al., 2020).

There are some limitations to the use of secondary data. A significant limitation is associated with the imbalanced nature of secondary data, particularly when the research study attempts to explore certain demographic factors or rare events. An imbalanced dataset occurs when the categories for classification are disproportionately represented (Ramyachitra & Manikandan, 2014). For example, in the case of the chosen dataset, if the number of instances of one class (consumer adopting peer-to-peer [P2P] services) is much smaller or larger than the number of instances of the other class (consumer not adopting P2P services), the dataset is said to be imbalanced. Traditional data analysis approaches often fall short when applied to such skewed data, necessitating the adoption of specialized empirical techniques and informed discretion on the part of researchers. Although there is growing recognition of the problem of imbalanced datasets in the IS research community post-COVID-19 pandemic (Dorn et al., 2021), there is insufficient methodological guidance in dealing with the challenge of highly skewed datasets.

We endeavor to address this gap by presenting an example of an empirical analysis of a highly imbalanced dataset. Following prior exemplars that offer methodological guidelines (Gefen et al., 2000; Chua & Storey, 2016), we bring to the fore a series of salient decisions the researchers must make while dealing with imbalanced data, including the selection and categorization of variables, choice of models to use, and parameters to set. Furthermore, we demonstrate how different decisions made during empirical analysis lead to diverse findings. We explore the suitability and use of propensity score matching (PSM) (Rosenbaum & Rubin, 1983) and weighted logistic regression (WLR) techniques (King & Zeng, 2001) to analyze imbalanced data. We also compare the results of the two models and elaborate on when it is appropriate to choose one model over the other.

For illustrative purposes, we make use of secondary data that consists of responses to a survey conducted by one of the top 25 banks in the northeast United States regarding the use of P2P money transfer services. The data are highly skewed, with only 5.4% of customers using bank-based P2P services. We used the responses to this survey in our study to empirically show and explain how methodological decisions impact outcomes. Furthermore, in this study, we use six research questions that are of interest to banks. We focus on demographic factors (age, gender, income, education, and employment status) and trust that can be harnessed for strategic business gain.

Complete Article List

Search this Journal:

Reset

Volume 35: 1 Issue (2024)

Volume 34: 3 Issues (2023)

Volume 33: 5 Issues (2022): 4 Released, 1 Forthcoming

Volume 32: 4 Issues (2021)

Volume 31: 4 Issues (2020)

Volume 30: 4 Issues (2019)

Volume 29: 4 Issues (2018)

Volume 28: 4 Issues (2017)

Volume 27: 4 Issues (2016)

Volume 26: 4 Issues (2015)

Volume 25: 4 Issues (2014)

Volume 24: 4 Issues (2013)

Volume 23: 4 Issues (2012)

Volume 22: 4 Issues (2011)

Volume 21: 4 Issues (2010)

Volume 20: 4 Issues (2009)

Volume 19: 4 Issues (2008)

Volume 18: 4 Issues (2007)

Volume 17: 4 Issues (2006)

Volume 16: 4 Issues (2005)

Volume 15: 4 Issues (2004)

Volume 14: 4 Issues (2003)

Volume 13: 4 Issues (2002)

Volume 12: 4 Issues (2001)

Volume 11: 4 Issues (2000)

Volume 10: 4 Issues (1999)

Volume 9: 4 Issues (1998)

Volume 8: 4 Issues (1997)

Volume 7: 4 Issues (1996)

Volume 6: 4 Issues (1995)

Volume 5: 4 Issues (1994)

Volume 4: 4 Issues (1993)

Volume 3: 4 Issues (1992)

Volume 2: 4 Issues (1991)

Volume 1: 2 Issues (1990)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Handling Imbalanced Data With Weighted Logistic Regression and Propensity Score Matching methods: The Case of P2P Money Transfers

Abstract

Introduction

Complete Article List