The Impact of Data-Complexity and Team Characteristics on Performance in the Classification Model: Findings From a Collaborative Platform

The Impact of Data-Complexity and Team Characteristics on Performance in the Classification Model: Findings From a Collaborative Platform

Vitara Pungpapong, Prasert Kanawattanachai
Copyright: © 2022 |Pages: 16
DOI: 10.4018/IJBAN.288517
Article PDF Download
Open access articles are freely available for download

Abstract

This article investigates the impact of data-complexity and team-specific characteristics on machine learning competition scores. Data from five real-world binary classification competitions hosted on Kaggle.com were analyzed. The data-complexity characteristics were measured in four aspects including standard measures, sparsity measures, class imbalance measures, and feature-based measures. The results showed that the higher the level of the data-complexity characteristics was, the lower the predictive ability of the machine learning model was as well. Our empirical evidence revealed that the imbalance ratio of the target variable was the most important factor and exhibited a nonlinear relationship with the model’s predictive abilities. The imbalance ratio adversely affected the predictive performance when it reached a certain level. However, mixed results were found for the impact of team-specific characteristics measured by team size, team expertise, and the number of submissions on team performance. For high-performing teams, these factors had no impact on team score.
Article Preview
Top

Introduction

In the digital age, more and more businesses are relying on machine learning algorithms to analyze an ocean of data. Undoubtedly, the field of data science has received wide attention from both practitioners and researchers over the past decade (Davenport & Patil, 2012; Jordan & Mitchell, 2015; Krittanawong, 2018). Organizations have been trying hard to maximize the usage of data to gain competitive advantage in big data era. Data analytics and machine learning have now become essential in the business world. When planning a machine learning project, firms have often focused on investing in technological infrastructure and human capital. Recognizing what determines the difficulty level of a machine learning project will also help organizations evaluate both the technical and financial feasibility of such an undertaking.

Although machine learning algorithms have been rapidly enhanced and developed to improve their predictive abilities (Gilliland, 2020), little attention has been given to understanding the characteristics of data thus far. For example, imbalanced data sets are data sets where one class outnumbers the other class which are commonly found in practice. Several studies have been shown that imbalanced class leads to biased classification and thus impact the accuracy of the model. Specifically, a classifier tends to favor the majority class that can lead to a high number misclassifications of the minority class (Bekkar, Djemaa, & Alitouche, 2013; Chicco & Jurman, 2020). Prior studies have shown that dataset characteristics are important factors in determining the classification algorithm’s performance (Kwon & Sim, 2013; Luengo & Herrera, 2015; Oreški & Begičević Ređep, 2018; Sánchez, Mollineda, & Sotoca, 2007). However, the measurements of data complexity used in most previous studies have been relatively simple and limited. Ho and Basu (2002) and Lorena, Garcia, Lehmann, Souto, and Ho (2019) introduced a comprehensive set of geometric and statistical measurements extracted from a training dataset to determine the difficulty in a classification problem. Although the work of Ho and Basu (2002) and Lorena et al. (2019) have great theoretical value, several of these proposed geometric measurements rely on the results from specific classifiers or algorithms, which impose a practical limitation when dealing with big and complex data. In particular, it adds another layer of analysis, with some suggesting that these algorithms involve the process of hyperparameter tuning. Although several theories have been proposed to capture the complexity of data, few theories have been empirically tested with real-world datasets.

With the advances in information technology and the high demand for data scientists, crowdsourced machine learning competitions have served as playgrounds for data scientists to get their hands dirty with real-world problems while the companies sponsoring the competitions can benefit from their solutions (Bojer & Meldgaard, 2020; Stallkamp, Schlipsing, Salmen, & Igel, 2011). As today’s machine learning competitions become more complex, they encourage talents from various areas of expertise worldwide to collaborate in self-organized virtual teams to compete for rewards and build their professional profiles. Given the nature of self-organized teams, team members’ backgrounds and the size of the team can be very different. Furthermore, the use of a team has proven to increase task effectiveness and efficiency (Mysirlaki & Paraskeva, 2019; Sundstrom, De Meuse, & Futrell, 1990). Two common team characteristics, 1) team intellectual capital and 2) team size (Haeussler & Sauermann, 2020; Mao, Mason, Suri, & Watts, 2016; Rasch & Tosi, 1992; Rodríguez, Sicilia, García, & Harrison, 2012), have been conceptually identified and empirically tested as important in problem-solving tasks. However, studies on the influence of virtual teams on performance in crowdsourcing contests have been limited.

Complete Article List

Search this Journal:
Reset
Volume 11: 1 Issue (2024)
Volume 10: 1 Issue (2023)
Volume 9: 6 Issues (2022): 4 Released, 2 Forthcoming
Volume 8: 4 Issues (2021)
Volume 7: 4 Issues (2020)
Volume 6: 4 Issues (2019)
Volume 5: 4 Issues (2018)
Volume 4: 4 Issues (2017)
Volume 3: 4 Issues (2016)
Volume 2: 4 Issues (2015)
Volume 1: 4 Issues (2014)
View Complete Journal Contents Listing