Empirical Evaluation of Map Reduce Based Hybrid Approach for Problem of Imbalanced Classification in Big Data

Empirical Evaluation of Map Reduce Based Hybrid Approach for Problem of Imbalanced Classification in Big Data

Khyati Ahlawat (IGDTUW, Delhi, India), Anuradha Chug (GGSIPU, Delhi, India) and Amit Prakash Singh (GGSIPU, Delhi, India)
Copyright: © 2019 |Pages: 23
DOI: 10.4018/IJGHPC.2019070102


Imbalanced datasets are the ones with uneven distribution of classes that deteriorates classifier's performance. In this paper, SVM classifier is combined with K-Means clustering approach and a hybrid approach, Hy_SVM_KM is introduced. The performance of proposed method is also empirically evaluated using Accuracy and FN Rate measure and compared with existing methods like SMOTE. The results have shown that the proposed hybrid technique has outperformed traditional machine learning classifier SVM in mostly datasets and have performed better than known pre-processing technique SMOTE for all datasets. The goal of this article is to extend capabilities of popular machine learning algorithms and adapt it to meet the challenges of imbalanced big data classification. This article can provide a baseline study for future research on imbalanced big datasets classification and provides an efficient mechanism to deal with imbalanced nature big dataset with modified SVM classifier and improves the overall performance of the model.
Article Preview

1. Introduction

The era is experiencing a data explosion trend everywhere, in every form termed as Big Data (Hu, Wen, Chua, et al., 2014). Data mining with versatile big data requires exploring challenges at data, model and system levels and has become very compelling task (Tsai, Lai, Chao, et al., 2015; Wu, Zhu, Wu, et al., 2014). Big data is widely used in prediction-based system like short term load forecasting (Zhang, Cheng, Liu, et al., 2014), traumatic brain injury survival rate prediction (Rodger, 2015), in noisy big data (Yang & Fong, 2012). Machine learning techniques are proving to be highly efficient in such domains by modifying and adapting them to Map Reduce framework (Bechini, Marcelloni & Segatori, 2016; Hochbaum& Baumann, 2014). Classification or supervised machine learning has also been proved applicable in uncertainty reduction of big data (Wang, He, Chow, et al., 2015) in fuzzy systems (Fernández, Carmona, Jesus, et al., 2016; He, Wang, Zhuang, et al., 2015). Where classification algorithms typically require all data in same format and at same machine (Hochbaum & Baumann, 2014), Petuum, a platform for machine learning is capable of handling big data in a distributed manner (Xing, Ho, Dai, et al., 2015) though it is immature as compared to Spark and Hadoop. Apache Spark and Mahout are very popular tools that use Machine Learning Library, MLlib to address big data problems (Landset, Khoshgoftaar, Richter, et al., 2015). Researchers have studied current state of art of machine learning in sustainable data modeling for big data (Al-Jarrah, Yoo, Muhaidat, et al., 2015).

Usually, different classifiers learn by their pre-decided algorithm formulation and concept but some external factors also affect their learning process. One of such factors is class distribution which is the proportion of instances of each class in any dataset (Galar, Fernández, Barrenechea, et al., 2016). When this distribution is not balanced, datasets are termed as imbalanced datasets and learning performed on such datasets is known as imbalanced classification. Classifier learning from these imbalanced datasets is becoming a hot research topic in big data mining discipline.

Problem of imbalanced dataset occurs when instances of one class, which is of main interest as per the application field is under-represented as compared to other class. The Imbalance Ratio (IR)(López, Fernández, García, et al., 2013) that is used to define the extent of imbalance in any dataset. Normally classifiers tend to ignore the minority class samples considering them as outlier or noise and whole classification process lose its meaning and ability in such case. Consider an example of medical diagnosis (Ganganwar, 2012) where inputs are various parameters of patients based on which it is predicted that whether they are suffering from cancer or not. Assuming that non-cancer patients are 10000 and cancer patients are 10, two types of classifiers are learned for this problem. Classifier 1 classified 7 out of 10 cancer patients as fit and 10 out of 10000 other patients as cancer patients. On the other hand, classifier 2 classified 2 out of 10 cancer patients as fit patients and 100 out of 10000 other patients as cancer patients. Now, based upon classifier’s fallacy, classifier 1 is better than second classifier as number of mistakes in case of first is 17 and for second classifier it is 102. However, focusing on cancer patient classification, classifier 2 performs better than first one. Consequently, for such applications where correct classification of cancer patients is crucial, any algorithm will pick classifier 1 over classifier 2 which is a challenging problem.

This problem of class imbalance dataset becomes more challenging when normal data expands exponentially to big data. To address class imbalance problem in context of big data, traditional machine learning algorithms and classifiers need to be adapted with new big data technologies so that, an efficient mechanism for classifier learning can be obtained.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 12: 4 Issues (2020): 1 Released, 3 Forthcoming
Volume 11: 4 Issues (2019)
Volume 10: 4 Issues (2018)
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing