Article Preview
Top1. Introduction
The era is experiencing a data explosion trend everywhere, in every form termed as Big Data (Hu, Wen, Chua, et al., 2014). Data mining with versatile big data requires exploring challenges at data, model and system levels and has become very compelling task (Tsai, Lai, Chao, et al., 2015; Wu, Zhu, Wu, et al., 2014). Big data is widely used in prediction-based system like short term load forecasting (Zhang, Cheng, Liu, et al., 2014), traumatic brain injury survival rate prediction (Rodger, 2015), in noisy big data (Yang & Fong, 2012). Machine learning techniques are proving to be highly efficient in such domains by modifying and adapting them to Map Reduce framework (Bechini, Marcelloni & Segatori, 2016; Hochbaum& Baumann, 2014). Classification or supervised machine learning has also been proved applicable in uncertainty reduction of big data (Wang, He, Chow, et al., 2015) in fuzzy systems (Fernández, Carmona, Jesus, et al., 2016; He, Wang, Zhuang, et al., 2015). Where classification algorithms typically require all data in same format and at same machine (Hochbaum & Baumann, 2014), Petuum, a platform for machine learning is capable of handling big data in a distributed manner (Xing, Ho, Dai, et al., 2015) though it is immature as compared to Spark and Hadoop. Apache Spark and Mahout are very popular tools that use Machine Learning Library, MLlib to address big data problems (Landset, Khoshgoftaar, Richter, et al., 2015). Researchers have studied current state of art of machine learning in sustainable data modeling for big data (Al-Jarrah, Yoo, Muhaidat, et al., 2015).
Usually, different classifiers learn by their pre-decided algorithm formulation and concept but some external factors also affect their learning process. One of such factors is class distribution which is the proportion of instances of each class in any dataset (Galar, Fernández, Barrenechea, et al., 2016). When this distribution is not balanced, datasets are termed as imbalanced datasets and learning performed on such datasets is known as imbalanced classification. Classifier learning from these imbalanced datasets is becoming a hot research topic in big data mining discipline.
Problem of imbalanced dataset occurs when instances of one class, which is of main interest as per the application field is under-represented as compared to other class. The Imbalance Ratio (IR)(López, Fernández, García, et al., 2013) that is used to define the extent of imbalance in any dataset. Normally classifiers tend to ignore the minority class samples considering them as outlier or noise and whole classification process lose its meaning and ability in such case. Consider an example of medical diagnosis (Ganganwar, 2012) where inputs are various parameters of patients based on which it is predicted that whether they are suffering from cancer or not. Assuming that non-cancer patients are 10000 and cancer patients are 10, two types of classifiers are learned for this problem. Classifier 1 classified 7 out of 10 cancer patients as fit and 10 out of 10000 other patients as cancer patients. On the other hand, classifier 2 classified 2 out of 10 cancer patients as fit patients and 100 out of 10000 other patients as cancer patients. Now, based upon classifier’s fallacy, classifier 1 is better than second classifier as number of mistakes in case of first is 17 and for second classifier it is 102. However, focusing on cancer patient classification, classifier 2 performs better than first one. Consequently, for such applications where correct classification of cancer patients is crucial, any algorithm will pick classifier 1 over classifier 2 which is a challenging problem.
This problem of class imbalance dataset becomes more challenging when normal data expands exponentially to big data. To address class imbalance problem in context of big data, traditional machine learning algorithms and classifiers need to be adapted with new big data technologies so that, an efficient mechanism for classifier learning can be obtained.