Performance Assessment of Learning Algorithms on Multi-Domain Data Sets

Performance Assessment of Learning Algorithms on Multi-Domain Data Sets

Amit Kumar (Computer Science and Engineering, Birla Institute of Technology, Ranchi, India) and Bikash Kanti Sarkar (Computer Science and Engineering, Birla Institute of Technology, Ranchi, India)
Copyright: © 2018 |Pages: 15
DOI: 10.4018/IJKDB.2018010103

Abstract

This article describes how for the last few decades, data mining research has had significant progress in a wide spectrum of applications. Research in prediction of multi-domain data sets is a challenging task due to the imbalanced, voluminous, conflicting, and complex nature of data sets. A learning algorithm is the most important technique for solving these problems. The learning algorithms are widely used for classification purposes. But choosing the learners that perform best for data sets of particular domains is a challenging task in data mining. This article provides a comparative performance assessment of various state-of-the-art learning algorithms over multi-domain data sets to search the effective classifier(s) for a particular domain, e.g., artificial, natural, semi-natural, etc. In the present article, a total of 14 real world data sets are selected from University of California, Irvine (UCI) machine learning repository for conducting experiments using three competent individual learners and their hybrid combinations.
Article Preview

Introduction

Data mining (Klosgen & Z’ytkow, 2002) is an excellent process for designing intelligent models to identify useful patterns. In particular, knowledge discovery from databases (Han & Kamber, 2007; Fayyad, Piatetsky-Shapiro, & Smyth, 1996) are involved in finding useful patterns or meaning from raw data.

Recent trends in data mining research have definitely improved the lifestyle. The data consumptions are also increasing in our day to day life. Therefore, a number of issues like continuous growth in data warehouses, the necessity of intelligent data analytic tools and flexibility for handling large volumes of data, are identified for the business desktops, data miners or even end users. Truly speaking, the number of human data analysts is very less as compared to the amount of stored data. Hence, the (semi-) automatic methods are clearly needed for knowledge extractions from the data marts. In this respect, a number of data mining methods have been proposed. In particular, knowledge extraction from data warehouses has become one of the most usable tasks in order to obtain the valuable knowledge. Before launching a learned model for real world data sets, the model must be constructed using training set and then apply on test sets. For the purpose of estimation of learning methods, one may follow several strategies. In supervised learning, the data (observations) are labeled with predefined classes. It supervises classes just like a teacher. Hence, classification is a goal of supervised learning derived from past examples. For better understanding, a block diagram is shown in Figure 1.

On the other hand, in unsupervised machine learning, class labels of the data are unknown. The primary job of such learning is to label the unlabeled examples. Apart from these two kinds of learning style, we may consider reinforcement learning also. In reinforcement learning, a string of events is given that eventually results something that may either good (desired output) or bad (non-desired output). If it is good then the entire string of actions leading up to that output is reinforced, but if it is bad then the actions are penalized.

Figure 1.

A schematic of classification model

After all, the goal of any mining strategy is to discover knowledge from the database in order to predict accurately unseen data. So far, a number of algorithms have been adopted in various disciplines for inducing rules. Some most widely used data mining tools like decision tree (Quinlan, 1992), RIPPER (Cohen, 1995) and naïve Bayes (Rish, 2001) are considered in this study. The primary importance of such tools is that they are indeed, data driven, nonparametric and less restrictive in a priori hypothesis. However, they suffer from several issues namely, domain specificity, class imbalanced problem, operation on voluminous data, conflicting data, missing or noisy data, etc. Certainly, the above-mentioned issues have attracted the much-needed attention of researchers of the past and the present. A brief review on some competent classifiers (category-wise) is presented in BACKGROUND section. The paper is scheduled as follows. The INTRODUCTION section discusses the significance of data mining algorithms and current trends in the context of classification as decision support systems. BACKGROUND section introduces three individual learning classifiers namely, C4.5, RIPPER and Naïve Bayes. Similarly, it also describes three hybrid models combined by these individual learners say, (C4.5+RIPPER), (C4.5+ Naïve Bayes) and (RIPPER+ Naïve Bayes). Further, the experiments and their results are analyzed in EXPERIMENTAL RESULTS AND ANALYSIS section. Finally, conclusion is summarized in the CONCLUSION section.

Background

A brief background on the individual learners adopted in the present study are discussed below.

Complete Article List

Search this Journal:
Reset
Open Access Articles
Volume 9: 2 Issues (2019): Forthcoming, Available for Pre-Order
Volume 8: 2 Issues (2018)
Volume 7: 2 Issues (2017)
Volume 6: 2 Issues (2016)
Volume 5: 2 Issues (2015)
Volume 4: 2 Issues (2014)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing