MapReduce Implementation of a Multinomial and Mixed Naive Bayes Classifier

MapReduce Implementation of a Multinomial and Mixed Naive Bayes Classifier

Sikha Bagui (The University of West Florida, USA), Keerthi Devulapalli (University of West Florida, USA) and Sharon John (University of West Florida, USA)
Copyright: © 2020 |Pages: 23
DOI: 10.4018/IJIIT.2020040101

Abstract

This study presents an efficient way to deal with discrete as well as continuous values in Big Data in a parallel Naïve Bayes implementation on Hadoop's MapReduce environment. Two approaches were taken: (i) discretizing continuous values using a binning method; and (ii) using a multinomial distribution for probability estimation of discrete values and a Gaussian distribution for probability estimation of continuous values. The models were analyzed and compared for performance with respect to run time and classification accuracy for varying data sizes, data block sizes, and map memory sizes.
Article Preview
Top

Introduction

Naïve Bayes (NB) classification is robust and effective in practice and over the years this classifier has had a wide variety of applications in many complex domains such as text classification (Korpipaa et al., 2003; Yuan et al., 2012), document classification (Viegas et al., 2015) and sentiment analysis (Dei et al., 2007; Narayan et al., 2013). This probability-based classification model is easy to use for discrete values. But, in today’s Big Data era, we have to deal with more and more data that has continuous values, and this has become a major challenge. Of course, the first major implementation goal of the NB classifier in today’s Big Data environment is to use a parallel processing environment, and fortunately NB classifiers are naturally amenable for parallelization since the probability counts estimation for each attribute can be determined independent of each other.

In Hadoop’s MapReduce environment, the NB classification model for discrete values can be implemented with two MapReduce jobs, one for constructing the model (learning phase) and the second for applying the model on unlabeled data (test phase). The parallel NB implementation is not as straightforward for continuous values. For continuous values, there are two options: (i) discretizing the continuous values, or, (ii) using the Gaussian distribution probability density function. In this paper, the first method is referred to as the Discrete/Multinomial NB model and the second method is referred to as the Mixed NB model:

  • 1.

    Discretizing the continuous values and then building the NB model with the discrete values. This option requires an extra pre-processing step of discretizing the continuous values. This extra step could be time consuming and resource intensive for Big Data. On the MapReduce platform, chained MapReduce jobs become resource intensive since the output of every MapReduce job has to be stored on Hadoop’s Distributed File System (HDFS). Based on the chosen discretization algorithm, the discretization process can take one or more MapReduce jobs;

  • 2.

    Using the Gaussian distribution probability density function for continuous values. This model, referred to as the Mixed NB model, handles both continuous as well as discrete values. For discrete values, the Multinomial distribution is used for probability estimation, and for continuous values, the Gaussian distribution is used for probability estimation. This method does not require a pre-processing step. Here the NB model can be built in one MapReduce job and the model can be used for classification in another MapReduce job. Hence this model requires a lesser number of MapReduce jobs than the Discrete NB model.

The main contribution of this study is to present a way to efficiently deal with discrete as well as continuous values in Big Data. The paper implements and compares the parallel implementations of both the Discrete and Mixed NB models for a very large dataset containing a mixture of the continuous and discrete values on Hadoop’s MapReduce platform and evaluates the performance of both models.

The organization of the paper is as follows. Background knowledge of the Naïve Bayes model is presented in section 2 followed by the theory behind the two event models in the subsections. Section 3 presents a review of the literature. Actual implementation details are discussed in section 4. Section 5 presents the experimental setup and results and a discussion of the results and finally section 6 presents the conclusions of the study.

Top

With respect to the parallel implementation of the NB algorithm, He et al. (2010) and Zhon et al. (2012) implemented the NB classifier in a parallelized environment for handling large categorical datasets. They significantly improved their efficiency over the serial NB algorithms.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 17: 4 Issues (2021): Forthcoming, Available for Pre-Order
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing