Reliable Distributed Fuzzy Discretizer for Associative Classification of Big Data

Reliable Distributed Fuzzy Discretizer for Associative Classification of Big Data

Hepzi Jeya Pushparani, Nancy Jasmine Goldena
Copyright: © 2022 |Pages: 13
DOI: 10.4018/IJIRR.289572
Article PDF Download
Open access articles are freely available for download

Abstract

Data Mining is an essential task because the digital world creates huge data daily. Associative classification is one of the data mining task which is used to carry out classification of data, based on the demand of knowledge users. Most of the associative classification algorithms are not able to analyze the big data which are mostly continuous in nature. This leads to the interest of analyzing the existing discretization algorithms which converts continuous data into discrete values and the development of novel discretizer Reliable Distributed Fuzzy Discretizer for big data set. Many discretizers suffer the problem of over splitting the partitions. Our proposed method is implemented in distributed fuzzy environment and aims to avoid over splitting of partitions by introducing a novel stopping criteria. Proposed discretization method is compared with existing distributed fuzzy partitioning method and achieved good accuracy in the performance of associative classifiers.
Article Preview
Top

Introduction

Every second , the world creates a large volume of data in different domains, with reference to the International Data Corporation (IDC) study forecasting that the global data sphere will grow from 33 Zettabytes (ZB) in 2018 to 175 ZB by 2025. Large volumes of data beyond conventional system’s storage and processing capacities are known as Big data(Minelli et al., 2013). Real world data is categorical, numerical , continuous and various formats. The most efficient task is the extraction of information from that data. Classification algorithms are developed to meet the growing demand of data .The art of integrating frequent pattern mining and classification is known as Associative classification (Abdelhamid et al., 2012; Baralis & Garza, 2012) Many studies have shown that associative classifications have specific advantages over other traditional classification approaches such as Decision Tree and Rule Induction(Wedyan, 2014). First associations are extracted from the dataset using frequent pattern mining algorithms(Aggarwal et al., 2014) and then the classification rules are created. Most of the frequent pattern mining algorithm works only on categorical attributes. In order to improve the speed and accuracy of associative classifier , an efficient discretizer is required to discretize real data.

Discretization is a task of data preprocessing that transforms continuous features into discrete one, helping to enhance learning performance. Most of the algorithms for data mining work on discrete values. So discretization is carried out prior to the process of classification. Supervised discretization methods use the class information to set partition boundaries while unsupervised discretization methods do not use class labels to pick cut points. Entropy based discretization method uses class information to compute and evaluate the split point, which is certainly supervised and separated from top to bottom. Association rule learners prefer multivariate discretization that can capture the interdependencies between attributes, while univariate discretization discretes each attribute in isolation which tends to dissatisfactory association rules (Ishibuchi et al., 2001).

Discretization with fuzzy set is known as fuzzy discretization which resolves soft boundary problem. Fuzzy discretization first discretizes quantitative attribute values into intervals(Ishibuchi et al., 2001). Each cutpoint is associated with the membership function .The membership function is used to determine the degree of each attribute value corresponding to each interval. In fuzzy discretization, a value can be discretized into more than one interval at the same time with varying degrees. Fuzzy discretization process has the following steps (1) Identification of cutpoints (2) partitions are created based on the cutpoints (3) Using triangle membership function attribute values are converted into fuzzified values.

Classical data preprocessing techniques are not enough to scale well when managing large volume of data. To deal the problem with big data, scalable distributed techniques are developed. The first distributed programming techniques to tackle this problem are MapReduce(Dean & Ghemawat, 2004) and its open-source version Apache Hadoop. Apache Spark(Karau, Holden ; Konwinski, Andy ; Wendell Patrick ; Zaharia, 2015) is a fast, memory based data processing tool for large scale data processing . Through the ability of this Spark, processes present in many Machine Learning (ML) problems may be speeded up. So this tool has become especially popular among researchers and business experts in machine learning. Our main objective is to prove that in these frameworks, proposed discretization algorithm Reliable Distributed Fuzzy Discretizer can be parallelized, providing strong discretization solutions for Big data analytics. An efficient discretizer gives good classification accuracy in association rule mining. In order to prove the effectiveness of our proposed discretizer, RDFD is compared with distributed MDLP discretizer.

Complete Article List

Search this Journal:
Reset
Volume 14: 1 Issue (2024)
Volume 13: 1 Issue (2023)
Volume 12: 4 Issues (2022): 3 Released, 1 Forthcoming
Volume 11: 4 Issues (2021)
Volume 10: 4 Issues (2020)
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing