PaaS Optimization of Apache Applications Using System Parameter Tuning of Big Data Platforms in Distributed Computing

PaaS Optimization of Apache Applications Using System Parameter Tuning of Big Data Platforms in Distributed Computing

Tanuja Pattanshetti, Vahida Attar
Copyright: © 2020 |Pages: 16
DOI: 10.4018/IJDST.2020100102
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Widely used data processing platforms use distributed systems to process huge data efficiently. The aim of this article is to optimize the platform services by tuning only the relevant, tunable, system parameters and to identify the relation between the software quality metrics. The system parameters of data platforms based on the service level agreements can be defined and customized. In the first stage, the most significant parameters are identified and shortlisted using various feature selection approaches. In the second stage, the iterative runs of applications are executed for tuning these shortlisted parameters to identify the optimal value and to understand the impact of individual input parameters on the system output parameter. The empirical results imply significant improvement in performance and with which it is possible to render the proposed work optimizing the services offered by these data platforms.
Article Preview
Top

1. Introduction

The data which is associated with characteristics like huge velocity, volume and variety is termed as “Big Data”. The expansion of the digital world is further accelerating has rate at which this data is generated. Gantz and Reinsel (2012) have mentioned that as per the IDC prognosis, the digital universe will reach a stupendous size in another year which will be 300 times its size around a decade ago. IBM Big Data and Analytics Hub has presented the fourth V, “Veracity”, which consolidates the vulnerability of information (IBM Big Data & Analytics Hub, “The Four V's of Big Data,” n.d.). With the ascent of such information, the conventional methods fail to meet the demands of real*world applications. The cloud computing technology which makes use of distributed architecture has thus emerged as the one of the prime innovations fulfilling the needs of execution, effectiveness and accessibility. The Apache based data processing platforms like Hadoop (Apache Hadoop, n.d.), Spark (Apache Spark, n.d.) and Storm (Apache Storm, n.d.) are nowadays widely used for processing the big data.

Hadoop data platform performs batch processing by making use of HDFS (Hadoop Distributed File System) and MapReduce (Map and Reduce functions) framework. In distributed environment of large clustered systems, HDFS is used for data storage and processing of data is done using the MapReduce framework (Apache Hadoop, n.d.). Spark data platform is an open source cluster-based framework which provides faster data processing than Hadoopp due to resilient distributed data-set architecture (Apache Spark, n.d.). Storm is a real-time distributed computation system used as a stream processing solution in large clusters (Apache Storm, n.d.).

These widely used data platforms have more than hundreds of configurable system parameters. These parameters are normally tuned to certain default values. The purpose of the research work carried here is to assess the role, the impact of these system parameters and the values to which they are tuned in defining the completion of a given job with improved efficiency. To induce the foremost of every system, it is essential to tune these parameters to the best possible values. In current scenario these values are set according to the instinct and experience of the service provider, subsequently which might not always lead to the most ideal setup for offering services especially in a model like pay-as-you-go.

Earlier work carried out by different researchers suggests two methods currently adopted to find the optimal configuration for a system offering platform-as-a-service (PaaS). The first method involves an exhaustive trial-and-error methodology in which several attempts are made to identify the “best value” for each parameter. This method is intrinsically infeasible looking at the example stated here. Let us assume that service administrator needs to try 10 values for each parameter; for a parameter set of 100 parameters. For this one has to empirically note observations of the magnitude 10100 making it practically exhaustive. The second approach is using machine learning techniques to find tailor-made parameter values for a system setup (Wang, Xu, & He, 2016) (Trotter, Liu, & Wood, 2017). The second approach is robust and flexible enough although this too involves vast computation.

After identifying the research gaps, this paper proposes the heuristic optimal values for the configurable parameters of Apache framework data platforms. The parameters tuned here are identified using filter and embedded approaches of feature selection techniques (Pattanshetti and Attar, in press). Applying the feature selection technique helped in identifying the reduced feature space for all three data platforms and in eliminating the not-so relevant and redundant features. The commonly identified features by various filter and embedded algorithms eventually made to the final feature space producing the optimal feature set. This optimal feature set is used for tuning, to empirically assess the impact of every input parameter on the output parameter of the respective data platform. The results show significant improved performance when these input parameters are tuned to the heuristic optimal values as compared to when assigned with default values.

Complete Article List

Search this Journal:
Reset
Volume 15: 1 Issue (2024)
Volume 14: 2 Issues (2023)
Volume 13: 8 Issues (2022)
Volume 12: 4 Issues (2021)
Volume 11: 4 Issues (2020)
Volume 10: 4 Issues (2019)
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing