A Dynamic Scaling Approach in Hadoop YARN

A Dynamic Scaling Approach in Hadoop YARN

Warda Ismahene Nemouchi, Souheila Boudouda, Nacer Eddine Zarour
Copyright: © 2022 |Pages: 17
DOI: 10.4018/IJOCI.286176
(Individual Articles)
No Current Special Offers


In Cloud based Big Data applications, Hadoop has been widely adopted for distributed processing large scale data sets. However, the wastage of energy consumption of data centers still constitutes an important axis of research due to overuse of resources and extra overhead costs. As a solution to overcome this challenge, a dynamic scaling of resources in Hadoop YARN Cluster is a practical solution. This paper proposes a dynamic scaling approach in Hadoop YARN (DSHYARN) to add or remove nodes automatically based on workload. It is based on two algorithms (scaling up/down) which are implemented to automate the scaling process in the cluster. This article aims to assure energy efficiency and performance of Hadoop YARN’ clusters. To validate the effectiveness of DSHYARN, a case study with sentiment analysis on tweets about covid-19 vaccine is provided. the goal is to analyze tweets of the people posted on Twitter application. The results showed improvement in CPU utilization, RAM utilization and Job Completion time. In addition, the energy has been reduced of 16% under average workload.
Article Preview


Cloud computing (CC) has emerged as a recent paradigm that combines distributed computations with server virtualization and storage capacity (Shah & Trivedi, 2015). Its fundamental idea revolves around providing multiple services to customers over the internet through three models: Infrastructure As A Service (IAAS), Platform As A service (PAAS) and Software As A service (SAAS). The use of CC minimizes the burden on users and helps them to focus on their core business. It liberates them from any concerns or costs related to infrastructure (Kalagiakos & Karampelas, 2011) allows companies to scale their computations as they grow. Deploying applications on the cloud offers multiple advantages including scalability, resource sharing, on demand services and distributed computations (Balashandan & Shivika, 2017). It has been proved to be more versatile than the traditional infrastructure from both service quality and security perspectives (Armbrust et al, 2010).

The use of CC has given a rise to cloud-based applications especially when it comes to dealing with large scale data or in another term Big Data applications. The fundamental goal behind Big Data is to derive knowledges and insights from previously collected or real time generated data passing by different phases of cleaning, processing and analyzing which will improve the decision-making process. It exists some properties to differentiate Big Data from traditional data, referred to as the V model, including large volume of variety data generated in high velocity (Khan et al, 2015). These properties constitute big problems and big challenges to both companies and researchers, not only, due to high demanding requirements for handling and processing this data but also the need of minimizing response time to minutes and even seconds (near real time). Hence, most of enterprises deploy their data on the cloud for its elastic, on demand, self-service and resource pooling nature (Wang et al, 2015; Rajput et al, 2019).

One of the most used Cloud-based Big Data applications is Apache Hadoop. It is a framework that allows distributed processing of large data sets across computers’ clusters by using simple programming models. Hadoop implements MapReduce, one of the methods to run and analyze parallel processing of data (Shah & Trivedi, 2015). It is designed from the beginning to scale thousands of machines in a shared nothing architecture (Apache Foundation, n.d). Running Hadoop on the cloud makes the ability to add / remove nodes more smoothly.

Despite the fact that cloud has been proved to be beneficial for Big Data (Jannapureddy et al, 2019), running large scale data computations has a huge influence on data centers energy consumption. Most of users tend to preconfigure the cluster’ resources to be able to process the maximum workload. In addition, the scalability of the cloud can lead to uncontrolled growth of resources to meet users’ demands which leads to more unused servers and thus energy waste (Wang et al, 2015). It has been stated that big fractions of the overall cost of ownership of datacenters are energy costs (Jam et al, 2013; Wang et al, 2015). According to a study on a sample of 5000 servers in Google, CPU utilization of servers in such large-scale data centers is quite low, ranging at 10% to 20% and even 60% of computing resources are run without even being used (Barroso & Holezle, 2009). For that reason, dynamic scaling is required to use the resources efficiently. Multiple research works have been proposed in order to achieve energy efficiency and reduce operational costs through adjusting dynamically acquired resources to the workload (Hosamani et al, 2020). In other words, it is the possibility to add or remove nodes automatically according to the current need of workload and placing the others in lower power standby mode (Manikandan & Ravi, 2014).

Complete Article List

Search this Journal:
Volume 14: 1 Issue (2024): Forthcoming, Available for Pre-Order
Volume 13: 1 Issue (2023)
Volume 12: 4 Issues (2022)
Volume 11: 4 Issues (2021)
Volume 10: 4 Issues (2020)
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing