A Fine-Grained Stateful Data Analytics Method Based on Resilient State Table

A Fine-Grained Stateful Data Analytics Method Based on Resilient State Table

Jike Ge (College of Electrical and Information Engineering, Chongqing University of Science and Technology, Chongqing, China), Wenbo He (McMaster University, Toronto, Canada), Zuqin Chen (Chongqing University of Science and Technology, Chongqing, China), Can Liu (Chongqing University of Science and Technology, Chongqing, China), Jun Peng (Chongqing University of Science and Technology, Chongqing, China) and Guorong Chen (Chongqing University of Science and Technology, Chongqing, China)
DOI: 10.4018/IJSSCI.2018040105

Abstract

This article describes how stateful data analytic frameworks have emerged to provide fresh and low-latency results for big data processing. At present, it is desired to achieve the fine-grained data model in Spark data processing framework. However, Spark adopts coarse-grained data model in order to facilitate parallelization, it is challenging in dealing with the fine-grained data access in stateful data analytics. In this paper, the authors introduce a fine-grained stateful data component, Resilient State Table (RST), to Spark framework. For filling the gap between the coarse-grained data model in Spark and the fine-grained data access requirements in stateful data analytics, they devise the programming model of RST which interacts with Spark's coarse-grained memory representation seamlessly, and enable users to query/update the state entries in fine granularity with Spark-like programming interfaces. Performance evaluation experiments in various application fields demonstrate that their proposed solution achieves the improvements in latency, fault-tolerance, as well as scalability.
Article Preview

1. Introduction

The past few years, from social networks to E-commerce and the Web application, have seen tremendous interests in big data analytics (Wang et al., 2015; Xu et al., 2015; Srinivasa & Bhatnagar, 2012; Raghupathi & Raghupathi, 2014; Wu & Pei, 2015; Fiorini et al., 2016; Oide et al., 2017; Kulkarni et al., 2016), as data volumes in both industry and research continue to outgrow the processing speed of individual machines. With big data processing, the stateful data analytics has drawn significant attentions recently, wherein the results of data analytics are determined not only by the algorithm logic and the inputs, but also the “state” of system. For example, Internet search engines keep the previous URL ranking score as the state, so that they can update the results against the rapidly evolving web pages incrementally rather than re-computing from scratch (Logothetis et al., 2010). The stateful data analytics frameworks are expected to provide fine-grained and low-latency state access and scale with the large state size. Recent research has been focusing on developing new programming models and systems for state management to partially or completely fulfill these requirements (Salloum et al., 2016; To, Soto & Markl, 2017; Bhatotia et al., 2011; Castro Fernandez, Migliavacca, Kalyvianaki & Pietzuch, 2013; Fernandez et al., 2014; Gunda, Ravindranath, Thekkath & Zhuang, 2010; Murray et al., 2013). For example, Naiad (Murray et al., 2013) and SEEP (Castro Fernandez, Migliavacca, Kalyvianaki & Pietzuch, 2013) support computation over the state stored the local disk. Nectar (Gunda, Ravindranath, Thekkath & Zhuang, 2010) saves the previous results to derive the new one, but the proposed method does not provide fine-grained state access. SDG (Fernandez et al., 2014) and Piccolo (Power & Li, 2010) offer the full-fledged stateful data analytics functionalities with the imperative programming models.

As above described, there have been some research works to achieve stateful data analytics. These efforts can be classified into two categories, the new frameworks and the refactored stateless frameworks. For the new frameworks, the representative work like Naiad (Murray et al., 2013) and SEEP (Castro Fernandez, Migliavacca, Kalyvianaki & Pietzuch, 2013) adopt graph- and operator-oriented programming model respectively, but they cannot scale to the large state size beyond the capacity of a single host. The other representative studies in this direction include SDG (Fernandez et al., 2014) and Piccolo (Power & Li, 2010). Both of these two works adopt the imperative programming model and support large state in distributed environment.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 11: 4 Issues (2019): Forthcoming, Available for Pre-Order
Volume 10: 4 Issues (2018): 3 Released, 1 Forthcoming
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing