Article Preview
Top1. Introduction
The past few years, from social networks to E-commerce and the Web application, have seen tremendous interests in big data analytics (Wang et al., 2015; Xu et al., 2015; Srinivasa & Bhatnagar, 2012; Raghupathi & Raghupathi, 2014; Wu & Pei, 2015; Fiorini et al., 2016; Oide et al., 2017; Kulkarni et al., 2016), as data volumes in both industry and research continue to outgrow the processing speed of individual machines. With big data processing, the stateful data analytics has drawn significant attentions recently, wherein the results of data analytics are determined not only by the algorithm logic and the inputs, but also the “state” of system. For example, Internet search engines keep the previous URL ranking score as the state, so that they can update the results against the rapidly evolving web pages incrementally rather than re-computing from scratch (Logothetis et al., 2010). The stateful data analytics frameworks are expected to provide fine-grained and low-latency state access and scale with the large state size. Recent research has been focusing on developing new programming models and systems for state management to partially or completely fulfill these requirements (Salloum et al., 2016; To, Soto & Markl, 2017; Bhatotia et al., 2011; Castro Fernandez, Migliavacca, Kalyvianaki & Pietzuch, 2013; Fernandez et al., 2014; Gunda, Ravindranath, Thekkath & Zhuang, 2010; Murray et al., 2013). For example, Naiad (Murray et al., 2013) and SEEP (Castro Fernandez, Migliavacca, Kalyvianaki & Pietzuch, 2013) support computation over the state stored the local disk. Nectar (Gunda, Ravindranath, Thekkath & Zhuang, 2010) saves the previous results to derive the new one, but the proposed method does not provide fine-grained state access. SDG (Fernandez et al., 2014) and Piccolo (Power & Li, 2010) offer the full-fledged stateful data analytics functionalities with the imperative programming models.
As above described, there have been some research works to achieve stateful data analytics. These efforts can be classified into two categories, the new frameworks and the refactored stateless frameworks. For the new frameworks, the representative work like Naiad (Murray et al., 2013) and SEEP (Castro Fernandez, Migliavacca, Kalyvianaki & Pietzuch, 2013) adopt graph- and operator-oriented programming model respectively, but they cannot scale to the large state size beyond the capacity of a single host. The other representative studies in this direction include SDG (Fernandez et al., 2014) and Piccolo (Power & Li, 2010). Both of these two works adopt the imperative programming model and support large state in distributed environment.