Article Preview
Top1 Introduction
In this era of information, knowledge and wisdom, informed decisions are taken using processed data, presented in various visual forms, which forms the backbone of Information Society. The development in information and communication technology, IoT devices (Cai et al,, 2016; Qin et. al. 2016), medical devices, Internet technologies etc. have resulted in generation of large volumes of Big data. The Big data is available in structured, semi-structured and unstructured forms and formats. It is very large in volume and is generated continuously at a rapid rate, but has low integrity (Jacobs A. 2009; Zikopoulos et al. 2011; Gupta et al., 2012, Kumar & Vijay Kumar, 2015; Tsai et al., 2015; Zhang et al. 2016). Big data is heterogeneous in nature, and is collected for specific application. Some of the major applications of Big data are in e-commerce, healthcare, scientific application, education, social welfare (Global Pulse, 2012), IoT (Ahmed et al., 2017) etc. The Big data is to be analyzed in real time for timely decisions. For example, a large University may use Big data analytics to predict it's future infrastructure development plans from the Big data consisting of student intake of past years, geographic data of student residential locations, success rate of the students, effectiveness of past publicity interventions, future demands of newer curriculum etc. A large amount of data, especially semi-structured and unstructured data, will get generated for such systems. The processing of such data is time consuming and require enhanced techniques for faster processing. View materialization is one such technique used for faster processing of data.
View materialization is a complex problem, as there can be very large number of possible views for materialization, but only few of these can be materialized due to storage space constraint. The view selection problem aims at identifying a set of views that optimizes the query response time simultaneously with continuous data updates, while utilizing minimal resources. It is a NP-Hard problem (Harinarayan et al., 1996; Chirkova et al. 2001).With the emergence of Big data, selection of views for materialization is required to address additional issues due to large data volume, continuous heterogeneous data and integrity of data. In addition, the data processing paradigm also shifted from structured data to semi-structured and unstructured data, which also resulted in the change in the data processing framework. The newer frameworks involve distributed file system (DFS) (Hadoop 2008, 2012; Manyika 2011), Apache Hadoop (Dezyre, 2015), map-reduce (Dean & Ghemawat, 2012; Hadoop, 2008, 2012), cloud map-reduce (Dahiphale et al., 2014) and rich set of newer databases and data warehouse technologies like NoSQL, Hive, BigTable, Neo4j (Kumar & Vijay Kumar 2021a).
Big data view materialization was studied in the context of Hive databases on a standard dataset using the map reduce cost of queries and views (Goswami et al. 2017). However, it was just an extension of view materialization for data warehouse to Big data warehouse and did not incorporate Big data characteristics for computing the fitness values of the objective functions. (Kumar & Vijay Kumar, 2021a) presented the Big data view materialization problem as a single objective optimization problem in the context of Big data characteristics. It also suggested to use number of DFS blocks of stored data to compute the fitness value of the objective functions. The Big data view selection problem was presented as bi-objective optimization problem in (Kumar & Vijay Kumar., 2021b), which was solved using Vector Evaluated Genetic Algorithm (VEGA). This paper addresses this bi-objective optimization problem using Non-dominated Sorting Genetic Algorithm (NSGA-II) (Deb et al. 2002; Deb, 2014).
Section 2, presents the view materialization problem in the context of different DBMS and types of data. Section 3 presents the view materialization in the context of Big data; Section 4 presents the process of identification of candidate views; Section 5 discusses the model for computation of costs for Big data views, which was defined in (Kumar & Vijay Kumar., 2021a). Section 6 presents the bi-objective Big data view materialization problem (Kumar & Vijay Kumar., 2021b). Section 7 presents the NSGA-II based algorithm for selection of Big data views for materialization. Section 8 presents an example for the algorithm; Section 9 presents the experimental results of the algorithm followed by conclusion in Section 10.
Next, a brief account of different research issues related to View materialization are presented.