High-Level Languages for Geospatial Analysis of Big Data: Strengths and Weaknesses

High-Level Languages for Geospatial Analysis of Big Data: Strengths and Weaknesses

Symphorien Monsia (LTSIRS Laboratory, National Engineering School, Tunis, Tunisia) and Sami Faiz (LTSIRS Laboratory, National Engineering School, Tunis, Tunisia)
Copyright: © 2021 |Pages: 20
DOI: 10.4018/978-1-7998-1954-7.ch004
OnDemand PDF Download:
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

In recent years, big data has become a major concern for many organizations. An essential component of big data is the spatio-temporal data dimension known as geospatial big data, which designates the application of big data issues to geographic data. One of the major aspects of the (geospatial) big data systems is the data query language (i.e., high-level language) that allows non-technical users to easily interact with these systems. In this chapter, the researchers explore high-level languages focusing in particular on the spatial extensions of Hadoop for geospatial big data queries. Their main objective is to examine three open source and popular implementations of SQL on Hadoop intended for the interrogation of geospatial big data: (1) Pigeon of SpatialHadoop, (2) QLSP of Hadoop-GIS, and (3) ESRI Hive of GIS Tools for Hadoop. Along the same line, the authors present their current research work toward the analysis of geospatial big data.
Chapter Preview
Top

Introduction

Over the last few years, mega-data or big data has become a major concern for many organizations. The term ’Big Data’ refers to data sets that become so large that they become difficult to work with conventional database management systems. These massive data come from several sources among them the Web, sensor networks, satellites, drones, radars, cameras, connected devices (such as smartphones, tablets, etc.), geolocation practices and social networks (such as Twitter, Facebook, Google+, LinkedIn, etc.) online that bring together billions of users.

These phenomena considerably add to the challenges of big data for many organizations and have led to the emergence of Geospatial Big Data, which represents the application of big data issues to geographic data. Geospatial Big Data is therefore an essential component of the larger phenomenon of big data in that geographic data is an important part of the data collected and processed (Lee and Kang 2015). Franklin (1992) estimates that 80% of business data is geographic. An illustrative example is the LP DAAC (Land Processes Distributed Active Archive Center), an archive of terrestrial information originating from space borne sensors aboard NASA (National Aeronautics and Space Administration) satellites, which contains more than 1 petabyte of data and increases every day with new data.

This explosion of geographic data compels the community of researchers and developers of the geospatial domain to store and process them using traditional Big Data frameworks such as Spark (Zaharia et al., 2010), Flink (Carbone et al., 2015), MapReduce (Dean and Ghemawat 2004), Dryad (Isard et al., 2007), Hyracks (Borkar et al., 2011) and Hadoop (White 2015). Although these conventional Big Data systems can handle both geographic and non-geographic data, they display significantly lower performance compared to Geospatial Big Data processing. In fact, the only way to have Geospatial Big Data processed by traditional Big Data platforms is to either treat it as non-spatial data or to write a set of methods or functions as wrappers around existing non-spatial systems. However, doing so does not take any advantage of the properties of spatio-temporal data, which will lead to performance degradation (Eldawy and Mokbel 2016).

As a result, several extensions of traditional Big Data frameworks have emerged in recent years, many of which overcome this limitation by integrating geospatial functionality in a variety of ways among them HadoopGIS (Aji et al., 2013a,b), SpatialHadoop (Eldawy and Mokbel 2015), ESRI GIS Tools for Hadoop (Whitman et al., 2014), STARK (Hagedorn et al., 2017), SpatialSpark (You et al., 2015), GeoTrellis (Kini and Emanuele 2014), Simba (Xie et al., 2016), MD-HBase (Nishimura et al., 2013), GeoSpark (Yu et al., 2015), and GeoMesa (Hughes et al., 2015). In addition, some Geospatial Big Data frameworks are also implemented from-scratch among them BRACE (Wang et al., 2010), SciDB (Stonebraker et al., 2013), RasDaMan (Baumann et al., 1997) and Paradise (DeWitt et al., 1994).

An essential component of these Geospatial Big Data systems that the researchers are particularly interested in is the data query language that provides high-level access to the data in order to free users from any complexity of these systems. This chapter proposes to examine, among the high-level languages proposed in the literature, three open source and popular implementations of SQL on Hadoop intended for the interrogation of Geospatial Big Data: (1) Pigeon of SpatialHadoop, (2) QLSP of Hadoop-GIS and (3) ESRI Hive of GIS Tools for Hadoop. The chapter mainly presents an overview of the contributions and the shortcomings of these query languages. In addition, it presents several possible solutions to overcome the shortcomings mentioned.

The remainder of this chapter is structured as follows. Section 2 briefly describes the MapReduce programming model (including its advantages and disadvantages). Section 3 reviews, among the proposed languages, three open source and popular implementations of SQL on spatial extensions of the Hadoop framework for querying Geospatial Big Data. Section 4 provides a summary of the contributions and limits of the presented languages and briefly describes the authors’ planned research to possibly overcome the challenges raised. Finally, Section 5 lists their conclusion and next steps.

Complete Chapter List

Search this Book:
Reset