Framework for GeoSpatial Query Processing by Integrating Cassandra With Hadoop

Framework for GeoSpatial Query Processing by Integrating Cassandra With Hadoop

S. Vasavi (V. R. Siddhartha Engineering College, India), Mallela Padma Priya (V. R. Siddhartha Engineering College, India) and Anu A. Gokhale (Illinois State University, USA)
Copyright: © 2018 |Pages: 41
DOI: 10.4018/978-1-5225-5088-4.ch001


We are moving towards digitization and making all our devices, such as sensors and cameras, connected to internet, producing bigdata. This bigdata has variety of data and has paved the way to the emergence of NoSQL databases, like Cassandra, for achieving scalability and availability. Hadoop framework has been developed for storing and processing distributed data. In this chapter, the authors investigated the storage and retrieval of geospatial data by integrating Hadoop and Cassandra using prefix-based partitioning and Cassandra's default partitioning algorithm (i.e., Murmur3partitioner) techniques. Geohash value is generated, which acts as a partition key and also helps in effective search. Hence, the time taken for retrieving data is optimized. When users request spatial queries like finding nearest locations, searching in Cassandra database starts using both partitioning techniques. A comparison on query response time is made so as to verify which method is more effective. Results show the prefix-based partitioning technique is more efficient than Murmur3 partitioning technique.
Chapter Preview

1. Introduction

Companies that use big data for business challenges can gain advantage by integrating Cassandra with Hadoop. Hadoop distributed file system framework can process voluminous of data generated from various sources. Out of various NoSQL databases, Cassandra supports linear scalability and high availability for ensuring fault tolerance. As such, when integrated, Cassandra and Hadoop together increase the processing capabilities to manage big data efficiently. Geospatial data helps in identifying the geographic location of an object, its features and boundaries on Earth. Such data can be analyzed to serve various purposes such as tourism, health care, geo marketing and intelligent transportation system. There are two data types for spatial data, vector and raster. Both data types stores object reference as latitude and longitude (Vertices/paths or grid cells) as shown in figure 1. Raster data includes remote sensing, photogrammetric, and vector data includes Geographical Positioning System (GPS), digitizing. Cassandra integration with Hadoop helps in Querying spatial data efficiently by reducing the query response time.

Figure 1.

Longitude and Latitudes of Earth (Geohash and its format, 2016)

Raster data can be represented at its original resolution and form without generalization. But the location of each vertex needs to be stored explicitly. Advantage of Vector data is the geographic location of each cell is implied by its position in the cell matrix. Disadvantage is it is difficult to adequately represent linear features depending on the cell resolution. The following figure 2 and figure 3 presents example for vector data and raster data.

Figure 2.

Geospatial Vector data type

(Ruslan Bobov, 2017)
Figure 3.

Geospatial Raster data type

(Ruslan Bobov, 2017)


Traditional databases (Relational database) are suitable for storing and querying structured data that guarantees ACID properties. With the emergence of internet, large amount of unstructured data is being produced. NoSQL databases, that guarantees CAP properties are suitable for storing such unstructured data. Dynamo, MongoDB, BigTable, HBase, Cassandra are designed to handle the data storage and processing with less response time. Even though MongoDB suits for complex queries such as social networking applications where we have to be optimise for latency, HBase and Cassandra when integrated with Hadoop are equivalently good in such a scenario. Cassandra DB and its query language CQL supports queries such as indexing, search libraries but not spatial queries. Even though some works are reported for labeling and retrieving Cassandra database, are not efficient. This chapter aims at adding the functionality of spatial querying for Cassandra database by integrating Cassandra with Hadoop.

Complete Chapter List

Search this Book: