Programming and Pre-Processing Systems for Big Data Storage and Visualization

Programming and Pre-Processing Systems for Big Data Storage and Visualization

Hidayat Ur Rahman (University of Swat, Pakistan), Rehan Ullah Khan (Al Qassim University, Saudi Arabia) and Amjad Ali (University of Swat, Pakistan)
DOI: 10.4018/978-1-5225-3142-5.ch009

Abstract

This chapter of the book chapter provides detailed overview of the major concept used in Big Data. In order to process the huge volume of data, the first step is the pre-processing which is required to anomalies such as, missing values by applying various transformations. This chapter provides a detail overview of preprocessing tools used for Big Data such as, R, Yahoo! Pipes, Mechanical Turk, Elasticsearch etc. Beside preprocessing tools, the chapter provides detailed overview of storage tools, programming tools, data visualization, log processing tools and caching tools used for Big Data analytics. In other words, this chapter is the core of the book and provides the overview of the major technologies discussed later in the book.
Chapter Preview
Top

Background

The term Big Data is commonly used for huge volumes of data which cannot be operated using traditional databases, they are beyond the capabilities of commonly used software tools to store, manipulate and process data within limited time (Garcia, 2015). There are often terabytes (Tb) or petabytes (Pb) of information stored in a single dataset. Some of the problems related with big data are capturing steaming data, storage, indexing, sharing and visualization. Enterprises use this high volume of data to extract useful knowledge using various software tools. However, as the size of the dataset increase, the difficulty in management also increases. In order to manage this huge amount of data, advanced tools and techniques are used. Since traditional data analysis and management tools are unable to exploit the data, it requires more sophisticated and specialized tools to store and manipulate data. Big Data analytics comprise of tools and techniques which helps in decision making process (Russom, 2011). Big Data analytics comprises of three main areas, the storage and management tools used for data, processing tools used for extracting useful information from the data and visualization. These three areas form different phases of a decision-making process in Big Data (Russom, 2011).

Key Terms in this Chapter

Extract, Transform, Load (ETL): ETL stands for Extract, Transform and Load. ETL is a process used in data warehousing to populate one database from another database.

HBase: Apache HBASE is a non-relational Hadoop database used for Big Data store. HBASE is written in Java.

Content Delivery Network (CDN): Content Delivery Network is used in distributed environment to deliver web contents to the user based on geographic locations.

DRQL: DRQL is a query language that very much resembles with SQL used for nested data designed for column based processing. DRQL is compatible with BIG Query and Dremel.

ZeroMQ: ZeroMQ is an asynchronous messaging system used in distributed and concurrent applications.

Atomicity, Consistency, Isolation and Durability (ACID): Atomicity, Consistency, Isolation and Durability are the properties of transaction management system. A transaction should satisfy the ACID properties.

Greenplum: Greenplum is a Big Data company which provides analytic operation on petabytes of data very rapidly.

Extreme Application Platform (XAP): Gigaspaces Extreme Application Platform is well suited for high performance, low latency, transaction processing as well as analytic processing.

Hadoop Distributed File System (HDFS): Hadoop Distributed File System is used to store huge files and stream the data to server at higher bandwidth typically to servers and user applications.

ACUNU: ACUNU is an analytic platform used for high velocity data mostly used in production environments.

Complete Chapter List

Search this Book:
Reset