Hengam a MapReduce-Based Distributed Data Warehouse for Big Data: A MapReduce-Based Distributed Data Warehouse for Big Data

Hengam a MapReduce-Based Distributed Data Warehouse for Big Data: A MapReduce-Based Distributed Data Warehouse for Big Data

Mohammadhossein Barkhordari, Mahdi Niamanesh
Copyright: © 2018 |Pages: 20
DOI: 10.4018/IJALR.2018010102
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

When working with a high volume of information that follows an exponential pattern, the authors confront big data. This huge amount of information makes big data retrieval and analytics important issues. There have been many attempts to solve data analytic problems using distributed platforms, but the main problem with the proposed methods is not observing the data locality. In this article, a MapReduce-based method called Hengam is proposed. In this method, data format unification helps nodes to have data independence. The unified format leads to an increase in the information retrieval speed and prevents data exchange betoen nodes. The proposed method was evaluated using data items from an ICT company and the information retrieval time was much better than that of other open-source distributed data warehouse software.
Article Preview
Top

1. Introduction

When confronted with a high volume of data records generated by software systems, sensors, social networks, mobiles, etc., we need systems to manage and utilize this huge amount of data more than ever before. Current database management systems (DBMSs) cannot manage this huge amount of information, so a change is needed in this area. For several years, the managing of huge amounts of data has been known as big data management, where data volume is one dimension of big data. Other dimensions include data item veracity, velocity, and variety, which are out of the scope of this paper.

One of the areas needing big data management is data warehousing. When working with a high volume of information generated by online transaction process (OLTP) systems, creating a data warehouse for this huge amount of information is critical. Information retrieval is one of the most important factors in data warehousing. This huge amount of data cannot be stored only on one server, and data must be distributed over several nodes. There are two architectures for distributed solutions: shared memory and storage and shared nothing. In the shared memory and storage architecture, for example, Oracle real application cluster (RAC) servers have a shared memory in storage area network(SAN) storage. They have complex configurations and high maintenance costs. A node count limitation is another big problem. Another group of distributed solutions is that of shared nothing solutions. This group of solutions usually uses a distributed platform to store and retrieve information. One of the most popular groups of shared nothing solutions is not only structured query language (NOSQL), by which structured and non-structured data can be supported. Usually, users’ queries are converted to MapReduce tasks by the NOSQL interface. MapReduce (Dean et al., 2008) is a programming method that can solve big data problems using distributed and scalable solutions. The data entry speed in NOSQL data warehouses is very high because they do not have to observe DBMS constraints. NOSQL data warehouses usually do not have DBMS facilities, such as an index, different data types, etc.

However, the main problem with distributed data warehouses is not DBMS facilities. The main problem is data locality, as it does not exist for data processing on the node, which is needed for processing. Many attempts have been made to conquer this problem, but to the best of our knowledge, there is no method to solve this problem completely.

In this paper, we introduce the Hengam method, which offers a scalable and distributable data warehouse for big data; this method is based on MapReduce. In the proposed method, the data locality problem is solved completely, and traditional DBMSs can be used on distributed nodes. The proposed method was evaluated with the data items of an ICT company.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 8: 2 Issues (2018)
Volume 7: 2 Issues (2017)
Volume 6: 2 Issues (2016)
Volume 5: 1 Issue (2015)
Volume 4: 1 Issue (2014)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing