The Evolution of the Massively Parallel Processing Database in Support of Visual Analytics

The Evolution of the Massively Parallel Processing Database in Support of Visual Analytics

Ian A. Willson (The Boeing Company, USA)
Copyright: © 2011 |Pages: 26
DOI: 10.4018/irmj.2011100101
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

This article explores the evolution of the Massively Parallel Processing (MPP) database, focusing on trends of particular relevance to analytics. The dramatic shift of database vendors and leading companies to utilize MPP databases and deploy an Enterprise Data Warehouse (EDW) is presented. The inherent benefits of fresher data, storage efficiency, and most importantly accessibility to analytics are explored. Published industry and vendor metrics are examined that demonstrate substantial and growing cost efficiencies from utilizing MPP databases. The author concludes by reviewing trends toward parallelizing decision support workload into the database, ranging from within database transformations to new statistical and spatial analytic capabilities provided by parallelizing these algorithms to execute directly within the MPP database. These new capabilities present an opportunity for timely and powerful enterprise analytics, providing a substantial competitive advantage to those companies able to leverage this technology to turn data into actionable information, gain valuable new insights, and automate operational decision making.
Article Preview

The Massively Parallel Processing Database

With a focus on analytics, most of us have not been concerned with the precise details of the systems that host our data. To start off with, let us define our domain of interest to be broad enterprise-scale relational databases focused on supporting analytics and queries of all types, rather than transaction processing. The focus of this article is on performing a wide variety of valued analytics against such databases using an architecture specifically created to process this workload. MPP technology has been successfully deployed in large database systems for more than 25 years, supporting, with varying degrees of automation, many of the types of analytics we perform today. The author has been working with very large relational databases for decision support and analytical applications since 1986. In an early example, an entire IBM® 4081 mainframe was used to host a Human Resources decision support system. Since that time, what constitutes large has changed, as has the cost per unit of storage or unit of power, but surprisingly the fundamental method of storing and retrieving structured data has changed little.

The storage and retrieval of normalized data in tables first contemplated by Dr. Ted Codd (1970) is increasingly practical now that there is a suitably powerful hardware platform and database architecture available. The various design mitigations previously required for performance are fading in prevalence. The one fundamental change since 1970 is the commercial introduction of the shared-nothing or Massively Parallel Processing (MPP) database by Teradata Corporation, which shipped its first production system in 1984. One important distinction in this application of MPP to relational databases is that the parallelization is being applied to Structured Query Language (SQL) operators, a fundamentally functional language. Better known applications of MPP, at least 20 years ago, were in the scientific computing realm, where massive parallelization was thought to offer the opportunity to speed up calculations for a wide range of high value scientific computing problems.

Thinking Machines Corporation was one of the most widely-known parallel supercomputer manufacturers of the 1980s and early 1990s until its bankruptcy in 1994 (Taubes, 1995). In order to execute scientific calculations in parallel, special language extensions were required to parallelize computing on these MPP systems, such as C* and CM Fortran. Ultimately this approach was not cost-effective, with more economical designs for moderate parallelism utilizing specialized clusters of SMP-based computers succeeding in the commercial market. Fortunately in our case, MPP has proven to be a cost-effective and suitable design for relational databases due to the relative ease and substantial benefit available from parallelizing the functional operations of the SQL, compared with the complex scientific calculations implemented in conventional programming languages. The difficulty of converting other types of algorithms and operators to an efficient massively parallel processing implementation will become clear as we examine statistical and spatial operations and the time it took for these to emerge in MPP databases.

One early discussion of the shared-nothing architecture is a paper by Dr. Michael Stonebraker (1985) at Berkeley which contrasts the shared-nothing architecture with shared memory and shared disk approaches, addressing concerns that shared-nothing would not be able to scale to large transaction volumes. Dr. Stonebraker led the INGRES® project, a foundation for modern relational databases (IEEE Computer Society, 2005) and subsequently created POSTGRES®, which extended the capabilities of relational databases in many application areas and is used within an MPP product from Greenplum in its open source incarnation (Thoo & Beyer, 2008). Dr. Stonebraker also founded Vertica in 2005, designing a grid of commodity Linux servers supporting a columnar parallel database. This approach moves away from the row-based relational database and is targeted at low cost star schema data marts, rather than broad normalized data warehouses.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 30: 4 Issues (2017)
Volume 29: 4 Issues (2016)
Volume 28: 4 Issues (2015)
Volume 27: 4 Issues (2014)
Volume 26: 4 Issues (2013)
Volume 25: 4 Issues (2012)
Volume 24: 4 Issues (2011)
Volume 23: 4 Issues (2010)
Volume 22: 4 Issues (2009)
Volume 21: 4 Issues (2008)
Volume 20: 4 Issues (2007)
Volume 19: 4 Issues (2006)
Volume 18: 4 Issues (2005)
Volume 17: 4 Issues (2004)
Volume 16: 4 Issues (2003)
Volume 15: 4 Issues (2002)
Volume 14: 4 Issues (2001)
Volume 13: 4 Issues (2000)
Volume 12: 4 Issues (1999)
Volume 11: 4 Issues (1998)
Volume 10: 4 Issues (1997)
Volume 9: 4 Issues (1996)
Volume 8: 4 Issues (1995)
Volume 7: 4 Issues (1994)
Volume 6: 4 Issues (1993)
Volume 5: 4 Issues (1992)
Volume 4: 4 Issues (1991)
Volume 3: 4 Issues (1990)
Volume 2: 4 Issues (1989)
Volume 1: 1 Issue (1988)
View Complete Journal Contents Listing