Abstract
This chapter deals with a detailed discussion on the storage systems for data-intensive computing using Big Data. The chapter begins with a brief introduction about data-intensive computing and types of parallel processing approaches. It also highlights the points that display how data-intensive computing systems differ from other forms of computing. A discussion on the importance of Big Data computing is put forth. The current and future challenges of storage in genomics are discussed in detail. Also, storage and data management strategies are given. The chapter's focus is then on the software challenges for storage. Storage use cases are provided like DataDirect Networks, SDSC, etc. The list of storage tools and their details are provided. A small section discusses the sensor data storage system. Then a table is provided that shows the top 10 cloud storage systems for data-intensive computing using Big Data in the world. Top 500 Big Data storage servers statistics are also displayed effectively by the images from Top500 website.
TopIntroduction
Data-intensive computing systems have penetrated every aspect of people’s lives. Behind it is the scientific and commercial processing of massive data impacting the decision makings in companies, academics, governments, social cites, and personal lives.
There are two types of data-intensive computing systems that continue to co-exist in the modern computing environment:
- 1.
High Performance Computing (HPC) systems, consisting of tightly coupled computer nodes and storage nodes that are used to execute task parallelism for scientific purposes like weather forecasting, physics simulation, and the like (Rouse, 2017b).
- 2.
Message Passing Interface (MPI) is an example of a computing framework on HPC systems. Big Data systems, comprised of more loosely coupled nodes, are used to execute data parallelism for tasks such as sorting, data mining, machine learning, etc. MapReduce is an example of a computing framework on Big Data systems (Barney, 2017).
Both HPC systems and Big Data systems that are deployed for multiple users and applications to share the computing resources so that 1) the resource utilization is high, driving down the usage cost per application/user, and the users get better responsiveness of application execution; 2) the data set is reused without extra overhead to move around performing redundant Input/Outputs (I/O) and users can also save space.
As the computing needs continue to grow in data-intensive computing systems, the shared usage model results in a highly resourceful competing environment. For example, Amazon, Apple and eBay provides HPC and Big Data as cloud services. Hadoop version 2, YARN (Yet Another Resource Negotiator), that is one of the key features in the second-generation Hadoop 2 version of the Apache Software Foundation's open source distributed processing framework. Originally described by Apache as a redesigned resource manager. YARN is now characterized as a large-scale, distributed operating system for Big Data applications which provides a scheduler to incorporate both MapReduce and MPI jobs (Rouse, 2017a).
As the number of concurrent data-intensive applications and the amount of data increase, application I/O’s start to saturate the storage and interfere with each other, and storage systems become the bottleneck to application performance. Both HPC and Big Data systems I/O amplification adds to the I/O contention in the storage systems. To counter failures in these distributed systems, HPC systems employ defensive I/O’s such as check pointing to restart an application from where it fails, and Big Data systems replicate persistent data by a factor of k, which grows with the scale of the storage system. Both mechanisms aggravate the I/O contention on the storage. The storage systems can be scaled-out, but the compute to storage node ratio is still high, rendering the storage subsystem a highly contended component (Xu, 2016). Therefore, the lack of I/O performance isolation in the data-intensive computing systems causes severe storage interference which compromises the performance target set by other resource managers proposed or implemented in a large body of works. Failure to provide applications with guaranteed performance has consequences. Data-intensive applications must complete in bounded time so as to get meaningful results. For example, weather forecast data is much less useful when the forecasted time has passed. Paid user in a Big Data system also require a predictable runtime even though the job is not time sensitive, and the provider may get penalized in revenues if jobs fail to complete in a timely manner (Xu, 2016).
This chapter addresses the problems stated above for data-intensive computing systems. It provides different approaches for both HPC storage systems and Big Data storage systems because their differences in principles, architecture, and usage pose distinct challenges. Before studying these systems and addressing their respective problems separately, the discussion of the differences between these two types of systems is established here (Xu, 2016).
Key Terms in this Chapter
Storage-as-a-Service: A storage environment that could be seen as an alternative for small and medium-sized businesses that lack the capital budget and technical personnel to implement and maintain their own storage infrastructure. (Hosken, 2016)
Data-Intensive Computing: Computing applications which require large volumes of data and devote most of their processing time to I/O and manipulation of data are deemed data-intensive. (Data-intensive computing, 2017)
Pregel: A system of Google that facilitates the processing of large-scale graphs. Applications include that for analysis of network graphs and social networking services. The Pregel program output is a set consisting of the values output from all the vertexes, and the Pregel program output and input are an isomorphic directed graph. (Chen et. al., 2014)
Polyglot Pattern: A Big Data storage pattern that allows multiple storage mechanisms such as RDBMS, Hadoop, and other Big Data appliances to co-exist in a solution. This scenario is known as “Polyglot Persistence.” (Sawnat & Shah, 2013)
MapReduce: Programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. (Chen et al., 2014).
Infiniband: InfiniBand (abbreviated IB), a computer-networking communications standard used in high-performance computing, features very high throughput and very low latency. It is used for data interconnect both among and within computers. (InfiniBand, 2017)
All-Pairs: A system specifically designed for biometrics, bio-informatics and data mining applications, All-Pairs focuses on comparing element pairs in two databases by a given function. (Chen et. al., 2014)
Façade Pattern: A Big Data storage pattern in which the Hadoop Distributed File System (HDFS) serves as the intermittent façade (or interface that hides the complexities) for the larger traditional Data Warehouse (DW) systems. (Sawnat & Shah, 2013)
High-Performance-Computing-as-a-Service (HPCaaS): This model facilitates the execution of HPC applications on the cloud by enabling users to have on demand access to a scalable and reliable pool of high performance computing resources. (Radadiya & Rohokale, 2016)
Lean Pattern: A Big Data storage pattern that uses HBase implementation with only one column-family and only one column and unique row-key. (Sawnat & Shah, 2013)
High Performance Storage System (HPSS): HPSS is software that manages petabytes of data on disk and robotic tape libraries. HPSS provides highly flexible and scalable hierarchical storage management that keeps recently used data on disk and less recently used data on tape. HPSS uses cluster, LAN and/or SAN technology to aggregate the capacity and performance of many computers, disks, and tape drives into a single virtual file system of exceptional size and versatility. This approach enables HPSS to easily meet otherwise unachievable demands of total storage capacity, file sizes, data rates, and number of objects stored. (IBM, 2017)
Column-Oriented Databases: Databases that store and process data according to columns other than rows. Columns and rows are segmented in multiple nodes to realize expandability. (Chen et. al., 2014)
NoSQL Pattern: NoSQL databases can play a role with Hadoop implementation because NoSQL databases can store data on a local Network File System (NFS) disks as well as Hadoop Distributed File systems (HDFS). (Sawnat & Shah, 2013)
Message Passing Interface (MPI): The Message Passing Interface (MPI) is a standardized means of exchanging messages between multiple computers running a parallel program across distributed memory. (WhoIsHostingThis.com, 2017)
Platform for Nimble Universal Table Storage (PNUTS): A large-scale parallel geographical-distributed system for Yahoo!’s web applications. It relies on a simple relational data model in which data is organized into a property record table. In the physical layer of PNUTS, the system is divided into different regions each of which includes a set of complete system components and complete copies of tables. (Chen et. al., 2014)
Key-Valued Database: Databases that are constituted by a simple data model and data is stored corresponding to key-values. Every key is unique. Such databases feature a simple structure and the modern key-value databases are characterized with high expandability and smaller query response time higher than those of relational databases. (Chen et al., 2014)
Platform-as-a-Service (PaaS): Platform as a service (PaaS) is a category of cloud computing services that provides a platform allowing customers to develop, run and manage Web applications without the complexity of building and maintaining the infrastructure typically associated with developing and launching an app. (Platform-as-a-Service, 2017).
High-Performance Computing (HPC): The use of parallel processing which is the processing of program instructions by dividing them among multiple processors with the objective of running a program in less time. (Rouse, 2017b)
Dryad: A general-purpose distributed execution engine for processing parallel applications of coarse-grained data. The operational structure of Dryad is a directed acyclic graph, in which vertexes represent programs and edges represent data channels. (Chen et. al., 2014)
Parallel Virtual File System (PVFS): The Parallel Virtual File System (PVFS) is an open source parallel file system. A parallel file system is a type of distributed file system that distributes file data across multiple servers and provides for concurrent access by multiple tasks of a parallel application. (Parallel Virtual File System, 2017)