Storage Infrastructure for Big Data and Cloud

Storage Infrastructure for Big Data and Cloud

Anupama C. Raman
DOI: 10.4018/978-1-4666-5864-6.ch005
(Individual Chapters)
No Current Special Offers


Unstructured data is growing exponentially. Present day storage infrastructures like Storage Area Networks and Network Attached Storage are not very suitable for storing huge volumes of unstructured data. This has led to the development of new types of storage technologies like object-based storage. Huge amounts of both structured and unstructured data that needs to be made available in real time for analytical insights is referred to as Big Data. On account of the distinct nature of big data, the storage infrastructures for storing big data should possess some specific features. In this chapter, the authors examine the various storage technology options that are available nowadays and their suitability for storing big data. This chapter also provides a bird's eye view of cloud storage technology, which is used widely for big data storage.
Chapter Preview


In the initial stages of its evolution, Storage Area Network (SAN) was perceived as a client server system with the server attached to a collection of storage devices by means of a bus. In many scenarios, the client systems were directly connected to the storage devices as well. These storage architectures were referred to as Direct Attached Storage (DAS). The high level architecture diagram of a DAS system is given in Figure 1.

Figure 1.

Architecture of DAS


There are three main tiers in the architecture given above, they are:

  • 1.

    Tier one is comprised of the client devices which are connected to the application server using some kind of a switch.

  • 2.

    Tier two comprises of the application servers where the applications are hosted. The application servers have (Input /Output) I/O controllers to control input/output operations to the attached storage devices. The I/O controllers are designed to work according to the specific interfaces which are used for connecting to the storage devices. If the attached storage devices support different types of interfaces, there will be an I/O controller for each type of interface.

The following are the some of the key types of connectivity interfaces supported by the storage devices in a DAS system:

Small Computer System Interface (SCSI)

It is a set of American National Standards Institute (ANSI) standard electronic interfaces. Parallel SCSI (also called as SCSI) is one of the most popular forms of storage interface. It is mainly used to connect disk drives and tapes to the servers or client devices. It can be also be used to connect other devices such as printers and scanners. Communication between the source (servers/client devices) and the attached storage devices are done using the SCSI command set. The latest version of SCSI which is SCSI ultra 320 provides data transfer speeds of 320 MB/s. There is also a serial version of SCSI called Serial Attached SCSI (SAS). It offers better performance and scalability when compared to SCSI. SAS currently supports data transfer rates of 6 Gb/s.

Integrated Device Electronics/Advanced Technology Attachment (IDE/ATA)

The term IDE/ATA denotes the dual-naming conventions for various generations and variants of this interface. The IDE component in IDE/ATA provides the specification for the controllers connected to the computer’s motherboard or communicating with the device attached. The ATA component is the interface for connecting storage devices, such as CD-ROMs, floppy disk drives, and HDDs, to the motherboard. The latest version of IDE/ATA called Ultra DMA (UDMA) supports data transfer rates of 133MB/s. The serial version of the IDE / ATA specification is called Serial ATA (SATA). It provides data transfer speeds of up to 6Gb/s.

  • 3.

    Tier three comprises of the storage devices. The connections to these storage devices are controlled by means of an I/O controller which is attached to the application server. These storage devices are typically disk drives / tape drives .Tapes/ tape drives are a popular storage media used to store backup data because of their relatively low cost. However, they have the following limitations:

    • Data is stored linearly on the tape. Search and retrieval of data are done sequentially; invariably taking several seconds to search and retrieve the data .This limits the use of tapes for applications that require real-time and rapid access to data.

    • In a multi user environment, data stored on tape cannot be accessed by multiple applications simultaneously.

    • On a tape drive, the read/write head touches the tape surface, so the tape degrades or wears out after repeated use.

    • The storage and retrieval requirements of data from tape and the overhead associated with managing tape media are significant.

However, with all these limitations, tape is still a preferred option to store backup data and other types of data which is not accessed/required frequently.

Disk drives are a very popular choice of storage media used for storing and accessing data by performance-intensive and real time applications. Disks support rapid access of data from random data locations. This allows data to be read /written quickly by a large number of simultaneous users or applications. In addition, the disks have a large storage capacity.

Key Terms in this Chapter

P2P: A peer-to-peer (P2P) network is a type of decentralized and distributed network architecture in which individual nodes in the network (called “peers”) act as both suppliers and consumers of resources, in contrast to the centralized client–server model where client node request access to resources provided by central servers.

Encryption: It is the process of encoding messages (or information) in such a way that eavesdroppers or hackers cannot read it, but that authorized parties can.

Cluster: A computer cluster consists of a set of loosely connected or tightly connected computers that work together so that in many respects they can be viewed as a single system. The components of a cluster are usually connected to each other through fast local area networks (“LAN”), with each node (computer used as a server) running its own instance of an operating system.

TCP/IP: The Internet protocol suite is the networking model and a set of communications protocols used for the Internet and similar networks. It is commonly known as TCP/IP. It provides end-to-end connectivity specifying how data should be formatted, addressed, transmitted, routed and received at the destination.

Grid: Grid computing is the collection of computer resources from multiple locations to reach a common goal. The grid can be thought of as a distributed system with non-interactive workloads that involve a large number of files.

Network File System (NFS): Network File System (NFS) is a distributed file system protocol originally developed by Sun Microsystems in 1984 allowing a user on a client computer to access files over a network in a manner similar to how local storage is accessed.

Common Internet File System (CIFS): The Common Internet File System (CIFS) is the standard way that computer users share files across corporate intranets and the Internet. It is a native file sharing protocol used in windows operating system.

Complete Chapter List

Search this Book: