Distributed Streaming Big Data Analytics for Internet of Things (IoT)

Distributed Streaming Big Data Analytics for Internet of Things (IoT)

Sornalakshmi Krishnan (SRM University, India) and Kayalvizhi Jayavel (SRM University, India)
DOI: 10.4018/978-1-5225-3142-5.ch012
OnDemand PDF Download:
List Price: $37.50


In this chapter, a discussion on the integration of distributed streaming Big Data Analytics with the Internet of Things is presented. The chapter begins with the introduction of these two technologies by discussing their features and characteristics. Discussion on how the integration of these two technologies benefit in efficient processing of IoT device generated sensor data follows next. Such data centric processing of IoT data powered by cloud, services and other enablers will be the architecture of most of the realtime systems involving sensors and real-time monitoring and actuation. The Volume, Variety and Velocity of sensor generated data make it a Big Data scenario. In addition, the data is real time and requires decisions or actuations immediately. This chapter discusses how IoT data can be processed using distributed, scalable stream processing systems. The chapter is concluded with future directions of such real time Big Data Analytics in IoT.
Chapter Preview


Internet of Things has been identified as an emerging technology that will transform our environment to a more connected and smarter world. Cisco predicts that over the next five years, global IP networks will support up to 10 billion new devices and connections increasing from 16.3 billion in 2015 to 26.3 billion by 2020. The projection is 3.4 devices and connections per capita by 2020—up from 2.2 per capita in 2015. And if clearly observed, every company have ventured to IoT relevant to their sector. Cisco, Juniper and other networking based companies have started talking about Edge, Mist Fog Analytics as next futuristic technologies for IoT.

MathWorks have acquired ThingSpeak which is a cloud based company and MathWorks have developed extensive toolbox for Internet of Things. It comprises of many open source hardware support, to name a few Raspberry Pi, Arduino and many more. IBM has come up with BluemixCloud, Google with its OS for Internet of Things, Brillo. Internet of Things allows envisaging the evolution of internet as a huge network of connected intelligent devices. These ubiquitous connected things not only sense, but also process, analyze real physical events ranging from simpler to complex and triggers actions as the need demands.

As the number of affiliated devices increases, the rate at which the data is generated and processed also increases. This requirement has led to the employment of technical advancements like Cloud, Software Oriented Architectural models, Software Defined Networks, Machine Learning, Artificial Intelligence and many more for making the things around us smarter, faster and dynamically intelligent.

In such a connected environment, the enormous amount of data generated by networked devices has to be processed in both real time as well as batch basis. The data generated by IoT devices possess the characteristics of Big Data in terms of Volume, Velocity, Variety and Value. The heterogeneous devices when connected together produce huge amounts of data from which useful inferences or decisions have to be drawn.

The powerful paradigm of MapReduce along with the implementing frameworks like Hadoop has made Big Data processing easier. The sub area of Big Data is Streaming Analytics which analyzes huge amounts of data that arrives with huge velocity and expects the actuation or decision in real time with low latency in terms of seconds. With the number of devices connecting to internet and the need for real time decision, making intelligent applications like self-driving is gaining importance. The frameworks for Streaming Analytics should possess the basic characteristics of Fault Tolerance, high availability, low Latency and Scalability. According to the requirement of application, processing can either follow the store-process-react or process-react-optional store style.

Cloud along with its different flavors and characteristics like Elasticity, Pay as per service and Scalability provides the best performance for centralized Storage, Analytics and Visualization in IoT. Cloud Services are available and provisioned without any human intervention and follow the pay-as-you-go model. Also, while utilizing cloud, the problems of over provisioning or under provisioning found in static fixed provisioning environments do not exist. Scalability and Load Balancing can help maintain the Quality of Service promised to the customers. By offering most of the components as a service, cloud environment is taking away most of the complexities handled at the user level. This enables the users to concentrate on the business processing rather than infrastructure.

Service science is an emerging companion to Streaming Analytics in cloud where there are numerous research areas such as Service Discovery, Composition and Orchestration. The service-oriented Cloud Computing is a supporting framework for cohesive set of cloud components. The big streaming data as opposed to the traditional services data is not structured or similar between services. There are a huge variety of sources in the IoT context like Wireless Sensor Network monitoring forest fire or a Weather station or Pollution monitoring or Home automation which vary exorbitantly unimaginable in the data formats. The data exchange needed before or after Data Analytics will be taken care by services.

The IoT paradigm along with Big Data, Cloud and Service Science promises revolutionary architecture which will be suitable for most of the critical IT applications ranging from smart grids to smart connected communities (Sun, Song, Jara & Bie, 2016).

Key Terms in this Chapter

NoSQL: A NoSQL (originally referring to “non SQL”, “non-relational” or “not only SQL”) database provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases. The data structures used by NoSQL databases (e.g. Key-Value, Wide Column, Graph or Document) are different from those used by default in relational databases making some operations faster in NoSQL. ( Wikipedia, 2017f )

WSN (Wireless Sensor Networks): Wireless Sensor Networks (WSN), sometimes called Wireless Sensor and Actuator Networks (WSAN) are spatially distributed autonomous sensors to monitor physical or environmental conditions such as Temperature, Sound, Pressure, etc. and to cooperatively pass their data through the network to a main location. ( Wikipedia, 2017c )

Enterprise Service Bus (ESB): An Enterprise Service Bus (ESB) is implementing a communication system between mutually interacting software applications in a Service-Oriented Architecture (SOA). It implements a software architecture as depicted on the right. As it implements a software architecture for Distributed Computing, it implements a special variant of the more general Client-Server model also whereas in general any application using ESB can behave as Server or Client in turn. ESB promotes agility and flexibility with regards to high protocol-level communication between applications. The primary goal of high protocol-level communication is Enterprise Application Integration (EAI) of heterogeneous and complex landscapes. ( Wikipedia, 2017e )

Gartner: An American research and advisory firm providing Information Technology related insight for IT and other business leaders located across the world. Its headquarters are in Stamford, Connecticut, United States. It was known as Gartner Group, Inc. until 2000 when it was then changed to Gartner. ( Wikipedia, 2017b )

Platform as a Service (PaaS): The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using Programming Languages, Libraries, Services and Tools supported by the provider. The consumer does not manage or control the underlying Cloud infrastructure including Network, Servers, Operating Systems or Storage but has control over the deployed applications and possibly configuration settings for the application hosting environment. (Mell &Grance, 2011 AU67: The in-text citation "Mell &Grance, 2011" is not in the reference list. Please correct the citation, add the reference to the list, or delete the citation. )

Infrastructure as a Service (IaaS): The capability provided to the consumer is to provision Processing, Storage, Networks and other fundamental computing resources where the consumer is able to deploy and run arbitrary software which can include Operating Systems and applications. The consumer does not manage or control the underlying Cloud infrastructure but has control over Operating Systems, Storage and Deployed Applications; and possibly limited control of selected networking components (e.g., host firewalls). (Mell &Grance, 2011c AU66: The in-text citation "Mell &Grance, 2011c" is not in the reference list. Please correct the citation, add the reference to the list, or delete the citation. )

JSON (JavaScript Object Notation): JSON is an open-standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. It is the most common data format used for asynchronous browser/server communication, largely replacing XML which is used by Ajax. JSON is a language independent data format. It derives from JavaScript but as of 2016 many programming languages include code to generate and parse JSON formatted data. The official Internet media type for JSON is application/json. JSON filenames use the extension.json. ( Wikipedia, 2017g )

MQTT (Message Queueing Telemetry and Transport): MQTT (MQ Telemetry Transport) is an ISO standard (ISO/IEC PRF 20922) Publish-Subscribe based “lightweight” Messaging Protocol for use on top of the TCP/IP protocol. It is designed for connections with remote locations where a “small code footprint” is required or the Network Bandwidth is limited. ( Wikipedia, 2017d )

Resilient Distributed Datasets (RDDs): Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets programmers perform In-Memory Computations on large clusters in a Fault Tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: Iterative Algorithms and Interactive Data Mining Tools. Formally, an RDD is a read-only, partitioned collection of records. RDDs can only be created through deterministic operations on either (1) data in Stable Storage or (2) other RDDs. (Zaharia, 2012 AU68: The in-text citation "Zaharia, 2012" is not in the reference list. Please correct the citation, add the reference to the list, or delete the citation. )

Discretized Streams (DStreams): A Discretized Stream or D-Stream groups together a series of RDDs and lets the user manipulate them through various operators. D-Streams provide both stateless operators such as map which act independently on each time interval and stateful operators such as aggregation over a Sliding Window which operate on multiple intervals and may produce intermediate RDDs as state. (Zaharia, 2013 AU65: The in-text citation "Zaharia, 2013" is not in the reference list. Please correct the citation, add the reference to the list, or delete the citation. )

Internet of Things (IoT): The internetworking of physical devices, vehicles (also referred to as “connected devices” and “smart devices”), buildings and other items embedded with Electronics, Softwares, Sensors, Actuators and Network Connectivity that enable these objects to collect and exchange data. ( Wikipedia, 2017a )

Software as a Service (SaaS): The capability provided to the consumer is to use the provider’s applications running on a cloud infrastructure. The applications are accessible from various client devices through either a thin client interface such as a web browser (e.g., web-based email) or a program interface. The consumer does not manage or control the underlying cloud infrastructure including Network, Servers, Operating Systems, Storage or even individual application capabilities with the possible exception of limited user specific application configuration settings. (Mell &Grance, 2011 AU69: The in-text citation "Mell &Grance, 2011" is not in the reference list. Please correct the citation, add the reference to the list, or delete the citation. )

Complete Chapter List

Search this Book: