Parallelizing Large-Scale Graph Algorithms Using the Apache Spark-Distributed Memory System

Parallelizing Large-Scale Graph Algorithms Using the Apache Spark-Distributed Memory System

Ahmad Askarian (University of Texas at Dallas, USA), Rupei Xu (University of Texas at Dallas, USA) and Andras Farago (University of Texas at Dallas, USA)
DOI: 10.4018/978-1-5225-2814-2.ch009


The rapidly emerging area of Social Network Analysis is typically based on graph models. They include directed/undirected graphs, as well as a multitude of random graph representations that reflect the inherent randomness of social networks. A large number of parameters and metrics are derived from these graphs. Overall, this gives rise to two fundamental research/development directions: (1) advancements in models and algorithms, and (2) implementing the algorithms for huge real-life systems. The model and algorithm development part deals with finding the right graph models for various applications, along with algorithms to treat the associated tasks, as well as computing the appropriate parameters and metrics. In this chapter we would like to focus on the second area: on implementing the algorithms for very large graphs. The approach is based on the Spark framework and the GraphX API which runs on top of the Hadoop distributed file system.
Chapter Preview


Spark is an open source, in-memory big data processing framework in a distributed environment. It started as a research program in 2009 and became an open source project in 2010. In 2014 it was released as an Apache incubator project (Xin et al, 2013).

Spark is evolved from Hadoop MapReduce so it can be run on Hadoop cluster and data in the Hadoop distributed File System (HDFS). It supports a wide range of workloads, such as Machine Learning, Business Intelligence, streaming and batch processing. Spark was created to complement, rather than replace Hadoop. The Spark core is accompanied by a set of powerful, higher-level libraries which can be used in the same application. These libraries currently include SparkSQL, Spark Streaming, MLlib (for machine learning), and GraphX, as shown in Figure 1.

Figure 1.

Spark full stack

In order to efficiently use the processing resources of a cluster, Spark needs a cluster resource manager. Yet Another Resource Negotiator (YARN) is a Hadoop processing layer that contains a resource manager and a job scheduler. Yarn allows multiple applications to run on a single Hadoop Cluster. Figure 2 illustrates how Spark uses Yarn as a distributed resource manager (Vavilapalli et al, 2013).

Figure 2.

Yet Another Resource Manager

Although Spark is designed for in-memory computation, it is capable of handling workloads larger than the cluster aggregate memory. Almost all the Spark built-in functions automatically split to local disks when the working data set does not fit in memory. In the next two section we outline the difference between Spark and MapReduce, as well as the concept of Resilient Distributed Dataset (RDD) in Spark (Meng, Bradley et al, 2016).


Apache Spark Vs. Hadoop Mapreduce

Apache Spark improvements over Hadoop MapReduce are characterized by efficiency and usability, as shown in Figure 3. In order to improve efficiency, it offers in-memory computing capability, which can provide a fast running environment for applications that need to reuse and share data across computations. Having different languages with integrated APIs, such as Java, Scala, Python and R, improve Spark’s usability, as compared to MapReduce.

Figure 3.

Apache Spark efficiency and usability

Next we explain the Hadoop Mapreduce with an example, and then discuss how Spark can improve the efficiency for implementing more complex algorithms (Xin, Deyhim et al, 2014).

MapReduce is a programming model, and an associated implementation, which allows massive scalable data processing across hundreds or thousands of servers. MapReduce refers to two separate and distinct tasks, needed for big data processing. The first one is the map task, which converts a set of data to another set of data called tuples (key/value pairs). The reduce task takes the map output and combines those data tuples into smaller sets of tuples. In the MapReduce processing model the reduce task always runs after the map task. A simple MapReduce example, described in (ibm01, 2011) is the following.

Assume you have five files, and each file contains two columns (a key and a value in Hadoop terms) that represent a city and the corresponding temperature recorded in that city for the various measurement days. Of course, the real world applications won’t be quite so simple, as it’s likely to contain millions or even billions of rows, and they might not be neatly formatted rows at all; in fact, no matter how big or small the amount of data you need to analyze, the key principles we’re covering here remain the same. Either way, in this example, city is the key and temperature is the value as shown in Figure 4.

Complete Chapter List

Search this Book: