Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us
Newsroom

Parallelizing Large-Scale Graph Algorithms Using the Apache Spark-Distributed Memory System

Ahmad Askarian, Rupei Xu, Andras Farago

Source Title: Graph Theoretic Approaches for Analyzing Large-Scale Social Networks

DOI: 10.4018/978-1-5225-2814-2.ch009

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

The rapidly emerging area of Social Network Analysis is typically based on graph models. They include directed/undirected graphs, as well as a multitude of random graph representations that reflect the inherent randomness of social networks. A large number of parameters and metrics are derived from these graphs. Overall, this gives rise to two fundamental research/development directions: (1) advancements in models and algorithms, and (2) implementing the algorithms for huge real-life systems. The model and algorithm development part deals with finding the right graph models for various applications, along with algorithms to treat the associated tasks, as well as computing the appropriate parameters and metrics. In this chapter we would like to focus on the second area: on implementing the algorithms for very large graphs. The approach is based on the Spark framework and the GraphX API which runs on top of the Hadoop distributed file system.

Chapter Preview

Top

Introduction

Spark is an open source, in-memory big data processing framework in a distributed environment. It started as a research program in 2009 and became an open source project in 2010. In 2014 it was released as an Apache incubator project (Xin et al, 2013).

Spark is evolved from Hadoop MapReduce so it can be run on Hadoop cluster and data in the Hadoop distributed File System (HDFS). It supports a wide range of workloads, such as Machine Learning, Business Intelligence, streaming and batch processing. Spark was created to complement, rather than replace Hadoop. The Spark core is accompanied by a set of powerful, higher-level libraries which can be used in the same application. These libraries currently include SparkSQL, Spark Streaming, MLlib (for machine learning), and GraphX, as shown in Figure 1.

Figure 1.

Spark full stack

In order to efficiently use the processing resources of a cluster, Spark needs a cluster resource manager. Yet Another Resource Negotiator (YARN) is a Hadoop processing layer that contains a resource manager and a job scheduler. Yarn allows multiple applications to run on a single Hadoop Cluster. Figure 2 illustrates how Spark uses Yarn as a distributed resource manager (Vavilapalli et al, 2013).

Figure 2.

Yet Another Resource Manager

Although Spark is designed for in-memory computation, it is capable of handling workloads larger than the cluster aggregate memory. Almost all the Spark built-in functions automatically split to local disks when the working data set does not fit in memory. In the next two section we outline the difference between Spark and MapReduce, as well as the concept of Resilient Distributed Dataset (RDD) in Spark (Meng, Bradley et al, 2016).

Top

Apache Spark Vs. Hadoop Mapreduce

Apache Spark improvements over Hadoop MapReduce are characterized by efficiency and usability, as shown in Figure 3. In order to improve efficiency, it offers in-memory computing capability, which can provide a fast running environment for applications that need to reuse and share data across computations. Having different languages with integrated APIs, such as Java, Scala, Python and R, improve Spark’s usability, as compared to MapReduce.

Figure 3.

Apache Spark efficiency and usability

Next we explain the Hadoop Mapreduce with an example, and then discuss how Spark can improve the efficiency for implementing more complex algorithms (Xin, Deyhim et al, 2014).

MapReduce is a programming model, and an associated implementation, which allows massive scalable data processing across hundreds or thousands of servers. MapReduce refers to two separate and distinct tasks, needed for big data processing. The first one is the map task, which converts a set of data to another set of data called tuples (key/value pairs). The reduce task takes the map output and combines those data tuples into smaller sets of tuples. In the MapReduce processing model the reduce task always runs after the map task. A simple MapReduce example, described in (ibm01, 2011) is the following.

Assume you have five files, and each file contains two columns (a key and a value in Hadoop terms) that represent a city and the corresponding temperature recorded in that city for the various measurement days. Of course, the real world applications won’t be quite so simple, as it’s likely to contain millions or even billions of rows, and they might not be neatly formatted rows at all; in fact, no matter how big or small the amount of data you need to analyze, the key principles we’re covering here remain the same. Either way, in this example, city is the key and temperature is the value as shown in Figure 4.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Parallelizing Large-Scale Graph Algorithms Using the Apache Spark-Distributed Memory System

Abstract

Introduction

Apache Spark Vs. Hadoop Mapreduce

Complete Chapter List