Efficient Execution of Dataflows on Parallel and Heterogeneous Environments

Efficient Execution of Dataflows on Parallel and Heterogeneous Environments

George Teodoro
DOI: 10.4018/978-1-4666-2533-4.ch001
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Current advances in computer architectures have transformed clusters, traditional high performance distributed platforms, into hierarchical environments, where each node has multiple heterogeneous processing units including accelerators such as GPUs. Although parallel heterogeneous environments are becoming common, their efficient use is still an open problem. Current tools for development of parallel applications are mainly concerning with the exclusive use of accelerators, while it is argued that the adequate coordination of heterogeneous computing cores can significantly improve performance. The approach taken in this chapter to efficiently use such environments, which is experimented in the context of replicated dataflow applications, consists of scheduling processing tasks according to their characteristics and to the processors specificities. Thus, we can better utilize the available hardware as we try to execute each task into the best-suited processor for it. The proposed approach has been evaluated using two applications, for which there were previously available CPU-only and GPU-only implementations. The experimental results show that using both devices simultaneously can improve the performance significantly; moreover, the proposed method doubled the performance of a demand driven approach that utilizes both CPU and GPU, on the two applications in several scenarios.
Chapter Preview
Top

Introduction

With the current advances in computer architectures, traditional distributed platforms are fast transforming into hierarchical environments, where each computing node may have multiple processors that are, further, multi-core systems. Moreover, the same advances have brought forth the development of new highly parallel architectures, as is the case of GPUs, which have been systematically used as co-processors capable of significantly improving the performance of applications under certain conditions. The end result is that we now see the availability of clusters of computers, each having several heterogeneous computing devices available for processing.

The problem discussed in this chapter is that of efficiently executing component-based applications on heterogeneous clusters of GPU-equipped multi-core computers. The proposed techniques are evaluated in the context of the filter-stream programming model, a type of dataflow system where applications are decomposed into components, called filters, which can be replicated on multiple nodes of a distributed system. These filters communicate using logical streams that are capable of delivering data from an instance of the source filter to an instance of the destination filter. Filters, in our run-time framework, are multi-threaded and may have different versions of their codes so they can execute on heterogeneous processors.

An important facet of our work is that the performance of accelerators, such as GPUs, are data-dependent, meaning that the speedup achieved over the CPU varies according to the type of computation and the input data. Moreover, the variation in performance may occur in two different levels: (i) on the component (filter) level, when there is a performance variation among the processors as different filters' internal computations are carried out, and (ii) on the application level, which is observed as multiple executions of the same application, using CPU or GPU exclusively, have different relative performance according to the application input data and parameters used.

Therefore, our approach to execute dataflow computations on heterogeneous environments consists into assign the execution of each task to the device in which it yields the best relative performance, while all devices are kept busy with different tasks. The proposed solution is extensible to both levels of scheduling, while on the component level the concept of task is related to each event ready to be executed on the application filters, in the application level it is an entire execution of the application.

The proposed solution is materialized by a hierarchical scheduler, which is performed at the level where the performance variations are observed. The approach at the filter level aims to optimize each execution of the application, coordinating the assignment of its internal tasks (data buffers) to the appropriate devices. While at the application level scheduling we focus on optimizing the execution of a workload containing executions of the same application (jobs), by triggering multiple of them in parallel on the available hardware. Thus, at the application level, each execution of the application is performed using only one type of processor, which is chosen according to the estimated relative performance.

The problem of assigning tasks in heterogeneous environments has been the target of research for a long time. Recently, with the increasing ubiquity of GPUs in mainstream computers, the scientific community has examined the use of nodes with CPUs and GPUs in more detail. In Mars (He, 2008), an implementation of the MapReduce programming model for CPU- and GPU-equipped nodes, it was evaluated the collaborative use of CPUs and GPUs to process Map and Reduce tasks that are divided among processor using a fixed relative performance between devices. The Qilin system (Luk, 2009), on the other hand, argues that the processing rates of system processors dependent on the data size. By generating a model of the processing rates for each of the processors in the system in a training phase, Qilin determines how best to split the work among the processors for successive executions of the application. However, these works do not consider that some tasks have different characteristics, making them suitable to different processors. Thus, Qilin and Mars proposed schedulers are not capable to maximize the hardware exploitation when there is inter-task performance variability.

Complete Chapter List

Search this Book:
Reset