The problem of load balancing parallel applications is particularly challenging on computational grids, since the characteristics of both the application and the platform must be taken into account. This chapter reviews the wide range of solutions that have been proposed. It considers tightly coupled parallel applications that can be described by an undirected graph representing concurrent execution of tasks and communication of tasks, executing on computational grids with static and dynamic network and processor performance. While a rich set of solution techniques have been proposed, there has not been of yet any performance comparisons between them. Such comparisons will require parallel benchmarks and computational grid emulators and simulators.
Since heterogeneous computing is a broad research area and there are many definitions of computational grids, we define the characteristics of the applications and platforms that we are considering.
Key Terms in this Chapter
Resource-Aware Load Balancing: Assignment of tasks to a computational platform, taking into account its heterogeneous characteristics, with the objective of balancing the work assigned to processors. The tasks may be independent, or may have interdependencies described in a graph or digraph. In the latter case an additional objective is to minimize communication between processors. The assignment may be static, that is decided before tasks execute, or dynamic, that is tasks are assigned as conditions change.
Dynamic Load Balancing: Assignment of tasks to a computational platform, with the objective of balancing the work assigned to processors. When the tasks are interdependent an additional objective is to minimize communication between processors. The load balance is monitored during execution of the application, and the assignment is modified if required, while trying to minimize the communication cost of migrating tasks.
High Capacity Computational Grid: A grid platform dedicated to high performance computing (HPC) applications and generally consisting of an aggregation of multiple HPC clusters. These grids are space-shared, that is processors only execute a single user’s process(es) at a time. The clusters may be located in one organization, or they may be more spread out geographically, which can pose additional problems of high latencies.
Static Load Balancing: Assignment of tasks to a computational platform, before execution takes place, with the objective of balancing the work assigned to processors. When the tasks are interdependent an additional objective is to minimize communication between processors. This assignment is then fixed for the duration of the computation.
Graph Partitioning: In the context of parallel computing, the partitioning of a graph (representing concurrent execution of tasks and communication between tasks) into disjoint subgraphs, where the communication between subgraphs is minimized and the load is roughly balanced.
High Throughput Computational Grid: A grid platform that aims to take advantage of spare compute cycles on processing nodes. These nodes may be dedicated to multiple users at the same time and hence are time-shared. These grids can consist of local or global networks of computers, and can be either within one organization or be simply a collection of individual computers on the Internet.
Heterogeneous Computing: Design and implementation of algorithms for computing platforms with heterogeneous characteristics, such as processor and network performance.