Schedulers Based on Ant Colony Optimization for Parameter Sweep Experiments in Distributed Environments

Schedulers Based on Ant Colony Optimization for Parameter Sweep Experiments in Distributed Environments

Elina Pacini, Cristian Mateos, Carlos García Garino
DOI: 10.4018/978-1-4666-2518-1.ch016
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Scientists and engineers are more and more faced to the need of computational power to satisfy the ever-increasing resource intensive nature of their experiments. An example of these experiments is Parameter Sweep Experiments (PSE). PSEs involve many independent jobs, since the experiments are executed under multiple initial configurations (input parameter values) several times. In recent years, technologies such as Grid Computing and Cloud Computing have been used for running such experiments. However, for PSEs to be executed efficiently, it is necessary to develop effective scheduling strategies to allocate jobs to machines and reduce the associated processing times. Broadly, the job scheduling problem is known to be NP-complete, and thus many variants based on approximation techniques have been developed. In this work, the authors conducted a survey of different scheduling algorithms based on Swarm Intelligence (SI), and more precisely Ant Colony Optimization (ACO), which is the most popular SI technique, to solve the problem of job scheduling with PSEs on different distributed computing environments.
Chapter Preview
Top

Introduction

Parameter Sweep Experiments, or PSEs for short, is a very popular way of conducting simulation-based experiments among scientists and engineers through which the same application code is run several times with different input parameters resulting in different outputs (Youn and Kaiser, 2010). Representative examples of PSEs are sensitivity studies of results in terms of defined parameter changes like is the case of imperfections in the simulation of simple tension test, or the study of buckling of imperfect columns.

From a purely software perspective, most PSEs are cluster friendly in the sense that individual inputs of an experiment can be handled by independent jobs. Therefore, using a software platform such as Condor (Thain et al., 2005), which is able to exploit the distributed nature of a computer cluster, allows these jobs to be run in parallel. In this way, not only PSEs execute faster, but also experiments more computing intensive can be computed, and hence more complex simulations can be performed. The same idea has been systematically applied to execute PSEs on Grids (Foster and Kesselman, 2003), which are basically infrastructures that connect clusters via wide-area connections to increase computational power. To this end, software platforms designed to exploit Grids provide the illusion of the existence of a large supercomputer, which in turn virtualizes and combines the hardware capabilities of many much less powerful, geographically-dispersed machines to run resource intensive applications (Coveney et al., 2005).

On the downside, for users not proficient in distributed technologies, manually configuring PSEs is tedious, time-consuming and error-prone. As a consequence, users typically waste precious time that could be instead invested into analyzing results. The availability of elaborated GUIs -especially for Grids- that help in automating an experimentation process has in part mitigated this problem. However, the highly complex nature of today’s experiments and thus their associated computational cost greatly surpasses the time savings that can be delivered by this automation.

A recent distributed computing paradigm that is rapidly gaining momentum is Cloud Computing (Buyya et al., 2009), which bases on the idea of providing an on demand computing infrastructure to end users. Typically, users exploit Clouds by requesting from them one or more machine images, which are virtual machines running a desired operating system on top of several physical machines (e.g. a datacenter). Interaction with a Cloud is performed by using Cloud services, which define the functional capabilities of a Cloud, i.e. machine image management, access to software/data, security, and so on.

Due to the fact that PSEs perform the processing of a lot of jobs, it is necessary to address how they will be executed in distributed computing environments, which is a complex endeavor. For jobs to be properly executed, it is necessary to allocate them to different resources reasonably and optimally. This problem is known as job scheduling and is an NP-complete problem, which aims to minimize the overall execution time of all jobs. Job scheduling is one of the bottlenecks in distributed computing.

In the last ten years or so, Swarm Intelligence has received increasing attention in the research community. Swarm Intelligence refers to the collective behavior that emerges from a swarm of social insects (Bonabeau et al., 1999). Social insect colonies solve complex problems collectively by intelligent methods. These problems are beyond the capabilities of each individual insect, and the cooperation among them is largely self-organized without any supervision. Through studying social insect colonies behaviors such as ant colonies, researchers have proposed some algorithms or theories for combinational optimal problems. Moreover, job scheduling in Grids or Clouds is also a combinational optimal problem.

Key Terms in this Chapter

Load Balancing: Is a computer methodology to distribute workload across multiple computers to achieve optimal resource utilization, maximize throughput, minimize response time, and avoid overload. Load balancing divides the amount of work that a computer has to do between two or more computers so that more work gets done in the same amount of time and, in general, all users get served faster.

Makespan: Is defined as the amount of time, from start to finish for completing a set of jobs, i.e. the maximum completion time of all jobs.

Parameter Sweep: a parameter sweep is a type of experiment in which multiple datapoints are examined by executing an algorithm numerous times with different parameter configurations.

Cloud Computing: A Cloud is a type of parallel and distributed system consisting of a collection of inter-connected and virtualized computers that are dynamically provisioned and presented as one or more unified computing resource(s) based on service-level agreements established through negotiation between the service provider and consumers. Cloud Computing is the delivery of computing as a service rather than a product, whereby shared resources, software, and information are provided to computers and other devices as a utility over a network (typically the Internet).

Grid Computing: Is a model of distributed computing that uses geographically and administratively disparate resources. A Grid is a type of parallel and distributed system that enables the sharing, selection, and aggregation of geographically distributed ‘autonomous’ resources dynamically at runtime depending on their availability, capability, performance, cost, and users' quality-of-service requirements. Individual users can access computers and data transparently, without having to consider location, operating system, account administration, and other details.

Ant Colony Optimization (ACO): Is a class of optimization algorithms modeled on the actions of an ant colony. ACO methods are useful in problems that need to find paths to goals. Artificial 'ants' locate optimal solutions by moving through a parameter space representing all possible solutions. Real ants lay down pheromones directing each other to resources while exploring their environment. The simulated 'ants' similarly record their positions and the quality of their solutions, so that in later simulation iterations more ants locate better solutions.

Job Scheduling: Is the process of allocating a set of jobs belonging to an application into available computing resources. The main objective is achieving a high system throughput while matching application needs with the available computing resources.

Swarm Intelligence (SI): Is a discipline that deals with natural and artificial systems composed of many individuals that coordinate using decentralized control and self-organization. In particular, SI focuses on the collective behaviors that result from the local interactions of the individuals with each other and with their environment. Examples of systems studied by swarm intelligence are colonies of ants and termites, schools of fish, flocks of birds, herds of land animals.

Complete Chapter List

Search this Book:
Reset