Article Preview
Top1. Introduction
Over the past several years cloud computing has made substantial technical and operational advances in reliability and availability. Today’s cloud technologies have now advanced to the point where they can provide end-users with considerable flexibility to self-provision resources, either explicitly or implicitly, and provide on-demand computational capabilities and services as defined and summarized in a recent NIST publication (Mell & Grance, 2011) describing what constitutes properties and characteristics of cloud computing. From the technical and operational perspective, users now have a spectrum of choices in the design and configuration of customized software stacks for various types of applications. From the business and economics perspective, a cloud option provides a mechanism to transfer the large capital expenditures for the purchase, operation and maintenance of a data center to a more” pay-as-you go” expense.
Cloud computing options have been applied to situation where users have constraints on facility access to computational resources, small prototype computations needing many short calculations with different parameters, and large computations with minimal communications requirements between processors. All of these types of computations have been successfully implemented in either private or commercial cloud computing systems. (UberCloud, 2014; Amazon Web Services, 2014). As a result, companies, academic institutions, organizations and individuals are seriously considering and experimenting with cloud computing as a platform for computation and data analysis.
Despite all of these advances, cloud computing has only had mixed success when attempting to implement supercomputing applications onto these types of platforms. Users explicitly requiring high performance computing favor systems that allow them to operate “close to the metal” with the ability to tune both the hardware and storage in order to optimize computational performance. Early efforts to re-create these HPC capabilities implemented the most straightforward option of deploying these supercomputing applications onto existing cloud platforms. Although this method did show some promise for codes with minimal inter-processor communications requirements, the more tightly coupled HPC applications suffered degraded performance. Alternative approaches involved constructing small groups of HPC clouds with more robust uniform hardware architecture and network connections. This design provided” spill-over provisioning” from the HPC supercomputer to a cloud system when the HPC system became saturated. Although these implementations did provide some overall acceleration, the underlying shortcomings of delivering HPC supercomputer level computational throughput with commodity cloud cluster hardware still remained problematic.
The basic difficulty is that general cloud computing systems lack the specialized HPC architectural infrastructure needed to deliver the required high throughput. Tightly coupled HPC codes need state-of-the-art network interconnects that can provide maximum computational throughput and minimum latency. Submitting such applications onto standard cloud computing systems generally result in degraded performance. Additional performance degradation was also attributed to a lack of uniformity in the computational hardware.