Many accelerator-based computers have demonstrated that they can be faster and more energy-efficient than traditional high-performance multi-core computers. Two types of programmable accelerators are available in high-performance computing: general-purpose accelerators such as GPUs, and customizable accelerators such as FPGAs, although general-purpose accelerators have received more attention. This chapter reviews the state-of-the-art and current trends of high-performance customizable computers (HPCC) and their use in Computational Science and Engineering (CSE). A top-down approach is used to be more accessible to the non-specialists. The “top view” is provided by a taxonomy of customizable computers. This abstract view is accompanied with a performance comparison of common CSE applications on HPCC systems and high-performance microprocessor-based computers. The “down view” examines software development, describing how CSE applications are programmed on HPCC computers. Additionally, a cost analysis and an example illustrate the origin of the benefits. Finally, the future of the high-performance customizable computing is analyzed.
Frequently, automated solutions to Computational Science and Engineering (CSE) problems require that billions to trillions of complex operations be applied to input data acquired from the real world. In many cases, these solutions must be reported without delay, they are time critical, and frequently, they must also be of the highest precision. Both, availability and precision of information are key elements in resolving CSE problems and so making living more comfortable and longer.
In order to reach this performance goal, high-performance computing is a research and development domain which aids the solution of CSE problems with a combination of high-performance computers and parallel programs. For many years, the fastest computers integrated central processing units (CPUs) or microprocessors that were specialized in performing the greatest number of operations per second. However, nowadays, the architectures of the fastest high-performance computers are dominated by a large population of multi-core programmable processors, many of which can be also integrated into desktop or server computers.
In this time of transition, new high-performance processors can provide higher levels of performance than their predecessors due mainly to an increase in the number of processing cores that are integrated on-chip. Increasing the numbers of processing cores on a single chip offers increased computer performance at somewhat lower power dissipation than a complex single-core microprocessor with an equivalent number of transistors on-chip.
Nevertheless, the multi-core approach does not address three basic problems. Firstly, the available computing power on-chip is not efficiently utilized by programs. Secondly, the connection from the processor to the external memory becomes more loaded as the number of cores increases. This and the difference in operating frequency between multi-core processor and external memory can become a bottleneck of parallel processing and stall some or all cores. The third problem is caused because effective programming of multi-core systems is difficult, and in many cases, software is ultimately responsible for the lack of performance scalability as the amount of cores increases (Mackin & Woods, 2006).
An alternative approach has arisen; High-Performance Customizable Computing (HPCC) is a different paradigm of high-performance computing. Instead of having only programmable processors, customizable computers also integrate hardware coprocessors with non-fixed architectures. These high-performance computing elements can be customized for a portion of a specific program and so accelerate the execution of key steps in the application software.
Customizable hardware devices offer the advantage of speeding up several software applications because its hardware flexibility allows the same chip to be specialized and reused. This is the main property that is applied to High-Performance Customizable Computing. This property is very useful in exploiting the inherent parallelism of many CSE problems. Customizable devices have shown a big potential for use in high-performance computing with much better power efficiency than programmable processors. New customizable devices are providing ever higher performance because their clock frequency and the number of transistors dedicated to specialized processing both are increasing. Additionally, customizable devices have other advantages that are exploited in embedded hardware engineering, such as reducing both the non-recurrent engineering costs (Dehon, 2008) and development time of a product (Guccione, 2008).
Two types of computing systems that integrate customizable devices are common nowadays: configurable and reconfigurable systems. Configurable Systems are built from baseline chip designs that are partially specialized during design-time and before fabrication (Leibson, 2006). After chip fabrication, these systems can be software-programmed but cannot be specialized anymore. On the other hand, Reconfigurable Systems are based on field-programmable devices that can be completely customized after fabrication (Chang, 2008).
The main goal of this chapter is to help readers understand how customizable hardware systems can be exploited to provide high performance, i.e., how to get 10X, 100X or 1000X the performance of the equivalent number of transistors in a microprocessor-based computer with much better power efficiency. The reader will gain insight into the design, management and use of high-performance infrastructures that integrate microprocessor-based and customizable computers.
Key Terms in this Chapter
Field Programmable Gate Array (FPGA): is a reconfigurable electronic device with fine-grain architecture that implements customized computational logic specific to the application being executed, and can be reconfigured for a wide range of tasks (Maxfield, 2004). Its reconfigurable architecture is composed of: processing elements called Look-Up-Tables (LUTs) that can implement any logic function with few inputs, an interconnection network that can connect any logic cell with the rest of the circuit, memory blocks that store data to be loaded by any other element of the architecture, and special modules that are integrated on chip to efficiently do a frequently used task such as multiplications, digital signal processing, and external input-output interfacing.
Reconfigurable Devices: are versatile configurable electronic components that are used to build distinct hardware implementations on the same set of reconfigurable resources after chip fabrication (Compton & Hauck, 2002). Reconfigurability is achieved by the use of an integrated configuration memory that stores the information about the state and functionally of each part of the device. The device is configured by loading in a configuration bitstream, consisting of a series of commands and frame data. At any time after a reconfigurable device has been powered up, it is possible to suspend its operations, load in a completely new hardware configuration, and restart its operation using the newly loaded configuration(Leibson, 2006).
Processor,: or Central Processing Unit (CPU): is the central circuit of a computer that processes a sequence of jobs that arrive over time and actually executes the application program. Since the beginning of CPUs, their performance has been driven by higher clock rates and improved internal organizations of the circuit. In the last years, new CPUs with higher clock rates are not commercially reliable and the technology trend has been to integrate more than one core on a chip. Additionally, low power CPUs are playing a central role in high-performance computers because building ever-larger clusters of commercial off-the-shelf hardware are being constrained by power and cooling (Donofrio et al, 2009; Henkel & Parameswaran, 2007).
High-Performance Computers (HPC): provide hardware and software infrastructures whose main goal is to accelerate the execution of customer applications or improve their fault-tolerance. Usually, these machines are composed of multiple processors, large memory capacity, large disk storage, and high-bandwidth communications among all their main components (Blake et al, 2009).
Coprocessors: are specialized circuits that can be integrated into a computer and connected to a CPU to provide added performance for applications, implementing specific computational tasks (Gulati & Khatri, 2010).
Customizable Electronic Devices: can be customized to efficiently execute a task and frequently are used as CPU and/or coprocessor. They can achieve typically 10-1000 times faster execution compared to today’s fastest CPU and a reduction of about 95% in power consumption. Three types of customizable devices can be distinguished: ASICs, Configurable Processors, and Reconfigurable Devices.
Application Specific Integrated Circuit (ASIC): is an integrated electronic circuit that is customizable during the design phase to efficiently execute a specific task. After fabrication, it cannot be modified to execute other tasks. This customizable device can achieve the best performance and the lowest energy consumption. However, its design and fabrication costs are very high and can only be justified if the number of chips sold is very large (Rigo et al, 2010).
Configurable Processors: are special ASICs that are based on a conventional CPU and tailored during chip design time for a specific software application. This type of processor produces much better computing efficiency and much lower power consumption than the original CPU. After fabrication, they cannot be configured again.
Instruction Set Architecture (ISA): is the set of hardware elements of the processor that can be managed by the software program. Hardware control from the program is performed via normalized machine instructions. A program is a composition of machine instructions that are loaded sequentially by the CPU over time (Patterson & Hennessy, 2009).
Parallelization: is the software technique that allows an application program to be partitioned and then, the independent resources in a computer can be efficiently activated with the independent program parts. This software partitioning can be done at instruction-level, data-level, thread-level, procedure-level or program-level. Many CSE applications can be parallelized. If the respective parallel programs are executed on HPC platforms, costs and manpower are improved (Akhter & Roberts, 2006).