Computer Architectures and Programming Models: How to Exploit Parallelism

Computer Architectures and Programming Models: How to Exploit Parallelism

DOI: 10.4018/978-1-7998-7082-1.ch002
(Individual Chapters)
No Current Special Offers


In this chapter, basic concepts about programming models and computer architectures are introduced to provide context about the major developments in both topics. Differences between multicore and accelerators are also addressed to help the reader understand how the concepts relate and translate between different architectures. Moreover, the authors also present an introduction to programming models focusing on their suitability for different parallelizing strategies and their special features. In this way, the reader is guided to identify which programming models are best suited for specific problems and architectures, according to the computational requirements, as well as those arisen from the data layout.
Chapter Preview


In 1965, Gordon Moore, Intel’s co-founder noticed that advances in technology were moving so fast that it was possible to enhance processors in such a way that the number of transistors per chip would double every 18 months. This trend was stated as “Moore’s Law”, and it continues almost strictly until today (Edwards, 2021). This increase of resources has been mainly aimed at sequential processors following a common computer model known as Von Neumann architecture. In this way, it is not necessary to retrain programmers or rewrite programs for emergent computer architectures. Multiple transparent techniques, from a programmer point of view, were carried out to take advantage of a higher number of transistors. However, the use of these techniques and the limitation of the current VLSI (Very Large Scale Integration) technology do not suppose an increase in performance for single-threaded processors today. In response, most hardware companies are designing and developing new parallel architectures (Geer, 2005). In order to maintain the same rate of growth in performance as in the number of transistors per chip, the processor designer evolved from the Von Neumann model to a new concept of Chip Multi-Processors (CMP), which integrates several processing elements in the same chip. This change supposes a greater pressure on memory and I/O buses as the number of cores increases. Furthermore, this new paradigm implies higher efforts for programmers who are responsible for exposing the implicit parallelism in their applications.

Computer architectures have suffered many changes over time. These were carried out to search for more amenable features to solve scientific problems. In this way, some proposals for massively parallel computers emerged in the 1950s (Unger, 1958) because of the fact that, in most of these applications, it was necessary a high number of identical calculations, so that it seemed logical to replicate the processing elements. This led to the first Simple Instruction Multiple Data (SIMD) architectures; ILLIAC IV (Barnes, et al., 1968), MPP (Batcher, 1980), CLIP4 (Duff, 1976), and Multiple Instruction Multiple Data (MIMD) architectures; Cytocomputer (Loughead & Cubbrey, 1980), ZMOB (Rieger et al., 1981).

Numerical simulation is probably the field with the most impact on computer architecture. Indeed, those applications related to Linear Algebra are one of the most computationally expensive, mainly due to the huge computational resources needed. Clearly, this has to do with the incorporation of certain SIMD instructions in commercial processors; Intel (Peleg & Weiser, 1996), Sun (Tremblay et al., 1996), HP (Lee, 1996), among others.

Key Terms in this Chapter

OpenACC: Open Accelerators is a high-level pragma-based (directives) programming model for accelerator (GPUs) designed to simplify parallel programing of heterogeneous CPU-GPU systems

CUDA: Compute Unified Device Architecture is a low-level parallel programming model and application programming interface (API) model created by NVIDIA to use CUDA-enabled (NVIDIA) graphics processing unit (GPU) for general purpose processing.

OmpSs: OpenMP Super scalar is a pragma-based programming model developed by the BSC. Its syntax is similar to that of OpenMP, however it implements task-based parallelism.

SIMD: Single Instruction, Multiple Data is a class of parallel computers that perform the same operation on multiple data points simultaneously. This is similar to the concept of vectorial architecture or vector instruction.

Streaming Multiprocessor: Also known as multi-processor, is a component of the current GPUs composed by a set of small cores and a relatively small memory. Current GPUs are composed by several streaming multiprocessors, which are connected via a common memory known as global or GPU memory.

SMT: Simultaneous Multi-Threading is a technique for improving the overall efficiency of CPUs, which permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures.

GPU: Graphics Processing Unit is a specialized processor originally designed to accelerate graphics rendering. GPUs can process many pieces of data simultaneously, making them useful for high performance computing, machine learning, video editing, and gaming applications.

CMP: Chip Multi-Processors or multi-core architecture is a logic design architecture whereby multiple processing units (e.g., CPU cores) are integrated onto a single monolithic integrated circuit or onto multiple dies in a single package. This is also known as shared memory architecture.

BSC: Barcelona Supercomputing Center is a public research center located in Spain. It was founded in 2005 and holds the supercomputer MareNostrum.

VLSI: Very Large-Scale Integration is the process of creating an integrated circuit (processors) by combining millions of transistors onto a single chip.

MPI: Message Passing Interface. MPI is the standard that defines the syntax and semantics of the library designed to be used in distributed computing.

CPU: Central Processing Unit (or processor of a computer) is the electronic circuitry in charge of the execution of the instructions that compose a computer program.

OpenMP: Open Multi-Processing is an API for shared memory programs. It is a pragma based programming model that provides a set of different parallelism approaches, such as fork-join, task-based, target offloading, among others.

Kokkos: As RAJA, is a C++ software library containing a set of abstractions that enable architecture portability on several parallel computational architectures.

RAJA: Is a software library containing a high level collection of abstractions based on the C++ programming language. The RAJA abstractions enable architecture portability for High Performance Computing applications.

NUMA: Non-Uniform Memory Access is one way of managing memory. In this case, memory is stored as close to the processor, that way the cores can be fed with data for computing as fast as possible.

Scalability: Capability of the system to increase the delivered performance proportionally to the number of added resources.

Complete Chapter List

Search this Book: