Challenges on Porting Lattice Boltzmann Method on Accelerators: NVIDIA Graphic Processing Units and Intel Xeon Phi

Challenges on Porting Lattice Boltzmann Method on Accelerators: NVIDIA Graphic Processing Units and Intel Xeon Phi

Claudio Schepke (Federal University of Pampa, Brazil), João V. F. Lima (Federal University of Santa Maria, Brazil) and Matheus S. Serpa (Federal University of Rio Grade do Sul, Brazil)
Copyright: © 2018 |Pages: 24
DOI: 10.4018/978-1-5225-4760-0.ch002

Abstract

Currently NVIDIA GPUs and Intel Xeon Phi accelerators are alternatives of computational architectures to provide high performance. This chapter investigates the performance impact of these architectures on the lattice Boltzmann method. This method is an alternative to simulate fluid flows iteratively using discrete representations. It can be adopted for a large number of flows simulations using simple operation rules. In the experiments, it was considered a three-dimensional version of the method, with 19 discrete directions of propagation (D3Q19). Performance evaluation compare three modern GPUs: K20M, K80, and Titan X; and two architectures of Xeon Phi: Knights Corner (KNC) and Knights Landing (KNL). Titan X provides the fastest execution time of all hardware considered. The results show that GPUs offer better processing time for the application. A KNL cache implementation presents the best results for Xeon Phi architectures and the new Xeon Phi (KNL) is two times faster than the previous model (KNC).
Chapter Preview
Top

Introduction

High performance computing has been responsible for a scientific revolution. Using computers, problems that could not be solved, or demanded too much time to be solved, became available to the scientific community. The evolution of computer architectures improved the computational power, increasing the range of problems that could be dealt. The adoption of integrated circuits, pipelines, increased frequency of operation, out-of-order execution, and branch prediction are an important part of the technologies introduced up to the end of the 20th century. Recently, the concern about energy consumption has been growing, with the goal of achieving computation at the exascale level in a sustainable way. However, the aforementioned technologies alone do not allow the achievement of exascale computing, due to the high energy cost of increasing frequency and pipeline stages, as well as the fact that we are at the limits of exploration the instruction level parallelism.

In order to solve such problems, multicore and accelerators architectures have been introduced in recent years. The main feature of multicore and accelerators is the presence of several processing cores operating concurrently, in which the application has to be programmed by separating it into several tasks that communicate with each other. Concerning the use of accelerators in HPC architectures, its main characteristic is the presence of different environments in the same system, each with its own specialized architecture for a type of task. A typical HPC system is normally composed by a generic processor, responsible of managing the system, and several accelerators present in the system to perform the computation of certain kind of tasks.

The usage of accelerators poses several challenges for HPC. Applications need to be coded considering the particularities and constraints of each environment, as well as considering their distinct architectural characteristics. For example, in the memory hierarchy, the presence of several cache memory levels, some shared and others private, as well as whether the memory banks are centralized or distributed, introduce non-uniform access times, which impact the performance. In addition, in accelerators, the number of functional units may vary between different hardware versions, and the instruction set itself may not be the same. All these aspects influence on the performance of applications and need to be considered in the application code.

This chapter covers recent challenges of parallel programming for the Lattice-Boltzmann Method (LBM) (Schepke, Maillard, & Navaux, 2009). LBM is the current backbone for fluid flow through porous media. It has been extensively applied for Soil Filtration and Fuel Cells for the last five years. LBM is an iterative numerical method to model and to simulate fluid dynamics properties, where space, time and velocity are discrete. The method enables the computational modeling of a large variety of problems, including fluid with multi-components, in one or more phases, with irregular boundary conditions and in complex geometries (Valero-Lara, 2014) (Valero-Lara, & Jansson, 2016). The LBM has been used for simulations of blood vessels, flow of oil in porous rocks with water emulsions, and turbulent flows (Nita, Itu, Suciu, & Suciu, 2013) (Obrecht, Kuznik, Tourancheau, & Roux, 2011).

The LBM is a numerical approach for simulation of fluid flows that take benefits of the fact that it can be used for specific flow conditions, to be naturally discrete and to be parallelized. In terms of development, the fluid flow modeling begins discrete, that is, the domain representation does not need to be discretized after. This model simplifies coding because both method and algorithm are the same. At last, because the operations of the method are local, each lattice element can be computed in parallel. So, a parallel version of the algorithm should be straightforward.

Computational methods, such as LBM, should be continuously ported to the newest HPC hardware available to maintain competitiveness. Parallel programming strategies are considered for operations over each lattice element and it neighbor elements. To execute parallel simulations, state-of-art HPC architectures are employed, generating accurate and faster results at each generation. The software must evolve to support the features of each design to keep performance scaling. Furthermore, it is important to understand the software impact in order to improve performance.

Complete Chapter List

Search this Book:
Reset