Efficient Prefix Scan for the GPU-Based Implementation of Random Forest

Efficient Prefix Scan for the GPU-Based Implementation of Random Forest

Bojan Novak (UM FERI, Slovenia)
DOI: 10.4018/978-1-4666-7377-9.ch009
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

The random forest ensemble learning with the Graphics Processing Unit (GPU) version of prefix scan method is presented. The efficiency of the implementation of the random forest algorithm depends critically on the scan (prefix sum) algorithm. The prefix scan is used in the depth-first implementation of optimal split point computation. Described are different implementations of the prefix scan algorithms. The speeds of the algorithms depend on three factors: the algorithm itself, which could be improved, the programming skills, and the compiler. In parallel environments, things are even more complicated and depend on the programmer´s knowledge of the Central Processing Unit (CPU) or the GPU architecture. An efficient parallel scan algorithm that avoids bank conflicts is crucial for the prefix scan implementation. In our tests, multicore CPU and GPU implementation based on NVIDIA´s CUDA is compared.
Chapter Preview
Top

Introduction

In the first chapter the evolution of implementations of random forest algorithms that can be executed on the graphics processing unit (GPU) are described. Random forests are ensemble learning methods for classification that constructs a set of decision trees at training process (Breiman, 2001). The output class is calculated from the classes output by individual trees. Advantages of training algorithms that can produce compact random forests composed of many, small trees rather than fewer, deep trees form a solid foundation for the GPU implementation. Pros and contras of depth-first and bread-first search are presented.

For better understanding of next chapters a short description of the GPU architecture is given. The GPU Architecture allows the possibility to program GPUs using a master-slave computing model (Knuth, 1974). PC´s central processing unit (CPU) works as the master and the GPU processors act as the slaves. The host invokes kernels (C functions) that run on slave processors. Kernels can be executed as threads.

In the third chapter theory and implementations of the prefix scan algorithm on the GPU is presented. The performances of prefix scan algorithms are measured by different criteria: computational complexity, memory usage, stability, recursion etc. (Cormen, Leiserson, Rivest, & Stein, 2001) and (Knuth, 1973). The speeds of the algorithms depend on three factors: the algorithm itself, which can be improved, the programming skills, and the compiler. In parallel environments things are even more complicated and depend on the programmer´s knowledge of the CPU or GPU unit (Govindaraju, Gray, Kumar, & Manocha, 2006). Particularly when the architecture of GPU changes, this happens almost every year, as programs are no longer optimal.

Follows the chapter about random forests ensemble learning with the GPU version of the prefix scan method. The efficiency of the implementation of the random forest algorithm depends critically on the scan (prefix sum) algorithm. The prefix scan is used in the depth-first implementation of optimal split point computation. Described are different implementations of the prefix scan algorithms. Follows some performance tests for multicore CPU and GPU implementations based on NVIDIA´s CUDA platform on the medium size data set MNIST and the large Poker hand dataset. The experiments were performed on the PC with Intel I7 920, 2.66GHz CPU, and NVIDIA GTX TITAN GPU.

In the next chapter some future direction of development of the proposed algorithm for the multi GPU environment is discussed when the new NVLink, a high-speed interconnect will be available.

In the last chapter a short summary of theory and benchmark results are presented. Follows some additional explanation of achieved results. Given are some advices of the proper use of the GPUs version of the algorithms.

Key Terms in this Chapter

GPU: Graphical processing unit, suitable for the intensive computation. Mostly build by AMd and NVIDIA.

FPGA: Field-programmable gate arrays

Hawaii: The latest GPU architecture provided by AMD

OpenCL: Is a framework for developing programs that execute across heterogeneous platforms consisting of CPUs, GPUs and FPGAs.

KEPLER: The GPU architecture provided by NVIDIA.

CAYMAN: The GPU architecture provided by AMD.

OpenACC: Is a programming standard for parallel computing It is designed to simplify parallel programming of heterogeneous CPU/GPU systems.

FERMI: The GPU architecture provided by NVIDIA.

OpenMP: Is an application programming interface that supports shared memory multiprocessing programming in C++ and similar languages.

Complete Chapter List

Search this Book:
Reset