Exploring Vectorization and Prefetching Techniques on Scientific Kernels and Inferring the Cache Performance Metrics

Exploring Vectorization and Prefetching Techniques on Scientific Kernels and Inferring the Cache Performance Metrics

J. Saira Banu (School of Computing Science and Engineering, VIT University, Vellore, India) and M. Rajasekhara Babu (School of Computing Science and Engineering, VIT University, Vellore, India)
Copyright: © 2015 |Pages: 19
DOI: 10.4018/IJGHPC.2015040102
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Performance improvement in modern processor is staggering due to power wall and memory wall problem. In general, the power wall problem is addressed by various vectorization design techniques. The Memory wall problem is diminished through prefetching technique. In this paper vectorization is achieved through Single Instruction Multiple Data (SIMD) registers of the current processor. It provides architecture optimization by reducing the number of instructions in the pipeline and by minimizing the utilization of multi-level memory hierarchy. These registers provide an economical computing platform compared to Graphics Processing Unit (GPU) for compute intensive applications. This paper explores software prefetching via Streaming SIMD extension (SSE) instructions to mitigate the memory wall problem. This work quantifies the effect of vectorization and prefetching in Matrix Vector Multiplication (MVM) kernel with dense and sparse structure. Both Prefetching and Vectorization method reduces the data and instruction cache pressure and thereby improving the cache performance. To show the cache performance improvements in the kernel, the Intel VTune amplifier is used. Finally, experimental results demonstrate a promising performance of matrix kernel by Intel Haswell's processor. However, effective utilization of SIMD registers is a programming challenge to the developers.
Article Preview

Introduction

Uni-processor performance improvement has become flattened due to three wall problems such as Instruction Level Parallelism (ILP), memory wall problem and power problem. Power problem is addressed via improved resource utilization. Currently, general-purpose commercial microprocessors are provided with SIMD vector extensions to minimize the power. SIMD approach utilizes a small amount of extra hardware in the execution units of a core and thereby reducing the power consumption overall (Welch et al., 2012). It is a cost effective method compared to GPU computing (Livesey et al., 2012). (Cebrian et al.,2012) quantifies the effect of parallelization, vectorization, specialization and heterogeneity in increasing the energy efficiency of the new generation processor. They specified that, software developers should prefer vectorization compared to parallelization since it gives better energy efficiency. (Liu et al., 2013) in his paper described that SIMD computing address the cooling challenges and provides high performance computing with a minimum clock speed. To program effectively in SIMD units, SSE instructions are now available with high level programming language. (Mitra et al., 2013) performed a comparative study on NEON SIMD instruction set of ARM processor with the SSE2 instruction set provided for Intel platforms. They also performed a performance study on auto-vectorization and hand tuned vectorization for 5 different benchmarks in ten different hardware processor. They proved that hand tuned vectorization outperforms auto-vectorization in both ARM and Intel Processor. In this paper, we have used SIMD vectorization technique with hand tuned SSE instructions to address the power problem as it has been used in the literature.

Memory wall problem is addressed by the techniques like speculative execution, out of order execution, Multithreading and data prefetching. Prefetching can be performed either by using software or hardware method (Liu et al., 2014), (Karakasis et al., 2009), (Byna et al., 2008). Nowadays, Current processors have a support for hardware prefetchers. They are preferable for applications like DMVM exhibiting regular access patterns. As specified in (Intel Architectures optimization reference manual, 2013) hardware prefetchers of Haswell processors are used to prefetch the data in to the L2 cache. Several studies have examined prefetch strategies for scientific and commercial applications. (Daniel F. Zucker et al., 2000) examined hardware and software cache prefetching techniques for MPEG benchmarks. Software prefetching is used to hide the memory latency problem of applications such as SpMV and graph algorithms showing irregular memory access patterns (Ammenouche and Guojing Cong, 2011). Prefetching via hardware and software means is preferred in the literature to improve the cache performance by reducing the cache miss rate and thereby addresses the memory wall problem.

Performance tools like GNU prof, ATOM, PIN tool and VTune Amplifier are used in the literature to gather the cache related metrics. (Khamparia and SairaBanu, 2013) made an extensive study on performance monitoring tools and used PIN tool to measure the cache metrics for DMVM kernel. In another work (Sairabanu et al., 2013) used PIN tool to gather the cache miss rate for SpMV kernel. This binary instrumentation tool is not capable of gathering kernel level statistics. Routines written for one tool is incompatible to others and also spends significant time in gathering the results (Thiel 2006). To collect data on a specific line of a function and to collect more cache related metrics Intel VTune amplifier is preferred in the literature.(Prakash and Peng 2008) have described the usage of Intel VTune Performance analyzer as a fast and practical tool to characterize the emerging workloads. (Kimball et al., 2014) enumerates the effect of the matrix structure on SpMV performance. They analysed the cache memory performance metrics in SpMV Kernel with R-MAT matrices and (Finite Difference) FD matrices using VTune amplifier. In their paper, they have not concentrated on SIMD vectorization or prefetching technique.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing