Article Preview
TopIntroduction
Uni-processor performance improvement has become flattened due to three wall problems such as Instruction Level Parallelism (ILP), memory wall problem and power problem. Power problem is addressed via improved resource utilization. Currently, general-purpose commercial microprocessors are provided with SIMD vector extensions to minimize the power. SIMD approach utilizes a small amount of extra hardware in the execution units of a core and thereby reducing the power consumption overall (Welch et al., 2012). It is a cost effective method compared to GPU computing (Livesey et al., 2012). (Cebrian et al.,2012) quantifies the effect of parallelization, vectorization, specialization and heterogeneity in increasing the energy efficiency of the new generation processor. They specified that, software developers should prefer vectorization compared to parallelization since it gives better energy efficiency. (Liu et al., 2013) in his paper described that SIMD computing address the cooling challenges and provides high performance computing with a minimum clock speed. To program effectively in SIMD units, SSE instructions are now available with high level programming language. (Mitra et al., 2013) performed a comparative study on NEON SIMD instruction set of ARM processor with the SSE2 instruction set provided for Intel platforms. They also performed a performance study on auto-vectorization and hand tuned vectorization for 5 different benchmarks in ten different hardware processor. They proved that hand tuned vectorization outperforms auto-vectorization in both ARM and Intel Processor. In this paper, we have used SIMD vectorization technique with hand tuned SSE instructions to address the power problem as it has been used in the literature.
Memory wall problem is addressed by the techniques like speculative execution, out of order execution, Multithreading and data prefetching. Prefetching can be performed either by using software or hardware method (Liu et al., 2014), (Karakasis et al., 2009), (Byna et al., 2008). Nowadays, Current processors have a support for hardware prefetchers. They are preferable for applications like DMVM exhibiting regular access patterns. As specified in (Intel Architectures optimization reference manual, 2013) hardware prefetchers of Haswell processors are used to prefetch the data in to the L2 cache. Several studies have examined prefetch strategies for scientific and commercial applications. (Daniel F. Zucker et al., 2000) examined hardware and software cache prefetching techniques for MPEG benchmarks. Software prefetching is used to hide the memory latency problem of applications such as SpMV and graph algorithms showing irregular memory access patterns (Ammenouche and Guojing Cong, 2011). Prefetching via hardware and software means is preferred in the literature to improve the cache performance by reducing the cache miss rate and thereby addresses the memory wall problem.
Performance tools like GNU prof, ATOM, PIN tool and VTune Amplifier are used in the literature to gather the cache related metrics. (Khamparia and SairaBanu, 2013) made an extensive study on performance monitoring tools and used PIN tool to measure the cache metrics for DMVM kernel. In another work (Sairabanu et al., 2013) used PIN tool to gather the cache miss rate for SpMV kernel. This binary instrumentation tool is not capable of gathering kernel level statistics. Routines written for one tool is incompatible to others and also spends significant time in gathering the results (Thiel 2006). To collect data on a specific line of a function and to collect more cache related metrics Intel VTune amplifier is preferred in the literature.(Prakash and Peng 2008) have described the usage of Intel VTune Performance analyzer as a fast and practical tool to characterize the emerging workloads. (Kimball et al., 2014) enumerates the effect of the matrix structure on SpMV performance. They analysed the cache memory performance metrics in SpMV Kernel with R-MAT matrices and (Finite Difference) FD matrices using VTune amplifier. In their paper, they have not concentrated on SIMD vectorization or prefetching technique.