OpenCL
OpenCL (Open Computing Language) (Khronos, 2008) is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors.
Kernel
The definition of a kernel by OpenCL: A kernel is a function declared in a program and executed on an OpenCL device. A kernel is identified by the __kernel qualifier.
From control complexity: A kernel is a subroutine executed independently in a SIMD or in an accelerator without interwork to its host machine or other SIMD.
From data complexity: Kernel is a computation that uses single the regular memory access pattern for each operand array (using only one addressing kernel / template).
From algorithm complexity: A kernel shall handle only one algorithm or part of an algorithm which can be implemented using only one loop.
Cluster
A cluster here consists of one master (host) machine and several SIMD machines.
The run time cost of (1) loading data from the main memory to the SIMD local vector memory, (2) loading data from SIMD local vector memory to the vector register file, and (3) storing results from SIMD local vector memory to the main memory.
The data permutation here in this article is used to select each piece of data in a vector and to store it in a memory block of the vector memory. It can be conducted during the data loading from the main memory to the local vector memory. The purpose of data permutation is to distribute data to different memory blocks in a vector memory so that multiple data values can be used in parallel simultaneously.
Based on data permutation, data is selected to be stored in different memory blocks. Multiple data can be accessed in parallel without conflict, facilitating parallel computing.
The data access kernel is separated from its original algorithm kernel. A kernel carries the data location information in the main memory and in the local vector memory. It also specifies the way that the data in the main memory is collected and merged into one DMA transaction, and the way that the data shall be distributed to each block of the vector parallel memory.
It is a part of a context; a prolog is used to introduce a kernel to be executed in a SIMD machine and an epilog is used to terminate a kernel executed in a SIMD machine.
The prolog is the initial part of a SIMD kernel program and the epilog is the finishing part of a SIMD kernel program. A prolog in SIMD introduces the regular parallel computing by aligning data and data access. An epilog in SIMD handles the final irregular part of a SIMD kernel after executing the regular parallel computing.