Article Preview
Top1. Introduction
The advent of Big data technology has brought a great revolution in the science of Numerical Weather Prediction. Big data in NWP actually refers to ‘climate big data’ that come from rapid and dense observations from advanced sensors and very high-resolution model output. A ten-fold increase in the model resolution would require 104 more computations for the four dimensions in space and time. To achieve this massively challenging throughput and to fully utilize this big data so as to provide more accurate and rapidly updated weather prediction, innovations have to be brought to the existing Data Assimilation and NWP systems (Big Data Assimilation) (Miyoshi et al., 2016a; Miyoshi et al., 2016b). This can help strengthen our early warning system against regional, sudden and severe calamities such as hurricanes, heavy rain, flooding, landslides and the alike. Innovative research has already started towards speeding up the various phases of NWP such as observation data processing, model run and data transfer between model and DA. Even in the Data assimilation phase, ways to improve storage and processing of large matrices and vectors can be explored. With the three spatial dimensions and one temporal dimension considered in Variational data assimilation algorithms and Kalman Filter based assimilation algorithms, the atmospheric state variables such as Wind, Pressure, Humidity etc at all grid points for various vertical layers and time instants are represented in a vector with around 108 entries. Likewise, the measurement vector contains 106 observation entries. Due to large size of these vectors, the resulting model error covariance and observation error covariance matrices too will be large, of the order of O(108X108). Hence the performance of these assimilation methods depends on the design and implementation of better algorithms for processing of large matrices in general and inversion in particular, and this was the impetus behind our proposed work.
The massive number crunching capacity needed to work with large matrices can be made possible by employing Graphics Processing Units (GPUs). CUDA is well suited for data-parallel algorithms (Garland et al., 2008) such as shallow water model (Playne & Hawick, 2015), delivering high computational throughput if few design principles are followed to fully utilize the GPU’s processor cores and their shared memory that is critical to the performance of many efficient algorithms. Various improvements made to the storage format for efficient execution of SpMV operations on GPUs (Gao, Qi & He, 2016; Koza, Matyka, Szkoda & Mirosław, 2014; Dziekonski, Lamecki & Mrozowski, 2011) have shown this. Wu, Ke, Lin and Jhan (2014) claim that adjusting the number of threads dynamically helps to completely utilize the compute power of GPUs. Modeling tools (Zouaneb, Belarbi & Chouarfia, 2016) also lend a helping hand in validating task scheduling on GPUs and analyzing the performance. Earlier studies show that GPU implementations are several times faster than its CPU counterpart (Helfenstein & Koko, 2012) and can be efficient if the matrix is represented and processed using the two-dimensional textures that GPUs are optimized for (Galoppo, Govindaraju, Henson & Manocha, 2005). Further studies have revealed that parallel implementation of algorithms on hybrid platform consisting of CPU and GPUs (Ezzatti, Quintana & Remón Gómez, 2011a; Benner, Ezzatti, Quintana-Ortí & Remón, 2009; Ezzatti, Quintana-Orti, & Remon, 2011b) has proved to be more efficient for both small and large size matrices than the pure GPU implementation.