Selection of Representative Feature Training Sets With Self-Organized Maps for Optimized Time Series Modeling and Prediction: Application to Forecasting Daily Drought Conditions With ARIMA and Neural Network Models

Selection of Representative Feature Training Sets With Self-Organized Maps for Optimized Time Series Modeling and Prediction: Application to Forecasting Daily Drought Conditions With ARIMA and Neural Network Models

Elizabeth McCarthy (University of Southern Queensland, Australia), Ravinesh C. Deo (University of Southern Queensland, Australia), Yan Li (University of Southern Queensland, Australia) and Tek Maraseni (University of Southern Queensland, Australia)
DOI: 10.4018/978-1-5225-4766-2.ch020


While the simulation of stochastic time series is challenging due to their inherently complex nature, this is compounded by the arbitrary and widely accepted feature data usage methods frequently applied during the model development phase. A pertinent context where these practices are reflected is in the forecasting of drought events. This chapter considers optimization of feature data usage by sampling daily data sets via self-organizing maps to select representative training and testing subsets and accordingly, improve the performance of effective drought index (EDI) prediction models. The effect would be observed through a comparison of artificial neural network (ANN) and an autoregressive integrated moving average (ARIMA) models incorporating the SOM approach through an inspection of commonly used performance indices for the city of Brisbane. This study shows that SOM-ANN ensemble models demonstrate competitive predictive performance for EDI values to those produced by ARIMA models.
Chapter Preview


The quality of data-driven forecasts generated for environmental variables is greatly influenced by the nature of the training data used (Nelson, Hill, Remus, & O'Connor, 1999; Zhang, & Qi, 2005), particularly when operating at daily intervals where the stochastic nature of raw environmental behaviour is more apparent. The data used for training, validating and testing data-intelligent models have a profound impact on the model’s ability to detect the characteristics of the features and the consequential predictive performance of models (Bowden, Maier, & Dandy, 2002). Checks to compare the statistical characteristics of training and testing data sets for consistency and representativeness of the whole set are rarely performed and reported in literature. Accordingly, the resulting models may have significant capacity for performance optimization.

A literature review revealed that data sets are typically allocated based on the divisions along the chronologically-ordered time series at arbitrarily defined intervals (Dayal, Deo, & Apan, 2017; Deo, Byun, Adamowski, & Kim, 2014; Deo, Kisi, & Singh, 2017; Djerbouai, & Souag-Gamane, 2016; Nury, Hasan, & Alam, 2017; Shirmohammadi, Vafakhah, Moosavi, & Moghaddamnia, 2013; Zhang, 2003). Such approaches may fail to recognise the potential for more subtle, lower frequency trends, and hence may also compromise the performance of the models due to the statistically unrepresentative training data sets.

An alternative approach which has limited applications in drought forecasting is the optimal configuration of data-driven models in an ensemble with Kohonen’s self-organizing map (SOM) (Kohonen, 1998; 2014). SOM is a popular neural network tool offering an unsupervised method of clustering the feature data set values (Kalteh, Hjorth, & Berndtsson, 2008; Nourani, Baghanam, Adamowski, & Kisi, 2014). SOM can be applied to simplify the input series by identifying the underlying trends in the feature datasets to be modelled, thus reducing the need for an intact data series. The feature dataset is then constructed from the simple random sampling of each of the clusters (Wu, May, Dandy, & Maier, 2012; Wu, May, Maier, & Dandy, 2013). The implementation of SOM for training and testing data set selection incidentally provides a means to manage time series stationarity and linearity issues, both of which reduce the efficacy of stochastic models for forecasting purposes in climate applications. Hence, the more deliberate selection of training and testing data sets through SOM offers a convenient and effective method to optimize the architecture and improve the performance of data-driven forecasting models.

Optimization of the features in model input data with a SOM using the neural network clustering and random sampling approach can assist modelers in creating robust and statistically consistent training, validating and testing of data sets. This can lead to high-performing and efficient data-driven models. Such model optimization attributes are highly desirable in drought management and drought forecasting decision-support tools.

The aim of this research is to develop a data-driven model using a self-organized map to produce quality time series forecasts while (1) optimizing the model’s architecture by selecting representative data sets, using unsupervised SOMs, to train and test the data-driven models; and (2) compare the performance of the optimized SOM-ANN ensemble model against ARIMA and equivalent ANN models formulated from indiscriminate data sets.

Complete Chapter List

Search this Book: