Statistical Modelling and Analysis of the Computer-Simulated Datasets

Statistical Modelling and Analysis of the Computer-Simulated Datasets

M. Harshvardhan (Indian Institute of Management Indore, India) and Pritam Ranjan (Indian Institute of Management Indore, India)
DOI: 10.4018/978-1-5225-8407-0.ch011

Abstract

Over the last two decades, the science has come a long way from relying on only physical experiments and observations to experimentation using computer simulators. This chapter focuses on the modelling and analysis of data arising from computer simulators. It turns out that traditional statistical metamodels are often not very useful for analyzing such datasets. For deterministic computer simulators, the realizations of Gaussian process (GP) models are commonly used for fitting a surrogate statistical metamodel of the simulator output. The chapter starts with a quick review of the standard GP-based statistical surrogate model. The chapter also emphasizes on the numerical instability due to near-singularity of the spatial correlation structure in the GP model fitting process. The authors also present a few generalizations of the GP model, reviews methods, and algorithms specifically developed for analyzing big data obtained from computer model runs, and reviews the popular analysis goals of such computer experiments. A few real-life computer simulators are also briefly outlined here.
Chapter Preview
Top

Introduction

In early days, when the computers were not readily accessible to common people, statisticians and data analysts focussed on the development of innovative methodologies that were efficient for analyzing small datasets. Over the last two decades, we have come a long way from relying on only physical experiments and observations to experimentation using computer simulation models, commonly referred to as the computer simulators or computer models. These simulators are software implementation of the real-world processes, imitated based on the comprehensive understanding on the underlying phenomena. The applications range from simulating socioeconomic behaviour, impact due to a car crash, manufacturing a compound for drug discovery, climate and weather forecasting, population growth of certain pest species, cosmological phenomena like dark energy and universe expansion, emulation of tidal flow for harnessing renewable energy, the simulation of a nuclear reactions, and so on. Given the easier access to high performance computing power such as cloud computing and cluster grids, computer model data is now a reality in everyday life.

In this chapter, we focus on the modelling and analysis of data sets arising from such computer simulators. Similar to the physical experiments setup, the data obtained from the computer simulator runs have to be modelled and analysed for a deeper understanding of the underlying process. However, traditional statistical metamodels are often not very useful for analyzing such datasets. This is because, many a time, these computer models are deterministic in nature, that is, the repeated runs of such a computer simulator with a fixed input settings yield the same output / response. In other words, there is no replication error for the deterministic computer simulators. Recall that in the traditional statistical models, such as regression, the main driving force for model fitting and inference part of the methodology is the distribution of replication errors.

For deterministic computer simulators, the realizations of Gaussian Process (GP) models, trained by the observed simulator data, are commonly used for fitting a surrogate statistical metamodel of the simulator output. This is particularly crucial if the simulator is expensive to run, which is the case for many complex real-life phenomena. The notion of GP models gained popularity in late 1990 and early 2000 (e.g., Santner et al. (2003); Rasmussen and Williams (2006); Fang et al. (2005)), though it was first proposed in the seminal paper of Sacks et al. (1989). Section 2 of the chapter presents a quick review of the standard GP based statistical surrogate model. We will also briefly discuss the implementation procedure using both the maximum likelihood method and the Bayesian approach.

Almost all published research articles and books focus on the new methodologies and algorithms that can be used for analyzing the computer simulator data, and not on the small nuances related to the actual implementation which is extremely useful from a practitioners’ standpoint. This chapter emphasizes on such computational issues. In particular, Section 3 of the chapter discusses the numerical instability due to near-singularity or ill-conditioning of the spatial correlation structure which is the key building block behind the flexibility of the GP-based surrogate model. In practice, the majority of researchers simply use a numerical fix to overcome this issue, but this inadvertently compromises with other aspects of the model assumptions. We present an empirical study to compare different current practices to address this ill-conditioning problem. We also discuss the best coding practices in the implementation of such model fitting exercise, for instance, which of the matrix decomposition method, LU / QR / SVD / Cholesky, is recommended from an accuracy and time efficiency perspective.

Given the revolution in the computing power, it is now easy to collect and process data sets that are spatio-temporal and functional in nature. Dynamic computer models, i.e. the simulator which returns time-series response (see Zhang et al. (2018b)), is a current hot topic of research in applied statistics and computer experiments. Section 4 of the chapter reviews several generalizations of the GP model that accounts for multiple sources of uncertainty in the simulation model, non-stationarity of the underlying processes, and dynamic nature of such computer simulator outputs.

Key Terms in this Chapter

Stationarity: (referred to weak-stationarity) In the context of response surfaces, a process is said to be non-stationary if the surface exhibit abrupt changes in the curvature and shape.

Correlation Length Parameter: It is the inverse of the correlation hyper-parameter, , in the power-exponential correlation function, and used to quantify smoothness of the fitted surrogate.

Gaussian Process: A stochastic process is said to follow Gaussian Process (GP) if every finite subset of the random variables , for arbitrary , and , follow multivariate normal distribution.

Best Linear Unbiased Predictor (BLUP): An unbiased linear predictor with minimum variance among the class of all linear unbiased predictors is called the best linear unbiased predictor. In some sense it is the best linear unbiased estimator (BLUE) of the unobserved .

Nugget: (denoted by ) It is a small positive constant added to the diagonal of the correlation matrix to evade ill-conditioning in the near-singular matrices.

Near-Singular Matrix: A matrix which has its determinant close to zero, and whose inverse is unreliable, is called near-singular matrix or ill-conditioned matrix. The extent of ill-conditioning is defined by its condition number.

Complete Chapter List

Search this Book:
Reset