vignettes/gaussianProcessRegression.Rmd
gaussianProcessRegression.Rmd
Gaussian Process Regression (GPR) is a nonparametric, nonlinear Bayesian approach to regression that is used in physics, geostatistics (here usually called “Kriging”), financial statistics and time series analysis. In addition, it is currently making waves in the area of Machine Learning. The aim is to find the posterior distribution of the response conditioned on the data. So the question here is: What random function could explain the observed values best? As pre-condition, we need realizations that are described by a Multivariate Normal Distribution. (Tip: Space Filling Designs are highly suitable.) This makes it able to capture input-output dynamics without having to “fit” anything, leading to a highly flexible tool. Moreover, GPR is relatively robust to transgressions between assumptions and reality, also in higher dimensions.
After training the model and fine-tuning its hyper-parameters (bandwidth and nugget), it can be used to make predictions for unseen data.
How do we use the method ‘gaussianProcessRegression’ from ‘CornerstoneR’ in ‘Cornerstone’?
To use GPR in ‘Cornerstone’, open a dataset, e.g. ‘sobol_sequence’ from the Cornerstone sample data, and choose the menu ‘Analyses’ -> ‘CornerstoneR’ -> ‘Gaussian Process Regression’ as shown in the following screenshot.
Gaussian Process Regression: Menu
In the appearing dialog, select all ‘x*’ variables to predictors. These columns were simulated in Cornerstone using Niederreiter sequences and represent a space filling design. ‘y1’ and ‘y2’ are the computed response variables with random component.
Gaussian Process Regression: Variable Selection
‘OK’ confirms your selection and the following window appears.
Gaussian Process Regression: R Script
Set the GPR options via ‘R Script’ -> ‘Script Variables’ as shown in the following screenshot.
Gaussian Process Regression: Script Vars
The following settings can be made:
The script variables are explained in more detail in section Options in the Script Variables Dialog. When you are done setting the script variables, click ‘OK’.
Now, click the execute button (green arrow) or choose the menu ‘R Script’ -> ‘Execute’ and all calculations are done via ‘R’. Calculations are done if the text at the lower left status bar contains ‘Last execute error state: OK’. Our result is available via the ‘Summaries’ menu as shown in the following screenshot.
Gaussian Process Regression: Result Menu
Via ‘Summaries’ -> ‘Statistics’, the following data set with some essential statistics is shown. If you have selected multiple response variables these statistics are shown row-wise for each response variable.
Gaussian Process Regression: Statistics Table
For instance, the ‘Count’ lets you check on how many observations the model learns. The ‘Degrees of Freedom’ displays the Student-t degrees of freedom scalar. To estimate the calculation time for bigger data, ‘Runtime’ shows the corresponding time ‘R’ needed in seconds. We get an R-Squared above 0.99 for both response variables, which means that the Gaussian Process Regression seems to perform pretty good for the given data.
Via ‘Summaries’ -> ‘Predictions’, the following data table is shown. It gives you the used, predicted and residual values for each response.
Gaussian Process Regression: Predictions
This dataset is brushable. If some observations were not used for training, they would have gotten the entry ‘0’ (FALSE) in the ‘Used’ column.
We get the tuned hyper-parameters, i.e. the bandwidth (d) and if required, the nugget (g), via “Summaries” -> “Hyper-Parameters”. If we have chosen to not optimize the hyper-parameters, the defined start values will be seen as the tuned hyper-parameters. If you have not defined any start values, default values will be taken. See section Options in the Script Variables Dialog for more information.
Gaussian Process Regression: Hyper-Parameters
If you decide to retrain the model for the same data set but with additional observations, choose these values as start values for the hyper-parameters.
If you have set the check box for computing and outputting the covariance matrix in the script variables, you will get the full predictive Covariance Matrix for a multivariate Student-t distribution for each response via “Summaries” -> “Covariance Matrix”. Note that its calculation substantially increases computation time.
Open “Graphs” -> “Adjusted Response Graph” to obtain the Adjusted Response Graph as already known from the basic Cornerstone Linear Regression. We get the following graph.
Gaussian Process Regression: Adjusted Response Graph
To check the underlying data, open “Summaries” -> “Adjusted Response Data”. Note that the Adjusted Response table and graph are not brushable due to the extra prediction points increasing the number of rows in the dataset. The number of prediction points can be set in the script variables.
If you are interested in predicting the values for a DoE, check the Design Predictions via ‘Summaries’ -> ‘Design Predictions’.
Gaussian Process Regression: Design Predictions
The design predictor values are the sequence from the minimum to the maximum of each predictor from the original dataset. The predicted mean, standard error and response are calculated based on a full-factorial design using the number of prediction levels set via the script variables. The Design Predictions can only be calculated if the number of predictors is at least 2. If you have chosen to use the response without transformation, the mean and predictions columns will be the same.
The corresponding Design Cube to the design predictions is obtained via “Summaries” -> “Design Cube”. This means that we get the minimum, maximum and mean values for each predictor.
Gaussian Process Regression: Design Cube
Here, the three points (x1/x2/x3) build the diagonal of a 3-dimensional cube.
In this section we discuss prediction of a response in a new dataset with the existing model from above. Therefore, we open the dataset ‘sobol_sequence’ in ‘Cornerstone’ again and delete the columns ‘y1’ and ‘y2’. Starting from this dataset, we want to predict the original responses ‘y1’ and ‘y2’ via the menu ‘Analyses’ -> ‘CornerstoneR’ -> ‘Model Prediction’ as shown in the following screenshot.
Gaussian Process Prediction: Menu
In the appearing dialog select all ‘x*’ variables to predictors. We have no response variable.
Gaussian Process Prediction: Variable Selection
‘OK’ confirms your selection and the following window appears.
Gaussian Process Prediction: R Script
At this point, we add the existing GPR model to the prediction dialog at hand. It is possible via menu ‘R Script’ -> ‘Input R Objects’ which brings up the following dialog.
Gaussian Process Prediction: Input R Objects
We choose ‘Gaussian Process Regression Models’ as selected ‘R’ objects and click ‘OK’.
Now, click the execute button (green arrow) or choose the menu ‘R Script’ -> ‘Execute’ and all calculations are done via ‘R’. Calculations are done if the text at the lower left status bar contains ‘Last execute error state: OK’. Our result is available via the ‘Summaries’ menu.
The entry “Predictions” opens a dataset with the input data and all response columns that are predictable from the chosen GPR models.
Gaussian Process Prediction: Result
Finally, the ‘Cornerstone’ workmap with all generated objects looks like the following screenshot.
Gaussian Process Regression: Final Workmap
Some options are exported from the used ‘R’ method to ‘Cornerstone’. Starting from the ‘R’ analysis object ‘gaussianProcessRegression’, you can find the ‘Script Variables’ dialog via the menu ‘R Script’ -> ‘Script Variables’. The following dialog appears.
Gaussian Process Regression: Script Variables
As an alternative you can use only brushed or non-brushed observations to fit the Gaussian Process Model. Hence, after brushing a number of observations it is not necessary to create a ‘Cornerstone’ subset to include or exclude specific rows, you can just use this option to fit the model on the brushed or non-brushed set of rows. Training the model on less rows decreases run time. The default is to take all the rows from the input data for training.
Similar as in Conerstone’s Linear Regression, you are able to transform your response variables before modeling a GPR if you expect better results with a logarithmic scale, for example. You can choose from “Untransformed”, “Squared”, “Square root”, “Log”, “Reciprocal”, and “Reciprocal sqrt”. You can also set a Offset with Sign to the Box-Cox Transformation if you wish to shift your response values by a certain value. For instance, if you would like to add -5 to your response values, type “5” to the field “Box-Cox Transformation Offset” and choose the minus in the corresponding Sign field. The default is no transformation.
Check the box “Compute and Output a Covariance Matrix” if you wish to return the full predictive covariance matrix for the multivariate Student-t distribution. This would substantially increase computation time if only point-prediction is required. Default is FALSE, i.e. no covariance matrix will be outputted.
Check the corresponding box if you wish to optimize the hyper-parameters bandwidth and nugget if desired. This may increase the training time. If unchecked, the default/start values will be taken as the “true” hyper-parameter values to work with. Default is FALSE.
Here, you can enter your bandwidth start values for each predictor and response. Separate the entered values by comma (“,”) and use a decimal point (“.”) as decimal separator. Put the values in the same order as of the Variable Selection (check via the menu “R Script” -> “Variables”). For example, if you have 3 predictors \(x_i, i = 1,2,3\) and 2 responses \(y_j, j = 1,2\), enter your bandwidth start values \(d_ij\) as the following: \(d_{11}, d_{21}, d_{31}, d_{12}, d_{22}, d_{32}\). If you enter only as much values as predictors, these values will be “recycled” if you have more than one response variable. If you do not enter any values, the default values will be used which is \((max(x_i) - min(x_i)) / 20\) for each predictor \(x_i, i = 1,\ldots,p\). The choice of these values have an impact on the resulting model and the calculation time.
If you wish to have a tuned nugget parameter (noise) for kernels, check the corresponding box. Default is FALSE here.
Here, you can enter your nugget start values for each response. Separate the entered values by comma (“,”) and use a decimal point (“.”) as decimal separator. Put the values in the same order as of the Variable Selection (check via the menu “R Script” -> “Variables”). For example, if you have 2 responses \(y_j, j = 1,2\), enter your nugget start values \(g_j\) as the following: \(g_1, g_2\). If you enter only one value, it will be “recycled” if you have more than one response variable. If you do not enter any values, the default value of 0.000001 (1e-6) will be used. The choice of these values have an impact on the resulting model and the calculation time. Note: if you get the error message that the value “g” must be below a certain value, reduce the value of your chosen nugget start value.
The number of prediction levels are needed to create a full factorial design and can be set for each predictor and response. The input rules are the same as for the bandwidth start values. If not specified, the number of prediction levels will be set to 25 for each predictor and response. The higher the values here and the more predictors you have, the longer the calculation time. If you encounter memory issues, reduce the prediction level values or the number of predictors. The output table “Design Predictions” will then have as many rows as the multiplication of the prediction levels for each response. For instance if you have 3 predictors and one response and you have entered “20, 20, 20” as prediction levels, you will obtain a Design Predictions Table with \(20^3 = 8000\) rows.
The number of prediction points are used for the Adjusted Response Graph. If not specified, the number of prediction points will be set to 25.
The implementation is based on the R packages ‘laGP’ (Gramacy, 2016) and ‘DoE.base’ (Grömping, 2018). The theory behind the Gaussian Process Regression can be read in more detail in Bishop (2006) and Gramacy (2020).
Christopher M. Bishop (2006). “Pattern Recognition and Machine Learning (Information Science and Statistics).” Springer-Verlag, Berlin, Heidelberg.
Robert B. Gramacy (2016). “laGP: Large-Scale Spatial Modeling via Local Approximate Gaussian Processes in R.” Journal of Statistical Software, 72(1), 1-46. doi: 10.18637/jss.v072.i01.
Robert B. Gramacy (2020). “Surrogates: Gaussian Process Modeling, Design and Optimization for the Applied Sciences.” Chapman Hall/CRC, Boca Raton, Florida.
Ulrike Grömping (2018). “R Package DoE.base for Factorial Experiments.” Journal of Statistical Software, 85(5), 1-41. doi: 10.18637/jss.v085.i05.