Initial Situation and Goal

Gaussian Process Regression (GPR) is a nonparametric, nonlinear Bayesian approach to regression that is used in physics, geostatistics (here usually called “Kriging”), financial statistics and time series analysis. In addition, it is currently making waves in the area of Machine Learning. The aim is to find the posterior distribution of the response conditioned on the data. So the question here is: What random function could explain the observed values best? As pre-condition, we need realizations that are described by a Multivariate Normal Distribution. (Tip: Space Filling Designs are highly suitable.) This makes it able to capture input-output dynamics without having to “fit” anything, leading to a highly flexible tool. Moreover, GPR is relatively robust to transgressions between assumptions and reality, also in higher dimensions.

After training the model and fine-tuning its hyper-parameters (bandwidth and nugget), it can be used to make predictions for unseen data.

How do we use the method 'gaussianProcessRegression' from 'CornerstoneR' in 'Cornerstone'?

Train a Gaussian Process to Data

To illustrate GPR in 'Cornerstone', we will first create a small sample dataset in Cornerstone. For this, open a New Dataset via 'Data' \(\rightarrow\) 'New Dataset' from the Cornerstone Workmap window and add 10 rows via 'Rows' \(\rightarrow\) 'Append...'. We now add 3 computed columns via 'Columns' \(\rightarrow\) 'Add' \(\rightarrow\) 'Computed...' with the following variable names and formula expressions:

  • Name: “Helper”, Expression: \(1/9\)
  • Name: “x”, Expression: \(cumsum("Helper")-1/9\)
  • Name: “y”, Expression: \(abs(sin(2*3.141592653*x))\)^\(0.8\)

As a result, we obtain the dataset as displayed in the following picture.

Gaussian Process Regression: Data

In that window, choose the menu 'Analyses' \(\rightarrow\) 'CornerstoneR' \(\rightarrow\)'Gaussian Process Regression' as shown in the following screenshot.

Gaussian Process Regression: Menu

In the appearing dialog, select 'x' variable to predictors. This variable represents a series of 10 equally distanced values between 0 and 1. 'y' is a computed response variable with sinusoidal shape. Note that the Gaussian Process Regression also supports multiple predictors and responses.

Gaussian Process Regression: Variable Selection

'OK' confirms your selection and the following window appears.

Gaussian Process Regression: R Script

Set the GPR options via 'R Script' \(\rightarrow\) 'Script Variables' as shown in the following screenshot.

Gaussian Process Regression: Script Vars

The following settings can be made:

  • Only Use Brushed / Non-brushed Rows for Modeling
  • Apply Box-Cox Transformation
  • Define Box-Cox Transformation Offset and Sign
  • Choose a Confidence Level
  • Output the Covariance Matrix
  • Optimize Parameters
  • Set start values for hyper-parameters bandwidth and nugget
  • Use nugget for generation of priors
  • Choose number of prediction levels to create full factorial design
  • Define number of prediction points for Adjusted Response Graph

The script variables are explained in more detail in section Options in the Script Variables Dialog. When you are done setting the script variables, click 'OK'. For this example, we keep the options as displayed in the screenshot.

Now, click the execute button (green arrow) or choose the menu 'R Script' \(\rightarrow\) 'Execute' and all calculations are done via 'R'. Calculations are done if the text at the lower left status bar contains 'Last execute error state: OK'. Our result is available via the 'Summaries' menu as shown in the following screenshot.

Gaussian Process Regression: Result Menu

Statistics

Via 'Summaries' \(\rightarrow\) 'Statistics', the following data set with some essential statistics is shown. If you have selected multiple response variables, these statistics are shown row-wise for each response.

Gaussian Process Regression: Statistics Table

For instance, the 'Count' lets you check on how many observations the model learns. The 'Degrees of Freedom' displays the Student-t degrees of freedom scalar. To estimate the calculation time for bigger data, 'Runtime' shows the corresponding time 'R' needed in seconds. We get an R-Squared around 0.95, which means that the Gaussian Process Regression seems to perform very well for the given data.

Predictions

Via 'Summaries' \(\rightarrow\) 'Predictions', the following data table is shown. It gives you the used, predicted and residual values for each response.

Gaussian Process Regression: Predictions

This dataset is brushable. If some observations were not used for the model fit, they would have gotten the entry '0' (FALSE) in the 'Used' column.

Hyper-Parameters

We get the tuned hyper-parameters, i.e. the bandwidth (d) and if required, the nugget (g), via 'Summaries' \(\rightarrow\) 'Hyper-Parameters'. If we had chosen to not optimize the hyper-parameters, the defined start values would have been considered as the tuned hyper-parameters. If you have not defined any start values, the R routine will estimate them based on the input data. See section Options in the Script Variables Dialog for more information.

Gaussian Process Regression: Hyper-Parameters

If you decide to refit the model for the same data set but with additional observations, choose the last output as start values for the hyper-parameters.

Covariance Matrix

If you have set the check box for computing and outputting the covariance matrix in the script variables, you will get the full predictive Covariance Matrix for a multivariate Student-t distribution for each response via 'Summaries' \(\rightarrow\) 'Covariance Matrix'. Note that its calculation substantially increases computation time, esp. with big input datasets.

Gaussian Process Regression: Covariance Matrix

Adjusted Response Data and Graph

Open 'Graphs' \(\rightarrow\) 'Adjusted Response Graph' to obtain the Adjusted Response Graph for each response as already known from the basic Cornerstone Linear Regression. We get the following graph for our example.

Gaussian Process Regression: Adjusted Response Graph

To check the underlying data, open 'Summaries' \(\rightarrow\) 'Adjusted Response Data'. Note that the Adjusted Response table and graph are not brushable due to the extra prediction points increasing the number of rows in the dataset. The number of prediction points can be set in the script variables.

Design Predictions

If you are interested in predicting the values for a DoE, check the Design Predictions via 'Summaries' \(\rightarrow\) 'Design Predictions'. The Design Predictions can only be calculated if the number of predictors is at least 2. If you have chosen to use the response without transformation, the mean and predictions columns will be the same. The design predictor values are the sequence from the minimum to the maximum of each predictor from the original dataset. The predicted mean, standard error and response are calculated based on a full-factorial design using the number of prediction levels set via the script variables.

Design Cube

The corresponding Design Cube to the design predictions is obtained via 'Summaries' \(\rightarrow\) 'Design Cube'. For instance in case of 3 predictors, the points would build the diagonal of a 3-dimensional cube. This also means that we get the minimum, maximum and mean values for each predictor.

Gaussian Process Regression: Design Cube

GPR Models

All models in the 'Cornerstone' object 'Gaussian Process Regression' can be exported to the Workmap via 'Summaries' \(\rightarrow\) 'Gaussian Process Regression Models’`.

We need this export to use existing GPR models in additional datasets for predictions.

Use Fitted Gaussian Processes for Predictions

In this section we discuss prediction of a response in a new dataset with the existing model from above. Therefore, we open the dataset we constructed in Cornerstone again. Starting from this dataset, we aim to predict the original response 'y' via the menu 'Analyses' \(\rightarrow\) 'CornerstoneR' \(\rightarrow\) 'Model Prediction' as shown in the following screenshot.

Gaussian Process Prediction: Menu

In the appearing dialog we select only the predictor 'x'.

Gaussian Process Prediction: Variable Selection

'OK' confirms your selection and the following window appears.

Gaussian Process Prediction: R Script

At this point, we add the existing GPR model to the prediction dialog at hand. This is done via the menu 'R Script' \(\rightarrow\) 'Input R Objects' which pops up the following dialog.

Gaussian Process Prediction: Input R Objects

We choose 'Gaussian Process Regression Models' as selected 'R' Objects and click 'OK'.

Now, click the execute button (green arrow) or choose the menu 'R Script' \(\rightarrow\) 'Execute' and all calculations are done via 'R'. Calculations are done if the text at the lower left status bar contains 'Last execute error state: OK'. Our result is available via the 'Summaries' menu.

The entry 'Predictions' opens a dataset with the input data and all response columns that are predictable from the chosen GPR models.

Gaussian Process Prediction: Result

Finally, the 'Cornerstone' workmap with all generated objects looks like the following screenshot.

Gaussian Process Regression: Final Workmap

Options in the Script Variables Dialog

Some options are exported from the used 'R' method to 'Cornerstone'. Starting from the 'R' analysis object 'gaussianProcessRegression', you can find the 'Script Variables' dialog via the menu 'R Script' \(\rightarrow\) 'Script Variables'. The following dialog appears.

Gaussian Process Regression: Script Variables

Only Use Brushed / Non-brushed Rows for Fitting

As an alternative you can use only brushed or non-brushed observations to fit the Gaussian Process Model. Hence, after brushing a number of observations it is not necessary to create a ‘Cornerstone’ subset to include or exclude specific rows, you can just use this option to fit the model on the brushed or non-brushed set of rows. Training the model on less rows decreases run time. The default is to take all the rows from the input data for training.

Box-Cox Transformation

Similar as in Conerstone’s Linear Regression, you are able to transform your response variables before modeling a GPR if you expect better results with a logarithmic scale, for example. You can choose from 'Untransformed', 'Squared', 'Square root', 'Log', 'Reciprocal', and 'Reciprocal sqrt'. You can also set a Offset with Sign to the Box-Cox Transformation if you wish to shift your response values by a certain value. For instance, if you would like to add -5 to your response values, type '5' to the field 'Box-Cox Transformation Offset' and choose the minus in the corresponding Sign field. The default is no transformation.

Covariance Matrix

Check the box “Compute and Output a Covariance Matrix” if you wish to return the full predictive covariance matrix for the multivariate Student-t distribution. This would substantially increase computation time if only point-prediction is required. Default is FALSE, i.e. no covariance matrix will be outputted.

Optimization of Hyperparameters

Check the corresponding box if you wish to optimize the hyper-parameters bandwidth and nugget if desired. This may increase the training time. If unchecked, the default/start values will be taken as the “true” hyper-parameter values to work with. Default is FALSE.

Bandwidth Start Values

Here, you can enter your bandwidth start values for each predictor and response. Separate the entered values by comma (“,”) and use a decimal point (“.”) as decimal separator. Put the values in the same order as of the Variable Selection (check via the menu 'R Script' \(\rightarrow\) 'Variables'). For example, if you have 3 predictors \(x_i, i = 1,2,3\) and 2 responses \(y_j, j = 1,2\), enter your bandwidth start values \(d_ij\) as the following: \(d_{11}, d_{21}, d_{31}, d_{12}, d_{22}, d_{32}\). If you enter only as much values as predictors, these values will be “recycled” if you have more than one response variable. The choice of these values have an impact on the resulting model and the calculation time. If you do not enter any values, the values will be estimated from the generated priors for each predictor \(x_i, i = 1,\ldots,p\).

Use of Nugget

If you wish to have a tuned nugget parameter (noise) for kernels, check the corresponding box. Default value is FALSE.

Nugget Start Values

Here, you can enter your nugget start values for each response. Separate the entered values by comma (“,”) and use a decimal point (“.”) as decimal separator. Put the values in the same order as of the Variable Selection (check via the menu 'R Script' \(\rightarrow\) 'Variables'). For example, if you have 2 responses \(y_j, j = 1,2\), enter your nugget start values \(g_j\) as the following: \(g_1, g_2\). If you enter only one value, it will be “recycled” if you have more than one response variable. If you do not enter any values, the default value of 0.000001 (1e-6) will be used. The choice of these values have an impact on the resulting model and the calculation time. Note: if you get the error message that the value “g” must be below a certain value, decrease the value of your chosen nugget start value.

Prediction Levels

The number of prediction levels are needed to create a full factorial design and can be set for each predictor and response. The input rules are the same as for the bandwidth start values. If not specified, the number of prediction levels will be set to 25 for each predictor and response. The higher the values here and the more predictors you have, the longer the calculation time. If you encounter memory issues, reduce the prediction level values or the number of predictors. The output table “Design Predictions” will then have as many rows as the multiplication of the prediction levels for each response. For instance if you have 3 predictors and one response and you have entered “20, 20, 20” as prediction levels, you will obtain a Design Predictions Table with \(20^3 = 8000\) rows.

Prediction Points

The number of prediction points are used for the Adjusted Response Graph. If not specified, the number of prediction points will be set to 25.

Further reading

The implementation is based on the R packages ‘laGP’ (Gramacy, 2016) and ‘DoE.base’ (Grömping, 2018). The theory behind the Gaussian Process Regression can be read in more detail in Bishop (2006) and Gramacy (2020).

Remarks

This function will not accept data with missing values for predictors and/or responses. If your data contain rows with missing values, we recommend using the function Handling Missing Values beforehand.

References

Christopher M. Bishop (2006). “Pattern Recognition and Machine Learning (Information Science and Statistics).” Springer-Verlag, Berlin, Heidelberg.

Robert B. Gramacy (2016). “laGP: Large-Scale Spatial Modeling via Local Approximate Gaussian Processes in R.” Journal of Statistical Software, 72(1), 1-46. doi: 10.18637/jss.v072.i01.

Robert B. Gramacy (2020). “Surrogates: Gaussian Process Modeling, Design and Optimization for the Applied Sciences.” Chapman Hall/CRC, Boca Raton, Florida.

Ulrike Grömping (2018). “R Package DoE.base for Factorial Experiments.” Journal of Statistical Software, 85(5), 1-41. doi: 10.18637/jss.v085.i05.