`vignettes/gaussianProcessRegression.Rmd`

`gaussianProcessRegression.Rmd`

Gaussian Process Regression (GPR) is a nonparametric, nonlinear Bayesian approach to regression that is used in physics, geostatistics (here usually called “Kriging”), financial statistics and time series analysis. In addition, it is currently making waves in the area of Machine Learning. The aim is to find the posterior distribution of the response conditioned on the data. So the question here is: What random function could explain the observed values best? As pre-condition, we need realizations that are described by a Multivariate Normal Distribution. (Tip: Space Filling Designs are highly suitable.) This makes it able to capture input-output dynamics without having to “fit” anything, leading to a highly flexible tool. Moreover, GPR is relatively robust to transgressions between assumptions and reality, also in higher dimensions.

After training the model and fine-tuning its hyper-parameters (bandwidth and nugget), it can be used to make predictions for unseen data.

How do we use the method `'gaussianProcessRegression'`

from `'CornerstoneR'`

in `'Cornerstone'`

?

To use GPR in `'Cornerstone'`

, open a dataset,
e.g. `'sobol_sequence'`

from the Cornerstone sample data, and
choose the menu `'Analyses'`

\(\rightarrow\) `'CornerstoneR'`

\(\rightarrow\)`'Gaussian Process Regression'`

as shown in the following screenshot.

In the appearing dialog, select all `'x\*'`

variables to
predictors. These columns were simulated in Cornerstone using
Niederreiter sequences and represent a space filling design.
`'y1'`

and `'y2'`

are the computed response
variables with random component.

`'OK'`

confirms your selection and the following window
appears.

Set the GPR options via `'R Script'`

\(\rightarrow\)
`'Script Variables'`

as shown in the following
screenshot.

The following settings can be made:

- Only Use Brushed / Non-brushed Rows for Modeling
- Apply Box-Cox Transformation
- Define Box-Cox Transformation Offset and Sign
- Choose a Confidence Level
- Output the Covariance Matrix
- Optimize Parameters
- Set start values for hyper-parameters bandwidth and nugget
- Use nugget for generation of priors
- Choose number of prediction levels to create full factorial design
- Define number of prediction points for Adjusted Response Graph

The script variables are explained in more detail in section Options in the Script
Variables Dialog. When you are done setting the script variables,
click `'OK'`

.

Now, click the execute button (green arrow) or choose the menu
`'R Script'`

\(\rightarrow\)
`'Execute'`

and all calculations are done via
`'R'`

. Calculations are done if the text at the lower left
status bar contains `'Last execute error state: OK'`

. Our
result is available via the `'Summaries'`

menu as shown in
the following screenshot.

Via `'Summaries'`

\(\rightarrow\) `'Statistics'`

,
the following data set with some essential statistics is shown. If you
have selected multiple response variables these statistics are shown
row-wise for each response variable.

For instance, the `'Count'`

lets you check on how many
observations the model learns. The `'Degrees of Freedom'`

displays the Student-t degrees of freedom scalar. To estimate the
calculation time for bigger data, `'Runtime'`

shows the
corresponding time `'R'`

needed in seconds. We get an
R-Squared above 0.99 for both response variables, which means that the
Gaussian Process Regression seems to perform pretty good for the given
data.

Via `'Summaries'`

\(\rightarrow\) `'Predictions'`

,
the following data table is shown. It gives you the used, predicted and
residual values for each response.

This dataset is brushable. If some observations were not used for
training, they would have gotten the entry `'0'`

(FALSE) in
the `'Used'`

column.

We get the tuned hyper-parameters, i.e. the bandwidth (d) and if
required, the nugget (g), via `'Summaries'`

\(\rightarrow\)
`'Hyper-Parameters'`

. If we have chosen to not optimize the
hyper-parameters, the defined start values will be seen as the tuned
hyper-parameters. If you have not defined any start values, default
values will be taken. See section Options in the Script
Variables Dialog for more information.

If you decide to retrain the model for the same data set but with additional observations, choose these values as start values for the hyper-parameters.

If you have set the check box for computing and outputting the
covariance matrix in the script variables, you will get the full
predictive Covariance Matrix for a multivariate Student-t distribution
for each response via `'Summaries'`

\(\rightarrow\)
`'Covariance Matrix'`

. Note that its calculation
substantially increases computation time.

Open `'Graphs'`

\(\rightarrow\)
`'Adjusted Response Graph'`

to obtain the Adjusted Response
Graph as already known from the basic Cornerstone Linear Regression. We
get the following graph.

To check the underlying data, open `'Summaries'`

\(\rightarrow\)
`'Adjusted Response Data'`

. Note that the Adjusted Response
table and graph are not brushable due to the extra prediction points
increasing the number of rows in the dataset. The number of prediction
points can be set in the script variables.

If you are interested in predicting the values for a DoE, check the
Design Predictions via `'Summaries'`

\(\rightarrow\)
`'Design Predictions'`

.

The design predictor values are the sequence from the minimum to the maximum of each predictor from the original dataset. The predicted mean, standard error and response are calculated based on a full-factorial design using the number of prediction levels set via the script variables. The Design Predictions can only be calculated if the number of predictors is at least 2. If you have chosen to use the response without transformation, the mean and predictions columns will be the same.

In this section we discuss prediction of a response in a new dataset
with the existing model from above. Therefore, we open the dataset
`'sobol_sequence'`

in `'Cornerstone'`

again and
delete the columns `'y1'`

and `'y2'`

. Starting
from this dataset, we want to predict the original responses
`'y1'`

and `'y2'`

via the menu
`'Analyses'`

\(\rightarrow\)
`'CornerstoneR'`

\(\rightarrow\)
`'Model Prediction'`

as shown in the following
screenshot.

In the appearing dialog select all `'x\*'`

variables to
predictors. We have no response variable.

`'OK'`

confirms your selection and the following window
appears.

At this point, we add the existing GPR model to the prediction dialog
at hand. It is possible via menu `'R Script'`

\(\rightarrow\)
`'Input R Objects'`

which brings up the following dialog.

We choose `'Gaussian Process Regression Models'`

as
selected `'R'`

objects and click `'OK'`

.

Now, click the execute button (green arrow) or choose the menu
`'R Script'`

\(\rightarrow\)
`'Execute'`

and all calculations are done via
`'R'`

. Calculations are done if the text at the lower left
status bar contains `'Last execute error state: OK'`

. Our
result is available via the `'Summaries'`

menu.

The entry `'Predictions'`

opens a dataset with the input
data and all response columns that are predictable from the chosen GPR
models.

Finally, the `'Cornerstone'`

workmap with all generated
objects looks like the following screenshot.

Some options are exported from the used `'R'`

method to
`'Cornerstone'`

. Starting from the `'R'`

analysis
object `'gaussianProcessRegression'`

, you can find the
`'Script Variables'`

dialog via the menu
`'R Script'`

\(\rightarrow\)
`'Script Variables'`

. The following dialog appears.

As an alternative you can use only brushed or non-brushed observations to fit the Gaussian Process Model. Hence, after brushing a number of observations it is not necessary to create a ‘Cornerstone’ subset to include or exclude specific rows, you can just use this option to fit the model on the brushed or non-brushed set of rows. Training the model on less rows decreases run time. The default is to take all the rows from the input data for training.

Similar as in Conerstone’s Linear Regression, you are able to
transform your response variables before modeling a GPR if you expect
better results with a logarithmic scale, for example. You can choose
from `'Untransformed'`

, `'Squared'`

,
`'Square root'`

, `'Log'`

,
`'Reciprocal'`

, and `'Reciprocal sqrt'`

. You can
also set a Offset with Sign to the Box-Cox Transformation if you wish to
shift your response values by a certain value. For instance, if you
would like to add -5 to your response values, type `'5'`

to
the field `'Box-Cox Transformation Offset'`

and choose the
minus in the corresponding Sign field. The default is no
transformation.

Check the box “Compute and Output a Covariance Matrix” if you wish to return the full predictive covariance matrix for the multivariate Student-t distribution. This would substantially increase computation time if only point-prediction is required. Default is FALSE, i.e. no covariance matrix will be outputted.

Check the corresponding box if you wish to optimize the hyper-parameters bandwidth and nugget if desired. This may increase the training time. If unchecked, the default/start values will be taken as the “true” hyper-parameter values to work with. Default is FALSE.

Here, you can enter your bandwidth start values for each predictor
and response. Separate the entered values by comma (“,”) and use a
decimal point (“.”) as decimal separator. Put the values in the same
order as of the Variable Selection (check via the menu
`'R Script'`

\(\rightarrow\)
`'Variables'`

). For example, if you have 3 predictors \(x_i, i = 1,2,3\) and 2 responses \(y_j, j = 1,2\), enter your bandwidth start
values \(d_ij\) as the following: \(d_{11}, d_{21}, d_{31}, d_{12}, d_{22},
d_{32}\). If you enter only as much values as predictors, these
values will be “recycled” if you have more than one response variable.
If you do not enter any values, the default values will be used which is
\((max(x_i) - min(x_i)) / 20\) for each
predictor \(x_i, i = 1,\ldots,p\). The
choice of these values have an impact on the resulting model and the
calculation time.

If you wish to have a tuned nugget parameter (noise) for kernels, check the corresponding box. Default is FALSE here.

Here, you can enter your nugget start values for each response.
Separate the entered values by comma (“,”) and use a decimal point (“.”)
as decimal separator. Put the values in the same order as of the
Variable Selection (check via the menu `'R Script'`

\(\rightarrow\) `'Variables'`

).
For example, if you have 2 responses \(y_j, j
= 1,2\), enter your nugget start values \(g_j\) as the following: \(g_1, g_2\). If you enter only one value, it
will be “recycled” if you have more than one response variable. If you
do not enter any values, the default value of 0.000001 (1e-6) will be
used. The choice of these values have an impact on the resulting model
and the calculation time. Note: if you get the error message that the
value “g” must be below a certain value, reduce the value of your chosen
nugget start value.

The number of prediction levels are needed to create a full factorial design and can be set for each predictor and response. The input rules are the same as for the bandwidth start values. If not specified, the number of prediction levels will be set to 25 for each predictor and response. The higher the values here and the more predictors you have, the longer the calculation time. If you encounter memory issues, reduce the prediction level values or the number of predictors. The output table “Design Predictions” will then have as many rows as the multiplication of the prediction levels for each response. For instance if you have 3 predictors and one response and you have entered “20, 20, 20” as prediction levels, you will obtain a Design Predictions Table with \(20^3 = 8000\) rows.

The number of prediction points are used for the Adjusted Response Graph. If not specified, the number of prediction points will be set to 25.

The implementation is based on the R packages ‘laGP’ (Gramacy, 2016) and ‘DoE.base’ (Grömping, 2018). The theory behind the Gaussian Process Regression can be read in more detail in Bishop (2006) and Gramacy (2020).

This function will not accept data with missing values for predictors and/or responses. If your data contain rows with missing values, we recommend using the function Handling Missing Values beforehand.

Christopher M. Bishop (2006). “Pattern Recognition and Machine
Learning (Information Science and Statistics).”
*Springer-Verlag*, Berlin, Heidelberg.

Robert B. Gramacy (2016). “laGP: Large-Scale Spatial Modeling via
Local Approximate Gaussian Processes in R.” *Journal of Statistical
Software*, 72(1), 1-46. doi: 10.18637/jss.v072.i01.

Robert B. Gramacy (2020). “Surrogates: Gaussian Process Modeling,
Design and Optimization for the Applied Sciences.” *Chapman
Hall/CRC*, Boca Raton, Florida.

Ulrike Grömping (2018). “R Package DoE.base for Factorial
Experiments.” *Journal of Statistical Software*, 85(5), 1-41.
doi: 10.18637/jss.v085.i05.