vignettes/gaussianProcessRegression.Rmd
gaussianProcessRegression.Rmd
Gaussian Process Regression (GPR) is a nonparametric, nonlinear Bayesian approach to regression that is used in physics, geostatistics (here usually called “Kriging”), financial statistics and time series analysis. In addition, it is currently making waves in the area of Machine Learning. The aim is to find the posterior distribution of the response conditioned on the data. So the question here is: What random function could explain the observed values best? As pre-condition, we need realizations that are described by a Multivariate Normal Distribution. (Tip: Space Filling Designs are highly suitable.) This makes it able to capture input-output dynamics without having to “fit” anything, leading to a highly flexible tool. Moreover, GPR is relatively robust to transgressions between assumptions and reality, also in higher dimensions.
After training the model and fine-tuning its hyper-parameters (bandwidth and nugget), it can be used to make predictions for unseen data.
How do we use the method 'gaussianProcessRegression'
from 'CornerstoneR'
in 'Cornerstone'
?
To illustrate GPR in 'Cornerstone'
, we will first create
a small sample dataset in Cornerstone. For this, open a New Dataset via
'Data'
\(\rightarrow\)
'New Dataset'
from the Cornerstone Workmap window and add
10 rows via 'Rows'
\(\rightarrow\) 'Append...'
. We
now add 3 computed columns via 'Columns'
\(\rightarrow\) 'Add'
\(\rightarrow\) 'Computed...'
with the following variable names and formula expressions:
As a result, we obtain the dataset as displayed in the following picture.
In that window, choose the menu 'Analyses'
\(\rightarrow\) 'CornerstoneR'
\(\rightarrow\)'Gaussian Process Regression'
as shown in the following screenshot.
In the appearing dialog, select 'x'
variable to
predictors. This variable represents a series of 10 equally distanced
values between 0 and 1. 'y'
is a computed response variable
with sinusoidal shape. Note that the Gaussian Process Regression also
supports multiple predictors and responses.
'OK'
confirms your selection and the following window
appears.
Set the GPR options via 'R Script'
\(\rightarrow\)
'Script Variables'
as shown in the following
screenshot.
The following settings can be made:
The script variables are explained in more detail in section Options in the Script
Variables Dialog. When you are done setting the script variables,
click 'OK'
. For this example, we keep the options as
displayed in the screenshot.
Now, click the execute button (green arrow) or choose the menu
'R Script'
\(\rightarrow\)
'Execute'
and all calculations are done via
'R'
. Calculations are done if the text at the lower left
status bar contains 'Last execute error state: OK'
. Our
result is available via the 'Summaries'
menu as shown in
the following screenshot.
Via 'Summaries'
\(\rightarrow\) 'Statistics'
,
the following data set with some essential statistics is shown. If you
have selected multiple response variables, these statistics are shown
row-wise for each response.
For instance, the 'Count'
lets you check on how many
observations the model learns. The 'Degrees of Freedom'
displays the Student-t degrees of freedom scalar. To estimate the
calculation time for bigger data, 'Runtime'
shows the
corresponding time 'R'
needed in seconds. We get an
R-Squared around 0.95, which means that the Gaussian Process Regression
seems to perform very well for the given data.
Via 'Summaries'
\(\rightarrow\) 'Predictions'
,
the following data table is shown. It gives you the used, predicted and
residual values for each response.
This dataset is brushable. If some observations were not used for the
model fit, they would have gotten the entry '0'
(FALSE) in
the 'Used'
column.
We get the tuned hyper-parameters, i.e. the bandwidth (d) and if
required, the nugget (g), via 'Summaries'
\(\rightarrow\)
'Hyper-Parameters'
. If we had chosen to not optimize the
hyper-parameters, the defined start values would have been considered as
the tuned hyper-parameters. If you have not defined any start values,
the R routine will estimate them based on the input data. See section Options in the Script
Variables Dialog for more information.
If you decide to refit the model for the same data set but with additional observations, choose the last output as start values for the hyper-parameters.
If you have set the check box for computing and outputting the
covariance matrix in the script variables, you will get the full
predictive Covariance Matrix for a multivariate Student-t distribution
for each response via 'Summaries'
\(\rightarrow\)
'Covariance Matrix'
. Note that its calculation
substantially increases computation time, esp. with big input
datasets.
Open 'Graphs'
\(\rightarrow\)
'Adjusted Response Graph'
to obtain the Adjusted Response
Graph for each response as already known from the basic Cornerstone
Linear Regression. We get the following graph for our example.
To check the underlying data, open 'Summaries'
\(\rightarrow\)
'Adjusted Response Data'
. Note that the Adjusted Response
table and graph are not brushable due to the extra prediction points
increasing the number of rows in the dataset. The number of prediction
points can be set in the script variables.
If you are interested in predicting the values for a DoE, check the
Design Predictions via 'Summaries'
\(\rightarrow\)
'Design Predictions'
. The Design Predictions can only be
calculated if the number of predictors is at least 2. If you have chosen
to use the response without transformation, the mean and predictions
columns will be the same. The design predictor values are the sequence
from the minimum to the maximum of each predictor from the original
dataset. The predicted mean, standard error and response are calculated
based on a full-factorial design using the number of prediction levels
set via the script variables.
The corresponding Design Cube to the design predictions is obtained
via 'Summaries'
\(\rightarrow\) 'Design Cube'
.
For instance in case of 3 predictors, the points would build the
diagonal of a 3-dimensional cube. This also means that we get the
minimum, maximum and mean values for each predictor.
In this section we discuss prediction of a response in a new dataset
with the existing model from above. Therefore, we open the dataset we
constructed in Cornerstone again. Starting from this dataset, we aim to
predict the original response 'y'
via the menu
'Analyses'
\(\rightarrow\)
'CornerstoneR'
\(\rightarrow\)
'Model Prediction'
as shown in the following
screenshot.
In the appearing dialog we select only the predictor
'x'
.
'OK'
confirms your selection and the following window
appears.
At this point, we add the existing GPR model to the prediction dialog
at hand. This is done via the menu 'R Script'
\(\rightarrow\)
'Input R Objects'
which pops up the following dialog.
We choose 'Gaussian Process Regression Models'
as
selected 'R'
Objects and click 'OK'
.
Now, click the execute button (green arrow) or choose the menu
'R Script'
\(\rightarrow\)
'Execute'
and all calculations are done via
'R'
. Calculations are done if the text at the lower left
status bar contains 'Last execute error state: OK'
. Our
result is available via the 'Summaries'
menu.
The entry 'Predictions'
opens a dataset with the input
data and all response columns that are predictable from the chosen GPR
models.
Finally, the 'Cornerstone'
workmap with all generated
objects looks like the following screenshot.
Some options are exported from the used 'R'
method to
'Cornerstone'
. Starting from the 'R'
analysis
object 'gaussianProcessRegression'
, you can find the
'Script Variables'
dialog via the menu
'R Script'
\(\rightarrow\)
'Script Variables'
. The following dialog appears.
As an alternative you can use only brushed or non-brushed observations to fit the Gaussian Process Model. Hence, after brushing a number of observations it is not necessary to create a ‘Cornerstone’ subset to include or exclude specific rows, you can just use this option to fit the model on the brushed or non-brushed set of rows. Training the model on less rows decreases run time. The default is to take all the rows from the input data for training.
Similar as in Conerstone’s Linear Regression, you are able to
transform your response variables before modeling a GPR if you expect
better results with a logarithmic scale, for example. You can choose
from 'Untransformed'
, 'Squared'
,
'Square root'
, 'Log'
,
'Reciprocal'
, and 'Reciprocal sqrt'
. You can
also set a Offset with Sign to the Box-Cox Transformation if you wish to
shift your response values by a certain value. For instance, if you
would like to add -5 to your response values, type '5'
to
the field 'Box-Cox Transformation Offset'
and choose the
minus in the corresponding Sign field. The default is no
transformation.
Check the box “Compute and Output a Covariance Matrix” if you wish to return the full predictive covariance matrix for the multivariate Student-t distribution. This would substantially increase computation time if only point-prediction is required. Default is FALSE, i.e. no covariance matrix will be outputted.
Check the corresponding box if you wish to optimize the hyper-parameters bandwidth and nugget if desired. This may increase the training time. If unchecked, the default/start values will be taken as the “true” hyper-parameter values to work with. Default is FALSE.
Here, you can enter your bandwidth start values for each predictor
and response. Separate the entered values by comma (“,”) and use a
decimal point (“.”) as decimal separator. Put the values in the same
order as of the Variable Selection (check via the menu
'R Script'
\(\rightarrow\)
'Variables'
). For example, if you have 3 predictors \(x_i, i = 1,2,3\) and 2 responses \(y_j, j = 1,2\), enter your bandwidth start
values \(d_ij\) as the following: \(d_{11}, d_{21}, d_{31}, d_{12}, d_{22},
d_{32}\). If you enter only as much values as predictors, these
values will be “recycled” if you have more than one response variable.
The choice of these values have an impact on the resulting model and the
calculation time. If you do not enter any values, the values will be
estimated from the generated priors for each predictor \(x_i, i = 1,\ldots,p\).
If you wish to have a tuned nugget parameter (noise) for kernels, check the corresponding box. Default value is FALSE.
Here, you can enter your nugget start values for each response.
Separate the entered values by comma (“,”) and use a decimal point (“.”)
as decimal separator. Put the values in the same order as of the
Variable Selection (check via the menu 'R Script'
\(\rightarrow\) 'Variables'
).
For example, if you have 2 responses \(y_j, j
= 1,2\), enter your nugget start values \(g_j\) as the following: \(g_1, g_2\). If you enter only one value, it
will be “recycled” if you have more than one response variable. If you
do not enter any values, the default value of 0.000001 (1e-6) will be
used. The choice of these values have an impact on the resulting model
and the calculation time. Note: if you get the error message that the
value “g” must be below a certain value, decrease the value of your
chosen nugget start value.
The number of prediction levels are needed to create a full factorial design and can be set for each predictor and response. The input rules are the same as for the bandwidth start values. If not specified, the number of prediction levels will be set to 25 for each predictor and response. The higher the values here and the more predictors you have, the longer the calculation time. If you encounter memory issues, reduce the prediction level values or the number of predictors. The output table “Design Predictions” will then have as many rows as the multiplication of the prediction levels for each response. For instance if you have 3 predictors and one response and you have entered “20, 20, 20” as prediction levels, you will obtain a Design Predictions Table with \(20^3 = 8000\) rows.
The number of prediction points are used for the Adjusted Response Graph. If not specified, the number of prediction points will be set to 25.
The implementation is based on the R packages ‘laGP’ (Gramacy, 2016) and ‘DoE.base’ (Grömping, 2018). The theory behind the Gaussian Process Regression can be read in more detail in Bishop (2006) and Gramacy (2020).
This function will not accept data with missing values for predictors and/or responses. If your data contain rows with missing values, we recommend using the function Handling Missing Values beforehand.
Christopher M. Bishop (2006). “Pattern Recognition and Machine Learning (Information Science and Statistics).” Springer-Verlag, Berlin, Heidelberg.
Robert B. Gramacy (2016). “laGP: Large-Scale Spatial Modeling via Local Approximate Gaussian Processes in R.” Journal of Statistical Software, 72(1), 1-46. doi: 10.18637/jss.v072.i01.
Robert B. Gramacy (2020). “Surrogates: Gaussian Process Modeling, Design and Optimization for the Applied Sciences.” Chapman Hall/CRC, Boca Raton, Florida.
Ulrike Grömping (2018). “R Package DoE.base for Factorial Experiments.” Journal of Statistical Software, 85(5), 1-41. doi: 10.18637/jss.v085.i05.