vignettes/logisticRegression.Rmd
logisticRegression.Rmd
In many cases, the response variable is a binary variable e.g. True/False, Good/Bad, Yes/No etc., and thus, you need to fit a logistic regression. A logistic model with ‘logit’ link has the form \[ Logit(Y_i \mid X=x_i) = \log\left(\dfrac{P(Y=1\mid X=x_i)}{1-P(Y=1\mid X=x_i)}\right) = x'_i\beta \] and \[ \dfrac{P(Y=1\mid X=x_i)}{P(Y=0\mid X=x_i)} = \dfrac{P(Y=1\mid X=x_i)}{1-P(Y=1\mid X=x_i)}=e^{x'_i\beta}, \] respectively. A logistic model with ‘probit’ link takes the form \[ Probit(Y_i \mid X=x_i) = \Phi(x_i'\beta). \]
In general, we would like to get the best model fit using the available data. How do we realize this in ‘Cornerstone’ using ‘Logistic Regression’ from ‘CornerstoneR’?
In this example, we will start with the ‘Cornerstone’ sample data set
‘regdata’. The data contains the columns 'Clearance'
,
'Oven'
, 'Current'
, 'Thickness'
and 'Yield'
. To obtain a binary response variable out of
'Yield'
, we add a computed column to the end of the data
set by inserting the following formula:
if(random() < Yield/200) then (1) else(0)
. We name the
column 'Good/Bad'
. This will be our response variable.
'Clearance'
, 'Current'
and
'Thickness'
are continuous predictors, whereas
'Oven'
is a categorical predictor. The following screenshot
shows the final data set.
Show Data to Be Fitted
Choose the menu ‘Analysis’ \(\rightarrow\) ‘CornerstoneR’ \(\rightarrow\) ‘Logistic Regression’ as shown in the following screenshot.
Fit Logistic Function: Menu
In the appearing dialog, select the variables
'Clearance'
, 'Oven'
, 'Current'
and 'Thickness'
as predictors. 'Good/Bad'
is
the response variable.
Fit Logistic Function: Variable Selection
‘OK’ confirms your selection and the following window appears.
Fit Logistic Function: R Script
Before we start the script, it is necessary to set the following options:
To do this, we open the menu ‘R Script’ \(\rightarrow\) ‘Script Variables’. The screenshot shows the default options.
Fit Logistic Function: R Script Variables Menu
In the appearing dialog, we select
Fit Logistic Function: R Script Variables
Now close this dialog with ‘OK’ and click the execute button (green arrow) or choose the menu ‘R Script’ \(\rightarrow\) ‘Execute’ and all calculations are done via ‘R’. Calculations are done if the text at the lower left status bar contains ‘Last execute error state: OK’. Our results are available via the menus ‘Summaries’ as shown in the following screenshot.
Fit Logistic Function: Result Menu
The menu ‘Goodness of Fit’ shows the number of complete observations for fitting the response variable (column “Count”) and the same evaluation measures R-Square, adjusted R-Square and the RMS Error.
Fit Logistic Function: Goodness of Fit.
The menu ‘Coefficient Table’ displays the model coefficients, the standard errors, t-values1 and p-values (to check significance) for each term. If the link function was chosen to be ‘logit’, the exp-coefficients are shown in the third column (“Exp Coefficient”). Categorical predictors are effect coded, and the last group per category variable is set as contrasting group2.
Fit Logistic Function: Coefficient Table
The menu ‘Coefficient Correlation’ demonstrates the coefficient correlation matrix which is computed as \[ \Upsilon_{ij} = \dfrac{\Sigma_{ij}}{\sqrt{\Sigma_{ii}\Sigma_{jj}}}, \] where \(\mathbf{\Sigma}\) represents the variance-covariance matrix of the coefficients.
Fit Logistic Function: Coefficient Correlation
To have a better overview of the predictor correlations, we can have a look at the coefficient correlation plot, by selecting Graphs \(\rightarrow\) Correlation Plot (EMF) in the R Script menu. The following window with the requested graph appears.
Fit Logistic Function: Correlation Plot
The menu ‘Fit Estimate’ shows the fitted values, the residuals and the leverages per model observation. The fitted values are the predicted probabilities for the sample values. The residuals are the difference between the measured responses and the fitted values. The leverages are the diagonal of the ‘hat’ matrix (see regression.pdf, p. 116).
Fit Logistic Function: Fit Estimate
The menu ‘Regression Dataset’ opens a table showing the model terms with the corresponding p value (for significance check)3. In addition, it demonstrates the evaluation measures R-Square, adjusted R-Square, RMS Error, and the residual degrees of freedom. These measures help to evaluate the model performance.
In contrast to Linear Regression, we calculate the McFadden’s R-Square for Logistic Regression as \[ R^2 = 1 - \dfrac{log(L_1)}{log(L_0)}, \] where \(log(L_1)\) is the (maximized) log-Likelihood value of the current fitted model, and \(log(L_0)\) denotes the analogous value for the null model – the model with only an intercept and no covariates.
The adjusted R-Square can be constructed in analogy: \[ R^2_{adj} = 1 - \dfrac{log(L_1)}{log(L_0)} \times \dfrac{n-1}{n-p}, \] where \(n\) is the number of observations and \(p = k + 1\) with \(k\) as the number of predictors in the model. \(n - p\) is then the number of residual degrees of freedom. According to McFadden, an (adjusted) pseudo R-Square between 0.2 and 0.4 indicates excellent model fit.
The RMS Error (Root-Mean-Square Error) is computed as follows: \[ \sqrt{\dfrac{-2 log(L_1)}{n-p}}. \]
Fit Logistic Function: Regression Dataset
The last menu ‘Response vs. Predictors’ shows the model predictions
for our dataset. By clicking on the column name, we can see the formula
behind the computed values. For our response variable
'Good/Bad'
this corresponds to the model formula.
Fit Logistic Function: Computed Prediction Table
The stepwise regression respects the model hierarchy and is
recommended when there are covariates which are not significant
i.e. p-value > significance level, and the goodness of fit measures
(e.g. R-Square) leave room for improvement. We hereby use the results
from our ‘Regression Dataset’ and ‘Coefficient Table’ above to decide
whether to execute stepwise regression based on the significance for
each term (similar to “auto” button in linear regression, see
documentation Regression.pdf, p. 24) to check if we can improve
performance. The Coefficient Table already reveals that for example the
term 'Thickness^2'
is not significant as its p value is
higher than our significance level of 0.10. We thus decide to conduct a
stepwise regression.
The screenshot shows the script variables setting for a stepwise regression with alpha = 0.10. The other script variables were not changed, i.e. the link function remains ‘logit’ and the type of effects remains ‘Linear+’.
Fit Logistic Function: Script Variables for Stepwise Regression based on Significance
Now we close this dialog via ‘OK’ and re-execute the R Script (green arrow button). As soon as the calculations are done, our updated results are available via the menus ‘Summaries’ again. We now open the ‘Regression Dataset’ menu.
Fit Logistic Function: Regression Dataset after Stepwise Regression
We can see now that the stepwise regression kicked out the quadratic
terms 'Thickness^2'
and 'Current^2'
as well as
the linear terms 'Thickness'
and 'Current'
as
they had no significant effect on 'Good/Bad'
(p Value <
0.10).
If you get an error message or a strange output, pay attention to these aspects:
Strictly speaking, the z-values are normally calculated in logistic regression. We called this column “t-values” to be consistent with the naming of the linear regression in Cornerstone, although the z-values are calculated here.↩︎
The constant term represents the unweighted Grand Mean (mean of the group means). The other coefficients can be interpreted as the deviation to the Grand Mean. The coefficient of the contrasting group is then calculated as the Grand Mean minus the sum of the coefficients of the other groups. The p value of the contrasting group was set to NA.↩︎
No p value is available for the contrasting group.↩︎