`vignettes/logisticRegression.Rmd`

`logisticRegression.Rmd`

In many cases, the response variable is a binary variable e.g. True/False, Good/Bad, Yes/No etc., and thus, you need to fit a logistic regression. A logistic model with ‘logit’ link has the form \[ Logit(Y_i \mid X=x_i) = \log\left(\dfrac{P(Y=1\mid X=x_i)}{1-P(Y=1\mid X=x_i)}\right) = x'_i\beta \] and \[ \dfrac{P(Y=1\mid X=x_i)}{P(Y=0\mid X=x_i)} = \dfrac{P(Y=1\mid X=x_i)}{1-P(Y=1\mid X=x_i)}=e^{x'_i\beta}, \] respectively. A logistic model with ‘probit’ link takes the form \[ Probit(Y_i \mid X=x_i) = \Phi(x_i'\beta). \]

In general, we would like to get the best model fit using the available data. How do we realize this in ‘Cornerstone’ using ‘Logistic Regression’ from ‘CornerstoneR’?

In this example, we will start with the ‘Cornerstone’ sample data set
‘regdata’. The data contains the columns `'Clearance'`

,
`'Oven'`

, `'Current'`

, `'Thickness'`

and `'Yield'`

. To obtain a binary response variable out of
`'Yield'`

, we add a computed column to the end of the data
set by inserting the following formula:
`if(random() < Yield/200) then (1) else(0)`

. We name the
column `'Good/Bad'`

. This will be our response variable.
`'Clearance'`

, `'Current'`

and
`'Thickness'`

are continuous predictors, whereas
`'Oven'`

is a categorical predictor. The following screenshot
shows the final data set.

Choose the menu ‘Analysis’ \(\rightarrow\) ‘CornerstoneR’ \(\rightarrow\) ‘Logistic Regression’ as shown in the following screenshot.

In the appearing dialog, select the variables
`'Clearance'`

, `'Oven'`

, `'Current'`

and `'Thickness'`

as predictors. `'Good/Bad'`

is
the response variable.

‘OK’ confirms your selection and the following window appears.

Before we start the script, it is necessary to set the following options:

- the type of link function: ‘logit’ or ‘probit’
- the type of effects: ‘Linear’, ‘Linear+’, ‘Interactions’ or ‘Quadratic’ (see Regression.pdf, p. 35)
- the significance level (also known as “alpha error”)
- check whether you want to conduct an automatic stepwise regression based on the significance level you choose (see below for details)

To do this, we open the menu ‘R Script’ \(\rightarrow\) ‘Script Variables’. The screenshot shows the default options.

In the appearing dialog, we select

- “logit” for logit link
- “Linear+” for linear and quadratic effects without interactions
- “0.10” as significance level
- no stepwise regression \(\rightarrow\) leave the box unchecked

Now close this dialog with ‘OK’ and click the execute button (green arrow) or choose the menu ‘R Script’ \(\rightarrow\) ‘Execute’ and all calculations are done via ‘R’. Calculations are done if the text at the lower left status bar contains ‘Last execute error state: OK’. Our results are available via the menus ‘Summaries’ as shown in the following screenshot.

The menu ‘Goodness of Fit’ shows the number of complete observations for fitting the response variable (column “Count”) and the same evaluation measures R-Square, adjusted R-Square and the RMS Error.

The menu ‘Coefficient Table’ displays the model coefficients, the
standard errors, t-values^{1} and p-values (to check significance) for
each term. If the link function was chosen to be ‘logit’, the
exp-coefficients are shown in the third column (“Exp Coefficient”).
Categorical predictors are effect coded, and the last group per category
variable is set as contrasting group^{2}.

The menu ‘Coefficient Correlation’ demonstrates the coefficient correlation matrix which is computed as \[ \Upsilon_{ij} = \dfrac{\Sigma_{ij}}{\sqrt{\Sigma_{ii}\Sigma_{jj}}}, \] where \(\mathbf{\Sigma}\) represents the variance-covariance matrix of the coefficients.

To have a better overview of the predictor correlations, we can have a look at the coefficient correlation plot, by selecting Graphs \(\rightarrow\) Correlation Plot (EMF) in the R Script menu. The following window with the requested graph appears.

The menu ‘Fit Estimate’ shows the fitted values, the residuals and the leverages per model observation. The fitted values are the predicted probabilities for the sample values. The residuals are the difference between the measured responses and the fitted values. The leverages are the diagonal of the ‘hat’ matrix (see regression.pdf, p. 116).

The menu ‘Regression Dataset’ opens a table showing the model terms
with the corresponding p value (for significance check)^{3}. In addition, it
demonstrates the evaluation measures R-Square, adjusted R-Square, RMS
Error, and the residual degrees of freedom. These measures help to
evaluate the model performance.

In contrast to Linear Regression, we calculate the McFadden’s R-Square for Logistic Regression as \[ R^2 = 1 - \dfrac{log(L_1)}{log(L_0)}, \] where \(log(L_1)\) is the (maximized) log-Likelihood value of the current fitted model, and \(log(L_0)\) denotes the analogous value for the null model – the model with only an intercept and no covariates.

The adjusted R-Square can be constructed in analogy: \[ R^2_{adj} = 1 - \dfrac{log(L_1)}{log(L_0)} \times \dfrac{n-1}{n-p}, \] where \(n\) is the number of observations and \(p = k + 1\) with \(k\) as the number of predictors in the model. \(n - p\) is then the number of residual degrees of freedom. According to McFadden, an (adjusted) pseudo R-Square between 0.2 and 0.4 indicates excellent model fit.

The RMS Error (Root-Mean-Square Error) is computed as follows: \[ \sqrt{\dfrac{-2 log(L_1)}{n-p}}. \]

The last menu ‘Response vs. Predictors’ shows the model predictions
for our dataset. By clicking on the column name, we can see the formula
behind the computed values. For our response variable
`'Good/Bad'`

this corresponds to the model formula.

The stepwise regression respects the model hierarchy and is
recommended when there are covariates which are not significant
i.e. p-value > significance level, and the goodness of fit measures
(e.g. R-Square) leave room for improvement. We hereby use the results
from our ‘Regression Dataset’ and ‘Coefficient Table’ above to decide
whether to execute stepwise regression based on the significance for
each term (similar to “auto” button in linear regression, see
documentation Regression.pdf, p. 24) to check if we can improve
performance. The Coefficient Table already reveals that for example the
term `'Thickness^2'`

is not significant as its p value is
higher than our significance level of 0.10. We thus decide to conduct a
stepwise regression.

The screenshot shows the script variables setting for a stepwise regression with alpha = 0.10. The other script variables were not changed, i.e. the link function remains ‘logit’ and the type of effects remains ‘Linear+’.

Now we close this dialog via ‘OK’ and re-execute the R Script (green arrow button). As soon as the calculations are done, our updated results are available via the menus ‘Summaries’ again. We now open the ‘Regression Dataset’ menu.

We can see now that the stepwise regression kicked out the quadratic
terms `'Thickness^2'`

and `'Current^2'`

as well as
the linear terms `'Thickness'`

and `'Current'`

as
they had no significant effect on `'Good/Bad'`

(p Value <
0.10).

If you get an error message or a strange output, pay attention to these aspects:

- so far, the function supports only
*one*(binary) response variable. “Group by” and “Auxiliaries” variables are redundant. - observations with missing values will be omitted. If you wish some other behavior, do missing value handling first.
- watch out for multicollinearity. If your predictors are highly correlated, the coefficient estimator may get imprecise (wrong sign, high standard deviation). In this case, check if you can leave out or aggregate the affected variable(s) first.

Strictly speaking, the z-values are normally calculated in logistic regression. We called this column “t-values” to be consistent with the naming of the linear regression in Cornerstone, although the z-values are calculated here.↩︎

The constant term represents the unweighted Grand Mean (mean of the group means). The other coefficients can be interpreted as the deviation to the Grand Mean. The coefficient of the contrasting group is then calculated as the Grand Mean minus the sum of the coefficients of the other groups. The p value of the contrasting group was set to NA.↩︎

No p value is available for the contrasting group.↩︎