In addition to the 'Cornerstone'
core methods of fitting
data by a linear regression or perform a MANOVA it is possible to use a
decision tree to model data.
How do we use the method 'Decision Tree'
in
'Cornerstone'
from 'CornerstoneR'
?
To use a decision tree model in 'Cornerstone'
open a
dataset, e.g. 'irisdata'
and choose menu
'Analyses'
\(\rightarrow\)
'CornerstoneR'
\(\rightarrow\) 'Decision Tree'
as shown in the following screenshot.
In the appearing dialog select all 'sepal_\*'
and
'petal_\*'
variables to predictors.
'iris_type'
is the response variable. It is also possible
to select multiple responses to fit multiple decision tree models at
once.
'OK'
confirms your selection and the following window
appears.
Now, click the execute button (green arrow) or choose the menu
'R Script'
\(\rightarrow\)
'Execute'
and all calculations are done via
'R'
. Calculations are done if the text at the lower left
status bar contains 'Last execute error state: OK'
. Our
result is available via the 'Summaries'
menu as shown in
the following screenshot.
Via 'Summaries'
\(\rightarrow\) 'Statistics'
the
following dataset with some essential statistics is shown. When you
selected multiple response variables these statistics are shown row-wise
for each variable.
For instance, the 'Type'
shows whether the tree used a
classification or regression model. The 'Sample Size'
lets
you check on how many observations the model learns. To estimate the
calculation time for bigger data 'Runtime R Script [s]'
shows the corresponding time 'R'
needed.
Via 'Summaries'
\(\rightarrow\)
'Variable Importance'
the following dataset is shown. For
multiple responses the variable importance is shown row-wise for each
variable. The values are scaled between 0 and 100.
The related variable importance barplot can be found in
'Graphs'
\(\rightarrow\)
'Variable Importance'
.
Via 'Summaries'
\(\rightarrow\) 'Predictions'
the following dataset is shown. Each additional response variable gets
four additional columns with its corresponding data.
The first column 'Used.iris_type'
indicates whether this
observation was used (1) or not (0) to fit the decision tree model. You
find the original data in column 'iris_type'
. The
corresponding prediction by the model is shown in column
'Pred.iris_type'
. 'Resid.iris_type'
, as the
fourth column, shows the calculated residuum. For classification models
it is 0 (matching prediction) or 1 (not matching prediction). In case of
regression models we calculate the difference between observation and
prediction.
If a response is not observed the model predicts automatically its value. To demonstrate this case, we manually deleted the second observation. The result is shown in the following screenshot.
Now this row is not used to fit the model
('Used.iris_type'
= 0), its observation is missing as
expected, the observation is predicted as 'setosa'
in
column 'Pred.iris_type'
, and it is not possible to
calculate a residuum.
Confusion tables are only calculated for classification models and
available via 'Summaries'
\(\rightarrow\)
'Confusion Table'
. For multiple response variables, we add
an additional menu for each classification.
The table shows for each level the number of corresponding
predictions. For the 'irisdata'
dataset most predictions
match to their observations.
In this section we discuss prediction of a response in a new dataset
with the existing model from above. Therefore, we open the dataset
'irisdata'
in 'Cornerstone'
again and delete
the column 'iris_type'
. Starting from this dataset, we want
to predict the original response 'iris_type'
. Via menu
'Analyses'
\(\rightarrow\)
'CornerstoneR'
\(\rightarrow\)
'Model Prediction'
as shown in the following
screenshot.
In the appearing dialog select all 'sepal_\*'
and
'petal_\*'
variables to predictors. We have no response
variable.
'OK'
confirms your selection and the following window
appears.
At this point we add the existing Decision Tree model to the
prediction dialog at hand. It is possible via menu
'R Script'
\(\rightarrow\)
'Input R Objects'
which brings up the following dialog.
We choose 'Decision Tree Models'
as selected
'R'
objects and click 'OK'
.
Now, click the execute button (green arrow) or choose the menu
'R Script'
\(\rightarrow\)
'Execute'
and all calculations are done via
'R'
. Calculations are done if the text at the lower left
status bar contains 'Last execute error state: OK'
. Our
result is available via the 'Summaries'
menu as shown in
the following screenshot.
This menu opens a dataset with all response columns that are predictable from the chosen random forest models.
Finally, the 'Cornerstone'
workmap with all generated
objects looks like the following screenshot.
Some options are exported from the used 'R'
method to
'Cornerstone'
. Starting from the 'R'
analysis
object for the Decision Tree, you find the
'Script Variables'
dialog via the menu
'R Script'
\(\rightarrow\)
'Script Variables'
. The following dialog appears.
During the data exploration phase you probably realize a pattern and want to check its impact on your responses. By checking ‘Use Brush State as Additional Predictor’ the current brush selection is used in the decision tree fitting as an additional dichotomous prediction variable. After brushing observations in a graph or dataset execute the decision tree ‘R’ script and the model is updated using the brush as predictor variable.
As an alternative you can use only brushed or non-brushed
observations to fit the decision tree model. Hence, after brushing a
number of observations it is not necessary to create a
'Cornerstone'
subset to exclude or include specific rows,
you can just use this option to fit the decision tree model on the
brushed or non-brushed set of rows.
If you use the option above, this selection is automatically overwritten by the setting ‘all’ rows.
The option 'Splitting Criterion (for Classification)'
can be changed between 'Gini'
and
'Information'
(also called Entropy). The measure for
regression is always the Sum of Square Error.
Setting 'Minimal Node Size'
to a different value changes
the minimal node size used to fit the decision tree model. The flag
'Prune Tree'
can be checked if automatic pruning is
required. Both are used for regularization of the model, i.e. to prevent
overfitting.
For details, take a look into the documentation of rpart::rpart()
and rpart.plot::rpart.plot()
.
The options 'Graphic Size: Width'
and
'Graphic Size: Height'
control the size of the outputted
tree plot. Default size is 700x700, which corresponds to a quadratic
graphic. For big trees with a lot of nods, it might be helpful to
increase the Width of the graphic (e.g. 1300x700) to better distinguish
the nods and read the texts in the graphic.
This function will not accept data with missing values for predictors. If your data contain rows with missing values, we recommend using the function Handling Missing Values beforehand.