vignettes/missingValuesHandling.Rmd
missingValuesHandling.Rmd
In many cases, data contains missing entries. For example, the
variable 'MPG'
from the dataset 'carstats'
has
eight missing values (\(\bullet\)),
meaning that no 'MPG'
was recorded for Model
"citroen ds-21 pallas"
from "France"
,
"chevrolet chevelle concours (sw)"
,
"ford torino (sw)"
, "plymouth satellite (sw)"
and "amc rebel sst (sw)"
from "USA"
,
"volkswagen super beetle 117"
from "Germany"
and "saab 900s"
from "Sweden"
. Further, the
variable 'Horsepower'
contains six missing values.
Carstats Data
Our goal here is to pre-process the observations (rows) of the data
for further analyses by omitting or imputing the missing entries. How to
impute the missing values meaningfully depends on the context of the
data. The'Missing Values Handling'
function eases the
handling of missing entries by proposing some pre-defined methods. It is
also possible to define the handling method for each variable (column)
individually.
First of all, open a dataset which contains missing values,
e.g. 'carstats'
, and choose the menu
'Analyses'
-> 'CornerstoneR'
->
'Missing Values Handling'
.
As shown in the following screenshot, a dialog window will pop-up
which enables to select the variables from the data you want to access.
We choose 'Model'
, 'Origin'
,
'MPG'
and 'Cylinders'
as
'Predictors'
, and 'Displacement'
,
'Horsepower'
, 'Weight'
,
'Acceleration'
and 'Model Year'
as
Responses
. In any other circumstances, you can choose any
variables either to Predictors
or Responses
,
or both. If you choose one or more variables as Group by
variables, the missing values handling will be considered by its
corresponding group. If you wish to interpolate your data, you can
choose an underlying time scale via Auxiliaries
.
Missing Values Handling: Variable Selection
Click 'OK'
to confirm your selection and the following
window appears.
Missing Values Handling: R Script
Before executing the script, we can select the method we would like
to apply to our missing values. For this purpose, we open the menu
'R Script'
-> 'Script Variables'
.
Missing Values Handling: R Script Variables Menu
In the appearing dialog, we can choose the desired handling method
out of "Omit Missing Values (omit)"
(default),
"Last Observation Carried Forward (locf)"
,
"Next Observation Carried Backward (nocb)"
,
"Mean Values (mean)"
,
"Median Values (median)"
,
"Minimum Values (min)"
,
"Maximum Values (max)"
,
"Linear Interpolation (linpol)"
,
"Cubic Interpolation (cubicpol)"
, or compose a method
manually by selecting "User Defined"
. Note that the mean,
the median, the minimum, the maximum and the interpolation functions can
only be computed for numerical variables. If you have set an auxiliary
variable, the interpolation will consider this as underlying time scale.
Otherwise the interpolation will use an equidistant spacing. If we
choose "User Defined"
, we must define an imputation method
(as string) in the field
"User-Defined Imputation (values, methods or mathematical terms)"
.
Here, we have the following possibilities:
"0"
"MPG = 0"
or
"MPG = 0, Horsepower = 1"
"MPG = omit, Horsepower = min"
(the method identifiers
here are “omit”, “locf”, “nocb”, “mean”, “median”, “min”, “max”)"MPG = 3+5, Horsepower = log(4)"
The conditions can also be mixed,
e.g. "MPG = mean, Horsepower = sqrt(2)"
. Note that the
Missing Values Handling function cannot (yet) impute the missing values
using values from other variables,
e.g. "Horsepower = Weight+5"
.
Via the field "Missing Value Representation"
, we can
define new representation(s) for the missing values, apart from the
“real” missing values (i.e. black points in Cornerstone data). Multiple
representations should be separated by comma e.g., “0, N/A, [MISSING],
.” would replace all the entries “0”, “N/A”, “[MISSING]” and “.” in the
data. These entries will also be considered as missing values within the
Missing Values Handling function.
If the data contains empty columns (columns with missing entries
only) and we want to delete these, we can remove them by checking the
box "Remove Empty Columns If Available"
.
Here, we stick to the default values and click 'OK'
.
Missing Values Handling: Script Variables
Now click the Execute button (green arrow) on the left side of the
menu bar, or choose the menu 'R Script'
->
'Execute'
and all calculations are done via ‘R’. The
computations are done if the text at the lower left status bar contains
'Last execute error state: OK'
.
Missing Values Handling: Execute Green Arrow
Our results are available via the menu 'Summaries'
as
shown in the following screenshot.
Missing Values Handling: Summaries Menu
'Row Indices of Missing Values'
opens a ‘Cornerstone’
data set, indicating the row numbers of the missing values in the input
dataset. The screenshot shows that "MPG"
and
"Horsepower"
contain missing values and in which row we can
find them.
Missing Values Handling: Row Indices of Missing Values
Via 'Summaries'
-> 'Output Dataset'
, we
obtain the complete cases of the input data set. 14 observations (rows)
have been removed, resulting in a table of 392 observations. Note that
the brush tool does not work if we have chosen to omit the missing
values because the resulting data set shows a different number of rows.
The brush tool is applicable if we have chosen to impute the missing
entries.
Missing Values Handling: Table with Omitted Missing Values
For numeric variables, we can for example impute the mean values by
its column mean values. For this, we go back to the R script and open
the menu 'R Script'
-> 'Script Variables'
.
We now select 'Mean Values (mean)'
instead of
'Omit Missing Values (omit)'
as Imputation Method.
Missing Values Handling: Script Variables - Mean Values
After executing the R script and opening the output data set, we can see that all the missing values have been replaced with the mean per column. Therefore, we now obtain a complete brushable data set with 406 observations (rows).
Missing Values Handling: Output Table with Mean Values
Now, we want to replace the missing values in 'carstats'
with its mean values grouped by the variable 'Origin'
. We
re-open the 'carstats'
data set and select
'Analyses'
-> 'CornerstoneR'
->
'Missing Values Handling'
. We choose 'Model'
,
'MPG'
and 'Cylinders'
as
'Predictors'
, and 'Displacement'
,
'Horsepower'
, 'Weight'
,
'Acceleration'
and 'Model Year'
as
Responses
. In addition, we choose 'Origin'
as
Group By
variable and click 'OK'
.
Missing Values Handling: Script Variables with Group By
In the R script window, we open 'R Script'
->
'Script Variables'
and set again
'Mean Values (mean)'
as Imputation Method. Click
'OK'
and Execute the R script via the green arrow. Open the
output data set in 'Summaries'
.
Missing Values Handling: Output with Group By
We can now see that the missing values were replaced by the mean per
value in 'Origin'
. For example, the mean Horsepower of the
cars from 'MPG'
(about 27.2) is higher than the mean
'MPG'
of the cars from the 'USA'
(about 20.1).
If we now compare this output dataset, we can see that these mean values
do not correspond the total mean values of 'MPG'
or
'Horsepower'
.