In many cases, data contains missing entries. For example, the
'MPG' from the dataset
eight missing values (\(\bullet\)),
meaning that no
'MPG' was recorded for Model
"citroen ds-21 pallas" from
"chevrolet chevelle concours (sw)",
"ford torino (sw)",
"plymouth satellite (sw)"
"amc rebel sst (sw)" from
"volkswagen super beetle 117" from
"saab 900s" from
"Sweden". Further, the
'Horsepower' contains six missing values.
Our goal here is to pre-process the observations (rows) of the data
for further analyses by omitting or imputing the missing entries. How to
impute the missing values meaningfully depends on the context of the
'Missing Values Handling' function eases the
handling of missing entries by proposing some pre-defined methods. It is
also possible to define the handling method for each variable (column)
First of all, open a dataset which contains missing values,
'carstats', and choose the menu
'Missing Values Handling'.
As shown in the following screenshot, a dialog window will pop-up
which enables to select the variables from the data you want to access.
'Model Year' as
Responses. In any other circumstances, you can choose any
variables either to
or both. If you choose one or more variables as
variables, the missing values handling will be considered by its
corresponding group. If you wish to interpolate your data, you can
choose an underlying time scale via
'OK' to confirm your selection and the following
Before executing the script, we can select the method we would like
to apply to our missing values. For this purpose, we open the menu
'R Script' ->
In the appearing dialog, we can choose the desired handling method
"Omit Missing Values (omit)" (default),
"Last Observation Carried Forward (locf)",
"Next Observation Carried Backward (nocb)",
"Mean Values (mean)",
"Median Values (median)",
"Minimum Values (min)",
"Maximum Values (max)",
"Linear Interpolation (linpol)",
"Cubic Interpolation (cubicpol)", or compose a method
manually by selecting
"User Defined". Note that the mean,
the median, the minimum, the maximum and the interpolation functions can
only be computed for numerical variables. If you have set an auxiliary
variable, the interpolation will consider this as underlying time scale.
Otherwise the interpolation will use an equidistant spacing. If we
"User Defined", we must define an imputation method
(as string) in the field
"User-Defined Imputation (values, methods or mathematical terms)".
Here, we have the following possibilities:
"MPG = 0"or
"MPG = 0, Horsepower = 1"
"MPG = omit, Horsepower = min"(the method identifiers here are “omit”, “locf”, “nocb”, “mean”, “median”, “min”, “max”)
"MPG = 3+5, Horsepower = log(4)"
The conditions can also be mixed,
"MPG = mean, Horsepower = sqrt(2)". Note that the
Missing Values Handling function cannot (yet) impute the missing values
using values from other variables,
"Horsepower = Weight+5".
Via the field
"Missing Value Representation", we can
define new representation(s) for the missing values, apart from the
“real” missing values (i.e. black points in Cornerstone data). Multiple
representations should be separated by comma e.g., “0, N/A, [MISSING],
.” would replace all the entries “0”, “N/A”, “[MISSING]” and “.” in the
data. These entries will also be considered as missing values within the
Missing Values Handling function.
If the data contains empty columns (columns with missing entries
only) and we want to delete these, we can remove them by checking the
"Remove Empty Columns If Available".
Here, we stick to the default values and click
Now click the Execute button (green arrow) on the left side of the
menu bar, or choose the menu
'R Script' ->
'Execute' and all calculations are done via ‘R’. The
computations are done if the text at the lower left status bar contains
'Last execute error state: OK'.
Our results are available via the menu
shown in the following screenshot.
'Row Indices of Missing Values' opens a ‘Cornerstone’
data set, indicating the row numbers of the missing values in the input
dataset. The screenshot shows that
"Horsepower" contain missing values and in which row we can
'Output Dataset', we
obtain the complete cases of the input data set. 14 observations (rows)
have been removed, resulting in a table of 392 observations. Note that
the brush tool does not work if we have chosen to omit the missing
values because the resulting data set shows a different number of rows.
The brush tool is applicable if we have chosen to impute the missing
For numeric variables, we can for example impute the mean values by
its column mean values. For this, we go back to the R script and open
'R Script' ->
We now select
'Mean Values (mean)' instead of
'Omit Missing Values (omit)' as Imputation Method.
After executing the R script and opening the output data set, we can see that all the missing values have been replaced with the mean per column. Therefore, we now obtain a complete brushable data set with 406 observations (rows).
Now, we want to replace the missing values in
with its mean values grouped by the variable
'carstats' data set and select
'Missing Values Handling'. We choose
'Model Year' as
Responses. In addition, we choose
Group By variable and click
In the R script window, we open
'R Script' ->
'Script Variables' and set again
'Mean Values (mean)' as Imputation Method. Click
'OK' and Execute the R script via the green arrow. Open the
output data set in
We can now see that the missing values were replaced by the mean per
'Origin'. For example, the mean Horsepower of the
'MPG' (about 27.2) is higher than the mean
'MPG' of the cars from the
'USA' (about 20.1).
If we now compare this output dataset, we can see that these mean values
do not correspond the total mean values of