Initial Situation and Goal

In many cases, data contains missing entries. For example, the variable 'MPG' from the dataset 'carstats' has eight missing values (\(\bullet\)), meaning that no 'MPG' was recorded for Model "citroen ds-21 pallas" from "France", "chevrolet chevelle concours (sw)", "ford torino (sw)", "plymouth satellite (sw)" and "amc rebel sst (sw)" from "USA", "volkswagen super beetle 117" from "Germany" and "saab 900s" from "Sweden". Further, the variable 'Horsepower' contains six missing values.

Carstats Data

Our goal here is to pre-process the observations (rows) of the data for further analyses by omitting or imputing the missing entries. How to impute the missing values meaningfully depends on the context of the data. The'Missing Values Handling' function eases the handling of missing entries by proposing some pre-defined methods. It is also possible to define the handling method for each variable (column) individually.

Cornerstone Workflow

How do we access the Missing Values Handling function from Cornerstone?

First of all, open a dataset which contains missing values, e.g. 'carstats', and choose the menu 'Analyses' -> 'CornerstoneR' -> 'Missing Values Handling'.

As shown in the following screenshot, a dialog window will pop-up which enables to select the variables from the data you want to access. We choose 'Model', 'Origin', 'MPG' and 'Cylinders' as 'Predictors', and 'Displacement', 'Horsepower', 'Weight', 'Acceleration' and 'Model Year' as Responses. In any other circumstances, you can choose any variables either to Predictors or Responses, or both. If you choose one or more variables as Group by variables, the missing values handling will be considered by its corresponding group. If you wish to interpolate your data, you can choose an underlying time scale via Auxiliaries.

Missing Values Handling: Variable Selection

Click 'OK' to confirm your selection and the following window appears.

Missing Values Handling: R Script

Before executing the script, we can select the method we would like to apply to our missing values. For this purpose, we open the menu 'R Script' -> 'Script Variables'.

Missing Values Handling: R Script Variables Menu

In the appearing dialog, we can choose the desired handling method out of "Omit Missing Values (omit)" (default), "Last Observation Carried Forward (locf)", "Next Observation Carried Backward (nocb)", "Mean Values (mean)", "Median Values (median)", "Minimum Values (min)", "Maximum Values (max)", "Linear Interpolation (linpol)", "Cubic Interpolation (cubicpol)", or compose a method manually by selecting "User Defined". Note that the mean, the median, the minimum, the maximum and the interpolation functions can only be computed for numerical variables. If you have set an auxiliary variable, the interpolation will consider this as underlying time scale. Otherwise the interpolation will use an equidistant spacing. If we choose "User Defined", we must define an imputation method (as string) in the field "User-Defined Imputation (values, methods or mathematical terms)". Here, we have the following possibilities:

  • a single value to replace all the missing values with, e.g. "0"
  • a value for one or more specific column containing missing values, e.g. "MPG = 0" or "MPG = 0, Horsepower = 1"
  • a method from the pre-defined functions, e.g. "MPG = omit, Horsepower = min" (the method identifiers here are “omit”, “locf”, “nocb”, “mean”, “median”, “min”, “max”)
  • a mathematical term which can be evaluated, e.g. "MPG = 3+5, Horsepower = log(4)"

The conditions can also be mixed, e.g. "MPG = mean, Horsepower = sqrt(2)". Note that the Missing Values Handling function cannot (yet) impute the missing values using values from other variables, e.g. "Horsepower = Weight+5".

Via the field "Missing Value Representation", we can define new representation(s) for the missing values, apart from the “real” missing values (i.e. black points in Cornerstone data). Multiple representations should be separated by comma e.g., “0, N/A, [MISSING], .” would replace all the entries “0”, “N/A”, “[MISSING]” and “.” in the data. These entries will also be considered as missing values within the Missing Values Handling function.

If the data contains empty columns (columns with missing entries only) and we want to delete these, we can remove them by checking the box "Remove Empty Columns If Available".

Here, we stick to the default values and click 'OK'.

Missing Values Handling: Script Variables

Now click the Execute button (green arrow) on the left side of the menu bar, or choose the menu 'R Script' -> 'Execute' and all calculations are done via ‘R’. The computations are done if the text at the lower left status bar contains 'Last execute error state: OK'.

Missing Values Handling: Execute Green Arrow

Our results are available via the menu 'Summaries' as shown in the following screenshot.

Missing Values Handling: Summaries Menu

Row Indices of Missing Entries

'Row Indices of Missing Values' opens a ‘Cornerstone’ data set, indicating the row numbers of the missing values in the input dataset. The screenshot shows that "MPG" and "Horsepower" contain missing values and in which row we can find them.

Missing Values Handling: Row Indices of Missing Values

Output Data

Via 'Summaries' -> 'Output Dataset', we obtain the complete cases of the input data set. 14 observations (rows) have been removed, resulting in a table of 392 observations. Note that the brush tool does not work if we have chosen to omit the missing values because the resulting data set shows a different number of rows. The brush tool is applicable if we have chosen to impute the missing entries.

Missing Values Handling: Table with Omitted Missing Values

How do we replace Missing Values using a pre-defined imputation method?

Example: (Total) Mean Values

For numeric variables, we can for example impute the mean values by its column mean values. For this, we go back to the R script and open the menu 'R Script' -> 'Script Variables'. We now select 'Mean Values (mean)' instead of 'Omit Missing Values (omit)' as Imputation Method.

Missing Values Handling: Script Variables - Mean Values

After executing the R script and opening the output data set, we can see that all the missing values have been replaced with the mean per column. Therefore, we now obtain a complete brushable data set with 406 observations (rows).

Missing Values Handling: Output Table with Mean Values

Example: Mean Values by ‘Group By’ Variable

Now, we want to replace the missing values in 'carstats' with its mean values grouped by the variable 'Origin'. We re-open the 'carstats' data set and select 'Analyses' -> 'CornerstoneR' -> 'Missing Values Handling'. We choose 'Model', 'MPG' and 'Cylinders' as 'Predictors', and 'Displacement', 'Horsepower', 'Weight', 'Acceleration' and 'Model Year' as Responses. In addition, we choose 'Origin' as Group By variable and click 'OK'.

Missing Values Handling: Script Variables with Group By

In the R script window, we open 'R Script' -> 'Script Variables' and set again 'Mean Values (mean)' as Imputation Method. Click 'OK' and Execute the R script via the green arrow. Open the output data set in 'Summaries'.

Missing Values Handling: Output with Group By

We can now see that the missing values were replaced by the mean per value in 'Origin'. For example, the mean Horsepower of the cars from 'MPG' (about 27.2) is higher than the mean 'MPG' of the cars from the 'USA' (about 20.1). If we now compare this output dataset, we can see that these mean values do not correspond the total mean values of 'MPG' or 'Horsepower'.