Function for the automatic handling of missing values.
missingValuesHandling(
dataset = cs.in.dataset(),
preds = cs.in.predictors(),
resps = cs.in.responses(),
groups = cs.in.groupvars(),
auxs = cs.in.auxiliaries(),
scriptvars = cs.in.scriptvars(),
return.results = FALSE
)
[data.frame
]
Dataset with named columns. The names correspond to predictors and responses.
[character
]
Character vector of predictor variables.
[character
]
Character vector of response variables.
[character
]
Character vector of group variables.
[character
]
Character vector of auxiliary variables.
[list
]
Named list of script variables set via the Cornerstone "Script Variables" menu.
For details see below.
[logical(1)
]
If FALSE
the function returns TRUE
invisibly.
If TRUE
, it returns a list
of results.
Default is FALSE
.
Logical [TRUE
] invisibly and outputs to Cornerstone or,
if return.results = TRUE
, list
of
resulting data.table
objects:
Data table indicating which columns contain missing values in which rows.
Output data table with changes in missing entries.
The following script variables are summarized in scriptvars
list:
[character(1)
]
Function selection for missing value handling in data.
It is possible to choose a predefined method out of
Omit Missing Values (omit)
,
Last Observation Carried Forward (locf)
,
Next Observation Carried Backward (nocb)
,
Mean Values (mean)
, Median Values (median)
,
Minimum Values (min)
, Maximum Values (max)
,
Linear Interpolation (linpol)
,
Cubic Interpolation (cubicpol)
,
or compose a method manually by selecting User Defined
.
If one or several group by variables were passed, the method will be
applied by common group. Note that brushing is not possible if
Omit Missing Values (omit)
was selected since the output dataset
will have less rows than the original one. If you select interpolation,
you can choose an underlying time scale via auxiliaries.
Default is Omit Missing Values (omit)
.
[character(1)
]
If User Defined
is selected, one or multiple input values or
formulas must be specified.
This can be:
a single value to replace all the NAs with e.g. "0",
a value for one or more specific columns containing NAs e.g. "MPG = 0,
Horsepower = 1"
a formula from the pre-defined ones e.g. MPG = omit, Horsepower = min"
(the identifiers here are "omit", "locf", "nocb", "mean", "median",
"min", "max"),
a mathematical formula which can be evaluated e.g. log(4), 3+5 etc.
[character(1)
]
The NA representation(s) of your data apart from NA (represented as
black point in Cornerstone). Separate string by comma for multiple
NA representations, e.g. "N/A, MISSING, .".
[numeric(1)
]
A value between 0 and 1 indicating the minimal complete cases
proportion to keep variables, e.g. 0.2 keeps only columns where at
least 20
0 would keep all columns (default).
1 (100
To remove all columns which contain solely missing values, choose a
number near 0 or, more accurately, 1 divided by the number of the data
rows.
data(carstats)
summary(carstats)
#> Model Origin MPG Cylinders Displacement
#> Length:406 England: 1 Min. : 9.00 3: 4 Min. : 68.0
#> Class :character France : 14 1st Qu.:17.50 4:207 1st Qu.:105.0
#> Mode :character Germany: 39 Median :23.00 5: 3 Median :151.0
#> Italy : 8 Mean :23.51 6: 84 Mean :194.8
#> Japan : 79 3rd Qu.:29.00 8:108 3rd Qu.:302.0
#> Sweden : 11 Max. :46.60 Max. :455.0
#> USA :254 NA's :8
#> Horsepower Weight Acceleration Model.Year
#> Min. : 46.00 Min. :1613 Min. : 8.00 Min. :70.00
#> 1st Qu.: 75.75 1st Qu.:2226 1st Qu.:13.70 1st Qu.:73.00
#> Median : 95.00 Median :2822 Median :15.50 Median :76.00
#> Mean :105.08 Mean :2979 Mean :15.52 Mean :75.92
#> 3rd Qu.:130.00 3rd Qu.:3618 3rd Qu.:17.18 3rd Qu.:79.00
#> Max. :230.00 Max. :5140 Max. :24.80 Max. :82.00
#> NA's :6
# the carstats data set contains missing values in two columns
missingValuesHandling(carstats, preds = "Horsepower",
resps = c("Model", "MPG", "Cylinders", "Displacement", "Weight",
"Acceleration", "Model.Year"), groups = "Origin", auxs = character(),
scriptvars = list(math.fun = "Mean Values (mean)", input.values = "",
na.representation = "", min.complete = 0.5), return.results = TRUE)
#> $rowInds
#> MPG Horsepower
#> <int> <int>
#> 1: 11 39
#> 2: 12 134
#> 3: 13 338
#> 4: 14 344
#> 5: 15 362
#> 6: 18 383
#> 7: 40 NA
#> 8: 368 NA
#>
#> $outDataset
#> Model Origin MPG Cylinders Displacement Horsepower
#> <char> <char> <num> <char> <num> <num>
#> 1: chevrolet chevelle malibu USA 18 8 307 130
#> 2: buick skylark 320 USA 15 8 350 165
#> 3: plymouth satellite USA 18 8 318 150
#> 4: amc rebel sst USA 16 8 304 150
#> 5: ford torino USA 17 8 302 140
#> ---
#> 402: ford mustang gl USA 27 4 140 86
#> 403: vw pickup Germany 44 4 97 52
#> 404: dodge rampage USA 32 4 135 84
#> 405: ford ranger USA 28 4 120 79
#> 406: chevy s-10 USA 31 4 119 82
#> Weight Acceleration Model.Year
#> <num> <num> <num>
#> 1: 3504 12.0 70
#> 2: 3693 11.5 70
#> 3: 3436 11.0 70
#> 4: 3433 12.0 70
#> 5: 3449 10.5 70
#> ---
#> 402: 2790 15.6 82
#> 403: 2130 24.6 82
#> 404: 2295 11.6 82
#> 405: 2625 18.6 82
#> 406: 2720 19.4 82
#>