vignettes/reshape.Rmd
reshape.Rmd
Reshape grouped data to wide format works on tables in a long format
with at least one predictor column, one response column, and one
grouping column. An example is the 'Indometh'
dataset with
its first six rows shown in the following table.
Subject | time | conc | |
---|---|---|---|
1 | 1 | 0.25 | 1.50 |
2 | 1 | 0.50 | 0.94 |
3 | 1 | 0.75 | 0.78 |
34 | 4 | 0.25 | 1.85 |
35 | 4 | 0.50 | 1.39 |
36 | 4 | 0.75 | 1.02 |
Our target is to reshape this table into a dataset with a row for
each 'Subject'
and columns for every 'time'
entity. The data from the column 'conc'
should be available
as data. Hence, the result should look like the following table (first 5
columns).
time | conc_first_1 | conc_first_4 | conc_first_2 | conc_first_5 |
---|---|---|---|---|
0.25 | 1.50 | 1.85 | 2.03 | 2.05 |
0.50 | 0.94 | 1.39 | 1.63 | 1.04 |
0.75 | 0.78 | 1.02 | 0.71 | 0.81 |
1.00 | 0.48 | 0.89 | 0.70 | 0.39 |
1.25 | 0.37 | 0.59 | 0.64 | 0.30 |
2.00 | 0.19 | 0.40 | 0.36 | 0.23 |
3.00 | 0.12 | 0.16 | 0.32 | 0.13 |
4.00 | 0.11 | 0.11 | 0.20 | 0.11 |
5.00 | 0.08 | 0.10 | 0.25 | 0.08 |
6.00 | 0.07 | 0.07 | 0.12 | 0.10 |
8.00 | 0.05 | 0.07 | 0.08 | 0.06 |
To achieve this result in 'Cornerstone'
open a dataset
and choose menu 'Analysis'
\(\rightarrow\) 'CornerstoneR'
\(\rightarrow\)
'Reshape to Wide'
as shown in the following screenshot.
Reshape to Wide: Menu
In the appearing dialog select variable 'time'
to
predictors. The unique values are shown in a new column. The response
variable is 'conc'
. We want the data allocated in multiple
columns with respect to the grouping variable 'Subject'
because the response is available for every group.
Reshape to Wide: Variable Selection
'OK'
confirms your selection and the following window
appears.
Reshape to Wide: R Script
Now, click the execute button (green arrow) or choose the menu
'R Script'
\(\rightarrow\)
'Execute'
and all calculations are done via
'R'
. Calculations are done if the text at the lower left
status bar contains 'Last execute error state: OK'
. Our
result is available via the menu 'Summaries'
\(\rightarrow\) 'Wide Data'
as
shown in the following screenshot.
Reshape to Wide: Result Menu
After clicking this menu a 'Cornerstone'
dataset with
the reshaped data opens.
Reshape to Wide: Result Dataset
Reshape to Wide can be parametrized by the
'Script Variables'
dialog via the menu
'R Script'
\(\rightarrow\)
'Script variables'
. The following dialog appears.
Reshape to Wide: Script Variables
'Drop missing combinations'
This checkbox allows the algorithm to drops rows with solely missing values. The default is checked.
'Simplify column names'
This checkbox allows the new column names to have the values in the grouping variable. This routine can only work if there is only one Response variable and only one aggregation function. The default is not checked.
'Aggregation'
The text box can hold a comma separated list of aggregation function.
The default is empty, which is equivalent to 'first'
. If
you select variables as 'Auxiliaries'
in the variable
selection menu they are shown in the drop-down box right from the
'<<'
button.
Let’s take a look at the following dataset example:
Subject | time | conc | repet |
---|---|---|---|
1 | 0.25 | 1.50 | 1 |
1 | 0.25 | 1.67 | 2 |
1 | 0.25 | 1.58 | 3 |
1 | 0.50 | 0.94 | 1 |
1 | 0.50 | 0.89 | 2 |
1 | 0.50 | 0.72 | 3 |
The aggregation function is used if one response data is not
identifiable by the combination of predictors and groups. In case
Subject 1 for time 0.25 was measured three times. We obtain the
following result using the aggregation function 'first'
,
'mean'
, and 'sd'
.`
time | conc_first_1 | conc_mean_1 | conc_sd_1 |
---|---|---|---|
0.25 | 1.50 | 1.583333 | 0.0850490 |
0.50 | 0.94 | 0.850000 | 0.1153256 |
As you can see new columns are added for each additional aggregation
function, in this example 'first, mean, sd'
. Function
'first'
returns the first value of the three
'conc'
measurements 1.50, 1.67, and 1.58. Followed by its
mean value 'mean'
and standard deviation
'sd'
.
How is an auxiliary variable used in this case? In case of repeated measurements it can bee necessary to select the last or first repetition.
From the example above we would like to get the data for the highest
repetition value and add 'maxby(repet)'
to the aggregation
function. As opposed to this 'minby(repet)'
returns the
'conc'
value for the lowest repetition value. The following
table shows the corresponding results applying both functions.
time | conc_maxby_1 | conc_minby_1 |
---|---|---|
0.25 | 1.58 | 1.50 |
0.50 | 0.72 | 0.94 |
In general it is possible to use all available 'R'
functions. Each function should return one value like the
'mean'
or 'sd'
function. Function which return
more than one value, e.g. 'head'
, can work but it is not
defined which value is shown in Cornerstone.
Reshape grouped data to long format works on tables in wide format with at least one predictor and one response column. An example is shown in the following table.
Place | Temp.Jan | Temp.July |
---|---|---|
Europe | 0 | 20 |
USA | -10 | 15 |
Asia | 20 | 50 |
Our target is to reshape this dataset into a dataset with the
temperatures 'Temp.Jan'
and 'Temp.July'
as
responses in one column and an additional column to identify the month.
All predictors records should be multiplied like shown in the following
table to identify each data correctly.
Place | variable | value |
---|---|---|
Europe | Jan | 0 |
USA | Jan | -10 |
Asia | Jan | 20 |
Europe | July | 20 |
USA | July | 15 |
Asia | July | 50 |
The predictor 'time'
is available in one column because
it belongs to all responses. Selected responses are split to multiple
columns depending on the value in grouping column.
To achieve this result in 'Cornerstone'
open a
corresponding dataset, e.g. sample dataset 'cities'
, and
choose menu 'Analysis'
\(\rightarrow\) 'CornerstoneR'
\(\rightarrow\)
'Reshape to Long'
as shown in the following screenshot.
Reshape to Long: Menu
In the appearing dialog select 'city'
as a predictor and
both temperatures 'JanTempF'
and 'JulyTempF'
as responses. We want to stack all temperatures with respect to each
city.
Reshape to Long: Variable Selection
'OK'
confirms your selection and the following window
appears.
Reshape to Long: R Script
Before we start to reshape our data, it is necessary to check the
string by which each response variable name is split. Choose menu
'Options'
\(\rightarrow\)
'Script Variables'
as shown in the following
screenshot.
Reshape to Long: R Script Variables Menu
The appearing dialog shows by default an underscore
('_'
). We want to split the variable name at the word
'Temp'
because it is redundant and change the text box to
'Temp'
as shown in the next screenshot.
Reshape to Long: R Script Variables
If you don’t want to split the variable name at all remove all
characters from the split text box. 'OK'
confirms settings
to the script variables.
Now, click the execute button (green arrow) or choose the
corresponding menu 'R Script'
\(\rightarrow\) 'Execute'
and
all calculations are done via 'R'
. The text in the lower
left status bar turns to 'Last execute error state: OK'
to
identify a successful termination. Our result is available via menu
'Summaries'
\(\rightarrow\) 'Long Data'
as
shown in the following screenshot.
Reshape to Long: Result Menu
After clicking a 'Cornerstone'
dataset with the reshaped
data opens.
Reshape to Long: Result Dataset
Selected predictors are multiplied in their column. All response
names are split by 'Temp'
and available in two new columns
'variable1'
and 'variable2'
. The corresponding
value is stored in column 'value'
.
If data are available in rows, it is necessary to transpose the data
set before further evaluation. An example is
'Summary Statistic'
table from Cornerstone without groups
as shown in the following table.
Statistic | MPG | Displacement | Horsepower | Weight | Acceleration |
---|---|---|---|---|---|
Count | 398.00 | 406.00 | 400.00 | 406.00 | 406.00 |
Mean | 23.52 | 194.78 | 105.08 | 2979.41 | 15.52 |
St. Dev. | 7.82 | 104.92 | 38.77 | 847.00 | 2.80 |
To use these data in further analysis or graphs, it is necessary to transpose the data so that they are available as shown in the following table.
colnames | Count | Mean | StDev |
---|---|---|---|
MPG | 398 | 23.52 | 7.82 |
Displacement | 406 | 194.78 | 104.92 |
Horsepower | 400 | 105.08 | 38.77 |
Weight | 406 | 2979.41 | 847.00 |
Acceleration | 406 | 15.52 | 2.80 |
If there are missing names in the header column, these will be assigned with Variable_1, Variable_2, etc. If there are duplicated names in the header, the duplicated names will be numbered accordingly.
To achieve this result in 'Cornerstone'
, select the
'Analysis'
\(\rightarrow\)
'CornerstoneR'
\(\rightarrow\) 'Transpose Data'
menu from the corresponding record, as shown in the following
screenshot.
Transpose Data: Menu
In the appearing dialog select the variables 'MPG'
to
'Acceleration'
as predictors or responses. For this
function the division is irrelevant, because all selected variables are
transposed. The variable 'Statistics'
contains names for
the future columns and is optionally selected as
'Group by'
.
Transpose Data: Variable Selection
'OK'
confirms your selection and the following window
appears.
Transpose Data: R Script
Now click the green button (green arrow) or select the menu
'R Script'
\(\rightarrow\)
'Execute'
to have all calculations performed by
'R'
. The execution was successful, if the status bar in the
lower left corner shows the text
'Last execute error state: OK'
. Our result is available via
the menu 'Summaries'
\(\rightarrow\)
'Transposed Data'
as shown in the following screenshot.
Transpose Data: Result Menu
After clicking this menu, the dataset with the transposed data opens.
Transpose Data: Result Dataset
The 'Transpose Data'
function converts all data into
numeric values. If the conversion is not successful, missing values are
inserted. This conversion can be deactivated with the
'Script Variables'
dialog via the 'R Script'
\(\rightarrow\)
'Script Variables'
menu. The following dialog appears.
Transpose Data: Script Variables
If you uncheck 'Convert data to numeric'
, for example,
text is retained and is not replaced by a missing value.