vignettes/correlationAnalysis.Rmd
correlationAnalysis.Rmd
In many use cases, we want to reduce large data sets based on the correlation between the variables and further be able to reorder the data set. Thus, we need a correlation analysis.
There are three different situations :
In the following, we will work through it all.
In this example, we will start with the 'Cornerstone'
sample dataset 'solar_panel_production'
. The dataset
contains in total 123 variables and 7333 observations, both categorical
and numerical data are included.
"Shift"
,"Chamber"
,"Block"
etc."Temperature"
,"Pressure"
,"Effic.%"
etc.Choose the menu 'Analyses'
\(\rightarrow\) 'CornerstoneR'
\(\rightarrow\)
'Correlation Analysis'
as shown in the following
screenshot.
(If you want to reorder the data, put all variables in predictors in order to obtain a symmetrical matrix)
In the appearing dialog select 'I_sc'
'V_oc'
and all 'Temperature_xAve'
as
predictors, and leave responses blank
'OK'
confirms your selection and the following window
appears.
You can customize the calculation by different settings. To do this,
we open the menu 'R Script'
\(\rightarrow\)
'Script Variables'
. The screenshot shows the default
options.
Following options are available:
'pearson'
or 'spearman'
'FPC'
: first principle component'AOE'
: angular order of eigenvectors'Alphabet'
: alphabetical order'Original'
: original order from data set'0.9'
~'0.99'
'yes'
or 'no'
(tick box)We will now use the default settings. Close this dialog with ‘OK’ and
click the execute button (green arrow) or choose the menu
'R Script'
\(\rightarrow\)
'Execute'
and all calculations are done via
'R'
. Calculations are done if the text at the lower left
status bar contains 'Last execute error state: OK'
. Our
results are available via the menu 'Summaries'
as shown in
the following screenshot.
The menu 'Correlation Matrix'
displays all the
correlations between the variables.
The menu 'Sorted Correlation List'
displays the pairwise
correlations with their p-value.
To have a better overview of the correlations, we can have a look at
the correlation Heatmap, by selecting 'Graphs'
\(\rightarrow\)
'Correlation Heatmap'
in the R Script menu. The following
window with the requested graph appears. Since we have a symmetrical
matrix, it is sufficient to show the lower triangular matrix.
For categorical variables we use Cramér’s V, which is computed by taking the square root of the chi-squared statistic divided by the sample size and the minimum dimension minus 1. The confidence interval for Cramér’s V is estimated using the bootstrap technique. Please note that this method is not reliable when Cramér’s V is close to 0, 1, or when the sample is small.
Open the dataset and variable selection again and this time, we choose the variables as shown in the following picture.
Now let’s customize the settings via the menu 'R Script'
\(\rightarrow\)
'Script Variables'
.
Due to different dimensions of predictors and responses, we are not
able to reorder the data. Thus,the order from the original data set will
be kept. So in this case the only options we can change are
'Confidence Level'
and
'Remove Insignificant Correlations'
. Set the script
variables as shown in the screenshot.
Now close this dialog with 'OK'
and click the execute
button (green arrow). Open the 'Correlation Matrix'
and
'Sorted Correlation List'
under 'Summaries'
menu. The result will be like in the following screenshot. The missing
values indicate that the correlations between some variables are not
significant.
Open the 'Correlation Heatmap'
in the summary
'Graphs'
menu. For the asymmetrical matrix, the graph will
look slightly different than in the symmetrical case.
(put categorical variables in predictors and numerical variables in responses)
To compute the correlation between numerical and categorical variables, we use linear regression and in rare situation \(R^2\) could be negative, thus correlations are computed by the following formula: \[ correlation = sign(adjR^2) \times sqrt(|adjR^2|) \]
Open the dataset and variable selection again and this time, we choose the variables as shown in the following picture.
Now close this dialog with 'OK'
and keep the default
settings. Then click the execute button (green arrow) and open the
'Correlation Matrix'
and
'Sorted Correlation List'
under 'Summaries'
menu. The result will be like in the following screenshots.
Open the 'Correlation Heatmap'
in the
'Graphs'
menu to see the Heatmap.
If you get an error message or a strange output, pay attention to these aspects:
This function will not accept data with missing values for predictors and/or responses. If your data contain rows with missing values, we recommend using the function Handling Missing Values beforehand.