vignettes/correlationAnalysis.Rmd
correlationAnalysis.Rmd
In many use cases, we want to reduce large data sets based on the correlation between the variables and further be able to reorder the data set. Thus, we need a correlation analysis.
There are three different situations :
In the following, we will work through it all.
In this example, we will start with the ‘Cornerstone’ sample dataset ‘solar_panel_production’. The dataset contains in total 123 variables and 7333 observations, both categorical and numerical data are included.
show the to analyzed data
Choose the menu ‘Analyses’ \(\rightarrow\) ‘CornerstoneR’ \(\rightarrow\) ‘Correlation Analysis’ as shown in the following screenshot.
correlation analysis menu
(If you want to reorder the data, put all variables in predictors in order to obtain a symmetrical matrix)
In the appearing dialog select ‘I_sc’‘V_oc’and all ’Temperature_xAve’ as predictors, and leave responses blank
correlation analysis: variable selection
‘OK’ confirms your selection and the following window appears.
correlation analysis: R Script
You can customize the calculation by different settings. To do this, we open the menu ‘R Script’ \(\rightarrow\) ‘Script Variables’. The screenshot shows the default options.
correlation analysis: script variable
Following options are available:
We will now use the default settings. Close this dialog with ‘OK’ and click the execute button (green arrow) or choose the menu ‘R Script’ \(\rightarrow\) ‘Execute’ and all calculations are done via ‘R’. Calculations are done if the text at the lower left status bar contains ‘Last execute error state: OK’. Our results are available via the menu ‘Summaries’ as shown in the following screenshot.
correlation analysis: summaries
The menu ‘Correlation Matrix’ displays all the correlations between the variables.
correlation anlysis matrix
The menu ‘Sorted Correlation List’ displays the pairwise correlations with their p-value.
correlation anlysis: sorted correlation list
To have a better overview of the correlations, we can have a look at the correlation Heatmap, by selecting ‘Graphs’ \(\rightarrow\) ‘Correlation Heatmap’ in the R Script menu. The following window with the requested graph appears. Since we have a symmetrical matrix, it is sufficient to show the lower triangular matrix.
correlation analysis: Heatmap
For categorical variables we use Cramér’s V, which is computed by taking the square root of the chi-squared statistic divided by the sample size and the minimum dimension minus 1.
Open the dataset and variable selection again and this time, we choose the variables as shown in the following picture.
correlation anlysis: categorical variales
Now let’s customize the settings via the menu ‘R Script’ \(\rightarrow\) ‘Script Variables’.
Due to different dimensions of predictors and responses, we are not able to reorder the data. Thus,the order from the original data set will be kept. So in this case the only options we can change are ‘Confidence Level’ and ‘Remove Insignificant Correlations’. Set the script variables as shown in the screenshot.
correlation anlysis: asymmetric categorical settings
Now close this dialog with ‘OK’ and click the execute button (green arrow). Open the ‘Correlation Matrix’ and ‘Sorted Correlation List’ under ‘Summaries’ menu. The result will be like in the following screenshot. The missing values indicate that the correlations between some variables are not significant.
correlation anlysis: asymmetric categorical matrix without NA
correlation anlysis: asymmetric categorical list without NA
Open the ‘Correlation Heatmap’ in the ‘summary ’Graphs’ menu. For the asymmetrical matrix, the graph will look slightly different than in the symmetrical case.
correlation anlysis: asymmetric categorical Heatmap
(put categorical variables in predictors and numerical variables in responses)
To compute the correlation between numerical and categorical variables, we use linear regression and in rare situation R^2 could be negative, thus correlations are computed by the following formula: \[ correlation = sign(adjR^2) \times sqrt(|adjR^2|) \]
Open the dataset and variable selection again and this time, we choose the variables as shown in the following picture.
correlation anlysis: categorical variables
Now close this dialog with ‘OK’ and keep the default settings. Then click the execute button (green arrow) and open the ‘Correlation Matrix’ and ‘Sorted Correlation List’ under ‘Summaries’ menu. The result will be like in the following screenshots.
correlation anlysis: asymmetric numerical/categorical matrix
correlation anlysis: asymmetric numerical/categorical list
Open the ‘Correlation Heatmap’ in the ‘Graphs’ menu to see the
Heatmap.
If you get an error message or a strange output, pay attention to these aspects: