Initial Situation and Goal

Clustering is a common technique in exploratory data analysis, it aims at identifying clusters (subgroups) in the data that share common characteristics. The k-means clustering algorithm partitions the data into k predefined clusters and assigns data points to a cluster by minimizing the squared Euclidean distances between these points and the cluster’s centroids.

The k-means function

The k-means function in CornerstoneR performs the clustering for a different number of clusters (k) defined in the 'Script Variables' as the minimum and maximum of clusters. The function will use all integers between minimum and maximum as number of clusters and perform the different strategies for the k-means.

The k-means function also works for data that have pre-defined groups. If a grouping column exists, the function will perform the different clustering strategies for each group in the data.

In the next steps, we will present two examples of the function, the first without a grouping column and the second with a grouping column.

k-means without grouping

In this example, we will use the 'Iris' data set provided as 'irisdata' in Cornerstone. This data set contains five columns and 150 observations from three different flower species.

Iris example data

Choose the menu 'Analyses' \(\rightarrow\) 'CornerstoneR' \(\rightarrow\) 'k-Means Clustering'. In the next dialog select 'sepal_length' and 'sepal_width' as Predictors, 'petal_length' and 'petal_width' as Responses and press 'OK'.

Dialog to select the varibles of the data set

We can customize the minimum and maximum number of k clusters, as well as the maximum k for the Elbow plot. To do that open the menu 'R Script' \(\rightarrow\) 'Script Variables'.

Dialog to select the varibles of the data set

We will keep the default settings for this example. Close this dialog with 'OK' and click the execute button (green arrow) or choose the menu 'R Script' \(\rightarrow\) 'Execute' and all calculations are done via 'R'. Calculations are done if the text at the lower left status bar contains 'Last execute error state: OK'. Our results are available via the menu 'Summaries' and 'Graphs'.

The summary 'Data with Clusters' shows the original data and to which clustering group each row was assigned in the different clustering strategies. The clustering strategies are named 'nCluster_2' and 'nCluster_3'.

Summary table: Data with Clusters

The summary 'Percentage of Variance' shows how much of the total variance in the data set was retained by each clustering strategy.

Summary table: Percentage of Variance

The summary 'Cluster Sizes and Means' summarizes the sizes and means of each cluster within each cluster strategy. The first column is the cluster strategy and the second column shows the cluster number for each strategy.

Summary table: Cluster Sizes and Means

The graph 'Elbow Plot' in 'Graphs' shows the elbow method plot to choose the best strategy for the number of clusters k.

Graph: Elbow Plot

We can visualize the clusters for the different clustering strategies by selecting 'Graphs' \(\rightarrow\) 'Matrix nClusters_2' and 'Matrix n_clusters_3' in the R Script menu.

Graph: Matrix nClusters_2

Graph: Matrix nClusters_3

k-means with grouping

For the example with a group, we will go again to 'Analyses' \(\rightarrow\) 'CornerstoneR' \(\rightarrow\) 'k-Means Clustering' as we did before. But now we will select 'iris_type' as a Group by variable.

Dialog to select the varibles of the data set

We will keep the default settings for 'Script Variables' again for this example. We will now click the execute button (green arrow) or choose the menu 'R Script' \(\rightarrow\) 'Execute' and all calculations are done via 'R'. Now the summaries 'Percentage of Variance' and 'Cluster Sizes and Means' will have an extra column with the groups.

Summary table: Percentage of Variance

And finally, we will have the graphs for each group in 'iris_type', as in the following example:

Graph: Matrix group setosa nClusters_2

Remarks

This function will not accept data with missing values for predictors and/or responses. If your data contain rows with missing values, we recommend using the function Handling Missing Values beforehand.