Clustering is a common technique in exploratory data analysis, it aims at identifying clusters (subgroups) in the data that share common characteristics. The k-means clustering algorithm partitions the data into k predefined clusters and assigns data points to a cluster by minimizing the squared Euclidean distances between these points and the cluster’s centroids.
The k-means function in CornerstoneR performs the clustering
for a different number of clusters (k) defined in the
'Script Variables'
as the minimum and maximum of clusters.
The function will use all integers between minimum and maximum as number
of clusters and perform the different strategies for the
k-means.
The k-means function also works for data that have pre-defined groups. If a grouping column exists, the function will perform the different clustering strategies for each group in the data.
In the next steps, we will present two examples of the function, the first without a grouping column and the second with a grouping column.
In this example, we will use the 'Iris'
data set
provided as 'irisdata'
in Cornerstone. This data set
contains five columns and 150 observations from three different flower
species.
Choose the menu 'Analyses'
\(\rightarrow\) 'CornerstoneR'
\(\rightarrow\)
'k-Means Clustering'
. In the next dialog select
'sepal_length'
and 'sepal_width'
as
Predictors, 'petal_length'
and 'petal_width'
as Responses and press 'OK'
.
We can customize the minimum and maximum number of k
clusters, as well as the maximum k for the Elbow plot. To do
that open the menu 'R Script'
\(\rightarrow\)
'Script Variables'
.
We will keep the default settings for this example. Close this dialog
with 'OK'
and click the execute button (green arrow) or
choose the menu 'R Script'
\(\rightarrow\) 'Execute'
and
all calculations are done via 'R'
. Calculations are done if
the text at the lower left status bar contains
'Last execute error state: OK'
. Our results are available
via the menu 'Summaries'
and 'Graphs'
.
The summary 'Data with Clusters'
shows the original data
and to which clustering group each row was assigned in the different
clustering strategies. The clustering strategies are named
'nCluster_2'
and 'nCluster_3'
.
The summary 'Percentage of Variance'
shows how much of
the total variance in the data set was retained by each clustering
strategy.
The summary 'Cluster Sizes and Means'
summarizes the
sizes and means of each cluster within each cluster strategy. The first
column is the cluster strategy and the second column shows the cluster
number for each strategy.
The graph 'Elbow Plot'
in 'Graphs'
shows
the elbow method plot to choose the best strategy for the number of
clusters k.
We can visualize the clusters for the different clustering strategies
by selecting 'Graphs'
\(\rightarrow\)
'Matrix nClusters_2'
and 'Matrix n_clusters_3'
in the R Script menu.
For the example with a group, we will go again to
'Analyses'
\(\rightarrow\)
'CornerstoneR'
\(\rightarrow\)
'k-Means Clustering'
as we did before. But now we will
select 'iris_type'
as a Group by variable.
We will keep the default settings for 'Script Variables'
again for this example. We will now click the execute button (green
arrow) or choose the menu 'R Script'
\(\rightarrow\) 'Execute'
and
all calculations are done via 'R'
. Now the summaries
'Percentage of Variance'
and
'Cluster Sizes and Means'
will have an extra column with
the groups.
And finally, we will have the graphs for each group in
'iris_type'
, as in the following example:
This function will not accept data with missing values for predictors and/or responses. If your data contain rows with missing values, we recommend using the function Handling Missing Values beforehand.