## Accuracy for Kmeans clustering

2

I am looking for accuracy python code for kmeans clustering with no labels. Is there anyone who knows about it? it is ok that is not built-in function. Manually made is also ok

what is your goal? don't you trust k-means algorithm/implementation? – MaxU – 2018-11-16T15:29:52.117

I have to show the accuracy of the K-means – Bong Si-Yoon – 2018-11-16T15:52:25.163

8

Accuracy is a measure of comparing the true label to the predicted label. K-Means is an unsupervised clustering algorithm where a predicted label does not exist. So, accuracy can not be directly applied to K-Means clustering evaluation. However, there are two examples of metrics that you could use to evaluate your clusters.

Within Cluster Sum of Squares

The first is Within Cluster Sum of Squares (WCSS), which measures the squared average distance of all the points within a cluster to the center of the cluster (known as the cluster centroid).

To calculate this, you can start by finding the Euclidean distance between a given point and the cluster center which the point is assigned to. You then repeat this process for every point in the cluster, and then sum the values for the cluster and divide by the number of points. Finally calculate the average across all clusters. This will give you the average within cluster sum of squares.

This measurement can indicate the variability of the points within the cluster in terms of the average distance to the cluster center. A large sum of squares could indicate a very large a spread out cluster. A small sum of squares could indicate a small, compact cluster with little variation in the attributes of the points. This measurement is sometimes called Cohesion because it measures similarity between the data points in that cluster.

Between Clusters Sum of Squares

The second metric is Between Clusters Sum of Squares (BCSS), which measures the squared average distance between all cluster centroids.

To calculate this, you can find the Euclidean distance from a given cluster centroid to all other cluster centroids. Repeat this for all of the clusters. Then, sum all of the values together. This will give you the Between Cluster Sum of Squares. You can divide by the number of clusters to calculate the Average Between Cluster Sum of Squares.

This measurement can indicates the variation between all clusters. A large number can indicate clusters that are spread out. A small number can indicate clusters that are close to each other. This measurement is sometimes called separation because it measures the separation of the clusters.

Other Resources

You can also look into the Silhouette Coefficient, which combines both the Cohesion and Separation. And the Elbow Method can be used to help you determine the optimal K value.

Checkout the limitations of k-means clustering. And if you want an in-depth explanation of WCSS and BCSS, check out this Khan Academy video:

Depending on the language you are programming in, there may be packages available to assist you in evaluating your clusters. R has kmeans, which has attributes including withinss and betweenss. For Python, Sklearn's implementation of k-means has inertia, which is the "sum of squared distances to the closest centroid for all observations in the training set".

Consider a different algorithm

If you are looking to measure the accuracy of a prediction on a given dataset, it would help to define what the ground truth of that is. If you have a ground truth, there may be a better algorithm to use to model the outcome.

What is WCSS and BCSS? could you pls explain more about it? – Bong Si-Yoon – 2018-11-17T12:37:18.147

@BongSi-Yoon, I added some additional information. See above. – E. Kenney – 2018-11-20T20:01:05.023