Is this dataset with only two features suitable for clustering with k-means?


I am working with the K-means clustering algorithm for unsupervised learning.

Is the following dataset suitable for the k-means clustering task or not? Why or why not? The dataset has only two features.

enter image description here

enter image description here


Posted 2020-05-17T22:31:09.297

Reputation: 61



One problem with clustering algorithms is that they will typically find you a solution, ie they will split your data set into clusters, but it will find you a structure even if there isn't one. Your data looks like it could consist of about 5 to 7 clusters, but it could equally well just be 2 or only 1.

What you need to do after the clustering is to assess the quality of the result. I recommend having a look at Finding Groups in Data by Kaufman & Rousseeuw. They discuss various clustering algorithms and also a procedure that works out how cohesive your clusters are. Though it is 30 years old, it is an excellent book on the topic.

You also have the issue of choosing a value for k in your clustering: I usually start with two, and increase it from there; at each step I compute the cohesion of the result using their method, until I get the best score. This is an objective way of finding a good value for k and usually a reasonable clustering result.

The ultimate test, of course, is then if looking at the result makes sense to you. No cluster algorithm can do that for you.

Oliver Mason

Posted 2020-05-17T22:31:09.297

Reputation: 3 755

How can i answer this question that if this dataset is suitable for k-means clustering task or not? considering what measures? – Debugger – 2020-05-18T15:57:46.690

There is a cohesion measure described in Kaufman & Rousseeuw. IIRC you calculate for each element the distant to its own cluster centre and the ones of the other clusters; if it's closest to its own, then it's a positive score. But it's been a while since I last did this. – Oliver Mason – 2020-05-18T16:57:42.063

Okay but clustering is the next step, i just need to answer by looking at the dataset and it's statistical measures that is it suitable for k-means clustering task or not? so i am confused how can i answer this? – Debugger – 2020-05-18T17:02:34.730