## Is this dataset with only two features suitable for clustering with k-means?

2

I am working with the K-means clustering algorithm for unsupervised learning.

Is the following dataset suitable for the k-means clustering task or not? Why or why not? The dataset has only two features.

2

One problem with clustering algorithms is that they will typically find you a solution, ie they will split your data set into clusters, but it will find you a structure even if there isn't one. Your data looks like it could consist of about 5 to 7 clusters, but it could equally well just be 2 or only 1.

What you need to do after the clustering is to assess the quality of the result. I recommend having a look at Finding Groups in Data by Kaufman & Rousseeuw. They discuss various clustering algorithms and also a procedure that works out how cohesive your clusters are. Though it is 30 years old, it is an excellent book on the topic.

You also have the issue of choosing a value for k in your clustering: I usually start with two, and increase it from there; at each step I compute the cohesion of the result using their method, until I get the best score. This is an objective way of finding a good value for k and usually a reasonable clustering result.

The ultimate test, of course, is then if looking at the result makes sense to you. No cluster algorithm can do that for you.

How can i answer this question that if this dataset is suitable for k-means clustering task or not? considering what measures? – Debugger – 2020-05-18T15:57:46.690

There is a cohesion measure described in Kaufman & Rousseeuw. IIRC you calculate for each element the distant to its own cluster centre and the ones of the other clusters; if it's closest to its own, then it's a positive score. But it's been a while since I last did this. – Oliver Mason – 2020-05-18T16:57:42.063

Okay but clustering is the next step, i just need to answer by looking at the dataset and it's statistical measures that is it suitable for k-means clustering task or not? so i am confused how can i answer this? – Debugger – 2020-05-18T17:02:34.730