K-Means' goal is to reduce the within-cluster variance, and because it computes the centroids as the mean point of a cluster, it is required to use the Euclidean distance in order to converge properly. Therefore, if you want to absolutely use K-Means, you need to make sure your data works well with it.

## Representation

K-Means, and clustering in general, tries to partition the data in meaningful groups by making sure that instances in the same clusters are similar to each other. Therefore, you need a good way to represent your data so that you can easily compute a meaningful similarity measure.

Using one-hot encoding on categorical variables is a good idea when the categories are equidistant from each other. For instance, if you have the colour light blue, dark blue, and yellow, using one-hot encoding might not give you the best results, since dark blue and light blue are likely "closer" to each other than they are to yellow.

In case the categorical value are not "equidistant" and can be ordered, you could also give the categories a numerical value. For instance, kid, teenager, adult, could potentially be represented as 0, 1, and 2. This would make sense because a teenager is "closer" to being a kid than an adult is.

## K-Medoids

A more generic approach to K-Means is K-Medoids. K-Medoids works similarly as K-Means, but the main difference is that the centroid for each cluster is defined as the point that reduces the within-cluster sum of distances. Enforcing this allows you to use any distance measure you want, and therefore, you could build your own custom measure which will take into account what categories should be close or not.

8Yes, using 1-of-n encoding is valid too. – Sean Owen – 2014-05-14T06:47:00.223

Do you have any idea about 'TIME SERIES' clustering mix of categorical and numerical data? – Leila Yousefi – 2018-04-13T09:23:23.063

1

Perhaps this approach would be useful: http://zeszyty-naukowe.wwsi.edu.pl/zeszyty/zeszyt12/Numerical_Coding_of_Nominal_Data.pdf

– None – 2015-12-14T19:24:05.343