## K-Means clustering for mixed numeric and categorical data

81

66

My data set contains a number of numeric attributes and one categorical.

Say, NumericAttr1, NumericAttr2, ..., NumericAttrN, CategoricalAttr,

where CategoricalAttr takes one of three possible values: CategoricalAttrValue1, CategoricalAttrValue2 or CategoricalAttrValue3.

I'm using default k-means clustering algorithm implementation for Octave https://blog.west.uni-koblenz.de/2012-07-14/a-working-k-means-code-for-octave/. It works with numeric data only.

So my question: is it correct to split the categorical attribute CategoricalAttr into three numeric (binary) variables, like IsCategoricalAttrValue1, IsCategoricalAttrValue2, IsCategoricalAttrValue3 ?

6Yes, using 1-of-n encoding is valid too.Sean Owen 2014-05-14T06:47:00.223

Perhaps this approach would be useful: http://zeszyty-naukowe.wwsi.edu.pl/zeszyty/zeszyt12/Numerical_Coding_of_Nominal_Data.pdf

– None – 2015-12-14T19:24:05.343

## Answers

70

The standard k-means algorithm isn't directly applicable to categorical data, for various reasons. The sample space for categorical data is discrete, and doesn't have a natural origin. A Euclidean distance function on such a space isn't really meaningful. As someone put it, "The fact a snake possesses neither wheels nor legs allows us to say nothing about the relative value of wheels and legs." (from here)

There's a variation of k-means known as k-modes, introduced in this paper by Zhexue Huang, which is suitable for categorical data. Note that the solutions you get are sensitive to initial conditions, as discussed here (PDF), for instance.

Huang's paper (linked above) also has a section on "k-prototypes" which applies to data with a mix of categorical and numeric features. It uses a distance measure which mixes the Hamming distance for categorical features and the Euclidean distance for numeric features.

A Google search for "k-means mix of categorical data" turns up quite a few more recent papers on various algorithms for k-means-like clustering with a mix of categorical and numeric data. (I haven't yet read them, so I can't comment on their merits.)

Actually, what you suggest (converting categorical attributes to binary values, and then doing k-means as if these were numeric values) is another approach that has been tried before (predating k-modes). (See Ralambondrainy, H. 1995. A conceptual version of the k-means algorithm. Pattern Recognition Letters, 16:1147–1157.) But I believe the k-modes approach is preferred for the reasons I indicated above.

3

Good answer. Potentially helpful: I have implemented Huang's k-modes and k-prototypes (and some variations) in Python: https://github.com/nicodv/kmodes

Def_Os 2014-06-12T16:08:03.140

1I do not recommend converting categorical attributes to numerical values. Imagine you have two city names: NY and LA. If you apply NY number 3 and LA number 8, the distance is 5, but that 5 has nothing to see with the difference among NY and LA.adesantos 2014-06-25T14:38:17.337

Very interesting, but I am far from convinced that Hamming distance really provides any metric that is useful for clustering -- pat and rat being one apart and rat and rodent somewhat more. There is interesting work being done with neural nets now to find similar set of words that might be useful to convert words/categories into something that can be clustered using typical k-means type approaches.John Powell aka Barça 2016-10-21T20:24:33.143

@JohnBarça This question was about clustering categorical variables, not finding the semantic similarity between terms. By "Hamming distance", I meant something like "the Hamming distance between the binary vector which contains a one-hot encoding of each categorical variable" (i.e., one basis vector for each level of each categorical variable). The paper I cited refers to the "total mismatches" of the categorical variables, but that would be the same as 1/2 the Hamming distance between the one-hot encodings.Tim Goodman 2016-10-26T08:38:04.680

@adesantos Yes, that's a problem with representing multiple categories with a single numeric feature and using a Euclidean distance. Using the Hamming distance is one approach; in that case the distance is 1 for each feature that differs (rather than the difference between the numeric values assigned to the categories). Making each category its own feature is another approach (e.g., 0 or 1 for "is it NY", and 0 or 1 for "is it LA").Tim Goodman 2014-07-01T14:36:16.047

Rolled back edit. The grammar correction ("allows" -> "allow") was incorrect ("allows" correctly matches the singular subject "The fact"), and you shouldn't change a direct quotation. If it had been an actual grammatical error, better to add "[sic]". Also, the added citations (to Wikipedia and a stackoverflow post on k-modes) are less authoritative than the ones I already included (e.g., original paper on k-modes) and add little value in my opinion.Tim Goodman 2016-11-29T20:16:52.677

Why not just dummy the categorical variables? Then it should work. If other variables are scaled to 0-1, then the distances become comparable.wordsforthewise 2018-02-15T03:37:09.510

7If your scale your numeric features to the same range as the binarized categorical features then cosine similarity tends to yield very similar results to the Hamming approach above. I don't have a robust way to validate that this works in all cases so when I have mixed cat and num data I always check the clustering on a sample with the simple cosine method I mentioned and the more complicated mix with Hamming. If the difference is insignificant I prefer the simpler method.cwharland 2014-05-14T17:53:54.897

1That sounds like a sensible approach, @cwharland. On further consideration I also note that one of the advantages Huang gives for the k-modes approach over Ralambondrainy's -- that you don't have to introduce a separate feature for each value of your categorical variable -- really doesn't matter in the OP's case where he only has a single categorical variable with three values. Better to go with the simplest approach that works.Tim Goodman 2014-05-14T19:54:45.137

14

In my opinion, there are solutions to deal with categorical data in clustering. R comes with a specific distance for categorical data. This distance is called Gower (http://www.rdocumentation.org/packages/StatMatch/versions/1.2.0/topics/gower.dist) and it works pretty well.

1

This is the approach I'm using for a mixed dataset - partitioning around medoids applied to the Gower distance matrix (see https://www.r-bloggers.com/clustering-mixed-data-types-in-r/). The problem is that calculating the distance matrix requires a lot of memory, proportional to O(n^2), hence for datasets larger than 10 or 20,000 records I'm looking at variants on k-means clustering that require less memory and can handle mixed data.

RobertF 2017-03-03T15:58:11.753

10

(In addition to the excellent answer by Tim Goodman)

The choice of k-modes is definitely the way to go for stability of the clustering algorithm used.

1. The clustering algorithm is free to choose any distance metric / similarity score. Euclidean is the most popular. But any other metric can be used that scales according to the data distribution in each dimension /attribute, for example the Mahalanobis metric.

2. With regards to mixed (numerical and categorical) clustering a good paper that might help is: INCONCO: Interpretable Clustering of Numerical and Categorical Objects

3. Beyond k-means: Since plain vanilla k-means has already been ruled out as an appropriate approach to this problem, I'll venture beyond to the idea of thinking of clustering as a model fitting problem. Different measures, like information-theoretic metric: Kullback-Liebler divergence work well when trying to converge a parametric model towards the data distribution. (Of course parametric clustering techniques like GMM are slower than Kmeans, so there are drawbacks to consider)

4. Fuzzy k-modes clustering also sounds appealing since fuzzy logic techniques were developed to deal with something like categorical data. See Fuzzy clustering of categorical data using fuzzy centroids for more information.

6

This question seems really about representation, and not so much about clustering.

Categorical data is a problem for most algorithms in machine learning. Suppose, for example, you have some categorical variable called "color" that could take on the values red, blue, or yellow. If we simply encode these numerically as 1,2, and 3 respectively, our algorithm will think that red (1) is actually closer to blue (2) than it is to yellow (3). We need to use a representation that lets the computer understand that these things are all actually equally different.

One simple way is to use what's called a one-hot representation, and it's exactly what you thought you should do. Rather than having one variable like "color" that can take on three values, we separate it into three variables. These would be "color-red," "color-blue," and "color-yellow," which all can only take on the value 1 or 0.

This increases the dimensionality of the space, but now you could use any clustering algorithm you like. It does sometimes make sense to zscore or whiten the data after doing this process, but the your idea is definitely reasonable.

I agree with your answer. HotEncoding is very useful.Pramit 2016-09-20T21:17:07.093

1

You can also give the Expectation Maximization clustering algorithm a try. It can work on categorical data and will give you a statistical likelihood of which categorical value (or values) a cluster is most likely to take on.

1Can you be more specific? EM refers to an optimization algorithm that can be used for clustering. There are many ways to do this and it is not obvious what you mean.bayer 2014-06-25T19:18:45.110

@bayer, i think the clustering mentioned here is gaussian mixture model. GMM usually uses EM.goh 2014-10-29T07:17:00.650

I don't think that's what he means, cause GMM does not assume categorical variables.bayer 2014-10-29T09:06:51.230

0

It depends on your categorical variable being used. For ordinal variables, say like bad,average and good, it makes sense just to use one variable and have values 0,1,2 and distances make sense here(Avarage is closer to bad and good). However, if there is no order, you should ideally use one hot encoding as mentioned above.

0

If we consider a scenario where the categorical variable cannot be hot encoded like the categorical variable has 200+ categories.

In such cases you can use a package clustMixType

It can handle mixed data(numeric and categorical), you just need to feed in the data, it automatically segregates Categorical and Numeric data.

If you find any issues like some numeric is under categorical then you can you as.factor()/ vice-versa as.numeric(), on that respective field and convert that to a factor and feed in that new data to the algorithm.

Calculate lambda, so that you can feed-in as input at the time of clustering.

we can even get a WSS(within sum of squares), plot(elbow chart) to find the optimal number of Clusters.

Hope this answer helps you in getting more meaningful results.