The standard k-means algorithm isn't directly applicable to categorical data, for various reasons. The sample space for categorical data is discrete, and doesn't have a natural origin. A Euclidean distance function on such a space isn't really meaningful. As someone put it, "The fact a snake possesses neither wheels nor legs allows us to say nothing about the relative value of wheels and legs." (from here)

There's a variation of k-means known as k-modes, introduced in this paper by Zhexue Huang, which is suitable for categorical data. Note that the solutions you get are sensitive to initial conditions, as discussed here (PDF), for instance.

Huang's paper (linked above) also has a section on "k-prototypes" which applies to data with a mix of categorical and numeric features. It uses a distance measure which mixes the Hamming distance for categorical features and the Euclidean distance for numeric features.

A Google search for "k-means mix of categorical data" turns up quite a few more recent papers on various algorithms for k-means-like clustering with a mix of categorical and numeric data. (I haven't yet read them, so I can't comment on their merits.)

Actually, what you suggest (converting categorical attributes to binary values, and then doing k-means as if these were numeric values) is another approach that has been tried before (predating k-modes). (See Ralambondrainy, H. 1995. A conceptual version of the k-means algorithm. Pattern Recognition Letters, 16:1147–1157.) But I believe the k-modes approach is preferred for the reasons I indicated above.

6Yes, using 1-of-n encoding is valid too. – Sean Owen – 2014-05-14T06:47:00.223

Perhaps this approach would be useful: http://zeszyty-naukowe.wwsi.edu.pl/zeszyty/zeszyt12/Numerical_Coding_of_Nominal_Data.pdf

– None – 2015-12-14T19:24:05.343