## Clustering analysis for observations with lists as data

0

So I have several samples analyzed for their chemical composition. After data analysis, for each sample, I have a list of compounds found and their corresponding relative abundance. Some compounds are unique but most are actually found in most samples.

I want to do clustering analysis based on these list of compounds. How do I go about this? Specifically how to vectorize my dataset since each observation is actually an array with both numerical (abundance) and categorical (compound label) variables.

-1

K-means would be a fine clustering method for you to start with, though you will have to provide the number of clusters you wish for it to return (not sure if you know that/can figure it out). Otherwise check out DBSCAN.

As for the mix of numerical vs categorical data types, all you will need to do is one-hot encoding on your categorical variables. What that does is that it will take all of the known possibilities for a category and it creates new features out of them. A 1 is assigned if the sample is part of that category, and a 0 if it is not. In this way you can use numerical and categorical at the same time, just make sure to normalize your numerical data!

Cool. I will try one-hot encoding. I kinda thought of that but I didn't know if it's appropriate (and also what it's called so I can google more about it). Thanks a lot! – quarksome – 2019-08-17T01:39:14.100

Ok wait. So for example, I have 10 different compound labels. That translates to 10 new variables (1 or 0). And for each observation, I have a list of compounds with their corresponding abundance. So for each observation, I have an (m x 11) matrix. Can k-means do clustering with matrices as your input? – quarksome – 2019-08-17T01:56:19.163

Since you have binary data as features, you will need to do calculations a bit differently for kmeans. Hamming distance should be used instead of the standard euclidean. Otherwise you could do something like hierarchical clustering instead, which I believe works better with binary data. Alternatively, you could just give a feature set of all the different compound labels and their number of occurances, if they are not present then it would just be 0. Then kmeans would work just fine. – stefanLopez – 2019-08-18T16:16:32.120

Kmeans doesn't optimize Hamming distance. Nor Euclidean actually, only squared Euclidean. It's not really well suited for one-hot encoded data, but trends to produce pretty poor results on those (for a reason: the "mean" in kmeans is a bad idea on such variables). – Has QUIT--Anony-Mousse – 2019-09-02T21:15:42.963