Clustering Weekday Weekend Data and Multicollinearity


Hi I have data of weekday and weekend step counts in which I extracted metrics from them such as the wd steps, we steps, standard deviation of wd steps, standard deviation of we steps and so on...

  wd_count  we_count  wd_sd_count  we_sd_count  ... .... ....
1  5000      3000      300          500
2  7000      2000      400          100

If I do clustering on this data, the weekday and weekend variables are going to be highly correlated and I will have to remove them before clustering. Is there any way around this problem for this kind of analysis?


Posted 2019-12-30

Yes its called correlation clustering.

Even though correlation can cause problems with many clustering algorithms by giving extra weight on these attributes, it would be best to drop highly correlated variables for example with PCA

However, there exist correlation clustering algorithms that are meant to process data containing multiple correlations, and cluster objects based on the correlations they exhibit, using your problem exactly to an advantage.

Noah Weber

Posted 2019-12-30

