How to refine K-means clustering on a data set?


I'm working with a data set where the data is stored in a string such as AxByCyA where A, B and C are actions and v,w,x,y,z are times between the actions (each letter represents an interval of time). It's worth noting that B cannot occur without A, and C cannot occur without B, and C is the action I'm attempting to study (ie: I'd like to be able to predict whether a user will do C based on their prior actions).

I intend to create 2 clusters: people who do C and those who don't.

From this data set I build a training array to run the sci-kit (python) k-means algorithm on, containing the number of As, the number of Bs, the mean time between actions (calculated using the average of each interval) and the standard deviation between each interval.

This gives me an overall success rate of 82% on the test set, but is there anything I can do for more accuracy?

Jessica Chambers

Posted 2018-05-09T09:28:56.357

Reputation: 285

Comments are not for extended discussion; this conversation has been moved to chat.

– nbro – 2020-06-02T23:31:05.757



The usual parameters to adjust in a k-means:

  1. Number of clusters (recall many clusters can have same label).
  2. Distance definition (euclidean is the most basic, Gauss is an
  3. Selection of initial cluster positions.
  4. Data preprocessing (data normalization, ...)

pasaba por aqui

Posted 2018-05-09T09:28:56.357

Reputation: 733