2

I'm working with a data set where the data is stored in a string such as `AxByCyA`

where `A`

, `B`

and `C`

are actions and `v,w,x,y,z`

are times between the actions (each letter represents an interval of time). It's worth noting that `B`

cannot occur without `A`

, and `C`

cannot occur without `B`

, and `C`

is the action I'm attempting to study (ie: I'd like to be able to predict whether a user will do `C`

based on their prior actions).

I intend to create 2 clusters: people who do `C`

and those who don't.

From this data set I build a training array to run the sci-kit (python) k-means algorithm on, containing the number of `A`

s, the number of `B`

s, the mean time between actions (calculated using the average of each interval) and the standard deviation between each interval.

This gives me an overall success rate of 82% on the test set, but is there anything I can do for more accuracy?

Comments are not for extended discussion; this conversation has been moved to chat.

– nbro – 2020-06-02T23:31:05.757