I'm working with a data set where the data is stored in a string such as
C are actions and
v,w,x,y,z are times between the actions (each letter represents an interval of time). It's worth noting that
B cannot occur without
C cannot occur without
C is the action I'm attempting to study (ie: I'd like to be able to predict whether a user will do
C based on their prior actions).
I intend to create 2 clusters: people who do
C and those who don't.
From this data set I build a training array to run the sci-kit (python) k-means algorithm on, containing the number of
As, the number of
Bs, the mean time between actions (calculated using the average of each interval) and the standard deviation between each interval.
This gives me an overall success rate of 82% on the test set, but is there anything I can do for more accuracy?