Process mining with ML


I have a little more general question. My dataset consists of N sequences of events. Example of one sequence could be [A,B,C,D,X,Y] and another [A,B,Z], where letters represent different events. The sequences are at most 80 steps long.

The idea is to predict next letter or next step from known previous events. For very simple example maybe after A will always come B. Next step would be measuring time of each event and the ultimate goal is to predict how long until process reaches specific event.

I tried N-gram, MLP neural network and lastly LSTM network, which had around 80% accuracy.

That would not be bad if the events were balanced in the dataset. To account for that I used weighted loss function in training of the LSTM and then the overall accuracy is around 66%. However the less frequent classes have much much higher accuracy (still not perfect, but higher). How can I create model that will have the best of both? That will learn the less frequent AND the most frequent at the same time.

Also I have read that tree base methods perform very good on unbalanced dataset. However all examples always consider one big timeseries data. My data are many short timeseries. Is it possible to train RandomForest on such data? How?

If you know about different algorithm/method that could be applied to such data please post it :)

Thank you.

Matúš Košík

Posted 2018-08-15T20:46:59.947

Reputation: 1



I suspect that the problem has more to do with your data than with your algorithms. My recommendation is to spend some time studying your data and ensuring that it is a robust representation of the kinds of problems you're expecting to solve. If possible, come up with a way to generate extra data. Given the fact that you already have many permutations, you could perhaps write a script to create additional permutations by modifying existing samples with rules that you know.

David Shapiro

Posted 2018-08-15T20:46:59.947

Reputation: 11