Most efficient algorithm for sequential pattern mining on a dataset with large amount of items in each transaction?


To provide some context, I am trying to do frequent pattern mining on a dataset of system error logs from servers. I organized it into transactions based on the thread ID, which results in some very long transactions (the longest one is 415). At first, I kept the items just as the error messages themselves (so each transaction would be a list of strings) but since there is a certain amount of possible error messages, I created a dictionary where each possible message is encoded as an int. Now, my transactions are in the form of a list of ints (so all the transactions are represented as a list of lists of ints).

I have tried the apriori algorithm and the fp-growth algorithm. Both take a very long time to run (over a day for apriori, and less for fp-growth) and then terminate with "Killed: 9" being the only error message. There is nothing else printed in the console, no other information. From investigating online,my best guess as to why this happened is that the process was using too much memory, so it received the SIGKILL signal. I've tried the apyori package, pyfpgrowth, and this one I found on github. I am guessing that the reason why this happens is because my transactions have a lot of items, as that's the only difference from the example data the documentation provides. I have tried running it on a small amount of my data, only 6 transactions, and the same thing happens. From my understanding of how the algorithms work, it goes into a recursive loop.

I have discussed pruning some of the error lines from my dataset with the engineers, however the purpose of this project is to find patterns within these error logs that we haven't considered, and it's difficult to discern if something is actually insignificant enough to leave out. Is there a more efficient algorithm for long transaction lists (and if so please indicate the implementation) or anything I can do to find frequent patterns successfully?

Additionally, I'm worried that these algorithms may not be taking into account the fact that the order of items within the transaction matters. Each line has a timestamp, so they are organized in a sequence, which is pretty significant. Finding errors that tend to occur together in a thread ID is helpful, but not as much as ones that will appear in sequence. In my research, frequent pattern mining and sequential pattern mining appeared to overlap and at times be used interchangeably, so if I am going about this wrong and there is a better approach, I would appreciate it.


Posted 2019-08-20T17:52:26.957

Reputation: 11

Welcome to DS StackExchange. Please, edit this question and add some context. Could you please elaborate on your problem, what you tried up to now, and report the extensive error message? As it is, it's very difficult for other people to understand your question. Thank you – Leevo – 2019-08-22T07:30:43.620

Thank you @Leevo! I didn't realize my question was unclear. I have tried to provide as much relevant information as I can. – Diana – 2019-08-22T14:20:03.800

No answers