I'm trying to use machine learning for security event classification. My goal is to predict the outcome (true positive or false positive) of a specific event. An event has a set of variables in it, called observables. These can be urls, ip addresses, file hashes etc. (8 types altogether). However, one event could contain a small number of observables or a huge number as well. Since I want to predict the outcome based on these observables, my feature size varies in length - between 1 and 2500. This in an example of the data set:
['user1', '188.8.131.52', 'explorer.exe', NULL, NULL, NULL ...] ['google.com', 'msword.exe', NULL, NULL, NULL ...] ['user3', '184.108.40.206', 'explorer.exe', 'e0d123e5f316bef78bfdf5a008837577', 'http://google.com, NULL ...]
How can I handle this scenario? I'd like to try out a classification as well as neural networks as well for comparison.
I ended up using the Bag of Words approach, as the "observables" I mentioned could be interpreted as words in a document. From there my case is a relatively known text classification problem and I achieved good results with Naive Bayes algorithms and hash vectorization.