## Dealing with highly variable feature set size

3

I'm trying to use machine learning for security event classification. My goal is to predict the outcome (true positive or false positive) of a specific event. An event has a set of variables in it, called observables. These can be urls, ip addresses, file hashes etc. (8 types altogether). However, one event could contain a small number of observables or a huge number as well. Since I want to predict the outcome based on these observables, my feature size varies in length - between 1 and 2500. This in an example of the data set:

['user1', '1.1.1.1', 'explorer.exe', NULL, NULL, NULL ...]
['google.com', 'msword.exe', NULL, NULL, NULL ...]
['user3', '1.1.1.9', 'explorer.exe', 'e0d123e5f316bef78bfdf5a008837577', 'http://google.com, NULL ...]


How can I handle this scenario? I'd like to try out a classification as well as neural networks as well for comparison.

Edit
I ended up using the Bag of Words approach, as the "observables" I mentioned could be interpreted as words in a document. From there my case is a relatively known text classification problem and I achieved good results with Naive Bayes algorithms and hash vectorization.

1

Before thinking of what type of algorithm you could use, I would think of how to properly preprocess your data. Depending on how many possible values you might have for each of your 8 possible types (if I understood correctly), you could construct a 0's and 1's dataset, it is, indicating the presence or absence of each possible value in each event.

This would lead you to having a sparse matrix, but that is something you can deal with with some tools, a possibly nice example is in this link

Conclusion:

• I would first try to identify if there are fixed possible values for each one of your types (by creating groups of ips by region? are there a fixed number of possible .exe, urls...?)
• preprocess your data, so that http://google.com and google.com are the same value (i.e., by deleting the http:// from urls)
• if you think the number of possible values is not ridiculously big, you might try to go for constructing the sparse matrix

After this, you could think of wich kind of algorithm to apply, and not going crazy for a neural network from the beginning.

Unfortunately there is no way to identify a set which contains the possible feature values. Even though http://google.com and google.com generally means the same thing - in this scenario they are truly different, they hold different meanings (one is a url and the other is a domain). The number of possible values is basically infinite.

– ptrsz – 2020-10-21T14:53:28.993

0

I agree with German C M, there is some structure in your data even though it's not fully structured. So the first task is to transform the data into features that can be exploited by ML. This is typical feature engineering: the idea is to try to organize the different types of elements in the data in a way susceptible to provide useful indications to the algorithm. Many learning algorithms can deal with missing values, so the absence of a particular type of information is not necessarily an issue. Of course it's difficult to give precise advice since this stage requires expert knowledge.

Note that technically there are methods which take as input such variable length sequences, but it's highly unlikely that it would work well if the algorithm has to guess everything by itself.

Thanks for the feedback! Indeed it is the feature engineering issue. I regret that I mentioned the two algorithms, as it seems like my focus is on the algorithms already (when it's obviously not). – ptrsz – 2020-10-22T16:06:06.467