## Machine Learning algorithm for detecting anomalies in large sets of events

0

1. There is traffic: normal and anomaly. Each traffic sample contains a list of events (of variable size)
2. Events happen in order, the possible events set size is ~40000 elements
3. Should run on relatively small amounts of memory and processing power

Having a traffic sample (of size 1000 events max), what is the best machine learning algorithm, that fits the preconditions, to identify whether it's an anomaly?

Given my limited knowledge in machine learning algorithms, here is what I came up with:

This system can be described very well as a markov process, but there are huge memory limitations in this scenario

1. Reduced Markov Chains

Store frequent pairs of events (that appeared more than 10 times in normal traffic) and then search the some pair there: if it doesn't appear, count as an anomaly. Then use some heuristic to identify if the traffic as a whole is an anomaly.

I called it reduced, because practically, we are using only a chain of two events, any other bigger size would become a huge combinatorial problem and fill any memory it's been given, which is infeasible.

2. Naive KNN

Get all the normal traffic samples (each sample may contain up to 1000 events) and analyze the number of appearances of each event in each sample. Separate the dataset into 10 parts, compute their means to get the mean frequency for each part (basically, we now have 10 mean-frequency vectors) and use them as positive data-points inside the KNN algorithm

Do the same with anomaly traffic and add 10 data points. Having the points, we can now use the KNN algorithm regression to compute a score and make a decision.

This is a bit tricky, because the frequency vectors are quite big, have too many of them becomes a problem. A solution would be to implement sparse vectors

Any other ideas, what am I missing?

0

You can also try these:

4- One Class SVM (Support vector Machine):

You can train this with the normal samples that you have and then it can label them all and detect anything that does not fall under this label. It can also scale easily, although it might get quite slow on training with large data. You can try the scikit-learn implementation: https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html

5- Isolation Forest:

Isolation forest is actually an Ensamble algorithm, it is basically using Random Forest (Which by itself is using Trees) to detect Anomalies or outliers in the data. At first you can try scikit-learn implementation of it: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html

6- Deep One-Class Classification:

The other thing that you can try is the Neural Network implementation of One class classification. You should be able to do it easily with Keras for example. You can take a look at Keras documentation here: https://keras.io/models/about-keras-models/

0

Before using the KNN algorithm you could just reduce dimensionality by applying e.g., a singular value decomposition or a PCA. This will reduce complexity for the particular algorithm.

In addition using sparse vectors is always a good idea. Another approach to minimize the memory consumption is to cast all values to the smallest type.