Autoencoder for anomaly detection from feature vectors



I am trying to use an autoencoder (as described here for anomaly detection. I am using a ~1700 feature vector (rather than images, which were used in the example) with each vector describing a different protein interaction. I have a "normal" category of interactions on which I train the AE, then I feed it new vectors and use reconstruction error to detect anomalous interactions.

Adjusting my threshold so I get a true positive rate of 0.95, I get a false positive rate of 0.15, which is rather high. When I trained xgboost on the normal and anomalous vectors (using both types of interactions in training and testing) I was able to get precision of 0.98 **.

Does that mean that my model (or indeed my approach of using an AE) is ineffective, or maybe this is the best I could hope for when training an anomaly detector rather than a 2 category classifier (that is, xgboost in my case)? How should I proceed?

** Of course, this is merely a sanity check, and cannot be used as the solution. I need the model to detect anomalies that can be very different from those I currently have - thus I need to train it on the normal interaction set, and leave anomalies for testing alone.


Posted 2018-01-25T13:36:24.913

Reputation: 552



Does that mean that my model (or indeed my approach of using an AE) is ineffective

Well, it depends. Auto Encoder are a quite broad field, there are many hyperparameters to tune, width, depth, loss function, optimizer, epochs.

How should I proceed?

From my gut feeling I would say that you don't have enough data to train the AE properly. Keep in Mind, the MNIST database contains 50,000 image. And you need enough variance in order to not overfit your training data. Tree based approaches are, at least in my experience, easier to train. If you like to stick at the anomaly detection part, which I recommend since you don't know what anomalies you will face, try the Isolation Forest Algorithm. But for a solid recommendation I would need to know how your data looks.

Btw, A good metric to use in such a case is the ROC score, which basically tells you how likely it is that your model will classify new data points correctly. Check out the link for an visual explanationROC explained

So Baseline is, try less complex approaches until you a certain that they are not sufficient enough.


Posted 2018-01-25T13:36:24.913

Reputation: 166

I did not have much luck with the Isolation Forest, that is why I tried AE. Is there any material available regarding the tuning of AE hyperparameters? Or should I try everything that comes to mind and see what sticks? – Lafayette – 2018-01-25T17:10:17.350

1@user9084663 I didn't read anything specific about tuning AEs, but I am afraid that some of the hard try and error tasks that comes with those algorithms. There is a common approach where you would split your data set into 3 parts, Cross Validation, Training and Test (20/60/20 for example). You'd use the CV part for "hyper tuning", find the parameter with which the algorithm works best (measured by some metric). I think that is also called grid search, basically a brute force method. – RyanMcFlames – 2018-01-25T17:37:28.180

@user9084663 I've also heard of random search, that uses some statistical tools to optimize the search, but I don't know any open source frameworks who have implemented it. And the current big thing, as fas as I know, are evolutionary algorithms and bayes networks. But I am afraid that is far beyond my current knowledge. – RyanMcFlames – 2018-01-25T17:39:48.200


@user9084663 maybe this thread will help you link

– RyanMcFlames – 2018-01-25T20:22:17.497


Adding StandardScaler from sklearn.preprocessing improved the results somewhat, as did (in this case) making the net deeper.


Posted 2018-01-25T13:36:24.913

Reputation: 552