Using an autoencoder for anomaly detection on categorical data



Say a dataset has 0.5% of its features continuous and 99.5% categorical (binary) with ~2400 features in total. In this dataset, each observation is 1 of 2 classes - Fraud (1) or Not Fraud (0). Furthermore, there is a large class imbalance with only 2.6% of examples being Fraud, and the other ~97% of examples being Not Fraud.

Say we want to to predict whether a given example is Fraud or Not Fraud, and we take an anomaly detection approach using autoencoders.

Given the mixed data types in the dataset, in general, will an autoencoder, trained on only the Non Fraud examples, perform well in predicting Fraud examples? Is there any literature to suggest what architectures work best / if some preprocessing should be performed beforehand (scaling and PCA)? I ask because I feel an autoencoder may be hard to train with binary features.


Posted 2018-07-09T15:53:37.967

Reputation: 1 316

Is there any chance that you train it also on Fraud examples? They are quite important part of the equation. – mapto – 2018-07-09T17:50:02.880



In general an autoencoder should perform well, when it comes to detect fraud examples. Fraud examples should have in theory a much higher reconstruction error. When it comes to train the autoencoder on binary data, I agree with you that it can be quite challenging. I suggest to take a look at this blog:

Andreas Look

Posted 2018-07-09T15:53:37.967

Reputation: 863