Categorical data for sklearns Isolation Forrest

4

I'm trying to do anomaly detection with Isolation Forests (IF) in sklearn. Except for the fact that it is a great method of anomaly detection, I also want to use it because about half of my features are categorical (font names, etc.)

I've got a bit too much to use one hot encoding (about 1000+ and that would just be one of many features) and I'm anyway looking for a more robust way of data representation.

Also, I want to experiment with other clustering techniques later on, so I don't want to necessarily do label encoding as it will misrepresent the data in euclidean space.

I have thus a two part question:

  1. How will label encoding (ie. ordinal numbers) affect tree based methods such as the Isolation Forest? Seeing as they aren't distance based, they shouldn't make assumptions about ordinal data, right?

  2. What other feature transformations can I consider for distance based models?

amateurjustin

Posted 2018-07-25T14:29:33.800

Reputation: 55

Answers

1

I would really try not to use ordinal numbers for categorical data. It imposes a false magnitude and ordering in the model, especially when you have 1,000 examples. For example, the difference between Brush Script and Calibri could be very small and the difference between Calibri and Times New Roman UNBELIEVABLY HUGE (assuming lexicographical assignment), when really they're all just different fonts.

You could:

  1. Try to figure out groupings of similar features that make sense, then one-hot those groupings so you wouldn't end up with too many columns.
  2. One-hot the whole thing and then try some dimensionality reduction techniques to get the feature space back down to something sensible.
  3. Try to use an autoencoder or neural method to learn an embedding of fixed dimension.

One thing you should definitely be careful of is how you combine the result of this process with whatever the other half of your features are.

Matthew

Posted 2018-07-25T14:29:33.800

Reputation: 904

1Hi @Matthew, Thanks for the answer. I have never considered doing something like PCA over one-hot vectors. Will that work? Also, seeing as I'll firstly be using tree based methods, does it really matter with ordinal data not being that representative? I mean, it's not a distance based model? But that's what I think. I could be completely wrong here and I haven't gotten around to finding that answer myself. – amateurjustin – 2018-07-26T11:10:29.753

1However, I do think that embeddings should be the answer. Have you any knowledge of some well defined examples? Maybe even a library at this stage. I need to have a POC real soon and don't want to painstakingly be writing code when I could quickly implement a library that does embedding itself. – amateurjustin – 2018-07-26T11:13:12.360

I think PCA would only be valuable if you have additional details regarding the fonts, such as serif vs. non-serif, and other attributes about the font itself. Otherwise, PCA on just the one-hot encoded font field is simply reducing it down to its original essence. Thats my expectation at least. – theStud54 – 2019-12-16T14:08:22.723

2

I coded isolation forest with dataset containing both categorical and numeric features, and it is working properly. How is it possible.?

Shivanya

Posted 2018-07-25T14:29:33.800

Reputation: 51

how did you do it? is your categorical data is represented with numbers? When I use strings in sklearn.ensemble.IsolationForests I get an error. – Maverick Meerkat – 2019-09-09T08:00:00.073

No it is string only. and I am not using any inbuilt library of isolation forest – Shivanya – 2019-09-10T08:04:37.503