Hashing Trick - what actually happens



When ML algorithms, e.g. Vowpal Wabbit or some of the factorization machines winning click through rate competitions (Kaggle), mention that features are 'hashed', what does that actually mean for the model? Lets say there is a variable that represents the ID of an internet add, which takes on values such as '236BG231'. Then I understand that this feature is hashed to a random integer. But, my question is:

  • Is the integer now used in the model, as an integer (numeric) OR
  • is the hashed value actually still treated like a categorical variable and one-hot-encoded? Thus the hashing trick is just to save space somehow with large data?


Posted 2014-10-10T03:48:54.660

Reputation: 692



The a second bullet is the value in feature hashing. Hashing and one hot encoding to sparse data saves space. Depending on the hash algo you can have varying degrees of collisions which acts as a kind of dimensionality reduction.

Also, in the specific case of Kaggle feature hashing and one hot encoding help with feature expansion/engineering by taking all possible tuples (usually just second order but sometimes third) of features that are then hashed with collisions that explicitly create interactions that are often predictive whereas the individual features are not.

In most cases this technique combined with feature selection and elastic net regularization in LR acts very similar to a one hidden layer NN so it performs quite well in competitions.


Posted 2014-10-10T03:48:54.660

Reputation: 951

So one-hot-encoding is still used, just on hashed values *which as you say saves space and can cause dimensionality reduction (given collisions). Is that correct? – B_Miner – 2014-10-12T00:08:00.330

1One Host Encoding isn't a required part of hashing features but is often used alongside since it helps a good bit with predictive power. One way to think of one hot encoding is transforming a feature from a set of N discrete values into a set N binary questions. Perhaps it's not important for me know if feature J is 2 or 3 only that it's not 4. One Hot makes that distinction specific. This helps a lot with linear models whereas ensemble approaches (like RF) will scan break points in the feature to find that distinction. – cwharland – 2014-10-12T00:15:58.410