How will a rotation matrix affect contestants in machine learning contests?

3

Machine Learning contests like Kaggle usually layout the machine learning task in a human-understandable way. E.g. they might tell you the meaning of the input (features). But what if a Machine Learning contest doesn't want to expose the meaning of its input data? One approach I can think of is to apply a (random) rotation matrix to the features, so that each resulting feature doesn't have obvious meaning.

A rotation on the input space shouldn't change a model's ability to separate the positives from the negatives (using binary classification as an example) -- after all the same hyper plan (when applied the same rotation) can be used to separate the examples. What could be changed by the rotation is the distribution of each feature (i.e. when looking at a single feature's values across all examples) if a contestant cares about them. However, rotation is PCA-invariant, so if a contestant decides to work on the PCA-ed version of the input then the rotation doesn't change anything there.

How much do contestants reply on statistical analysis on the (raw, i.e. non-PCA-ed) input features? Is there any (other) thing that I should be aware of that a rotation can change for a contestant during such a contest?

Roy

Posted 2017-05-08T18:56:04.020

Reputation: 281

Answers

5

Kaggle competitions with clean, anonymised and opaque numerical features are often popular. My opinion is they are popular because they are more universally accessible - all you need is to have studied at least one ML supervised learning approach, and maybe have a starter script that loads the data, and it is very easy to make a submission. The competitions become very focused on optimising parameters, picking best model implementations and ensembling techniques. The more advanced competitors will also refine and check their CV approaches very carefully, trying to squeeze the last iota of confidence out of them in order to beat the crowd climbing the public leaderboard.

Examples of historic Kaggle competitions with obfuscated data might be Otto Group Product Classification or BNP Paribas Cardif CLaims Management. For some of these competitions the data is adjusted for anonymity of the users who might otherwise be identified from the records. In other cases it is less clear what the sponsor's motivation is.

However, there are negative consequences (you will find these complained about in the same competitions):

  • Use of insight from domain knowledge, or exploration/study of the underlying principles from the subject being predicted are effectively blocked. It is hard to assess the impact of this, but it is possible that the sponsors miss out on potentially better models.

  • Doing "just" the machine learning side can be a bit too mechanical and boring for some competitors, who may not not try as hard.

How much do contestants reply on statistical analysis on the (raw, i.e. non-PCA-ed) input features?

There are always data explorations and views of data published in forums (and Kaggle's scripts - called kernels), and many people view, upvote and presumably use the insights from them. I recall at least one competition forum thread where there was a lot of discussion about weird patterns appearing in data, which were probably an artefact of obfuscation (sorry I cannot find the thread now).

With obfuscated data, there can be attempts to de-obfuscate, and they have sometimes been partially successful.

Neil Slater

Posted 2017-05-08T18:56:04.020

Reputation: 24 613

Thank you for much for your answer. It's very informative. – Roy – 2017-05-09T00:16:36.560