Can distribution values of a target variable be used as features in cross-validation?


I came across an SVM predictive model where the author used the probabilistic distribution value of the target variable as a feature in the feature set. For example:

The author built a model for each gesture of each player to guess which gesture would be played next. Calculating over 1000 games played the distribution may look like (20%, 10%, 70%). These numbers were then used as feature variables to predict the target variable for cross-fold validation.

Is that legitimate? That seems like cheating. I would think you would have to exclude the target variables from your test set when calculating features in order to not "cheat".


Posted 2015-01-26T18:20:58.203

Reputation: 400



After speaking with some experienced statisticians, this is what I got.

As for technical issues regarding the paper, I'd be worried about data leakage or using future information in the current model. This can also occur in cross validation. You should make sure each model trains only on past data, and predicts on future data. I wasn't sure exactly how they conducted CV, but it definitely matters. It's also non-trivial to prevent all sources of leakage. They do claim unseen examples but it's not explicit exactly what code they wrote here. I'm not saying they are leaking for sure, but I'm saying it could happen.


Posted 2015-01-26T18:20:58.203

Reputation: 400


There is nothing necessarily wrong with this. If you have no better information, then using past performance (i.e., prior probabilities) can work pretty well, particularly when your classes are very unevenly distributed.

Example methods using class priors are Gaussian Maximum Likelihood classification and Naïve Bayes.


Since you've added additional details to the question...

Suppose you are doing 10-fold cross-validation (holding out 10% of the data for validating each of the 10 subsets). If you use the entire data set to establish the priors (including the 10% of validation data), then yes, it is "cheating" since each of the 10 subset models uses information from the corresponding validation set (i.e., it is not truly a blind test). However, if the priors are recomputed for each fold using only the 90% of data used for that fold, then it is a "fair" validation.

An example of the effect of this "cheating" is if you have a single, extreme outlier in your data. Normally, with k-fold cross-validation, there would be one fold where the outlier is in the validation data and not the training data. When applying the corresponding classifier to the outlier during validation, it would likely perform poorly. However, if the training data for that fold included global statistics (from the entire data set), then the outlier would influence the statistics (priors) for that fold, potentially resulting in artificially favorable performance.


Posted 2015-01-26T18:20:58.203

Reputation: 826

That depends on how you receive/process the training data. If you receive the training examples in a batch (without a time associated with each example) and you want to build a classifier that you will apply to all future observations, then no, you do not need to update the distribution parameters. But if you are attempting to do online learning where the classifier is updated after each example, then you may want to update the parameters (e.g., using the N previous observations).

– bogatron – 2015-01-29T00:52:13.443

I believe my question was not clearly stated initially. – Climbs_lika_Spyder – 2015-01-30T12:45:16.903


I agree that there is nothing wrong with using these type of features. I have used for inter-arrival times for example in modeling work. I have noticed however that many of these kind of features have "interesting" covariance relationships with each other, so you have to be really careful about using multiple distribution features in a model.


Posted 2015-01-26T18:20:58.203

Reputation: 245


As bogatron and Paul already said, there is nothing wrong with using the prediction from one classifier as a feature in another classifier. Actually, so-called "Cascading classifiers" work that way. From Wikipedia:

Cascading is a particular case of ensemble learning based on the concatenation of several classifiers, using all information collected from the output from a given classifier as additional information for the next classifier in the cascade.

This can be helpful not only to inform posterior classifiers using new features but also as an optimization measure. In the Viola-Jones object detection framework, a set of weak classifiers is used sequentially in order to reduce the amount of computation in the object recognition task. If one of the weak classifiers fails to recognize an object of interest, others classifiers don't need to be computed.

Robert Smith

Posted 2015-01-26T18:20:58.203

Reputation: 828

This is not related to my question. – Climbs_lika_Spyder – 2015-01-30T12:43:44.360

The model you're describing is an instance of a cascading classifier. Therefore, this is a legitimate practice used fairly frequently, so this information is very related to your question. – Robert Smith – 2015-01-30T21:11:36.943

I noticed you changed your question. If the first model was computed without looking at the labels and then its output was taken as features for the second classifier, that's not cheating. You don't say if the first model was built from cross validation, though. – Robert Smith – 2015-01-30T21:17:57.050

There is only one model. I did reworded my question since it seems people are confused. – Climbs_lika_Spyder – 2015-02-01T12:40:21.317