Sklearn StratifiedKFold code explanation


While going through the following blog I came across the following code snippet

from sklearn.cross_validation import StratifiedKFold
eval_size = 0.10
kf = StratifiedKFold(y,round(1./eval_size))
train_indices, valid_indices = next(iter(kf))
X_train, y_train = X[train_indices], y[train_indices]
X_valid, y_valid = X[valid_indices], y[valid_indices]

and I am unable to understand how it works. Can anyone please help me with an explanation? Thanks


Posted 2016-08-01T14:29:32.783

Reputation: 439

If you are not comfortable with the snippet code from this Kaggle blog entry, you can use sklearn.model_selection.train_test_split instead, setting the stratify argument. – tagoma – 2017-08-26T19:29:08.077



A KFold split will take the data and split it however many times you designate. StratifiedKFold is used in order to ensure that your training and validation datasets each contain the same percentage of classes (see sklearn documentation for more). The function StratifiedKFold takes two arguments, the array of labels (for binary classification this would be an array of 1's and 0's) and the number of folds. They have designated the number of folds as 1./eval_size where eval_size = 0.10. So this is a 10-fold validation.

train_indices, valid_indices = next(iter(kf))

This line derives the indices in order to split the data into the train/validation data sets.

Now that we have the indices, we use those indices to actually split the data.

It is interesting that they wrote next(iter(kf)) and then input the indices into the dataset when they could use sklearn.cross_validation.train_test_split. train_test_split is just a wrapper for next(iter(kf)), but it is more readable and it's already a function in sklearn.

Hope this helps!


Posted 2016-08-01T14:29:32.783

Reputation: 1 329