Train/Test/Validation Set Splitting in Sklearn

111

45

How could I split randomly a data matrix and the corresponding label vector into a X_train, X_test, X_val, y_train, y_test, y_val with Sklearn? As far as I know, sklearn.cross_validation.train_test_split is only capable of splitting into two, not in three...

Hendrik

Posted 2016-11-15T14:55:04.130

Reputation: 6 637

Answers

146

You could just use sklearn.model_selection.train_test_split twice. First to split to train, test and then split train again into validation and train. Something like this:

 X_train, X_test, y_train, y_test 
    = train_test_split(X, y, test_size=0.2, random_state=1)

 X_train, X_val, y_train, y_val 
    = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2

hh32

Posted 2016-11-15T14:55:04.130

Reputation: 1 752

3Yes, this works of course but I hoped for something more elegant ;) Never mind, I accept this answer. – Hendrik – 2016-11-17T08:10:53.463

1

I wanted to add that if you want to use the validation set to search for the best hyper-parameters you can do the following after the split: https://gist.github.com/albertotb/1bad123363b186267e3aeaa26610b54b

– skd – 2018-06-06T16:34:04.537

16So what is the final train, test, validation proportion in this example? Because on the second train_test_split, you are doing this over the previous 80/20 split. So your val is 20% of 80%. The split proportions aren't very straightforward in this way. – Monica Heddneck – 2018-06-14T19:22:37.847

1I agree with @Monica Heddneck that the 64% train, 16% validation and 20% test splt could be clearer. It's an annoying inference you have to make with this solution. – Perry – 2019-06-25T08:00:45.290

1if test_size is an integer number this function will take test_size number of elements for test, so you can pre-compute the number of elements in each subsets given your proportion and use these values to do a double split – GJCode – 2019-11-10T10:39:42.520

I found this answer useful, so I thought I would add some explanatory text regarding the numbers. The first split creates 80% training+validation and 20% test. The second split starts with the 80% training+validation split and assigns 25% of this 80% to the validation split - this size comes from 0.25 X 0.80 = 0.20 (20%). So the validation split is 20%. So, now we have validation and testing at 20% each. The training split size is calculated as 75% of the 80% = 0.75 X 0.80 = 0.60 (60%). So, this gives a training split size of 60%. Overall, this gives 60%-20%-20% for train-validation-test. – edesz – 2020-10-20T03:53:52.190

I don't have any labels....how do I do the split? – Charlie Parker – 2021-02-13T20:28:16.050

50

There is a great answer to this question over on SO that uses numpy and pandas.

The command (see the answer for the discussion):

train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

produces a 60%, 20%, 20% split for training, validation and test sets.

0_0

Posted 2016-11-15T14:55:04.130

Reputation: 645

2I can see the .6 meaning 60%... but what does the .8 mean? – Tom Hale – 2019-05-11T05:02:55.637

1@TomHale np.split will split at 60% of the length of the shuffled array, then 80% of length (which is an additional 20% of data), thus leaving a remaining 20% of the data. This is due to the definition of the function. You can test/play with: x = np.arange(10.0), followed by np.split(x, [ int(len(x)*0.6), int(len(x)*0.8)]) – 0_0 – 2019-05-14T13:35:08.820

1This is fantastic, such a simple, straightforward method. I always tried shuffling the indexes, then selecting a first X%, a.s.o. Just great! – devplayer – 2020-03-11T11:24:51.343

2Major benefit of train_test_split is stratification – HashRocketSyntax – 2020-10-05T01:16:21.527

11

Adding to @hh32's answer, while respecting any predefined proportions such as (75, 15, 10):

train_ratio = 0.75
validation_ratio = 0.15
test_ratio = 0.10

# train is now 75% of the entire data set
# the _junk suffix means that we drop that variable completely
x_train, x_test, y_train, y_test = train_test_split(dataX, dataY, test_size=1 - train_ratio)

# test is now 10% of the initial data set
# validation is now 15% of the initial data set
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio)) 

print(x_train, x_val, x_test)

Andrei Florea

Posted 2016-11-15T14:55:04.130

Reputation: 111

I think this is the best answer and should be accepted. What do you mean by "# the _junk suffix means that we drop that variable completely" though? – PascalIv – 2020-06-12T07:52:39.000

And I think the shuffle argument should be set to False in the second call, simply because there is no reason to shuffle again. – PascalIv – 2020-06-12T08:01:28.040

8

You can use train_test_split twice. I think this is most straightforward.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=1)

In this way, train, val, test set will be 60%, 20%, 20% of the dataset respectively.

David Jung

Posted 2016-11-15T14:55:04.130

Reputation: 181

4

Most often you will find yourself not splitting it once but in a first step you will split your data in a training and test set. Subsequently you will perform a parameter search incorporating more complex splittings like cross-validation with a 'split k-fold' or 'leave-one-out(LOO)' algorithm.

JLT

Posted 2016-11-15T14:55:04.130

Reputation: 141

3

Best answer above does not mention that by separating two times using train_test_split not changing partition sizes won`t give initially intended partition:

x_train, x_remain = train_test_split(x, test_size=(val_size + test_size))

Then the portion of validation and test sets in the x_remain change and could be counted as

new_test_size = np.around(test_size / (val_size + test_size), 2)
# To preserve (new_test_size + new_val_size) = 1.0 
new_val_size = 1.0 - new_test_size

x_val, x_test = train_test_split(x_remain, test_size=new_test_size)

In this occasion all initial partitions are saved.

A.Ametov

Posted 2016-11-15T14:55:04.130

Reputation: 131

2

Extension of @hh32's answer with preserved ratios.

# Defines ratios, w.r.t. whole dataset.
ratio_train = 0.8
ratio_val = 0.1
ratio_test = 0.1

# Produces test split.
x_remaining, x_test, y_remaining, y_test = train_test_split(
    x, y, test_size=test_ratio)

# Adjusts val ratio, w.r.t. remaining dataset.
ratio_remaining = 1 - ratio_test
ratio_val_adjusted = ratio_val / ratio_remaining

# Produces train and val splits.
x_train, x_val, y_train, y_val = train_test_split(
    x_remaining, y_remaining, test_size=ratio_val_adjusted)

Since the remaining dataset is reduced after the first split, new ratios with respect to the reduced dataset must be calculated by solving the equation:

$ R_{remaining} \cdot R_{new} = R_{old}$

Jorge Barrios

Posted 2016-11-15T14:55:04.130

Reputation: 131

This is a correct implementation! Thank you! @Jorge Barrios – amc – 2020-09-10T17:26:35.343

1

Here's another approach (assumes equal three-way split):

# randomly shuffle the dataframe
df = df.reindex(np.random.permutation(df.index))

# how many records is one-third of the entire dataframe
third = int(len(df) / 3)

# Training set (the top third from the entire dataframe)
train = df[:third]

# Testing set (top half of the remainder two third of the dataframe)
test = df[third:][:third]

# Validation set (bottom one third)
valid = df[-third:]

This can be made more concise but I kept it verbose for explanation purposes.

Vishal

Posted 2016-11-15T14:55:04.130

Reputation: 248

1

Given train_frac=0.8, this function creates a 80% / 10% / 10% split:

import sklearn

def data_split(examples, labels, train_frac, random_state=None):
    ''' https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
    param data:       Data to be split
    param train_frac: Ratio of train set to whole dataset

    Randomly split dataset, based on these ratios:
        'train': train_frac
        'valid': (1-train_frac) / 2
        'test':  (1-train_frac) / 2

    Eg: passing train_frac=0.8 gives a 80% / 10% / 10% split
    '''

    assert train_frac >= 0 and train_frac <= 1, "Invalid training set fraction"

    X_train, X_tmp, Y_train, Y_tmp = sklearn.model_selection.train_test_split(
                                        examples, labels, train_size=train_frac, random_state=random_state)

    X_val, X_test, Y_val, Y_test   = sklearn.model_selection.train_test_split(
                                        X_tmp, Y_tmp, train_size=0.5, random_state=random_state)

    return X_train, X_val, X_test,  Y_train, Y_val, Y_test

Tom Hale

Posted 2016-11-15T14:55:04.130

Reputation: 171

1

How about using numpy random choice

import numpy as np
from sklearn.datasets import load_iris

def ttv_split(X, y = None, train_size = .6, test_size = .2, validation_size = .2, random_state = 42):
    """
    Basic approach using np random choice
    """
    np.random.seed(random_state)
    X = pd.DataFrame(X, columns = ["col_" + str(i) for i in range(X.shape[1])])
    size = sum((train_size,test_size,validation_size))
    n_samples = X.shape[0]
    if  size != 1:
        return f"Size of the dataset must sum up to 100% instead: {size} correct and try again"
    else:
        split_series = np.random.choice(a = ["train","test","validation"], p = [train_size, test_size, validation_size], size = n_samples)
        split_series = pd.Series(split_series)
        
        
        X_train, X_test, X_validation = X.iloc[split_series[split_series == "train"].index,:], X.iloc[split_series[split_series == "test"].index,:], X.iloc[split_series[split_series == "validation"].index,:]
        
        if not y is None:
            y = pd.DataFrame(y,columns=["target"])
            
            y_train, y_test, y_validation = y.iloc[split_series[split_series == "train"].index,:], y.iloc[split_series[split_series == "test"].index,:], y.iloc[split_series[split_series == "validation"].index,:]
            
            return X_train,X_test,X_validation,y_train,y_test,y_validation
        else:
            return X_train,X_test,X_validation
            

X,y = load_iris(return_Xy = True)

X_train,X_test,X_validation,y_train,y_test,y_validation = ttv_split(X, y)

Julio Jesus

Posted 2016-11-15T14:55:04.130

Reputation: 1 083

0

import numpy as np
import pandas as pd

#length of data 
N = 10
scale=2


#generated random data
X, y = np.arange(N*scale).reshape((N, scale)), np.arange(N)

#Works for pandas dataframe too
#You can download titanic.csv from here 
#https://github.com/fuwiak/faster_ds/blob/master/sample_data/titanic.csv

#df = pd.read_csv("titanic.csv", sep="\t")
#X=df[df.columns.difference(["Survived"])]
#y=df["Survived"]



def train_test_val(X, y, train_ratio, test_ratio, val_ratio):
    assert sum([train_ratio, test_ratio, val_ratio])==1.0, "wrong given ratio, all ratios have to sum to 1.0"
    assert X.shape[0]==len(y), "X and y shape mismatch"

    ind_train = int(round(X.shape[0]*train_ratio))
    ind_test = int(round(X.shape[0]*(train_ratio+test_ratio)))

    X_train = X[:ind_train]
    X_test = X[ind_train:ind_test]
    X_val = X[ind_test:]

    y_train = y[:ind_train]
    y_test = y[ind_train:ind_test]
    y_val = y[ind_test:]

    return X_train, X_test, X_val, y_train, y_test, y_val
# put ratio as you wish
X_train, X_test, X_val, y_train, y_test, y_val=train_test_val(X, y, 0.8, 0.1, 0.1) 

fuwiak

Posted 2016-11-15T14:55:04.130

Reputation: 1 285

0

Run it twice. Here is the math for the 2nd test_size.

Let's say I want {train:0.67, validation:0.13, test:0.20}

The first test_size is 20% which leaves 80% of the original data to be split into validation and training data.

(1.0/(1.0-test_size))*validation_size = second_test_size

# (1.0/(1.0-0.20))*0.13 = 0.1625

Also, look into the stratify parameter as that is the real reason to use train_test_split as opposed to selecting random row indices.

HashRocketSyntax

Posted 2016-11-15T14:55:04.130

Reputation: 473