How to handle preprocessing (StandardScaler, LabelEncoder) when using data generator to train?



So, I have a dataset that is too big to load into memory all at once. Therefore I want to use a generator to load batches of data to train on.

In this scenario, how do I go about performing scaling of the features using LabelEncoder + StandardScaler from scikitlearn?

Some more context:

I have 10million+ samples of data with 23 features and 1 label column in a database.

My setup used to be (when it was ~3 million samples) to load in pandas with sql, perform some more feature extractions, use LabelEncoder on some features, do train/test split and then use StandardScaler on the training features. And then fit my keras model.

However this workflow is no longer possible on my machines because of the amount of data. (MemoryErrors.)

I'm looking into using keras.utils.Sequence to load batches of data instead of everything in memory at once, this way i would only need to have the complete list of indexes, and one full batch in memory at a time.

However how would I go about LabelEncoding and more importantly: How would I go about feature scaling in this scenario? And given the context, is this a correct approach?


Posted 2019-01-19T09:10:36.127

Reputation: 181

For sure there will be better ways, but for scaling, you can try to load just the variable you want to scale, process it and save it afterwards. Maybe it will work also for label encoding, depending on the number of classes. – Let's try – 2020-08-01T10:55:20.447



It is a correct approach to standardize on your training features. In that way, you ensure not to give any information from the testing set to the training set.

About features scaling, if you have too many samples to fit your scaler at once, you could use the partial_fit (see here) method of StandardScaler in sklearn. Load sequentially your training features and do the partial_fit. Once finished, your scaler is ready to be used on your training/testing batches.

About label encoding, either you already have an array containing all the labels, so you can fit your LabelEncoder, or you would have to load sequentially all your data to get all the different labels before fitting (LabelEncoder does not have partial_fit method).


Posted 2019-01-19T09:10:36.127

Reputation: 1 005