Difference in labelling and normalizing train/test data



I am working on a dataset comprised of almost 17000 data points. Since it's a financial dataset and the components are many different companies, I need necessarily to split it by date. Therefore, supposing I have 10 years of data, I am training over the first 8 years and testing over the remaining 2. This approach I am pretty sure is consistent with the classification problem I need to do.

I am using LSTM network for predicting the direction of financial returns, depending on a bunch of features which are derived from companies' financial statements. Starting from the fact I am obtaining training accuracy greater than test accuracy with almost any architectures and hyperparameter configuration, I suppose there is something wrong in the way I have manipulated the dataset.

Here comes my concerns. I labelled my dataset looking at the median returns and putting 1 if the return for a single data point (company value at a specific date) is above such median, 0 otherwise. Am I correct if I compute two different medians? So that I labelled the training set using its median return and in the same way the test set using its own median return? Should I compute the median over the entire dataset, label it and then splitting?

Moreover, I scaled the training data to be in a range of (0,1). Should I do the same kind of normalization with my test set? I did it, but I wasn't sure about it.

It's kind of my first application of neural networks and I need those clarification about hwo treating the dataset, without influencing the results.


Posted 2019-01-15T09:27:31.487

Reputation: 67



Ideally, there should be no information leakage between your training and test sets. You need to scale your test data using the bounds found for the training data. And, you need to calculate the median with respect to your training data. Think as if you'll have the test samples one by one, not as a batch. Then, how would you calculate your test median, and how would you scale it?


Posted 2019-01-15T09:27:31.487

Reputation: 291

If I understand correctly, I should compute the median only on the train set and then label both train and test sets according to such median. This is reasonable to me. The scaling I did is simply bounded each features of the train set in a range (0,1) using MinMaxScaler from scikit learn. I suppose therefore that I have to do the same kind of scaling on the test set. – Alexbrini – 2019-01-15T11:16:54.737

Yes. For the scaler, you use fit method with training data, it learns the bounds, and use transform method to transform your both training and test data. – gunes – 2019-01-15T11:26:40.290

1I have corrected my code to dot he following: scaler = MinMaxScaler() minmax_scale = scaler.fit(df_train) x_train = minmax_scale.transform(df_train) x_test = minmax_scale.transform(df_test). It doesn't change too much in my out of sample accuracy, but I think it's also due to the nature of my data. Anyway, now I'm working correctly on the test set. – Alexbrini – 2019-01-15T13:41:37.177