I have a LSTM based network which inputs a n-sized sequence of length (n x 300) and outputs the next single step (1 x 300).
The "raw" data consists of a few thousand semi-processed sequences of variable length where each step is (obviously) 1 x 300. Hence X is (n x 300).
The way I am generating the training and test data, from this semi-processed original sequences is:
- drop all original sequences shorter than K=9
- apply a sliding window with stride 1 and length K=9 to each original sequence kept
- shuffle the generated data
- separate train/dev test/test data
Now all training/testing data is [9 x 300] and Y is [1 x 300]
The resulting network starts overfitting around epoch 10 which led me to 1. This is itself is not a problem since the results are of good enough quality.
If I try sequences as short as 6 (as in 6 x 300, that were dropped during the first phase) it gives proper enough information for us in y.
Hence the questions are about good practices and improving the network.
Technically, I am using Keras and using Google Cloud ML to execute the hyper parameter search.
The questions are:
should I stick to generating data with sliding windows since it is working properly?
if sliding window, k=9 is kind of a best guess, should I hyperparameterize this number K and search for the minimum loss with it in mind? I am afraid of doing so because hyperparameters typically affect the network itself, not the training data set, but note that K defines the data set length, would I be generating undesired bias by hyperparameterizing it?
if sliding window, should I be generating y shaped [9x300] with y=[x2, x3,...,x9, x10], instead of the current y generation? if I try both output structures and pick the one that lowers the loss in the dev test, I would check against a validation set?
if no sliding window, should I be training with one sequence per original time series input? and use attention or bidirectional lstm? 2
shuffling generates batches with shuffled partial sentences instead of batches from the same original sequence. but since shuffling is done prior to separating data for training and testing, it seems like I am not doing it the best way (or even properly). Should I first separate the original long sentences into train/each test phase, and then break them with the sliding windows (if applicable)?