## RNNs for time series prediction - what configurations would make sense

4

4

My question here is mostly about general-intuition logic: when using a RNN (LSTM) for predicting a time series, and you have the goal of, for example, predicting at 100 steps ahead a series of one single feature, what kinds of configurations would make sense for the layer sizes of a simple input | hidden | output RNN, and of the window size (assuming you want to look at more than one point at one so you pick a "window" / "interval"?

More precisely:

• use only 1 input neuron and look at one point at a time, or pick a window/interval? - obviously that by looking only at 1 point, you lean entirely on the memory aspect of the network for detecting useful patterns... would this be a good thing?
• when using a window, does it make sense to use a set of overlapping-windows, eg. to slide a window of 10 points one point at a time, or non-overlapping windows?
• should there number of units in the hidden layer roughly equal the number of steps you aim to predict ahead? (also, similar thing with the relationship between window size and steps to predict ahead)

Or, if answering this is too hard or time consuming, where could one get the intuitions for answering such questions?

(Because obviously the space of possible configurations is huge, and unless you have a ton of time or resource, you need some intuitions so you can start form a configuration that "roughly works"...)

2

should there number of units in the hidden layer roughly equal the number of steps you aim to predict ahead? (also, similar thing with the relationship between window size and steps to predict ahead)

Don't forget that a single neuron can saturate, over time - it can assume values closer and closer to -1 or to 1. That means you get nice flexibility, across several consecutive timesteps. Therefore, you don't need number of neurons to be as many as the number of timesteps.

You can verify it by plotting a single neuron of a simple RNN on the graph-paper, with pen and paper. For every square (timestep) draw neuron's activation and previous activation. At some particular timestep assume the input is no longer 0 (for example -1) and observe how the function fluctuates into the wave from that timestep onwards. Pretty flexible for a single neuron to be honest. Even more entangling were we to graph 2 neurons. That wave of only a few neurons can hold and encode a lot of information (exponentially more)

when using a window, does it make sense to use a set of overlapping-windows, eg. to slide a window of 10 points one point at a time, or non-overlapping windows? This is similar to another question: sliding window leads to overfitting in LSTM?

Overall, sliding window is a bad approach for LSTM that leads to overfitting. A better approach would be to scan sequence backwards and forwards before outputting an answer. This is Bidirectional LSTM, or even looking into Attention. If you need to output an answer every frame relying on a neuron as is would be reasonable - after all LSTMs are quite good at remembering things due to the "accumulation" and the entangling which allows for oscillations (I've mentioned them above).

you need some intuitions so you can start form a configuration that "roughly works"...)

Quoting a pretty much legendary post:

in sum, for most problems, one could probably get decent performance (even without a second optimization step) by setting the hidden layer configuration using just two rules: (i) number of hidden layers equals one; and (ii) the number of neurons in that layer is the mean of the neurons in the input and output layers.

Also a bonus, how to query LSTM might be useful for future thought - LSTM querying approach

1Thanks for taking the time to write this rich answer! And for all the pointers to useful resources. It will take a lot of work to get from these to actual implementable answers to my questions, but I know I posted a wide/vague question, and now a I gave a few solid pointers to start from! – NeuronQ – 2018-06-30T09:37:54.093