Adding Features To Time Series Model LSTM



have been reading up a bit on LSTM's and their use for time series and its been interesting but difficult at the same time. One thing I have had difficulties with understanding is the approach to adding additional features to what is already a list of time series features. Assuming you have your dataset up like this:


Now lets say you know you have a feature that does affect the output but its not necessarily a time series feature, lets say its the weather outside. Is this something you can just add and the LSTM will be able to distinguish what is the time series aspect and what isnt?


Posted 2017-02-21T22:17:40.000

Reputation: 1 065

I like your question. However, can you elaborate how this non-time series feature influence the output at time t, based on subject matter knowledge. – horaceT – 2017-02-23T20:28:37.203

The title was misleading, I thought it was adding new features to a model that had already been trained and then continuing with the training. – Joke Huang – 2021-02-19T05:55:38.943



For RNNs (e.g., LSTMs and GRUs), the layer input is a list of timesteps, and each timestep is a feature tensor. That means that you could have a input tensor like this (in Pythonic notation):

# Input tensor to RNN
    # Timestep 1
    [ temperature_in_paris, value_of_nasdaq, unemployment_rate ],
    # Timestep 2
    [ temperature_in_paris, value_of_nasdaq, unemployment_rate ],
    # Timestep 3
    [ temperature_in_paris, value_of_nasdaq, unemployment_rate ],

So absolutely, you can have multiple features at each timestep. In my mind, weather is a time series feature: where I live, it happens to be a function of time. So it would be quite reasonable to encode weather information as one of your features in each timestep (with an appropriate encoding, like cloudy=0, sunny=1, etc.).

If you have non-time-series data, then it doesn't really make sense to pass it through the LSTM, though. Maybe the LSTM will work anyway, but even if it does, it will probably come at the cost of higher loss / lower accuracy per training time.

Alternatively, you can introduce this sort of "extra" information into your model outside of the LSTM by means of additional layers. You might have a data flow like this:

TIME_SERIES_INPUT ------> LSTM -------\
                                       *---> MERGE ---> [more processing]
AUXILIARY_INPUTS --> [do something] --/

So you would merge your auxiliary inputs into the LSTM outputs, and continue your network from there. Now your model is simply multi-input.

For example, let's say that in your particular application, you only keep the last output of the LSTM output sequence. Let's say that it is a vector of length 10. You auxiliary input might be your encoded weather (so a scalar). Your merge layer could simply append the auxiliary weather information onto the end of the LSTM output vector to produce a single vector of length 11. But you don't need to just keep the last LSTM output timestep: if the LSTM outputted 100 timesteps, each with a 10-vector of features, you could still tack on your auxiliary weather information, resulting in 100 timesteps, each consisting of a vector of 11 datapoints.

The Keras documentation on its functional API has a good overview of this.

In other cases, as @horaceT points out, you may want to condition the LSTM on non-temporal data. For example, predict the weather tomorrow, given location. In this case, here are three suggestions, each with positive/negatives:

  1. Have the first timestep contain your conditioning data, since it will effectively "set" the internal/hidden state of your RNN. Frankly, I would not do this, for a bunch of reasons: your conditioning data needs to be the same shape as the rest of your features, makes it harder to create stateful RNNs (in terms of being really careful to track how you feed data into the network), the network may "forget" the conditioning data with enough time (e.g., long training sequences, or long prediction sequences), etc.

  2. Include the data as part of the temporal data itself. So each feature vector at a particular timestep includes "mostly" time-series data, but then has the conditioning data appended to the end of each feature vector. Will the network learn to recognize this? Probably, but even then, you are creating a harder learning task by polluting the sequence data with non-sequential information. So I would also discourage this.

  3. Probably the best approach would be to directly affect the hidden state of the RNN at time zero. This is the approach taken by Karpathy and Fei-Fei and by Vinyals et al. This is how it works:

    1. For each training sample, take your condition variables $\vec{x}$.
    2. Transform/reshape your condition variables with an affine transformation to get it into the right shape as the internal state of the RNN: $\vec{v} = \mathbf{W} \vec{x} + \vec{b}$ (these $\mathbf{W}$ and $\vec{b}$ are trainable weights). You can obtain it with a Dense layer in keras.
    3. For the very first timestep, add $\vec{v}$ to the hidden state of the RNN when calculating its value.

    This approach is the most "theoretically" correct, since it properly conditions the RNN on your non-temporal inputs, naturally solves the shape problem, and also avoids polluting your inputs timesteps with additional, non-temporal information. The downside is that this approach often requires graph-level control of your architecture, so if you are using a higher-level abstraction like Keras, you will find it hard to implement unless you add your own layer type.

Adam Sypniewski

Posted 2017-02-21T22:17:40.000

Reputation: 1 016

1Good suggestion, but what if the output of the LSTM has structural dependence on a non-time series predictor. – horaceT – 2017-02-23T18:26:12.873

1Could you give an example? – Adam Sypniewski – 2017-02-23T18:27:24.723

8OK, here is a very artificial example. Say you're trying to predict weather at time t, based on obs from last n time steps. Weather depends on the part of the world you're in. If it's summer in northern hemisphere, it's winter in southern hemisphere. So this north/south factor should be taken into account. Can you incorp it into LSTM? – horaceT – 2017-02-23T18:32:32.687

1Great question! I've included edits to address this. – Adam Sypniewski – 2017-02-24T16:16:25.783

Thks for the edits and the two references. Quite useful. – horaceT – 2017-02-25T21:27:24.833

@AdamSypniewski, are you sure that recommendation is what those papers are doing? My read is that Karpathy & Fei-Fei are encoding an image to h-space via one model (a CNN), and encoding a sentence into h-space via a separate model (an LSTM), and then they infer from there; I think Vinyals et al are feeding in their conditioning variable at time '-1' before the start of the sequence. Am I wrong? – StatsSorceress – 2018-01-09T16:52:17.683

For Keras users: if you want to include an external constant (like a conditioning variable) in a recurrent network, see "Note on passing external constants to RNNs" here:

– StatsSorceress – 2018-01-09T16:59:31.687

@StatsSorceress from Karpathy's paper, they say: The RNN is trained to combine a word (x[t]),the previous context (h[t−1]) to predict the next word (y[t]).We condition the RNN’s predictions on the image information (b_v) via bias interactions on the first step. Any idea how to do this conditioning with TensorFlow ? – Ciprian Tomoiagă – 2018-01-23T16:29:54.677

@CiprianTomoiagă no, sorry, I use Keras. – StatsSorceress – 2018-01-24T15:29:55.987

I'd appreciate an example of a LSTM with exogenous data .. like if variables correlate in two groups (like: performance_walking, performance_chewing_gum, hours_awake, amount_of_work_to_do -- the first two correlate, the second two correlate, and the second group influences the first, but the trend lines of the first group will be relatively smooth and upwards, while the trend lines of the second group will be far more noisy). In such cases a naive LSTM might do well with the larger group and poor with the smaller. – roberto tomás – 2018-10-25T12:23:51.963

This is a very nice answer. Might be worth noting that in the RNN "estimator" (i.e. built-in model with all the bells and whistles) in TensorFlow, they appear to concatenate the static contextual features onto each element of the sequence, as in your suggestion #2. I wonder if there is a good reason this is their default approach, rather than the initialization-style approach in suggestion #3. – abeboparebop – 2019-05-20T13:22:48.820


Based on all the good answers of this thread, I wrote a library to condition on auxiliary inputs. It abstracts all the complexity and has been designed to be as user-friendly as possible: (tensorflow)

Hope it helps!

Philippe Remy

Posted 2017-02-21T22:17:40.000

Reputation: 181


Adam's answer does seem to make the most sense, however, I am not sure about the second statement "Polluting sequential data with non-sequential information".

So recently I trained a character-level LSTM model, in which I just appended a non-sequential feature in the end of the sequential features. The model learned how to differentiate that pretty well.

The question if the model will perform better had I done it Adam's way, is still to be tested. But for people who don't want to go the extra mile, appending non-sequential features to sequential ones works just fine.


Posted 2017-02-21T22:17:40.000

Reputation: 11


There is a function in keras LSTM reset_states(states).

However the parameter states is the concatination of two states, hidden state h and cell state.

States = [h, c]

it would be interesting to know if you should initialize h or c according to the approaches in the above mentioned papers.


Posted 2017-02-21T22:17:40.000

Reputation: 103


This is probably not the most efficient way, but the static variables could be repeated to timeseries length using tf.tile().


Posted 2017-02-21T22:17:40.000

Reputation: 1