The difference between `Dense` and `TimeDistributedDense` of `Keras`

38

31

I am still confused about the difference between Dense and TimeDistributedDense of Keras even though there are already some similar questions asked here and here. People are discussing a lot but no common-agreed conclusions.

And even though, here, @fchollet stated that:

TimeDistributedDense applies a same Dense (fully-connected) operation to every timestep of a 3D tensor.

I still need detailed illustration about what exactly the difference between them.

fluency03

Posted 2016-03-22T20:04:23.467

Reputation: 483

Answers

49

Let's say you have time-series data with $N$ rows and $700$ columns which you want to feed to a SimpleRNN(200, return_sequence=True) layer in Keras. Before you feed that to the RNN, you need to reshape the previous data to a 3D tensor. So it becomes a $N \times 700 \times 1$.

$ $

unrolled RNN

The image is taken from https://colah.github.io/posts/2015-08-Understanding-LSTMs

$ $

In RNN, your columns (the "700 columns") is the timesteps of RNN. Your data is processed from $t=1 \ to \ 700$. After feeding the data to the RNN, now it have 700 outputs which are $h_1$ to $h_{700}$, not $h_1$ to $h_{200}$. Remember that now the shape of your data is $N \times 700 \times 200$ which is samples (the rows) x timesteps (the columns) x channels.

And then, when you apply a TimeDistributedDense, you're applying a Dense layer on each timestep, which means you're applying a Dense layer on each $h_1$, $h_2$,...,$h_t$ respectively. Which means: actually you're applying the fully-connected operation on each of its channels (the "200" one) respectively, from $h_1$ to $h_{700}$. The 1st "$1 \times 1 \times 200$" until the 700th "$1 \times 1 \times 200$".

Why are we doing this? Because you don't want to flatten the RNN output.

Why not flattening the RNN output? Because you want to keep each timestep values separate.

Why keep each timestep values separate? Because:

  • you're only want to interacting the values between its own timestep
  • you don't want to have a random interaction between different timesteps and channels.

Rizky Luthfianto

Posted 2016-03-22T20:04:23.467

Reputation: 1 968

1And then, when you apply a TimeDistributedDense, you're applying a Dense layer on each timestep --> This mean each timestep share the Dense layer's weight? With the Dense layer doesn't it only apply to the last timestep? – o0omycomputero0o – 2018-01-21T09:19:16.473

2

Why isn't TimeDistributedDense not used in the Keras example at https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html ?

– user1934212 – 2018-03-07T11:48:48.487

3Because TimeDistributedDense is already deprecated. Since Keras 2.0, Dense can handle >2-dimensional tensor well – Rizky Luthfianto – 2019-01-24T09:23:28.377