Number of parameters in an LSTM model



How many parameters does a single stacked LSTM have? The number of parameters imposes a lower bound on the number of training examples required and also influences the training time. Hence knowing the number of parameters is useful for training models using LSTMs.


Posted 2016-03-09T11:14:20.163

Reputation: 1 117



The LSTM has a set of 2 matrices: U and W for each of the (3) gates. The (.) in the diagram indicates multiplication of these matrices with the input $x$ and output $h$.

  • U has dimensions $n \times m$
  • W has dimensions $n \times n$
  • there is a different set of these matrices for each of the three gates(like $U_{forget}$ for the forget gate etc.)
  • there is another set of these matrices for updating the cell state S
  • on top of the mentioned matrices, you need to count the biases (not in the picture)

Hence total # parameters = $4(nm+n^{2} + n)$

LSTM abstract block


Posted 2016-03-09T11:14:20.163

Reputation: 1 117

2I faced this question myself when taking practical decisions on estimating hardware requirements and project planning for a deep learning project.
PS: I didn't answer my own question to just gain reputation points. I want to know if my answer is right from the community.
– wabbit – 2016-03-09T11:17:03.137

1You have ignored bias units. See Adam Oudad's answer below. – arun – 2018-06-20T00:11:33.993

1Biases are not there. I have edited the answer. – Escachator – 2018-10-20T12:09:12.273

Doesn't this then need to be multiplied by the number of lstm units in the layer? Here isn't this only the number of params in a single LSTM-cell? – Joe Black – 2020-05-27T18:03:02.493


Following previous answers, The number of parameters of LSTM, taking input vectors of size $m$ and giving output vectors of size $n$ is:


However in case your LSTM includes bias vectors, (this is the default in keras for example), the number becomes:

$$4(nm+n^2 + n)$$

Adam Oudad

Posted 2016-03-09T11:14:20.163

Reputation: 923

3This is the only complete answer. Every other answer appears content to ignore the case of bias neurons. – None – 2018-02-07T14:11:30.040

2To give a concrete example, if your input has m=25 dimensions and you use an LSTM layer with n=100 units, then number of params = 4(10025 + 100**2 + 100) = 50400. – arun – 2018-06-20T00:13:59.920

1Suppose I am using timestep data, is my understanding below correct? n=100: mean I will have 100 timestep in each sample(example) so I need 100 units. m=25 mean at each timestep, I have 25 features like [weight, height, age ...]. – jason zhang – 2019-03-10T06:41:04.007

3@jasonzhang The number of timesteps is not relevant, because the same LSTM cell will be applied recursively to your input vectors (one vector for each timestep). what arun called "units" is also the size of each output vector, not the number of timesteps. – Adam Oudad – 2019-03-11T08:24:05.260


According to this:

LSTM cell structure

LSTM cell structure

LSTM equations

LSTM equations

Ingoring non-linearities

Ingoring non-linearities

If the input x_t is of size n×1, and there are d memory cells, then the size of each of W∗ and U∗ is d×n, and d×d resp. The size of W will then be 4d×(n+d). Note that each one of the dd memory cells has its own weights W∗ and U∗, and that the only time memory cell values are shared with other LSTM units is during the product with U∗.

Thanks to Arun Mallya for great presentation.


Posted 2016-03-09T11:14:20.163

Reputation: 383


to completely receive you'r answer and to have a good insight visit :

g, no. of FFNNs in a unit (RNN has 1, GRU has 3, LSTM has 4)

h, size of hidden units

i, dimension/size of input

Since every FFNN(feed forward neural network) has h(h+i) + h parameters, we have

num_params = g × [h(h+i) + h]

Example 2.1: LSTM with 2 hidden units and input dimension 3.

enter image description here

g = 4 (LSTM has 4 FFNNs)

h = 2

i = 3


= g × [h(h+i) + h]

= 4 × [2(2+3) + 2]

= 48

    input = Input((None, 3))
    lstm = LSTM(2)(input)
    model = Model(input, lstm)

thanks to RAIMI KARIM

Ali Alipoury

Posted 2016-03-09T11:14:20.163

Reputation: 21


To make it clearer , I annotate the diagram from

ot-1 : previous output , dimension , n (to be exact, last dimension's units is n )

i: input , dimension , m

fg: forget gate

ig: input gate

update: update gate

og: output gate

Since at each gate, the dimension is n, so for ot-1 and i to get to each gate by matrix multiplication(dot product), need nn+mn parameters, plus n bias .so total is 4(nn+mn+n).


Posted 2016-03-09T11:14:20.163

Reputation: 11