## What's an LSTM-LM formulation?

8

2

I am reading this paper "Sequence to Sequence Learning with Neural Networks" http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

Under "2. The Model" it says:

The LSTM computes this conditional probability by first obtaining the fixed dimensional representation v of the input sequence (x1, . . . , xT ) given by the last hidden state of the LSTM, and then computing the probability of y1, . . . ,yT′ with a standard LSTM-LM formulation whose initial hidden state is set to the representation v of x1, . . . , xT:

I know what an LSTM is, but what's an LSTM-LM? I've tried Googling it but can't find any good leads.

But this sentence is still puzzling to me. if I put it into equation if makes [ with c the last hidden state of the encoder. then the first hidden state represents the information provided by the encoder but the next ones represent the probability distribution of the target sequence's elements : something of a radically different nature. Also the cell state state initialisation is not given and the figure 1 let believe that the LSTM provid

– Charles Englebert – 2018-09-18T12:40:01.363

10

The definition of a Language Model (LM) is a probability distribution over sequences of words.

The simple illustration of an LM is predicting the next word given the previous word(s).

For example, if I have a language model and some initial word(s):

• I set my initial word to My
• My model predicts there is a high probability that name appears after My.
• By setting the initial words to My name, my model predicts there is a high probability that is appears after My name.
• So it's like: My -> My name -> My name is -> My name is Tom, and so on.

You can think of the autocompletion on your smartphone keyboard. In fact, LM is the heart of autocompletions.

So, LSTM-LM is simply using an LSTM (and softmax function) to predict the next word given your previous words.

By the way, Language Model is not limited to LSTM, other RNNs (GRU), or other structured models. In fact, you can also use feedforward networks with context/sliding/rolling window to predict the next word given your initial words.

Does that change the formulation of the LSTM itself in any way? – Taivanbat Badamdorj – 2016-08-04T08:37:35.813

Or does it change the way that the LSTMs are linked together? – Taivanbat Badamdorj – 2016-08-04T08:37:56.220

1IMHO, perhaps it means an LSTM that is tuned for LM (Language Modeling). I am reading the same paper and that is my understanding – Ali – 2016-10-24T15:33:15.117

@TaevanbatMongol no it's not changing the LSTM formulation. You only need a softmax function (or something) to generate the probability of words from the LSTM output – Rizky Luthfianto – 2017-10-15T14:38:13.847

Probability of words means if you sum the probability/score of the output of a timestep, it will equals to 1 – Rizky Luthfianto – 2017-10-15T14:39:33.487

1

In this context I think it means you take the output representation and learn an additional softmax layer that corresponds to the tokens in your language model (in this case letters).