## LSTM querying approach

2

2

I've just realized my prediction approach for LSTM might not be correct.

I am trying to predict character by character, by reading over the book. The way I've approached the problem is as follows:

   b                                    c                d                e
^     carry cell state forward       ^                ^                ^
LSTM_t0  ------------------------->  LSTM_t1  ----->  LSTM_t2  ----->  LSTM_t3
^                                    ^                ^                ^
a                                    b                c                d


This means I have 4 timesteps, at at each one I feed next letter into LSTM, expecting it to immediately predict the next letter.

Should I instead do this:

ignore          ignore           ignore             e
^               ^                ^                ^
LSTM_t0  ---->  LSTM_t1  ----->  LSTM_t2  ----->  LSTM_t3
^               ^                ^                ^
a               b                c                d


In the first case, I am able to get 4 loss-values, but in the second example, I only have 1 source of gradient, at _t3

My main concern is in first example, I demand LSTM to make prediction of 'b' and 'c' without supplying it enough previous context. It's fine for 'd' and 'e', but asking for answer at timestep 0 and 1 is a bit unfair?

What would be best for this particular example?

## Answers

1

Your first example is basically not a sequential model. You have an input and an output and that's it. You don't need a recurrent layer for that...

What I would suggest you do is:

    e               f               g
^               ^               ^
LSTM_t1  ---->  LSTM_t2  -----> LSTM_t3  ----->  ...
^               ^               ^
abcd            bcde            cdef


Obviously, you will lose the ability of predict the first characters but hopefully your dataset will include a lot more examples using those specific characters so it will learn how to use them eventually. I would say this is the most general way of producing char to char text generation models. A couple of examples including implementations: here and here.

1I would argue that my first example is a sequential model - it has a recurrent connection. It's now clear to me, that it's desirable to have some history of letters at the past timesteps, to make a more certain prediction. Like you've said 'a, b, c, d' = e, all of that happens via 5 distinct timesteps. – Kari – 2018-03-13T10:24:27.307

However, I am still inclined to go with first example - it has a lot more sources of gradient. Yes, predicting b or c may not be as reliable as predicting 'e', but, as you said, if my timesteps are long, 'a' might come up somewhere in the middle next time – Kari – 2018-03-13T10:26:04.837

Of course this is a sequential model. Any function that is represented by an RNN is a sequential model, regardless if you constrain the model to only measure the loss on the output of the final element.

Now there may be better models for f(abcd) that may not be sequential. But that is outside the scope of his question. – user18764 – 2018-04-18T15:22:33.163