## Sentences language translation with neural network, with a simple layer structure (if possible sequential)

3

1

Context: Many language sentences translation systems (e.g. French to English) with neural networks use a seq2seq structure:

"the cat sat on the mat" -> [Seq2Seq model] -> "le chat etait assis sur le tapis"

I noticed that in all these examples the structure of the neural network is not done by using a Sequential structure with consecutive layers, but rather a more complex structure like this:

Question: are there successful attempts of doing sentence language translation with classic Sequential layers?

i.e.:

• Input layer: word-tokenized sentence in english, zero padded: "the cat sat on the mat"
=> x = [2, 112, 198, 22, 2, 302, 0, 0, 0, 0, 0, 0, 0, 0, ...]

• Output layer: word-tokenized sentence in french, zero padded: "le chat etait assis sur le tapis" => y = [2, 182, 17, 166, 21, 2, 302, 0, 0, 0, 0, 0, 0, 0, 0, ...]

What would you use as layers? I think we could start with:

model = Sequential()                                   # in  shape: (None, 200)
model.add(Embedding(max_words, 32, input_length=200))  # out shape: (None, 200, 32)
model.add(LSTM(100))                                   # out shape: (None, 100)
.... what here? ...


but then how to have a second Embedding for the output language and reverse it? from a 200x32 embedding (floats) to an integer list like this [2, 182, 17, 166, 21, 2, 302, 0, 0, 0, 0, 0, 0, 0, 0, ...]?

Also, how to measure the loss in this situation, mean squared error?

More generally, what is the simplest structure you can think of (even if it does not give the best results), working for language translation? (it's ok even if not sequential)

2

Machine translation using traditional neural architecture (seq2seq models) had various issues due to rare-words, low accuracy and slow translation [1]. Even after using various mechanisms like attention and residual connections the performance was only comparable (not better than) statistical phrase-based machine translations [1].

I can only think of this paper as a successful attempt to use LSTMs in the encoder, decoder setting (8 encoder and 8 decoder layers) to get comparable results (there might be other attempts too). AWD-LSTMs [2] perform remarkably better than other models.

In the machine translation task, the model should understand proper relation between translated words and the words being translated and their positioning. This can only be achieved by using knowledge representation (word embeddings/encoding) from both the languages.

That's why we need to use both encoder and decoder layers.

If you ask me, I would say the following code (taken from link) is the simplest model structure possible using a simple LSTM/seq2seq model.

from keras.models import Model
from keras.layers import Input, LSTM, Dense

# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard encoder_outputs and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using encoder_states as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# encoder_input_data & decoder_input_data into decoder_target_data
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)



From your question, it seems you only want to define the model using only Keras's sequential layer. If that is the case then you must know that variable encoder_states plays an important role in the aforementioned model's definition.

LSTMs are sequential models which means it works on a single word at a time and calculates hidden state for the next word in one iteration. The process is followed for all words in the input sequence (source language). Then the final hidden state is used in the decoder layer to compute the context for output sequence (destination language). That's why there is initial_state=encoder_states in decoder LSTM layer definition. Without encoder_states decoder LSTM won't know the context and your model will only give jibberish output.

are there successful attempts of doing sentence language translation with classic Sequential layers?

You can only try to understand how machine translation works and get comfortable with the complexity of the machine translation model definition. As this is the simplest model possible.

For more information, you can go through these papers. 1 2 3

I hope it helps.

0

The reason why the seq2seq models are not just stacks of layers is that the decoder cannot know in advance how long the output will be (at the inference time) and the next actions of the decoder depend on its previous actions. This property of the decoder is called autoregressivity. The decoder needs to keep track of two things: what is on the input (left branch of your diagram) and it did in the previous steps (right branch of the diagram).

Formulating MT as a stack of layers is an active research area, mostly because it offers a significant speedup, but usually at the expense of translation quality. This approach also does not work with LSTMs, but only with Transformers because the self-attentive layers in the Transformer directly allow arbitrary reordering of input states which is a crucial feature for MT because different languages have different word order.

Thank you for your answer. Just to be sure to understand: the decoder cannot know in advance how long the output will be (at the inference time): can't this be solved by a zero-padded input? i.e. we only work with sentences (in input and output language) of length 200: [2, 33, 4, 987, 1, 0, 0, 0, 0, 0, ..., 0]? – Basj – 2020-02-13T10:29:30.367

Again, just to be sure, the next actions of the decoder depend on its previous actions: do you mean 1) it depends on previous actions "inside a single sentence" (i.e. previous words of the sentence), or 2) does training on sentences_dataset[128] depend on previous training actions on sentences_dataset[127]? (of course the weights are updated in the SGD, but do you mean something different by previous actions?) – Basj – 2020-02-13T10:31:56.450