Neural Machine Translation

Recurrent Neural Networks (RNNs) for Language Modeling with Keras

David Cecchini

Data Scientist

Encoder and decoders

Neural Machine Translation architecture. This architecture is divided in an encoder and decoder parts for the input and output language, respectively. The encoder part learns a language model for the input language and the decoder part learns a language model for the output language. The final state of the encoder is passed to the decoder, that has no other input.

Recurrent Neural Networks (RNNs) for Language Modeling with Keras

Encoder example

# Instantiate the model
model = Sequential()

# Embeding layer for input language model.add(Embedding(input_language_size, input_wordvec_dim, input_length=input_language_len, mask_zero=True))
# Add LSTM layer model.add(LSTM(128))
# Repeat the last vector model.add(RepeatVector(output_language_len))
Recurrent Neural Networks (RNNs) for Language Modeling with Keras

Decoder example

# Right after the encoder
model.add(LSTM(128, return_sequences=True))

# Add Time Distributed model.add(TimeDistributed(Dense(eng_vocab_size, activation='softmax')))
Recurrent Neural Networks (RNNs) for Language Modeling with Keras

Data prep

Encoder and decoder text preparation. In the encoder we need to transform the input language into a sequence of numerical indexes, and on the decoder we do the same for the output language and also one-hot encode each index

Recurrent Neural Networks (RNNs) for Language Modeling with Keras

Data preparation for the input language

# Import modules
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Use the Tokenizer class
tokenizer = Tokenizer()
tokenizer.fit_on_texts(input_texts_list)

# Text to sequence of numerical indexes X = tokenizer.texts_to_sequences(input_texts_list)
# Pad sequences X = pad_sequences(X, maxlen=length, padding='post')
Recurrent Neural Networks (RNNs) for Language Modeling with Keras

Tokenize the output language

# Use the Tokenizer class
tokenizer = Tokenizer()
tokenizer.fit_on_texts(output_texts_list)

# Text to sequence of numerical indexes Y = tokenizer.texts_to_sequences(output_texts_list)
# Pad sequences Y = pad_sequences(Y, maxlen=length, padding='post')
Recurrent Neural Networks (RNNs) for Language Modeling with Keras

One-hot encode the output language

# Instantiate a temporary variable
ylist = list()

# Loop over the sequence of numerical indexes for sequence in Y:
# One=hot encode each index on current sentence encoded = to_categorical(sequence, num_classes=vocab_size)
# Append one-hot encoded values to the list ylist.append(encoded)
# Transform to np.array and reshape Y = np.array(ylist).reshape(Y.shape[0], Y.shape[1], vocab_size)
Recurrent Neural Networks (RNNs) for Language Modeling with Keras

Note on training and evaluating

Training the model:

model.fit(X, Y, epochs=N)

Evaluating:

  • Use BLEU
    • nltk.translate.bleu_score
Recurrent Neural Networks (RNNs) for Language Modeling with Keras

Let's practice!

Recurrent Neural Networks (RNNs) for Language Modeling with Keras

Preparing Video For Download...