Introduction to machine translation

Machine Translation with Keras

Thushan Ganegedara

Data Scientist and Author

Machine translation

Hello translations

Machine Translation with Keras

Machine translation

Hello translation with Google icon

Machine Translation with Keras

Course outline

  • Chapter 1 - Introduction to machine translation
  • Chapter 2 - Implement a machine translation model (encoder-decoder architecture)
  • Chapter 3 - Training the model and generating translations
  • Chapter 4 - Improving the translation model
Machine Translation with Keras

Dataset (English-French sentence corpus)

  • English corpus
new jersey is sometimes quiet during autumn , and it is snowy in april .
the united states is usually chilly during july , and it is usually freezing ...
california is usually quiet during march , and it is usually hot in june .
  • French corpus
new jersey est parfois calme pendant l' automne , et il est neigeux en avril .
les états-unis est généralement froid en juillet , et il gèle habituellement ...
california est généralement calme en mars , et il est généralement chaud en juin .
1 https://github.com/udacity/deep-learning/tree/master/language-translation/data
Machine Translation with Keras

Machine translation - Overview

English to French

Machine Translation with Keras

Machine translation - Overview

Source target language terms

Machine Translation with Keras

Machine translation - Overview

Machine translation model

Machine Translation with Keras

Machine translation - Overview

Machine translation model

Machine Translation with Keras

One-hot encoded vectors

  • A vector of ones and zeros
  • Vector length is determined by the size of the vocabulary
  • Vocabulary - the collection of unique words in the dataset

One hot vectors

Machine Translation with Keras

One-hot encoded vectors

A mapping containing words and their corresponding indices

word2index = {"I":0, "like": 1, "cats": 2}

Converting words to IDs or indices

words = ["I", "like", "cats"]
word_ids = [word2index[w] for w in words]
print(word_ids)
[0, 1, 2]
Machine Translation with Keras

One-hot encoded vectors

One-hot encoding without specifying output vector length

onehot_1 = to_categorical(word_ids)
print([(w,ohe.tolist()) for w,ohe in zip(words, onehot_1)])
[('I', [1.0, 0.0, 0.0]), ('like', [0.0, 1.0, 0.0]), ('cats', [0.0, 0.0, 1.0])]

One-hot encoding with specifying output vector length

onehot_2 = to_categorical(word_ids, num_classes=5)
print([(w,ohe.tolist()) for w,ohe in zip(words, onehot_2)])
[('I', [1.0, 0.0, 0.0, 0.0, 0.0]), ('like', [0.0, 1.0, 0.0, 0.0, 0.0]), 
('cats', [0.0, 0.0, 1.0, 0.0, 0.0])]
Machine Translation with Keras

Let's practice!

Machine Translation with Keras

Preparing Video For Download...