Part 1: Preprocessing the Data

Machine Translation with Keras

Thushan Ganegedara

Data Scientist and Author

Introduction to data

  • Data

    • en_text : A Python list of sentences, each sentence is a string of words separated by spaces.
    • fr_text: A Python list of sentences, each sentence is a string of words separated by spaces.
  • Printing some data in the dataset

for en_sent, fr_sent in zip(en_text[:3], fr_text[:3]):
  print("English: ", en_sent)
  print("\tFrench: ", fr_sent)
English:  new jersey is sometimes quiet during autumn , and it is snowy in april .
    French:  new jersey est parfois calme pendant l' automne , et il est neigeux en avril .
English:  the united states is usually chilly during july , and it is usually freezing in november .
    French:  les états-unis est généralement froid en juillet , et il gèle habituellement en novembre .
...
Machine Translation with Keras

Word tokenization

  • Tokenization

    • Process of breaking a sentence/phrase to individual words/characters
    • E.g. "I watched a movie last night, it was okay." becomes,
    • [I, watched, a, movie, last, night, it, was, okay]
  • Tokenization with Keras

    • Learns a mapping from word to a word ID using a given corpus.
    • Can be used to convert a given string to a sequence of IDs
from tensorflow.keras.preprocessing.text import Tokenizer
en_tok = Tokenizer()
Machine Translation with Keras

Fitting the Tokenizer

  • Fitting the Tokenizer on data
    • Tokenizer needs to be fit on some data (i.e. sentences) to learn the word to word ID mapping.
en_tok = Tokenizer()
en_tok.fit_on_texts(en_text)
  • Getting the word to ID mapping
    • Use the Tokenizer's word_index attribute.
id = en_tok.word_index["january"] # => returns 51
  • Getting the ID to word mapping
w = en_tok.index_word[51] # => returns 'january'
Machine Translation with Keras

Transforming sentences to sequences

seq = en_tok.texts_to_sequences(['she likes grapefruit , peaches , and lemons .'])
[[26, 70, 27, 73, 7, 74]]
Machine Translation with Keras

Limiting the size of the vocabulary

  • You can limit the size of the vocabulary in a Keras Tokenizer.
tok = Tokenizer(num_words=50)
  • Out-of-vocabulary (OOV) words

    • Rare words in the training corpus (i.e. collection of text).
    • Words that are not present in the training set.
  • E.g.

    • tok.fit_on_texts(["I drank milk"])
    • tok.texts_to_sequences(["I drank water"])
    • The word water is a OOV word and will be ignored.
Machine Translation with Keras

Treating Out-of-Vocabulary words

  • Defining a OOV token
tok = Tokenizer(num_words=50, oov_token='UNK')
  • E.g.
    • tok.fit_on_texts(["I drank milk"])
    • tok.texts_to_sequences(["I drank water"])
    • The word water is a OOV word and will be replaced with UNK.
      • i.e. Keras will see "I drank UNK"
Machine Translation with Keras

Let's practice!

Machine Translation with Keras

Preparing Video For Download...