Part 2: Preprocessing the text

Machine Translation with Keras

Thushan Ganegedara

Data Scientist and Author

Adding special starting/ending tokens

The sentence:

'les états-unis est parfois occupé en janvier , et il est parfois chaud en novembre .'

becomes:

'sos les états-unis est parfois occupé en janvier , et il est parfois chaud en novembre . eos',

after adding special tokens

  • sos - Start of a sentence/sequence
  • eos - End of a sentence/sequence
Machine Translation with Keras

Padding the sentences

  • Real world datasets never have the same number of words in all sentences

  • Importing pad_sequences

from tensorflow.keras.preprocessing.sequence import pad_sequences
  • Converting sentences to sequences
sentences = [
  'new jersey is sometimes quiet during autumn .',
  'california is never rainy during july , but it is sometimes beautiful in february .'
]
seqs = en_tok.texts_to_sequences(sentences)
Machine Translation with Keras

Padding the sentences

preproc_text = pad_sequences(seqs, padding='post', truncating='post', maxlen=12)

for orig, padded in zip(seqs, preproc_text): print(orig, ' => ', padded)

First sentence gets five 0s padded to the end:

#  'new jersey is sometimes quiet during autumn .',
[18, 20, 2, 10, 32, 5, 46]  =>  [18 20  2 10 32  5 46  0  0  0  0  0]

Second sentence gets one word truncated at the end:

# 'california is never rainy during july , but it is sometimes beautiful in february .'
[21, 2, 11, 47, 5, 41, 7, 4, 2, 10, 30, 3, 38]  =>  [ 12 2 11 47  5 41  7  4  2 10 30  3]
  • In Keras, 0 will never be allocated as a word ID
Machine Translation with Keras

Benefit of reversing sentences

  • Helps to make a stronger initial connection between the encoder and the decoder

Distance with and without reversing

Machine Translation with Keras

Reversing the sentences

  • Creating padded sequences and reversing the sequences on the time dimension
    sentences = ["california is never rainy during july .",]
    seqs = en_tok.texts_to_sequences(sentences)
    pad_seq = preproc_text = pad_sequences(seqs, padding='post', truncating='post', maxlen=12)
    
[[21  2  9 25  5 27  0  0  0  0  0  0]]
Machine Translation with Keras

Reversing the sentences

pad_seq
[[21  2  9 25  5 27  0  0  0  0  0  0]]
pad_seq = pad_seq[:,::-1]
[[ 0  0  0  0  0  0 27  5 25  9  2 21]]
rev_sent = [en_tok.index_word[wid] for wid in pad_seq[0][-6:]] 
print('Sentence: ', sentences[0])
print('\tReversed: ',' '.join(rev_sent))
Sentence:  california is never rainy during july .
    Reversed:  july during rainy never is california
Machine Translation with Keras

Let's practice!

Machine Translation with Keras

Preparing Video For Download...