Part 2: Preprocessing the text

Machine Translation with Keras

Thushan Ganegedara

Data Scientist and Author

Adding special starting/ending tokens

The sentence:

'les états-unis est parfois occupé en janvier , et il est parfois chaud en novembre .'

becomes:

'sos les états-unis est parfois occupé en janvier , et il est parfois chaud en novembre . eos',

after adding special tokens

sos - Start of a sentence/sequence
eos - End of a sentence/sequence

Padding the sentences

Real world datasets never have the same number of words in all sentences
Importing pad_sequences

from tensorflow.keras.preprocessing.sequence import pad_sequences

Converting sentences to sequences

sentences = [
  'new jersey is sometimes quiet during autumn .',
  'california is never rainy during july , but it is sometimes beautiful in february .'
]
seqs = en_tok.texts_to_sequences(sentences)

Padding the sentences

preproc_text = pad_sequences(seqs, padding='post', truncating='post', maxlen=12)

for orig, padded in zip(seqs, preproc_text):
    print(orig, ' => ', padded)

First sentence gets five 0s padded to the end:

#  'new jersey is sometimes quiet during autumn .',
[18, 20, 2, 10, 32, 5, 46]  =>  [18 20  2 10 32  5 46  0  0  0  0  0]

Second sentence gets one word truncated at the end:

# 'california is never rainy during july , but it is sometimes beautiful in february .'
[21, 2, 11, 47, 5, 41, 7, 4, 2, 10, 30, 3, 38]  =>  [ 12 2 11 47  5 41  7  4  2 10 30  3]

In Keras, 0 will never be allocated as a word ID

Benefit of reversing sentences

Helps to make a stronger initial connection between the encoder and the decoder

Distance with and without reversing

Reversing the sentences

Creating padded sequences and reversing the sequences on the time dimension

sentences = ["california is never rainy during july .",]
seqs = en_tok.texts_to_sequences(sentences)
pad_seq = preproc_text = pad_sequences(seqs, padding='post', truncating='post', maxlen=12)

[[21  2  9 25  5 27  0  0  0  0  0  0]]

Reversing the sentences

pad_seq

[[21  2  9 25  5 27  0  0  0  0  0  0]]

pad_seq = pad_seq[:,::-1]

[[ 0  0  0  0  0  0 27  5 25  9  2 21]]

rev_sent = [en_tok.index_word[wid] for wid in pad_seq[0][-6:]] 
print('Sentence: ', sentences[0])
print('\tReversed: ',' '.join(rev_sent))

Sentence:  california is never rainy during july .
    Reversed:  july during rainy never is california

Let's practice!

Machine Translation with Keras