Machine Translation with Keras
Thushan Ganegedara
Data Scientist and Author
The sentence:
'les états-unis est parfois occupé en janvier , et il est parfois chaud en novembre .'
becomes:
'sos les états-unis est parfois occupé en janvier , et il est parfois chaud en novembre . eos',
after adding special tokens
sos
- Start of a sentence/sequenceeos
- End of a sentence/sequenceReal world datasets never have the same number of words in all sentences
Importing pad_sequences
from tensorflow.keras.preprocessing.sequence import pad_sequences
sentences = [
'new jersey is sometimes quiet during autumn .',
'california is never rainy during july , but it is sometimes beautiful in february .'
]
seqs = en_tok.texts_to_sequences(sentences)
preproc_text = pad_sequences(seqs, padding='post', truncating='post', maxlen=12)
for orig, padded in zip(seqs, preproc_text): print(orig, ' => ', padded)
First sentence gets five 0s padded to the end:
# 'new jersey is sometimes quiet during autumn .',
[18, 20, 2, 10, 32, 5, 46] => [18 20 2 10 32 5 46 0 0 0 0 0]
Second sentence gets one word truncated at the end:
# 'california is never rainy during july , but it is sometimes beautiful in february .'
[21, 2, 11, 47, 5, 41, 7, 4, 2, 10, 30, 3, 38] => [ 12 2 11 47 5 41 7 4 2 10 30 3]
sentences = ["california is never rainy during july .",]
seqs = en_tok.texts_to_sequences(sentences)
pad_seq = preproc_text = pad_sequences(seqs, padding='post', truncating='post', maxlen=12)
[[21 2 9 25 5 27 0 0 0 0 0 0]]
pad_seq
[[21 2 9 25 5 27 0 0 0 0 0 0]]
pad_seq = pad_seq[:,::-1]
[[ 0 0 0 0 0 0 27 5 25 9 2 21]]
rev_sent = [en_tok.index_word[wid] for wid in pad_seq[0][-6:]]
print('Sentence: ', sentences[0])
print('\tReversed: ',' '.join(rev_sent))
Sentence: california is never rainy during july .
Reversed: july during rainy never is california
Machine Translation with Keras