Introduction to Natural Language Processing in Python
Katharine Jarmul
Founder, kjamistan

from gensim.corpora.dictionary import Dictionaryfrom nltk.tokenize import word_tokenizemy_documents = ['The movie was about a spaceship and aliens.', 'I really liked the movie!', 'Awesome action scenes, but boring characters.', 'The movie was awful! I hate alien films.', 'Space is cool! I liked the movie.', 'More space films, please!',]
tokenized_docs = [word_tokenize(doc.lower()) for doc in my_documents]dictionary = Dictionary(tokenized_docs)dictionary.token2id
{'!': 11,
',': 17,
'.': 7,
'a': 2,
'about': 4,
...}
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]corpus
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
[(0, 1), (1, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
...]
gensim models can be easily saved, updated, and reusedIntroduction to Natural Language Processing in Python