Introduction to Natural Language Processing in Python
Katharine Jarmul
Founder, kjamistan
from gensim.corpora.dictionary import Dictionary
from nltk.tokenize import word_tokenize
my_documents = ['The movie was about a spaceship and aliens.', 'I really liked the movie!', 'Awesome action scenes, but boring characters.', 'The movie was awful! I hate alien films.', 'Space is cool! I liked the movie.', 'More space films, please!',]
tokenized_docs = [word_tokenize(doc.lower()) for doc in my_documents]
dictionary = Dictionary(tokenized_docs)
dictionary.token2id
{'!': 11,
',': 17,
'.': 7,
'a': 2,
'about': 4,
...}
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
corpus
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
[(0, 1), (1, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
...]
gensim
models can be easily saved, updated, and reusedIntroduction to Natural Language Processing in Python