Introduction to gensim

Introduction to Natural Language Processing in Python

Katharine Jarmul

Founder, kjamistan

What is gensim?

  • Popular open-source NLP library
  • Uses top academic models to perform complex tasks
    • Building document or word vectors
    • Performing topic identification and document comparison
Introduction to Natural Language Processing in Python

What is a word vector?

word2vec chart

Introduction to Natural Language Processing in Python

Gensim example

Introduction to Natural Language Processing in Python
from gensim.corpora.dictionary import Dictionary

from nltk.tokenize import word_tokenize
my_documents = ['The movie was about a spaceship and aliens.', 'I really liked the movie!', 'Awesome action scenes, but boring characters.', 'The movie was awful! I hate alien films.', 'Space is cool! I liked the movie.', 'More space films, please!',]
tokenized_docs = [word_tokenize(doc.lower()) 
                  for doc in my_documents]

dictionary = Dictionary(tokenized_docs)
dictionary.token2id
{'!': 11,
 ',': 17,
 '.': 7,
 'a': 2,
 'about': 4,
...}
Introduction to Natural Language Processing in Python

Creating a gensim corpus

corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

corpus
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
 [(0, 1), (1, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
...]
  • gensim models can be easily saved, updated, and reused
  • Our dictionary can also be updated
  • This more advanced and feature rich bag-of-words can be used in future exercises
Introduction to Natural Language Processing in Python

Let's practice!

Introduction to Natural Language Processing in Python

Preparing Video For Download...