Introduction to gensim

Introduction to Natural Language Processing in Python

Katharine Jarmul

Founder, kjamistan

What is gensim?

Popular open-source NLP library
Uses top academic models to perform complex tasks
- Building document or word vectors
- Performing topic identification and document comparison

What is a word vector?

word2vec chart

Gensim example

LDA data visualization

(Source: http://tlfvincent.github.io/2015/10/23/presidential-speech-topics)

from gensim.corpora.dictionary import Dictionary

from nltk.tokenize import word_tokenize

my_documents = ['The movie was about a spaceship and aliens.',
                'I really liked the movie!',
                'Awesome action scenes, but boring characters.',
                'The movie was awful! I hate alien films.',
                'Space is cool! I liked the movie.',
                'More space films, please!',]

tokenized_docs = [word_tokenize(doc.lower()) 
                  for doc in my_documents]

dictionary = Dictionary(tokenized_docs)

dictionary.token2id

{'!': 11,
 ',': 17,
 '.': 7,
 'a': 2,
 'about': 4,
...}

Creating a gensim corpus

corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

corpus

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
 [(0, 1), (1, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
...]

gensim models can be easily saved, updated, and reused
Our dictionary can also be updated
This more advanced and feature rich bag-of-words can be used in future exercises

Let's practice!

Introduction to Natural Language Processing in Python