Tf-idf with gensim

Introduction to Natural Language Processing in Python

Katharine Jarmul

Founder, kjamistan

What is tf-idf?

  • Term frequency - inverse document frequency
  • Allows you to determine the most important words in each document
  • Each corpus may have shared words beyond just stopwords
  • These words should be down-weighted in importance
  • Example from astronomy: "Sky"
  • Ensures most common words don't show up as key words
  • Keeps document specific frequent words weighted high
Introduction to Natural Language Processing in Python

Tf-idf formula

$$w_{i,j} = tf_{i,j} * \log (\frac{N}{df_i})$$

$$w_{i,j} = \textnormal t \textnormal f \textnormal - \textnormal i \textnormal d \textnormal f \space \textnormal w \textnormal e \textnormal i \textnormal g \textnormal h \textnormal t \space \textnormal f \textnormal o \textnormal r \space \textnormal t \textnormal o \textnormal k \textnormal e \textnormal n \space i \space \textnormal i \textnormal n \space \textnormal d \textnormal o \textnormal c \textnormal u \textnormal m \textnormal e \textnormal n \textnormal t \space j $$

$$tf_{i,j} = \textnormal n \textnormal u \textnormal m \textnormal b \textnormal e \textnormal r \space \textnormal o \textnormal f \space \textnormal o \textnormal c \textnormal c \textnormal u \textnormal r \textnormal e \textnormal n \textnormal c \textnormal e \textnormal s \space \textnormal o \textnormal f \space \textnormal t \textnormal o \textnormal k \textnormal e \textnormal n \space i \space \textnormal i \textnormal n \space \textnormal d \textnormal o \textnormal c \textnormal u \textnormal m \textnormal e \textnormal n \textnormal t \space j $$

$$df_i = \textnormal n \textnormal u \textnormal m \textnormal b \textnormal e \textnormal r \space \textnormal o \textnormal f \space \textnormal d \textnormal o \textnormal c \textnormal u \textnormal m \textnormal e \textnormal n \textnormal t \textnormal s \space \textnormal t \textnormal h \textnormal a \textnormal t \space \textnormal c \textnormal o \textnormal n \textnormal t \textnormal a \textnormal i \textnormal n \space \textnormal t \textnormal o \textnormal k \textnormal e \textnormal n \space i $$

$$N = \textnormal t \textnormal o \textnormal t \textnormal a \textnormal l \space \textnormal n \textnormal u \textnormal m \textnormal b \textnormal e \textnormal r \space \textnormal o \textnormal f \space \textnormal d \textnormal o \textnormal c \textnormal u \textnormal m \textnormal e \textnormal n \textnormal t \textnormal s$$

Introduction to Natural Language Processing in Python

Tf-idf with gensim

from gensim.models.tfidfmodel import TfidfModel

tfidf = TfidfModel(corpus)
tfidf[corpus[1]]
[(0, 0.1746298276735174),
 (1, 0.1746298276735174),
 (9, 0.29853166221463673),
 (10, 0.7716931521027908),
...
]
Introduction to Natural Language Processing in Python

Let's practice!

Introduction to Natural Language Processing in Python

Preparing Video For Download...