Word counts with bag-of-words

Introduction to Natural Language Processing in Python

Katharine Jarmul

Founder, kjamistan

Bag-of-words

  • Basic method for finding topics in a text
  • Need to first create tokens using tokenization
  • ... and then count up all the tokens
  • The more frequent a word, the more important it might be
  • Can be a great way to determine the significant words in a text
Introduction to Natural Language Processing in Python

Bag-of-words example

  • Text: "The cat is in the box. The cat likes the box. The box is over the cat."

  • Bag of words (stripped punctuation):

    • "The": 3, "box": 3
    • "cat": 3, "the": 3
    • "is": 2
    • "in": 1, "likes": 1, "over": 1
Introduction to Natural Language Processing in Python

Bag-of-words in Python

from nltk.tokenize import word_tokenize

from collections import Counter
Counter(word_tokenize("""The cat is in the box. The cat likes the box. The box is over the cat."""))
Counter({'.': 3,
         'The': 3,
         'box': 3,
         'cat': 3,
         'in': 1,
         ...
         'the': 3})
counter.most_common(2)
[('The', 3), ('box', 3)]
Introduction to Natural Language Processing in Python

Let's practice!

Introduction to Natural Language Processing in Python

Preparing Video For Download...