Simple text preprocessing

Introduction to Natural Language Processing in Python

Katharine Jarmul

Founder, kjamistan

Why preprocess?

  • Helps make for better input data
    • When performing machine learning or other statistical methods
  • Examples:
    • Tokenization to create a bag of words
    • Lowercasing words
  • Lemmatization/Stemming
    • Shorten words to their root stems
  • Removing stop words, punctuation, or unwanted tokens
  • Good to experiment with different approaches
Introduction to Natural Language Processing in Python

Preprocessing example

  • Input text: Cats, dogs and birds are common pets. So are fish.

  • Output tokens: cat, dog, bird, common, pet, fish

Introduction to Natural Language Processing in Python

Text preprocessing with Python

from nltk.corpus import stopwords
text = """The cat is in the box. The cat likes the box. 
                  The box is over the cat."""

tokens = [w for w in word_tokenize(text.lower()) if w.isalpha()]
no_stops = [t for t in tokens if t not in stopwords.words('english')]
Counter(no_stops).most_common(2)
[('cat', 3), ('box', 3)]
Introduction to Natural Language Processing in Python

Let's practice!

Introduction to Natural Language Processing in Python

Preparing Video For Download...