Introduction to tokenization

Introduction to Natural Language Processing in Python

Katharine Jarmul

Founder, kjamistan

What is tokenization?

  • Turning a string or document into tokens (smaller chunks)
  • One step in preparing a text for NLP
  • Many different theories and rules
  • You can create your own rules using regular expressions
  • Some examples:
    • Breaking out words or sentences
    • Separating punctuation
    • Separating all hashtags in a tweet
Introduction to Natural Language Processing in Python

nltk library

  • nltk: natural language toolkit
from nltk.tokenize import word_tokenize
word_tokenize("Hi there!")
['Hi', 'there', '!']
Introduction to Natural Language Processing in Python

Why tokenize?

  • Easier to map part of speech
  • Matching common words
  • Removing unwanted tokens
  • "I don't like Sam's shoes."
  • "I", "do", "n't", "like", "Sam", "'s", "shoes", "."
Introduction to Natural Language Processing in Python

Other nltk tokenizers

  • sent_tokenize: tokenize a document into sentences

  • regexp_tokenize: tokenize a string or document based on a regular expression pattern

  • TweetTokenizer: special class just for tweet tokenization, allowing you to separate hashtags, mentions and lots of exclamation points!!!

Introduction to Natural Language Processing in Python

More regex practice

  • Difference between re.search() and re.match()
import re
re.match('abc', 'abcde')
<_sre.SRE_Match object; span=(0, 3), match='abc'>
re.search('abc', 'abcde')
<_sre.SRE_Match object; span=(0, 3), match='abc'>
re.match('cd', 'abcde')
re.search('cd', 'abcde')
<_sre.SRE_Match object; span=(2, 4), match='cd'>
Introduction to Natural Language Processing in Python

Let's practice!

Introduction to Natural Language Processing in Python

Preparing Video For Download...