Introduction to tokenization

Introduction to Natural Language Processing in Python

Katharine Jarmul

Founder, kjamistan

What is tokenization?

Turning a string or document into tokens (smaller chunks)
One step in preparing a text for NLP
Many different theories and rules
You can create your own rules using regular expressions
Some examples:
- Breaking out words or sentences
- Separating punctuation
- Separating all hashtags in a tweet

from nltk.tokenize import word_tokenize
word_tokenize("Hi there!")

['Hi', 'there', '!']

sent_tokenize: tokenize a document into sentences
regexp_tokenize: tokenize a string or document based on a regular expression pattern
TweetTokenizer: special class just for tweet tokenization, allowing you to separate hashtags, mentions and lots of exclamation points!!!

import re
re.match('abc', 'abcde')

<_sre.SRE_Match object; span=(0, 3), match='abc'>

re.search('abc', 'abcde')

<_sre.SRE_Match object; span=(0, 3), match='abc'>

re.match('cd', 'abcde')
re.search('cd', 'abcde')

<_sre.SRE_Match object; span=(2, 4), match='cd'>

Introduction to Natural Language Processing in Python