Tokenization and Lemmatization

Feature Engineering for NLP in Python

Rounak Banik

Data Scientist

Text sources

News articles
Tweets
Comments

Making text machine friendly

Dogs, dog
reduction, REDUCING, Reduce
don't, do not
won't, will not

Text preprocessing techniques

Converting words into lowercase
Removing leading and trailing whitespaces
Removing punctuation
Removing stopwords
Expanding contractions
Removing special characters (numbers, emojis, etc.)

Tokenization

"I have a dog. His name is Hachi."

Tokens:

["I", "have", "a", "dog", ".", "His", "name", "is", "Hachi", "."]

"Don't do this."

Tokens:

["Do", "n't", "do", "this", "."]

Tokenization using spaCy

import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Initiliaze string
string = "Hello! I don't know what I'm doing here."

# Create a Doc object
doc = nlp(string)

# Generate list of tokens
tokens = [token.text for token in doc]
print(tokens)

['Hello','!','I','do',"n't",'know','what','I',"'m",'doing','here','.']

Lemmatization

Convert word into its base form
- reducing, reduces, reduced, reduction → reduce
- am, are, is → be
- n't → not
- 've → have

Lemmatization using spaCy

import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')
# Initiliaze string
string = "Hello! I don't know what I'm doing here."
# Create a Doc object
doc = nlp(string)


# Generate list of lemmas
lemmas = [token.lemma_ for token in doc]
print(lemmas)

['hello','!','-PRON-','do','not','know','what','-PRON','be','do','here', '.']

Let's practice!

Feature Engineering for NLP in Python