Stemming and lemmatization

Sentiment Analysis in Python

Violeta Misheva

Data Scientist

What is stemming?

Stemming is the process of transforming words to their root forms, even if the stem itself is not a valid word in the language.

staying, stays, stayed ----> stay
house, houses, housing ----> hous

Sentiment Analysis in Python

What is lemmatization?

Lemmatization is quite similar to stemming but unlike stemming, it reduces the words to roots that are valid words in the language.

stay, stays, staying, stayed ----> stay
house, houses, housing ----> house
Sentiment Analysis in Python

Stemming vs. lemmatization

Stemming

  • Produces roots of words
  • Fast and efficient to compute

Lemmatization

  • Produces actual words
  • Slower than stemming and can depend on the part-of-speech
Sentiment Analysis in Python

Stemming of strings

from nltk.stem import PorterStemmer

porter = PorterStemmer()
porter.stem('wonderful')
'wonder'
Sentiment Analysis in Python

Non-English stemmers

Snowball Stemmer: Danish, Dutch, English, Finnish, French, German, Hungarian,Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish

from nltk.stem.snowball import SnowballStemmer

DutchStemmer = SnowballStemmer("dutch")
DutchStemmer.stem("beginnen")
'begin'
Sentiment Analysis in Python

How to stem a sentence?

porter.stem('Today is a wonderful day!')
'today is a wonderful day!'
tokens = word_tokenize('Today is a wonderful day!')
stemmed_tokens = [porter.stem(token) for token in tokens]
stemmed_tokens
['today', 'is', 'a', 'wonder', 'day', '!']
Sentiment Analysis in Python

Lemmatization of a string

from nltk.stem import WordNetLemmatizer

WNlemmatizer = WordNetLemmatizer()
WNlemmatizer.lemmatize('wonderful', pos='a')
'wonderful'
Sentiment Analysis in Python

Let's practice!

Sentiment Analysis in Python

Preparing Video For Download...