Text normalization techniques

Natural Language Processing (NLP) in Python

Fouad Trad

Machine Learning Engineer

Text normalization

  • Normalization brings words into standard format
  • Useful in text classification, search, sentiment analysis
  • Techniques to explore:
    • Lowercasing
    • Stemming
    • Lemmatization

Image showing three people mentioning how they enjoy watching movies with great direction, in three different ways: "I enjoy watching movies with great direction!", "I enjoyed watching movies that are well-directed.", "Movies with excellent direction are something I enjoy."

Natural Language Processing (NLP) in Python

Lowercasing

  • If we don't normalize, these words will be treated as separate tokens
  • This will mislead models and inflate vocabulary size
text  = """Data scientists and data 
engineers need DATA"""
lower_text = text.lower()

print(lower_text)
data scientists and data 
engineers need data

Image showing a person confused between 'data', 'Data', and 'DATA.'

Natural Language Processing (NLP) in Python

Lowercasing applicability

Useful for

Tasks relying on keyword analysis

Image showing a person searching for keywords in a text.

Natural Language Processing (NLP) in Python

Lowercasing applicability

Useful for

Tasks relying on keyword analysis

Image showing a person searching for keywords in a text.

Not useful for

Tasks requiring distinction between words that depend on case for meaning

Image showing  a group of people to represent "us" and the American flag to represent the "US."

Natural Language Processing (NLP) in Python

Stemming

  • Reduces words to their root form by chopping off suffixes
    • running → run
    • reading → read
  • Useful for grouping similar variations of word together
  • Can result in non-valid words
    • organization → organizat

Stem_logo.png

Natural Language Processing (NLP) in Python

Stemming in code

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
tokens = ['running', 'bats', 'organizations', 'reading']
stemmed = [stemmer.stem(word) for word in tokens]
print(stemmed)
['run', 'bat', 'organizat', 'read']
Natural Language Processing (NLP) in Python

Lemmatization

  • Reduces words to their base form
  • Ensures result is a valid word
    • organizations → organization
  • Useful when grammatical accuracy is crucial
  • Slower than stemming but more precise

lemmalogo.png

Natural Language Processing (NLP) in Python

Lemmatization in code

from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
tokens = ['running', 'bats', 'organizations', 'reading']
lemmatized = [lemmatizer.lemmatize(word) for word in tokens]
print(lemmatized)
['running', 'bat', 'organization', 'reading']
Natural Language Processing (NLP) in Python

Stemming

  • Faster
  • Results in non-dictionary words
  • Use if needing speed

Image showing a person cutting a tree with a chainsaw.

Lemmatization

  • Slower
  • More accurate
  • Use if needing grammatical accuracy

Image showing a person carefully trimming a tree to preserve its structure.

Natural Language Processing (NLP) in Python

Let's practice!

Natural Language Processing (NLP) in Python

Preparing Video For Download...