Stop words and punctuation handling

Natural Language Processing (NLP) in Python

Fouad Trad

Machine Learning Engineer

Stop words

Appear frequently but contribute little to the machine's understanding of context
Don't add much value in many NLP tasks
Removing them helps models focus on important words

Image showing several stopwords such as a, an, the, in, of, that, for, by, etc.

Stop words removal

Useful for

Understanding the topic of a text

Image showing reviews for a product on a mobile application

Stop words removal

Useful for

Understanding the topic of a text

Image showing reviews for a product on a mobile application

Not useful for

Tasks requiring every word in the text

Image showing a text being translated from English (Good morning) to French (Bonjour).

Accessing stop words

NLTK provides a list of stop words for several languages

from nltk.corpus import stopwords
nltk.download('stopwords')


stop_words = stopwords.words('english')

print(stop_words[:10])

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an']

Removing stop words

from nltk.tokenize import word_tokenize


text = "This is an example to demonstrate removing stop words."

tokens = word_tokenize(text)

# The .lower() method helps with case sensitivity
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print(filtered_tokens)

['example', 'demonstrate', 'removing', 'stop', 'words', '.']

Punctuation

Structuring language for humans
No meaningful information in many NLP tasks

Image showing punctuation marks and special characters.

Punctuation removal

Useful for

Tasks requiring to find common or important words in documents

Image showing multiple files and documents that need processing.

Punctuation removal

Useful for

Tasks requiring to find common or important words in documents

Image showing multiple files and documents that need processing.

Not useful for

Tasks requiring to maintain sentence structure for clarity

Image showing a pile of books and a summary document generated out of them.

Accessing and removing punctuation

import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

text = "This is an example to demonstrate removing stop words."
tokens = word_tokenize(text)
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]


clean_tokens = [word for word in filtered_tokens if word not in string.punctuation]

print(clean_tokens)

['example', 'demonstrate', 'removing', 'stop', 'words']

Let's practice!

Natural Language Processing (NLP) in Python