Stop words and punctuation handling

Natural Language Processing (NLP) in Python

Fouad Trad

Machine Learning Engineer

Stop words

  • Appear frequently but contribute little to the machine's understanding of context
  • Don't add much value in many NLP tasks
  • Removing them helps models focus on important words

Image showing several stopwords such as a, an, the, in, of, that, for, by, etc.

Natural Language Processing (NLP) in Python

Stop words removal

Useful for

Understanding the topic of a text

Image showing reviews for a product on a mobile application

Natural Language Processing (NLP) in Python

Stop words removal

Useful for

Understanding the topic of a text

Image showing reviews for a product on a mobile application

Not useful for

Tasks requiring every word in the text

Image showing a text being translated from English (Good morning) to French (Bonjour).

Natural Language Processing (NLP) in Python

Accessing stop words

NLTK provides a list of stop words for several languages

from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = stopwords.words('english')
print(stop_words[:10])
['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an']
Natural Language Processing (NLP) in Python

Removing stop words

from nltk.tokenize import word_tokenize

text = "This is an example to demonstrate removing stop words."
tokens = word_tokenize(text)
# The .lower() method helps with case sensitivity filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)
['example', 'demonstrate', 'removing', 'stop', 'words', '.']
Natural Language Processing (NLP) in Python

Punctuation

  • Structuring language for humans
  • No meaningful information in many NLP tasks

Image showing punctuation marks and special characters.

Natural Language Processing (NLP) in Python

Punctuation removal

Useful for

Tasks requiring to find common or important words in documents

Image showing multiple files and documents that need processing.

Natural Language Processing (NLP) in Python

Punctuation removal

Useful for

Tasks requiring to find common or important words in documents

Image showing multiple files and documents that need processing.

Not useful for

Tasks requiring to maintain sentence structure for clarity

Image showing a pile of books and a summary document generated out of them.

Natural Language Processing (NLP) in Python

Accessing and removing punctuation

import string
print(string.punctuation)
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
text = "This is an example to demonstrate removing stop words."
tokens = word_tokenize(text)
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

clean_tokens = [word for word in filtered_tokens if word not in string.punctuation]
print(clean_tokens)
['example', 'demonstrate', 'removing', 'stop', 'words']
Natural Language Processing (NLP) in Python

Let's practice!

Natural Language Processing (NLP) in Python

Preparing Video For Download...