Text cleaning

Feature Engineering for NLP in Python

Rounak Banik

Data Scientist

Text cleaning techniques

  • Unnecessary whitespaces and escape sequences
  • Punctuations
  • Special characters (numbers, emojis, etc.)
  • Stopwords
Feature Engineering for NLP in Python

isalpha()

"Dog".isalpha()
True
"3dogs".isalpha()
False
"12347".isalpha()
False
"!".isalpha()
False
"?".isalpha()
False
Feature Engineering for NLP in Python

A word of caution

  • Abbreviations: U.S.A, U.K, etc.
  • Proper Nouns: word2vec and xto10x.
  • Write your own custom function (using regex) for the more nuanced cases.
Feature Engineering for NLP in Python

Removing non-alphabetic characters

string = """
OMG!!!! This is like    the best thing ever \t\n. 
Wow, such an amazing song! I'm hooked. Top 5 definitely. ?
"""

import spacy # Generate list of tokens nlp = spacy.load('en_core_web_sm') doc = nlp(string) lemmas = [token.lemma_ for token in doc]
Feature Engineering for NLP in Python

Removing non-alphabetic characters

...
...
# Remove tokens that are not alphabetic
a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() or lemma == '-PRON-']

# Print string after text cleaning print(' '.join(a_lemmas))
'omg this be like the good thing ever wow such an amazing song -PRON- be hooked top definitely'
Feature Engineering for NLP in Python

Stopwords

  • Words that occur extremely commonly
  • Eg. articles, be verbs, pronouns, etc.
Feature Engineering for NLP in Python

Removing stopwords using spaCy

# Get list of stopwords
stopwords = spacy.lang.en.stop_words.STOP_WORDS

string = """ OMG!!!! This is like the best thing ever \t\n. Wow, such an amazing song! I'm hooked. Top 5 definitely. ? """
Feature Engineering for NLP in Python

Removing stopwords using spaCy

...
...
# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]
# Print string after text cleaning
print(' '.join(a_lemmas))
'omg like good thing wow amazing song hooked definitely'
Feature Engineering for NLP in Python

Other text preprocessing techniques

  • Removing HTML/XML tags
  • Replacing accented characters (such as é)
  • Correcting spelling errors
Feature Engineering for NLP in Python

A word of caution

Always use only those text preprocessing techniques that are relevant to your application.

Feature Engineering for NLP in Python

Let's practice!

Feature Engineering for NLP in Python

Preparing Video For Download...