Text cleaning

Feature Engineering for NLP in Python

Rounak Banik

Data Scientist

Text cleaning techniques

Unnecessary whitespaces and escape sequences
Punctuations
Special characters (numbers, emojis, etc.)
Stopwords

isalpha()

"Dog".isalpha()

True

"3dogs".isalpha()

False

"12347".isalpha()

False

"!".isalpha()

False

"?".isalpha()

False

A word of caution

Abbreviations: U.S.A, U.K, etc.
Proper Nouns: word2vec and xto10x.
Write your own custom function (using regex) for the more nuanced cases.

Removing non-alphabetic characters

string = """
OMG!!!! This is like    the best thing ever \t\n. 
Wow, such an amazing song! I'm hooked. Top 5 definitely. ?
"""

import spacy

# Generate list of tokens
nlp = spacy.load('en_core_web_sm')
doc = nlp(string)
lemmas = [token.lemma_ for token in doc]

Removing non-alphabetic characters

...
...
# Remove tokens that are not alphabetic
a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() or lemma == '-PRON-']


# Print string after text cleaning
print(' '.join(a_lemmas))

'omg this be like the good thing ever wow such an amazing song -PRON- be hooked top definitely'

Stopwords

Words that occur extremely commonly
Eg. articles, be verbs, pronouns, etc.

Removing stopwords using spaCy

# Get list of stopwords
stopwords = spacy.lang.en.stop_words.STOP_WORDS

string = """
OMG!!!! This is like    the best thing ever \t\n. 
Wow, such an amazing song! I'm hooked. Top 5 definitely. ?
"""

Removing stopwords using spaCy

...
...
# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]
# Print string after text cleaning
print(' '.join(a_lemmas))

'omg like good thing wow amazing song hooked definitely'

Other text preprocessing techniques

Removing HTML/XML tags
Replacing accented characters (such as é)
Correcting spelling errors

A word of caution

Always use only those text preprocessing techniques that are relevant to your application.

Let's practice!

Feature Engineering for NLP in Python