Feature Engineering for NLP in Python
Rounak Banik
Data Scientist
"Dog".isalpha()
True
"3dogs".isalpha()
False
"12347".isalpha()
False
"!".isalpha()
False
"?".isalpha()
False
U.S.A
, U.K
, etc.word2vec
and xto10x
.string = """ OMG!!!! This is like the best thing ever \t\n. Wow, such an amazing song! I'm hooked. Top 5 definitely. ? """
import spacy # Generate list of tokens nlp = spacy.load('en_core_web_sm') doc = nlp(string) lemmas = [token.lemma_ for token in doc]
... ... # Remove tokens that are not alphabetic a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() or lemma == '-PRON-']
# Print string after text cleaning print(' '.join(a_lemmas))
'omg this be like the good thing ever wow such an amazing song -PRON- be hooked top definitely'
# Get list of stopwords stopwords = spacy.lang.en.stop_words.STOP_WORDS
string = """ OMG!!!! This is like the best thing ever \t\n. Wow, such an amazing song! I'm hooked. Top 5 definitely. ? """
...
...
# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas
if lemma.isalpha() and lemma not in stopwords]
# Print string after text cleaning
print(' '.join(a_lemmas))
'omg like good thing wow amazing song hooked definitely'
Always use only those text preprocessing techniques that are relevant to your application.
Feature Engineering for NLP in Python