Feature Engineering for NLP in Python
Rounak Banik
Data Scientist
For any ML algorithm,
Corpus
"The lion is the king of the jungle"
"Lions have lifespans of a decade"
"The lion is an endangered species"
Vocabulary → a, an, decade, endangered, have, is, jungle, king, lifespans, lion, Lions, of, species, the, The
"The lion is the king of the jungle"
[0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 2, 1]
"Lions have lifespans of a decade"
[1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0]
"The lion is an endangered species"
[0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1]
Lions, lion → lionThe, the → thecorpus = pd.Series([
'The lion is the king of the jungle',
'Lions have lifespans of a decade',
'The lion is an endangered species'
])
# Import CountVectorizer from sklearn.feature_extraction.text import CountVectorizer# Create CountVectorizer object vectorizer = CountVectorizer()# Generate matrix of word vectors bow_matrix = vectorizer.fit_transform(corpus) print(bow_matrix.toarray())
array([[0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 3],
[0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0],
[1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1]], dtype=int64)
Feature Engineering for NLP in Python