Feature Engineering for NLP in Python
Rounak Banik
Data Scientist
For any ML algorithm,
Corpus
"The lion is the king of the jungle"
"Lions have lifespans of a decade"
"The lion is an endangered species"
Vocabulary → a
, an
, decade
, endangered
, have
, is
, jungle
, king
, lifespans
, lion
, Lions
, of
, species
, the
, The
"The lion is the king of the jungle"
[0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 2, 1]
"Lions have lifespans of a decade"
[1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0]
"The lion is an endangered species"
[0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1]
Lions
, lion
→ lion
The
, the
→ the
corpus = pd.Series([
'The lion is the king of the jungle',
'Lions have lifespans of a decade',
'The lion is an endangered species'
])
# Import CountVectorizer from sklearn.feature_extraction.text import CountVectorizer
# Create CountVectorizer object vectorizer = CountVectorizer()
# Generate matrix of word vectors bow_matrix = vectorizer.fit_transform(corpus) print(bow_matrix.toarray())
array([[0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 3],
[0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0],
[1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1]], dtype=int64)
Feature Engineering for NLP in Python