Feature Engineering voor NLP in Python
Rounak Banik
Data Scientist
Voor elk ML-algoritme geldt:
Corpus
"The lion is the king of the jungle"
"Lions have lifespans of a decade"
"The lion is an endangered species"
Woordenlijst → a, an, decade, endangered, have, is, jungle, king, lifespans, lion, Lions, of, species, the, The
"The lion is the king of the jungle"
[0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 2, 1]
"Lions have lifespans of a decade"
[1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0]
"The lion is an endangered species"
[0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1]
Lions, lion → lionThe, the → thecorpus = pd.Series([
'The lion is the king of the jungle',
'Lions have lifespans of a decade',
'The lion is an endangered species'
])
# Import CountVectorizer from sklearn.feature_extraction.text import CountVectorizer# Maak een CountVectorizer-object vectorizer = CountVectorizer()# Genereer matrix met woordvectoren bow_matrix = vectorizer.fit_transform(corpus) print(bow_matrix.toarray())
array([[0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 3],
[0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0],
[1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1]], dtype=int64)
Feature Engineering voor NLP in Python