Feature Engineering for NLP in Python
Rounak Banik
Data Scientist
message | label |
---|---|
WINNER!! As a valued network customer you have been selected to receive a $900 prize reward! To claim call 09061701461 | spam |
Ah, work. I vaguely remember that. What does it feel like? | ham |
CountVectorizer arguments
lowercase
: False
, True
strip_accents
: 'unciode'
, 'ascii'
, None
stop_words
: 'english'
, list
, None
token_pattern
: regex
tokenizer
: function
# Import CountVectorizer from sklearn.feature_extraction.text import CountVectorizer
# Create CountVectorizer object vectorizer = CountVectorizer(strip_accents='ascii', stop_words='english', lowercase=False)
# Import train_test_split from sklearn.model_selection import train_test_split # Split into training and test sets X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.25)
... ... # Generate training Bow vectors X_train_bow = vectorizer.fit_transform(X_train)
# Generate test BoW vectors X_test_bow = vectorizer.transform(X_test)
# Import MultinomialNB from sklearn.naive_bayes import MultinomialNB
# Create MultinomialNB object clf = MultinomialNB()
# Train clf clf.fit(X_train_bow, y_train)
# Compute accuracy on test set accuracy = clf.score(X_test_bow, y_test) print(accuracy)
0.760051
Feature Engineering for NLP in Python