Membangun klasifier Naive Bayes BoW

Rekayasa Fitur untuk NLP di Python

Rounak Banik

Data Scientist

Penyaringan spam

message	label
WINNER!! As a valued network customer you have been selected to receive a $900 prize reward! To claim call 09061701461	spam
Ah, work. I vaguely remember that. What does it feel like?	ham

Langkah-langkah

Prapemrosesan teks
Membangun model (representasi) bag-of-words
Pembelajaran mesin

Prapemrosesan teks dengan CountVectorizer

Argumen CountVectorizer

lowercase: False, True
strip_accents: 'unciode', 'ascii', None
stop_words: 'english', list, None
token_pattern: regex
tokenizer: function

Membangun model BoW

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer


# Buat objek CountVectorizer
vectorizer = CountVectorizer(strip_accents='ascii', stop_words='english', lowercase=False)


# Import train_test_split
from sklearn.model_selection import train_test_split

# Bagi menjadi set latih dan uji
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.25)

Membangun model BoW

...
...
# Hasilkan vektor BoW untuk data latih
X_train_bow = vectorizer.fit_transform(X_train)


# Hasilkan vektor BoW untuk data uji
X_test_bow = vectorizer.transform(X_test)

Melatih klasifier Naive Bayes

# Import MultinomialNB
from sklearn.naive_bayes import MultinomialNB


# Buat objek MultinomialNB
clf = MultinomialNB()


# Latih clf
clf.fit(X_train_bow, y_train)


# Hitung akurasi pada set uji
accuracy = clf.score(X_test_bow, y_test)
print(accuracy)

0.760051

Ayo berlatih!

Rekayasa Fitur untuk NLP di Python