Membangun model bag-of-words

Rekayasa Fitur untuk NLP di Python

Rounak Banik

Data Scientist

Ringkasan format data untuk algoritme ML

Untuk algoritme ML apa pun,

Data harus berbentuk tabel
Fitur pelatihan harus numerik

Model bag-of-words

Ekstrak token kata
Hitung frekuensi token
Bentuk vektor kata dari frekuensi dan kosakata korpus

Contoh model bag-of-words

Korpus

"The lion is the king of the jungle"

"Lions have lifespans of a decade"

"The lion is an endangered species"

Contoh model bag-of-words

Kosakata → a, an, decade, endangered, have, is, jungle, king, lifespans, lion, Lions, of, species, the, The

"The lion is the king of the jungle"

[0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 2, 1]

"Lions have lifespans of a decade"

[1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0]

"The lion is an endangered species"

[0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1]

Prapemrosesan teks

Lions, lion → lion
The, the → the
Tanpa tanda baca
Tanpa stopword
Kosakata jadi lebih kecil
Mengurangi dimensi membantu kinerja

Model bag-of-words dengan sklearn

corpus = pd.Series([
    'The lion is the king of the jungle',
    'Lions have lifespans of a decade',
    'The lion is an endangered species'
])

Model bag-of-words dengan sklearn

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Buat objek CountVectorizer
vectorizer = CountVectorizer()

# Hasilkan matriks vektor kata
bow_matrix = vectorizer.fit_transform(corpus)
print(bow_matrix.toarray())

array([[0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 3],
       [0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0],
       [1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1]], dtype=int64)

Ayo berlatih!

Rekayasa Fitur untuk NLP di Python