Representasi Bag-of-Words

Natural Language Processing (NLP) in Python

Fouad Trad

Machine Learning Engineer

Ringkasan alur NLP

Diagram alur lengkap yang menyebutkan Bab 3 dan 4 berfokus pada pustaka transformers.

Natural Language Processing (NLP) in Python

Bag-of-Words (BoW)

  • Teknik dasar untuk merepresentasikan teks sebagai angka
  • Representasikan teks dengan menghitung frekuensi tiap kata
  • Masukkan kata ke “tas” lalu hitung kemunculannya
  • Abaikan tata bahasa dan urutan

Gambar menunjukkan kata-kata dari teks dimasukkan ke sebuah tas lalu tiap kemunculan kata dihitung.

Natural Language Processing (NLP) in Python

Contoh BoW

Gambar menampilkan dua kalimat: 'I love NLP' dan 'I love machine learning.'

Natural Language Processing (NLP) in Python

Contoh BoW

Gambar menunjukkan kosakata yang diambil dari kalimat: I, love, NLP, machine, dan learning.

  • Bangun kosakata dari semua kata unik
Natural Language Processing (NLP) in Python

Contoh BoW

Gambar menampilkan vektor fitur untuk tiap kalimat, dibuat dengan menghitung kemunculan kata sesuai kosakata.

  • Bangun kosakata dari semua kata unik
  • Hitung berapa kali tiap kata dalam kosakata muncul
Natural Language Processing (NLP) in Python

BoW dengan kode

reviews = ["I loved the movie. It was amazing!",
           "The movie was okay.",
           "I hated the movie. It was boring."]

def preprocess(text):
text = text.lower()
tokens = word_tokenize(text)
tokens = [word for word in tokens if word not in string.punctuation]
return " ".join(tokens)
cleaned_reviews = [preprocess(review) for review in reviews]
print(cleaned_reviews)
['i loved the movie it was amazing', 
 'the movie was okay', 
 'i hated the movie it was boring']
Natural Language Processing (NLP) in Python

BoW dengan kode

from sklearn.feature_extraction.text import CountVectorizer


vectorizer = CountVectorizer()
vectorizer.fit(cleaned_reviews)
print(vectorizer.get_feature_names_out())
['amazing' 'boring' 'hated' 'it' 'loved' 'movie' 'okay' 'the' 'was']
Natural Language Processing (NLP) in Python

Keluaran BoW

X = vectorizer.transform(cleaned_reviews)

# OR X = vectorizer.fit_transform(cleaned_reviews)
print(X)
<Compressed Sparse Row sparse matrix of dtype 'int64'
    with 16 stored elements and shape (3, 9)>

Matriks jarang: tabel yang sebagian besar berisi nol

Natural Language Processing (NLP) in Python

Keluaran BoW

print(X.toarray())
[[1 0 0 1 1 1 0 1 1]
 [0 0 0 0 0 1 1 1 1]
 [0 1 1 1 0 1 0 1 1]]
print(vectorizer.get_feature_names_out())
['amazing' 'boring' 'hated' 'it' 'loved' 'movie' 'okay' 'the' 'was']
Natural Language Processing (NLP) in Python

Frekuensi kata

import numpy as np

word_counts = np.sum(X.toarray(), axis=0)
words = vectorizer.get_feature_names_out()
import matplotlib.pyplot as plt

plt.bar(words, word_counts)
plt.title("Word Frequencies in Movie Reviews")
plt.xlabel("Words") plt.ylabel("Frequency") plt.show()

Plot batang yang menampilkan kata dan frekuensinya; stopword paling sering.

Natural Language Processing (NLP) in Python

Ayo berlatih!

Natural Language Processing (NLP) in Python

Preparing Video For Download...