Bag-of-Words gösterimi

Python ile Natural Language Processing (NLP)

Fouad Trad

Machine Learning Engineer

NLP iş akışı özeti

Tüm iş akışını gösteren diyagram; 3. ve 4. bölümlerin transformers kütüphanelerine odaklandığını belirtir.

Bag-of-Words (BoW)

Metni sayılara dönüştüren temel teknik
Her kelimenin kaç kez geçtiğini sayarak metni temsil eder
Kelimeleri bir “çanta”ya atar ve sayar
Dilbilgisi ve sıralamayı yok sayar

Bir metindeki kelimelerin bir çantaya atılıp tek tek sayılarına bakıldığını gösteren görsel.

BoW örneği

İki cümleyi gösteren görsel: 'I love NLP' ve 'I love machine learning.'

BoW örneği

Cümlelerden türetilen sözlüğü gösteren görsel: I, love, NLP, machine ve learning.

Tüm benzersiz kelimelerden bir sözlük oluşturun

BoW örneği

Tanımlı sözlüğe göre kelime sayımlarından üretilen, her cümle için özellik vektörlerini gösteren görsel.

Tüm benzersiz kelimelerden bir sözlük oluşturun
Sözlükteki her kelimenin kaç kez geçtiğini sayın

Kod ile BoW

reviews = ["I loved the movie. It was amazing!",
           "The movie was okay.",
           "I hated the movie. It was boring."]

def preprocess(text):

    text = text.lower()

    tokens = word_tokenize(text)

    tokens = [word for word in tokens if word not in string.punctuation]

    return " ".join(tokens)

cleaned_reviews = [preprocess(review) for review in reviews]

print(cleaned_reviews)

['i loved the movie it was amazing', 
 'the movie was okay', 
 'i hated the movie it was boring']

Kod ile BoW

from sklearn.feature_extraction.text import CountVectorizer


vectorizer = CountVectorizer()

vectorizer.fit(cleaned_reviews)

print(vectorizer.get_feature_names_out())

['amazing' 'boring' 'hated' 'it' 'loved' 'movie' 'okay' 'the' 'was']

BoW çıktısı

X = vectorizer.transform(cleaned_reviews)

# OR
X = vectorizer.fit_transform(cleaned_reviews)

print(X)

<Compressed Sparse Row sparse matrix of dtype 'int64'
    with 16 stored elements and shape (3, 9)>

Seyrek matris: çoğu sıfır olan tablo

BoW çıktısı

print(X.toarray())

[[1 0 0 1 1 1 0 1 1]
 [0 0 0 0 0 1 1 1 1]
 [0 1 1 1 0 1 0 1 1]]

print(vectorizer.get_feature_names_out())

['amazing' 'boring' 'hated' 'it' 'loved' 'movie' 'okay' 'the' 'was']

Sözcük sıklıkları

import numpy as np


word_counts = np.sum(X.toarray(), axis=0)

words = vectorizer.get_feature_names_out()

import matplotlib.pyplot as plt


plt.bar(words, word_counts)

plt.title("Word Frequencies in Movie Reviews")

plt.xlabel("Words")
plt.ylabel("Frequency")
plt.show()

Durdurma sözcüklerinin en sık olduğu, sözcükler ve sıklıklarını gösteren çubuk grafik.

Hadi pratik yapalım!

Python ile Natural Language Processing (NLP)