Vektorisasi TF-IDF

Natural Language Processing (NLP) in Python

Fouad Trad

Machine Learning Engineer

Dari BoW ke TF-IDF

  • BoW menganggap semua kata sama penting
  • TF-IDF memperbaiki ini dengan memberi tahu:
    • seberapa sering kata muncul di dokumen
    • seberapa bermakna kata itu di seluruh koleksi

 

Gambar yang menampilkan representasi BoW dari dua kalimat: 'I love this NLP course' dan 'I enjoyed this project.'

Natural Language Processing (NLP) in Python

TF-IDF

Gambar yang menunjukkan TF-IDF adalah hasil kali TF dan IDF.

Natural Language Processing (NLP) in Python

TF-IDF

Gambar yang menunjukkan TF-IDF adalah hasil kali TF dan IDF.

  • TF: Term Frequency
    • Frekuensi kemunculan kata dalam dokumen
Natural Language Processing (NLP) in Python

TF-IDF

Gambar yang menunjukkan TF-IDF adalah hasil kali TF dan IDF.

  • TF: Term Frequency
    • Frekuensi kemunculan kata dalam dokumen
  • IDF: Inverse Document Frequency
    • Seberapa jarang kata itu di semua dokumen

 

  • Kata muncul di satu dokumen, tidak di yang lain → skor tinggi
  • Kata muncul di setiap dokumen → skor rendah
Natural Language Processing (NLP) in Python

TF-IDF dengan kode

reviews = [
    "I loved the movie. It was amazing!",
    "The movie was okay.",
    "I hated the movie. It was boring."
]

cleaned_reviews = [preprocess(review) for review in reviews] print(cleaned_reviews)
['i loved the movie it was amazing', 
 'the movie was okay', 
 'i hated the movie it was boring']
Natural Language Processing (NLP) in Python

TF-IDF dengan kode

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(cleaned_reviews)
print(tfidf_matrix)
<Compressed Sparse Row sparse matrix of dtype 'float64'
    with 16 stored elements and shape (3, 9)>
Natural Language Processing (NLP) in Python

Output TF-IDF

print(tfidf_matrix.toarray())
[[0.52523431 0.         0.         0.39945423 0.52523431 0.31021184   0.         0.31021184 0.31021184]
 [0.         0.         0.         0.         0.         0.41285857   0.69903033 0.41285857 0.41285857]
 [0.         0.52523431 0.52523431 0.39945423 0.         0.31021184   0.         0.31021184 0.31021184]]
vectorizer.get_feature_names_out()
['amazing' 'boring' 'hated' 'it' 'loved' 'movie' 'okay' 'the' 'was']
Natural Language Processing (NLP) in Python

Visualisasi skor sebagai heatmap

import pandas as pd

df_tfidf = pd.DataFrame(

tfidf_matrix.toarray(),
columns=vectorizer.get_feature_names_out() )
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df_tfidf, annot=True)
plt.title("Skor TF-IDF per Ulasan") plt.xlabel("Term") plt.ylabel("Dokumen") plt.show()

 

Peta panas representasi TF-IDF dari dataset.

Natural Language Processing (NLP) in Python

Membandingkan dengan BoW

 

Peta panas representasi BoW dari dataset.

 

Peta panas representasi TF-IDF dari dataset.

Natural Language Processing (NLP) in Python

Ayo berlatih!

Natural Language Processing (NLP) in Python

Preparing Video For Download...