Vettorizzazione TF-IDF

Natural Language Processing (NLP) in Python

Fouad Trad

Machine Learning Engineer

Da BoW a TF-IDF

BoW tratta tutte le parole come uguali
TF-IDF risolve così:
- quante volte una parola appare in un documento
- quanto è informativa su tutta la collezione

Immagine che mostra la rappresentazione BoW di due frasi: 'I love this NLP course' e 'I enjoyed this project.'

TF-IDF

Immagine che mostra che TF-IDF è il prodotto di TF e IDF.

TF-IDF

Immagine che mostra che TF-IDF è il prodotto di TF e IDF.

TF: Term Frequency
- Quante volte una parola appare in un documento

TF-IDF

Immagine che mostra che TF-IDF è il prodotto di TF e IDF.

TF: Term Frequency
- Quante volte una parola appare in un documento
IDF: Inverse Document Frequency
- Quanto è rara quella parola in tutti i documenti

Parola presente in un documento, assente negli altri → punteggio alto
Parola presente in tutti i documenti → punteggio basso

TF-IDF con codice

reviews = [
    "I loved the movie. It was amazing!",
    "The movie was okay.",
    "I hated the movie. It was boring."
]

cleaned_reviews = [preprocess(review) for review in reviews]
print(cleaned_reviews)

['i loved the movie it was amazing', 
 'the movie was okay', 
 'i hated the movie it was boring']

TF-IDF con codice

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

tfidf_matrix = vectorizer.fit_transform(cleaned_reviews)

print(tfidf_matrix)

<Compressed Sparse Row sparse matrix of dtype 'float64'
    with 16 stored elements and shape (3, 9)>

Output TF-IDF

print(tfidf_matrix.toarray())

[[0.52523431 0.         0.         0.39945423 0.52523431 0.31021184   0.         0.31021184 0.31021184]
 [0.         0.         0.         0.         0.         0.41285857   0.69903033 0.41285857 0.41285857]
 [0.         0.52523431 0.52523431 0.39945423 0.         0.31021184   0.         0.31021184 0.31021184]]

vectorizer.get_feature_names_out()

['amazing' 'boring' 'hated' 'it' 'loved' 'movie' 'okay' 'the' 'was']

Visualizzare i punteggi come heatmap

import pandas as pd

df_tfidf = pd.DataFrame(

    tfidf_matrix.toarray(),

    columns=vectorizer.get_feature_names_out()
)

import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df_tfidf, annot=True)

plt.title("Punteggi TF-IDF tra le recensioni")
plt.xlabel("Termini")
plt.ylabel("Documenti")
plt.show()

Heatmap della rappresentazione TF-IDF del dataset.

Confronto con BoW

Heatmap della rappresentazione BoW del dataset.

Heatmap della rappresentazione TF-IDF del dataset.

Ayo berlatih!

Natural Language Processing (NLP) in Python