Membangun vektor dokumen tf-idf

Rekayasa Fitur untuk NLP di Python

Rounak Banik

Data Scientist

Pemodelan n-gram

Bobot suatu dimensi bergantung pada frekuensi kata untuk dimensi itu.
- Dokumen memuat kata human di lima tempat.
- Dimensi untuk human berbobot 5.

Motivasi

Beberapa kata sangat umum di semua dokumen
Korpus dokumen tentang alam semesta
- Satu dokumen memiliki jupiter dan universe masing-masing muncul 20 kali.
- jupiter jarang muncul di dokumen lain. universe umum.
- Beri bobot lebih besar pada jupiter karena kekhasan.

Aplikasi

Deteksi stopword otomatis
Pencarian
Sistem rekomendasi
Kinerja lebih baik pada beberapa pemodelan prediktif

Term frequency–inverse document frequency

Sebanding dengan frekuensi istilah
Berbanding terbalik dengan jumlah dokumen yang memuatnya

Rumus matematis

$$\red{w_{i,j}} = tf_{i,j} \cdot \log\left(\frac{N}{df_{i}}\right) $$

$$\red{w_{i,j}} \rightarrow \text{bobot istilah } i \text{ dalam dokumen } j$$

Rumus matematis

$$w_{i,j} = \red{tf_{i,j}} \cdot \log\left(\frac{N}{df_{i}}\right) $$

$$w_{i,j} \rightarrow \text{bobot istilah } i \text{ dalam dokumen } j$$

$$\red{tf_{i,j}} \rightarrow \text{frekuensi istilah untuk istilah } i \text{ dalam dokumen } j $$

Rumus matematis

$$w_{i,j} = tf_{i,j} \cdot \red{\log\left(\frac{N}{df_{i}}\right)} $$

$$w_{i,j} \rightarrow \text{bobot istilah } i \text{ dalam dokumen } j$$

$$tf_{i,j} \rightarrow \text{frekuensi istilah untuk istilah } i \text{dalam dokumen } j $$

$$\red{N} \rightarrow \text{jumlah dokumen dalam korpus } $$

$$\red{df_{i}} \rightarrow \text{jumlah dokumen yang memuat istilah } i$$

Rumus matematis

$$w_{i,j} = tf_{i,j} \cdot \log\left(\frac{N}{df_{i}}\right) $$

$$w_{i,j} \rightarrow \text{bobot istilah } i \text{ dalam dokumen } j$$

$$tf_{i,j} \rightarrow term \; frequency \; of \; term \; i \; in \; document \; j $$

$$N \rightarrow number \; of \; documents \; in \; the \; corpus $$

$$df_{i} \rightarrow number \; of \; documents \; cotaining \; term \; i$$

Contoh:

$\red{w_{library, document}} = 5 \cdot log(\frac{20}{8}) \approx 2 $

tf-idf dengan scikit-learn

# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(corpus)
print(tfidf_matrix.toarray())

[[0.         0.         0.         0.         0.25434658 0.33443519
  0.33443519 0.         0.25434658 0.         0.25434658 0.
  0.76303975]
 [0.         0.46735098 0.         0.46735098 0.         0.
  0.         0.46735098 0.         0.46735098 0.35543247 0.
  0.        ]
...

Ayo berlatih!

Rekayasa Fitur untuk NLP di Python