De TF‑IDF

Introductie tot Natural Language Processing in R

Kasey Jones

Research Data Scientist

Valkuilen van bag-of-words

t1 <- "My name is John. My best friend is Joe. We like tacos."
t2 <- "Two common best friend names are John and Joe."
t3 <- "Tacos are my favorite food. I eat them with my buddy Joe."

clean_t1 <- "john friend joe tacos"
clean_t2 <- "common friend john joe names"
clean_t3 <- "tacos favorite food eat buddy joe"

Gedeelde woorden

clean_t1 <- "john friend joe tacos"
clean_t2 <- "common friend john joe names"
clean_t3 <- "tacos favorite food eat buddy joe"

Vergelijk t1 en t2

3/4 woorden uit t1 staan in t2
3/5 woorden uit t2 staan in t1

Vergelijk t1 en t3

2/4 woorden uit t1 staan in t3
2/6 woorden uit t3 staan in t1

Taco’s tellen mee

t1 <- "My name is John. My best friend is Joe. We like tacos."
t2 <- "Two common best friend names are John and Joe."
t3 <- "Tacos are my favorite food. I eat them with my friend Joe."

Woorden per tekst:

John: t1, t2
Joe: t1, t2, t3
Tacos: t1, t3

TF‑IDF

clean_t1 <- "john friend joe tacos"
clean_t2 <- "common friend john joe names"
clean_t3 <- "tacos favorite food eat buddy joe"

TF: Term Frequency
- Het aandeel woorden in een tekst dat die term is
- john is 1/4 woorden in clean_t1, tf = .25
IDF: Inverse Document Frequency
- Het gewicht voor hoe vaak een term in alle documenten voorkomt
- john staat in 3/3 documenten, IDF = 0

IDF‑vergelijking

$ IDF = log \frac{N}{n_{t}} $

N: totaal aantal documenten in het corpus
$n_{t}$: aantal documenten waarin de term voorkomt

Voorbeeld:

IDF van taco: $log (\frac{3}{2}) = .405$
IDF van buddy: $log (\frac{3}{1}) = 1.10$
IDF van john: $log (\frac{3}{3}) = 0$

TF + IDF

clean_t1 <- "john friend joe tacos"
clean_t2 <- "common friend john joe names"
clean_t3 <- "tacos favorite food eat buddy joe"

TF‑IDF voor "tacos":

clean_t1: TF * IDF = (1/4) * (.405) = 0.101
clean_t2: TF * IDF = (0/4) * (.405) = 0
clean_t3: TF * IDF = (1/6) * (.405) = 0.068

De TF‑IDF-matrix berekenen

# Create a data.frame
df <- data.frame('text' = c(t1, t2, t3), 'ID' = c(1, 2, 3))

df %>%
  unnest_tokens(output = "word", token = "words", input = text) %>%
  anti_join(stop_words) %>%
  count(ID, word, sort = TRUE) %>%
  bind_tf_idf(word, ID, n)

word: de kolom met de termen
ID: de kolom met document-ID’s
n: de woordtelling van count()

bind_tf_idf-uitvoer

# A tibble: 15 x 6
       X word         n    tf   idf tf_idf
   <dbl> <chr>    <int> <dbl> <dbl>  <dbl>
 1     1 friend       1 0.25  0.405 0.101 
 2     1 joe          1 0.25  0     0     
 3     1 john         1 0.25  0.405 0.101 
 4     1 tacos        1 0.25  0.405 0.101 
 5     2 common       1 0.2   1.10  0.220 
 6     2 friend       1 0.2   0.405 0.0811
 ...

TF‑IDF oefenen

Introductie tot Natural Language Processing in R