Introduction to Natural Language Processing in R
Kasey Jones
Research Data Scientist
t1 <- "My name is John. My best friend is Joe. We like tacos."
t2 <- "Two common best friend names are John and Joe."
t3 <- "Tacos are my favorite food. I eat them with my buddy Joe."
clean_t1 <- "john friend joe tacos"
clean_t2 <- "common friend john joe names"
clean_t3 <- "tacos favorite food eat buddy joe"
clean_t1 <- "john friend joe tacos"
clean_t2 <- "common friend john joe names"
clean_t3 <- "tacos favorite food eat buddy joe"
Compare t1 and t2
Compare t1 and t3
t1 <- "My name is John. My best friend is Joe. We like tacos."
t2 <- "Two common best friend names are John and Joe."
t3 <- "Tacos are my favorite food. I eat them with my friend Joe."
Words in each text:
clean_t1 <- "john friend joe tacos"
clean_t2 <- "common friend john joe names"
clean_t3 <- "tacos favorite food eat buddy joe"
clean_t1
, tf = .25
$ IDF = log \frac{N}{n_{t}} $
Example:
clean_t1 <- "john friend joe tacos"
clean_t2 <- "common friend john joe names"
clean_t3 <- "tacos favorite food eat buddy joe"
TFIDF for "tacos":
# Create a data.frame
df <- data.frame('text' = c(t1, t2, t3), 'ID' = c(1, 2, 3))
df %>%
unnest_tokens(output = "word", token = "words", input = text) %>%
anti_join(stop_words) %>%
count(ID, word, sort = TRUE) %>%
bind_tf_idf(word, ID, n)
count()
# A tibble: 15 x 6
X word n tf idf tf_idf
<dbl> <chr> <int> <dbl> <dbl> <dbl>
1 1 friend 1 0.25 0.405 0.101
2 1 joe 1 0.25 0 0
3 1 john 1 0.25 0.405 0.101
4 1 tacos 1 0.25 0.405 0.101
5 2 common 1 0.2 1.10 0.220
6 2 friend 1 0.2 0.405 0.0811
...
Introduction to Natural Language Processing in R