Introduction to Text Analysis in R
Maham Faisal Khan
Senior Data Science Content Developer
Some natural language processing (NLP) vocabulary:
tidy_review <- review_data %>% unnest_tokens(word, review)
tidy_review
# A tibble: 229,481 x 4
date product stars word
<chr> <chr> <dbl> <chr>
1 2/28/15 iRobot Roomba 650 for Pets 5 you
2 2/28/15 iRobot Roomba 650 for Pets 5 would
3 2/28/15 iRobot Roomba 650 for Pets 5 not
# … with 229,478 more rows
tidy_review %>%
count(word) %>%
arrange(desc(n))
# A tibble: 10,310 x 2
word n
<chr> <int>
1 the 11785
2 it 7905
3 and 6794
# … with 10,307 more rows
tidy_review2 <- review_data %>% unnest_tokens(word, review) %>% anti_join(stop_words)
tidy_review2
# A tibble: 78,868 x 4
date product stars word
<chr> <chr> <dbl> <chr>
1 1/12/15 iRobot Roomba 650 for Pets 4 walk
2 1/12/15 iRobot Roomba 650 for Pets 4 rest
# … with 78,866 more rows
tidy_review2 %>%
count(word) %>%
arrange(desc(n))
# A tibble: 9,672 x 2
word n
<chr> <int>
1 roomba 2286
2 clean 1204
3 vacuum 989
# … with 9,669 more rows
Introduction to Text Analysis in R