Introduction to Natural Language Processing in R
Kasey Jones
Research Data Scientist
Common types of tokenization:
Package overview:
dplyr
, ggplot2
, and Other Tidy Tools"animal_farm
# A tibble: 10 x 2
chapter text_column
<chr> <chr>
1 Chapter 1 "Mr. Jones, of the Manor Farm, had locked ...
2 Chapter 2 "Three nights later old Major died peacefully ...
3 Chapter 3 "How they toiled and sweated to get the hay ...
...
animal_farm %>%
unnest_tokens(output = "word",
input = text_column,
token = "words")
Token Options
sentences
lines
regex
words
animal_farm %>%
unnest_tokens(output = "word",
token = "words",
input = text_column) %>%
count(word, sort = TRUE)
# A tibble: 4,076 x 2
word n
<chr> <int>
1 the 2187
2 and 966
3 of 899
4 to 814
...
animal_farm %>%
filter(chapter == 'Chapter 1') %>%
unnest_tokens(output = "Boxer", input = text_column,
token = "regex", pattern = "(?i)boxer") %>%
slice(2:n())
# A tibble: 5 x 2
chapter Boxer
<chr> <chr>
2 Chapter 1 " and clover, came in together, walking very slowly and setting down their vast hairy hoofs with great care lest there should be some…
3 Chapter 1 " was an enormous beast, nearly eighteen hands high, and as strong as any two ordinary horses put together. a white stripe down his n…
4 Chapter 1 "; the two of them usually spent their sundays together in the small paddock beyond the orchard, grazing side by side and never speak…
...
Introduction to Natural Language Processing in R