Tokenization

Introduction to Natural Language Processing in R

Kasey Jones

Research Data Scientist

What are tokens?

Common types of tokenization:

  • characters
  • words
  • sentences
  • documents
  • regular expression separations
Introduction to Natural Language Processing in R

tidytext package

Package overview:

  • "Text Mining using dplyr, ggplot2, and Other Tidy Tools"
  • Follows the tidy data format

Introduction to the Tidyverse

1 https://cran.r-project.org/web/packages/tidytext/index.html
Introduction to Natural Language Processing in R

The Animal Farm dataset

animal_farm
# A tibble: 10 x 2
   chapter    text_column                                                                                                      
   <chr>      <chr>                                                                                                            
 1 Chapter 1  "Mr. Jones, of the Manor Farm, had locked ...
 2 Chapter 2  "Three nights later old Major died peacefully ...
 3 Chapter 3  "How they toiled and sweated to get the hay ...
...

1 https://en.wikipedia.org/wiki/Animal_Farm
Introduction to Natural Language Processing in R

Tokenization practice

animal_farm %>%
  unnest_tokens(output = "word",
                input = text_column,
                token = "words")

Token Options

  • sentences
  • lines
  • regex
  • words
  • ...
Introduction to Natural Language Processing in R

Counting tokens

animal_farm %>%
  unnest_tokens(output = "word",
                token = "words",
                input = text_column) %>%
  count(word, sort = TRUE)
# A tibble: 4,076 x 2
   word      n
   <chr> <int>
 1 the    2187
 2 and     966
 3 of      899
 4 to      814
 ...
Introduction to Natural Language Processing in R

Tokenization with regular expressions

animal_farm %>%
  filter(chapter == 'Chapter 1') %>%
  unnest_tokens(output = "Boxer", input = text_column,
                token = "regex", pattern = "(?i)boxer") %>%
  slice(2:n())
# A tibble: 5 x 2
  chapter   Boxer                                                                                                                                 
  <chr>     <chr>                                                                                                                                 
2 Chapter 1 " and clover, came in together, walking very slowly and setting down their vast hairy hoofs with great care lest there should be some…
3 Chapter 1 " was an enormous beast, nearly eighteen hands high, and as strong as any two ordinary horses put together. a white stripe down his n…
4 Chapter 1 "; the two of them usually spent their sundays together in the small paddock beyond the orchard, grazing side by side and never speak…
...
Introduction to Natural Language Processing in R

Let's tokenize some text.

Introduction to Natural Language Processing in R

Preparing Video For Download...