Tokenizing and cleaning

Introduction to Text Analysis in R

Maham Faisal Khan

Senior Data Science Content Developer

Using tidytext

Tokenizing text

Some natural language processing (NLP) vocabulary:

Bag of words: Words in a document are independent
Every separate body of text is a document
Every unique word is a term
Every occurrence of a term is a token
Creating a bag of words is called tokenizing

Using unnest_tokens()

tidy_review <- review_data %>% 
  unnest_tokens(word, review)

tidy_review

# A tibble: 229,481 x 4
   date    product                    stars word   
   <chr>   <chr>                      <dbl> <chr>  
 1 2/28/15 iRobot Roomba 650 for Pets     5 you    
 2 2/28/15 iRobot Roomba 650 for Pets     5 would  
 3 2/28/15 iRobot Roomba 650 for Pets     5 not    
# … with 229,478 more rows

Counting words

tidy_review %>% 
  count(word) %>% 
  arrange(desc(n))

# A tibble: 10,310 x 2
   word      n
   <chr> <int>
 1 the   11785
 2 it     7905
 3 and    6794
# … with 10,307 more rows

Using anti_join()

We'd like to remove stop words from our tidied data frame
We'll use joins to do this

Using anti_join()

tidy_review2 <- review_data %>% 
  unnest_tokens(word, review) %>% 
  anti_join(stop_words)

tidy_review2

# A tibble: 78,868 x 4
   date     product                    stars word       
   <chr>    <chr>                      <dbl> <chr>      
 1 1/12/15  iRobot Roomba 650 for Pets     4 walk       
 2 1/12/15  iRobot Roomba 650 for Pets     4 rest       
# … with 78,866 more rows

Counting words again

tidy_review2 %>% 
  count(word) %>% 
  arrange(desc(n))

# A tibble: 9,672 x 2
   word         n
   <chr>    <int>
 1 roomba    2286
 2 clean     1204
 3 vacuum     989
# … with 9,669 more rows

Let's practice!

Introduction to Text Analysis in R