Text as data

Introduction to Text Analysis in R

Maham Faisal Khan

Senior Data Science Content Developer, DataCamp

Using the tidyverse

Introduction to Text Analysis in R

Using the tidyverse

Introduction to Text Analysis in R

Using the tidyverse

Introduction to Text Analysis in R

Loading packages

library(tidyverse)
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.0.0     v purrr   0.2.5
v tibble  2.0.0     v dplyr   0.7.8
v tidyr   0.8.2     v stringr 1.3.1
v readr   1.1.1     v forcats 0.3.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
Introduction to Text Analysis in R

Importing review data

review_data <- read_csv("Roomba Reviews.csv")

review_data
# A tibble: 1,833 x 4
   Date     Product               Stars Review
   <chr>    <chr>                 <dbl> <chr>
 1 2/28/15  iRobot Roomba 650 fo…     5 You would not believe how well...
 2 1/12/15  iRobot Roomba 650 fo…     4 You just walk away and it does... 
 3 12/26/13 iRobot Roomba 650 fo…     5 You have to Roomba proof your...
 4 8/4/13   iRobot Roomba 650 fo…     3 Yes, its a fascinating, albeit...
# … with 1,829 more rows
Introduction to Text Analysis in R

Using filter() and summarize()

review_data %>% 
  filter(product == "iRobot Roomba 650 for Pets") %>% 
  summarize(stars_mean = mean(stars))
# A tibble: 1 x 1
  stars_mean
       <dbl>
1       4.49
Introduction to Text Analysis in R

Using group_by() and summarize()

review_data %>% 
  group_by(product) %>% 
  summarize(stars_mean = mean(stars))
# A tibble: 2 x 2
  product                                  stars_mean
  <chr>                                         <dbl>
1 iRobot Roomba 650 for Pets                     4.49
2 iRobot Roomba 880 for Pets and Allergies       4.42
Introduction to Text Analysis in R

Unstructured data

review_data %>% 
  group_by(product) %>% 
  summarize(review_mean = mean(review))
Warning messages:
1: In mean.default(review) :
  argument is not numeric or logical: returning NA
2: In mean.default(review) :
  argument is not numeric or logical: returning NA
Introduction to Text Analysis in R

Let's practice!

Introduction to Text Analysis in R

Preparing Video For Download...