Introduction to Natural Language Processing in R
Kasey Jones
Data Scientist
tm
package as corpus
VCorpus
- most common representationlibrary(tm)
data("acq")
acq[[1]]$meta
author : character(0)
datetimestamp: 1987-02-26 15:18:06
heading : COMPUTER TERMINAL SYSTEMS <CPML> COMPLETES SALE
id : 10
language : en
origin : Reuters-21578 XML
... : ...
library(tm)
data("acq")
acq[[1]]$meta$places
[1] "usa"
acq[[1]]$content
[1] "Computer Terminal Systems Inc said it has completed ...
acq[[2]]$content
[1] "Ohio Mattress Co said its first quarter, ending ...
library(tm)
library(tidytext)
data("acq")
tidy_data <- tidy(acq)
tidy_data
# A tibble: 50 x 16
author datetimestamp description heading id language origin
<chr> <dttm> <chr> <chr> <chr> <chr> <list>
1 <NA> 1987-02-26 10:18:06 "" COMPUT… 10 en <chr …
...
Create the corpus
corpus <- VCorpus(VectorSource(tidy_data$text))
Add the meta information
meta(corpus, 'Author') <- tidy_data$author
meta(corpus, 'oldid') <- tidy_data$oldid
head(meta(corpus))
Author oldid
1 <NA> 5553
2 <NA> 5555
Introduction to Natural Language Processing in R