Understanding an R corpus

Introduction to Natural Language Processing in R

Kasey Jones

Data Scientist

Corpora

  • Collections of documents containing natural language text
  • From the tm package as corpus
  • VCorpus - most common representation
1 https://www.rdocumentation.org/packages/tm/versions/0.7-8/topics/Corpus
Introduction to Natural Language Processing in R

Contents of a VCorpus: metadata

library(tm)
data("acq")
acq[[1]]$meta
  author       : character(0)
  datetimestamp: 1987-02-26 15:18:06
  heading      : COMPUTER TERMINAL SYSTEMS <CPML> COMPLETES SALE
  id           : 10
  language     : en
  origin       : Reuters-21578 XML
  ...          : ...
1 http://www.daviddlewis.com/resources/testcollections/reuters21578/
Introduction to Natural Language Processing in R

Contents of a VCorpus: metadata

library(tm)
data("acq")
acq[[1]]$meta$places
[1] "usa"
Introduction to Natural Language Processing in R

Contents of a VCorpus: content

acq[[1]]$content
[1] "Computer Terminal Systems Inc said it has completed ...
acq[[2]]$content
[1] "Ohio Mattress Co said its first quarter, ending ...
Introduction to Natural Language Processing in R

Tidying a corpus

library(tm)
library(tidytext)
data("acq")
tidy_data <- tidy(acq)
tidy_data
# A tibble: 50 x 16
   author datetimestamp       description heading id    language origin
   <chr>  <dttm>              <chr>       <chr>   <chr> <chr>  <list>
 1 <NA>   1987-02-26 10:18:06 ""          COMPUT… 10    en       <chr …
...
Introduction to Natural Language Processing in R

Creating a corpus

Create the corpus

corpus <- VCorpus(VectorSource(tidy_data$text))

Add the meta information

meta(corpus, 'Author') <- tidy_data$author
meta(corpus, 'oldid') <- tidy_data$oldid
head(meta(corpus))
  Author oldid
1 <NA>  5553
2 <NA>  5555
Introduction to Natural Language Processing in R

Let's see this in action.

Introduction to Natural Language Processing in R

Preparing Video For Download...