Document term matrices

Introduction to Text Analysis in R

Maham Faisal Khan

Senior Data Science Content Developer

Matrices and sparsity

sparse_review

    Terms
Docs admit ago albeit amazing angle awesome
   4     1   0      1       0     0       0
   5     0   1      0       1     1       0
   3     0   0      0       0     0       1
   2     0   0      0       0     0       0

Using cast_dtm()

tidy_review %>% 
  count(word, id) %>% 
  cast_dtm(id, word, n)

<<DocumentTermMatrix (documents: 1791, terms: 9669)>>
Non-/sparse entries: 62766/17252622
Sparsity           : 100%
Maximal term length: NA
Weighting          : term frequency (tf)

Using as.matrix()

dtm_review <- tidy_review %>% 
  count(word, id) %>% 
  cast_dtm(id, word, n) %>% 
  as.matrix()

dtm_review[1:4, 2000:2004]

      Terms
Docs   consecutive consensus consequences considerable considerably
  223            0         0            0            0            0
  615            0         0            0            0            0
  1069           0         0            0            0            0
  425            0         0            0            0            0

Let's practice!

Introduction to Text Analysis in R