Analyzing Social Media Data in R
Vivek Vijayaraghavan
Data Science Coach
# Create a document term matrix
dtm <- DocumentTermMatrix(twt_corpus_refined)
# Inspect the DTM
inspect(dtm)
<<DocumentTermMatrix (documents: 1000, terms: 5079)>>
Non-/sparse entries: 12862/5066138
Sparsity : 100%
Maximal term length: 29
Weighting : term frequency (tf)
Sample :
Terms
Docs california child diabetes fat food health people ranks rates weight
131 0 0 0 0 0 0 0 0 0 0
161 0 0 0 2 0 0 0 0 0 1
295 0 0 0 0 1 0 1 0 0 0
418 0 0 0 0 0 0 0 0 1 0
604 0 0 1 0 0 1 0 0 0 0
# Find the sum of word counts in each Document
rowTotals <- apply(dtm , 1, sum)
# Select rows from DTM with row totals greater than zero
tweet_dtm_new <- dtm[rowTotals> 0, ]
LDA()
function# Build the topic model
library(topicmodels)
lda_5 <- LDA(tweet_dtm_new, k = 5)
# View top 10 terms in the topic model
top_10terms <- terms(lda_5,10)
top_10terms
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
[1,] "disease" "people" "black" "child" "weight"
[2,] "health" "health" "fat" "rates" "diet"
[3,] "cancer" "diabetes" "trump" "ranks" "food"
[4,] "meghanmccain" "overweight" "childhood" "california" "diabetes"
[5,] "realcandaceo" "fat" "health" "fat" "health"
[6,] "food" "meghanmccain" "professor" "eat" "bmi"
[7,] "risk" "realcandaceo" "gender" "people" "problem"
[8,] "heart" "body" "studies" "epidemic" "eating"
[9,] "weight" "weight" "healthy" "health" "disease"
[10,] "diabetes" "obese" "problem" "healthy" "family"
Analyzing Social Media Data in R