Topic modeling of tweets

Analyzing Social Media Data in R

Vivek Vijayaraghavan

Data Science Coach

Lesson Overview

Fundamentals of topic modeling
Create a document term matrix or DTM
Build a topic model from the DTM

Topic and Document

Topic definition and example

Topic and Document

Document definition and example

Topic modeling

Task of automatically discovering topics
Extract core discussion topics from large datasets
Quickly summarize vast information into topics

How LDA works

Latent Dirichlet Allocation algorithm for topic modeling

How LDA works

Document term matrix (DTM)

Create a document term matrix
DTM is a matrix representation of a corpus
Documents are rows and words or terms are columns

Document term matrix or DTM

Create a document term matrix

# Create a document term matrix
dtm <- DocumentTermMatrix(twt_corpus_refined)

Create a document term matrix

# Inspect the DTM
inspect(dtm)

Create a document term matrix

<<DocumentTermMatrix (documents: 1000, terms: 5079)>>
Non-/sparse entries: 12862/5066138
Sparsity           : 100%
Maximal term length: 29
Weighting          : term frequency (tf)
Sample             :
     Terms
Docs    california child diabetes fat food health people ranks rates weight
  131          0     0        0   0    0      0      0     0     0      0
  161          0     0        0   2    0      0      0     0     0      1
  295          0     0        0   0    1      0      1     0     0      0
  418          0     0        0   0    0      0      0     0     1      0
  604          0     0        1   0    0      1      0     0     0      0

Preparing the DTM

Filter the DTM for rows that have a row sum greater than 0

# Find the sum of word counts in each Document 
rowTotals <- apply(dtm , 1, sum)

# Select rows from DTM with row totals greater than zero
tweet_dtm_new <- dtm[rowTotals> 0, ]

Build the topic model

Create the topic model using the LDA() function

# Build the topic model
library(topicmodels)
lda_5 <- LDA(tweet_dtm_new, k = 5)

Build the topic model

Extracted 5 topics from the tweet corpus

# View top 10 terms in the topic model
top_10terms <- terms(lda_5,10)
top_10terms

View top 10 terms in the topic model

     Topic 1        Topic 2        Topic 3     Topic 4      Topic 5   
 [1,] "disease"      "people"       "black"     "child"      "weight"  
 [2,] "health"       "health"       "fat"       "rates"      "diet"    
 [3,] "cancer"       "diabetes"     "trump"     "ranks"      "food"    
 [4,] "meghanmccain" "overweight"   "childhood" "california" "diabetes"
 [5,] "realcandaceo" "fat"          "health"    "fat"        "health"  
 [6,] "food"         "meghanmccain" "professor" "eat"        "bmi"     
 [7,] "risk"         "realcandaceo" "gender"    "people"     "problem" 
 [8,] "heart"        "body"         "studies"   "epidemic"   "eating"  
 [9,] "weight"       "weight"       "healthy"   "health"     "disease" 
[10,] "diabetes"     "obese"        "problem"   "healthy"    "family"

An obesity management program can center its theme around a core topic

Let's practice!

Analyzing Social Media Data in R