Topic modeling of tweets

Analyzing Social Media Data in R

Vivek Vijayaraghavan

Data Science Coach

Lesson Overview

  • Fundamentals of topic modeling
  • Create a document term matrix or DTM
  • Build a topic model from the DTM
Analyzing Social Media Data in R

Topic and Document

Topic definition and example

Analyzing Social Media Data in R

Topic and Document

Document definition and example

Analyzing Social Media Data in R

Topic modeling

  • Task of automatically discovering topics
  • Extract core discussion topics from large datasets
  • Quickly summarize vast information into topics
Analyzing Social Media Data in R

How LDA works

  • Latent Dirichlet Allocation algorithm for topic modeling

How LDA works

Analyzing Social Media Data in R

How LDA works

How LDA works

Analyzing Social Media Data in R

How LDA works

How LDA works

Analyzing Social Media Data in R

Document term matrix (DTM)

  • Create a document term matrix
  • DTM is a matrix representation of a corpus
  • Documents are rows and words or terms are columns

Document term matrix or DTM

Analyzing Social Media Data in R

Create a document term matrix

# Create a document term matrix
dtm <- DocumentTermMatrix(twt_corpus_refined)
Analyzing Social Media Data in R

Create a document term matrix

# Inspect the DTM
inspect(dtm)
Analyzing Social Media Data in R

Create a document term matrix

<<DocumentTermMatrix (documents: 1000, terms: 5079)>>
Non-/sparse entries: 12862/5066138
Sparsity           : 100%
Maximal term length: 29
Weighting          : term frequency (tf)
Sample             :
     Terms
Docs    california child diabetes fat food health people ranks rates weight
  131          0     0        0   0    0      0      0     0     0      0
  161          0     0        0   2    0      0      0     0     0      1
  295          0     0        0   0    1      0      1     0     0      0
  418          0     0        0   0    0      0      0     0     1      0
  604          0     0        1   0    0      1      0     0     0      0
Analyzing Social Media Data in R

Preparing the DTM

  • Filter the DTM for rows that have a row sum greater than 0
# Find the sum of word counts in each Document 
rowTotals <- apply(dtm , 1, sum)
# Select rows from DTM with row totals greater than zero
tweet_dtm_new <- dtm[rowTotals> 0, ]
Analyzing Social Media Data in R

Build the topic model

  • Create the topic model using the LDA() function
# Build the topic model
library(topicmodels)
lda_5 <- LDA(tweet_dtm_new, k = 5)
Analyzing Social Media Data in R

Build the topic model

  • Extracted 5 topics from the tweet corpus
# View top 10 terms in the topic model
top_10terms <- terms(lda_5,10)
top_10terms
Analyzing Social Media Data in R

View top 10 terms in the topic model

     Topic 1        Topic 2        Topic 3     Topic 4      Topic 5   
 [1,] "disease"      "people"       "black"     "child"      "weight"  
 [2,] "health"       "health"       "fat"       "rates"      "diet"    
 [3,] "cancer"       "diabetes"     "trump"     "ranks"      "food"    
 [4,] "meghanmccain" "overweight"   "childhood" "california" "diabetes"
 [5,] "realcandaceo" "fat"          "health"    "fat"        "health"  
 [6,] "food"         "meghanmccain" "professor" "eat"        "bmi"     
 [7,] "risk"         "realcandaceo" "gender"    "people"     "problem" 
 [8,] "heart"        "body"         "studies"   "epidemic"   "eating"  
 [9,] "weight"       "weight"       "healthy"   "health"     "disease" 
[10,] "diabetes"     "obese"        "problem"   "healthy"    "family"
  • An obesity management program can center its theme around a core topic
Analyzing Social Media Data in R

Let's practice!

Analyzing Social Media Data in R

Preparing Video For Download...