Analyzing Social Media Data in R
Vivek Vijayaraghavan
Data Science Coach
# Extract 1000 tweets on "Obesity" in English and exclude retweets
tweets_df <- search_tweets("Obesity", n = 1000, include_rts = F, lang = 'en')
# Extract the tweet texts and save it in a data frame
twt_txt <- tweets_df$text
head(twt_txt, 3)
[1] "@WeeaUwU for real, obesity should not be praised like it is in today's society"
[2] "Great work by @DosingMatters in @AJHPOfficial on \"Vancomycin Vd estimation in
adults with class III obesity\". As we continue to study/learn more about dosing in
large body weight pts, we see that it's not a simple, one size, one level estimate
that works https://t.co/KkYPqS6JzG"
[3] "The Scottish Government have an ambition to halve childhood obesity by 2030.
This means reducing obesity prevalence in 2-15yo children in Scotland to 7%.
\n\n\U0001f449 In 2018, this figure was 16%\n\nFind out more in our latest blog:
https://t.co/FWp56QWjQc https://t.co/XBK8Je7F1A"
# Remove URLs from the tweet text
library(qdapRegex)
twt_txt_url <- rm_twitter_url(twt_txt)
twt_txt_url[1:3]
[1] "@WeeaUwU for real, obesity should not be praised like it is in today's society"
[2] "Great work by @DosingMatters in @AJHPOfficial on \"Vancomycin Vd estimation in adults
with class III obesity\". As we continue to study/learn more about dosing in large body
weight pts, we see that it's not a simple, one size, one level estimate that works"
[3] "The Scottish Government have an ambition to halve childhood obesity by 2030.
This means reducing obesity prevalence in 2-15yo children in Scotland to 7%.
\U0001f449In 2018, this figure was 16% Find out more in our latest blog:"
# Remove special characters, punctuation & numbers
twt_txt_chrs <- gsub("[^A-Za-z]", " ", twt_txt_url)
twt_txt_chrs[1:3]
[1] " WeeaUwU for real obesity should not be praised like it is in today s society"
[2] "Great work by DosingMatters in AJHPOfficial on Vancomycin Vd estimation in
adults with class III obesity As we continue to study learn more about dosing in
large body weight pts we see that it s not a simple one size one level estimate
that works"
[3] "The Scottish Government have an ambition to halve childhood obesity by This
means reducing obesity prevalence in yo children in Scotland to In this
figure was Find out more in our latest blog "
# Convert to text corpus
library(tm)
twt_corpus <- twt_txt_chrs %>%
VectorSource() %>%
Corpus()
twt_corpus[[3]]$content
[1] "The Scottish Government have an ambition to halve childhood obesity by
This means reducing obesity prevalence in yo children in Scotland to In
this figure was Find out more in our latest blog "
# Convert text corpus to lowercase
twt_corpus_lwr <- tm_map(twt_corpus, tolower)
twt_corpus_lwr[[3]]$content
[1] "the scottish government have an ambition to halve childhood obesity by this
means reducing obesity prevalence in yo children in scotland to in this
figure was find out more in our latest blog "
# Common stop words in English
stopwords("english")
# Remove stop words from corpus
twt_corpus_stpwd <- tm_map(twt_corpus_lwr, removeWords, stopwords("english"))
twt_corpus_stpwd[[3]]$content
[1] " scottish government ambition halve childhood obesity means
reducing obesity prevalence yo children scotland figure
find latest blog "
# Remove additional spaces
twt_corpus_final <- tm_map(twt_corpus_stpwd, stripWhitespace)
twt_corpus_final[[3]]$content
[1] " scottish government ambition halve childhood obesity means reducing obesity
prevalence yo children scotland figure find latest blog "
Analyzing Social Media Data in R