Cleaning and preprocessing text

Text Mining with Bag-of-Words in R

Ted Kwartler

Instructor

Common preprocessing functions

preprocessing_functions copy.png

Text Mining with Bag-of-Words in R

Preprocessing in practice

preprocessing.png

# Make a vector source: coffee_source
coffee_source <- VectorSource(coffee_tweets)

# Make a volatile corpus: coffee_corpus coffee_corpus <- VCorpus(coffee_source)
# Apply various preprocessing functions tm_map(coffee_corpus, removeNumbers) tm_map(coffee_corpus, removePunctuation)
tm_map(coffee_corpus, content_transformer(replace_abbreviation))
Text Mining with Bag-of-Words in R

Another preprocessing step: word stemming

# Stem words
stem_words <- stemDocument(c("complicatedly", "complicated","complication"))
stem_words
"complic" "complic" "complic"
# Complete words using single word dictionary
stemCompletion(stem_words, c("complicate"))
     complic      complic      complic 
"complicate" "complicate" "complicate"
# Complete words using entire corpus
stemCompletion(stem_words, my_corpus)
Text Mining with Bag-of-Words in R

Let's practice!

Text Mining with Bag-of-Words in R

Preparing Video For Download...