Introduction to Natural Language Processing in R
Kasey Jones
Research Data Scientist
NLP:
Topics Covered:
words <- c("DW-40", "Mike's Oil", "5w30", "Joe's Gas", "Unleaded", "Plus-89")
# Finding Digits
grep("\\d", words, value = TRUE)
[1] 1 3 6
# Finding Apostrophes
grep("\\'", words, value = TRUE)
[1] "Mike's Oil" "Joe's Gasoline"
Pattern | Text Matches | R Example | Text Example |
---|---|---|---|
\w | Any alphanumeric | gregexpr(pattern ='\w', <text>) | a |
\d | Any digit | gregexpr(pattern ='\d', text) | 1 |
\w+ | An alphanumeric of any length | gregexpr(pattern ='\w+', text) | word |
\d+ | Digits of any length | gregexpr(pattern ='\d+', text) | 1234 |
\s | Spaces | gregexpr(pattern ='\s', text) | ' ' |
\S | Any non-space | gregexpr(pattern ='\S', text) | word |
Function | Purpose | Syntax |
---|---|---|
grep | Find matches of the pattern in a vector | grep(pattern ='\w', x = <vector>, value = F) |
gsub | Replaces all matches of a string/vector | gsub(pattern ='\d+', replacement = "", x = <vector>) |
Introduction to Natural Language Processing in R