Regular expression basics

Introduction to Natural Language Processing in R

Kasey Jones

Research Data Scientist

What is natural language processing?

NLP:

  • Focuses on using computers to analyze and understand text

Topics Covered:

  • Classifying Text
  • Topic Modeling
  • Named Entity Recognition
  • Sentiment Analysis
Introduction to Natural Language Processing in R

What are regular expressions?

  • A sequence of characters used to search text
  • Examples include:
    • searching files in a directory using the command line
    • finding articles that contain a specific pattern
    • replacing specific text
    • ...
Introduction to Natural Language Processing in R

Examples

words <- c("DW-40", "Mike's Oil", "5w30", "Joe's Gas", "Unleaded", "Plus-89")
# Finding Digits
grep("\\d", words, value = TRUE)
[1] 1 3 6
# Finding Apostrophes
grep("\\'", words, value = TRUE)
[1] "Mike's Oil"     "Joe's Gasoline"
Introduction to Natural Language Processing in R

Regular Expression Examples

Pattern Text Matches R Example Text Example
\w Any alphanumeric gregexpr(pattern ='\w', <text>) a
\d Any digit gregexpr(pattern ='\d', text) 1
\w+ An alphanumeric of any length gregexpr(pattern ='\w+', text) word
\d+ Digits of any length gregexpr(pattern ='\d+', text) 1234
\s Spaces gregexpr(pattern ='\s', text) ' '
\S Any non-space gregexpr(pattern ='\S', text) word
Introduction to Natural Language Processing in R

R Examples

Function Purpose Syntax
grep Find matches of the pattern in a vector grep(pattern ='\w', x = <vector>, value = F)
gsub Replaces all matches of a string/vector gsub(pattern ='\d+', replacement = "", x = <vector>)
Introduction to Natural Language Processing in R

RegEx Practice

1 https://regexone.com/lesson/matching_characters
Introduction to Natural Language Processing in R

Time to code!

Introduction to Natural Language Processing in R

Preparing Video For Download...