Introduction to XPATH

Web Scraping in R

Timo Grossenbacher

Instructor

XML Path Language

  • A path through an HTML tree is formulated, e.g. //div/p[@class = "blue"] (equivalent to div > p.blue)
  • Select nodes on properties of other nodes
  • More advanced and customized selections possible
  • For example: Select elements based on properties of their children, e.g. select only div elements that contain a nodes with a special class
Web Scraping in R

A simple HTML tree where all p elements are selected

html %>%
    html_elements(xpath = '//p')
# CSS selector equivalent: p
html %>%
    html_elements(xpath = '//body//p')
# CSS selector equivalent: body p
html %>%
    html_elements(xpath = '/html/body//p')
# CSS selector equivalent: html > body p
Web Scraping in R

A simple HTML tree where only p elements below divs are selected

html %>%
    html_elements(xpath = '//div/p')
# CSS selector equivalent: div > p
Web Scraping in R

A simple HTML tree where only divs with a children are selected

html %>%
    html_elements(xpath = '//div[a]')
# CSS selector equivalent: none
Web Scraping in R

Syntax: axes, steps, and predicates

  • Axes: / or //
  • Steps: HTML types like span and a
  • Predicates: [...]
  • Example: //span/a[@class = "external"] (CSS: span > a.external)
  • Example: //*[@id = "special"]//div (CSS: #special div or *#special div)
Web Scraping in R

Let's practice!

Web Scraping in R

Preparing Video For Download...