Introduction to XPATH

Web Scraping in R

Timo Grossenbacher

Instructor

XML Path Language

A path through an HTML tree is formulated, e.g. //div/p[@class = "blue"] (equivalent to div > p.blue)
Select nodes on properties of other nodes
More advanced and customized selections possible
For example: Select elements based on properties of their children, e.g. select only div elements that contain a nodes with a special class

A simple HTML tree where all p elements are selected

html %>%
    html_elements(xpath = '//p')
# CSS selector equivalent: p

html %>%
    html_elements(xpath = '//body//p')
# CSS selector equivalent: body p

html %>%
    html_elements(xpath = '/html/body//p')
# CSS selector equivalent: html > body p

A simple HTML tree where only p elements below divs are selected

html %>%
    html_elements(xpath = '//div/p')
# CSS selector equivalent: div > p

A simple HTML tree where only divs with a children are selected

html %>%
    html_elements(xpath = '//div[a]')
# CSS selector equivalent: none

Web Scraping in R