XPATH functions and advanced predicates

Web Scraping in R

Timo Grossenbacher

Instructor

The position() function

...
<ol>
  <li>First element.</li>
  <li>Second element.</li>
  <li>Third element.</li>
  <li>Fourth element.</li>
  <li>Fifth element.</li>
</ol>
...
html %>% 
  html_elements(xpath = 
             '//ol/li[position() = 2]')
# Equivalent CSS selector: 
# ol > li:nth-child(2)
{xml_nodeset (1)}
[1] <li>Second element.</li>
Web Scraping in R

More operators for the position() function

...
<ol>
  <li>First element.</li>
  <li>Second element.</li>
  <li>Third element.</li>
  <li>Fourth element.</li>
  <li>Fifth element.</li>
</ol>
...
html %>% 
  html_elements(xpath = 
             '//ol/li[position() < 3]')
{xml_nodeset (2)}
[1] <li>First element.</li>
[2] <li>Second element.</li>
Web Scraping in R

More operators for the position() function

...
<ol>
  <li>First element.</li>
  <li>Second element.</li>
  <li>Third element.</li>
  <li>Fourth element.</li>
  <li>Fifth element.</li>
</ol>
...
html %>% 
  html_elements(xpath = 
             '//ol/li[position() != 3]')
{xml_nodeset (4)}
[1] <li>First element.</li>
[2] <li>Second element.</li>
[3] <li>Fourth element.</li>
[4] <li>Fifth element.</li>
Web Scraping in R

Combining predicates

...
<ol>
  <li class = 'blue'>First element.</li>
  <li>Second element.</li>
  <li class = 'blue'>Third element.</li>
  <li>Fourth element.</li>
  <li class = 'blue'>Fifth element.</li>
</ol>
...
html %>% 
  html_elements(xpath = 
  '//ol/li[position() != 3 and @class = "blue"]')
{xml_nodeset (2)}
[1] <li class="blue">First element.</li>
[2] <li class="blue">Fifth element.</li>
html %>% 
  html_elements(xpath = 
  '//ol/li[position() != 3 or @class = "blue"]')
{xml_nodeset (5)}
...
Web Scraping in R

The count() function

...
<ol>
  <li class = 'blue'>First element.</li>
  <li>Second element.</li>
  <li class = 'blue'>Third element.</li>
</ol>
<ol>
  <li class = 'red'>First element.</li>
  <li>Second element.</li>
</ol>
...
html %>% 
  html_elements(xpath = '//ol[count(li) = 2]')
{xml_nodeset (1)}
[1] <ol>\n<li class="red">...
html %>% 
  html_elements(xpath = '//ol[count(li) > 2]')
{xml_nodeset (1)}
[1] <ol>\n<li class="blue">...
Web Scraping in R

Let's try out some functions!

Web Scraping in R

Preparing Video For Download...