Introduction to HTML

Web Scraping in R

Timo Grossenbacher

Instructor

If you see something, it can be scraped

no download

Web Scraping in R

Hypertext Markup Language (HTML)

<html> 
  <body> 
    <h2>A first example</h2>
    <p>A text paragraph.</p>
    <p>
      Here follows a list:
    </p>
  </body> 
</html>

HTML intro

Web Scraping in R

HTML is organized hierarchically

HTML intro

...
    <div>
      Here follows a list:
      <ul>
        <li>Bullet 1</li>
        <li>Bullet 2</li>
        <li>Bullet 3</li>
      </ul>
    </div>
...
Web Scraping in R

HTML tags can have attributes

HTML intro

...
    <p>
      Here follows a 
      <a href="https://google.com">link</a>.
    </p>
...
Web Scraping in R

Reading HTML with R

library(rvest)
html <- read_html(html_document)
html
{html_document}
<html>
[1] <body> \n    <h2>A first example</h2>\n    <p>A text paragraph.</p>\n   ...
class(html)
"xml_document" "xml_node"
Web Scraping in R
library(xml2)
xml_structure(html)
<html>
  <body>
    {text}
    <h2>
      {text}
    {text}
    <p>
      {text}
    {text}
    <p>
      {text}
      <a [href]>
        {text}
      {text}
    {text}
Web Scraping in R

Let's parse HTML!

Web Scraping in R

Preparing Video For Download...