Getting Ready to Crawl

Web Scraping in Python

Thomas Laetsch

Data Scientist, NYU

Let's Respond

Selector vs Response:

  • The Response has all the tools we learned with Selectors:
    • xpath and css methods followed by extract and extract_first methods.
  • The Response also keeps track of the url where the HTML code was loaded from.
  • The Response helps us move from one site to another, so that we can "crawl" the web while scraping.
Web Scraping in Python

What We Know!

  • xpath method works like a Selector
response.xpath( '//div/span[@class="bio"]' )
  • css method works like a Selector
response.css( 'div > span.bio' )
  • Chaining works like a Selector
response.xpath('//div').css('span.bio')
  • Data extraction works like a Selector
response.xpath('//div').css('span.bio').extract()
response.xpath('//div').css('span.bio').extract_first()
Web Scraping in Python

What We Don't Know

  • The response keeps track of the URL within the response url variable.
response.url
>>> 'http://www.DataCamp.com/courses/all'
  • The response lets us "follow" a new link with the follow() method
# next_url is the string path of the next url we want to scrape
response.follow( next_url )
  • We'll learn more about follow later.
Web Scraping in Python

In Response

Web Scraping in Python

Preparing Video For Download...