Aan de slag met parsen

Webscraping in Python

Thomas Laetsch

Data Scientist, NYU

Nog een keer

class DCspider( scrapy.Spider ):
    name = "dcspider"

    def start_requests( self ):
        urls = [ 'https://www.datacamp.com/courses/all' ]
        for url in urls:
            yield scrapy.Request( url = url, callback = self.parse )

    def parse( self, response ):
        # simple example: write out the html
        html_file = 'DC_courses.html'
        with open( html_file, 'wb' ) as fout:
            fout.write( response.body )
Webscraping in Python

Dit ken je al!

def parse( self, response ):

# input parsing code with response that you already know!
# output to a file, or...
# crawl the web!
Webscraping in Python

DataCamp-cursuskoppelingen: opslaan naar bestand

class DCspider( scrapy.Spider ):
    name = "dcspider"

    def start_requests( self ):
        urls = [ 'https://www.datacamp.com/courses/all' ]
        for url in urls:
            yield scrapy.Request( url = url, callback = self.parse )

def parse( self, response ):
links = response.css('div.course-block > a::attr(href)').extract()
filepath = 'DC_links.csv' with open( filepath, 'w' ) as f: f.writelines( [link + '/n' for link in links] )
Webscraping in Python

DataCamp-cursuskoppelingen: opnieuw parsen

class DCspider( scrapy.Spider ):
    name = "dcspider"

    def start_requests( self ):
        urls = [ 'https://www.datacamp.com/courses/all' ]
        for url in urls:
            yield scrapy.Request( url = url, callback = self.parse )

def parse( self, response ):
links = response.css('div.course-block > a::attr(href)').extract()
for link in links: yield response.follow( url = link, callback = self.parse2 )
def parse2( self, response ): # parse the course sites here!
Webscraping in Python

Een spin die links volgt op de DataCamp-website.

Webscraping in Python

Johnny Parsin'

Webscraping in Python

Preparing Video For Download...