Une demande de service

Web Scraping en Python

Thomas Laetsch

Data Scientist, NYU

Rappel : spider

import scrapy
from scrapy.crawler import CrawlerProcess

class SpiderClassName(scrapy.Spider):
    name = "spider_name"
    # the code for your spider
    ...

process = CrawlerProcess()

process.crawl(SpiderClassName)

process.start()
Web Scraping en Python

Rappel : spider

class DCspider( scrapy.Spider ):
    name = "dc_spider"

    def start_requests( self ):
        urls = [ 'https://www.datacamp.com/courses/all' ]
        for url in urls:
            yield scrapy.Request( url = url, callback = self.parse )

    def parse( self, response ):
        # simple example: write out the html
        html_file = 'DC_courses.html'
        with open( html_file, 'wb' ) as fout:
            fout.write( response.body )
Web Scraping en Python

L’essentiel de start_requests

def start_requests( self ):

urls = ['https://www.datacamp.com/courses/all']
for url in urls: yield scrapy.Request( url = url, callback = self.parse )
def start_requests( self ):
    url = 'https://www.datacamp.com/courses/all'
    yield scrapy.Request( url = url, callback = self.parse )
  • Ici, scrapy.Request fournira une variable response
  • L’argument url indique quel site extraire
  • L’argument callback indique où envoyer response pour le traitement
Web Scraping en Python

Vue d’ensemble

class DCspider( scrapy.Spider ):
    name = "dc_spider"

    def start_requests( self ):
        urls = [ 'https://www.datacamp.com/courses/all' ]
        for url in urls:
            yield scrapy.Request( url = url, callback = self.parse )

    def parse( self, response ):
        # simple example: write out the html
        html_file = 'DC_courses.html'
        with open( html_file, 'wb' ) as fout:
            fout.write( response.body )
Web Scraping en Python

Fin de la requête

Web Scraping en Python

Preparing Video For Download...