Introduction au sélecteur scrapy

Web Scraping en Python

Thomas Laetsch

Data Scientist, NYU

Configurer un Selector

from scrapy import Selector

html = '''
<html>
  <body>
    <div class="hello datacamp">
      <p>Hello World!</p>
    </div>
    <p>Enjoy DataCamp!</p>
  </body>
</html>
'''

sel = Selector( text = html )

Création d’un objet Selector scrapy à partir d’une chaîne contenant le code HTML
Le sélecteur sel a sélectionné tout le document HTML

Sélectionner des Selectors

On peut utiliser xpath dans un Selector pour créer de nouveaux Selector sur des parties spécifiques du HTML
Le retour est un SelectorList de Selector

sel.xpath("//p")

# outputs the SelectorList:
[<Selector xpath='//p' data='<p>Hello World!</p>'>, 
 <Selector xpath='//p' data='<p>Enjoy DataCamp!</p>'>]

Extraire des données d’un SelectorList

Utiliser la méthode extract()

>>> sel.xpath("//p")

out: [<Selector xpath='//p' data='<p>Hello World!</p>'>,
      <Selector xpath='//p' data='<p>Enjoy DataCamp!</p>'>]

>>> sel.xpath("//p").extract()

out: [ '<p>Hello World!</p>', 
       '<p>Enjoy DataCamp!</p>' ]

extract_first() renvoie le premier élément de la liste

>>> sel.xpath("//p").extract_first()

out: '<p>Hello World!</p>'

Extraire des données d’un Selector

ps = sel.xpath('//p')

second_p = ps[1]

second_p.extract()

out: '<p>Enjoy DataCamp!</p>'

Sélectionnez ce cours !

Web Scraping en Python