Web Scraping in Python
Thomas Laetsch
Data Scientist, NYU
/
replace by >
(except first character)/html/body/div
html > body > div
//
replaced by a blank space (except first character)//div/span//p
div > span p
[N]
replaced by :nth-of-type(N)
//div/p[2]
div > p:nth-of-type(2)
XPATH
xpath = '/html/body//div/p[2]'
CSS
css = 'html > body div > p:nth-of-type(2)'
.
p.class-1
selects all paragraph elements belonging to class-1
#
div#uid
selects the div
element with id
equal to uid
Select paragraph elements within class class1
:
css_locator = 'div#uid > p.class1'
Select all elements whose class attribute belongs to class1
:
css_locator = '.class1'
css = '.class1'
xpath = '//*[@class="class1"]'
xpath = '//*[contains(@class,"class1")]'
from scrapy import Selector
html = '''
<html>
<body>
<div class="hello datacamp">
<p>Hello World!</p>
</div>
<p>Enjoy DataCamp!</p>
</body>
</html>
'''
sel = Selector( text = html )
>>> sel.css("div > p")
out: [<Selector xpath='...' data='<p>Hello World!</p>'>]
>>> sel.css("div > p").extract()
out: [ '<p>Hello World!</p>' ]
Web Scraping in Python