Web Scraping in Python
Thomas Laetsch
Data Scientist, NYU
from scrapy import Selector
html = '''
<html>
<body>
<div class="hello datacamp">
<p>Hello World!</p>
</div>
<p>Enjoy DataCamp!</p>
</body>
</html>
'''
sel = Selector( text = html )
Created a scrapy Selector object using a string with the html code
The selector sel
has selected the entire html document
We can use the xpath
call within a Selector
to create new Selector
s of specific pieces of the html code
The return is a SelectorList
of Selector
objects
sel.xpath("//p")
# outputs the SelectorList: [<Selector xpath='//p' data='<p>Hello World!</p>'>, <Selector xpath='//p' data='<p>Enjoy DataCamp!</p>'>]
extract()
method>>> sel.xpath("//p")
out: [<Selector xpath='//p' data='<p>Hello World!</p>'>, <Selector xpath='//p' data='<p>Enjoy DataCamp!</p>'>]
>>> sel.xpath("//p").extract()
out: [ '<p>Hello World!</p>', '<p>Enjoy DataCamp!</p>' ]
extract_first()
to get the first element of the list>>> sel.xpath("//p").extract_first()
out: '<p>Hello World!</p>'
ps = sel.xpath('//p')
second_p = ps[1]
second_p.extract()
out: '<p>Enjoy DataCamp!</p>'
Web Scraping in Python