Scraping del web in Python

Importazione di dati in Python - livello intermedio

Hugo Bowne-Anderson

Data Scientist at DataCamp

HTML

  • Mix di dati non strutturati e strutturati

  • Dati strutturati:

    • Hanno un modello predefinito, oppure

    • Sono organizzati in modo definito

  • Dati non strutturati: nessuna di queste proprietà

ch_1_3.008.png

Importazione di dati in Python - livello intermedio

BeautifulSoup

  • Analizza ed estrae dati strutturati da HTML

ch_1_3.011.png

  • Rende leggibile il “tag soup” ed estrae info
Importazione di dati in Python - livello intermedio

BeautifulSoup

from bs4 import BeautifulSoup
import requests
url = 'https://www.crummy.com/software/BeautifulSoup/'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)
Importazione di dati in Python - livello intermedio

Soup formattata

print(soup.prettify())
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/transitional.dtd">
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title>
   Beautiful Soup: We called him Tortoise because he taught us.
  </title>
  <link href="mailto:[email protected]" rev="made"/>
  <link href="/nb/themes/Default/nb.css" rel="stylesheet" type="text/css"/>
  <meta content="Beautiful Soup: a library designed for screen-scraping HTML and XML." name="Description"/>
  <meta content="Markov Approximation 1.4 (module: leonardr)" name="generator"/>
  <meta content="Leonard Richardson" name="author"/>
 </head>
 <body alink="red" bgcolor="white" link="blue" text="black" vlink="660066">
  <img align="right" src="10.1.jpg" width="250"/>
  <br/>
  <p>
Importazione di dati in Python - livello intermedio

Esplorare BeautifulSoup

  • Molti metodi, ad es.:
print(soup.title)
<title>Beautiful Soup: We called him Tortoise because he taught us.</title>
print(soup.get_text())
Beautiful Soup: We called him Tortoise because he taught us.
You didn't write that awful page. You're just trying to
get some data out of it. Beautiful Soup is here to 
help. Since 2004, it's been saving programmers hours or
days of work on quick-turnaround screen scraping 
projects.
Importazione di dati in Python - livello intermedio

Esplorare BeautifulSoup

  • find_all()
for link in soup.find_all('a'):
    print(link.get('href'))
bs4/download/
#Download
bs4/doc/
#HallOfFame
https://code.launchpad.net/beautifulsoup
https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup
http://www.candlemarkandgleam.com/shop/constellation-games/
http://constellation.crummy.com/Constellation%20Games%20excerpt.html
https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup
https://bugs.launchpad.net/beautifulsoup/
http://lxml.de/
http://code.google.com/p/html5lib/
Importazione di dati in Python - livello intermedio

Passiamo alla pratica !

Importazione di dati in Python - livello intermedio

Preparing Video For Download...