Streamlined Data Ingestion with pandas
Amany Mahfouz
Instructor
# Print columns containing nested data
print(bookstores[["categories", "coordinates", "location"]].head(3))
categories \
0 [{'alias': 'bookstores', 'title': 'Bookstores'}]
1 [{'alias': 'bookstores', 'title': 'Bookstores'...
2 [{'alias': 'bookstores', 'title': 'Bookstores'}]
coordinates \
0 {'latitude': 37.7975997924805, 'longitude': -1...
1 {'latitude': 37.7885846793652, 'longitude': -1...
2 {'latitude': 37.7589836120605, 'longitude': -1...
location
0 {'address1': '261 Columbus Ave', 'address2': '...
1 {'address1': '50 2nd St', 'address2': '', 'add...
2 {'address1': '866 Valencia St', 'address2': ''...
pandas.io.json
submodule has tools for reading and writing JSONimport
statementjson_normalize()
pd.DataFrame()
does)attribute.nestedattribute
sep
argumentimport pandas as pd import requests
from pandas.io.json import json_normalize
# Set up headers, parameters, and API endpoint api_url = "https://api.yelp.com/v3/businesses/search" headers = {"Authorization": "Bearer {}".format(api_key)} params = {"term": "bookstore", "location": "San Francisco"}
# Make the API call and extract the JSON data response = requests.get(api_url, headers=headers, params=params) data = response.json()
# Flatten data and load to dataframe, with _ separators
bookstores = json_normalize(data["businesses"], sep="_")
print(list(bookstores))
['alias',
'categories',
'coordinates_latitude',
'coordinates_longitude',
...
'location_address1',
'location_address2',
'location_address3',
'location_city',
'location_country',
'location_display_address',
'location_state',
'location_zip_code',
...
'url']
print(bookstores.categories.head())
0 [{'alias': 'bookstores', 'title': 'Bookstores'}]
1 [{'alias': 'bookstores', 'title': 'Bookstores'...
2 [{'alias': 'bookstores', 'title': 'Bookstores'}]
3 [{'alias': 'bookstores', 'title': 'Bookstores'}]
4 [{'alias': 'bookstores', 'title': 'Bookstores'...
Name: categories, dtype: object
json_normalize()
record_path
: string/list of string attributes to nested datameta
: list of other attributes to load to dataframemeta_prefix
: string to prefix to meta column names# Flatten categories data, bring in business details df = json_normalize(data["businesses"], sep="_",
record_path="categories",
meta=["name", "alias", "rating", ["coordinates", "latitude"], ["coordinates", "longitude"]],
meta_prefix="biz_")
print(df.head(4))
alias title biz_name \
0 bookstores Bookstores City Lights Bookstore
1 bookstores Bookstores Alexander Book Company
2 stationery Cards & Stationery Alexander Book Company
3 bookstores Bookstores Borderlands Books
biz_alias biz_rating biz_coordinates_latitude \
0 city-lights-bookstore-san-francisco 4.5 37.797600
1 alexander-book-company-san-francisco 4.5 37.788585
2 alexander-book-company-san-francisco 4.5 37.788585
3 borderlands-books-san-francisco 5.0 37.758984
biz_coordinates_longitude
0 -122.406578
1 -122.400631
2 -122.400631
3 -122.421638
Streamlined Data Ingestion with pandas