Analyzing US Census Data in Python
Lee Hachadoorian
Asst. Professor of Instruction, Temple University
https://api.census.gov/data/2010/dec/sf1?get=NAME,P001001,&for=state:*
https://api.census.gov/data/2010/dec/sf1?
https://api.census.gov/data
2010
dec/sf1
https://api.census.gov/data/2010/dec/sf1?get=NAME,P001001,&for=state:*
https://api.census.gov/data
2010
dec/sf1
get
- List of variablesfor
- Geography of interestimport requests
HOST = "https://api.census.gov/data" year = "2010" dataset = "dec/sf1"
base_url = "/".join([HOST, year, dataset])
predicates = {}
get_vars = ["NAME", "AREALAND", "P001001"]
predicates["get"] = ",".join(get_vars)
predicates["for"] = "state:*"
r = requests.get(base_url, params=predicates)
print(r.text)
[["NAME","AREALAND","P001001","state"],
["Alabama","131170787086","4779736","01"],
["Alaska","1477953211577","710231","02"],
["Arizona","294207314414","6392017","04"],
...
print(r.text)
error: unknown variable 'nonexistentvariable'
print(r.json()[0])
['NAME', 'AREALAND', 'P001001', 'state']
Create easy to remember column names using snake_case:
col_names = ["name", "area_m2", "total_pop", "state"]
import pandas as pd
df = pd.DataFrame(columns=col_names, data=r.json()[1:])
# Fix data types df["area_m2"] = df["area_m2"].astype(int) df["total_pop"] = df["total_pop"].astype(int)
print(df.head())
name area_m2 total_pop state
0 Alabama 131170787086 4779736 01
1 Alaska 1477953211577 710231 02
2 Arizona 294207314414 6392017 04
3 Arkansas 134771261408 2915918 05
4 California 403466310059 37253956 06
# Create new column df["pop_per_km2"] = 1000**2 * df["total_pop"] / df["area_m2"]
# Find top 3 df.nlargest(3, "pop_per_km2")
name area_m2 total_pop state pop_per_km2
8 District of Columbia 158114680 601723 11 3805.611218
30 New Jersey 19047341691 8791894 34 461.581156
51 Puerto Rico 8867536532 3725789 72 420.160547
Analyzing US Census Data in Python