Analyzing US Census Data in Python
Lee Hachadoorian
Asst. Professor of Instruction, Temple University
https://api.census.gov/data/2010/dec/sf1?get=NAME,P001001,&for=state:*
https://api.census.gov/data/2010/dec/sf1?
https://api.census.gov/data2010dec/sf1https://api.census.gov/data/2010/dec/sf1?get=NAME,P001001,&for=state:*
https://api.census.gov/data2010dec/sf1get - List of variablesfor - Geography of interestimport requestsHOST = "https://api.census.gov/data" year = "2010" dataset = "dec/sf1"base_url = "/".join([HOST, year, dataset])predicates = {}get_vars = ["NAME", "AREALAND", "P001001"]predicates["get"] = ",".join(get_vars)predicates["for"] = "state:*"r = requests.get(base_url, params=predicates)
print(r.text)
[["NAME","AREALAND","P001001","state"],
["Alabama","131170787086","4779736","01"],
["Alaska","1477953211577","710231","02"],
["Arizona","294207314414","6392017","04"],
...
print(r.text)
error: unknown variable 'nonexistentvariable'
print(r.json()[0])
['NAME', 'AREALAND', 'P001001', 'state']
Create easy to remember column names using snake_case:
col_names = ["name", "area_m2", "total_pop", "state"]
import pandas as pddf = pd.DataFrame(columns=col_names, data=r.json()[1:])# Fix data types df["area_m2"] = df["area_m2"].astype(int) df["total_pop"] = df["total_pop"].astype(int)print(df.head())
name area_m2 total_pop state
0 Alabama 131170787086 4779736 01
1 Alaska 1477953211577 710231 02
2 Arizona 294207314414 6392017 04
3 Arkansas 134771261408 2915918 05
4 California 403466310059 37253956 06
# Create new column df["pop_per_km2"] = 1000**2 * df["total_pop"] / df["area_m2"]# Find top 3 df.nlargest(3, "pop_per_km2")
name area_m2 total_pop state pop_per_km2
8 District of Columbia 158114680 601723 11 3805.611218
30 New Jersey 19047341691 8791894 34 461.581156
51 Puerto Rico 8867536532 3725789 72 420.160547
Analyzing US Census Data in Python