Loading multiple tables with joins

Streamlined Data Ingestion with pandas

Amany Mahfouz

Instructor

Keys

  • Database records have unique identifiers, or keys

311 call data, with the unique_key column highlighted. Unique keys are numbers.

Streamlined Data Ingestion with pandas

Keys

  • Database records have unique identifiers, or keys

Course catalog data, with the course_code column highlighted. Codes consist of subject abbreviations and numbers.

Streamlined Data Ingestion with pandas

Keys

  • Database records have unique identifiers, or keys

Course catalog data, with instructor_id column highlighted. Values are 9-digit integers.

Streamlined Data Ingestion with pandas

Keys

Professor table, with id column highlighted. Values are 9-digit integers like in the course catalog instructor_ids.

Streamlined Data Ingestion with pandas

Keys

Course catalog data with professors' names column joined in

Streamlined Data Ingestion with pandas

Joining Tables

Weather data, with the date column highlighted

311 call data, with the created_date column highlighted

Streamlined Data Ingestion with pandas

Joining Tables

SELECT *
  FROM hpd311calls
Streamlined Data Ingestion with pandas

Joining Tables

SELECT *
  FROM hpd311calls
       JOIN weather 
       ON hpd311calls.created_date = weather.date;
  • Use dot notation (table.column) when working with multiple tables
  • Default join only returns records whose key values appear in both tables
  • Make sure join keys are the same data type or nothing will match
Streamlined Data Ingestion with pandas

Joining and Filtering

/* Get only heat/hot water calls and join in weather data */
SELECT *
  FROM hpd311calls
       JOIN weather 
       ON hpd311calls.created_date = weather.date
 WHERE hpd311calls.complaint_type = 'HEAT/HOT WATER';
Streamlined Data Ingestion with pandas

Joining and Aggregating

/* Get call counts by borough */
SELECT hpd311calls.borough, 
         COUNT(*)
  FROM hpd311calls
 GROUP BY hpd311calls.borough;
Streamlined Data Ingestion with pandas

Joining and Aggregating

/* Get call counts by borough
   and join in population and housing counts */
SELECT hpd311calls.borough, 
       COUNT(*), 
       boro_census.total_population,
       boro_census.housing_units
  FROM hpd311calls
 GROUP BY hpd311calls.borough
Streamlined Data Ingestion with pandas

Joining and Aggregating

/* Get call counts by borough
   and join in population and housing counts */
SELECT hpd311calls.borough, 
       COUNT(*), 
       boro_census.total_population,
       boro_census.housing_units
  FROM hpd311calls
       JOIN boro_census 
       ON hpd311calls.borough = boro_census.borough
 GROUP BY hpd311calls.borough;
Streamlined Data Ingestion with pandas
query = """SELECT hpd311calls.borough, 
                    COUNT(*), 
                    boro_census.total_population,
                    boro_census.housing_units
             FROM hpd311calls
                  JOIN boro_census 
                  ON hpd311calls.borough = boro_census.borough
            GROUP BY hpd311calls.borough;"""

call_counts = pd.read_sql(query, engine)
print(call_counts)
         borough  COUNT(*)  total_population  housing_units
0          BRONX     29874           1455846         524488
1       BROOKLYN     31722           2635121        1028383
2      MANHATTAN     20196           1653877         872645
3         QUEENS     11384           2339280         850422
4  STATEN ISLAND      1322            475948         179179
Streamlined Data Ingestion with pandas

Review

  • SQL order of keywords
    • SELECT
    • FROM
    • JOIN
    • WHERE
    • GROUP BY
Streamlined Data Ingestion with pandas

Let's practice!

Streamlined Data Ingestion with pandas

Preparing Video For Download...