Joining on key columns

Pandas Joins for Spreadsheet Users

John Miller

Principal Data Scientist

Framework (continued)

$$

After viewing and understanding the data:

  • Determine the relationship
  • Check for unique values in key column

squares

Pandas Joins for Spreadsheet Users

Unique key columns

$$ $$ Unique values for single column key

df.duplicated('GameKey').sum()

$$ -- -- -- A value of 0 means no duplicates -- -- --

df.duplicated(['GameKey', 'PlayId').sum()

single column index

multi-column index

Pandas Joins for Spreadsheet Users

Framework (continued)

$$

After viewing and understanding the data:

  • Determine the relationship
  • Check for unique values in key column
  • Write merge statement and execute

squares

Pandas Joins for Spreadsheet Users

Executing the merge

$$

The statement is the same!

df1.merge(df2, how='inner', on='')

  • Pay attention to parameters

$$

Full syntax: DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

Pandas Joins for Spreadsheet Users

Validating merges

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

$$

Values for validate:

  • “one_to_one” or “1:1”
  • “one_to_many” or “1:m”
  • “many_to_one” or “m:1”
  • “many_to_many” or “m:m” (does nothing)
Pandas Joins for Spreadsheet Users

Let's practice!

Pandas Joins for Spreadsheet Users

Preparing Video For Download...