Filtering joins

Joining Data with pandas

Aaren Stubberfield

Instructor

Mutating versus filtering joins

Mutating joins:

  • Combines data from two tables based on matching observations in both tables

Filtering joins:

  • Filter observations from table based on whether or not they match an observation in another table
Joining Data with pandas

What is a semi join?

Table diagram of semi join where results come from intersection

Semi joins

  • Returns the intersection, similar to an inner join
  • Returns only columns from the left table and not the right
  • No duplicates
Joining Data with pandas

Musical dataset

image of phone with headphones

SQLite tutorial image

1 Photo by Vlad Bagacian from Pexels
Joining Data with pandas

Example datasets

  gid  name           
0 1    Rock           
1 2    Jazz           
2 3    Metal          
3 4    Alternative ...
4 5    Rock And Roll  
  tid  name             aid  mtid  gid  composer         u_price
0 1    For Those Ab...  1    1     1    Angus Young,...  0.99   
1 2    Balls to the...  2    2     1    nan              0.99   
2 3    Fast As a Shark  3    2     1    F. Baltes, S...  0.99   
3 4    Restless and...  3    2     1    F. Baltes, R...  0.99   
4 5    Princess of ...  3    2     1    Deaffy & R.A...  0.99
Joining Data with pandas

Step 1 - semi join

genres_tracks = genres.merge(top_tracks, on='gid')
print(genres_tracks.head())
  gid  name_x  tid   name_y           aid  mtid  composer         u_price
0 1    Rock    2260  Don't Stop M...  185  1     Mercury, Fre...  0.99   
1 1    Rock    2933  Mysterious Ways  232  1     U2               0.99   
2 1    Rock    2618  Speed Of Light   212  1     Billy Duffy/...  0.99   
3 1    Rock    2998  When Love Co...  237  1     Bono/Clayton...  0.99   
4 1    Rock    685   Who'll Stop ...  54   1     J. C. Fogerty    0.99
Joining Data with pandas

Step 2 - semi join

genres['gid'].isin(genres_tracks['gid'])

Image of 2 columns of gid being matched

Joining Data with pandas

Step 2 - semi join

genres['gid'].isin(genres_tracks['gid'])
0     True
1     True
2     True
3     True
4    False
Name: gid, dtype: bool
Joining Data with pandas

Step 3 - semi join

genres_tracks = genres.merge(top_tracks, on='gid')
top_genres = genres[genres['gid'].isin(genres_tracks['gid'])]
print(top_genres.head())
  gid  name           
0 1    Rock           
1 2    Jazz           
2 3    Metal          
3 4    Alternative & Punk
4 6    Blues
Joining Data with pandas

What is an anti join?

Table diagram of semi join where results come from left table minus intersection

Anti join:

  • Returns the left table, excluding the intersection
  • Returns only columns from the left table and not the right
Joining Data with pandas

Step 1 - anti join

genres_tracks = genres.merge(top_tracks, on='gid', how='left', indicator=True)
print(genres_tracks.head())
  gid  name_x           tid     name_y           aid    mtid  composer         u_price  _merge   
0 1    Rock             2260.0  Don't Stop M...  185.0  1.0   Mercury, Fre...  0.99     both  
1 1    Rock             2933.0  Mysterious Ways  232.0  1.0   U2               0.99     both  
2 1    Rock             2618.0  Speed Of Light   212.0  1.0   Billy Duffy/...  0.99     both  
3 1    Rock             2998.0  When Love Co...  237.0  1.0   Bono/Clayton...  0.99     both 
4 5    Rock And Roll    NaN     NaN              NaN    NaN   NaN              NaN      left_only
Joining Data with pandas

Step 2 - anti join

gid_list = genres_tracks.loc[genres_tracks['_merge'] == 'left_only', 'gid']
print(gid_list.head())
23     5
34     9
36    11
37    12
38    13
Name: gid, dtype: int64
Joining Data with pandas

Step 3 - anti join

genres_tracks = genres.merge(top_tracks, on='gid', how='left', indicator=True)
gid_list = genres_tracks.loc[genres_tracks['_merge'] == 'left_only','gid']
non_top_genres = genres[genres['gid'].isin(gid_list)]

print(non_top_genres.head())
  gid  name          
0 5    Rock And Roll 
1 9    Pop           
2 11   Bossa Nova    
3 12   Easy Listening
4 13   Heavy Metal
Joining Data with pandas

Let's practice!

Joining Data with pandas

Preparing Video For Download...