Merging on indexes

Joining Data with pandas

Aaren Stubberfield

Instructor

Table with an index

  id     title            popularity  release_date
0 257    Oliver Twist     20.415572   2005-09-23  
1 14290  Better Luck ...  3.877036    2002-01-12  
2 38365  Grown Ups        38.864027   2010-06-24  
3 9672   Infamous         3.680896    2006-11-16  
4 12819  Alpha and Omega  12.300789   2010-09-17 
       title            popularity  release_date
id                                              
257    Oliver Twist     20.415572   2005-09-23
14290  Better Luck ...  3.877036    2002-01-12
38365  Grown Ups        38.864027   2010-06-24
9672   Infamous         3.680896    2006-11-16
12819  Alpha and Omega  12.300789   2010-09-17
Joining Data with pandas

Setting an index

movies = pd.read_csv('tmdb_movies.csv', index_col=['id'])
print(movies.head())
       title            popularity  release_date
id                                              
257    Oliver Twist     20.415572   2005-09-23
14290  Better Luck ...  3.877036    2002-01-12
38365  Grown Ups        38.864027   2010-06-24
9672   Infamous         3.680896    2006-11-16
12819  Alpha and Omega  12.300789   2010-09-17
Joining Data with pandas

Merge index datasets

       title            popularity  release_date
id                                              
257    Oliver Twist     20.415572   2005-09-23
14290  Better Luck ...  3.877036    2002-01-12
38365  Grown Ups        38.864027   2010-06-24
9672   Infamous         3.680896    2006-11-16
        tagline        
id                     
19995   Enter the Wo...
285     At the end o...
206647  A Plan No On...
49026   The Legend Ends
Joining Data with pandas

Merging on index

movies_taglines = movies.merge(taglines, on='id', how='left')
print(movies_taglines.head())
       title            popularity  release_date  tagline        
id     
257    Oliver Twist     20.415572   2005-09-23    NaN            
14290  Better Luck ...  3.877036    2002-01-12    Never undere...
38365  Grown Ups        38.864027   2010-06-24    Boys will be...
9672   Infamous         3.680896    2006-11-16    There's more...
12819  Alpha and Omega  12.300789   2010-09-17    A Pawsome 3D...
Joining Data with pandas

MultiIndex datasets

samuel = pd.read_csv('samuel.csv', 
                     index_col=['movie_id',
                                'cast_id'])
print(samuel.head())
                   name
movie_id  cast_id                   
184       3        Samuel L. Jackson
319       13       Samuel L. Jackson
326       2        Samuel L. Jackson
329       138      Samuel L. Jackson
393       21       Samuel L. Jackson
casts = pd.read_csv('casts.csv', 
                    index_col=['movie_id',
                               'cast_id'])
print(casts.head())
                   character
movie_id  cast_id          
5         22       Jezebel
          23       Diana
          24       Athena
          25       Elspeth
          26       Eva
Joining Data with pandas

MultiIndex merge

samuel_casts = samuel.merge(casts, on=['movie_id','cast_id'])
print(samuel_casts.head())
print(samuel_casts.shape)
                   name               character
movie_id  cast_id                                  
184       3        Samuel L. Jackson  Ordell Robbie
319       13       Samuel L. Jackson  Big Don
326       2        Samuel L. Jackson  Neville Flynn
329       138      Samuel L. Jackson  Arnold
393       21       Samuel L. Jackson  Rufus
(67, 2)
Joining Data with pandas

Index merge with left_on and right_on

       title            popularity  release_date
id                                              
257    Oliver Twist     20.415572   2005-09-23
14290  Better Luck ...  3.877036    2002-01-12
38365  Grown Ups        38.864027   2010-06-24
9672   Infamous         3.680896    2006-11-16
          genre
movie_id                 
5         Crime
5         Comedy
11        Science Fiction
11        Action
Joining Data with pandas

Index merge with left_on and right_on

movies_genres = movies.merge(movie_to_genres, left_on='id', left_index=True, 
                             right_on='movie_id', right_index=True)
print(movies_genres.head())
    id  title       popularity  release_date  genre
5    5  Four Rooms  22.876230   1995-12-09    Crime
5    5  Four Rooms  22.876230   1995-12-09    Comedy
11  11  Star Wars   126.393695  1977-05-25    Science Fiction
11  11  Star Wars   126.393695  1977-05-25    Action
11  11  Star Wars   126.393695  1977-05-25    Adventure
Joining Data with pandas

Let's practice!

Joining Data with pandas

Preparing Video For Download...