Introduction to the Million Songs Dataset

Building Recommendation Engines with PySpark

Jamen Long

Data Scientist at Nike

Explicit vs implicit

Explicit Ratings thumbs up/thumbs down, 4 of 5 stars, scale of colored dots from worst to least

Building Recommendation Engines with PySpark

Explicit vs implicit (cont.)

Explicit Ratings thumbs up/thumbs down, 4 of 5 stars, scale of colored dots from worst to least

Implicit Ratings thumbs up/thumbs down, 4 of 5 stars, scale of colored dots from worst to least

Building Recommendation Engines with PySpark

Implicit refresher II

Explicit Ratings thumbs up/thumbs down, 4 of 5 stars, scale of colored dots from worst to least

Implicit Ratings thumbs up/thumbs down, 4 of 5 stars, scale of colored dots from worst to least reference to white paper

Building Recommendation Engines with PySpark

Introduction to the Million Songs Dataset

Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (SIMIR 20122), 2011.

Building Recommendation Engines with PySpark

Add zeros sample

ratings.show()
+------+------+---------+
|userId|songId|num_plays|
+------+------+---------+
|    10|    22|        5|
|    38|    99|        1|
|    38|    77|        3|
|    42|    99|        1|
+------+------+---------+
Building Recommendation Engines with PySpark

Cross join intro

users = ratings.select("userId").distinct()
users.show()
+------+
|userId|
+------+
|    10|
|    38|
|    42|
+------+
songs = ratings.select("songId").distinct()
songs.show()
+------+
|songId|
+------+
|    22|
|    77|
|    99|
+------+
Building Recommendation Engines with PySpark

Cross join output

cross_join = users.crossJoin(songs)
cross_join.show()
+------+------+
|userId|songId|
+------+------+
|    10|    22|
|    10|    77|
|    10|    99|
|    38|    22|
|    38|    77|
|    38|    99|
|    42|    22|
|    42|    77|
|    42|    99|
+------+------+
Building Recommendation Engines with PySpark

Joining back original ratings data

cross_join = users.crossJoin(songs)
                  .join(ratings, ["userId", "songId"], "left")
cross_join.show()
+------+------+---------+
|userId|songId|num_plays|
+------+------+---------+
|    10|    22|        5|
|    10|    77|     null|
|    10|    99|     null|
|    38|    22|     null|
|    38|    77|        3|
|    38|    99|        1|
|    42|    22|     null|
|    42|    77|     null|
|    42|    99|        1|
+------+------+---------+
Building Recommendation Engines with PySpark

Filling in with zero

cross_join = users.crossJoin(songs)
                  .join(ratings, ["userId", "songId"], "left").fillna(0)
cross_join.show()
+------+------+---------+
|userId|songId|num_plays|
+------+------+---------+
|    10|    22|        5|
|    10|    77|        0|
|    10|    99|        0|
|    38|    22|        0|
|    38|    77|        3|
|    38|    99|        1|
|    42|    22|        0|
|    42|    77|        0|
|    42|    99|        1|
+------+------+---------+
Building Recommendation Engines with PySpark

Add zeros function

def add_zeros(df):
    # Extracts distinct users
    users = df.select("userId").distinct() 

    # Extracts distinct songs
    songs = df.select("songId").distinct() 

    # Joins users and songs, fills blanks with 0
    cross_join = users.crossJoin(items) \ 
                .join(df, ["userId", "songId"], "left").fillna(0)

    return cross_join
Building Recommendation Engines with PySpark

Let's practice!

Building Recommendation Engines with PySpark

Preparing Video For Download...