Introduction to the Million Songs Dataset

Building Recommendation Engines with PySpark

Jamen Long

Data Scientist at Nike

Explicit vs implicit

Explicit Ratings thumbs up/thumbs down, 4 of 5 stars, scale of colored dots from worst to least

Explicit vs implicit (cont.)

Explicit Ratings thumbs up/thumbs down, 4 of 5 stars, scale of colored dots from worst to least

Implicit Ratings thumbs up/thumbs down, 4 of 5 stars, scale of colored dots from worst to least

Implicit refresher II

Explicit Ratings thumbs up/thumbs down, 4 of 5 stars, scale of colored dots from worst to least

Implicit Ratings thumbs up/thumbs down, 4 of 5 stars, scale of colored dots from worst to least reference to white paper

Introduction to the Million Songs Dataset

Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (SIMIR 20122), 2011.

Add zeros sample

ratings.show()

+------+------+---------+
|userId|songId|num_plays|
+------+------+---------+
|    10|    22|        5|
|    38|    99|        1|
|    38|    77|        3|
|    42|    99|        1|
+------+------+---------+

Cross join intro

users = ratings.select("userId").distinct()
users.show()

+------+
|userId|
+------+
|    10|
|    38|
|    42|
+------+

songs = ratings.select("songId").distinct()
songs.show()

+------+
|songId|
+------+
|    22|
|    77|
|    99|
+------+

Cross join output

cross_join = users.crossJoin(songs)
cross_join.show()

+------+------+
|userId|songId|
+------+------+
|    10|    22|
|    10|    77|
|    10|    99|
|    38|    22|
|    38|    77|
|    38|    99|
|    42|    22|
|    42|    77|
|    42|    99|
+------+------+

Joining back original ratings data

cross_join = users.crossJoin(songs)
                  .join(ratings, ["userId", "songId"], "left")
cross_join.show()

+------+------+---------+
|userId|songId|num_plays|
+------+------+---------+
|    10|    22|        5|
|    10|    77|     null|
|    10|    99|     null|
|    38|    22|     null|
|    38|    77|        3|
|    38|    99|        1|
|    42|    22|     null|
|    42|    77|     null|
|    42|    99|        1|
+------+------+---------+

Filling in with zero

cross_join = users.crossJoin(songs)
                  .join(ratings, ["userId", "songId"], "left").fillna(0)
cross_join.show()

+------+------+---------+
|userId|songId|num_plays|
+------+------+---------+
|    10|    22|        5|
|    10|    77|        0|
|    10|    99|        0|
|    38|    22|        0|
|    38|    77|        3|
|    38|    99|        1|
|    42|    22|        0|
|    42|    77|        0|
|    42|    99|        1|
+------+------+---------+

Add zeros function

def add_zeros(df):
    # Extracts distinct users
    users = df.select("userId").distinct() 

    # Extracts distinct songs
    songs = df.select("songId").distinct() 

    # Joins users and songs, fills blanks with 0
    cross_join = users.crossJoin(items) \ 
                .join(df, ["userId", "songId"], "left").fillna(0)

    return cross_join

Let's practice!

Building Recommendation Engines with PySpark