Building Recommendation Engines with PySpark
Jamen Long
Data Scientist at Nike
Explicit Ratings
Explicit Ratings
Implicit Ratings
Explicit Ratings
Implicit Ratings
Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (SIMIR 20122), 2011.
ratings.show()
+------+------+---------+
|userId|songId|num_plays|
+------+------+---------+
| 10| 22| 5|
| 38| 99| 1|
| 38| 77| 3|
| 42| 99| 1|
+------+------+---------+
users = ratings.select("userId").distinct()
users.show()
+------+
|userId|
+------+
| 10|
| 38|
| 42|
+------+
songs = ratings.select("songId").distinct()
songs.show()
+------+
|songId|
+------+
| 22|
| 77|
| 99|
+------+
cross_join = users.crossJoin(songs)
cross_join.show()
+------+------+
|userId|songId|
+------+------+
| 10| 22|
| 10| 77|
| 10| 99|
| 38| 22|
| 38| 77|
| 38| 99|
| 42| 22|
| 42| 77|
| 42| 99|
+------+------+
cross_join = users.crossJoin(songs)
.join(ratings, ["userId", "songId"], "left")
cross_join.show()
+------+------+---------+
|userId|songId|num_plays|
+------+------+---------+
| 10| 22| 5|
| 10| 77| null|
| 10| 99| null|
| 38| 22| null|
| 38| 77| 3|
| 38| 99| 1|
| 42| 22| null|
| 42| 77| null|
| 42| 99| 1|
+------+------+---------+
cross_join = users.crossJoin(songs)
.join(ratings, ["userId", "songId"], "left").fillna(0)
cross_join.show()
+------+------+---------+
|userId|songId|num_plays|
+------+------+---------+
| 10| 22| 5|
| 10| 77| 0|
| 10| 99| 0|
| 38| 22| 0|
| 38| 77| 3|
| 38| 99| 1|
| 42| 22| 0|
| 42| 77| 0|
| 42| 99| 1|
+------+------+---------+
def add_zeros(df):
# Extracts distinct users
users = df.select("userId").distinct()
# Extracts distinct songs
songs = df.select("songId").distinct()
# Joins users and songs, fills blanks with 0
cross_join = users.crossJoin(items) \
.join(df, ["userId", "songId"], "left").fillna(0)
return cross_join
Building Recommendation Engines with PySpark