Building Recommendation Engines in Python
Rob O'Callaghan
Director of Data
Jaccard similarity:
$$J(A,B)=\frac{A\cap B }{A \cup B}$$
genres_array_df
:
Book | Adventure | Fantasy | Tragedy | Social commentary | ... |
---|---|---|---|---|---|
The Hobbit | 1 | 1 | 0 | 0 | ... |
The Great Gatsby | 0 | 0 | 1 | 1 | ... |
A Game of Thrones | 0 | 1 | 0 | 0 | ... |
Macbeth | 0 | 0 | 1 | 0 | ... |
... | ... | ... | ... | ... | ... |
from sklearn.metrics import jaccard_score
hobbit_row = book_genre_df.loc['The Hobbit']
GOT_row = book_genre_df.loc['A Game of Thrones']
print(jaccard_score(hobbit_row, GOT_row))
0.5
from scipy.spatial.distance import pdist, squareform
jaccard_distances = pdist(book_genre_df.values, metric='jaccard') print(jaccard_distances)
[1. 0.5 1. 1. 0.5 1. ]
square_jaccard_distances = squareform(jaccard_distances)
print(square_jaccard_distances)
[[0. 1. 0.5 1. ]
[1. 0. 1. 0.5]
[0.5 1. 0. 1. ]
[1. 0.5 1. 0. ]]
print(square_jaccard_distances)
[[0. 1. 0.5 1. ]
[1. 0. 1. 0.5]
[0.5 1. 0. 1. ]
[1. 0.5 1. 0. ]]
jaccard_similarity_array = 1 - square_jaccard_distances
print(jaccard_similarity_array)
[[1. 0. 0.5 0. ]
[0. 1. 0. 0.5]
[0.5 0. 1. 0. ]
[0. 0.5 0. 1. ]]
distance_df = pd.DataFrame(jaccard_similarity_array, index=genres_array_df['Book'], columns=genres_array_df['Book'])
distance_df.head()
The Hobbit The Great Gatsby A Game of Thrones Macbeth ...
The Hobbit 1.00 0.15 0.75 0.01 ...
The Great Gatsby 0.15 1.00 0.01 0.43 ...
...
print(distance_df['The Hobbit']['A Game of Thrones'])
0.75
print(distance_df['The Hobbit']['The Great Gatsby'])
0.15
print(distance_df['The Hobbit'].sort_values(ascending=False))
title
The Hobbit 1.00
The Two Towers 0.91
A Game of Thrones 0.50
...
Building Recommendation Engines in Python