Making content-based recommendations

Building Recommendation Engines in Python

Rob O'Callaghan

Director of Data

Introducing the Jaccard similarity

Jaccard similarity: Venn diagram showing Jaccard similarity. $$J(A,B)=\frac{A\cap B }{A \cup B}$$

Calculating Jaccard similarity between books

genres_array_df:

Book	Adventure	Fantasy	Tragedy	Social commentary	...
The Hobbit	1	1	0	0	...
The Great Gatsby	0	0	1	1	...
A Game of Thrones	0	1	0	0	...
Macbeth	0	0	1	0	...
...	...	...	...	...	...

Calculating Jaccard similarity between books

from sklearn.metrics import jaccard_score


hobbit_row = book_genre_df.loc['The Hobbit']

GOT_row = book_genre_df.loc['A Game of Thrones']


print(jaccard_score(hobbit_row, GOT_row))

0.5

Finding the distance between all items

from scipy.spatial.distance import pdist, squareform


jaccard_distances = pdist(book_genre_df.values, metric='jaccard')
print(jaccard_distances)

[1.  0.5 1.  1.  0.5 1. ]

square_jaccard_distances = squareform(jaccard_distances)
print(square_jaccard_distances)

[[0.  1.  0.5 1. ]
 [1.  0.  1.  0.5]
 [0.5 1.  0.  1. ]
 [1.  0.5 1.  0. ]]

Finding the distance between all items

print(square_jaccard_distances)

[[0.  1.  0.5 1. ]
 [1.  0.  1.  0.5]
 [0.5 1.  0.  1. ]
 [1.  0.5 1.  0. ]]

jaccard_similarity_array = 1 -  square_jaccard_distances
print(jaccard_similarity_array)

[[1.  0.  0.5 0. ]
 [0.  1.  0.  0.5]
 [0.5 0.  1.  0. ]
 [0.  0.5 0.  1. ]]

Creating a usable distance table

distance_df = pd.DataFrame(jaccard_similarity_array,
                           index=genres_array_df['Book'], 
                           columns=genres_array_df['Book'])

distance_df.head()

            The Hobbit The Great Gatsby  A Game of Thrones          Macbeth     ...
The Hobbit        1.00             0.15               0.75             0.01     ...
The Great Gatsby  0.15             1.00               0.01             0.43     ...
...

Comparing books

print(distance_df['The Hobbit']['A Game of Thrones'])

0.75

print(distance_df['The Hobbit']['The Great Gatsby'])

0.15

Finding the most similar books

print(distance_df['The Hobbit'].sort_values(ascending=False))

title
The Hobbit                                                             1.00
The Two Towers                                                         0.91
A Game of Thrones                                                      0.50
...

Let's practice!

Building Recommendation Engines in Python