Building a plot line based recommender

Feature Engineering for NLP in Python

Rounak Banik

Data Scientist

Movie recommender

Title	Overview
Shanghai Triad	A provincial boy related to a Shanghai crime family is recruited by his uncle into cosmopolitan Shanghai in the 1930s to be a servant to a ganglord's mistress.
Cry, the Beloved Country	A South-African preacher goes to search for his wayward son who has committed a crime in the big city.

Movie recommender

get_recommendations("The Godfather")

1178               The Godfather: Part II
44030    The Godfather Trilogy: 1972-1990
1914              The Godfather: Part III
23126                          Blood Ties
11297                    Household Saints
34717                   Start Liquidation
10821                            Election
38030                          Goodfellas
17729                   Short Sharp Shock
26293                  Beck 28 - Familjen
Name: title, dtype: object

Steps

Text preprocessing
Generate tf-idf vectors
Generate cosine similarity matrix

The recommender function

Take a movie title, cosine similarity matrix and indices series as arguments.
Extract pairwise cosine similarity scores for the movie.
Sort the scores in descending order.
Output titles corresponding to the highest scores.
Ignore the highest similarity score (of 1).

Generating tf-idf vectors

# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Generate matrix of tf-idf vectors
tfidf_matrix = vectorizer.fit_transform(movie_plots)

Generating cosine similarity matrix

# Import cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity

# Generate cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

array([[1.        , 0.27435345, 0.23092036, ..., 0.        , 0.        ,
        0.00758112],
       [0.27435345, 1.        , 0.1246955 , ..., 0.        , 0.        ,
        0.00740494],
       ...,
       [0.00758112, 0.00740494, 0.        , ..., 0.        , 0.        ,
        1.        ]])

The linear_kernel function

Magnitude of a tf-idf vector is 1
Cosine score between two tf-idf vectors is their dot product.
Can significantly improve computation time.
Use linear_kernel instead of cosine_similarity.

Generating cosine similarity matrix

# Import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel

# Generate cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

array([[1.        , 0.27435345, 0.23092036, ..., 0.        , 0.        ,
        0.00758112],
       [0.27435345, 1.        , 0.1246955 , ..., 0.        , 0.        ,
        0.00740494],
       ...,
       [0.00758112, 0.00740494, 0.        , ..., 0.        , 0.        ,
        1.        ]])

The get_recommendations function

get_recommendations('The Lion King', cosine_sim, indices)

7782                      African Cats
5877    The Lion King 2: Simba's Pride
4524                         Born Free
2719                          The Bear
4770     Once Upon a Time in China III
7070                        Crows Zero
739                   The Wizard of Oz
8926                   The Jungle Book
1749                 Shadow of a Doubt
7993                      October Baby
Name: title, dtype: object

Let's practice!

Feature Engineering for NLP in Python