Feature Engineering for NLP in Python
Rounak Banik
Data Scientist
Title | Overview |
---|---|
Shanghai Triad | A provincial boy related to a Shanghai crime family is recruited by his uncle into cosmopolitan Shanghai in the 1930s to be a servant to a ganglord's mistress. |
Cry, the Beloved Country | A South-African preacher goes to search for his wayward son who has committed a crime in the big city. |
get_recommendations("The Godfather")
1178 The Godfather: Part II
44030 The Godfather Trilogy: 1972-1990
1914 The Godfather: Part III
23126 Blood Ties
11297 Household Saints
34717 Start Liquidation
10821 Election
38030 Goodfellas
17729 Short Sharp Shock
26293 Beck 28 - Familjen
Name: title, dtype: object
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()
# Generate matrix of tf-idf vectors
tfidf_matrix = vectorizer.fit_transform(movie_plots)
# Import cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity
# Generate cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
array([[1. , 0.27435345, 0.23092036, ..., 0. , 0. ,
0.00758112],
[0.27435345, 1. , 0.1246955 , ..., 0. , 0. ,
0.00740494],
...,
[0.00758112, 0.00740494, 0. , ..., 0. , 0. ,
1. ]])
linear_kernel
instead of cosine_similarity
.# Import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel
# Generate cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
array([[1. , 0.27435345, 0.23092036, ..., 0. , 0. ,
0.00758112],
[0.27435345, 1. , 0.1246955 , ..., 0. , 0. ,
0.00740494],
...,
[0.00758112, 0.00740494, 0. , ..., 0. , 0. ,
1. ]])
get_recommendations('The Lion King', cosine_sim, indices)
7782 African Cats
5877 The Lion King 2: Simba's Pride
4524 Born Free
2719 The Bear
4770 Once Upon a Time in China III
7070 Crows Zero
739 The Wizard of Oz
8926 The Jungle Book
1749 Shadow of a Doubt
7993 October Baby
Name: title, dtype: object
Feature Engineering for NLP in Python