Building recommender systems using NMF

Unsupervised Learning in Python

Benjamin Wilson

Director of Research at lateral.io

Finding similar articles

Engineer at a large online newspaper
Task: recommend articles similar to article being read by customer
Similar articles should have similar topics

Strategy

Apply NMF to the word-frequency array
NMF feature values describe the topics
... so similar documents have similar NMF feature values
Compare NMF feature values?

Apply NMF to the word-frequency array

articles is a word frequency array

from sklearn.decomposition import NMF
nmf = NMF(n_components=6)
nmf_features = nmf.fit_transform(articles)

Strategy

Apply NMF to the word-frequency array
NMF feature values describe the topics
... so similar documents have similar NMF feature values
Compare NMF feature values?

Versions of articles

Different versions of the same document have same topic proportions
... exact feature values may be different!

Strong version: Dog bites man! Attack by terrible canine leaves man paralyzed...

Versions of articles

Different versions of the same document have same topic proportions
... exact feature values may be different!
E.g. because one version uses many meaningless words

Weak version: You may have heard, unfortunately it seems that a dog has perhaps bitten a man...

Versions of articles

Different versions of the same document have same topic proportions
... exact feature values may be different!
E.g. because one version uses many meaningless words
But all versions lie on the same line through the origin

Scatter plot of topic pets vs topic danger, with strong and weak version lying on the same line through the origin

Cosine similarity

Uses the angle between the lines
Higher values means more similar
Maximum value is 1, when angle is 0 degrees

Scatter plot of document A and document B, with 2 separate lines going through them at different angles from the origin

Calculating the cosine similarities

from sklearn.preprocessing import normalize

norm_features = normalize(nmf_features)

# if has index 23
current_article = norm_features[23,:]  
similarities = norm_features.dot(current_article)

print(similarities)

[ 0.7150569   0.26349967 ..., 0.20323616  0.05047817]

DataFrames and labels

Label similarities with the article titles, using a DataFrame
Titles given as a list: titles

import pandas as pd

norm_features = normalize(nmf_features)

df = pd.DataFrame(norm_features, index=titles)

current_article = df.loc['Dog bites man']

similarities = df.dot(current_article)

DataFrames and labels

print(similarities.nlargest())

Dog bites man                            1.000000
Hound mauls cat                          0.979946
Pets go wild!                            0.979708
Dachshunds are dangerous                 0.949641
Our streets are no longer safe           0.900474
dtype: float64

Let's practice!

Unsupervised Learning in Python

Preparing Video For Download...