Building recommender systems using NMF

Unsupervised Learning in Python

Benjamin Wilson

Director of Research at lateral.io

Finding similar articles

  • Engineer at a large online newspaper
  • Task: recommend articles similar to article being read by customer
  • Similar articles should have similar topics
Unsupervised Learning in Python

Strategy

  • Apply NMF to the word-frequency array
  • NMF feature values describe the topics
  • ... so similar documents have similar NMF feature values
  • Compare NMF feature values?
Unsupervised Learning in Python

Apply NMF to the word-frequency array

  • articles is a word frequency array
from sklearn.decomposition import NMF
nmf = NMF(n_components=6)
nmf_features = nmf.fit_transform(articles)
Unsupervised Learning in Python

Strategy

  • Apply NMF to the word-frequency array
  • NMF feature values describe the topics
  • ... so similar documents have similar NMF feature values
  • Compare NMF feature values?
Unsupervised Learning in Python

Versions of articles

  • Different versions of the same document have same topic proportions
  • ... exact feature values may be different!
  •  
  •  

Strong version: Dog bites man! Attack by terrible canine leaves man paralyzed...

Unsupervised Learning in Python

Versions of articles

  • Different versions of the same document have same topic proportions
  • ... exact feature values may be different!
  • E.g. because one version uses many meaningless words
  •  

Weak version: You may have heard, unfortunately it seems that a dog has perhaps bitten a man...

Unsupervised Learning in Python

Versions of articles

  • Different versions of the same document have same topic proportions
  • ... exact feature values may be different!
  • E.g. because one version uses many meaningless words
  • But all versions lie on the same line through the origin

Scatter plot of topic pets vs topic danger, with strong and weak version lying on the same line through the origin

Unsupervised Learning in Python

Cosine similarity

  • Uses the angle between the lines
  • Higher values means more similar
  • Maximum value is 1, when angle is 0 degrees

Scatter plot of document A and document B, with 2 separate lines going through them at different angles from the origin

Unsupervised Learning in Python

Calculating the cosine similarities

from sklearn.preprocessing import normalize

norm_features = normalize(nmf_features)
# if has index 23 current_article = norm_features[23,:] similarities = norm_features.dot(current_article)
print(similarities)
[ 0.7150569   0.26349967 ..., 0.20323616  0.05047817]
Unsupervised Learning in Python

DataFrames and labels

  • Label similarities with the article titles, using a DataFrame
  • Titles given as a list: titles
import pandas as pd

norm_features = normalize(nmf_features)
df = pd.DataFrame(norm_features, index=titles)
current_article = df.loc['Dog bites man']
similarities = df.dot(current_article)
Unsupervised Learning in Python

DataFrames and labels

print(similarities.nlargest())
Dog bites man                            1.000000
Hound mauls cat                          0.979946
Pets go wild!                            0.979708
Dachshunds are dangerous                 0.949641
Our streets are no longer safe           0.900474
dtype: float64
Unsupervised Learning in Python

Let's practice!

Unsupervised Learning in Python

Preparing Video For Download...