Building recommender systems using NMF

Apprendimento non supervisionato in Python

Benjamin Wilson

Director of Research at lateral.io

Finding similar articles

  • Engineer at a large online newspaper
  • Task: recommend articles similar to article being read by customer
  • Similar articles should have similar topics
Apprendimento non supervisionato in Python

Strategy

  • Apply NMF to the word-frequency array
  • NMF feature values describe the topics
  • ... so similar documents have similar NMF feature values
  • Compare NMF feature values?
Apprendimento non supervisionato in Python

Apply NMF to the word-frequency array

  • articles is a word frequency array
from sklearn.decomposition import NMF
nmf = NMF(n_components=6)
nmf_features = nmf.fit_transform(articles)
Apprendimento non supervisionato in Python

Strategy

  • Apply NMF to the word-frequency array
  • NMF feature values describe the topics
  • ... so similar documents have similar NMF feature values
  • Compare NMF feature values?
Apprendimento non supervisionato in Python

Versions of articles

  • Different versions of the same document have same topic proportions
  • ... exact feature values may be different!
  •  
  •  

Strong version: Dog bites man! Attack by terrible canine leaves man paralyzed...

Apprendimento non supervisionato in Python

Versions of articles

  • Different versions of the same document have same topic proportions
  • ... exact feature values may be different!
  • E.g. because one version uses many meaningless words
  •  

Weak version: You may have heard, unfortunately it seems that a dog has perhaps bitten a man...

Apprendimento non supervisionato in Python

Versions of articles

  • Different versions of the same document have same topic proportions
  • ... exact feature values may be different!
  • E.g. because one version uses many meaningless words
  • But all versions lie on the same line through the origin

Scatter plot of topic pets vs topic danger, with strong and weak version lying on the same line through the origin

Apprendimento non supervisionato in Python

Cosine similarity

  • Uses the angle between the lines
  • Higher values means more similar
  • Maximum value is 1, when angle is 0 degrees

Scatter plot of document A and document B, with 2 separate lines going through them at different angles from the origin

Apprendimento non supervisionato in Python

Calculating the cosine similarities

from sklearn.preprocessing import normalize

norm_features = normalize(nmf_features)
# if has index 23 current_article = norm_features[23,:] similarities = norm_features.dot(current_article)
print(similarities)
[ 0.7150569   0.26349967 ..., 0.20323616  0.05047817]
Apprendimento non supervisionato in Python

DataFrames and labels

  • Label similarities with the article titles, using a DataFrame
  • Titles given as a list: titles
import pandas as pd

norm_features = normalize(nmf_features)
df = pd.DataFrame(norm_features, index=titles)
current_article = df.loc['Dog bites man']
similarities = df.dot(current_article)
Apprendimento non supervisionato in Python

DataFrames and labels

print(similarities.nlargest())
Dog bites man                            1.000000
Hound mauls cat                          0.979946
Pets go wild!                            0.979708
Dachshunds are dangerous                 0.949641
Our streets are no longer safe           0.900474
dtype: float64
Apprendimento non supervisionato in Python

Let's practice!

Apprendimento non supervisionato in Python

Preparing Video For Download...