Cosine Similarity

Introduction to Natural Language Processing in R

Kasey Jones

Research Data Scientist

TFIDF output

# A tibble: 1,498 x 6
       X word         n     tf   idf tf_idf
   <int> <chr>    <int>  <dbl> <dbl>  <dbl>
 1    20 january      4 0.0930  2.30  0.214
 2    15 power        4 0.0690  3.00  0.207
 3    19 futures      9 0.0643  3.00  0.193
 4     8 8            6 0.0619  3.00  0.185
 5     3 canada       2 0.0526  3.00  0.158
 6     3 canadian     2 0.0526  3.00  0.158

Cosine similarity

a measure of similarity between two vectors
measured by the angle formed by the two vectors

Cosine similarity is the measure of the angle between two vectors. You can imagine it as the angle formed by two lines.

¹ https://en.wikipedia.org/wiki/Cosine_similarity

Cosine similarity formula

similarity is calculated as the two vectors dot product

Calculating the dot product between two vectors will tell us how similar they are.

Finding similarities part I

crude_weights <- crude_tibble %>%
  unnest_tokens(output = "word", token = "words", input = text) %>%
  anti_join(stop_words) %>%
  count(word, X) %>%
  bind_tf_idf(word, X, n)

# A tibble: 1,498 x 6
       X word          n    tf   idf tf_idf
   <int> <chr>     <int> <dbl> <dbl>  <dbl>
 1     1 1.50          1 0.25   3.25  0.812
 2     1 16.00         1 1      3.25  3.25 
 3     1 barrel        2 0.133  3.25  0.433
 ...

Pairwise similarity

pairwise_similarity(tbl, item, feature, value, ...)

tbl: a table or tibble
item: the items to compare (articles, tweets, etc.)
feature: column describing the link between the items (i.e. words)
value: the column of values (i.e. n or tf_idf)

Finding similarities part II

crude_weights %>%
  pairwise_similarity(X, word, tf_idf) %>%
  arrange(desc(similarity))

# A tibble: 380 x 3
   item1 item2 similarity
   <int> <int>      <dbl>
 1    17    16      0.663
 2    16    17      0.663
 3    13    10      0.311
 4    10    13      0.311
 ...

Cosine similarity use-cases

find duplicate/similar pieces of text
use in clustering and classification analysis
...

Let's practice!

Introduction to Natural Language Processing in R