Cosine Similarity

Introduction to Natural Language Processing in R

Kasey Jones

Research Data Scientist

TFIDF output

# A tibble: 1,498 x 6
       X word         n     tf   idf tf_idf
   <int> <chr>    <int>  <dbl> <dbl>  <dbl>
 1    20 january      4 0.0930  2.30  0.214
 2    15 power        4 0.0690  3.00  0.207
 3    19 futures      9 0.0643  3.00  0.193
 4     8 8            6 0.0619  3.00  0.185
 5     3 canada       2 0.0526  3.00  0.158
 6     3 canadian     2 0.0526  3.00  0.158
Introduction to Natural Language Processing in R

Cosine similarity

  • a measure of similarity between two vectors
  • measured by the angle formed by the two vectors

Cosine similarity is the measure of the angle between two vectors. You can imagine it as the angle formed by two lines.

1 https://en.wikipedia.org/wiki/Cosine_similarity
Introduction to Natural Language Processing in R

Cosine similarity formula

  • similarity is calculated as the two vectors dot product

Calculating the dot product between two vectors will tell us how similar they are.

Introduction to Natural Language Processing in R

Finding similarities part I

crude_weights <- crude_tibble %>%
  unnest_tokens(output = "word", token = "words", input = text) %>%
  anti_join(stop_words) %>%
  count(word, X) %>%
  bind_tf_idf(word, X, n)
# A tibble: 1,498 x 6
       X word          n    tf   idf tf_idf
   <int> <chr>     <int> <dbl> <dbl>  <dbl>
 1     1 1.50          1 0.25   3.25  0.812
 2     1 16.00         1 1      3.25  3.25 
 3     1 barrel        2 0.133  3.25  0.433
 ...
Introduction to Natural Language Processing in R

Pairwise similarity

pairwise_similarity(tbl, item, feature, value, ...)
  • tbl: a table or tibble
  • item: the items to compare (articles, tweets, etc.)
  • feature: column describing the link between the items (i.e. words)
  • value: the column of values (i.e. n or tf_idf)
Introduction to Natural Language Processing in R

Finding similarities part II

crude_weights %>%
  pairwise_similarity(X, word, tf_idf) %>%
  arrange(desc(similarity))
# A tibble: 380 x 3
   item1 item2 similarity
   <int> <int>      <dbl>
 1    17    16      0.663
 2    16    17      0.663
 3    13    10      0.311
 4    10    13      0.311
 ...
Introduction to Natural Language Processing in R

Cosine similarity use-cases

  • find duplicate/similar pieces of text
  • use in clustering and classification analysis
  • ...
Introduction to Natural Language Processing in R

Let's practice!

Introduction to Natural Language Processing in R

Preparing Video For Download...