Latent Dirichlet allocation

Introduction to Text Analysis in R

Maham Faisal Khan

Senior Data Science Content Developer

Unsupervised learning

Some more natural language processing (NLP) vocabulary:

  • Latent Dirichlet allocation (LDA) is a standard topic model
  • A collection of documents is known as a corpus
  • Bag-of-words is treating every word in a document separately
  • Topic models find patterns of words appearing together
  • Searching for patterns rather than predicting is known as unsupervised learning
Introduction to Text Analysis in R

Word probabilities

Introduction to Text Analysis in R

Clustering vs. topic modeling

Clustering

  • Clusters are uncovered based on distance, which is continuous.
  • Every object is assigned to a single cluster.

Topic Modeling

  • Topics are uncovered based on word frequency, which is discrete.
  • Every document is a mixture (i.e., partial member) of every topic.
Introduction to Text Analysis in R

Let's practice!

Introduction to Text Analysis in R

Preparing Video For Download...