Classifying fake news using supervised learning with NLP

Introduction to Natural Language Processing in Python

Katharine Jarmul

Founder, kjamistan

What is supervised learning?

  • Form of machine learning
    • Problem has predefined training data
    • This data has a label (or outcome) you want the model to learn
    • Classification problem
    • Goal: Make good hypotheses about the species based on geometric features
Sepal length Sepal width Petal length Petal width Species
5.1 3.5 1.4 0.2 I. setosa
7.0 3.2 4.77 1.4 I.versicolor
6.3 3.3 6.0 2.5 I.virginica
Introduction to Natural Language Processing in Python

Supervised learning with NLP

  • Need to use language instead of geometric features
  • scikit-learn: Powerful open-source library
  • How to create supervised learning data from text?
    • Use bag-of-words models or tf-idf as features
Introduction to Natural Language Processing in Python

IMDB Movie Dataset

Plot Sci-Fi Action
In a post-apocalyptic world in human decay, a ... 1 0
Mohei is a wandering swordsman. He arrives in ... 0 1
#137 is a SCI/FI thriller about a girl, Marla,... 1 0

  • Goal: Predict movie genre based on plot summary
  • Categorical features generated using preprocessing
Introduction to Natural Language Processing in Python

Supervised learning steps

  • Collect and preprocess our data
  • Determine a label (Example: Movie genre)
  • Split data into training and test sets
  • Extract features from the text to help predict the label
    • Bag-of-words vector built into scikit-learn
  • Evaluate trained model using the test set
Introduction to Natural Language Processing in Python

Let's practice!

Introduction to Natural Language Processing in Python

Preparing Video For Download...