Classifying fake news using supervised learning with NLP

Python ile Doğal Dil İşlemeye Giriş

Katharine Jarmul

Founder, kjamistan

What is supervised learning?

  • Form of machine learning
    • Problem has predefined training data
    • This data has a label (or outcome) you want the model to learn
    • Classification problem
    • Goal: Make good hypotheses about the species based on geometric features
Sepal length Sepal width Petal length Petal width Species
5.1 3.5 1.4 0.2 I. setosa
7.0 3.2 4.77 1.4 I.versicolor
6.3 3.3 6.0 2.5 I.virginica
Python ile Doğal Dil İşlemeye Giriş

Supervised learning with NLP

  • Need to use language instead of geometric features
  • scikit-learn: Powerful open-source library
  • How to create supervised learning data from text?
    • Use bag-of-words models or tf-idf as features
Python ile Doğal Dil İşlemeye Giriş

IMDB Movie Dataset

Plot Sci-Fi Action
In a post-apocalyptic world in human decay, a ... 1 0
Mohei is a wandering swordsman. He arrives in ... 0 1
#137 is a SCI/FI thriller about a girl, Marla,... 1 0

  • Goal: Predict movie genre based on plot summary
  • Categorical features generated using preprocessing
Python ile Doğal Dil İşlemeye Giriş

Supervised learning steps

  • Collect and preprocess our data
  • Determine a label (Example: Movie genre)
  • Split data into training and test sets
  • Extract features from the text to help predict the label
    • Bag-of-words vector built into scikit-learn
  • Evaluate trained model using the test set
Python ile Doğal Dil İşlemeye Giriş

Let's practice!

Python ile Doğal Dil İşlemeye Giriş

Preparing Video For Download...