Building word count vectors with scikit-learn

Introduction to Natural Language Processing in Python

Katharine Jarmul

Founder, kjamistan

Predicting movie genre

  • Dataset consisting of movie plots and corresponding genre
  • Goal: Create bag-of-word vectors for the movie plots
    • Can we predict genre based on the words used in the plot summary?
Introduction to Natural Language Processing in Python

Count Vectorizer with Python

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

df = ... # Load data into DataFrame
y = df['Sci-Fi']
X_train, X_test, y_train, y_test = train_test_split( df['plot'], y, test_size=0.33, random_state=53)
count_vectorizer = CountVectorizer(stop_words='english')
count_train = count_vectorizer.fit_transform(X_train.values)
count_test = count_vectorizer.transform(X_test.values)
Introduction to Natural Language Processing in Python

Let's practice!

Introduction to Natural Language Processing in Python

Preparing Video For Download...