Building word count vectors with scikit-learn

Introduction to Natural Language Processing in Python

Katharine Jarmul

Founder, kjamistan

Predicting movie genre

Dataset consisting of movie plots and corresponding genre
Goal: Create bag-of-word vectors for the movie plots
- Can we predict genre based on the words used in the plot summary?

Count Vectorizer with Python

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

df = ... # Load data into DataFrame

y = df['Sci-Fi']

X_train, X_test, y_train, y_test = train_test_split(
                                             df['plot'], y, 
                                             test_size=0.33, 
                                             random_state=53)

count_vectorizer = CountVectorizer(stop_words='english')

count_train = count_vectorizer.fit_transform(X_train.values)

count_test = count_vectorizer.transform(X_test.values)

Let's practice!

Introduction to Natural Language Processing in Python