Word Count Representation

Feature Engineering for Machine Learning in Python

Robert O'Callaghan

Director of Data Science, Ordergroove

Text to columns

Feature Engineering for Machine Learning in Python

Initializing the vectorizer

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
print(cv)
CountVectorizer(analyzer=u'word', binary=False, 
        decode_error=u'strict', 
        dtype=<type 'numpy.int64'>, 
        encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, 
        min_df=1,ngram_range=(1, 1), preprocessor=None, 
        stop_words=None, strip_accents=None, 
        token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None
Feature Engineering for Machine Learning in Python

Specifying the vectorizer

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(min_df=0.1, max_df=0.9)

min_df: minimum fraction of documents the word must occur in max_df: maximum fraction of documents the word can occur in

Feature Engineering for Machine Learning in Python

Fit the vectorizer

cv.fit(speech_df['text_clean'])
Feature Engineering for Machine Learning in Python

Transforming your text

cv_transformed = cv.transform(speech_df['text_clean'])
print(cv_transformed)
<58x8839 sparse matrix of type '<type 'numpy.int64'>'
Feature Engineering for Machine Learning in Python

Transforming your text

cv_transformed.toarray()
Feature Engineering for Machine Learning in Python

Getting the features

feature_names = cv.get_feature_names()
print(feature_names)
[u'abandon', u'abandoned', u'abandonment', u'abate', 
u'abdicated', u'abeyance', u'abhorring', u'abide',
u'abiding', u'abilities', u'ability', u'abject'...
Feature Engineering for Machine Learning in Python

Fitting and transforming

cv_transformed = cv.fit_transform(speech_df['text_clean'])
print(cv_transformed)
<58x8839 sparse matrix of type '<type 'numpy.int64'>'
Feature Engineering for Machine Learning in Python

Putting it all together

cv_df = pd.DataFrame(cv_transformed.toarray(), 
                     columns=cv.get_feature_names())\
                               .add_prefix('Counts_')
print(cv_df.head())
     Counts_aback    Counts_abandoned    Counts_a...
0               1                   0        ...
1               0                   0        ...
2               0                   1        ...
3               0                   1        ...
4               0                   0        ...
1 ```out Counts_aback Counts_abandon Counts_abandonment 0 1 0 0 1 0 0 1 2 0 1 0 3 0 1 0 4 0 0 0 ```
Feature Engineering for Machine Learning in Python

Updating your DataFrame

speech_df = pd.concat([speech_df, cv_df], 
                      axis=1, sort=False)
print(speech_df.shape)
(58, 8845)
Feature Engineering for Machine Learning in Python

Let's practice!

Feature Engineering for Machine Learning in Python

Preparing Video For Download...