Engineering text features

Preprocessing for Machine Learning in Python

James Chapman

Curriculum Manager, DataCamp

Extraction

  • Regular expressions: code to identify patterns
import re

my_string = "temperature:75.6 F"
temp = re.search("\d+\.\d+", my_string)
print(float(temp.group(0)))
75.6
  • \d+
  • \.
  • \d+
Preprocessing for Machine Learning in Python

Vectorizing text

TF/IDF: Vectorizes words based upon importance

  • TF = Term Frequency
  • IDF = Inverse Document Frequency
Preprocessing for Machine Learning in Python

Vectorizing text

from sklearn.feature_extraction.text import TfidfVectorizer
print(documents.head())
0    Building on successful events last summer and ...
1               Build a website for an Afghan business
2    Please join us and the students from Mott Hall...
3    The Oxfam Action Corps is a group of dedicated...
4    Stop 'N' Swap reduces NYC's waste by finding n...
tfidf_vec = TfidfVectorizer()
text_tfidf = tfidf_vec.fit_transform(documents)
Preprocessing for Machine Learning in Python

Text classification

 

$$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$

Preprocessing for Machine Learning in Python

Let's practice!

Preprocessing for Machine Learning in Python

Preparing Video For Download...