Preprocessing for Machine Learning in Python
James Chapman
Curriculum Manager, DataCamp
import re
my_string = "temperature:75.6 F"
temp = re.search("\d+\.\d+", my_string)
print(float(temp.group(0)))
75.6
\d+
\.
\d+
TF/IDF: Vectorizes words based upon importance
from sklearn.feature_extraction.text import TfidfVectorizer
print(documents.head())
0 Building on successful events last summer and ...
1 Build a website for an Afghan business
2 Please join us and the students from Mott Hall...
3 The Oxfam Action Corps is a group of dedicated...
4 Stop 'N' Swap reduces NYC's waste by finding n...
tfidf_vec = TfidfVectorizer()
text_tfidf = tfidf_vec.fit_transform(documents)
$$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$
Preprocessing for Machine Learning in Python