Introduction to preprocessing

Preprocessing for Machine Learning in Python

James Chapman

Curriculum Manager, DataCamp

What is data preprocessing?

  • After exploratory data analysis and data cleaning
  • Preparing data for modeling

 

  • Example: transforming categorical features into numerical features (dummy variables)
Preprocessing for Machine Learning in Python

Why preprocess?

 

 

  • Transform dataset so it's suitable for modeling
  • Improve model performance
  • Generate more reliable results

A bullseye target.

Preprocessing for Machine Learning in Python

Recap: exploring data with pandas

import pandas as pd
hiking = pd.read_json("hiking.json")
print(hiking.head())
  Prop_ID                     Name  ... lat lon
0    B057  Salt Marsh Nature Trail  ... NaN NaN
1    B073                Lullwater  ... NaN NaN
2    B073                  Midwood  ... NaN NaN
3    B073                Peninsula  ... NaN NaN
4    B073                Waterfall  ... NaN NaN
Preprocessing for Machine Learning in Python

Recap: exploring data with pandas

print(hiking.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
 --  ------          --------------  -----  
 0   Prop_ID         33 non-null     object 
 1   Name            33 non-null     object 
 2   Location        33 non-null     object 
 3   Park_Name       33 non-null     object 
 4   Length          29 non-null     object 
 5   Difficulty      27 non-null     object 
 6   Other_Details   31 non-null     object 
 7   Accessible      33 non-null     object 
 8   Limited_Access  33 non-null     object 
 9   lat             0 non-null      float64
 10  lon             0 non-null      float64
dtypes: float64(2), object(9)
memory usage: 3.0+ KB
Preprocessing for Machine Learning in Python

Recap: exploring data with pandas

print(wine.describe())
             Type     Alcohol    ...  Alcalinity of ash
count  178.000000  178.000000    ...   178.000000
mean     1.938202   13.000618    ...    19.494944
std      0.775035    0.811827    ...     3.339564
min      1.000000   11.030000    ...    10.600000
25%      1.000000   12.362500    ...    17.200000
50%      2.000000   13.050000    ...    19.500000
75%      3.000000   13.677500    ...    21.500000
max      3.000000   14.830000    ...    30.000000
Preprocessing for Machine Learning in Python

Removing missing data

print(df)
     A    B    C
0  1.0  NaN  2.0
1  4.0  7.0  3.0
2  7.0  NaN  NaN
3  NaN  7.0  NaN
4  5.0  9.0  7.0
print(df.dropna())
     A    B    C
1  4.0  7.0  3.0
4  5.0  9.0  7.0
Preprocessing for Machine Learning in Python

Removing missing data

print(df)
     A    B    C
0  1.0  NaN  2.0
1  4.0  7.0  3.0
2  7.0  NaN  NaN
3  NaN  7.0  NaN
4  5.0  9.0  7.0
print(df.drop([1, 2, 3]))
     A    B    C
0  1.0  NaN  2.0
4  5.0  9.0  7.0
Preprocessing for Machine Learning in Python

Removing missing data

print(df)
     A    B    C
0  1.0  NaN  2.0
1  4.0  7.0  3.0
2  7.0  NaN  NaN
3  NaN  7.0  NaN
4  5.0  9.0  7.0
print(df.drop("A", axis=1))
     B    C
0  NaN  2.0
1  7.0  3.0
2  NaN  NaN
3  7.0  NaN
4  9.0  7.0
Preprocessing for Machine Learning in Python

Removing missing data

print(df)
     A    B    C
0  1.0  NaN  2.0
1  4.0  7.0  3.0
2  7.0  NaN  NaN
3  NaN  7.0  NaN
4  5.0  9.0  7.0
print(df.isna().sum())
A    1
B    2
C    2
dtype: int64
print(df.dropna(subset=["B"]))
     A    B    C
1  4.0  7.0  3.0
3  NaN  7.0  NaN
4  5.0  9.0  7.0
Preprocessing for Machine Learning in Python

Removing missing data

print(df)
     A    B    C
0  1.0  NaN  2.0
1  4.0  7.0  3.0
2  7.0  NaN  NaN
3  NaN  7.0  NaN
4  5.0  9.0  7.0
print(df.dropna(thresh=2))
     A    B    C
0  1.0  NaN  2.0
1  4.0  7.0  3.0
4  5.0  9.0  7.0
Preprocessing for Machine Learning in Python

Let's practice!

Preprocessing for Machine Learning in Python

Preparing Video For Download...