Introduction to preprocessing

Python'da Machine Learning için Ön İşleme

James Chapman

Curriculum Manager, DataCamp

What is data preprocessing?

  • After exploratory data analysis and data cleaning
  • Preparing data for modeling

 

  • Example: transforming categorical features into numerical features (dummy variables)
Python'da Machine Learning için Ön İşleme

Why preprocess?

 

 

  • Transform dataset so it's suitable for modeling
  • Improve model performance
  • Generate more reliable results

A bullseye target.

Python'da Machine Learning için Ön İşleme

Recap: exploring data with pandas

import pandas as pd
hiking = pd.read_json("hiking.json")
print(hiking.head())
  Prop_ID                     Name  ... lat lon
0    B057  Salt Marsh Nature Trail  ... NaN NaN
1    B073                Lullwater  ... NaN NaN
2    B073                  Midwood  ... NaN NaN
3    B073                Peninsula  ... NaN NaN
4    B073                Waterfall  ... NaN NaN
Python'da Machine Learning için Ön İşleme

Recap: exploring data with pandas

print(hiking.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
 --  ------          --------------  -----  
 0   Prop_ID         33 non-null     object 
 1   Name            33 non-null     object 
 2   Location        33 non-null     object 
 3   Park_Name       33 non-null     object 
 4   Length          29 non-null     object 
 5   Difficulty      27 non-null     object 
 6   Other_Details   31 non-null     object 
 7   Accessible      33 non-null     object 
 8   Limited_Access  33 non-null     object 
 9   lat             0 non-null      float64
 10  lon             0 non-null      float64
dtypes: float64(2), object(9)
memory usage: 3.0+ KB
Python'da Machine Learning için Ön İşleme

Recap: exploring data with pandas

print(wine.describe())
             Type     Alcohol    ...  Alcalinity of ash
count  178.000000  178.000000    ...   178.000000
mean     1.938202   13.000618    ...    19.494944
std      0.775035    0.811827    ...     3.339564
min      1.000000   11.030000    ...    10.600000
25%      1.000000   12.362500    ...    17.200000
50%      2.000000   13.050000    ...    19.500000
75%      3.000000   13.677500    ...    21.500000
max      3.000000   14.830000    ...    30.000000
Python'da Machine Learning için Ön İşleme

Removing missing data

print(df)
     A    B    C
0  1.0  NaN  2.0
1  4.0  7.0  3.0
2  7.0  NaN  NaN
3  NaN  7.0  NaN
4  5.0  9.0  7.0
print(df.dropna())
     A    B    C
1  4.0  7.0  3.0
4  5.0  9.0  7.0
Python'da Machine Learning için Ön İşleme

Removing missing data

print(df)
     A    B    C
0  1.0  NaN  2.0
1  4.0  7.0  3.0
2  7.0  NaN  NaN
3  NaN  7.0  NaN
4  5.0  9.0  7.0
print(df.drop([1, 2, 3]))
     A    B    C
0  1.0  NaN  2.0
4  5.0  9.0  7.0
Python'da Machine Learning için Ön İşleme

Removing missing data

print(df)
     A    B    C
0  1.0  NaN  2.0
1  4.0  7.0  3.0
2  7.0  NaN  NaN
3  NaN  7.0  NaN
4  5.0  9.0  7.0
print(df.drop("A", axis=1))
     B    C
0  NaN  2.0
1  7.0  3.0
2  NaN  NaN
3  7.0  NaN
4  9.0  7.0
Python'da Machine Learning için Ön İşleme

Removing missing data

print(df)
     A    B    C
0  1.0  NaN  2.0
1  4.0  7.0  3.0
2  7.0  NaN  NaN
3  NaN  7.0  NaN
4  5.0  9.0  7.0
print(df.isna().sum())
A    1
B    2
C    2
dtype: int64
print(df.dropna(subset=["B"]))
     A    B    C
1  4.0  7.0  3.0
3  NaN  7.0  NaN
4  5.0  9.0  7.0
Python'da Machine Learning için Ön İşleme

Removing missing data

print(df)
     A    B    C
0  1.0  NaN  2.0
1  4.0  7.0  3.0
2  7.0  NaN  NaN
3  NaN  7.0  NaN
4  5.0  9.0  7.0
print(df.dropna(thresh=2))
     A    B    C
0  1.0  NaN  2.0
1  4.0  7.0  3.0
4  5.0  9.0  7.0
Python'da Machine Learning için Ön İşleme

Let's practice!

Python'da Machine Learning için Ön İşleme

Preparing Video For Download...