Preprocessing for Machine Learning in Python
James Chapman
Curriculum Manager, DataCamp
import pandas as pd
hiking = pd.read_json("hiking.json")
print(hiking.head())
Prop_ID Name ... lat lon
0 B057 Salt Marsh Nature Trail ... NaN NaN
1 B073 Lullwater ... NaN NaN
2 B073 Midwood ... NaN NaN
3 B073 Peninsula ... NaN NaN
4 B073 Waterfall ... NaN NaN
print(hiking.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 11 columns):
# Column Non-Null Count Dtype
-- ------ -------------- -----
0 Prop_ID 33 non-null object
1 Name 33 non-null object
2 Location 33 non-null object
3 Park_Name 33 non-null object
4 Length 29 non-null object
5 Difficulty 27 non-null object
6 Other_Details 31 non-null object
7 Accessible 33 non-null object
8 Limited_Access 33 non-null object
9 lat 0 non-null float64
10 lon 0 non-null float64
dtypes: float64(2), object(9)
memory usage: 3.0+ KB
print(wine.describe())
Type Alcohol ... Alcalinity of ash
count 178.000000 178.000000 ... 178.000000
mean 1.938202 13.000618 ... 19.494944
std 0.775035 0.811827 ... 3.339564
min 1.000000 11.030000 ... 10.600000
25% 1.000000 12.362500 ... 17.200000
50% 2.000000 13.050000 ... 19.500000
75% 3.000000 13.677500 ... 21.500000
max 3.000000 14.830000 ... 30.000000
print(df)
A B C
0 1.0 NaN 2.0
1 4.0 7.0 3.0
2 7.0 NaN NaN
3 NaN 7.0 NaN
4 5.0 9.0 7.0
print(df.dropna())
A B C
1 4.0 7.0 3.0
4 5.0 9.0 7.0
print(df)
A B C
0 1.0 NaN 2.0
1 4.0 7.0 3.0
2 7.0 NaN NaN
3 NaN 7.0 NaN
4 5.0 9.0 7.0
print(df.drop([1, 2, 3]))
A B C
0 1.0 NaN 2.0
4 5.0 9.0 7.0
print(df)
A B C
0 1.0 NaN 2.0
1 4.0 7.0 3.0
2 7.0 NaN NaN
3 NaN 7.0 NaN
4 5.0 9.0 7.0
print(df.drop("A", axis=1))
B C
0 NaN 2.0
1 7.0 3.0
2 NaN NaN
3 7.0 NaN
4 9.0 7.0
print(df)
A B C
0 1.0 NaN 2.0
1 4.0 7.0 3.0
2 7.0 NaN NaN
3 NaN 7.0 NaN
4 5.0 9.0 7.0
print(df.isna().sum())
A 1
B 2
C 2
dtype: int64
print(df.dropna(subset=["B"]))
A B C
1 4.0 7.0 3.0
3 NaN 7.0 NaN
4 5.0 9.0 7.0
print(df)
A B C
0 1.0 NaN 2.0
1 4.0 7.0 3.0
2 7.0 NaN NaN
3 NaN 7.0 NaN
4 5.0 9.0 7.0
print(df.dropna(thresh=2))
A B C
0 1.0 NaN 2.0
1 4.0 7.0 3.0
4 5.0 9.0 7.0
Preprocessing for Machine Learning in Python