Dealing with Missing Data in Python
Suraj Donthi
Deep Learning & Computer Vision Consultant
import pandas as pd airquality = pd.read_csv('air-quality.csv', parse_dates='Date', index_col='Date')
airquality.head()
Ozone Solar Wind Temp
Date
1976-05-01 41.0 190.0 7.4 67
1976-05-02 36.0 118.0 8.0 72
1976-05-03 12.0 149.0 12.6 74
1976-05-04 18.0 313.0 11.5 62
1976-05-05 NaN NaN 14.3 56
airquality.isnull().sum()
Ozone 37
Solar 7
Wind 0
Temp 0
dtype: int64
airquality.isnull.mean() * 100
Ozone 24.183007
Solar 4.575163
Wind 0.000000
Temp 0.000000
dtype: float64
The attribute method
in .fillna()
can be set to
'ffill'
or 'pad'
'bfill'
or 'backwardfill'
NaN
s with last observed valuepad
is the same as 'ffill'
airquality.fillna(method='ffill', inplace=True)
airquality['Ozone'][30:40]
Date Ozone
1976-05-31 37.0
1976-06-01 NaN
1976-06-02 NaN
1976-06-03 NaN
1976-06-04 NaN
1976-06-05 NaN
1976-06-06 NaN
1976-06-07 29.0
1976-06-08 NaN
1976-06-09 71.0
airquality.fillna(method='ffill',
inplace=True)
airquality['Ozone'][30:40]
Date Ozone
1976-05-31 37.0
1976-06-01 37.0
1976-06-02 37.0
1976-06-03 37.0
1976-06-04 37.0
1976-06-05 37.0
1976-06-06 37.0
1976-06-07 29.0
1976-06-08 29.0
1976-06-09 71.0
NaN
s with next observed valuebackfill
is the same as 'bfill'
df.fillna(method='bfill', inplace=True)
airquality['Ozone'][30:40]
Date Ozone
1976-05-31 37.0
1976-06-01 NaN
1976-06-02 NaN
1976-06-03 NaN
1976-06-04 NaN
1976-06-05 NaN
1976-06-06 NaN
1976-06-07 29.0
1976-06-08 NaN
1976-06-09 71.0
airquality.fillna(method='bfill',
inplace=True)
airquality['Ozone'][30:40]
Date Ozone
1976-05-31 37.0
1976-06-01 29.0
1976-06-02 29.0
1976-06-03 29.0
1976-06-04 29.0
1976-06-05 29.0
1976-06-06 29.0
1976-06-07 29.0
1976-06-08 71.0
1976-06-09 71.0
.interpolate()
method extends the sequence of values to the missing valuesThe attribute method
in .interpolate()
can be set to
'linear'
'quadratic'
'nearest'
df.interpolate(method='linear', inplace=True)
airquality['Ozone'][30:40]
Date Ozone
1976-05-31 37.0
1976-06-01 NaN
1976-06-02 NaN
1976-06-03 NaN
1976-06-04 NaN
1976-06-05 NaN
1976-06-06 NaN
1976-06-07 29.0
1976-06-08 NaN
1976-06-09 71.0
airquality.interpolate(
method='linear', inplace=True)
airquality['Ozone'][30:40]
Date Ozone
1976-05-31 37.0
1976-06-01 35.9
1976-06-02 34.7
1976-06-03 33.6
1976-06-04 32.4
1976-06-05 31.3
1976-06-06 30.1
1976-06-07 29.0
1976-06-08 50.0
1976-06-09 71.0
df.interpolate(method='quadratic', inplace=True)
airquality['Ozone'][30:39]
Ozone
Date
1976-05-31 37.0
1976-06-01 NaN
1976-06-02 NaN
1976-06-03 NaN
1976-06-04 NaN
1976-06-05 NaN
1976-06-06 NaN
1976-06-07 29.0
1976-06-08 NaN
airquality.interpolate(
method='quadratic', inplace=True)
airquality['Ozone'][30:39]
Ozone
Date
1976-05-31 37.0
1976-06-01 -38.4
1976-06-02 -79.4
1976-06-03 -85.9
1976-06-04 -62.4
1976-06-06 -2.8
1976-06-07 29.0
1976-06-08 62.2
df.interpolate(method='nearest', inplace=True)
airquality['Ozone'][30:39]
Date Ozone
1976-05-31 37.0
1976-06-01 NaN
1976-06-02 NaN
1976-06-03 NaN
1976-06-04 NaN
1976-06-05 NaN
1976-06-06 NaN
1976-06-07 29.0
1976-06-08 NaN
airquality.interpolate(
method='nearest', inplace=True)
airquality['Ozone'][30:39]
Date Ozone
1976-05-31 37.0
1976-06-01 37.0
1976-06-02 37.0
1976-06-03 37.0
1976-06-04 29.0
1976-06-05 29.0
1976-06-06 29.0
1976-06-07 29.0
1976-06-08 29.0
Dealing with Missing Data in Python