Analyze the amount of missingness

Dealing with Missing Data in Python

Suraj Donthi

Deep Learning & Computer Vision Consultant

Load Air Quality dataset

Air Quality dataset

  • contains the sensor recordings of Ozone, Solar, Temperature and Wind
df_air = pd.read_csv('air-quality.csv', 
                            parse_dates=['Date'], 
                            index_col='Date')

df_air.head()
              Ozone  Solar  Wind  Temp
Date                                
1976-05-01   41.0  190.0   7.4    67
1976-05-02   36.0  118.0   8.0    72
1976-05-03   12.0  149.0  12.6    74
1976-05-04   18.0  313.0  11.5    62
1976-05-05    NaN    NaN  14.3    56
Dealing with Missing Data in Python

Nullity DataFrame

  • Use either .isnull() or .isna() methods on the DataFrame
airquality_nullity = airquality.isnull()
airquality_nullity.head()
            Ozone  Solar   Wind   Temp
Date                                  
1976-05-01  False  False  False  False
1976-05-02  False  False  False  False
1976-05-03  False  False  False  False
1976-05-04  False  False  False  False
1976-05-05   True   True  False  False
Dealing with Missing Data in Python

Total missing values

airquality_nullity.sum()
Ozone    37
Solar     7
Wind      0
Temp      0
dtype: int64
Dealing with Missing Data in Python

Percentage of missingness

airquality_nullity.mean() * 100
Ozone    24.183007
Solar     4.575163
Wind      0.000000
Temp      0.000000
dtype: float64
Dealing with Missing Data in Python

Nullity Bar

Missingno package

  • Package for graphical analysis of missing values
import missingno as msno
msno.bar(airquality)

MIssingness bar for air quality dataset

Dealing with Missing Data in Python

Nullity Matrix

msno.matrix(airquality)

Missingness matrix for air quality dataset

Dealing with Missing Data in Python

Nullity Matrix

msno.matrix(airquality)

Missingness matrix for air quality dataset

Dealing with Missing Data in Python

Nullity Matrix

msno.matrix(airquality)

Missingness matrix for air quality dataset

Dealing with Missing Data in Python

Nullity Matrix

msno.matrix(airquality)

Missingness matrix for air quality dataset

Dealing with Missing Data in Python

Nullity Matrix for time-series data

msno.matrix(airquality, freq='M')

Missingness matrix with monthly frequency for air quality dataset

Dealing with Missing Data in Python

Nullity Matrix for time-series data

msno.matrix(airquality, freq='M')

Missingness matrix with monthly frequency for air quality dataset

Dealing with Missing Data in Python

Fine tuning the matrix

msno.matrix(airquality.loc['May-1976': 'Jul-1976'], freq='M')

Missingness matrix with monthly frequency for air quality dataset

Dealing with Missing Data in Python

Summary

In this lesson we learned to analyze

  • the amount of missingness numerically
  • the amount of missingness graphically
  • the percentage of missingness
  • the nullity matrix for regular datasets
  • the nullity matrix for time-series datasets
Dealing with Missing Data in Python

Let's practice!

Dealing with Missing Data in Python

Preparing Video For Download...