Dealing with Missing Data in Python
Suraj Donthi
Deep Learning & Computer Vision Consultant
'NA'
, '-'
or '.'
etc.college = pd.read_csv('college.csv')
college.head()
gradrat lenroll rmbrd private stufac csat act
0 59.0 5.1761497326 3.75 1.0 10.8 . 21.0
1 52.0 4.7791234931 3.74 1.0 17.7 . 21.0
2 75.0 6.122492809500001 . 1.0 11.4 1052.0 24.0
3 56.0 5.3181199938 4.1 1.0 11.6 940.0 23.0
4 71.0 5.631211781799999 . 1.0 18.3 . 17.0
college.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 7 columns):
# Column Non-Null Count Dtype
-- ------ -------------- -----
0 gradrat 200 non-null object
1 lenroll 200 non-null object
2 rmbrd 200 non-null object
3 private 200 non-null float64
4 stufac 200 non-null object
5 csat 200 non-null object
6 act 200 non-null object
dtypes: float64(1), object(6)
memory usage: 11.1+ KB
csat_unique = college.csat.unique()
np.sort(csat_unique)
array(['.', '1000.0', '1006.0', '1010.0', '1013.0', '1020.0', '1024.0',
'1026.0', '1028.0', '1036.0', '1039.0', '1040.0', '1044.0',
'1045.0', '1050.0', '1052.0', '1060.0', '1070.0', '1080.0',
'1092.0', '1096.0', '1109.0', '1111.0', '1120.0', '1139.0',
... ... ... ... ... ...
... ... ... ... ... ...
'940.0', '943.0', '947.0', '950.0', '951.0', '964.0', '970.0',
'979.0', '980.0', '989.0', '992.0', '994.0', '996.0', '997.0',
'998.0'], dtype=object)
college = pd.read_csv('college.csv', na_values='.')
college.head()
gradrat lenroll rmbrd private stufac csat act
0 59.0 5.176150 3.75 1.0 10.8 NaN 21.0
1 52.0 4.779123 3.74 1.0 17.7 NaN 21.0
2 75.0 6.122493 NaN 1.0 11.4 1052.0 24.0
3 56.0 5.318120 4.10 1.0 11.6 940.0 23.0
4 71.0 5.631212 NaN 1.0 18.3 NaN 17.0
college.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 7 columns):
# Column Non-Null Count Dtype
-- ------ -------------- -----
0 gradrat 187 non-null float64
1 lenroll 199 non-null float64
2 rmbrd 114 non-null float64
3 private 200 non-null float64
4 stufac 199 non-null float64
5 csat 105 non-null float64
6 act 104 non-null float64
dtypes: float64(7)
memory usage: 11.0 KB
diabetes = pd.read_csv('pima-indians-diabetes.csv')
diabetes.head()
Pregnant Glucose Diastolic_BP Skin_Fold Serum_Insulin BMI Diabetes_Pedigree Age Class
0 6.0 148.0 72.0 35.0 NaN 33.6 0.627 50 1.0
1 1.0 85.0 66.0 29.0 NaN 26.6 0.351 31 0.0
2 8.0 183.0 64.0 NaN NaN 23.3 0.672 32 1.0
3 1.0 89.0 66.0 23.0 94.0 28.1 0.167 21 0.0
4 0.0 137.0 40.0 35.0 168.0 43.1 2.288 33 1.0
diabetes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
-- ------ -------------- -----
0 Pregnant 768 non-null float64
1 Glucose 763 non-null float64
2 Diastolic_BP 733 non-null float64
3 Skin_Fold 541 non-null float64
4 Serum_Insulin 394 non-null float64
5 BMI 757 non-null float64
6 Diabetes_Pedigree 768 non-null float64
7 Age 768 non-null int64
8 Class 768 non-null float64
dtypes: float64(8), int64(1)
memory usage: 54.1 KB
diabetes.describe()
Pregnant Glucose Diastolic_BP Skin_Fold Serum_Insulin BMI Diabetes_Pedigree Age Class
count 768.000000 763.000000 733.000000 541.000000 394.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 121.686763 72.405184 29.153420 155.548223 31.992578 0.471876 33.240885 0.348958
std 3.369578 30.535641 12.382158 10.476982 118.775855 7.884160 0.331329 11.760232 0.476951
min 0.000000 44.000000 24.000000 7.000000 14.000000 0.000000 0.078000 21.000000 0.000000
25% 1.000000 99.000000 64.000000 22.000000 76.250000 27.300000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 29.000000 125.000000 32.000000 0.372500 29.000000 0.000000
75% 6.000000 141.000000 80.000000 36.000000 190.000000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000
diabetes.BMI[diabetes.BMI == 0]
9 | 0.0
49 | 0.0
60 | 0.0
81 | 0.0
145| 0.0
371| 0.0
426| 0.0
494| 0.0
522| 0.0
684| 0.0
706| 0.0
Name: BMI, dtype: float64
diabetes.BMI[diabetes.BMI == 0] = np.nan
diabetes.BMI[np.isnan(diabetes.BMI)]
# OR
diabetes.BMI[diabetes.BMI.isna()]
# OR
diabetes.BMI[diabetes.BMI.isnull()]
9 | NaN
49 | NaN
60 | NaN
81 | NaN
145| NaN
371| NaN
426| NaN
494| NaN
522| NaN
684| NaN
706| NaN
Name: BMI, dtype: float64
NaN
Dealing with Missing Data in Python