Missing value imputation using transform()

Writing Efficient Code with pandas

Leonidas Souliotis

PhD Candidate

Counting missing values

prior_counts = restaurant.groupby('time')
['total_bill'].count()

missing_counts = restaurant_nan.groupby('time')
['total_bill'].count()
print(prior_counts - missing_counts)

time
Dinner    32
Lunch     13
Name: total_bill, dtype: int64

Missing value imputation

missing_trans = lambda x: x.fillna(x.mean())

restaurant_nan_grouped = restaurant_nan.groupby('time')['total_bill']
restaurant_nan_grouped.transform(missing_trans)

Time using .transform(): 0.00368881225586 sec

0    20.676573
1    10.340000
2    21.010000
3    23.680000
4    24.590000
5    25.290000
6    20.676573
Name: total_bill, dtype: float64

Comparison with native methods

start_time = time.time()
mean_din = restaurant_nan.loc[restaurant_nan.time == 
'Dinner']['total_bill'].mean()
mean_lun = restaurant_nan.loc[restaurant_nan.time == 
'Lunch']['total_bill'].mean()

for row in range(len(restaurant_nan)):
    if restaurant_nan.iloc[row]['time'] == 'Dinner':
        restaurant_nan.loc[row, 'total_time'] = mean_din
    else:
        restaurant_nan.loc[row, 'total_time'] = mean_lun
print("Results from the above operation calculated in %s seconds" % (time.time() - start_time))

Time using native Python: 0.172566890717 sec

Difference in time: 4,578.115%

Let's do it!

Writing Efficient Code with pandas