Data filtration using the filter() function

Writing Efficient Code with pandas

Leonidas Souliotis

PhD Candidate

Purpose of filter()

Limit results based on an aggregate feature

Number of missing values
Mean of a specific feature
Number of occurrences of the group

Filter using groupby().filter()

restaurant_grouped = restaurant.groupby('day')
filter_trans = lambda x : x['total_bill'].mean() > 20
restaurant_filtered = restaurant_grouped.filter(filter_trans)

Time using .filter() 0.00414085388184 sec

print(restaurant_filtered['tip'].mean())

3.11527607362

print(restaurant['tip'].mean())

2.9982786885245902

Comparison with native methods

t=[restaurant.loc[df['day'] == i]['tip'] for i in restaurant['day'].unique() 
    if restaurant.loc[df['day'] == i]['total_bill'].mean()>20]
restaurant_filtered = t[0]
for j in t[1:]: 
    restaurant_filtered=restaurant_filtered.append(j,ignore_index=True)

Time using native Python: 0.00663900375366 sec

print(restaurant_filtered.mean())

3.11527607362

Difference in time: 60.329341317157024%

Let's do it!

Writing Efficient Code with pandas