Initial EDA

Winning a Kaggle Competition in Python

Yauhen Babakhin

Kaggle Grandmaster

Goals of EDA

 

  • Size of the data
  • Properties of the target variable
  • Properties of the features
  • Generate ideas for feature engineering
Winning a Kaggle Competition in Python

Two sigma connect: rental listing inquiries

 

Problem statement
Predict the popularity of an apartment rental listing

Target variable
interest_level

Problem type
Classification with 3 classes: 'high', 'medium' and 'low'

Metric
Multi-class logarithmic loss

Winning a Kaggle Competition in Python

EDA. Part I

# Size of the data
twosigma_train = pd.read_csv('twosigma_train.csv')
print('Train shape:', twosigma_train.shape)

twosigma_test = pd.read_csv('twosigma_test.csv')
print('Test shape:', twosigma_test.shape)
Train shape: (49352, 11)
Test shape: (74659, 10)
Winning a Kaggle Competition in Python

EDA. Part I

print(twosigma_train.columns.tolist())
['id', 'bathrooms', 'bedrooms', 'building_id', 'latitude', 'longitude',
'manager_id', 'price', 'interest_level']
twosigma_train.interest_level.value_counts()
low       34284
medium    11229
high       3839
Winning a Kaggle Competition in Python

EDA. Part I

# Describe the train data
twosigma_train.describe()
         bathrooms      bedrooms      latitude     longitude         price
count  49352.00000  49352.000000  49352.000000  49352.000000  4.935200e+04
mean       1.21218      1.541640     40.741545    -73.955716  3.830174e+03
std        0.50142      1.115018      0.638535      1.177912  2.206687e+04
min        0.00000      0.000000      0.000000   -118.271000  4.300000e+01
25%        1.00000      1.000000     40.728300    -73.991700  2.500000e+03
50%        1.00000      1.000000     40.751800    -73.977900  3.150000e+03
75%        1.00000      2.000000     40.774300    -73.954800  4.100000e+03
max       10.00000      8.000000     44.883500      0.000000  4.490000e+06
Winning a Kaggle Competition in Python

EDA. Part II

 

import matplotlib.pyplot as plt
plt.style.use('ggplot')
# Find the median price by the interest level
prices = twosigma_train.groupby('interest_level', as_index=False)['price'].median()
Winning a Kaggle Competition in Python

EDA. Part II

# Draw a barplot
fig = plt.figure(figsize=(7, 5))
plt.bar(prices.interest_level, prices.price, width=0.5, alpha=0.8)

# Set titles plt.xlabel('Interest level') plt.ylabel('Median price') plt.title('Median listing price across interest level')
# Show the plot plt.show()
Winning a Kaggle Competition in Python

median listing price across interest level

Winning a Kaggle Competition in Python

Let's practice!

Winning a Kaggle Competition in Python

Preparing Video For Download...