Generating new features

Exploratory Data Analysis in Python

George Boorman

Curriculum Manager, DataCamp

Correlation

sns.heatmap(planes.corr(), annot=True)
plt.show()

Heatmap showing 0.54 Pearson correlation coefficient between Price and Duration

Exploratory Data Analysis in Python

Viewing data types

print(planes.dtypes)
Airline                    object
Date_of_Journey    datetime64[ns]
Source                     object
Destination                object
Route                      object
Dep_Time           datetime64[ns]
Arrival_Time       datetime64[ns]
Duration                  float64
Total_Stops                object
Additional_Info            object
Price                     float64
dtype: object
Exploratory Data Analysis in Python

Total stops

print(planes["Total_Stops"].value_counts())
1 stop      4107
non-stop    2584
2 stops     1127
3 stops       29
4 stops        1
Name: Total_Stops, dtype: int64
Exploratory Data Analysis in Python

Cleaning total stops

planes["Total_Stops"] = planes["Total_Stops"].str.replace(" stops", "")

planes["Total_Stops"] = planes["Total_Stops"].str.replace(" stop", "")
planes["Total_Stops"] = planes["Total_Stops"].str.replace("non-stop", "0")
planes["Total_Stops"] = planes["Total_Stops"].astype(int)
Exploratory Data Analysis in Python

Correlation

sns.heatmap(planes.corr(), annot=True)
plt.show()

Heatmap showing 0.62 Pearson correlation coefficient between Price and Total Stops and 0.74 correlation between Duration and Total Stops

Exploratory Data Analysis in Python

Dates

print(planes.dtypes)
Airline                    object
Date_of_Journey    datetime64[ns]
Source                     object
Destination                object
Route                      object
Dep_Time           datetime64[ns]
Arrival_Time       datetime64[ns]
Duration                  float64
Total_Stops                 int64
Additional_Info            object
Price                     float64
dtype: object
Exploratory Data Analysis in Python

Extracting month and weekday

planes["month"] = planes["Date_of_Journey"].dt.month

planes["weekday"] = planes["Date_of_Journey"].dt.weekday
print(planes[["month", "weekday", "Date_of_Journey"]].head())
   month  weekday   Date_of_Journey
0      9        4        2019-09-06
1     12        3        2019-12-05
2      1        3        2019-01-03
3      6        0        2019-06-24
4     12        1        2019-12-03
Exploratory Data Analysis in Python

Departure and arrival times

planes["Dep_Hour"] = planes["Dep_Time"].dt.hour
planes["Arrival_Hour"] = planes["Arrival_Time"].dt.hour
Exploratory Data Analysis in Python

Correlation

Heatmap showing no relationship between datetime attributes and price

Exploratory Data Analysis in Python

Creating categories

print(planes["Price"].describe())
count     7848.000000
mean      9035.413609
std       4429.822081
min       1759.000000
25%       5228.000000
50%       8355.000000
75%      12373.000000
max      54826.000000
Name: Price, dtype: float64
Range Ticket Type
<= 5228 Economy
> 5228 <= 8355 Premium Economy
> 8335 <= 12373 Business Class
> 12373 First Class
Exploratory Data Analysis in Python

Descriptive statistics

twenty_fifth = planes["Price"].quantile(0.25)

median = planes["Price"].median()
seventy_fifth = planes["Price"].quantile(0.75)
maximum = planes["Price"].max()
Exploratory Data Analysis in Python

Labels and bins

labels = ["Economy", "Premium Economy", "Business Class", "First Class"]

bins = [0, twenty_fifth, median, seventy_fifth, maximum]
Exploratory Data Analysis in Python

pd.cut()

Call pd-dot-cut

planes["Price_Category"] = pd.cut(


Exploratory Data Analysis in Python

pd.cut()

Pass the data

planes["Price_Category"] = pd.cut(planes["Price"],


Exploratory Data Analysis in Python

pd.cut()

Set the labels

planes["Price_Category"] = pd.cut(planes["Price"],
                                  labels=labels,

Exploratory Data Analysis in Python

pd.cut()

Provide the bins

planes["Price_Category"] = pd.cut(planes["Price"],
                                  labels=labels,
                                  bins=bins)
Exploratory Data Analysis in Python

Price categories

print(planes[["Price","Price_Category"]].head())
     Price   Price_Category
0  13882.0      First Class
1   6218.0  Premium Economy
2  13302.0      First Class
3   3873.0          Economy
4  11087.0   Business Class
Exploratory Data Analysis in Python

Price category by airline

sns.countplot(data=planes, x="Airline", hue="Price_Category")
plt.show()
Exploratory Data Analysis in Python

Price category by airline

Countplot showing the number of flights per airline in different price categories, with Jet Airways having the largest number of First Class tickets

Exploratory Data Analysis in Python

Let's practice!

Exploratory Data Analysis in Python

Preparing Video For Download...