Generating new features

Analisi esplorativa dei dati in Python

George Boorman

Curriculum Manager, DataCamp

Correlation

sns.heatmap(planes.corr(numeric_only=True), annot=True)
plt.show()

Heatmap showing 0.54 Pearson correlation coefficient between Price and Duration

Analisi esplorativa dei dati in Python

Viewing data types

print(planes.dtypes)
Airline                    object
Date_of_Journey    datetime64[ns]
Source                     object
Destination                object
Route                      object
Dep_Time           datetime64[ns]
Arrival_Time       datetime64[ns]
Duration                  float64
Total_Stops                object
Additional_Info            object
Price                     float64
dtype: object
Analisi esplorativa dei dati in Python

Total stops

print(planes["Total_Stops"].value_counts())
1 stop      4107
non-stop    2584
2 stops     1127
3 stops       29
4 stops        1
Name: Total_Stops, dtype: int64
Analisi esplorativa dei dati in Python

Cleaning total stops

planes["Total_Stops"] = planes["Total_Stops"].str.replace(" stops", "")

planes["Total_Stops"] = planes["Total_Stops"].str.replace(" stop", "")
planes["Total_Stops"] = planes["Total_Stops"].str.replace("non-stop", "0")
planes["Total_Stops"] = planes["Total_Stops"].astype(int)
Analisi esplorativa dei dati in Python

Correlation

sns.heatmap(planes.corr(numeric_only=True), annot=True)
plt.show()

Heatmap showing 0.62 Pearson correlation coefficient between Price and Total Stops and 0.74 correlation between Duration and Total Stops

Analisi esplorativa dei dati in Python

Dates

print(planes.dtypes)
Airline                    object
Date_of_Journey    datetime64[ns]
Source                     object
Destination                object
Route                      object
Dep_Time           datetime64[ns]
Arrival_Time       datetime64[ns]
Duration                  float64
Total_Stops                 int64
Additional_Info            object
Price                     float64
dtype: object
Analisi esplorativa dei dati in Python

Extracting month and weekday

planes["month"] = planes["Date_of_Journey"].dt.month

planes["weekday"] = planes["Date_of_Journey"].dt.weekday
print(planes[["month", "weekday", "Date_of_Journey"]].head())
   month  weekday   Date_of_Journey
0      9        4        2019-09-06
1     12        3        2019-12-05
2      1        3        2019-01-03
3      6        0        2019-06-24
4     12        1        2019-12-03
Analisi esplorativa dei dati in Python

Departure and arrival times

planes["Dep_Hour"] = planes["Dep_Time"].dt.hour
planes["Arrival_Hour"] = planes["Arrival_Time"].dt.hour
Analisi esplorativa dei dati in Python

Correlation

Heatmap showing no relationship between datetime attributes and price

Analisi esplorativa dei dati in Python

Creating categories

print(planes["Price"].describe())
count     7848.000000
mean      9035.413609
std       4429.822081
min       1759.000000
25%       5228.000000
50%       8355.000000
75%      12373.000000
max      54826.000000
Name: Price, dtype: float64
Range Ticket Type
<= 5228 Economy
> 5228 <= 8355 Premium Economy
> 8335 <= 12373 Business Class
> 12373 First Class
Analisi esplorativa dei dati in Python

Descriptive statistics

twenty_fifth = planes["Price"].quantile(0.25)

median = planes["Price"].median()
seventy_fifth = planes["Price"].quantile(0.75)
maximum = planes["Price"].max()
Analisi esplorativa dei dati in Python

Labels and bins

labels = ["Economy", "Premium Economy", "Business Class", "First Class"]

bins = [0, twenty_fifth, median, seventy_fifth, maximum]
Analisi esplorativa dei dati in Python

pd.cut()

Call pd-dot-cut

planes["Price_Category"] = pd.cut(


Analisi esplorativa dei dati in Python

pd.cut()

Pass the data

planes["Price_Category"] = pd.cut(planes["Price"],


Analisi esplorativa dei dati in Python

pd.cut()

Set the labels

planes["Price_Category"] = pd.cut(planes["Price"],
                                  labels=labels,

Analisi esplorativa dei dati in Python

pd.cut()

Provide the bins

planes["Price_Category"] = pd.cut(planes["Price"],
                                  labels=labels,
                                  bins=bins)
Analisi esplorativa dei dati in Python

Price categories

print(planes[["Price","Price_Category"]].head())
     Price   Price_Category
0  13882.0      First Class
1   6218.0  Premium Economy
2  13302.0      First Class
3   3873.0          Economy
4  11087.0   Business Class
Analisi esplorativa dei dati in Python

Price category by airline

sns.countplot(data=planes, x="Airline", hue="Price_Category")
plt.show()
Analisi esplorativa dei dati in Python

Price category by airline

Countplot showing the number of flights per airline in different price categories, with Jet Airways having the largest number of First Class tickets

Analisi esplorativa dei dati in Python

Let's practice!

Analisi esplorativa dei dati in Python

Preparing Video For Download...