Nieuwe features genereren

Exploratory Data Analysis in Python

George Boorman

Curriculum Manager, DataCamp

Correlatie

sns.heatmap(planes.corr(numeric_only=True), annot=True)
plt.show()

Heatmap met Pearson-correlatie 0,54 tussen Price en Duration

Exploratory Data Analysis in Python

Datatypen bekijken

print(planes.dtypes)
Airline                    object
Date_of_Journey    datetime64[ns]
Source                     object
Destination                object
Route                      object
Dep_Time           datetime64[ns]
Arrival_Time       datetime64[ns]
Duration                  float64
Total_Stops                object
Additional_Info            object
Price                     float64
dtype: object
Exploratory Data Analysis in Python

Totaal aantal tussenstops

print(planes["Total_Stops"].value_counts())
1 stop      4107
non-stop    2584
2 stops     1127
3 stops       29
4 stops        1
Name: Total_Stops, dtype: int64
Exploratory Data Analysis in Python

Tussenstops opschonen

planes["Total_Stops"] = planes["Total_Stops"].str.replace(" stops", "")

planes["Total_Stops"] = planes["Total_Stops"].str.replace(" stop", "")
planes["Total_Stops"] = planes["Total_Stops"].str.replace("non-stop", "0")
planes["Total_Stops"] = planes["Total_Stops"].astype(int)
Exploratory Data Analysis in Python

Correlatie

sns.heatmap(planes.corr(numeric_only=True), annot=True)
plt.show()

Heatmap met Pearson-correlatie 0,62 tussen Price en Total Stops en 0,74 tussen Duration en Total Stops

Exploratory Data Analysis in Python

Datums

print(planes.dtypes)
Airline                    object
Date_of_Journey    datetime64[ns]
Source                     object
Destination                object
Route                      object
Dep_Time           datetime64[ns]
Arrival_Time       datetime64[ns]
Duration                  float64
Total_Stops                 int64
Additional_Info            object
Price                     float64
dtype: object
Exploratory Data Analysis in Python

Maand en weekdag extraheren

planes["month"] = planes["Date_of_Journey"].dt.month

planes["weekday"] = planes["Date_of_Journey"].dt.weekday
print(planes[["month", "weekday", "Date_of_Journey"]].head())
   month  weekday   Date_of_Journey
0      9        4        2019-09-06
1     12        3        2019-12-05
2      1        3        2019-01-03
3      6        0        2019-06-24
4     12        1        2019-12-03
Exploratory Data Analysis in Python

Vertrek- en aankomsttijden

planes["Dep_Hour"] = planes["Dep_Time"].dt.hour
planes["Arrival_Hour"] = planes["Arrival_Time"].dt.hour
Exploratory Data Analysis in Python

Correlatie

Heatmap zonder relatie tussen datetime-kenmerken en prijs

Exploratory Data Analysis in Python

Categorieën maken

print(planes["Price"].describe())
count     7848.000000
mean      9035.413609
std       4429.822081
min       1759.000000
25%       5228.000000
50%       8355.000000
75%      12373.000000
max      54826.000000
Name: Price, dtype: float64
Bereik Tickettype
<= 5228 Economy
> 5228 <= 8355 Premium Economy
> 8335 <= 12373 Business Class
> 12373 First Class
Exploratory Data Analysis in Python

Beschrijvende statistiek

twenty_fifth = planes["Price"].quantile(0.25)

median = planes["Price"].median()
seventy_fifth = planes["Price"].quantile(0.75)
maximum = planes["Price"].max()
Exploratory Data Analysis in Python

Labels en bins

labels = ["Economy", "Premium Economy", "Business Class", "First Class"]

bins = [0, twenty_fifth, median, seventy_fifth, maximum]
Exploratory Data Analysis in Python

pd.cut()

Roep pd-dot-cut aan

planes["Price_Category"] = pd.cut(


Exploratory Data Analysis in Python

pd.cut()

Geef de data door

planes["Price_Category"] = pd.cut(planes["Price"],


Exploratory Data Analysis in Python

pd.cut()

Stel de labels in

planes["Price_Category"] = pd.cut(planes["Price"],
                                  labels=labels,

Exploratory Data Analysis in Python

pd.cut()

Geef de bins op

planes["Price_Category"] = pd.cut(planes["Price"],
                                  labels=labels,
                                  bins=bins)
Exploratory Data Analysis in Python

Prijscategorieën

print(planes[["Price","Price_Category"]].head())
     Price   Price_Category
0  13882.0      First Class
1   6218.0  Premium Economy
2  13302.0      First Class
3   3873.0          Economy
4  11087.0   Business Class
Exploratory Data Analysis in Python

Prijscategorie per luchtvaartmaatschappij

sns.countplot(data=planes, x="Airline", hue="Price_Category")
plt.show()
Exploratory Data Analysis in Python

Prijscategorie per luchtvaartmaatschappij

Countplot met aantal vluchten per maatschappij per prijscategorie; Jet Airways heeft de meeste First Class-tickets

Exploratory Data Analysis in Python

Laten we oefenen!

Exploratory Data Analysis in Python

Preparing Video For Download...