Exploratory Data Analysis

End-to-End Machine Learning

Joshua Stapleton

Machine Learning Engineer

The EDA process

A diagram showing various components of EDA as applied to the patient heart disease dataset

df.head()

# Print the first 5 rows
print(heart_disease_df.head())

The first 5 rows of our heart disease DataFrame. Results of calling the df.head() operation.

df.info()

# Print out details
print(heart_disease_df.info())

Summary information about our heart disease DataFrame. Results of calling the df.info() operation.

df.value_counts()

# print the class balance
print(heart_disease_df['target'].value_counts(normalize=True))

The class balance of the target column of our heart disease DataFrame. Results of calling the .value_counts() operation on the target column.

Use df.isnull()

Usage

# check whether all values in a column are null
print(heart_disease_df['oldpeak'].isnull().all())

True

Anomalous values
- Measurement errors
- Data entry errors
- Rare events
Can skew model performance
- Model learns based on extreme values
- Doesn't capture general data trend
Sometimes can be useful:
- Rare values
- Detection: use boxplot, or IQR

A visualization showing an outlier.

Visualizations show:

Other types of visualizations:

df['age'].plot(kind='hist')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

A visualization showing the distribution of age in our dataset.

¹ https://seaborn.pydata.org/tutorial/distributions.html, https://app.datacamp.com/learn/courses/intermediate-data-visualization-with-seaborn

Understand the data

Detect outliers

Formulate hypotheses

Check assumptions

End-to-End Machine Learning