Initial exploration

Exploratory Data Analysis in Python

Izzy Weber

Curriculum Manager, DataCamp

Exploratory Data Analysis

The process of reviewing and cleaning data to...

  • derive insights
  • generate hypotheses

graphic of many books on a shelf

Exploratory Data Analysis in Python

A first look with .head()

books = pd.read_csv("books.csv")

books.head()
|                          name |                   author |  rating | year |       genre |
|-------------------------------|--------------------------|---------|------|-------------|
| 10-Day Green Smoothie Cleanse |                 JJ Smith |    4.73 | 2016 | Non Fiction |
|             11/22/63: A Novel |             Stephen King |    4.62 | 2011 |     Fiction |
|             12 Rules for Life |       Jordan B. Peterson |    4.69 | 2018 | Non Fiction |
|        1984 (Signet Classics) |            George Orwell |    4.73 | 2017 |     Fiction |
|          5,000 Awesome Facts  | National Geographic Kids |    4.81 | 2019 |   Childrens |
Exploratory Data Analysis in Python

Gathering more .info()

books.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
--   ------  --------------  -----  
 0   name    350 non-null    object 
 1   author  350 non-null    object 
 2   rating  350 non-null    float64
 3   year    350 non-null    int64  
 4   genre   350 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 13.8+ KB
Exploratory Data Analysis in Python

A closer look at categorical columns

books.value_counts("genre")
genre
Non Fiction    179
Fiction        131
Childrens       40
dtype: int64
Exploratory Data Analysis in Python

.describe() numerical columns

books.describe()
         rating        year
count    350.000000  350.000000
mean     4.608571    2013.508571
std      0.226941    3.284711
min      3.300000    2009.000000
25%      4.500000    2010.000000
50%      4.600000    2013.000000
75%      4.800000    2016.000000
max      4.900000    2019.000000
Exploratory Data Analysis in Python

Visualizing numerical data

import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(data=books, x="rating") plt.show()

a histogram of book ratings

Exploratory Data Analysis in Python

Adjusting bin width

sns.histplot(data=books, x="rating", binwidth=.1)
plt.show()

histogram of books ratings with bin width of .1

Exploratory Data Analysis in Python

Let's practice!

Exploratory Data Analysis in Python

Preparing Video For Download...