Initial exploration

Exploratory Data Analysis in Python

Izzy Weber

Curriculum Manager, DataCamp

Exploratory Data Analysis

The process of reviewing and cleaning data to...

derive insights
generate hypotheses

graphic of many books on a shelf

A first look with .head()

books = pd.read_csv("books.csv")

books.head()

|                          name |                   author |  rating | year |       genre |
|-------------------------------|--------------------------|---------|------|-------------|
| 10-Day Green Smoothie Cleanse |                 JJ Smith |    4.73 | 2016 | Non Fiction |
|             11/22/63: A Novel |             Stephen King |    4.62 | 2011 |     Fiction |
|             12 Rules for Life |       Jordan B. Peterson |    4.69 | 2018 | Non Fiction |
|        1984 (Signet Classics) |            George Orwell |    4.73 | 2017 |     Fiction |
|          5,000 Awesome Facts  | National Geographic Kids |    4.81 | 2019 |   Childrens |

Gathering more .info()

books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
--   ------  --------------  -----  
 0   name    350 non-null    object 
 1   author  350 non-null    object 
 2   rating  350 non-null    float64
 3   year    350 non-null    int64  
 4   genre   350 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 13.8+ KB

A closer look at categorical columns

books.value_counts("genre")

genre
Non Fiction    179
Fiction        131
Childrens       40
dtype: int64

.describe() numerical columns

books.describe()

         rating        year
count    350.000000  350.000000
mean     4.608571    2013.508571
std      0.226941    3.284711
min      3.300000    2009.000000
25%      4.500000    2010.000000
50%      4.600000    2013.000000
75%      4.800000    2016.000000
max      4.900000    2019.000000

Visualizing numerical data

import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(data=books, x="rating")
plt.show()

a histogram of book ratings

Adjusting bin width

sns.histplot(data=books, x="rating", binwidth=.1)
plt.show()

histogram of books ratings with bin width of .1

Let's practice!

Exploratory Data Analysis in Python