Data validation

Exploratory Data Analysis in Python

Izzy Weber

Curriculum Manager, DataCamp

Validating data types

books.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
--   ------  --------------  -----  
 0   name    350 non-null    object 
 1   author  350 non-null    object 
 2   rating  350 non-null    float64
 3   year    350 non-null    float64 
 4   genre   350 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 13.8+ KB
books.dtypes
name       object
author     object
rating    float64
year      float64
genre      object
dtype: object
Exploratory Data Analysis in Python

Updating data types

books["year"] = books["year"].astype(int)

books.dtypes
name       object
author     object
rating    float64
year        int64
genre      object
dtype: object
Exploratory Data Analysis in Python

Updating data types

Type Python Name
String str
Integer int
Float float
Dictionary dict
List list
Boolean bool
Exploratory Data Analysis in Python

Validating categorical data

books["genre"].isin(["Fiction", "Non Fiction"])
0       True
1       True
2       True
3       True
4      False
       ...  
345     True
346     True
347     True
348     True
349    False
Name: genre, Length: 350, dtype: bool
Exploratory Data Analysis in Python

Validating categorical data

~books["genre"].isin(["Fiction", "Non Fiction"])
0      False
1      False
2      False
3      False
4       True
       ...  
345    False
346    False
347    False
348    False
349     True
Name: genre, Length: 350, dtype: bool
Exploratory Data Analysis in Python

Validating categorical data

books[books["genre"].isin(["Fiction", "Non Fiction"])].head()
|   |                          name |              author | rating | year |       genre |
|---|-------------------------------|---------------------|--------|------|-------------|
| 0 | 10-Day Green Smoothie Cleanse |            JJ Smith |    4.7 | 2016 | Non Fiction |
| 1 |             11/22/63: A Novel |        Stephen King |    4.6 | 2011 |     Fiction |
| 2 |             12 Rules for Life |  Jordan B. Peterson |    4.7 | 2018 | Non Fiction |
| 3 |        1984 (Signet Classics) |       George Orwell |    4.7 | 2017 |     Fiction |
| 5 |         A Dance with Dragons  | George R. R. Martin |    4.4 | 2011 |     Fiction |
Exploratory Data Analysis in Python

Validating numerical data

books.select_dtypes("number").head()
|   | rating | year |
|---|--------|------|
| 0 |    4.7 | 2016 |
| 1 |    4.6 | 2011 |
| 2 |    4.7 | 2018 |
| 3 |    4.7 | 2017 |
| 4 |    4.8 | 2019 |
Exploratory Data Analysis in Python

Validating numerical data

books["year"].min()
2009
books["year"].max()
2019
sns.boxplot(data=books, x="year")
plt.show()

a boxplot of the publishing years for the books data

Exploratory Data Analysis in Python

Validating numerical data

sns.boxplot(data=books, x="year", y="genre")

a boxplot of the books data, broken down by genre

Exploratory Data Analysis in Python

Let's practice!

Exploratory Data Analysis in Python

Preparing Video For Download...