Data validation

Exploratory Data Analysis in Python

Izzy Weber

Curriculum Manager, DataCamp

Validating data types

books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
--   ------  --------------  -----  
 0   name    350 non-null    object 
 1   author  350 non-null    object 
 2   rating  350 non-null    float64
 3   year    350 non-null    float64 
 4   genre   350 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 13.8+ KB

books.dtypes

name       object
author     object
rating    float64
year      float64
genre      object
dtype: object

Updating data types

books["year"] = books["year"].astype(int)

books.dtypes

name       object
author     object
rating    float64
year        int64
genre      object
dtype: object

Updating data types

Type	Python Name
String	`str`
Integer	`int`
Float	`float`
Dictionary	`dict`
List	`list`
Boolean	`bool`

Validating categorical data

books["genre"].isin(["Fiction", "Non Fiction"])

0       True
1       True
2       True
3       True
4      False
       ...  
345     True
346     True
347     True
348     True
349    False
Name: genre, Length: 350, dtype: bool

Validating categorical data

~books["genre"].isin(["Fiction", "Non Fiction"])

0      False
1      False
2      False
3      False
4       True
       ...  
345    False
346    False
347    False
348    False
349     True
Name: genre, Length: 350, dtype: bool

Validating categorical data

books[books["genre"].isin(["Fiction", "Non Fiction"])].head()

|   |                          name |              author | rating | year |       genre |
|---|-------------------------------|---------------------|--------|------|-------------|
| 0 | 10-Day Green Smoothie Cleanse |            JJ Smith |    4.7 | 2016 | Non Fiction |
| 1 |             11/22/63: A Novel |        Stephen King |    4.6 | 2011 |     Fiction |
| 2 |             12 Rules for Life |  Jordan B. Peterson |    4.7 | 2018 | Non Fiction |
| 3 |        1984 (Signet Classics) |       George Orwell |    4.7 | 2017 |     Fiction |
| 5 |         A Dance with Dragons  | George R. R. Martin |    4.4 | 2011 |     Fiction |

Validating numerical data

books.select_dtypes("number").head()

|   | rating | year |
|---|--------|------|
| 0 |    4.7 | 2016 |
| 1 |    4.6 | 2011 |
| 2 |    4.7 | 2018 |
| 3 |    4.7 | 2017 |
| 4 |    4.8 | 2019 |

Validating numerical data

books["year"].min()

books["year"].max()

sns.boxplot(data=books, x="year")
plt.show()

a boxplot of the publishing years for the books data

Validating numerical data

sns.boxplot(data=books, x="year", y="genre")

a boxplot of the books data, broken down by genre

Let's practice!

Exploratory Data Analysis in Python