Data type constraints

Cleaning Data in Python

Adel Nehme

Content Developer @ DataCamp

Course outline

dirty_data

Cleaning Data in Python

Course outline

side effects

Cleaning Data in Python

Course outline

clean_data

Cleaning Data in Python

Course outline

clean_data

Chapter 1 - Common data problems

Cleaning Data in Python

Why do we need to clean data?

ds_workflow

Cleaning Data in Python

Why do we need to clean data?

ds_workflow

Cleaning Data in Python

Why do we need to clean data?

                                                                                   Garbage in Garbage out

Cleaning Data in Python

Data type constraints

Datatype Example
Text data First name, last name, address ...
Integers # Subscribers, # products sold ...
Decimals Temperature, $ exchange rates ...
Binary Is married, new customer, yes/no, ...
Dates Order dates, ship dates ...
Categories Marriage status, gender ...
Python data type
str
int
float
bool
datetime
category
Cleaning Data in Python

Strings to integers

# Import CSV file and output header
sales = pd.read_csv('sales.csv')
sales.head(2)
   SalesOrderID    Revenue    Quantity
0         43659     23153$          12
1         43660      1457$           2
# Get data types of columns
sales.dtypes
SalesOrderID    int64
Revenue         object
Quantity        int64
dtype: object
Cleaning Data in Python

String to integers

# Get DataFrame information
sales.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31465 entries, 0 to 31464
Data columns (total 3 columns):
SalesOrderID     31465 non-null int64
Revenue          31465 non-null object
Quantity         31465 non-null int64
dtypes: int64(2), object(1)
memory usage: 737.5+ KB
Cleaning Data in Python

String to integers

# Print sum of all Revenue column
sales['Revenue'].sum()
'23153$1457$36865$32474$472$27510$16158$5694$6876$40487$807$6893$9153$6895$4216..
# Remove $ from Revenue column
sales['Revenue'] = sales['Revenue'].str.strip('$')
sales['Revenue'] = sales['Revenue'].astype('int')
# Verify that Revenue is now an integer
assert sales['Revenue'].dtype == 'int'
Cleaning Data in Python

The assert statement

# This will pass
assert 1+1 == 2
# This will not pass
assert 1+1 == 3
AssertionError                            Traceback (most recent call last)
         assert 1+1 == 3
AssertionError:
Cleaning Data in Python

Numeric or categorical?

...   marriage_status    ...
...                 3    ...
...                 1    ...
...                 2    ...

0 = Never married       1 = Married       2 = Separated       3 = Divorced

df['marriage_status'].describe()
       marriage_status
...
mean              1.4
std               0.20
min               0.00
50%               1.8 ...
Cleaning Data in Python

Numeric or categorical?

# Convert to categorical
df["marriage_status"] = df["marriage_status"].astype('category')

df.describe()
        marriage_status
count                 241
unique                4
top                   1
freq                  120
Cleaning Data in Python

Let's practice!

Cleaning Data in Python

Preparing Video For Download...