Data type constraints

Pulizia dei dati in Python

Adel Nehme

VP of AI Curriculum, DataCamp

Course outline

dirty_data

Pulizia dei dati in Python

Course outline

side effects

Pulizia dei dati in Python

Course outline

clean_data

Pulizia dei dati in Python

Course outline

clean_data

Chapter 1 - Common data problems

Pulizia dei dati in Python

Why do we need to clean data?

ds_workflow

Pulizia dei dati in Python

Why do we need to clean data?

ds_workflow

Pulizia dei dati in Python

Why do we need to clean data?

                                                                                   Garbage in Garbage out

Pulizia dei dati in Python

Data type constraints

Datatype Example
Text data First name, last name, address ...
Integers # Subscribers, # products sold ...
Decimals Temperature, $ exchange rates ...
Binary Is married, new customer, yes/no, ...
Dates Order dates, ship dates ...
Categories Marriage status, gender ...
Python data type
str
int
float
bool
datetime
category
Pulizia dei dati in Python

Strings to integers

# Import CSV file and output header
sales = pd.read_csv('sales.csv')
sales.head(2)
   SalesOrderID    Revenue    Quantity
0         43659     23153$          12
1         43660      1457$           2
# Get data types of columns
sales.dtypes
SalesOrderID    int64
Revenue         object
Quantity        int64
dtype: object
Pulizia dei dati in Python

String to integers

# Get DataFrame information
sales.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31465 entries, 0 to 31464
Data columns (total 3 columns):
SalesOrderID     31465 non-null int64
Revenue          31465 non-null object
Quantity         31465 non-null int64
dtypes: int64(2), object(1)
memory usage: 737.5+ KB
Pulizia dei dati in Python

String to integers

# Print sum of all Revenue column
sales['Revenue'].sum()
'23153$1457$36865$32474$472$27510$16158$5694$6876$40487$807$6893$9153$6895$4216..
# Remove $ from Revenue column
sales['Revenue'] = sales['Revenue'].str.strip('$')
sales['Revenue'] = sales['Revenue'].astype('int')
# Verify that Revenue is now an integer
assert sales['Revenue'].dtype == 'int'
Pulizia dei dati in Python

The assert statement

# This will pass
assert 1+1 == 2
# This will not pass
assert 1+1 == 3
AssertionError                            Traceback (most recent call last)
         assert 1+1 == 3
AssertionError:
Pulizia dei dati in Python

Numeric or categorical?

...   marriage_status    ...
...                 3    ...
...                 1    ...
...                 2    ...

0 = Never married       1 = Married       2 = Separated       3 = Divorced

df['marriage_status'].describe()
       marriage_status
...
mean              1.4
std               0.20
min               0.00
50%               1.8 ...
Pulizia dei dati in Python

Numeric or categorical?

# Convert to categorical
df["marriage_status"] = df["marriage_status"].astype('category')

df.describe()
        marriage_status
count                 241
unique                4
top                   1
freq                  120
Pulizia dei dati in Python

Let's practice!

Pulizia dei dati in Python

Preparing Video For Download...