Data preparation

Understanding Data Science

Hadrien Lacroix

Content Developer, DataCamp

Data workflow

data science workflow

Understanding Data Science

Why prepare data?

  • Real-life data is messy
  • Preparation is done to prevent:
    • errors
    • incorrect results
    • biasing algorithms

preparing soup

Understanding Data Science

Let's start cleaning

Sara Lis Hadrien Lis
Age "27" "30" "30"
Size 1.77 5.58 1.80 5.58
Country "Belgium" "USA" "FR" "USA"

scraper-cleaning-window

Understanding Data Science

Tidy data

Before

$$

Sara Lis Hadrien Lis
Age "27" "30" "30"
Size 1.77 5.58 1.80 5.58
Country "Belgium" "USA" "FR" "USA"

folded-laundry

Understanding Data Science

Tidy data output

Before

$$

Sara Lis Hadrien Lis
Age "27" "30" "30"
Size 1.77 5.58 1.80 5.58
Country "Belgium" "USA" "FR" "USA"

After

$$

Name Age Size Country
Sara "26" 1.78 "Belgium"
Lis "30" 5.58 "USA"
Hadrien 1.80 "FR"
Lis "30" 5.58 "USA"
Understanding Data Science

Remove duplicates

Before

$$

Name Age Size Country
Sara "27" 1.77 "Belgium"
Lis "30" 5.58 "USA"
Hadrien 1.80 "FR"
Lis "30" 5.58 "USA"

double-cherry

Understanding Data Science

Remove duplicates | output

Before

$$

Name Age Size Country
Sara "27" 1.77 "Belgium"
Lis "30" 5.58 "USA"
Hadrien 1.80 "FR"
Lis "30" 5.58 "USA"

After

$$

Name Age Size Country
Sara "27" 1.77 "Belgium"
Lis "30" 5.58 "USA"
Hadrien 1.80 "FR"
Understanding Data Science

Unique ID

Before

$$

Name Age Size Country
Sara "27" 1.77 "Belgium"
Lis "30" 5.58 "USA"
Hadrien 1.80 "FR"

purple-duck-with-yellow-ducks

Understanding Data Science

Unique ID | output

Before

$$

Name Age Size Country
Sara "27" 1.77 "Belgium"
Lis "30" 5.58 "USA"
Hadrien 1.80 "FR"

After

$$

ID Name Age Size Country
0 Sara "27" 1.77 "Belgium"
1 Lis "30" 5.58 "USA"
2 Hadrien 1.80 "FR"
Understanding Data Science

Homogeneity

Before

$$

ID Name Age Size Country
0 Sara "27" 1.77 "Belgium"
1 Lis "30" 5.58 "USA"
2 Hadrien 1.80 "FR"

small-goldfish-facing-large-goldfish

Understanding Data Science

Homogeneity | output

Before

$$

ID Name Age Size Country
0 Sara "27" 1.77 "Belgium"
1 Lis "30" 5.58 "USA"
2 Hadrien 1.80 "FR"

After

$$

ID Name Age Size Country
0 Sara "27" 1.77 "Belgium"
1 Lis "30" 1.70 "USA"
2 Hadrien 1.80 "FR"
Understanding Data Science

Homogeneity, again

Before

$$

ID Name Age Size Country
0 Sara "27" 1.77 "Belgium"
1 Lis "30" 1.70 "USA"
2 Hadrien 1.80 "FR"

belgian-flag

Understanding Data Science

Homogeneity, again | output

Before

$$

ID Name Age Size Country
0 Sara "27" 1.77 "Belgium"
1 Lis "30" 1.70 "US"
2 Hadrien 1.80 "FR"

After

$$

ID Name Age Size Country
0 Sara "27" 1.77 "BE"
1 Lis "30" 1.70 "US"
2 Hadrien 1.80 "FR"
Understanding Data Science

Data types

Before

$$

ID Name Age Size Country
0 Sara "27" 1.77 "BE"
1 Lis "30" 1.70 "US"
2 Hadrien 1.80 "FR"

different-types-of-pasta

Understanding Data Science

Data types | output

Before

$$

ID Name Age Size Country
0 Sara "27" 1.77 "BE"
1 Lis "30" 1.70 "US"
2 Hadrien 1.80 "FR"

After

$$

ID Name Age Size Country
0 Sara 27 1.77 "BE"
1 Lis 30 1.70 "US"
2 Hadrien 1.80 "FR"
Understanding Data Science

Missing values

Before

$$

ID Name Age Size Country
0 Sara 27 1.77 "BE"
1 Lis 30 1.70 "US"
2 Hadrien 1.80 "FR"

Reasons:

  • data entry
  • error
  • valid missing value

Solutions:

  • impute
  • drop
  • keep
Understanding Data Science

Missing values | output

Before

$$

ID Name Age Size Country
0 Sara 27 1.77 "BE"
1 Lis 30 1.70 "USA"
2 Hadrien 1.80 "FR"

After

$$

ID Name Age Size Country
0 Sara 27 1.77 "BE"
1 Lis 30 1.70 "US"
2 Hadrien 28 1.80 "FR"
Understanding Data Science

Let's practice!

Understanding Data Science

Preparing Video For Download...