Exploratory Data Analysis

Comprendere la Data Science

Hadrien Lacroix

Content Developer at DataCamp

What is EDA?

Exploratory Data Analysis:

  • Exploring the data
  • Formulating hypotheses
  • Assessing characteristics
  • Visualizing

photograph-of-john-tukey

Comprendere la Data Science

Data workflow

data science workflow

Comprendere la Data Science

Let's dive right in

Dataset 1        Dataset 2        Dataset 3        Dataset 4        
|x    |y    |    |x    |y    |    |x    |y    |    |x    |y    |
|-----|-----|    |-----|-----|    |-----|-----|    |-----|-----|    
|10.0 |8.04 |    |10.0 |9.14 |    |10.0 |7.46 |    |8.0  |6.58 |
|8.0  |6.95 |    |8.0  |8.14 |    |8.0  |6.77 |    |8.0  |5.76 |
|13.0 |7.58 |    |13.0 |8.74 |    |13.0 |12.74|    |8.0  |7.71 |
|9.0  |8.81 |    |9.0  |8.77 |    |9.0  |7.11 |    |8.0  |8.84 |
|11.0 |8.33 |    |11.0 |9.26 |    |11.0 |7.81 |    |8.0  |8.47 |
|14.0 |9.96 |    |14.0 |8.10 |    |14.0 |8.84 |    |8.0  |7.04 |
|6.0  |7.24 |    |6.0  |6.13 |    |6.0  |6.08 |    |8.0  |5.25 |
|4.0  |4.26 |    |4.0  |3.10 |    |4.0  |5.39 |    |19.0 |12.50|
|12.0 |10.84|    |12.0 |9.13 |    |12.0 |8.15 |    |8.0  |5.56 |
|7.0  |4.82 |    |7.0  |7.26 |    |7.0  |6.42 |    |8.0  |7.91 |
|5.0  |5.68 |    |5.0  |4.74 |    |5.0  |5.73 |    |8.0  |6.89 |
Comprendere la Data Science

Surprise!

All four datasets display:

  • identical mean and variance for x
  • identical mean and variance for y
  • identical correlation coefficient
  • identical linear regression equation

$$

In short: they look quite similar

Comprendere la Data Science

Anscombe's quartet

anscombe's quartet

Comprendere la Data Science

anscombe's linear graph

Comprendere la Data Science

anscombe's non-linear graph

Comprendere la Data Science

anscombe's regression thrown off

Comprendere la Data Science

anscombe's correlation thrown off

Comprendere la Data Science

two-rockets-landing-at-the-same-time

Comprendere la Data Science

Knowing your data

  • Flight Number (number)
  • Date (datetime)
  • Time (UTC) (datetime)
  • Booster Version (text)
  • Launch Site (text)
  • Payload (text)
  • Payload Mass (kg) (number)
  • Orbit (text)
  • Customer (text)
  • Mission Outcome (text)
  • Landing Outcome (text)
Comprendere la Data Science

Previewing your data

Flight  Date         Time (UTC)  Booster Version  Launch Site     Payload
_______________________________________________________________________________________________________
1       2010-06-04   18:45:00    F9 v1.0 B0003    CCAFS LC-40    Dragon Spacecraft Qualification Unit
2       2010-12-08   15:43:00    F9 v1.0 B0004    CCAFS LC-40    Dragon demo flight C1, two CubeSats...
3       2012-05-22   7:44:00     F9 v1.0 B0005    CCAFS LC-40    Dragon demo flight C2+
4       2012-10-08   0:35:00     F9 v1.0 B0006    CCAFS LC-40    SpaceX CRS-1
5       2013-03-01   15:10:00    F9 v1.0 B0007    CCAFS LC-40    SpaceX CRS-2    
Payload Mass (kg)    Orbit     Customer         Mission Outcome  Landing Outcome
_______________________________________________________________________________________________________
NaN                  LEO       SpaceX           Success          Failure (parachute)
NaN                  LEO (ISS) NASA (COTS) NRO  Success          Failure (parachute)
525                  LEO (ISS) NASA (COTS)      Success          No attempt
500                  LEO (ISS) NASA (CRS)       Success          No attempt
677                  LEO (ISS) NASA (CRS)       Success          No attempt
Comprendere la Data Science

Descriptive statistics

        Flight  Date         Time (UTC)  Booster Version  Launch Site     Payload
_______________________________________________________________________________________________________
count   55     55            55          55               55              55
unique  55     55            53          51               4               55
top     6      2018-03-30    4:45:00     F9 v1.1          CCAFS LC-40     SES-9
freq    1      1             2           5                26              1
        Payload Mass (kg)    Orbit     Customer         Mission Outcome  Landing Outcome
_______________________________________________________________________________________________________
count   53                   55        55               55               55
unique  47                   8         28               2                12
top     9,600                GTO       NASA (CRS)       Success          No attempt
freq    5                    22        14               54               18
Comprendere la Data Science

Visualize!

spacex_launch_count

Comprendere la Data Science

Ask more questions!

space-x-launches-by-site

Comprendere la Data Science

Ask more questions!

space-x-launches-by-outcome

Comprendere la Data Science

Outliers

spacex-payload-mass-histogram

Comprendere la Data Science

Let's practice!

Comprendere la Data Science

Preparing Video For Download...