Exploratory Data Analysis

Understanding Data Science

Hadrien Lacroix

Content Developer at DataCamp

What is EDA?

Exploratory Data Analysis:

  • Exploring the data
  • Formulating hypotheses
  • Assessing characteristics
  • Visualizing

photograph-of-john-tukey

Understanding Data Science

Data workflow

data science workflow

Understanding Data Science

Let's dive right in

Dataset 1        Dataset 2        Dataset 3        Dataset 4        
|x    |y    |    |x    |y    |    |x    |y    |    |x    |y    |
|-----|-----|    |-----|-----|    |-----|-----|    |-----|-----|    
|10.0 |8.04 |    |10.0 |9.14 |    |10.0 |7.46 |    |8.0  |6.58 |
|8.0  |6.95 |    |8.0  |8.14 |    |8.0  |6.77 |    |8.0  |5.76 |
|13.0 |7.58 |    |13.0 |8.74 |    |13.0 |12.74|    |8.0  |7.71 |
|9.0  |8.81 |    |9.0  |8.77 |    |9.0  |7.11 |    |8.0  |8.84 |
|11.0 |8.33 |    |11.0 |9.26 |    |11.0 |7.81 |    |8.0  |8.47 |
|14.0 |9.96 |    |14.0 |8.10 |    |14.0 |8.84 |    |8.0  |7.04 |
|6.0  |7.24 |    |6.0  |6.13 |    |6.0  |6.08 |    |8.0  |5.25 |
|4.0  |4.26 |    |4.0  |3.10 |    |4.0  |5.39 |    |19.0 |12.50|
|12.0 |10.84|    |12.0 |9.13 |    |12.0 |8.15 |    |8.0  |5.56 |
|7.0  |4.82 |    |7.0  |7.26 |    |7.0  |6.42 |    |8.0  |7.91 |
|5.0  |5.68 |    |5.0  |4.74 |    |5.0  |5.73 |    |8.0  |6.89 |
Understanding Data Science

Surprise!

All four datasets display:

  • identical mean and variance for x
  • identical mean and variance for y
  • identical correlation coefficient
  • identical linear regression equation

$$

In short: they look quite similar

Understanding Data Science

Anscombe's quartet

anscombe's quartet

Understanding Data Science

anscombe's linear graph

Understanding Data Science

anscombe's non-linear graph

Understanding Data Science

anscombe's regression thrown off

Understanding Data Science

anscombe's correlation thrown off

Understanding Data Science

two-rockets-landing-at-the-same-time

Understanding Data Science

Knowing your data

  • Flight Number (number)
  • Date (datetime)
  • Time (UTC) (datetime)
  • Booster Version (text)
  • Launch Site (text)
  • Payload (text)
  • Payload Mass (kg) (number)
  • Orbit (text)
  • Customer (text)
  • Mission Outcome (text)
  • Landing Outcome (text)
Understanding Data Science

Previewing your data

Flight  Date         Time (UTC)  Booster Version  Launch Site     Payload
_______________________________________________________________________________________________________
1       2010-06-04   18:45:00    F9 v1.0 B0003    CCAFS LC-40    Dragon Spacecraft Qualification Unit
2       2010-12-08   15:43:00    F9 v1.0 B0004    CCAFS LC-40    Dragon demo flight C1, two CubeSats...
3       2012-05-22   7:44:00     F9 v1.0 B0005    CCAFS LC-40    Dragon demo flight C2+
4       2012-10-08   0:35:00     F9 v1.0 B0006    CCAFS LC-40    SpaceX CRS-1
5       2013-03-01   15:10:00    F9 v1.0 B0007    CCAFS LC-40    SpaceX CRS-2    
Payload Mass (kg)    Orbit     Customer         Mission Outcome  Landing Outcome
_______________________________________________________________________________________________________
NaN                  LEO       SpaceX           Success          Failure (parachute)
NaN                  LEO (ISS) NASA (COTS) NRO  Success          Failure (parachute)
525                  LEO (ISS) NASA (COTS)      Success          No attempt
500                  LEO (ISS) NASA (CRS)       Success          No attempt
677                  LEO (ISS) NASA (CRS)       Success          No attempt
Understanding Data Science

Descriptive statistics

        Flight  Date         Time (UTC)  Booster Version  Launch Site     Payload
_______________________________________________________________________________________________________
count   55     55            55          55               55              55
unique  55     55            53          51               4               55
top     6      2018-03-30    4:45:00     F9 v1.1          CCAFS LC-40     SES-9
freq    1      1             2           5                26              1
        Payload Mass (kg)    Orbit     Customer         Mission Outcome  Landing Outcome
_______________________________________________________________________________________________________
count   53                   55        55               55               55
unique  47                   8         28               2                12
top     9,600                GTO       NASA (CRS)       Success          No attempt
freq    5                    22        14               54               18
Understanding Data Science

Visualize!

spacex_launch_count

Understanding Data Science

Ask more questions!

space-x-launches-by-site

Understanding Data Science

Ask more questions!

space-x-launches-by-outcome

Understanding Data Science

Outliers

spacex-payload-mass-histogram

Understanding Data Science

Let's practice!

Understanding Data Science

Preparing Video For Download...