Why use survival analysis?

Survival Analysis in Python

Shae Wang

Senior Data Scientist

Average battery life example

DataFrame name: battery_df

Battery ID Duration Dead Brand Truck
1 2.5 yrs No Brand A Long
2 6 yrs Yes Brand B Short
3 5 yrs No Brand B Long
... ... ... ... ...
1000 4.5 yrs Yes Brand A Short

What's the average battery lifetime?

np.average(battery_df["Duration"])
Survival Analysis in Python

Average battery life example

Battery life censorship cartoon.

Survival Analysis in Python

Censorship in battery life

Battery life censorship cartoon.

  • $T_{duration} \neq T_{lifetime}$ for batteries that have not died.
  • Batteries 1, 3, 4, and other batteries whose failures haven't been observed are inappropriately accounted for in the averaging.
Survival Analysis in Python

The censorship problem

When the survival time is only partially known.

How does censorship happen?

  • The event has not yet occurred at the end of the observation.
    • e.g. a free trial user has not converted to a paid user at the end of an experiment.
  • The individual's data is missing because of a dropout or loss of contact.
    • e.g. a free trial user declines to share data for the experiment.
Survival Analysis in Python

Types of censorship

Censorship type cartoon.

  • Not censored: the event occurred and survival duration is known.
  • Right-censored: the survival duration is greater than the observed duration.
  • Left-censored: the survival duration is less than observed duration.
  • Interval-censored: the survival duration is within a certain range but not exactly known.
Survival Analysis in Python

Why is censorship bad?

Aggregated statistics

  • A type of missing data.
  • Skew statistics, i.e. np.average(), max(), min().

Regression

  • Linear regression line minimizes the sum of squared errors.
  • For censored data, we don't know the error terms.
Survival Analysis in Python

The survival function

  • Does not impute censored data.
  • Models the probability of a survival duration being larger than a certain value.

  $$S(t)=Pr(T>t)$$

Survival Analysis in Python

Survival analysis versus censorship

Non-censored data cartoon.

Censored data cartoon.

Survival Analysis in Python

Checking data for censorship

Is there a way to identify which data points are censored?

Step 1) Check for censorship columns (often preprocessed).

Is too much data censored?

Step 2) Check the proportion of data points that are censored (a rule of thumb is 50%).

Is the censorship non-informative and random?

Step 3) Investigate the causes of the censorship to ensure that whether a data point is censored has no impact on survival.

Survival Analysis in Python

Let's practice!

Survival Analysis in Python

Preparing Video For Download...