Fitting a Kaplan-Meier estimator

Survival Analysis in Python

Shae Wang

Senior Data Scientist

What is the Kaplan-Meier estimator?

A non-parametric statistic that estimates the survival function of time-to-event data.

  • Also known as
    • the product-limit estimator
    • the K-M estimator
  • Non-parametric: constructs a survival curve from collected data and does not assume underlying distribution
Survival Analysis in Python

The mathematical intuition

Definitions:

  • $t_i$: a duration time
  • $d_i$: number of events that happened at time $t_i$
  • $n_i$: number of individuals known to have survived up to time $t_i$

 

Survival function $S(t)$ is estimated with: $$S(t)=\prod_{i:t_i\leq t}\bigg(1-\frac{d_i}{n_i}\bigg)$$

Survival Analysis in Python

Why is it called the product-limit estimator?

Suppose we have events at 3 times: 1, 2, 3

Survival rate for $t=2$: $$S(t=2)=\bigg(1-\frac{d_1}{n_1}\bigg)*\bigg(1-\frac{d_2}{n_2}\bigg)$$

Survival rate for $t=3$: $$S(t=3)=S(t=2)*\bigg(1-\frac{d_3}{n_3}\bigg)$$

The survival rate at time t is equal to the product of the percentage chance of surviving at time t and each prior time.

Survival Analysis in Python

Assumptions to keep in mind

  • Unambiguous events: the event of interest happens at a clearly specified time.
  • Survival probabilities are comparable in all subjects: individuals' survival probabilities do not depend on when they entered the study.
  • Censorship is non-informative: censored observations have the same survival prospects as observations that continue to be followed.
Survival Analysis in Python

Kaplan-Meier estimator with lifelines

from lifelines import KaplanMeierFitter

KaplanMeierFitter: a class of the lifelines library

kmf = KaplanMeierFitter()
kmf.fit(durations, event_observed)
Survival Analysis in Python

The mortgage problem example

DataFrame name: mortgage_df

id duration paid_off
1 25 0
2 17 1
3 5 0
... ... ...
100 30 1
Survival Analysis in Python

The mortgage problem example

DataFrame name: mortgage_df

id duration paid_off
1 25 0
2 17 1
3 5 0
... ... ...
100 30 1
from lifelines import KaplanMeierFitter
mortgage_kmf = KaplanMeierFitter()
mortgage_kmf.fit(duration=mortgage_df["duration"], 
        event_observed=mortgage_df["paid_off"])
<lifelines.KaplanMeierFitter:"KM_estimate", 
fitted with 100 total observations, 
18 right-censored observations>
Survival Analysis in Python

Using the Kaplan-Meier estimator

What is the median length of an outstanding mortgage?

print(mortgage_kmf.median_survival_time_)
4.0

What is the probability of a mortgage being outstanding every year after initiation?

print(mortgage_kmf.survival_function_)
          KM_estimate
timeline             
0.0          1.000000
1.0          0.983267
2.0          0.950933
3.0          0.892328
Survival Analysis in Python

Using the Kaplan-Meier estimator

What is the probability that a mortgage is not paid off by year 34 after initiation?

mortgage_kmf.predict(34)
0.037998
Survival Analysis in Python

Benefits and limitations

Benefits
  • Intuitive interpretation of survival probabilities.
  • Flexible to use on any time-to-event data.
  • Usually the first model to attempt on time-to-event data.
Limitations
  • Survival curve is not smooth.
  • If 50% of more of the data is censored, .median_survival_time_ cannot be calculated.
  • Not effective for analyzing the effect of covariates on the survival function.
Survival Analysis in Python

Let's practice!

Survival Analysis in Python

Preparing Video For Download...