Survival Analysis in Python
Shae Wang
Senior Data Scientist
Hazard function $h(t)$: describes the probability that event happens at some time, given survival up to that time.
Hazard rate: the instantaneous rate of event occurring
$$h(t)=-\frac{d}{dt}logS(t)$$
The hazard function $h(t)$ and the survival function $S(t)$ can be derived from each other.
The proportional hazards assumption: all individuals' hazards are proportional to one another.
In the case of individual $A$ and individual $B$: $$h_A(t)=ch_B(t)$$
Based on the proportional hazards assumption: $$h(t|x)=b_0(t)exp\bigg(\sum^{n}_{i=1}b_i(x_i-\overline{x_i}\bigg)$$
$b_0(t)$: population-level baseline hazard function that changes with time.
$exp\bigg(\sum^{n}_{i=1}b_i(x_i-\overline{x_i}\bigg)$: the linear relationship between covariates and the log of hazard, does NOT change with time.
CoxPHFitter
classfrom lifelines import CoxPHFitter
coxph = CoxPHFitter()
.fit()
to fit the estimator to the datacoxph.fit(df, duration_col, event_col)
coxph.summary()
coxph.predict()
mortgage_df
house
principal
interest
property_tax
credit_score
duration
, paid_off
from lifelines import CoxPHFitter
coxph = CoxPHFitter() coxph.fit(df=mortgage_df, duration_col="duration", event_col="paid_off")
Filter theDataFrame
:
new_df = mortgage_df.loc[:,
mortgage_df.columns!="house"]
coxph.fit(df=new_df,
duration_col="duration",
event_col="paid_off")
Use the formula
parameter:
coxph.fit(df=mortgage_df,
duration_col="duration",
event_col="paid_off",
formula="principal + interest
+ property_tax + credit_score")
print(coxph.summary)
<lifelines.CoxPHFitter: fitted with 1808 observations, 340 censored>
coef exp(coef) se(coef) z p
covariate house -0.38 0.68 0.19. -1.98 0.05
principal -0.06 0.94 0.02 -2.61 0.01
interest 0.31 1.37 0.31 1.02 0.31
property_tax -0.15 0.86 0.21 -0.71 0.48
credit_score -0.43 0.65 0.38 -1.14. 0.26
interest
from its median value -> the hazards change by the a factor of $e^{0.31}=1.37$, which is a 37% increase compared to the baseline hazards.Survival Analysis in Python