Introduction to Survival analysis

Author

Affiliation

Bongani Ncube(3002164)

University Of the Witwatersrand (School of Public Health)

Published

July 7, 2025

Introduction

Hie there ,welcome back to one of my presentations ,This presentation will cover some basics of survival analysis.

In this section, we will use the following packages:

basic Definitions in survival analysis

Survival analysis lets you analyze the rates of occurrence of events over time, without assuming the rates are constant. Generally, survival analysis lets you model the time until an event occurs, In the medical world, we typically think of survival analysis literally ,tracking time until death. But, it’s more general than that survival analysis models time until an event occurs (any event).

Examples

This might be death of a biological organism.
time until recovery,
Survival analysis also goes by reliability theory in engineering,
duration analysis in economics, and event history analysis in sociology.

Special features of survival data😝

Survival times are generally not symmetrically distributed.
Usually the data is positively skewed and it will be unreasonable to assume that of this type have a normal distribution. However data can be transformed to give a more symmetric distribution.
Survival times are frequently censored. The survival time of an individual is said to be censored when the end-point of interest has not been observed for that individual.

Example of the distribution of follow-up times according to event status:

censoring

Censoring is a type of missing data problem unique to survival analysis.
This happens when you track the sample/subject through the end of the study and the event never occurs.
This could also happen due to the sample/subject dropping out of the study for reasons other than death, or some other loss to followup. The sample is censored in that you only know that the individual survived up to the loss to followup, but you don’t know anything about survival after that.

A subject may be censored due to:

Loss to follow-up
Withdrawal from study
No event by end of fixed study period

Specifically these are examples of right censoring.

Left censoring and interval censoring are also possible, and methods exist to analyze these types of data, but this tutorial will be focus on right censoring.

illustrating the impact of censoring:

whats happening

Subjects 6 and 7 were event-free at 10 years.
Subjects 2, 9, and 10 had the event before 10 years.
Subjects 1, 3, 4, 5, and 8 were censored before 10 years, so we don’t know whether they had the event or not at 10 years. But we know something about them
that they were each followed for a certain amount of time without the event of interest prior to being censored.

types of censoring

If the individual was last known to be alive at time \(t_0 + C\), the time c is called a censored survival time. This censoring occurs after the individual has been entered into a study, that is, to the right of the last known survival time, and is therefore known as right censoring. The right-censored survival time is then less than the actual, but unknown, survival time.

Definition: A survival time is said to be right-censored at time \(t\) if it is only known to be greater than t.

which is encountered when the actual survival time of an individual is less than that observed. To illustrate this form of censoring, consider a study in which interest centres on the time to recurrence of a particular cancer following surgical removal of the primary tumour. Three months after their operation, the patients are examined to determine if the cancer has recurred. At this time, some of the patients may be found to have a recurrence. For such patients, the actual time to recurrence is less than three months,

Defining Survival and hazard functions

Definition : The survival function at time t, denoted \(S(t)\), is the probability of being event-free at \(t\). The cumulative incidence function at time t, denoted \(F(t)=1− S(t)\), is the complementary probability that the event has occurred time by \(t\).

Another useful summary is the hazard function h(t).

Definition : The hazard function \(h(t)\) is the short-term event rate for subjects who have not yet experienced the outcome event.

Survival function

The distribution function F(t) is used to describe chances of time of survival being less than a given value, t. In some cases, the cumulative distribution function is of interest and can be obtained as follows:

\[F(t)=P(T\le t)=\int_{t}^{\infty}f(u)\]

Interest also tends to focus on the survival or survivorship function which in a study is the chances of an individual surviving after time \(t\) and also given by the formula

probability that an individual survives beyond any given time

\[ S(t) = Pr(T>t) \]

\(S(t)\) monotonically decreasing function with \(S(0) = 1\) and \(S(\infty) = \text{lim}_{t \rightarrow \infty} S(t) = 0\).

Hazard function

Suppose we have \(R (t)\) (set at risk) that has individuals who are prone to dying by time t, the chances of an individual in that set to die in a given interval \([t, t + \delta t)\) is given by the following formula, where \(h(t)\) is the hazard rate and is defined as:

\[h(t) = \lim_{\delta t \rightarrow 0} \frac{P(t \le T < t+\delta t | T \ge t)}{\delta t}\] The hazard function’s shape can be any strictly positive function and it changes based on the specific survival data provided.

Relationship between h(t) and S(t)

we know that \(f(t) = - \frac{d}{dt} S(t)\) hence equation simplifies to

\[h(t)=\frac{f(t)}{S(t)} =\frac{- \frac{d}{dt} S(t)}{S(t)}= - \frac{d}{dt} [\log S(t)]\]

Taking intergrals both sides the above reduces to: \[H(t)= -\log S(t)\]

where \(H(t)=\int_{t}^{t} h(u) du\) ,is the cumulative risk of a individual dying by time t. Since the event is death, then H(t) summarises the risk of death up to time t, assuming that death has not occurred before t. The survival function is also approximately simplified as follows. \[ S(t) = e^{-H(t)} \approx 1 -H(t) \text{ when H(t) is small. }\]

Kaplan-Meier estimate

If there are \(r\) times of deaths among individuals, such that \(r\leq n\) and \(n\) is the number of individuals with observed survival times. Arranging these times in from the smallest to the largest,the \(j\)-th death time is noted by \(t_{(j)}\) , for \(j = 1, 2, \cdots , r\), such that the \(r\) death times ordred are \[ t_{1}<t_{2}< \cdots < t_{m}\] .

Let \(n_j\), for \(j = 1, 2, \ldots, r\), denote the number of individuals who are still alive before time \(t(j)\), including the ones that are expected to die at that time. Additionally, let \(d_j\) denote the number of individuals who die at time \(t(j)\). The estimate for the product limit can be expressed in the following form:

\[\hat{S}(t) = \prod_{j=1}^k \, \left(1-\frac{d_j}{r_j}\right)\]

To survive until time \(t\), an individual must start by surviving till \(t(1)\), and then continue to survive until each subsequent time point \(t(2), \ldots, t(r)\). It is assumed that there are no deaths between two consecutive time points \(t(j-1)\) and \(t(j)\). The conditional probability of surviving at time \(t(i)\), given that the individual has managed to survive up to \(t(i-1)\), is the complement of the proportion of the individuals who die at time \(t(i)\), denoted by \(d_i\), to the new individuals who are alive just before \(t(i)\), denoted by \(n_i\). Therefore, the probability that is conditional of surviving at time \(t(i)\) is given by \(1 - \frac{d_i}{n_i}\).

The variance of the estimator is given by \[\text{var}(\hat{S}(t)) = [\hat{S}(t)]^2 \sum_{j=1}^k \frac{d_j}{(r_j-d_j) r_j}\]

The corresponding standard deviation is therefore given by: \[\text{s.e}(\hat{S}(t)) = [\hat{S}(t)]\left[\sum_{j=1}^k \frac{d_j}{(r_j-d_j) r_j}\right]^{\frac{1}{2}}\]

The Kaplan-Meier curve illustrates the survival function. It’s a step function illustrating the cumulative survival probability over time. The curve is horizontal over periods where no event occurs, then drops vertically corresponding to a change in the survival function at each time an event occurs.

Log rank test

Given the problem of contrasting two or more survival functions,for instance survival times between male and female individuals in the ICU. Let \[ t_{1}<t_{2}< \cdots < t_{r}\]

Let \(t_i\) be the unique times of death observed among the individuals, found by joining all groups of interest. For group \(j\),\ let \(d_{ij}\) denote the number of deaths at time \(t_i\), and\ let \(n_{ij}\) denote the number at risk at time \(t_i\) in that group. Let \(d_i\) denote the total number of deaths at time \(t_i\) across all groups, and\

the expected number of deaths in group \(j\) at time \(t_i\) is: ,

\[E(d_{ij}) = \frac{n_{ij} \, d_i}{n_i} = n_{ij} \left(\frac{d_i}{n_i}\right) \]

At time \(t_i\), suppose there are \(d_i\) deaths and \(n_i\) individuals at risk in total, with \(d_{i0}\) and \(d_{i1}\) deaths and \(n_{i0}\) and \(n_{i1}\) individuals at risk in groups 0 and 1, respectively, such that \(d_{i0} + d_{i1} = d_i\) and \(n_{i0} + n_{i1} = n_i\). Such data can be summarized in a \(2 \times 2\) table at each death time \(t_i\).

Given that the null hypothesis is true, the number of deaths is expected to follow the hypergeometric distribution, and therefore: \[E(d_{i0}) =e_{i0}= \frac{n_{i0} \, d_i}{n_i} = n_{i0} \left(\frac{d_i}{n_i}\right)\]

\[\text{Var}(d_{i0}) = \frac{n_{i0} \, n_{i1} \, d_i(n_i-d_i)}{n_{i}^2 (n_{i}-1)}\]

the statistic \(U_L\) is given by \[U_L=\sum^r_{i=1}(d_{i0}-e_{i0})\] so that the variance of \(U_L\) is \[Var(U_L)=\sum^r_{j=1}v_{i0}=V_L\] which then follows that : \[\frac{U_L}{\sqrt{V_L}} \sim N(0,1)\] and squaring results in \[\frac{U^2_L}{V_L}\sim \chi^2_1\]

Regression models in Survival analysis

Cox’s Proportional Hazards

The model is semi-parametric in nature since no assumption about a probabilty distribution is made for the survival times. The baseline model can take any form and the covariate enter the model linearly (Fox,2008). The model can thus be represented as :

\[h(t|X) = h_0(t) \exp(\beta_1 X_{1} + \cdots + \beta_p X_{p})=h_0(t)\text{exp}^{(\beta^{T} X)}\]

In the above equation, \(h_0(t)\) is known as the baseline hazard function, which represents the hazard function for a individual when all the included variables in the model are zero. On the other hand, \(h(t|X)\) represents the hazard of a individual at time \(t\) with a covariate vector \(X=(x_1,x_2,\ldots,x_p)\) and \(\beta^T=(\beta_1,\beta_2,\ldots,\beta_p)\) is a vector of regression coefficients.

Fitting a proportional hazards model

The partial likelihood

Given that \(t(1) < t(2) < \ldots < t(r)\) denote the \(r\) ordered death times. The partial likelihood is given by :

\[L(\beta) = \prod_{i=1}^{n} \left[\frac{\text{exp}^{\beta ^{T} X_i}}{ \sum_{\ell\in R(t_{(i)}}\text{exp}^{\beta ^{T} X_\ell}}\right]^{\delta_i}\]

Maximising the logarithmic of the likelihood function is more easier and computationally efficient and this is generally given by:

\[\log L(\beta)= \sum_{i=1}^{n} \delta_i \left[ \beta ^{T} X_i - \log \left(\sum_{\ell \in R(t_{(i)}}\text{exp}^{\beta ^{T} X_\ell}\right) \right]\]

Residuals for Cox Regression Model

Six types of residuals are studied in this thesis: Cox-Snell, Modified Cox-Snell, Schoenfeld, Martingale, Deviance, and Score residuals.

Cox-Snell Residuals

Cox-Snell definition of residuals given by Cox and Snell (1968) is the most common residual in survival analysis. The Cox-Snell residual for the \(ith\) individual, \(i = 1, 2, \cdots , n\), is given by

\[r_{Ci}=\text{exp}(\mathbf{\hat{\beta}^{T} x_i})\hat{H}_0(t_i)\]

where \(\hat{H}_0(t_i)\) is an estimate of the baseline cumulative hazard function at time \(t_i\). It is sometimes called the Nelson-Aalen estimator.

Martingale residuals

These are useful in determing the functional form of the covariates to be included in the model. If the test reveals that the covariate can not be included linearly then there is need for such a covariate to be transformed . The martingale residuals are represented in the following form

\[r_{Mi}=\delta_i - r_{Ci}\]

Deviance Residuals

The skew range of Martingale residual makes it difficult to use it to detect outliers. The deviance residuals are much more symmetrically distributed about zero and are defined by

\[r_{Di}=\text{sgn}(r_{Mi})\left[-2\{r_{Mi}+\delta_i\log\left(\delta_i-r_{Mi}\right)\}\right]^{\frac{1}{2}}\]

Here, \(r_{Mi}\) represents the martingale residual for the \(i\)-th individual, and the function \(\text{sgn}(\cdot)\) denotes the sign function. Specifically, the function takes the value +1 if its argument is positive and -1 if negative.

Schoenfield residuals

Schoenfeld residuals are a useful tool for examining and testing the proportional hazard assumption, detecting leverage points, and identifying outliers. They can be calculated as follows:

\[r_{Sji}=\delta_i\{x_{ji}-\hat{a}_{ji}\}\]

where \(x_{ji}\) is the value of the \(jth\) explanatory variable,\(j = 1, 2, \cdots , p\), for the \(ith\) individual in the study,

\[\hat{a}_{ji}=\frac{\sum_{\ell \in R(t_i)}x_{j\ell}\text{exp}(\mathbf{\hat{\beta}^{T} x_{\ell}})}{\sum_{\ell \in R(t_i)}\text{exp}(\mathbf{\hat{\beta}^{T} x_{\ell}})}\]

Examples in R

doing survival analysis in R

functions

The core functions we’ll use out of the survival package include:

Surv(): Creates a survival object.
survfit(): Fits a survival curve using either a formula, of from a previously fitted Cox model.
coxph(): Fits a Cox proportional hazards regression model.

Other optional functions you might use include:

cox.zph(): Tests the proportional hazards assumption of a Cox regression model.
survdiff(): Tests for differences in survival between two groups using a log-rank / Mantel-Haenszel test.¹

Surv() creates the response variable, and typical usage takes the time to event,[^time2] and whether or not the event occured (i.e., death vs censored). survfit() creates a survival curve that you could then display or plot. coxph() implements the regression analysis, and models specified the same way as in regular linear models, but using the coxph() function.

Data description

setup

take a look at the dataset

Creating a Survival Object

Note

The survival object is created by the function survival::Surv, which typically requires two arguments: event and time. The survival object will be used as the outcome by the survival analysis methods we explore.

Some key components of this survfit object that will be used to create survival curves include:

time: the timepoints at which the curve has a step, i.e. at least one event occurred
surv: the estimate of survival at the corresponding time

Estimating the Survival Function: Kaplan-Meier Estimator

The Kaplan-Meier (KM) Estimator is a non-parametric method that estimates the survival probabilities at each time an event occurs. We will use the function survival::survfit(), which uses the KM Estimator, to estimate survival probabilities from our data.

survfit requires two arguments:

A formula object where the outcome is a survival object
A data frame

The Output of survfit

The survival probabilities for all patients:

Next, we will estimate the survival probabilities for stroke type:

this will give us two tables ,relating to two factors that we have in the data

Plot the survival probability

The KM estimate provides the survival probabilities. We can plot these probabilities to look at the trend of survival over time. The plot provides

survival probability on the \(y-axis\)
time on the \(x-axis\)

Plotting the overal Survival Function

Using the KM Estimator to Plot Multiple Survival Functions

Note

Clearly from the plot we can tell that patients with stroke-type=IS (blue) survived longer than the other group.
There is a significant difference in survival times between the groups (p<0.05)

We can perform the Kaplan-Meier estimates for variable dm too:

And then we can plot the survival estimates for patients with and without diabetes:

Comments

Clearly from the plot we can tell that patients without diabetes melitus survived longer than those with diabetes, however the difference is not statistically significant.

Typically we will also want to see the numbers at risk in a table below the x-axis. We can add this using add_risktable():

Comparing Kaplan-Meier estimates across groups

There are a number of available tests to compare the survival estimates between groups based on KM. The tests include:

log-rank (default)
peto-peto test

Log-rank test

to answer question if the survival estimates are different between levels or groups we can use statistical tests for example the log rank and the peto-peto tests.

For all the test, the null hypothesis is that that the survival estimates between levels or groups are not different. For example, to do that:

Note

The survival estimates between the surv_data types (IS vs HS groups) are different at the level of \(5\%\) significance (p-value = 0.03).

And for the survival estimates based on diabetes status:

Note

The survival estimates between patients with and without diabetes (dm status yes vs no groups) are not different (p-value = 0.1).

Cox PH General model

The Cox model is expressed by the hazard function denoted by \(h(t)\). This model can be used to fit univariable and multivariable regression models that have survival outcomes. The hazard function can be interpreted as the risk of dying at time t. It can be estimated as follow:

\(h(t, X_{i}) = h_{0}(t)e^{ \sum_{j=1}^{p} \beta_{j}X_{i,j}} = h_{0}(t)exp(\beta_{1}X_{i,1} + ... +\beta_{p}X_{i,p})\)

where:

\(h(t)\) is the hazard, the instantaneous rate at which events occur.
\(h_{0}(t)\) is called the baseline hazards (when all X’s are equal to 0), depends on \(t\)
\(X = (X_{1}, X_{2},..., X_{p})\) explanatory/predictor variables
\(e^{ \sum_{i=1}^{p} \beta_{i}X_{i}}\), depends only on X’s, called .

Because the baseline hazard \(h_{0}(t)\) is an unspecified function, the Cox model us a semiparametric model.

Note

Advantages of the model: “robust” model, so that the results from using the Cox model will closely approximate the results for the correct parametric model.

The Cox model can be written as a multiple linear regression of the logarithm of the hazard on the variables \(X_{i}\), with the baseline hazard, \(h_{0}(t)\), being an ‘intercept’ term that varies with time.

\[log(h(t, X_{i})) = log(h_{0}(t)) + \sum_{j=1}^{p} \beta_{j}X_{i,j}\]

Note

A hazard ratio above 1 indicates a covariate that is positively associated with the event probability, and thus negatively associated with the length of survival.

In summary,

HR = 1 : No effect
HR < 1: Reduction in the hazard
HR > 1: Increase in Hazard

As a note, in clinical studies, a covariate with hazard ratio :

greater than 1 (i.e.: b>0) is called bad prognostic factor.
smaller than 1 (i.e.: b<0) is called good prognostic factor.

As a consequence, a major assumption of this model is that the HR is constant over time because it is independent of time. Or equivalently that the hazard for one individual is proportional to the hazard for any other individual, where the proportionality constant is independent of time.

It is possible, nevertheless, to consider X’s which do involve t. Such X’s are called time-dependent variables. If time-dependent variables are considered, the Cox model form may still be used, but such a model no longer satisfies the PH assumption, and is called the extended Cox model.

Compute the Cox Model

The coxph() function uses the same syntax as lm(), glm(), etc. The response variable you create with Surv() goes on the left hand side of the formula, specified with a ~. Explanatory variables go on the right side.

COX PH model with stroke type variable only

interpretting

The effect of surv_data type is significantly related to survival (p-value = 0.0368), with better survival in Ischaemic surv_data in comparison to the other type (hazard ratio of dying = 0.5157).

The model is statistically significant. That 0.03344 p-value of the model is really close to the p = 0.03 p-value we saw on the Kaplan-Meier nodel as well as the likelihood ratio test = 4.52 is close to the log-rank chi-square (4.5) in the Kaplan-Meier model.

\(e^{\beta_{1}}\) = \(e^{-0.6622}\) = 0.5157 is the hazard ratio - the multiplicative effect of that variable on the hazard rate (for each unit increase in that variable). Ischaemic surv_data patients have 0.588 (~ 60%) times the hazard of dying in comparison to haemorage.

Model Building

this is important when perfoming statistical analysis

first build a model with all varibles

the estimate which is the log hazard. If you exponentiate it, you will get hazard ratio
the standard error
the p-value
the confidence intervals for the log hazard

using rms package

variable importance

now we create univariate models as in logistic regression

By using tbl_uvregression() we can generate simple univariable model for all covariates in one line of code. In return, we get the crude HR for all the covariates of interest.

it is clear that gcs has the least p-value hence can be included in the model first

The simple Cox PH model with covariate gcs shows that with each one unit increase in gcs, the crude log hazard for death changes by a factor of \(-0.175\).

lets tidy it up

Note

When we exponentiate the log HR, the simple Cox PH model shows that with each one unit increase in gcs, the crude risk for death decreases for about \(16\%\) and the of decrease are between \(95\% CI (0.785, 0.898)\). The relationship between surv_data death and gcs is highly significant (p-value \(< 0.0001\)) when not adjusting for other covariates.

next up we add stroke type

lets test if stroke type is worth it

The variable is not significant

we try another one

lets test if dm is worth it

still not worth is it , we can go on and on…

lets do a backward elimination

final model has gcs variable

Validity of the Cox PH model

The Cox proportional hazards model makes several assumptions. We use residuals methods to:

check the proportional hazards assumption with the Schoenfeld residuals
detect nonlinearity in relationship between the log hazard and the covariates using Martingale residual
examining influential observations (or outliers) with deviance residual (symmetric transformation of the martinguale residuals), to examine influential observations

Testing proportional hazard

The proportional hazard assumption is supported by a non-significant relationship between residuals and time, and refuted by a significant relationship.

We can test with the Goodness of Fit (GOF) approach based on the residuals defined by Schoenfeld.

The idea behind the statistical test is that if the PH assumption holds for a particular covariate then the Schoenfeld residuals for that covariate will not be related to survival time.

The implementation of the test can be thought of as a three-step process.

Step 1. Run a Cox PH model and obtain Schoenfeld residuals for each predictor.
Step 2. Create a variable that ranks the order of failures. The subject who has the first (earliest) event gets a value of 1, the next gets a value of 2, and so on.
Step 3. Test the correlation between the variables created in the first and second steps. The null hypothesis is that the correlation between the Schoenfeld residuals and ranked failure time is zero.

For each covariate, the function cox.zph() correlates the corresponding set of scaled Schoenfeld residuals with time, to test for independence between residuals and time. Additionally, it performs a global test for the model as a whole.

Note

From the output above, the test is not statistically significant, and therefore the global test is also not statistically significant. Therefore, we can assume the proportional hazards.

In the graphical diagnostic using the function ggcoxzph() [in the survminer package], the solid line is a smoothing spline fit to the plot, with the dashed lines representing a +/- 2-standard-error band around the fit. From the graphical inspection, there is no pattern with time. The assumption of proportional hazards appears to be supported for the covariates sex (which is, recall, a two-level factor, accounting for the two bands in the graph).

another approach

Another approach is to graphically check the PH assumption by comparing -log–log survival curves. A log–log survival curve is simply a transformation of an estimated survival curve that results from taking the natural log of an estimated survival probability twice.

\(h(t, X) = h_{0}(t)e^{ \sum_{j=1}^{p} \beta_{j}X_{j}}\) which is equivalent to \(S(t,X) = [S_{0}(t)]^{e^{ \sum_{j=1}^{p} \beta_{j}X_{j}}}\)

Therefore, \[-ln(-ln(S(t, X)))\]

\[= -ln(-ln([S_{0}(t)]^{e^{ \sum_{j=1}^{p} \beta_{j}X_{j}}}))\] \[= -ln[-ln(S_{0}(t))] - ln[e^{ \sum_{j=1}^{p} \beta_{j}X_{j}}]\]

\[= - \sum_{j=1}^{p} \beta_{j} X_{j} - ln[-ln(S_{0}(t))]\]

Therefore, the corresponding log–log curves for these individuals are given as shown here, where we have simply substituted \(X_1\) and \(X_2\) for \(X\) in the previous expression for the log–log curve for any individual \(X\).

\[ln(-ln(S(t, X_1))) - ln(-ln(S(t, X_2)))\] \[= \sum_{j=1}^{p} \beta_{j} (X_{1j} - X_{2j})\]

The baseline survival function has dropped out, so that the difference in log–log curves involves an expression that does not involve time t. The above formula says that if we use a Cox PH model and we plot the estimated -log–log survival curves for two groups of individuals on the same graph, the two plots would be approximately parallel.

Note

The distance between the two curves is the linear expression involving the differences in predictor values, which does not involve time. Note, in general, if the vertical distance between two curves is constant, then the curves are parallel.

The two curve cross, therefore this result suggests that the two groups in surv_data type do not satisfy the PH assumption.

Testing influential observations

To test influential observations or outliers, we can visualize either: the dfbeta values or the deviance residuals.

1. dfbeta values

This plot produces the estimated changes in the coefficients divided by their standard errors. The comparison of the magnitudes of the largest dfbeta values to the regression coefficients suggests that none of the observations is terribly influential individually.

Note

2)  deviance residuals

The deviance residual is a normalized transformation of the martingale residual. These residuals should be roughly symmetrically distributed about zero with a standard deviation of 1.

Positive values correspond to individuals that “died too soon” compared to expected survival times.
Negative values correspond to individual that “lived too long”.
Very large or small values are outliers, which are poorly predicted by the model.

Footnotes

Cox regression and the logrank test from survdiff are going to give you similar results most of the time. The log-rank test is asking if survival curves differ significantly between two groups. Cox regression is asking which of many categorical or continuous variables significantly affect survival.↩︎