MATH96048/MATH97075/MATH97183
Survival Models & Actuarial Applications
Dr David Whitney
2020/21
These notes are based on those by Prof Axel Gandy and Prof Nicholas Heard.
Contents
1 Principles of Modelling Time to Event Data 1
1.1 Statistical Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Model Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.2 Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Distributions of Event Times 2
2.1 Random Variable Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Cumulative Distribution and Survivor Functions . . . . . . . . . . . . . . . . 2
2.2.1 Criteria for a valid survivor function: . . . . . . . . . . . . . . . . . . 2
2.2.2 Future lifetime after age x . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.3 Actuarial notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.4 Relationship between Tx and T . . . . . . . . . . . . . . . . . . . . 3
2.3 Density Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Hazard Function / Force of Mortality . . . . . . . . . . . . . . . . . . . . . . 4
2.5 Cumulative Hazard Function . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6 Complete and Curtate Expectations of Future Lifetime . . . . . . . . . . . . 6
2.6.1 Complete future lifetime . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6.2 Curtate future lifetime . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.7 Random Survivorship Group . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.8 Life Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.9 Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.9.1 Right-censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.9.2 Left-censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.9.3 Interval-censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
i
2.9.4 Truncation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.9.5 Type I censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.9.6 Type II censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.9.7 Competing risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Parametric Distributions of Random Lifetimes T 12
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Gompertz-Makeham Distribution . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.1 Gompertz and Makeham laws of mortality . . . . . . . . . . . . . . . 14
3.4.2 Gompertz-Makeham hazard function . . . . . . . . . . . . . . . . . . 14
3.5 Choosing a Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5.1 Empirical survivor function . . . . . . . . . . . . . . . . . . . . . . . 14
4 Fitting Parametric Distributions 16
4.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.1 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.2 Asymptotic distribution of MLE . . . . . . . . . . . . . . . . . . . . 18
4.1.3 Functional invariance of MLEs . . . . . . . . . . . . . . . . . . . . . 18
4.2 Examples of ML Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.1 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.2 Weibull distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Hypothesis Testing for Nested Distributions . . . . . . . . . . . . . . . . . . 19
4.3.1 Likelihood ratio statistic . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3.2 Wald statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5 Nonparametric Methods 22
5.1 Discrete Lifetimes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.1 Discrete hazard rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.2 Survivor and discrete cumulative hazard rate . . . . . . . . . . . . . . 23
5.2 Kaplan-Meier (Product-limit) Estimate . . . . . . . . . . . . . . . . . . . . . 23
5.3 Greenwood’s Formula for the Standard Error of the K-M Estimate . . . . . . 25
5.4 Nelson-Aalen Estimate of the Cumulative Hazard . . . . . . . . . . . . . . . 26
5.5 The Actuarial Estimate of S . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6 Regression models 27
6.1 Inhomogeneous Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.2 Cox’s Proportional Hazards Model . . . . . . . . . . . . . . . . . . . . . . . 27
6.2.1 Partial likelihood function . . . . . . . . . . . . . . . . . . . . . . . . 28
6.2.2 Model checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.2.3 Estimating the baseline distribution . . . . . . . . . . . . . . . . . . 31
6.3 Accelerated Failure Time Model . . . . . . . . . . . . . . . . . . . . . . . . 31
6.3.1 Relation to proportional hazards . . . . . . . . . . . . . . . . . . . . 32
ii
7 The Markov Model 33
7.1 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.2 Homogeneous Continuous-Time Markov Jump Processes . . . . . . . . . . . 33
7.2.1 Markov Jump Processes . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.2.2 Transition probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.2.3 Transition Intensities . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.2.4 Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.2.5 Age rate intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.2.6 Assuming homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.3 A Two-State Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.4 Calculating Transition Probabilities . . . . . . . . . . . . . . . . . . . . . . . 37
7.4.1 Kolmogorov Forward Equations . . . . . . . . . . . . . . . . . . . . . 38
7.4.2 Kolmogorov Backward Equations . . . . . . . . . . . . . . . . . . . . 38
7.4.3 Example: The two-decrement model . . . . . . . . . . . . . . . . . . 39
7.5 Calculating Path Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.5.1 Holding times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.5.2 Jump probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.5.3 Example (continued): The two-decrement model . . . . . . . . . . . 41
7.6 Estimating Transition Intensities . . . . . . . . . . . . . . . . . . . . . . . . 41
7.6.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . 41
7.6.2 Interval censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.7 Statistical Testing of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.7.1 χ2 Goodness of Fit Test . . . . . . . . . . . . . . . . . . . . . . . . 43
7.7.2 Standardised and Cumulative Deviations Tests . . . . . . . . . . . . . 43
8 The Binomial and Poisson Models of Mortality 45
8.1 Binomial model of mortality . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.2 General Binomial Model - Non-uniform observation times . . . . . . . . . . . 46
8.2.1 Uniform distribution of deaths (UDD) . . . . . . . . . . . . . . . . . 47
8.2.2 Balducci assumption . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.2.3 Constant force of mortality . . . . . . . . . . . . . . . . . . . . . . . 48
8.2.4 Comparison of the three assumptions . . . . . . . . . . . . . . . . . . 48
8.3 The actuarial estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.3.1 Actuarial notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.4 Statistical testing of estimates . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.5 Poisson model of mortality . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
9 Counting Processes and the Poisson Process 52
9.1 Filtrations and Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
9.2 Counting Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
9.2.1 Intensity of a Counting Process . . . . . . . . . . . . . . . . . . . . . 53
9.3 Counting processes for survival data . . . . . . . . . . . . . . . . . . . . . . 54
9.3.1 Individual level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
9.3.2 Group level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
iii
9.4 Poisson processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
9.4.1 Homogeneous Poisson Processes . . . . . . . . . . . . . . . . . . . . 58
9.4.2 Inference for HPP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
9.5 Counting processes for Markov Models . . . . . . . . . . . . . . . . . . . . . 59
9.5.1 Individual level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
9.5.2 Group level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
9.5.3 Transition level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
10 Comparison of Markov, Binomial and Poisson models 61
10.1 Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
10.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
10.3 Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
10.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11 Graduation 62
11.1 Requirements for Graduated Estimates . . . . . . . . . . . . . . . . . . . . . 62
11.2 Graduation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
11.2.1 Parametric Graduation . . . . . . . . . . . . . . . . . . . . . . . . . 62
11.2.2 Graduation Using Previous Estimates . . . . . . . . . . . . . . . . . . 62
11.3 Testing Graduated Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . 63
11.3.1 Testing Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
11.3.2 Testing Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . 63
iv
1 Principles of Modelling Time to Event Data
1.1 Statistical Modelling
A stochastic or statistical model of a system is a mathematical model which represents
inherent uncertainties in the system as random variables. Any model which does not make
such allowances for uncertainty, on the other hand, is said to be deterministic. In actuarial
science, statistical models are especially useful for dealing with uncertainty in the time until
a specific event will occur, such as predicting the number of premium payments that will be
received before paying out on a life assurance contract. These statistical models for time
to event data also have much wider applications in fields such as medicine, biostatistics and
engineering.
Mathematical models are an imperfect representation of reality. Their utility comes from
being able to approximately learn the consequences of hypothetically changing certain exper-
imental inputs or actions. Statistical models extend this utility by capturing the uncertainty
surrounding unknown future outcomes, providing the possibility of searching for an optimal
decision under this uncertainty.
1.2 Model Choice
1.2.1 Complexity
The complexity of a good statistical model is often constrained for several reasons. The
analyst might be restricted by the inputs for which data are readily available, or by compu-
tational or statistical limitations to the models which can be reliably fitted. Additionally, it
is important that the chosen model satisfies the purposes for which it will be used; usually
this means that the mathematical structure of the model need be easily interpretable for
purposes of understanding and communication, and that the resulting fitted model will not
overfit and understate uncertainty and risk. For these reasons, the statistical analyst will
often seek to fit the most parsimonious model that the data will sensibly allow.
1.2.2 Sensitivity
Model choice and fitting should be viewed as an iterative procedure. Once a particular model
has been fitted to the available data, the quality of the fit should be inspected; hypothesis
tests such as goodness of fit tests can determine the suitability of the selected model. If the
structure of the data is not well captured by the model, this may suggest that the model
chosen is inadequate and the analyst should return to the model selection stage to consider
other alternatives.
Note however that performance of the chosen model will also depend heavily on the
quality of the data, with poor quality inputs likely to lead to unreliable output inference
(garbage in, garbage out). If there are questions about the accuracy of the data being used,
then it becomes particularly important to carry out a sensitivity analysis. This investigates
the magnitude of change in the model outputs when small perturbations are applied to the
model inputs.
1
Finally, it should be noted that the suitability of a well fitted model may not be comfort-
ably relied upon outside the range of the data used; for example, an exponential relationship
can appear fairly linear in the short term, before ballooning away from such a fit in the longer
term.
2
2 Distributions of Event Times
2.1 Random Variable Modelling
Consider a homogeneous population of individuals, who each have an associated event time
which is initially unknown and treated as a random variable. We will often refer to the period
of time until the event occurs as the lifetime of the individual.
Throughout the course, let T denote the future lifetime of a new-born individual (aged
0). Unless otherwise stated, we shall assume T is a continuous random variable which takes
values on the positive part of the real line R+ = [0,∞), with associated probability measure
P.
More specifically, in this chapter we might assume the existence of a limiting age ω
which T cannot exceed, so T ∈ [0,ω]. For human life calculations, typically we take
ω ≈120 years.
We first define the cumulative distribution function, F , the survivor function S , the
density f , the hazard µ and the cumulative hazard M for a lifetime, and derive relationships
between them.
This will provide us with a range of tools for specifying the distribution of a lifetime, and
rules for moving between them.
2.2 Cumulative Distribution and Survivor Functions
Definition: The cumulative distribution function (CDF) of T is
F (t) = P(T ≤ t),
the probability of death by age t.
Definition: The survivor (or reliability) function of T is
S(t) = P(T > t) = 1− F (t),
the probability of surviving beyond age t.
2.2.1 Criteria for a valid survivor function:
S(t) is a valid survivor function iff
1. S(0) = 1, lim
t→∞ S(t) = 0
2. 0 ≤ S(t) ≤ 1, ∀t ∈ R+;
3. Monotonicity: ∀t1, t2 ∈ R+, t1 < t2 =⇒ S(t1) ≥ S(t2);
4. S is ca`dla`g: ∀t ∈ R+,
(a) S(t+) ≡ lim
u↓t
S(u) = S(t) (S is right-continuous);
(b) S(t−) ≡ lim
u↑t
S(u) exists. (left limits exist for S).
3
2.2.2 Future lifetime after age x
Definition: Let Tx be the future lifetime of an individual who has survived to age x , for
0 ≤ x ≤ ω, so Tx ∈ [0,ω− x ]. So clearly
• T0 ≡ T ;
• [Tx ] ≡ [T − x |T > x ].
In an abuse of notation we shall generically refer to the probability measures associated
with the random variables {Tx} simply as P.
Definition: The cumulative distribution and survivor functions of Tx are
Fx (t) = P(Tx ≤ t),
Sx (t) = P(Tx > t) = 1− Fx (t).
2.2.3 Actuarial notation
tqx ≡ Fx (t),
tpx ≡ Sx (t) = 1− tqx .
For example 2q30 = P(T30 ≤ 2) = P(T ≤ 32|T > 30).
When working with units of one year, t = 1 is often omitted, i.e. qx ≡ 1qx and px ≡ 1px .
In actuarial science, the CDF tqx is called the rate of mortality and qx is called the initial
rate of mortality.
2.2.4 Relationship between Tx and T
For consistency with T , for the CDF we have
Fx (t) = P(Tx ≤ t) = P(T ≤ x + t|T > x),
=⇒ Fx (t) = F (x + t)− F (x)
S(x)
and for the survivor function,
Sx (t) = 1− Fx (t) = S(x + t)
S(x)
.
In actuarial notation
tqx =
x+tq0 − xq0
xp0
and tpx =
x+tp0
xp0
.
4
Hence, for any age x , and ∀s, t > 0
s+tpx = x+s+t
p0
xp0
[
Sx (s + t) =
S(x + s + t)
S(x)
]
= x+sp0
xp0
x+s+tp0
x+sp0
[
=
S(x + s)
S(x)
S(x + s + t)
S(x + s)
]
=⇒ s+tpx = spx tpx+s . [Sx (s + t) = Sx (s)Sx+s(t)]
That is, the probability of surviving for time s + t after age x is simply the probability of
surviving for time s multiplied by the probability of surviving for further time t after reaching
age x + s.
2.3 Density Function
Definition: The probability density function (PDF) of the random variable Tx is
fx (t) =
d
dt
Fx (t) = lim
h↓0
Fx (t + h)− Fx (t)
h
.
Since we are assuming Tx is a continuous random variable, by definition this density exists.
For individuals who have currently survived to age x , fx (t) is the rate of death t further
units of time into the future. That is, for such an individual, the probability of death within
the interval [t, t + h] for a small interval width h is approximately fx (t)h.
2.4 Hazard Function / Force of Mortality
The hazard function plays a central role in survival analysis. We denote the hazard function
(or force of mortality) at age x , 0 ≤ x ≤ ω, by µ(x).
Definition: The hazard function of T is defined as
µ(x) = lim
h↓0
P(T ≤ x + h|T > x)
h
.
We will always assume this limit exists.
Note
µ(x) = lim
h↓0
Fx (h)
h
= fx (0).
The interpretation of µ(x) is important. It represents the instantaneous death rate for
an individual who has survived to time x .
Or, approximately, for small h
hqx ≈ hµ(x). (1)
Given an individual has reached age x , the probability of death in the next short period
of time of length h is roughly proportional to h, the constant of proportionality being µ(x).
5
Important result
µ(x + t) =
fx (t)
Sx (t)
,
or alternatively
fx (t) = µ(x + t)Sx (t).
So in particular, for x = 0 we have
f = µS .
Proof
fx (t) = lim
h↓0
1
h
{P(Tx ≤ t + h)− P(Tx ≤ t)}
= lim
h↓0
1
h
{P(T ≤ x + t + h|T > x)− P(T ≤ x + t|T > x)}
= lim
h↓0
1
h
{F (x + t + h)− F (x)} − {F (x + t)− F (x)}
S(x)
= lim
h↓0
1
h
F (x + t + h)− F (x + t)
S(x)
=
S(x + t)
S(x)
lim
h↓0
1
h
F (x + t + h)− F (x + t)
S(x + t)
=
S(x + t)
S(x)
lim
h↓0
1
h
P(T ≤ x + t + h|T > x + t)
=
S(x + t)
S(x)
µ(x + t)
= Sx (t)µ(x + t).
Summary
• Tx is a continuous random variable denoting the random future lifetime of an individual
alive at age x .
• The distribution function Fx (t) ≡ tqx is defined on [0,ω− x ] with PDF fx (t) = F ′x (t).
• The survivor function Sx (t) ≡ tpx = 1− tqx . Note that
1. s+tpx = spx tpx+s = tpx spx+t and
2. fx (t) = −S ′x (t).
6
• The survivor function and the PDF provide two methods of specifying the distribution
of a continuous random variable.
• An additional way was provided via the hazard function
µ(x + t) =
fx (t)
Sx (t)
=
−S ′x (t)
Sx (t)
.
Why additionally consider the hazard rate when we have Fx (t) and fx (t)?
(i) It may be physically enlightening to consider the immediate risk.
(ii) Comparisons of groups of individuals are sometimes most incisively made via the hazard.
(iii) Hazard-based models are often convenient when there is censoring.
(iv) When fitting parametric models the form of the hazard function can be enlightening
about the assumptions made by the model: e.g. Exponential =⇒ constant hazard.
2.5 Cumulative Hazard Function
Definition: The cumulative (or integrated) hazard rate of Tx , denoted Mx (t), is simply
Mx (t) =
∫ t
0
µ(x + s)ds.
This leads to another important relationship,
Sx (t) = exp{−Mx (t)}.
Recall,
µ(x + t) =
−S ′x (t)
Sx (t)
= − d
dt
log Sx (t).
Furthermore we have the condition Sx (0) = 1 from which Sx (t) = exp{−Mx (t)}
follows.
Note,
fx (t) = µ(x + t) exp{−Mx (t)}.
2.6 Complete and Curtate Expectations of Future Lifetime
We have now met several different methods for specifying the distribution of Tx - through
the hazard, density, survivor function, etc. Given this distribution, we now consider two
expectations of interest.
7
2.6.1 Complete future lifetime
Definition: The complete expectation of life at age x , e˚x , is the expected future lifetime
e˚x = E(Tx ) =
∫ ω−x
0
tfx (t)dt
=
∫ ω−x
0
−t
(
d
dt
tpx
)
dt
= −[t tpx ]ω−x0 +
∫ ω−x
0
tpxdt
=
∫ ω−x
0
tpxdt.
In the case of human mortality e˚x , and in particular e˚0, is used as a measure of health
both for comparison between different populations and for trends within a single population.
Note that for 0 ≤ x1 < x2, e˚x2 is not necessarily less than e˚x1 . For example, there may
be a high rate of infant mortality.
We may also calculate the variance,
Var(Tx ) =
∫ ω−x
0
t2fx (t)dt − e˚ 2x
=
∫ ω−x
0
t2tpxµ(x + t)dt − e˚ 2x .
2.6.2 Curtate future lifetime
Suppose once more that we have an individual who has survived to age x .
As a discretised alternative to the continuous valued future lifetime Tx , we can consider
the number of whole years Kx subsequently survived, so Kx = bTxc; that is, the integer part
of Tx (rounded down).
Kx is a discrete random variable on {0, 1, ... , bω − xc} with probability mass function
(pmf)
P(Kx = k) = P(k ≤ Tx < k + 1)
= k+1qx − kqx
or = kpx1qx+k .
It is simple to check these are equivalent (Exercise).
We can use this pmf to calculate the expectation of Kx .
8
Definition: The curtate expectation of life at age x , ex , is thus defined by
ex = E(Kx ) =
bω−xc

k=0
kkpx1qx+k
=
bω−xc

k=1
kpx . (Exercise)
Again we might also be interested in the variance
Var(Kx ) =
bω−xc

k=0
k2kpx1qx+k − ex2.
2.7 Random Survivorship Group
Suppose we observe a group of `0 newborn lives sampled from our homogeneous population.
That is, `0 (not necessarily independent) realisations of T .
We might then be interested in the number of individuals still alive at age x , n(x), for
some x > 0.
Since each individual will survive beyond age x with probability xp0, if the lifetimes were
independent we would have
n(x) ∼ Binomial(`0, xp0).
In any case, the expected number of survivors at x , denoted `x , is certainly given by
`x = `0 xp0.
(`0 is referred to as the radix.)
Furthermore, it then follows that the expected number of deaths in the group between
ages x and x + t, written tdx , is given by
tdx = `x − `x+t
= `0{xp0 − x+tp0}.
Again when t = 1 we simply write dx .
From the definition of the hazard function and (1), for small t we also have
tdx ≈ `xµ(x)t.
A plot of `xµ(x) (= `0f (x)) against x is known as the curve of deaths (!).
9
2.8 Life Tables
Each year the http://www.statistics.gov.uk/default.aspOffice for National Statis-
tics (ONS) produces a new set of estimates of the distribution of mortality in the UK, based
on recorded deaths in the country over the most recent three year period. These estimates are
presented in the form of life tables (The most recent numbers are http://www.ons.gov.uk/
ons/rel/lifetables/national-life-tables/2012-2014/ref-uk.xls for the UK and
http://www.ons.gov.uk/ons/rel/lifetables/national-life-tables/2012-2014/ref-england.
xls for England ). Life tables such as these which are based on large amounts of data and
can therefore be considered reliable are known as standard tables.
The life data used consider only the integer age at death, and so continuous concepts
such as the density and hazard functions cannot be directly estimated. Instead, the statistics
are presented as those from a hypothetical survivorship group of initial size `0=100,000.
The life tables thus contain estimates for the following discrete quantities for each integer
age x ∈ {0, 1, 2, ... ,ω}.
`x - as above, the expected number of survivors from the `0 newborns at x .
dx - as above, the expected number of individuals who die between ages x and x + 1.
qx - the initial rate of mortality, which is the probability of an individual who has reached
age x dying before age x + 1. So qx = dx/`x .
mx - the central rate of mortality, is the death rate ’per year lived’. Those who die during
the year can be assumed to live for half of the year, so mx = qx/(1− qx/2). mx is
approximately the average force of mortality over [x , x + 1[ (for small qx).
ex - the period life expectancy for an individual who has survived to x , which is the curtate
expectation of life using the estimates qx as our probabilities.
Interim Life Tables
10
AGE x mx qx `x dx ex
0 0.005006 0.004993 100000.0 499.3 78.05
1 0.000335 0.000335 99500.7 33.4 77.44
2 0.000189 0.000189 99467.3 18.8 76.46
3 0.000144 0.000144 99448.5 14.3 75.48
4 0.000107 0.000107 99434.2 10.6 74.49
5 0.000119 0.000119 99423.6 11.9 73.50
6 0.000112 0.000112 99411.7 11.1 72.51
7 0.000087 0.000087 99400.6 8.7 71.51
8 0.000115 0.000115 99391.9 11.4 70.52
9 0.000102 0.000102 99380.5 10.2 69.53
10 0.000098 0.000098 99370.3 9.7 68.54
... . . . . . . . . . . . . . . .
Table 1: Extract from table based on UK data for males for the years 2008-2010.
dx/`0, 2008/10 ONS data
0 20 40 60 80 100
0.
00
0.
01
0.
02
0.
03
0.
04
Age
d x
l 0
qx , 2008/10 ONS data
11
0 20 40 60 80 100
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25
0.
30
0.
35
Age
q x
2.9 Censoring
A defining feature of survival data, which renders many standard statistical analysis methods
inappropriate, is that survival times are frequently censored.
A survival time is said to be censored if the exact event time (e.g. time of death) has
not been observed. Such observations provide only partial information about the value of T .
There are a number of common censoring mechanisms that prevent observation of some
event times.
2.9.1 Right-censoring
An event time is right-censored if the censoring mechanism prematurely terminates observa-
tion of the individual, before the event has actually occurred. For example, if we lose track of
an individual, or the study comes to an end. We only learn that T > t for some t ∈ (0,ω).
If it is known beforehand that right-censoring is due to happen at time c for all events
which have not yet occurred, then the time spent observing the individual Tobs = min(T , c)
is a random variable of mixed type. That is, Tobs has continuous distribution over the interval
[0, c) and then a discrete atom of mass at time c .
2.9.2 Left-censoring
An event time is left-censored if we discover the event occurred before observation of the
individual began; alternatively, it could be that we observe the event time but only have a
lower bound on when the life of the subject began. In either case, all we can say is that the
actual time to event T is less than some time t. For example, we begin a study of patients
three months after an operation and find that some have died. All we know is that T < 3
months for those patients.
12
2.9.3 Interval-censoring
When all that is known is that the event occurred within an interval. For example, if we
check the status of individuals every d days.
Although always undesirable, a censoring mechanism may inevitably be introduced through
the design of an experiment or data collection process.
2.9.4 Truncation
Truncation happens when we wish to estimate the distribution of T but some of our data
is sampled instead from a conditional distribution such as T |a < T ≤ b. Truncation is
somewhat similar to censoring but is not a type of censoring.
2.9.5 Type I censoring
Type I censoring occurs if we take n individuals and observe them for a pre-specified time
t∗. Any non-events are right censored with censoring time t∗. Commonly found in medical
studies, e.g. follow 15 patients for two years after operation.
-
6
Time
E
E
C
E
C
t∗
• Generalised Type I censoring: As above, although now individuals join the study at
different times ⇒ even censored observations have different durations on the study.
-
6
Time
E
E
C
E
C
t∗
2.9.6 Type II censoring
Where we take n individuals and observe them until the d th event occurs. The remaining
non-events are right-censored, although (unlike Type I) this censoring time is not known in
advance. Commonly found in reliability analysis, e.g. take 20 components and keep testing
until half of them fail.
13
2.9.7 Competing risks
Suppose we are interested in the marginal distribution of failure time T1 due to a particular
cause, but the occurrence of a separate event, a competing risk, at random time T2 would
cause the individual to exit the study. Then we will only be able to observe min(T1, T2),
along with the type of failure. Observations where T2 < T1 will constitute right-censored
observations of T1.
One important assumption made throughout this course is that there is independence
between the censoring mechanism and the event time. That is, if C denotes a (possibly
deterministic) censoring time, and T is the (perhaps unobserved) event time, for simplicity
of inference we assume T and C are statistically independent random variables.
14
3 Parametric Distributions of Random Lifetimes T
3.1 Introduction
In Chapter 2 we defined various mathematical expressions that could be used to describe the
probability distribution of a future random lifetime T . But until now, we have not considered
what forms these quantities might take.
In this chapter we shall meet some standard parametric distributions that are commonly
used in survival analysis.
Clearly, any distribution over the positive half-line R+ is a possible candidate. Moreover,
distributions on R can be models for log T . However, a number of standard distributions
have emerged as being particularly appropriate for the task of analysing survival data.
We shall consider the three most popular models, these being the exponential, the Weibull
and the Gompertz-Makeham distributions. Each will make a different assumption about the
nature of the hazard rate.
The parametric models in this chapter do not admit a limiting age ω, all three have
support over the whole of R+. If we were to truncate these distributions to lie on [0,ω], we
should be mindful of the induced changes in the nature of the hazard functions.
3.2 Exponential Distribution
The exponential distribution is a natural starting point as a lifetime distribution. For λ > 0,
if T ∼ Exp(λ) then
f (t) = λ exp(−λt),
S(t) = exp(−λt),
µ(t) = λ,
M(t) = λt.
The constant hazard function reflects a property known as lack of memory ; for the
exponential distribution,
Sx (t) = S(t).
In practice however, the assumption of a constant hazard is often untenable and when
fitting the exponential distribution to data it can be sensitive to outliers.
The other two distributions we now consider can be seen to generalise the exponential
distribution; that is, each contains the exponential distribution within their family as a special
case.
15
3.3 Weibull Distribution
The Weibull distribution can be written as
f (t)= ηα−ηtη−1 exp(−(t/α)η),
S(t)= exp(−(t/α)η),
µ(t)= ηα−ηtη−1,
M(t) = (t/α)η,
where α > 0 and η > 0 are known as the scale and shape parameters respectively.
The Weibull generalises the exponential distribution, which is recovered when η = 1.
Figures 1 and 2 show the hazard and density functions for different values of the shape
parameter.
0.0 0.5 1.0 1.5 2.0
0
1
2
3
4
5
t
µµ((t
))
η = 0.5
0.0 0.5 1.0 1.5 2.0
0
1
2
3
4
5
t
µµ((t
))
η = 1.5
0.0 0.5 1.0 1.5 2.0
0
1
2
3
4
5
t
µµ((t
))
η = 2.5
0.0 0.5 1.0 1.5 2.0
0
1
2
3
4
5
t
µµ((t
))
η = 5
Figure 1: Hazard functions for Weibull with mean 1 and shape parameter η.
The mean and variance of the Weibull density are
αΓ(η−1 + 1) and α2{Γ(2η−1 + 1)− [Γ(η−1 + 1)2]}
respectively, where Γ is the gamma function Γ(t) =
∫ ∞
0 u
t−1e−udu.
For η < 1 we have a monotonically decreasing (antitonic) hazard rate and for η > 1 we
have a monotonically increasing (isotonic) hazard rate (see Fig. 1).
In particular for 1 < η < 2 the hazard increases slower than linear in t; for η = 2 the
increase is linear and for η > 2, faster than linear.
16
0.0 0.5 1.0 1.5 2.0
0.
0
0.
5
1.
0
1.
5
2.
0
t
f(t)
η = 0.5
0.0 0.5 1.0 1.5 2.0
0.
0
0.
5
1.
0
1.
5
2.
0
t
f(t)
η = 1.5
0.0 0.5 1.0 1.5 2.0
0.
0
0.
5
1.
0
1.
5
2.
0
t
f(t)
η = 2.5
0.0 0.5 1.0 1.5 2.0
0.
0
0.
5
1.
0
1.
5
2.
0
t
f(t)
η = 5
Figure 2: Density function for Weibull with mean 1 and shape parameter η.
The Weibull distribution is probably the most widely used parametric distribution in
survival analysis. Some possible reasons:
• Simplicity of the survivor and hazard functions.
• Covers a wide variety of distributional shapes (see Fig. 2).
• Empirically it has been found to be accurate in many contexts. It is an extreme value
distribution.
3.4 Gompertz-Makeham Distribution
3.4.1 Gompertz and Makeham laws of mortality
The Gompertz law of mortality states that in a low mortality environment where external
causes of death are rare, the force of mortality increases approximately exponentially with age.
Empirical data gathered from observation of insects kept in laboratory conditions support
this hypothesis.
Outside of such an environment, the Makeham law of mortality considers the risk of
other, external causes of death to be approximately constant with age.
So together these laws suggest a hazard rate which is a sum of constant (Makeham) and
exponential (Gompertz) terms. This motivates the Gompertz-Makeham distribution.
17
3.4.2 Gompertz-Makeham hazard function
The Gompertz-Makeham distribution is a three parameter distribution with hazard function
µ(t) = θ + β exp(γt)
for non-negative parameters θ, β,γ.
This distribution again generalises the exponential distribution. The survivor function
and density function can be derived in the usual way (Exercise).
3.5 Choosing a Distribution
We have listed three common parametric distributions for lifetime random variables (there
are many more: gamma, log-normal, log-logistic,. . . ).
Given a particular data set of, say, n realisations of T , how do we decide which distribution
is appropriate?
Each distribution makes particular assumptions about the form of the hazard. Knowledge
of the ‘true’ hazard rate would enable us to decide which distribution was ‘closest’. In practice
we can compare to a non-parametric estimate of the hazard which converges to the truth.
3.5.1 Empirical survivor function
Suppose we have gathered some survival time data for n individuals drawn from a popu-
lation with common survivor function S(t). Assume, for now, that there are no censored
observations. Then the empirical survivor function
Sˆ(t) =
number of observations > t
n
provides an estimate of S(t) derived purely from the data. Notice that Sˆ(t) is an antitonic
step function with jumps at the death times. (Sˆ(t) is a non-parametric estimator of S(t).
In Chapter 5 we shall see how to estimate S in the presence of censoring.)
Since S(t) = exp{−M(t)}, the simple transformation
Mˆ(t) = − log Sˆ(t)
then provides a similar, data-based estimate of the cumulative hazard.
We can use a plot of Mˆ(t) to see how the overall shape of the function compares with
those dictated by some different distributions. In particular,
• M(t) vs. t is linear for the exponential;
• log M(t) vs. log t is linear for the Weibull;
• log M(t) vs. t approaches linearity for large t for the Gompertz-Makeham.
After plotting Mˆ(t) in these ways we can decide which assumption is closest. This
provides us with a crude but useful method of selecting a parametric model for a given set
of data.
18
0.0 0.5 1.0 1.5 2.0 2.5 3.0
t
0.
0
0.
5
1.
0
Empirical Survivor Function
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.
0
0.
5
1.
0
1.
5
2.
0
2.
5
3.
0
t
Empirical Cumulative Hazard Function
Figure 3: Sˆ(t) and Mˆ(t) for a sample of uncensored lifetime data.
19
4 Fitting Parametric Distributions
In Chapter 3 we met some standard parametric families that might be appropriate for mod-
elling T , and gave some simple criteria for choosing amongst them given a set of realised
values.
We now assume that a distribution has been selected. What values should the parameters
of the distribution take? This is a problem in statistical inference.
The most popular statistical approach to fitting parametric distributions to data is the
method of maximum likelihood (ML) estimation.
In survival analysis the inference problem is often complicated by the presence of censored
observations, and so we will need to consider ML estimation in this context.
4.1 Maximum Likelihood Estimation
Assume that the data (t1, ... , tn) are n independent, possibly censored realisations from the
distribution F (t; θ), where the form of F is known (e.g. Weibull) but the parameters θ are
unknown.
Definition: The likelihood is the joint probability of the observed data, regarded as a function
of the unknown parameters θ.
The likelihood principle states that all of the information about the parameters in a
sample of data is contained in the likelihood function.
The law of likelihood extends this principle to state that the degree to which the data
supports one parameter value over another is given by the ratio of their likelihoods.
The Maximum likelihood estimate (MLE) of θ is based solely upon the likelihood of
the observed data. Maximum likelihood estimation thus provides a principled and general
method of estimating parameters in parametric distributions using observed data.
As the name suggests the MLE is the value of the parameters that maximises the prob-
ability of the observed data,
θˆ = arg sup
θ∈Θ
L(θ),
L(θ) =
n

i=1
Pr(ti |F (; θ)),
where L is the likelihood function and Pr(ti |F (; θ)) is the probability of observing the i th
observation given the distribution F with parameters θ.
If none of the observations are censored we find,
L(θ) =
n

i=1
f (ti ; θ).
For each censored observation the contribution to the likelihood depends on the type of
censoring. Recall, an observation is left (right) censored if we only have an upper (lower)
bound for T , or interval censored if we only know that T lies in a specified interval.
20
• Pr(ti |F (; θ)) = S(ti ; θ) for an observation right-censored at ti ;
• Pr(ti |F (; θ)) = F (ti ; θ) for an observation left-censored at ti ;
• Pr(ti |F (; θ)) = F (t(u)i ; θ) − F (t(l)i ; θ) for an observation interval-censored within
[t
(l)
i , t
(u)
i ].
From now on we shall only consider right-censoring, this being the most common.
We can then split the data into two disjoint sets relating to
U = uncensored data
C = censored data.
The likelihood function is then
L(θ) =∏
i∈U
f (ti ; θ) ∏
i∈C
S(ti ; θ).
It is almost always more convenient to work with the log-likelihood,
`(θ) = log{L(θ)}= ∑
i∈U
log f (ti ; θ) + ∑
i∈C
log S(ti ; θ)
= ∑
i∈U
log µ(ti ; θ) +
n

i=1
log S(ti ; θ)
= ∑
i∈U
log µ(ti ; θ)−
n

i=1
M(ti ; θ).
For an m-parameter distribution, the estimate θˆ may be found by solving the likelihood
equations

∂θj
`(θ) = 0 (j = 1, ... , m).
Finding the solution often involves numerical techniques such as Newton and quasi-Newton
methods.
4.1.1 Newton’s Method
Newton’s Method for optimising `(θ) begins by first making an initial guess θ = θ0 ∈ Rm,
sufficiently close to the true optimum. Provided this initial guess θ0 is incorrect, we would
wish to add an increment δ ∈ Rm to θ0 s.t. `(θ0 + δ) is the optimum.
The (multivariate) Taylor expansion of ` about a value θ yields
`(θ + δ) ≈ `(θ) + δ′∇`(θ) + 1
2
δ′∇2`(θ)δ. (2)
To solve the easier problem of maximising the Taylor expansion (2), we find the derivative
of (2) wrt δ is equal to
∇`(θ) +∇2`(θ)δ
21
and hence get an approximate solution for δ of
δ = −{∇2`(θ)}−1∇`(θ).
So the algorithm proceeds iteratively, setting
θn = θn−1 − {∇2`(θn−1)}−1∇`(θn−1)
until sufficient convergence in the sequence θ0, θ1, θ2, ... occurs.
4.1.2 Asymptotic distribution of MLE
Maximum likelihood estimation not only provides a set of optimal parameter values for θ but
also allows assessment of the variance in the estimates.
Consider the m×m information matrix I (θ) = −∇2`(θ), so
I (θ)ij =
−∂2
∂θi∂θj
`(θ), i , j = 1, ... , m.
Then asymptotically for large samples, θˆ ∼ N(θ, I (θ)−1).
Evaluating I (θ) by setting the unknown θ equal to θˆ, giving the observed information
matrix, leads to an approximate covariance matrix for θˆ. That is, if V = I (θˆ)−1 then its ij th
entry vij is an estimate of the covariance between θˆi and θˆj .
In particular the standard error of θˆj is given by the square root of the j
th diagonal
element of V ,
s.e.
(
θˆj
) ≈ √vjj .
4.1.3 Functional invariance of MLEs
Another key advantage of ML estimation is that the MLE of a function of the parameters
g(θ) is simply
ĝ(θ) = g(θˆ).
For example, the exponential distribution has mean E(T ) = 1/λ, hence the MLE of E(T )
is
Ê(T ) =
1
λˆ
=
1
r
n

i=1
ti ,
which is the total time on the study survived by the individuals divided by the number of
deaths.
4.2 Examples of ML Estimation
We assume that we have observed lifetimes t1, ... , tn with r uncensored observations and
n− r right-censored observations.
22
4.2.1 Exponential distribution
The log-likelihood is
`(λ) = r log λ− λ
n

i=1
ti .
Hence
d`

=
r
λ

n

i=1
ti ,
which may be set to zero and solved immediately to give
λˆ =
r
∑ni=1 ti
.
Also we have,
−d2`
dλ2
=
r
λ2
=⇒ s.e. (λˆ) ≈ λˆ
r 1/2
.
So we have approximately
λˆ ∼ Normal(λ,λ2/r),
and then more approximately
λˆ ∼ Normal
λ, r/{ n∑
i=1
ti
}2 .
4.2.2 Weibull distribution
Recall the Weibull survivor function has form,
S(t) = exp(−λtη)
with scale λ and shape η. Note, we have changed the parameterisation slightly, writing λ
for α−η.
The log-likelihood in the presence of right-censoring is
`(λ, η) = r log(λη) + (η − 1) ∑
i∈U
log ti − λ
n

i=1
t
η
i .
We set
∂`
∂λ
= 0 =
r
λˆ

n

i=1
t
ηˆ
i (3)
and
∂`
∂η
= 0 =
r
ηˆ
+ ∑
i∈U
log ti − λˆ
n

i=1
t
ηˆ
i log ti . (4)
23
Solving (3), we obtain
λˆ =
r
∑ni=1 t
ηˆ
i
.
Substituting into (4) gives
r
ηˆ
+ ∑
i∈U
log ti − r
∑ni=1 t
ηˆ
i
n

i=1
t
ηˆ
i log ti = 0.
This is a non-linear equation in ηˆ which can only be solved using an iterative numerical
procedure.
In practice it is common to maximise {λ, η} simultaneously using Newton’s method.
An important by-product of which is an approximation of the covariance matrix from which
standard errors on MLE can be obtained.
4.3 Hypothesis Testing for Nested Distributions
At the end of Chapter 3 we provided some heuristic methods for distinguishing between
distributions. Here we consider asymptotic properties of the likelihood function to provide a
more rigorous test that can be used for certain comparisons.
Suppose we are interested in making statements about a parameter subset θ(A) ⊆ θ ∈ Θ.
Assume that θ is partitioned θ = (θ(A), θ(B)) and Θ = Θ(A) ×Θ(B).
• For example, in the Weibull distribution we might take θ(A) = η (shape), θ(B) = λ
(scale).
Let θˆ = (θˆ(A), θˆ(B)) denote the MLE of θ = (θ(A), θ(B)).
We will consider two tests for the null hypothesis H0 : θ(A) = θ
(A)
0 vs. H1 : θ
(A) ∈ Θ(A)
which make use of the MLE.
• Note that for the Weibull distribution a test for η = 1 is a test of exponentiality.
The tests lead to the derivation of a confidence region for θ(A) which is the collection of
parameter values in the subset Θ(A) not ‘rejected’ at a certain significance level.
4.3.1 Likelihood ratio statistic
Let
• θˆH1 = θˆ be the unconstrained MLE over Θ.
• Let θˆH0 =
(
θ
(A)
0 , θˆ
(B)
θ
(A)
0
)
, where θˆ
(B)
θ
(A)
0
is the MLE estimate of θ(B) over Θ(B) with θ(A)
fixed at θ(A) = θ
(A)
0 .
24
Then the likelihood ratio test statistic is
W
(
θ
(A)
0
)
= −2 [` (θˆH0)− ` (θˆH1)] .
Under the null hypothesis H0 : θ(A) = θ
(A)
0 , W is itself a random variable with an
approximate chi-squared distribution with mA degrees of freedom, where mA = dim
(
θ(A)
)
.
Large values of W relative to χ2mA supply evidence against H0.
The corresponding 1− α confidence region for θ(A) is{
θ(A) : W
(
θ(A)
)
≤ χ2mA,α
}
,
where χ2mA,α = is the upper 100α percentage point of χ
2
mA
. Within this region we cannot
reject the null hypothesis at the α significance level.
4.3.2 Wald statistic
Suppose V = V
(
θˆ(A), θˆ(B)
)
is the asymptotic covariance matrix for
(
θˆ(A), θˆ(B)
)
evaluated
at the MLE.
Let VA be the leading submatrix of V corresponding to θ
(A). Then,
W ∗
(
θ
(A)
0
)
=
(
θˆ(A) − θ(A)0
)′
V−1A
(
θˆ(A) − θ(A)0
)
also has an approximate χ2mA distribution under the null hypothesis θ
(A) = θ
(A)
0 .
The corresponding 1− α confidence region for θ(A) is{
θ(A) : W ∗
(
θ(A)
)
≤ χ2mA,α
}
.
Both the Wald statistic and the likelihood ratio test statistic are based on large sample
theory and both are asymptotically equivalent (and often give similar results in practice).
However, large discrepancies are possible.
In such cases the likelihood ratio statistic is perhaps preferable as the results are invariant
to reparameterisation.
Against this, the Wald statistic has the advantage that only one maximisation, over the
unconstrained parameter space Θ, is required.
Note however, that the theory is for large samples and may give a poor approximation
to small-sample results.
25
5 Nonparametric Methods
In Chapter 3 we met some parametric distributions which may be appropriate for modelling
a random lifetime variable T . Once a parametric model has been chosen, the distribution of
T is then constrained to take a fixed functional form and can only vary within the scope of
the parameters. (In Chapter 4 we considered how best to choose those free parameters to
fit a set of data.)
A limitation of this approach is that when the model is inadequate some interesting
features within the data will be hidden. That is, we are constrained in how we learn about
the distribution of T by the model we have selected to represent it.
In this chapter we see how this constraint can be relaxed and the search for an estimated
distribution extended to the space of all distributions, leading to the name nonparametric or
distribution-free inference.
So far we have been assuming T to be continuous, but in what follows we will require
some results from discrete random variable distributions.
5.1 Discrete Lifetimes
Suppose that the random lifetime T could only take one of a countable set of values {a1 <
a2 < ...}. Then T would now be a discrete random variable, with a probability mass function
(pmf)
pij = P(T = aj ), j = 1, 2, ... .
As usual, we would require that {pij} satisfy ∀j , 0 ≤ pij ≤ 1 and ∑j pij = 1.
5.1.1 Discrete hazard rate
We define the discrete hazard rate at time aj as
µj = P(T = aj |T ≥ aj ) = pij
1−∑i. (5)
pij =
{
µ1, j = 1
µj ∏i 1. (6)
Rearranging (5) gives
pij = µj
(
1−∑
ipii
)
=⇒ pij = pij−1 µj
µj−1
(1− µj−1) (Exercise)
26
and so by induction
pij =
{
µ1, j = 1
µj ∏i 1.
(7)
Thus {pij} and {µj} both characterise the distribution of T , although for the latter we
only need ∀j , 0 ≤ µj ≤ 1 (and if the range of T is finite, that ∃j s.t. µj = 1).
Note that if ∃j s.t. µj = 1, then surely T ≤ aj .
5.1.2 Survivor and discrete cumulative hazard rate
Since P(T > aj ) ≡ P(T ≥ aj+1), note that another rearrangement of (5) gives
P(T > aj ) =
pij+1
µj+1
.
Substituting pij+1 with (7), we get
P(T > aj ) = ∏
j
i=1(1− µi )
=⇒ S(t) = ∏j :aj≤t(1− µj ). (8)
So for an individual to survive to time t they must survive through all of the support points
aj up to time t, where each represents a Bernoulli trial with death probability µj . Again,
note that if ∃j s.t. µj = 1, then ∀t ≥ aj , S(t) = 0.
In an analogous definition to the continuous case (§2.5), we define the cumulative hazard
to be
M(t) = ∑
j :aj≤t
µj . (9)
Note that this invalidates the identity M(t) = − log{S(t)} from §2.5. However, we can
note that for small h, − log(1− h) ≈ h and thus for low hazard rates
M(t) ≈ − ∑
j :aj≤t
log(1− µj ) = − log{S(t)}.
5.2 Kaplan-Meier (Product-limit) Estimate
Returning to the idea of nonparametric estimation, we would like to find the maximum
likelihood distribution for some observed survival data, over the space of all distributions.
Suppose we observe n independent identically distributed lives from a population with
survivor function S , but that in the presence of right-censoring we only observe m deaths.
Let t1 < t2 < ... < tk be the ordered death times, with k ≤ m so that more than
one death can occur at any one time. Let dj denote the number of deaths at time tj , with
∑kj=1 dj = m.
Observation of the remaining n−m times is censored; Suppose that cj lives are censored
in the half open interval [tj , tj+1), for j = 0, 1, 2 ... , k , where t0 = 0 and tk+1 = ω.
Let tj1, tj2, ... , tjcj be the censoring times in the j
th interval, where the {tji} may or may
not be distinct. Note that ∑kj=0 cj = n−m.
27
-
tj
d
tj1
d
tj2
d
tj3 tj+1
× ×
Let
nj = n− ∑
i≤j−1
ci − ∑
i≤j−1
di
(
=
k

i=j
(ci + di )
)
(10)
be the total number of individuals still ‘in view’ as we reach time tj . Note the convention of
including in nj any observations censored at tj ; alternative schemes may be more appropriate
in some circumstances, and these naturally might lead to differing results.
It is easy to show that regardless of the nature of the true underlying distribution, the
maximum likelihood distribution for T will be that of a discrete random variable with atoms
of probability mass {pij} at the death times {tj}. Informally, any distribution placing mass
away from the death times would admit possibilities outside of the data observed, and so
clearly give lower likelihood to the data that have been observed.
To identify the maximum likelihood discrete distribution, the problem at hand then is to
identify optimal probability masses to place at each of the death times t1, ... , tk .
In constructing a likelihood function for our unknown discrete distribution, it is most
straightforward to specify the likelihood of the observed data in terms of the discrete hazard
function. Proceeding through time from birth, individuals would only be at risk of death at
the discrete points t1, ... , tk . Viewing the data sequentially at these risk times, as we reach
time tj , there are nj individuals still at risk, each subject to a Bernoulli trial with probability
µj of death and (1− µj ) of survival. Once individuals have died or been right-censored, they
cease to play a part in the subsequent terms of the likelihood function.
Hence we can write the likelihood of the data given a discrete model as
L(µ1, ... , µk) =
k

j=1
µ
dj
j (1− µj )nj−dj ,
which is clearly maximised by
µˆj =
dj
nj
, j = 1, 2, ... , k .
Hence the MLE for S is
Sˆ(t) = ∏
j :tj≤t
(
1− dj
nj
)
.
This is known as the Kaplan-Meier or product-limit estimate. It is the MLE over the space
of all valid distribution/survivor functions.
Product-limit: consider, finer and finer partitions of the time axis and estimate S(t) as
the product of probabilities of surviving each sub-interval. We can treat this as a discrete,
sequential problem using the discrete hazard definition in §5.1, and thus obtain the Kaplan-
Meier estimate in the limit as the meshes of the partition tend to zero.
28
Example
Consider the following data on 10 individuals
• death times: 1.1, 3, 3, 7, 10, 12.4;
• right-censoring times: 0.2, 0.8, 4.5, 11.
Death time Censoring time nj dj Sˆ(t)
0.2 1
0.8 1
1.1 8 1 0.875
3 7 2 0.625
4.5 0.625
7 4 1 0.469
10 3 1 0.312
11 0.312
12.4 1 1 0.0
Note: If the largest observation is a censored survival time, then Sˆ(t) is undefined beyond
this maximum time tmax.
5.3 Greenwood’s Formula for the Standard Error of the K-M Estimate
The Kaplan-Meier estimate is the most important and widely used estimate of the survivor
function, and so it is of interest to estimate its variability.
We have
log Sˆ(t) = ∑
j :tj≤t
log(1− µˆj )
and then by approximate independence the variance of log(Sˆ(t)) is
Var
{
log Sˆ(t)
} ≈ ∑
j :tj≤t
Var{log(1− µˆj )}.
The variance of (1− µˆj ) follows from the variance of a binomial random variable
Var (1− µˆj ) ≈ µˆj (1− µˆj )/nj .
Now we make use of a general result, known as the delta method, for approximating the
variance of a function of an asymptotically normal statistical estimator (like the MLE). If X
is a random variable with mean µX and g(X ) is a function of X with first derivative g
′(X ),
then
Var{g(X )} ≈ g ′(µX )2Var(X ), (11)
29
This leads to
Var{log(1− µˆj )} ≈ Var(1− µˆj )
(1− µˆj )2 ≈
µˆj (1− µˆj )/nj
(1− µˆj )2
=
µˆj
nj (1− µˆj ) =
dj
nj (nj − dj ) .
Hence
Var{log Sˆ(t)} ≈ ∑
j :tj≤t
dj
nj (nj − dj )
and further application of (11) leads to
Var{log Sˆ(t)} ≈ 1{Sˆ(t)}2 Var{Sˆ(t)}
so that
Var{Sˆ(t)} ≈ {Sˆ(t)}2 ∑
j :tj≤t
dj
nj (nj − dj ) .
Finally the standard error of the Kaplan-Meier estimate is the square root of the estimated
variance; that is,
s.e.{Sˆ(t)} = Sˆ(t)

j :tj≤t
dj
nj (nj − dj ) ,
which is known as Greenwood’s formula.
5.4 Nelson-Aalen Estimate of the Cumulative Hazard
We have found the MLE for S , Sˆ , to be the survivor function of a discrete random variable
with atoms of probability mass at the death times {tj}. This has a discrete hazard rate
µˆ = {µˆj}.
If our interests lay in estimating the cumulative hazard function M, as in §3.5, then
following definition (9) the corresponding estimate would be
Mˆ(t) = ∑
j :tj≤t
µˆj = ∑
j :tj≤t
dj
nj
.
This is the Nelson-Aalen estimate. By the property of invariance of MLEs under transfor-
mation, this is the MLE for the (discrete) cumulative hazard function. The MLE for the
continuous cumulative hazard function is instead − log(Sˆ(t)).
Similarly to Greenwood’s formula for the Kaplan-Meier estimate we can obtain an esti-
mate for the standard error of the Nelson-Aalen estimate,
Var
{
Mˆ(t)
} ≈ ∑
j :tj≤t
dj (nj − dj )
n3j
=⇒ s.e. {Mˆ(t)} ≈
√√√√ ∑
j :tj≤t
dj (nj − dj )
n3j
.
30
5.5 The Actuarial Estimate of S
The “adjusted-observed” life-table estimate or actuarial estimate of the survivor function
deals with aggregated, interval censored and right censored observations of a continuous
time to event variable, which are commonly found in actuarial applications. Its construction
is very similar to that of the product-limit estimate.
We suppose that the time axis is divided up into intervals (usually of equal length,
typically one year) [a0, a1), [a1, a2), ... , [ak−1, ak) where 0 = a0 < a1 < ... < ak .
The data are aggregated within these intervals, so the number of deaths, dj , and losses
to censoring, cj , are known for each interval [aj−1, aj ) but the actual lifetimes/censoring
times are unknown. As in the product-limit model, we estimate the probabilities of surviving
each subsequent interval.
If all of the deaths were known to precede all of the losses to censoring, then we could
proceed with estimation as before with µˆj = dj/nj . But in general we have no reason to
assume this. The actuarial assumption is that within each region the losses to censorship
occur uniformly across the region. Hence by a crude approximation the adjusted, expected
number of individuals at risk at any time during the interval is
n′j = n− ∑
i≤j−1
ci − ∑
i≤j−1
di − cj
2
= nj − cj
2
,
where nj is the number still ‘in view’ at the start of the j
th interval.
So using the arguments developed in §5.1 and §5.2 we treat survival of the intervals as
(approximate) binomial experiments, leading to the life-table estimate of S
S∗(t) = ∏
j :aj≤t
(
1− dj
n′j
)
,
with corresponding standard error
s.e.{S∗(t)} ≈ S∗(t)

j :aj≤t
dj
n′j (n

j − dj )
.
31
6 Regression models
6.1 Inhomogeneous Populations
We have so far considered survival analysis under the assumption that the lifetimes of in-
dividuals in our population are independent, identically distributed realisations of a random
variable T following a common distribution F .
This assumption of a homogeneous population is restrictive. We will often have additional
information about the individuals, the knowledge of which affects our beliefs about their
lifetime distribution.
In this chapter we shall consider regression methods for accommodating explanatory
variables which we believe affect the distribution of T for each individual.
These variables could be:
• demographic - e.g. age, sex.
• behavioural - e.g. smoking history, exercise taken.
• physiological - e.g. blood pressure, cholesterol level.
• external - e.g. treatment group/centre, dose level.
If the information on individuals is categorical (e.g. smokes/does not smoke) and we only
have a few explanatory variables, we may be able to partition the data set into smaller groups
of homogeneous individuals and use the methods in Chapters 3- 5 to estimate distributions
separately for each group. For example, we could split the population into smokers and
non-smokers and derive Kaplan-Meier estimates Sˆsmokers and Sˆnon−smokers and compare.
However, often we might have many factors to take into account and if some of them
are continuous then this simple approach breaks down.
Methods that can deal with these general regression situations include accelerated life
models and the popular proportional hazards model.
Suppose we have decided upon p explanatory variables (or covariates) which we believe
may affect the time until event. Let zi be the p-vector of values of these covariates observed
for individual i .
Where possible we attempt to standardise these covariates so that z = 0 relates to some
standard set of conditions, such as the control group in a clinical trial.
Under both the proportional hazards and accelerated lifetime approaches, a general model
for lifetime distributions is constructed from two parts:
1. A baseline model for the (hypothetical) standard individuals, corresponding to z = 0;
2. A scheme for modifying the baseline model as the covariates z deviate from zero.
32
6.2 Cox’s Proportional Hazards Model
The Cox 1 model proposes that the hazard rates of individuals are related via the relationship
µ(t; z) = µ0(t) exp(β · z), (12)
where β ∈ Rp is a vector of regression parameters. Equation (12) shows a multiplicative
influence on the hazard of any deviation away from zero in each of the p covariates in z . Here
µ0(t) is known as the baseline hazard and represents the hazard of a (possibly hypothetical)
individual with z = 0.
The survivor function and density follow:
S(t; z) = S0(t)
exp(β·z),
f (t; z) = exp(β · z)S0(t)exp(β·z)−1f0(t),
where S0(t) and f0(t) are the baseline survivor and density functions corresponding to µ0(t).
In (12), only µ0, not z or β, depends on time but the model can also be formulated with
time-dependent covariates.
Under the Cox model, the hazard functions of two individuals with covariates z1, z2 are
in constant proportion at all times; that is,
µ(t; z1)
µ(t; z2)
=
exp(β · z1)
exp(β · z2) = exp{β · (z1 − z2)},
giving rise to the name proportional hazards.
Note that in general we can have any positive function ψ of the covariates leading to
µ(t; z) = µ0(t)ψ(z ; β) and a ‘proportional hazards’ structure.
However, Cox’s model easily ensures that the hazard is always positive, and gives a linear
model for the log-hazard which is very convenient in theory and practice.
The utility of the model arises from the idea that the general shape of the hazard functions
could be the same for all individuals in a population, with any covariate-specific differences
between individuals having a constant, multiplicative effect.
So, if we are not primarily concerned with the precise form of the hazard, but with the
effects of the covariates, we may wish to ignore µ0 and estimate the regression coefficient
vector β from the data without reference to the shape of µ0.
This is termed a semi-parametric approach: a parametric model for covariate effects,
exp(β · z), and a distribution-free reference to µ0.
The Cox model (12) dominates the literature on regression of lifetimes, due to its sim-
plicity and versatility.
1http://www.jstor.org/page/termsConfirm.jsp?redirectUri=/stable/pdfplus/2985181.pdf
Cox, D. R. (1972) Regression Models and Life Tables, Journal of the Royal Statistical Society Series B, 34,
187-220.
33
6.2.1 Partial likelihood function
Suppose we have observation times t1, ... , tn, some of which are right-censoring times. Let
each of these individuals have associated covariate vectors z1, ... , zn.
We consider inference about β when the baseline hazard function is completely unknown.
This raises the question: “What is the likelihood of observing the data as a function of
β when µ0 is unspecified?”
That is, we would like the marginal likelihood of β. Strictly, this is not analytically
available (this would require Bayesian methods). However, Cox developed the following
(subjective) argument:
In the absence of knowledge about µ0, the time intervals between deaths can
provide little or no information about β, as their distribution will depend heavily
on µ0.
Hence it is only the event time ordering of the individuals that holds any in-
formation about β. The contribution to the likelihood function for β for an
individual, with covariates zi , that dies at time ti is
P(individual i with covariates zi dies at ti |one death at ti )
=
P(individual i with covariates zi dies at ti )
P(one death at ti )
=
µ(ti ; zi )
∑j∈Rti µ(ti ; zj )
=
exp(β · zi )
∑j∈Rti exp(β · zj )
,
where Rti is the risk set; that is, the set of individuals “at risk” (or “in view”, so
not dead or censored) at time ti .
This leads to,
L(β) =∏
i∈U
exp(β · zi )
∑j∈Rti exp(β · zj )
, (13)
where the product is over the set of uncensored observations (death times).
The likelihood (13) is not a full likelihood function for the observed data as it makes
no use of the actual event/censoring times. For this reason it is referred to as a partial
likelihood.
Note that the baseline hazard cancels out in the partial likelihood and so L(β) is inde-
pendent of µ0 when we only consider the order in which the deaths and losses to censoring
occurred.
Note that the censored observations only appear in the denominator of (13), through
their presence/absence in the risk sets Rti at each of the death times.
34
If the event times are ordered, t1 < t2 < · · · < tn we can write j ∈ Rti ≡ {tj ≥ ti} ⇐⇒
j ≥ i .
Example
Consider the following data on 5 individuals: z1, death at 2.8; z2, censored at 3.1; z3,
death at 2.2; z4, death at 4.7; z5, censored at 2.6. So there are 3 deaths.
Writing ψi = exp(β · zi ), the partial likelihood for β is constructed sequentially:
L(β) =
ψ3
ψ1 + ψ2 + ψ3 + ψ4 + ψ5
× ψ1
ψ1 + ψ2 + ψ4
× ψ4
ψ4
.
In general, analytically maximising L(β) (or `(β)) wrt β is not possible. Most statistical
packages contain numerical procedures for fitting a Cox model, such as Newton methods,
which make use of `′(β) and `′′(β).
Handling ties:
In practice complications in evaluating the partial likelihood can arise if
1. some dj > 1; more than one death at any time.
2. some censored observations coincide with a death time tj .
It is usual to deal with 2. by assuming that the losses to censoring occurred just after
time tj . So individuals censored at tj are included in the risk set Rtj .
Full analytic treatment of multiple simultaneous deaths, case 1, leads to complicated
expressions for the likelihood, since all possible orderings of the dj deaths from Rtj should be
considered. The following approximation due to Breslow (1974) 2 is commonly used. For
death times t1, ... , tk ,
L(β) =
k

j=1
exp(β · sj )
[∑i∈Rtj exp(β · zi )]dj
,
where sj is the sum of the covariate vectors z of the dj lives observed to die at time tj .
The partial likelihood is a true likelihood for the order of the events, and so has all
the usual properties; through maximisation it yields an estimator βˆ which is asymptotically
(multivariate) normally distributed and unbiased, and whose asymptotic covariance matrix
can be estimated by the inverse of the observed information matrix, found from
I (β)ij = − ∂
2
∂βi∂βj
`(β), i , j = 1, ... , p,
2http://www.jstor.org/page/termsConfirm.jsp?redirectUri=/stable/pdfplus/2529620.
pdfBreslow, N. (1974) Covariance Analysis of Censored Survival Data. Biometrics, 30, 89-99.
35
evaluated at βˆ, where `(β) = log{L(β)} is the log partial likelihood.
I (βˆ) is usually provided as a by product by most computer packages when fitting the
Cox model using Newton’s method.
This allows standard errors for βˆ to be obtained which are helpful for model checking.
6.2.2 Model checking
In a practical survival problem several possible explanatory variables might present themselves
as possibly affecting the survival time.
Part of the modelling procedure is to assess which (if any) are relevant to the model.
That is, which ones have a significant effect on the distribution of T .
The fact that the partial likelihood behaves like a full likelihood allows us to make use
of the test statistics from §4.3.
Suppose we currently have a model with p covariates and we wish to consider the inclusion
of an extra q covariates.
We can fit both models, one with p and the other with the p+q covariates, by maximising
the partial likelihoods. Let `p and `p+q be the maximised log-likelihoods of the two nested
models. Note that `p+q ≥ `p.
The likelihood ratio statistic is
2(`p+q − `p)
and has an asymptotic χ2q distribution under the null hypothesis that the extra q covariates
have no additional effect in the presence of the original p explanatory variables.
H0 : βp+1 = ... = βp+q = 0 vs. H1 : (βp+1, ... , βp+q) ∈ Rq.
Example
Suppose we are seeking a model for learning the effect of hypertension on survival.
The first model we might consider has two covariates zi = (zi1, zi2) relating to the sex
of an individual, zi1, and their blood pressure, zi2.
Suppose then we wanted to test the hypothesis that cigarette smoking has no effect,
allowing for sex and blood pressure.
We can construct two nested models with alternative covariate vectors z (1) = {(zi1, zi2) :
i = 1, ... , n} and z (2) = {(zi1, zi2, zi3) : i = 1, ... , n} where zi3 is a 0-1 indicator on whether
the i th individual smokes.
We fit both models and note the value of the maximum log-likelihood for the two models,
`z (1) , `z (2) . Under the assumption that smoking has no effect the test statistic W = 2(`z (2) −
`z (1)) has an asymptotic χ
2
1 distribution.
If W is small we cannot reject the null hypothesis, that smoking has no effect.
36
6.2.3 Estimating the baseline distribution
Once the regression coefficients have been estimated by MLE, it is possible to obtain a
nonparametric estimate of the baseline survivor function.
The following formula from Kalbfleisch and Prentice (1980) 3 is a generalisation of the
product-limit (PL) estimator of §5.2 when there are no ties
Sˆ0(t) = ∏
ti≤t
(
1− exp(βˆ · zi )
∑j∈Rti exp(βˆ · zj )
)exp(−βˆ·zi )
.
As with the PL estimator, if the last observation is a censored value, then Sˆ0(t) is
undefined past this point. Note that a plot of Sˆ0(t) may often suggest a suitable parametric
form for S .
Of course there is no requirement that µ0 should be unspecified to perform regression
modelling in survival analysis. In the case where we can assume µ0 to be from a known
parametric family, then full likelihood based inference for {µ0, β} can proceed. In most
practical applications the form of µ0 will not be well understood beforehand; in which case,
learning the effects of the covariates and from there estimating the shape of the baseline
model, as above, might lead to consideration of a suitable parametric form.
6.3 Accelerated Failure Time Model
The accelerated failure time model is an alternative, general model for survival data in which
the covariates are assumed to act multiplicatively on the time-scale.
The model has intuitive appeal in that the covariates are seen to ‘speed up’ or ‘slow
down’ the passage of time for one individual relative to another.
Formally, the survivor function for an individual is of the form
S(t; z) = S0(tψ(z ; β))
where S0 is a baseline survivor function and ψ(z ; β) is a positive function of the covariates
parameterised by β.
Assuming ψ(0; β) = 1, the corresponding random variables are related by
Tz =
T0
ψ(z ; β)
where Tz is the future lifetime random variable for an individual with covariates z and hence
T0 is the baseline future lifetime random variable.
It follows that
f (t; z) = f0[tψ(z ; β)]ψ(z ; β)
µ(t; z) = µ0[tψ(z ; β)]ψ(z ; β)
As in the proportional hazards model it is convenient to take ψ(z ; β) = exp(β · z).
3Kalbfleisch, J. D. and Prentice, R. L. (1980) The statistical analysis of failure time data. Wiley.
37
6.3.1 Relation to proportional hazards
For illustration we consider regression modelling of two groups of individuals with
• zi ∈ {0, 1} (e.g. smokers/non-smokers) and
• a baseline hazard which is piecewise constant, say
µ0(t) =
{
0.5 0 ≤ t ≤ 1
1 t > 1
Note that this is in effect a piecewise exponential model.
Consider a proportional hazards model
µ(t; zi ) = µ0(t) exp(βzi )
and accelerated failure time model,
µ(t; zi ) = µ0[t exp(βzi )] exp(βzi ),
both with β = 1.
Under the accelerated failure time model the increase in hazard for the group {i : zi = 1}
occurs sooner than under the proportional hazards model (see Fig 4, left).
The ‘kink’ in the survivor function (Fig 4, right) also occurs earlier.
0.0 0.5 1.0 1.5 2.0
0.
0
0.
5
1.
0
1.
5
2.
0
2.
5
3.
0
Time
H
az
ar
d
Fu
nc
tio
n
Baseline
PH
AFT
0.0 0.5 1.0 1.5 2.0
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Time
Su
rv
ivo
r
Fu
nc
tio
n
Baseline
PH
AFT
Figure 4: Hazard (left) and survivor (right) functions: Baseline (–); Proportional Hazards
(-); Accelerated failure (· · · ). With ψ(zi ; β) = exp(zi ), zi = 1.
38
7 The Markov Model
7.1 Stochastic Processes
Previously we have modelled the future lifetime of an individual currently aged x as a single
random variable T with an unknown distribution F . In this chapter we take a different
approach and treat each lifetime as a realisation of a stochastic process; that is, an indexed
family of random variables on a probability space.
Definition: A stochastic process on a set Ω is a collection {X (t)|t ∈ S} of random variables
X (t) ∈ Ω indexed by a set S . Let P be generic notation for the probability distribution for
the random variables X (t).
Usually the index set S ⊆ R. If S is countable (e.g. S = N), then we say {X (t)}
is a discrete-time process. If S is an interval of R (e.g. S = [0,∞)), then {X (t)} is a
continuous-time process.
In the Markov model, the sample path of the process {X (t)} (the values that it takes
over time) will be used to define our (possibly censored) time to event T .
Example: 2-state model
The simplest example of a continuous-time stochastic process for survival analysis is the
two-state Markov model illustrated by the diagram below:
-Alive0 Dead1
µx
Figure 5: Alive-Dead model.
There is just an Alive state and a Dead state, with transitions in one direction only.
The process takes value X (t) = 0 if the individual is in state Alive at time t, or X (t) = 1
if the individual is in state Dead, so Ω = {0, 1}.
The state Dead is said to be an absorbing state; that is, ∀s > 0
P(X (t + s) = 1|X (t) = 1) = 1.
Here the event T = t in Chapters 2-5 corresponds simply to a change of state from
Alive to Dead at time t.
7.2 Homogeneous Continuous-Time Markov Jump Processes
7.2.1 Markov Jump Processes
If Ω is a finite set of N states (e.g. Ω = {1, 2, ... , N}) and X (t) is a stochastic process on
the state space Ω, we say X (t) is a jump process.
39
If the path of a continuous-time process X = {X (t) : t ≥ 0} makes only a finite number
of jumps in any finite time interval, we say it is a pure jump process.
Furthermore, a pure jump process will be said to be a Markov jump process if it satisfies
the Markov property
P{X (t) = j |X (t1) = j1, ... , X (ts) = js} = P{X (t) = j |X (ts) = js},
∀j1, ... , js , j ∈ Ω and 0 ≤ t1 < ... < ts < t.
We shall use a Markov jump process to represent the future path of an individual currently
aged x . [Explicit reference to x is suppressed in this section to simplify notation.]
7.2.2 Transition probabilities
Definition: The transition probability pij (s, t) is defined to be probability of being in state
j at time t, given that at time s the process was in state i ,
pij (s, t) = P(X (t) = j |X (s) = i), t ≥ s ∈ R+, i , j ∈ Ω.
In actuarial notation
t−spijs ≡ pij (s, t).
7.2.3 Transition Intensities
Specifying coherent transition probabilities in continuous-time Markov models would be
highly complex, and so instead these models are more typically characterised through a
set of corresponding transition intensities.
Definition: For t ≥ 0 and i , j ∈ Ω, i 6= j we define the force of transition or transition
intensity,
µij (t) = lim
h↓0
pij (t, t + h)
h
(14)
and assume these limits exist. Transition intensities generalise the concept of the hazard
function met in §2.4
When explanatory variables are available for individuals, the transition intensities for each
individual can be constructed to depend on a covariate vector z in an analogous way to the
regression methods of Chapter 6.
Under the assumption of existence of (14) we can write
pij (t, t + dt) = µij (t)dt + o(dt), dt ≥ 0
where
o(dt)
dt
→ 0 as dt ↓ 0.
So for small dt, the transition probability pij (t, t + dt) between two distinct states i and
j is approximately linear in dt with constant of proportionality µij (t),
pij (t, t + dt) ≈ µij (t)dt.
40
Note that at time t, if µij (t) = 0 then transitions from i to j surely cannot occur.
For completeness, ∀i ∈ Ω we can define
µii (t) = lim
h↓0
pii (t, t + h)− 1
h
, (15)
although note that we do not have transitions i → i . This gives
pii (t, t + dt) = 1+ µii (t)dt + o(dt), dt ≥ 0
=⇒ pii (t, t + dt) ≈ 1− (−µii (t))dt.
This can be most helpfully interpreted as
P{X (t + dt) 6= i |X (t) = i} ≈ −µii (t)dt.
Defining µii (t) this way leads to a useful relationship. Since ∀i ∈ Ω and t, h > 0, we
know ∑Nj=1 pij (t, t + h) = 1, we therefore have
µii (t) = lim
h↓0
{1−∑j 6=i pij (t, t + h)− 1}
h
= −∑
j 6=i
lim
h↓0
pij (t, t + h)
h
=⇒ µii (t) = −∑
j 6=i
µij (t).
Notice from (14) and (15) that ∀i 6= j ∈ Ω, ∀t ≥ 0, µij (t) ≥ 0 and µii (t) ≤ 0.
Definition: The N ×N matrix Gt with ij th entry (Gt)ij = µij (t) is called the generator of
the process.
The generator can be thought of as the continuous-time analogue of the one-step tran-
sition probability matrix of discrete-time Markov chains.
We have seen that µii (t) = −∑
j 6=i
µij (t) and hence
N

j=1
µij (t) = 0, ∀i ∈ Ω
=⇒ Gt1′ = 0′
where 1′, 0′ are vectors of ones and zeros respectively.
41
7.2.4 Homogeneity
Definition: The process is said to be homogeneous if
pij (t, t + h) = pij (0, h), ∀i , j ∈ Ω, t, h > 0,
in which case we can simplify notation and write pij (h) for pij (t, t + h).
Clearly the condition of homogeneity is equivalent to the transition intensities being
constant,
µij (t) ≡ µij ,
Gt ≡ G.
7.2.5 Age rate intervals
Homogeneity is clearly a very strong condition, as it implies individuals in the process expe-
rience no effects from ageing. To make this assumption tenable, it is common to divide the
time domain of interest (perhaps R+) into small interval domains S1, S2, ... and fit separate
Markov models for each interval Si , within which homogeneity is more reasonable.
Smoothing methods, known as graduation (see Chapter 11), can be deployed later to
combine the resulting piecewise constant intensity estimates to obtain a smooth curve µij (t)
over the full time domain.
In studies of human mortality, these homogeneous intervals will typically be the integer
age ranges [x , x + 1) for x ∈ N = {0, 1, 2, ...}. Hence in these applications our implicit,
hidden notation is actually µij
x+ 12
, as the estimator can be thought of as corresponding most
closely to the midpoint of this integer age x interval. Alternative rate interval definitions for
age would change this; for example, if we defined age group x as [x − 12 , x + 12 ], i.e. nearest
birthday being x years, then we would be approximately estimating µijx .
7.2.6 Assuming homogeneity
From now on we shall assume that X (t) is a homogeneous process. We write Pt for the
N ×N matrix with entries pij (t), and the transition intensities µij (t) simply as µij . However,
it should be noted that many of the models and results that follow can be extended into the
inhomogeneous, duration-dependent setting.
Theorem 7.1. The family {Pt : t ≥ 0}, known as the transition semigroup of the process,
is a stochastic monoid under the matrix multiplication operation; that is, it satisfies the
following:
1. {Pt : t ≥ 0} are right-stochastic: ∀t ≥ 0, the entries of each row of Pt are non-
negative and sum to 1.
2. Identity element: P0 = I, the N ×N identity matrix.
3. Semigroup property: Ps+t = PsPt if s, t ≥ 0. These are known as the Chapman-Kolmogorov
equations.
42
Proof
1. For any i ∈ Ω,
N

j=1
(Pt)ij =
N

j=1
pij (t)
= P
(
∪Nj=1{X (t) = j}|X (0) = i
)
= 1.
2. Clearly (P0)ij = p
ij (0) = δij .
3. Chapman-Kolmogorov:
pij (s + t) = P(X (s + t) = j |X (0) = i)
=
N

k=1
P(X (s) = k |X (0) = i)P(X (s + t) = j |X (s) = k)
=
N

k=1
pik(s)pkj (t).
The distribution of {X (t) : t ≥ 0} is completely determined by the stochastic semigroup
{Pt : t ≥ 0} and the initial distribution of X (0). Furthermore, if we condition on X (0)
(i.e. assuming we can observe the initial states of our individuals) then the evolution of X (t)
depends only on Pt .
7.3 A Two-State Markov Model
The two-state homogeneous Markov model in Figure 5 provides a very simple and familiar
special case of these general Markov jump processes.
From above, the single transition intensity µ = µ01 has equation
µ = lim
h↓0
p01(x , x + h)
h
, ∀x > 0. (16)
Noticing that in the right hand side of (16), p01(x , x + h) is equal to Fx (h) in random
variable terminology, it is then apparent that here µ must be the hazard function. Since we
have that µ is constant under the assumption of homogeneity, it immediately follows that
the corresponding time to event random variable has an Exponential(µ) distribution.
For making inference about this two-state Markov model in the presence of possibly
censored data, the results for maximum likelihood estimation of µ immediately follow from
§4.2.1.
Let v be the sum of the waiting times (i.e. the time until death or censoring) of all
individuals on a study. In actuarial science, this is often called the central exposed to risk
and is denoted Ecx for age group x .
43
Then if d is the number of deaths in the sample, recall the MLE for µ is given by
µˆ =
d
v
.
We had approximately
µˆ ∼ Normal(µ, µ2/d),
and then more approximately
µˆ ∼ Normal(µ, d/v 2).
Further, we can obtain MLEs for survival probabilities S(t) and hence the transition
probabilities p01(t) via Sˆ(t) = exp{−µˆt} and 1− Sˆ(t) respectively.
7.4 Calculating Transition Probabilities
The real utility of the Markov model for survival data is that the two-state model of §7.3 can
be extended to any number of states, with arbitrary transitions between them. For example,
the three-state Sickness-Death model (see Fig 6).
Able0 Ill2
Dead1
-
ff
µ20x
µ02x
@
@
@
@
@
@
@
@
@
@R
µ01x

µ21x
Figure 6: Sickness-Death model.
Unfortunately, analytic expressions for the transition probabilities in terms of the tran-
sition intensities are not available in the general N-state Markov model. However, we can
derive differential equations for these relationships, known as the Kolmogorov forward and
backward equations.
The forward equations are preferable if there is a single initial state of particular impor-
tance and we want to know the probabilities of being in the various other states at time
t.
Conversely, the backward equations are preferred if there is a single final state of great
interest and we want the probability of reaching this state at time t from various initial
states.
44
7.4.1 Kolmogorov Forward Equations
Equations (14) and (15) can be combined and written as
lim
dt↓0
1
dt
(Pdt − I) = G.
Given that P0 = I, this matrix equation amounts to saying that Pt is differentiable at t = 0.
Then, since Pt+dt = PtPdt , we have
lim
dt↓0
1
dt
(Pt+dt −Pt) = Pt lim
dt↓0
1
dt
(Pdt − I) = PtG
and hence Pt is differentiable everywhere. So
P′t = PtG,
where P′t denotes the matrix with entries
d
dt
pij (t).
Considering the individual entries of these matrices, we find
d
dt
pij (t) =
N

k=1
pik(t)µkj . (17)
These are known as the Kolmogorov forward equations.
Sometimes it is more helpful to rewrite the forward equations (17) in terms of the
meaningful transition intensities; recalling that µjj = −∑k 6=j µjk , (17) becomes
d
dt
pij (t) = ∑
k 6=j
(
pik(t)µkj − pij (t)µjk
)
.
7.4.2 Kolmogorov Backward Equations
Alternatively, in the above we could have conditioned on the state at dt first, i.e. Pt+dt =
PdtPt . This yields
P′t = GPt,
or
d
dt
pij (t) =
N

k=1
µikpkj (t) (18)
or
d
dt
pij (t) = ∑
k 6=i
(
µikpkj (t)− µikpij (t)
)
.
These are the Kolmogorov Backward equations.
The Kolmogorov equations provide expressions for the derivatives of the transition prob-
abilities. The power series solution is:
Pt = exp(Gt) = I+Gt +
G2t2
2!
+ ...
Coupled with the knowledge P0 = I, we can use numerical methods to find Pt . Note
PtG = P′t = GPt, and so the matrices Pt and G commute.
45
7.4.3 Example: The two-decrement model
In general numerical techniques are required to solve the Kolmogorov equations relating to
the estimates pij (t).
However in certain elementary cases the solutions can be written down in closed form.
For example, consider the two-decrement model:
Able0 Retired2
Dead1
-
µ02x
@
@
@
@
@
@
@
@
@
@R
µ01x
Then we can use, say, the Kolmogorov forward equations (17) to obtain first order linear
differential equations for the transition probabilities which can be solved analytically using
the general result
dy
dt
= ay + b =⇒ y = Ceat − b
a
.
We find
p00(t) = e−(µ
01+µ02)t
p01(t) =
µ01
µ01 + µ02
[
1− e−(µ01+µ02)t
]
p02(t) =
µ02
µ01 + µ02
[
1− e−(µ01+µ02)t
]
. (Exercise)
7.5 Calculating Path Probabilities
7.5.1 Holding times
Clearly from our assumptions for the Markov model, ∀t the path {X (s) : 0 ≤ s ≤ t} is a
step function in the state space Ω, making a finite number of jumps with probability 1.
It is useful to consider the waiting times, also known as holding times, within states for
this step function.
46
Definition: For an individual entering state i at time s (so X (s) = i), we define the holding
time to be
inf{t : t ≥ 0, X (s + t) 6= i}.
Definition: Define pii (t) to be the occupancy probability,
pii (t) = Pr(remain in state i for at least time t).
Note that this has a different interpretation from pii (t). pii (t) is the survivor function for
the holding time in state i .
In fact, clearly
pii (t) ≤ pii (t).
By the Markov property we have
pii (t + dt) = pii (t)pii (dt),
and since the probability of more than one transition in a small time window of length dt is
o(dt), we can write
pii (t + dt) = pii (t)(1− ∑
k 6=i
µikdt + o(dt)).
Hence
d
dt
pii (t) = −pii (t)∑
k 6=i
µik .
From this we find
pii (t) = exp
(
−t∑
j 6=i
µij
)
.
Alternatively we could write
pii (t) = exp
(
tµii
)
,
and recall that µii ≤ 0.
So the holding time for the i th state in a homogeneous Markov jump process is exponen-
tially distributed with rate −µii = ∑
j 6=i
µij .
Notice the time spent in state i before jumping out therefore depends on the sum of the
(non-negative) intensities for transition to the other possible states.
The exponential distribution has the ‘lack of memory’ property
P(T > t + s |T > s) = P(T > t).
That the holding time in any state should exhibit this lack of memory follows intuitively from
47
• the homogeneity of the process (how old you are is irrelevant)
• and the Markov property (in predicting the future, all that matters is which state you
are in now).
7.5.2 Jump probabilities
We have found the distribution for the length of time spent in a state, so it remains to find
the distribution for the next state the process jumps to.
Let pij = Pr(X jumps from state i to state j |X jumps from state i).
Since the holding time is exponentially distributed with hazard rate −µii , from the defi-
nition of the hazard function we have
P{X (t + dt) 6= i |X (t) = i} = −µiidt + o(dt).
Hence for j 6= i , we have
P{X (t + dt) = j |X (t) = i} = (−µiidt + o(dt))pij .
But also we have
P{X (t + dt) = j |X (t) = i} = pij (dt) = µijdt + o(dt).
Setting these two equations equal, dividing by dt and taking limits as dt ↓ 0 thus yields
pij =
µij
−µii =
µij
∑k 6=i µik
.
So a matrix P with ijth entry pij ,
(P)ij =
{
µij
∑k 6=i µik
i 6= j
0 i = j ,
represents the transition probability matrix for the embedded Markov chain of the Markov
jump process, looking just at the jumps.
That is, if we were to consider only the transitions of our stochastic process {X (t) : t ≥
0}, these would form a discrete time Markov chain with transition matrix P.
In summary then, if X is currently in state i :
• The overall magnitude of the transition intensities ∑
j 6=i
µij controls how long our process
stays in state i .
• The relative magnitude of a particular µij gives the probability that the next state
visited by the process will be j .
48
7.5.3 Example (continued): The two-decrement model
The transition probabilities derived in §7.4.3 are now easily interpreted:
• p00(t) is the survivor function for holding time in state 0, which we know to be
exponentially distributed with rate parameter µ01 + µ02.
• For the other two equations
– the term in brackets is the probability of jumping out of state 0 by time t;
– the fraction gives the conditional probability of each decrement having occurred,
given that one of them has occurred.
7.6 Estimating Transition Intensities
7.6.1 Maximum Likelihood Estimation
We now consider the problem of making inference about the transition intensities for the
time interval [x , x + 1) in the presence of some data. Just as in previous chapters, we shall
use a maximum likelihood approach.
Suppose we observe n individuals from this Markov model, each for a finite period of time.
This will provide n independent partial realisations of the stochastic process {X (t) : t ≥ 0}.
Since our processes are assumed to be Markov and homogeneous, we are able to sum-
marise the information in those observed paths by:
1. The number of transitions i → j , ∀i 6= j .
2. The holding times between transitions for each visit to each state i ;
What is the contribution to the likelihood of each?
1. The jump probabilities for i → j : µ
ij
−µii .
2. (a) An uncensored period of length t (holding time) in state i : −µii exp(µii t).
(b) A right-censored period of length t (holding time) in state i : exp(µii t).
Since these terms are combined multiplicatively, sufficient statistics are the number of
transitions of each type (1) and simply the total holding time in each state (2). Let
D ijk = Number of transitions by k
th life from state i to state j ;
V ik = Holding time of the k
th life in state i .
Defining the totals across individuals,
D ij =
n

k=1
D ijk , V
i =
n

k=1
V ik ,
49
it follows that the likelihood for the transition intensities given observed data {v i , d ij} is
L(G ) =
N

i=1

j 6=i
exp{−µijv i}(µij )d ij (19)
where v i denotes the observed total holding time for state i and d ij denotes the number
of observed transitions i → j . Maximising this function yields the maximum likelihood
estimators:
µˆij =
D ij
V i
.
The asymptotic properties of the estimators are the same as those from the 2-state model
(see §7.3). Asymptotically,
µˆij ∼ Normal
(
µij ,
µij
2
d ij
)
. (20)
and then more approximately
µˆij ∼ Normal
(
µij ,
d ij
v i
2
)
. (21)
The estimators µˆji are not independent of one another; for example, in the Sickness-
Death model (Figure 6) D01i and D
21
i are both 0 or 1, but D
01
i D
21
i 6= 1, while (assuming
that the i th life starts in the able state) we have D02i = D
20
i or D
02
i = D
20
i + 1.
The estimators are, however, asymptotically independent.
7.6.2 Interval censoring
The calculation of the estimates µˆij requires the total holding time in the i th state to be
calculated. When the data are interval censored and aggregated this can cause problems.
However, estimates of the total waiting time can be approximated using numerical tech-
niques. Most commonly, we assume transitions occur half way through the interval and
calculate approximate holding times accordingly.
7.7 Statistical Testing of Estimates
Consider a particular transition type i → j . Perhaps from standard tables (see §2.8) or
elsewhere, we might have had prior beliefs from a wider population about the values µij , or
strictly speaking µij
x+ 12
, should take on each interval [x , x + 1). Notationally, we shall refer to
these expected values as µij
0, x+ 12
. We may then wish to test whether our observed estimates
cohere with these prior beliefs, or else whether our data on transitions i → j represent a
statistically significant departure from that population.
50
Under a null hypothesis that our data are from the same population as the {µij
0, x+ 12
|x =
0, 1, ...}, we would have from (20) that approximately
µˆij
x+ 12
=
d ijx
v ix
∼ Normal(µij
0, x+ 12
, µij
0, x+ 12
2
/d ijx )
=⇒ d ijx ∼ Normal(v ixµij0, x+ 12 , v
i
x
2
µij
0, x+ 12
2
/d ijx ) (22)
≈ Normal
(
v ixµ
ij
0, x+ 12
, v ixµ
ij
0, x+ 12
)
. (23)
From (22), the expected count of i → j transitions for individuals aged x , E(D ijx ) =
v ixµ
ij
0, x+ 12
.
7.7.1 χ2 Goodness of Fit Test
To obtain the usual χ2 goodness of fit test statistic X 2 we then sum across all age groups
the squared difference of observed and expected counts divided by expected count,
zx =
(d ijx − v ixµij0, x+ 12 )√
v ixµ
ij
0, x+ 12
, (24)
X 2 =∑
x
z2x . (25)
Under the null hypothesis that our data are from the same population as the {µij
0, x+ 12
|x =
0, 1, ...}, (25) would follow an approximate χ2ν distribution with degrees of freedom parameter
ν given by the number of age groups [x , x + 1] in our data. The p-value of the data under
this null hypothesis is thus given by the χ2ν right tail probability of the realised value of (25).
7.7.2 Standardised and Cumulative Deviations Tests
A disadvantage of the goodness of fit test is that by aggregating across all age groups to get
an overall statistic, the test can mask possibly poor performance in a small handful of age
groups, perhaps occurring in a run or clump of neighbouring age groups, or perhaps some
small overall bias in the estimates. The following tests offer some scope for examining how
balanced the contributions from (24) are in the sum of (25).
From (23) it follows that the quantities zx in (24) each come from the standard normal
distribution under the null hypothesis. Thus we can perform a goodness of fit test on the
observed z-values against the expected frequencies from i.i.d. samples from N(0, 1).
To make this possible, the range of possible values R must be first divided into bins
B1, ... , Bk , typically taken to be (−∞, 3), [−3,−2), [−2,−1), [−1, 0), [0, 1), [1, 2), [2, 3), [3,∞).
Let Oi be the observed count of zxs lying in Bi . Then under the null hypothesis, if Ei is the
51
expected value of this count then approximately
k

i=1
(Oi − Ei )2
Ei
∼ χ2ν
where here ν = k − 1. This is known as the standardised deviations test.
A variation on this idea is the cumulative deviations test. There, we note that (23) also
implies

x
d ijx ∼ Normal
(

x
v ixµ
ij
0, x+ 12
,∑
x
v ixµ
ij
0, x+ 12
)
and hence
∑x{d ijx − v ixµij0, x+ 12 }√
∑x v ixµ
ij
0, x+ 12
∼ Normal(0, 1).
A z-test can then compare this observed statistic with the standard normal tails.
52
8 The Binomial and Poisson Models of Mortality
The main uses of Binomial and Poisson models lie in actuarial science. Like our use of
Markov processes in Chapter 7, these models are concerned with survival within the unit time
intervals [x , x + 1). (This is a characteristic particular to actuarial modelling, as opposed to,
say, engineering or medical survival analyses.)
The Binomial model is also valuable as it underpins the the Kaplan-Meier nonparametric
estimator in Chapter 5.
8.1 Binomial model of mortality
In the analysis of mortality data we are typically presented with the following binary data:
A sample of n homogeneous individuals are monitored over the age range [x , x + 1). The
observed outcome for individual i , Di , is either death (Di = 1) or survival (Di = 0). Define
D = ∑ni=1 Di as the total number of deaths.
Using such data, we would like to make inference about the probability of death for a
new individual during the age range [x , x + 1).
Back in Chapter 2 we defined the cdf tqx as the probability of an individual currently
aged x dying within time t. For notational brevity we also took qx ≡ 1qx . Then qx is
precisely the quantity on which we wish to make inference.
If there is no censoring, so that the death of any individual in our sample can be observed
anywhere during [x , x + 1), then each life indicator Di is the outcome of a Bernoulli trial
with probability of death qx and survival probability px = 1− qx .
In which case, assuming the lifetimes are independent, D ∼ Binomial(n, qx ). That is,
P(D = d) =
(
n
d
)
qdx (1− qx )(n−d).
This is the Binomial model of mortality in its simplest form. Note that E(D) = nqx ,
Var(D) = nqx (1− qx ) = nqxpx .
The maximum likelihood estimator of qx is
qˆx =
D
n
,
which has mean and variance
E(qˆx ) =
E(D)
n
=
nqx
n
= qx ,
Var(qˆx ) =
Var(D)
n2
=
qx (1− qx )
n
=
qxpx
n
.
Asymptotically, by the Central Limit Theorem we have
qˆx ≈ Normal
(
qx ,
qx (1− qx )
n
)
.
53
[This can also be derived in the usual MLE/information matrix way (see §4.1.2) by noting
that for large n, qx ≈ D/n. (Exercise)]
Given a suitable set of data the probability of death in any interval can be trivially
estimated in this way. However, the Binomial model can be problematic under more realistic
circumstances, where we may observe individuals over different (incomplete) sub-intervals
of [x , x + 1); there will usually be multiple decrements (individuals leaving the process) and
sometimes increments (individuals joining the process) during [x , x + 1).
8.2 General Binomial Model - Non-uniform observation times
Again consider observing n independent lives over the interval [x , x + 1).
Now we assume that individual i comes into view at time x + ai and is unobserved beyond
time x + bi , when it is then deemed to be right-censored if still alive, for 0 ≤ ai < bi ≤ 1.
-
x + ai x + bi
x x+ 1
Then for each individual i , Di ∼ Bernoulli(bi−ai qx+ai ).
Defining a vector of the individual death probabilities
q
:
= (b1−a1qx+a1 , b2−a2qx+a2 , ... , bn−anqx+an),
the full likelihood function is
L(q
:
) =
n

i=1
(bi−ai qx+ai )
di (1− bi−ai qx+ai )(1−di ). (26)
Clearly, estimating the n values of q
:
via an unconstrained maximisation of (26) is inappro-
priate since if the (ai , bi ) pairs are unique then the MLEs are degenerate:
ˆbi−ai qx+ai = di ∈ {0, 1}.
Furthermore, for a new individual with unique entry and exit times (x + an+1, x + bn+1)
we are no further forward in being able to make predictions.
The solution is to assume that the probabilities {tqx} vary smoothly as a function of
t, 0 ≤ t ≤ 1, thus creating a dependence structure between individuals in the estimation
process. To achieve this dependence, the following sections provide three popular alterna-
tive methods for fractional age adjustment, each reducing the inference problem to that of
estimating the single parameter qx .
But first, as a preliminary step to verify that it is sufficient to construct models for
{tqx : 0 ≤ t ≤ 1} to fully specify more general probabilities bi−ai qx+ai , consider the
probability bi qx :
54
-x + ai x + bi
x x+ 1
-bi qx
Clearly,
bi qx = ai qx + (1− ai qx )bi−ai qx+ai .
Rearranging then gives
bi−ai qx+ai =
bi qx − ai qx
1− ai qx
,
an equation for the required probabilities bi−ai qx+ai in terms of probabilities of the type
tqx .
8.2.1 Uniform distribution of deaths (UDD)
The UDD model states that any deaths are uniformly distributed over the range [x , x + 1).
This implies a conditional cdf P(Tx ≤ t|Tx < 1) = t for 0 ≤ t < 1. Then since P(Tx <
1) = qx ,
tqx = tqx , 0 ≤ t ≤ 1.
Assuming UDD implies there is an increasing force of mortality (hazard) between integer
ages. To see this, first consider the survivor function
tpx = 1− tqx
= 1− tqx .
This survivor function implies a hazard rate of
=⇒ µx (t) = − d
dt
log (tpx ) = − d
dt
log(1− tqx )
=
qx
1− tqx ,
and clearly µx (t) ↑ as t ↑.
8.2.2 Balducci assumption
For qx < 1, the Balducci (or hyperbolic) assumption is
1−tqx+t = (1− t)qx , 0 ≤ t ≤ 1.
This model assumes a decreasing force of mortality. Too see this, consider partitioning
the interval [x , x + 1) in two at time t.
55
-x
x + t
x+ 1
-1−tqx+t
Then clearly
qx = tqx + (1− tqx )1−tqx+t .
Rearranging, we get
tqx =
qx − 1−tqx+t
1− 1−tqx+t .
So under the Balducci Assumption,
tqx =
qx − (1− t)qx
1− (1− t)qx
=
tqx
1− (1− t)qx = 1−
1− qx
1− (1− t)qx .
(Note that, in contrast with UDD, here tqx > tqx for t < 1.)
Hence the survivor function
tpx =
1− qx
1− (1− t)qx ,
and so the hazard function
µx (t) = − d
dt
log(tpx ) = − d
dt
log(1− qx ) + d
dt
log(1− qx + tqx )
=
qx
1− qx + tqx
which ↓ as t ↑.
8.2.3 Constant force of mortality
In the previous two models we have seen an increasing and then a decreasing force of mortality
over the unit interval.
To achieve a constant hazard we would need
− d
dt
log(tpx ) = µx
for some µx > 0 constant w.r.t. t.
Integrating this equation, we get
log(tpx ) = −µx t
56
and hence
tqx = 1− exp(−µx t), 0 ≤ t ≤ 1. (27)
Through the special case of (27) with t = 1, we see µx would be related to qx by the
equation qx = 1− exp(−µx ). Substituting this back into (27), we can equivalently write
the constant force of mortality assumption in terms of our quantity of interest qx by
tqx = 1− (1− qx )t , 0 ≤ t ≤ 1.
This is the required fractional age adjustment model for tqx for a constant force of mortality.
8.2.4 Comparison of the three assumptions
Under any of the three assumptions, the likelihood function (26) reduces to a (complicated)
function of the single parameter qx , L(qx ). Numerical procedures are generally required to
maximise L(qx ) for a given data set.
Figure 7 compares the three modelling assumptions for varying values of qx . For the
small values of qx typically found in actuarial applications at least for younger ages x , the
difference between these models diminishes.
8.3 The actuarial estimate
The Balducci assumption can be used to provide a theoretical justification of the actuarial
estimate of the survivor function we derived in §5.5.
For simplicity we assume that the {(ai , bi )} are known. First we notice
1−ai qx+ai = bi−ai qx+ai + (1− bi−ai qx+ai )1−bi qx+bi
=⇒ bi−ai qx+ai = 1−ai qx+ai − (1− bi−ai qx+ai )1−bi qx+bi
=⇒ E[Di ] = 1−ai qx+ai − (1− E[Di ])1−bi qx+bi
So under the Balducci assumption,
E[D ] =
n

i=1
E[Di ] =
n

i=1
(1− ai )qx −
n

i=1
(1− E[Di ])(1− bi )qx .
Substituting the observed number of deaths d on the left hand side would provide an
orthodox moment estimate of qx . But E[Di ] depends on qx and is also unknown.
So we also crudely estimate E[Di ] with the observed value di , leading to
d ≈
n

i=1
(1− ai )qx −
n

i=1
(1− di )(1− bi )qx .
57
0.0 0.2 0.4 0.6 0.8 1.0
0.
0
0.
2
0.
4
0.
6
0.
8
t
tq
x
Uniform
Balducci
Constant
qx = 0.9
0.0 0.2 0.4 0.6 0.8 1.0
0.
0
0.
1
0.
2
0.
3
0.
4
0.
5
t
tq
x
Uniform
Balducci
Constant
qx = 0.5
0.0 0.2 0.4 0.6 0.8 1.0
0.
00
0.
02
0.
04
0.
06
0.
08
0.
10
t
tq
x
Uniform
Balducci
Constant
qx = 0.1
0.0 0.2 0.4 0.6 0.8 1.0
0.
00
0.
01
0.
02
0.
03
0.
04
0.
05
t
tq
x
Uniform
Balducci
Constant
qx = 0.05
Figure 7: Comparison of uniform, Balducci and constant hazard assumptions for qx =
0.9, 0.5, 0.1, 0.05.
Rearrangement yields the actuarial estimate (cf. §5.5) for qx , now in a more general
form,
qˆx =
d
Ex
, (28)
where
Ex =
n

i=1
(1− ai )− ∑
i :di=0
(1− bi )(
= ∑
i :di=1
(1− ai ) + ∑
i :di=0
(bi − ai )
)
.
(29)
So for the binomial model of mortality, the actuarial estimate (28) can be interpreted as
the MLE under an approximation that
D ∼ Binomial (Ex , qx ) . (30)
58
8.3.1 Actuarial notation
• Ex in equation (29) is known as the initial exposed to risk.
– Ex counts deaths as having been ‘exposed’ for a duration of (1− ai ).
• Recall the total time under observation, v , is known as the central exposed to risk, E cx
(see §7.3).
– If the death times {x + ti} had been observed, then notice that
Ex = E
c
x + ∑
i :di=1
(1− ti ).
Finally, suppose that the individual pairs {(ai , bi )} are unobserved but that we have a
good estimate of the overall central exposed to risk E cx . Under the crude assumption that
deaths occur, on average, at age x + 12 , and ignoring the awkward possibility that ai >
1
2 ,
we obtain the formula
Ex ≈ E cx +
1
2
d
and hence from (28),
qˆx =
d
v + 12 d
.
The actuarial estimate is only an approximation to a method-of-moments estimator.
However, it does avoid numerical solutions of equations and for reasonably small qx < 0.3
it is often in agreement with other more statistically sound procedures.
8.4 Statistical testing of estimates
From (30) we have an expected number of deaths in [x , x + 1) of Exqx . Furthermore, by
the Central Limit Theorem we have approximately
D ∼ Normal (Exqx , Exqx (1− qx )) .
Using these relations for the death counts in each age group observed, we can follow
identical procedures to those introduced in §7.7 to test the goodness of fit of any hypothesised
values for {qx} of interest.
We have already seen earlier how the Binomial model can be simply extended to obtain
the non-parametric product-limit estimate (§5.2) which is widely used in survival analysis.
However, perhaps the crucial weakness is that the Binomial model is difficult to generalise
to settings with more than one outcome. Even the simplest three state model with two
decrements (see §7.4.3) is hard to model using multinomial ideas and the general N-state
models are harder still.
59
8.5 Poisson model of mortality
The Poisson distribution is a discrete distribution on the natural numbers N = {0, 1, 2, ...}
used to model the number of rare events occurring during a unit interval of time when the
risk of occurrence of these events is constant, e.g. particles emitted by a radioactive source.
For actuarial applications the Poisson distribution can be used as a model for the number
of deaths among a group of n individuals.
We once again consider fitting a separate model for each unit interval age group [x , x + 1),
with individual i being observed within a specific sub-interval [x + ai , x + bi ).
Let v denote the realisation of the total time spent observing the n individuals. (Recall
v ≡ E cx , the central exposed to risk.)
As in §3.2, §7.3 and §8.2.3, we assume that there is a force of mortality (hazard rate)
µx which is constant over [x , x + 1).
The Poisson model for mortality then states that conditional on v the the total number
of deaths D follows a Poisson distribution with parameter µxv . That is,
P(D = d) = e−µxv
(µxv)d
d !
, d = 0, 1, 2, ...
An obvious flaw in this model is that P(D > n) > 0. However, for large n and small
µxv it can be a good approximation.
The Poisson model can be seen to approximate the Binomial model; setting λ = nqx ,
under the Binomial(n, qx) model
P(D = d) =
n!
(n− d)!d !qx
d (1− qx )n−d
=
1
d !
n!
(n− d)!nd (nqx )
d
(
1− nqx
n
)n−d
=
λd
d !
n!
(n− d)!nd
(
1− λ
n
)n−d
=
λd
d !
d

i=1
(
n− i + 1
n
)(
1− λ
n
)n−d
→ λ
d
d !
e−λ as n→ ∞,
where the last step makes use of the fact that
ex = lim
n→∞
(
1+
x
n
)n
.
Now here in the Poisson model of mortality we have taken λ to be µxv . To see that
nqx ≈ µxv , firstly note that the LHS is the expected number of deaths from n individuals
in [x , x + 1]. Then secondly recall from §4.2.1 or §7.3 that the MLE for a constant hazard
is d/v .
60
9 Counting Processes and the Poisson Process
The Poisson model in §8.5 treats the number of deaths D as a random variable following a
Poisson distribution with parameter µv . To motivate and better understand this model, we
will now see a more rigorous derivation based on stochastic processes.
We shall meet the concept of counting processes, which provide a rich framework for the
consideration of survival problems; all of the main ideas we have met in this course can be
seen as being embedded within this unifying framework.
9.1 Filtrations and Martingales
Let {X (t)} be a stochastic process (see §7.1).
For a continuous-time process {X (t)}, define
X (t−) = lim
h↓0
X (t − h),
and for the small time interval [t, t + dt), let
dX (t) = X ({t + dt}−)− X (t−).
If {X (t)} were discrete-time indexed, we would define
dX (t) = X (t)− X (t − 1).
Definition: The history (or filtration) Ht of {X (t)} is all that is known in relation to the
process at time t. For example, Ht could simply be {X (s) : 0 ≤ s ≤ t}, but in general
it may include other information as well. Ht− is the history of the process up to, but not
including, time t.
Definition: We will say that the process {X (t)} has the Markov property if ∀s, t > 0,
[X (t + s)|Ht ] ≡ [X (t + s)|X (t)].
Definition: We will say {X (t)} is a martingale with respect to the filtration Ht if, ∀t,
1. E[X (t)] < ∞;
2. E[dX (t)|Ht−] = 0.
So a martingale has zero drift.
In the definition above, If instead we have ∀t E[dX (t)|Ht−] ≥ 0 and so a positive drift
(or ∀t E[dX (t)|Ht−] ≤ 0 and so a negative drift) we say {X (t)} is a submartingale (or
supermartingale).
61
9.2 Counting Processes
Definition: A stochastic process {N(t) : t ≥ 0} is called a counting process if
1. N(0) = 0;
2. N(t) ∈N = {0, 1, 2, ...};
3. N(t) is non-decreasing: N(s) ≤ N(t) for s < t;
4. N(t) is right continuous: lim
h↓0
N(t + h) = N(t);
5. N(t) has left limits: N(t)−N(t−) ∈ {0, 1};
6. E[N(t)] < ∞ ∀t ≥ 0.
-
d
t N(t)
t
×
Note that N(t) is non-decreasing and hence a submartingale.
From this definition of a counting process, N(t) can be thought of as recording the
number of occurrences of an event from some underlying process up to time t. N(.) is
therefore a random counting measure on R+.
Let T0 = 0 and T1 < T2 < ... be the ordered observed event times of N(t),
Tn = inf{t : N(t) = n}, n = 0, 1, ...
N(t) = max{n : Tn ≤ t}.
Definition: The inter-arrival times are random variables X1, X2, ... given by
Xn = Tn − Tn−1
Tn =
n

i=1
Xi .
62
9.2.1 Intensity of a Counting Process
Definition: For a counting process {N(t)}, define the intensity
λ(t) =
E[dN(t)|Ht−]
dt
.
The cumulative intensity, Λ(t), is then defined to be
Λ(t) =
∫ t
s=0
λ(s)ds.
{λ(t)} and {Λ(t)} are again stochastic processes, but these are predictable (known)
given the history process Ht−.
Since dΛ(t) = E[dN(t)|Ht−], we have
E[d(N(t)− Λ(t))|Ht−] = 0,
and hence the process D(t) = N(t)− Λ(t) is a martingale, known as the counting process
martingale. Writing
N(t) = Λ(t) +D(t),
the first part, Λ(t), is called the compensator of the counting process N(t). In contrast
with the random step function N(t), and hence also with the martingale D(t), Λ(t) is
Ht−-predictable and varies smoothly over time.
This decomposition of the right continuous submartingale N(t), as the sum of a right
continuous, Ht−-predictable compensator process and a martingale, is unique by the Doob-
Meyer Decomposition Theorem4.
There is an alternative interpretation of the intensity. From the definition above, we have
lim
dt↓0
E[N({t + dt}−)−N(t−)|Ht−]
dt
= λ(t)
=⇒ E[N({t + dt}−)−N(t−)|Ht−] = λ(t)dt + o(dt)
=⇒ P(N({t + dt}−)−N(t−) = 1|Ht−) = λ(t)dt + o(dt).
So for a small time increment dt, we have
P(N({t + dt}−)−N(t−) = j |Ht−) =

1− λ(t)dt + o(dt), j = 0;
λ(t)dt + o(dt), j = 1;
o(dt), j ¿ 1.
4See, for example, Andersen, P. K., Borgan, Ø., Gill, R. D. and Keiding, N. (1995) Statistical Models
Based on Counting Processes. Springer, New York.
63
9.3 Counting processes for survival data
Suppose we are to observe the lifetimes of n homogeneous individuals, subject to right-
censoring. We first consider the counting process of an individual from the group and then
combine these individual processes to consider the overall counting process of the group.
9.3.1 Individual level
Let Ti be the event time of individual i .
Define the indicator Yi (t) ∈ {0, 1} s.t. Yi (t) = 1 ⇐⇒ observation of individual i has
not been censored before time t and Ti ≥ t.
-
d
t Yi (t)
t
d
Then Yi (t) indicates whether individual i is at risk at time t.
The death status of individual i (whether or not he is known to be dead) over time can
be treated as a one-jump counting process Ni (t), where
Ni (t) = I(Ti ≤ t ∩ Yi (Ti ) = 1).
The history, Ht , is the combination of {Ni (s) : s ≤ t} and the censorship status of the
individuals. Clearly Yi (t) will be contained in Ht−. (If the individuals were heterogeneous,
Ht might also include measured covariates, which might be static or time-varying.)
Recall in §9.2 we saw the general result
P(Ni ({t + dt}−)−Ni (t−) = 1|Ht−) = λi (t)dt + o(dt),
where λi (t) is the intensity function for Ni (t).
Well here the LHS is simply P(Observe t ≤ Ti < t + dt|Ht−), which is zero if Yi (t) = 0,
and hence
P(Ni ({t + dt}−)−Ni (t−) = 1|Ht−) =
{
0 if Yi (t) = 0,
µ(t)dt + o(dt) if Yi (t) = 1.
where µ(t) is the hazard function for the homogeneous population.
Setting the RHS of these two equations equal yields the important relationship
λi (t) = Yi (t)µ(t).
Using the compensator representation of a counting process from 9.2,
Ni (t) =
∫ t
s=0
Yi (s)µ(s)ds +Di (t) (31)
where Di (t) is a martingale and M(t) is the cumulative hazard function.
64
9.3.2 Group level
Let
• N(t) =
n

i=1
Ni (t) be the number of events observed up to time t;
• Y (t) =
n

i=1
Yi (t) be the number of individuals at risk at time t.
Example - Four individuals
-
6
1
2
d
t d
t N(t)
t
× d × d
-
6
1
2
3
4 td td td td
Y (t)
t
× d ×
Then summing (31) over all individuals gives
N(t) =
∫ t
s=0
Y (s)µ(s)ds +D(t) (32)
=
∫ t
s=0
Y (s)dM(s) +D(t), (33)
where D(t) = ∑ni=1 Di (t) is also a martingale.
So from (32) the intensity of the process N(t) is
λ(t) = Y (t)µ(t).
Notice that as the number of individuals who have left the study increases, the intensity of
the process decreases.
Finally, consider the differential of equation (33):
dN(t) = Y (t)dM(t) + dD(t).
65
Then since the conditional expectation of dD(t) given Ht− is zero, for Y (t) > 0 we can
obtain a moment-based estimator
dMˆ(t) =
dN(t)
Y (t)
=⇒ Mˆ(t) =
∫ t
s=0
dN(s)
Y (s)
.
Noting that N(t) is a step function with unit steps at the death times {t1, t2, ... , tr} say, we
recover the Nelson-Aalen estimator
Mˆ(t) = ∑
tj≤t
1
Y (tj )
.
9.4 Poisson processes
Definition: A Poisson process is a counting process {N(t) : t ≥ 0} with independent
increments. That is, ∀h > 0 the number of events in the interval (t, t + h],
N(t + h)−N(t),
is independent of the history Ht .
We can immediately note that applying a process with independent increments to survival
data, where necessarily N(t) ∈ {0, 1, ... , n}, provides a serious limitation.
Compare with the required λ(t) = Y (t)µ(t) form we saw earlier. Now the intensity will
not be allowed to depend on Y (t). Instead we simply take
λ(t) = µ(t).
This suggests the Poisson process operates as if there is always exactly one individual at
risk at any stage in the study; it is as if we observe that individual until exit from the
study (through death or censoring), and then immediately commence observation of the
next subject, who is of identical age to the previous individual who just exited. The
Poisson process is one of the simplest examples of a continuous time Markov process. As a
consequence of this simplicity, we are able to derive the distribution of N(t).
Recall the cumulative intensity Λ(t) =
∫ t
s=0 λ(s)ds.
Theorem 9.1. Let {N(t) : t ≥ 0} be a Poisson process. ∀t > 0, N(t) has a Poisson
distribution with parameter Λ(t). That is,
P(N(t) = j) = e−Λ(t)
Λ(t)j
j !
, j = 0, 1, 2, ...
66
Proof
P(N(t + dt) = j) =
j

i=0
P(N(t) = i)P(N(t + dt) = j |N(t) = i)
=
j

i=0
P(N(t) = i)P(N(t + dt)−N(t) = j − i)
by the assumption of the Poisson process.
Up to an additive term of o(dt), there will be either zero or one events in the interval
[t, t + dt) with respective probabilities 1− λ(t)dt, λ(t)dt.
Writing pj (t) = P(N(t) = j), we have
pj (t + dt) =
{
(1− λ(t)dt)pj (t) + o(dt), j = 0;
(λ(t)dt)pj−1(t) + (1− λ(t)dt)pj (t) + o(dt), j > 0.
Now, subtracting pj (t) from both side of the equations, dividing by dt and letting dt ↓ 0
we obtain,
pj ′(t) =
{
−λ(t)pj (t), j = 0;
λ(t){pj−1(t)− pj (t)}, j > 0.
(34)
with the boundary condition
p0(0) = 1.
These form a collection of differential-difference equations for pj (t).
The base case j = 0 can be solved immediately, to obtain
p0(t) = e
− ∫ ts=0 λ(s)ds = e−Λ(t).
Then for j > 0, assuming the inductive hypothesis that the claim is true for pj−1(t), we
have
pj ′(t) = λ(t)e−Λ(t) Λ(t)
j−1
(j − 1)! − λ(t)pj (t).
This is a first-order ordinary differential equation of the form
dy
dt
= q(t)− p(t)y ,
for which the general solution is
y =

m(t)q(t)dt + k
m(t)
,
where m(t) = exp{∫ p(t)dt}.
67
Here, p(t) = λ(t) giving m(t) = eΛ(t), and q(t) = λ(t)e−Λ(t)Λ(t)j−1/(j − 1)!. Then
pj (t) = e
−Λ(t)
{∫
λ(t)
Λ(t)j−1
(j − 1)!dt + k
}
= e−Λ(t)
{
Λ(t)j
j !
+ k
}
.
Since j > 0, we require pj (0) = 0 =⇒ k = 0, giving
pj (t) = e
−Λ(t)Λ(t)j
j !
as required.
9.4.1 Homogeneous Poisson Processes
Definition: A Poisson process is said to be homogeneous if the intensity function is constant.
So when applied to survival data, we are assuming λ(t) = µ(t) = µ and hence a constant
hazard. Homogeneity implies the process is independent of the age of the individual in the
study.
Note that the cumulative intensity of a homogeneous Poisson process (HPP) Λ(t) = µt.
Hence ∀t,
N(t) ∼ Poisson(µt).
It is easy to check that for a HPP, ∀s, t > 0 we also have
N(s + t)−N(s) ∼ Poisson(µt).
There is an important equivalent formulation of the HPP that underlies the relationship
with the homogeneous Markov models of Chapter 7.
Theorem 9.2. The inter-arrival times X1, X2, ... of a homogeneous Poisson process with
intensity µ are independent exponential random variables with rate parameter µ.
Proof
First consider the survivor function for X1.
P(X1 > t) = P(N(t) = 0) = e
−µt ,
so X1 ∼Exponential(µ).
Now for j > 1, conditioning on {X1, ... , Xj−1} we have survivor
P{Xj > t|X1 = t1, ... , Xj−1 = (tj−1 − tj−2)} =
P{N(tj−1 + t) = N(tj−1)|X1 = t1, ... , Xj−1 = (tj−1 − tj−2)}
68
The composite event {X1 = t1, ... , Xj−1 = (tj−1 − tj−2)} relates to the interval [0, tj−1],
whereas the event {N(tj−1 + t)− N(tj−1) = 0} relates to the period (tj−1, tj−1 + t]. By
the assumption of the Poisson process, the two events must therefore be independent. And
P(N(tj−1 + t)−N(tj−1) = 0) = P(N(t) = 0) = e−µt ,
hence Xj ∼Exponential(µ).
9.4.2 Inference for HPP
We can now see the Poisson model of §8.5 was effectively assuming the death times in our
study to be the event times of a HPP observed for a period of time v .
For likelihood inference, we simply have
P(N(t) = j) = e−µt
(µt)j
j !
, j = 0, 1, 2, ...
For a total time on test observation period v , observing d deaths leads to the MLE
µˆ =
D
V
.
Under this assumed Poisson model, the estimator µˆ has
E[µˆ] = µ, Var[µˆ] =
µ
V
.
Note the similarity to estimating the transition intensity in the two-state Markov model
of §7.3. We can think of the Poisson model as an approximation to the two state Markov
model where the total waiting time (central exposed to risk) V (E cx ) is considered fixed.
9.5 Counting processes for Markov Models
As in Chapter 7, we now suppose that we have a sample of n independent individuals such
that each individual i is a realisation of a homogeneous continuous time Markov jump process
on the state space Ω = {1, ... , N}.
9.5.1 Individual level
Suppose that as we reach time t individual i currently happens to be in state j ; we will now
write this as an indicator Yi (t, j) = 1 and Yi (t, k) = 0 ∀k 6= j ∈ Ω.
If state j is an absorbing state, then we know µjj = 0 and individual i will make no
further transitions and remain in state j . Otherwise, if state j is not absorbing, then we
know from §7.5.1 that the unknown time until individual i will leave state j is governed by
the Exponential(−µjj) distribution with constant hazard rate −µjj .
69
So by identical reasoning to §9.3.1, if we consider the counting process Ni (t) of the
number of transitions of any type made by individual i by time t, this has intensity λi (t) at
time t given by −µjj . More formally,
λi (t) = −
N

j=1
Yi (t, j)µ
jj .
9.5.2 Group level
By identical reasoning to §9.3.2, if we consider the counting process N(t) = ∑ni=1 Ni (t) of
the number of transitions of any type made by any of the individuals by time t, this has
intensity λ(t) at time t given by ∑ni=1 λi (t). More formally,
λ(t) = −
n

i=1
N

j=1
Yi (t, j)µ
jj = −
N

j=1
Y (t, j)µjj , (35)
where Y (t, j) = ∑ni=1 Yi (t, j) is the number of individuals in state j as we approach time
t.
Note that this is a stochastic intensity function as in §9.3.2, but here even under the
assumptions of a constant hazard the intensity is no longer necessarily non-increasing; indi-
viduals move in and out of different states, altering the state counts {Y (t, j)} enabling the
intensity to go up and down until individuals start to enter any absorbing states.
9.5.3 Transition level
Alternatively, rather than considering the Markov model on the individual or group level, we
can think about counting transitions between particular states. For j 6= k , let N (jk)(t) be
the number of transitions that have been made from state j to state k by time t by the n
individuals.
As argued earlier in §7.6, clearly the collection of processes {N (jk)(t) : j 6= k ∈ Ω} will
be dependent one one another, and so this is an example of a multivariate counting process.
For an individual in state j , from (14) the hazard of leaving to state k is µjk . Alter-
natively, to obtain this result we have a hazard rate of leaving state j , −µjj , multiplied by
the probability of this jump being to state k , given by the jump probability −µjk/µjj (see
§7.5.2). Hence at time t the process N (jk)(t) has intensity
λjk(t) = Y (t, j)µjk .
Finally, summing over all state pairs (j , k) we recover the intensity (35) for the process
N(t) counting the total number of all transitions,
λ(t) =
N

j=1

k 6=j
λjk(t) =
N

j=1

k 6=j
Y (t, j)µjk =
N

j=1
Y (t, j)∑
k 6=j
µjk .
70
10 Comparison of Markov, Binomial and Poisson models
10.1 Characteristics
The Markov, Binomial and Poisson models are all used in actuarial science for estimating
survival probabilities for each integer age group [x , x + 1).
The Binomial model assumes a restricted view where only the year of death of individuals
is recorded, rather than the exact event times.
In contrast the Markov and Poisson models both require knowledge of the exact death
times; the Poisson model can be seen as an approximation to the Markov model where the
total observation time (central exposed to risk) is considered fixed.
10.2 Inference
If the exact death times, entry times and exit times on the study are known for all individuals,
then the MLE for the Markov and Poisson models are analytically available and structurally
identical; otherwise the central exposed to risk must be separately estimated.
If the total observation time were fixed, then the Poisson/Markov MLE would be unbi-
ased; however this condition is unlikely to be satisfied in actuarial studies, in which case we
have seen that this MLE is only asymptotically unbiased.
For the Binomial model we have seen the need for additional modelling techniques to
make use of such rich data, such as the Balducci assumption. For low hazard rates, we
have seen the differences between these models to be small, but only approximate results are
available for assessing their estimates.
10.3 Generality
The Markov model extends naturally to an arbitrary number of states. The Poisson model,
as an approximation of the Markov model, can also be adapted to more general problems.
The binomial model does not extend easily.
Multivariate counting processes generalise the concepts of Chapter 9, with a vector of
counting processes N(t) = (N1(t), ... , Nh(t)) counting different event types but sharing a
common filtration Ht .
10.4 Conclusions
When the force of mortality/hazard of the population is low (such as in human mortality)
the estimates we obtain are fairly robust to the different choice of models.
However, when we are to model more complicated processes or higher transition inten-
sities, which is increasingly the case as more complex insurance products are developed,
the stochastic process models appear to offer significant advantages; albeit with a higher
computational burden.
71
11 Graduation
11.1 Requirements for Graduated Estimates
In Chapters 7-10 we have considered actuarial models for mortality over single time units
(years) for individuals ageing from x to x + 1. For each age interval [x , x + 1) we have
separately calculated crude estimates {qˆx} (Binomial model) or {µˆijx+ 12 } (Markov or Poisson
models). Since these crude estimates are formed independently of one another, a plot of
them is likely to be quite rough.
Intuitively, we would typically expect qx or µ
ij
x+ 12
to be smooth, possibly monotonic
functions of age x . Reporting non-smooth estimates of such a function could be hard to
justify. Furthermore, estimates of qx−1 or qx+1, for example, should carry some information
about qx as well, and so we should be able to use this assumed smoothness to improve our
estimators.
The process of smoothing crude actuarial estimates is called graduation. The resulting
graduated estimates, denoted {q˚x} or {µ˚ijx+ 12 }, should be relatively smooth as a function of
x but still sufficiently close to the crude estimates {qˆx}, {µˆijx+ 12 } so that they continue to
adhere to the observed data. Clearly this represents a trade-off of conflicting requirements
and a good balance needs to be struck.
Additionally there might be prior knowledge about the populations which the parameters
should adhere to; for example, that mortality is higher in males than females, or that overall
mortality is lower than it used to be. Or there might be practical considerations we wish
to judge the smoothed estimates against. In actuarial science, there are two important
examples:
• In life insurance, underestimating mortality leads to losses;
• In pensions, overestimating mortality leads to losses.
11.2 Graduation Methods
11.2.1 Parametric Graduation
The most common graduation methods fit a parametric model to the crude estimates or the
raw data. The Gompertz-Makeham distribution (§3.4) provides a useful smoothing structure
for transition intensities which in its most general form is used as
µ˚ijx = r
ij (x) + exp{s ij (x)}
where r ij and s ij are polynomials. Joint likelihood analysis using (19) across age groups
then becomes a question of estimating the coefficients of the polynomials r ij and s ij .
Often it will be difficult to find a sufficiently simple parametric model which will model
the data well in all of the age groups. ML estimation of a parametric graduation model will
typically result in a good fit where there is most data, which in practice usually means at
middle ages, so at extreme ages the fit will be least reliable and may require adjustment.
72
11.2.2 Graduation Using Previous Estimates
Given a set of smooth, robust prior estimates {q0,x} or {µij0, x+ 12 } (cf. §7.7), perhaps
obtained from standard tables (see §2.8), we might wish to bias our new estimates towards
these previous values so that the new estimates are similar or related to the old ones. For
example, we might stipulate that ∀x
q˚x = a + bq0,x
or
µ˚ijx = µ
ij
0, x+k
for constants a, b, k .
The precise parametric form for this relationship might be chosen from some exploratory
plots of the crude estimates against the prior estimates. Once chosen, parameter estimates
can then be found via ML estimation as in §11.2.1.
This form of graduation would not be favoured when large amounts of current data are
available, but has the advantage of performing well in extreme age groups where we may
have very little data.
11.3 Testing Graduated Estimates
11.3.1 Testing Smoothness
To see that the graduated estimates are not moving around too much across integer ages,
differencing is most commonly used. In particular, the third differences of the smoothed
estimates {q˚x} or {µ˚ijx+ 12 } should be small relative to the estimates and progress regularly.
11.3.2 Testing Goodness of Fit
For testing how well the graduated estimates {q˚x} or {µ˚ijx+ 12 } still adhere to the original
data-based crude estimates {qˆx}, {µˆijx+ 12 } (recall that we would not want these to be too
different), we can use almost identical test procedures to those used in §7.7 and §8.4, where
we compared these crude estimates with prior estimates such as those obtained from standard
tables.
For example, using the graduated transition intensities {µ˚ij
x+ 12
} as the null hypothesis
values {µij
0, x+ 12
}, we could perform any of the tests in §7.7.1 or §7.7.2 with one alteration;
the number of degrees of freedom m in χ2 testing is reduced from the number of age groups
by the number of parameters that were estimated in the graduation procedure using the
current data.
73  Email:51zuoyejun

@gmail.com