程序代写案例-MATH97075/MATH97183
MATH96048/MATH97075/MATH97183 Survival Models & Actuarial Applications Dr David Whitney 2020/21 These notes are based on those by Prof Axel Gandy and Prof Nicholas Heard. Contents 1 Principles of Modelling Time to Event Data 1 1.1 Statistical Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Model Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2.1 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2.2 Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Distributions of Event Times 2 2.1 Random Variable Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Cumulative Distribution and Survivor Functions . . . . . . . . . . . . . . . . 2 2.2.1 Criteria for a valid survivor function: . . . . . . . . . . . . . . . . . . 2 2.2.2 Future lifetime after age x . . . . . . . . . . . . . . . . . . . . . . . 3 2.2.3 Actuarial notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2.4 Relationship between Tx and T . . . . . . . . . . . . . . . . . . . . 3 2.3 Density Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.4 Hazard Function / Force of Mortality . . . . . . . . . . . . . . . . . . . . . . 4 2.5 Cumulative Hazard Function . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.6 Complete and Curtate Expectations of Future Lifetime . . . . . . . . . . . . 6 2.6.1 Complete future lifetime . . . . . . . . . . . . . . . . . . . . . . . . 6 2.6.2 Curtate future lifetime . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.7 Random Survivorship Group . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.8 Life Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.9 Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.9.1 Right-censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.9.2 Left-censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.9.3 Interval-censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 i 2.9.4 Truncation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.9.5 Type I censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.9.6 Type II censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.9.7 Competing risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 Parametric Distributions of Random Lifetimes T 12 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.4 Gompertz-Makeham Distribution . . . . . . . . . . . . . . . . . . . . . . . . 14 3.4.1 Gompertz and Makeham laws of mortality . . . . . . . . . . . . . . . 14 3.4.2 Gompertz-Makeham hazard function . . . . . . . . . . . . . . . . . . 14 3.5 Choosing a Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.5.1 Empirical survivor function . . . . . . . . . . . . . . . . . . . . . . . 14 4 Fitting Parametric Distributions 16 4.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1.1 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1.2 Asymptotic distribution of MLE . . . . . . . . . . . . . . . . . . . . 18 4.1.3 Functional invariance of MLEs . . . . . . . . . . . . . . . . . . . . . 18 4.2 Examples of ML Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2.1 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2.2 Weibull distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.3 Hypothesis Testing for Nested Distributions . . . . . . . . . . . . . . . . . . 19 4.3.1 Likelihood ratio statistic . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3.2 Wald statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5 Nonparametric Methods 22 5.1 Discrete Lifetimes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.1.1 Discrete hazard rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.1.2 Survivor and discrete cumulative hazard rate . . . . . . . . . . . . . . 23 5.2 Kaplan-Meier (Product-limit) Estimate . . . . . . . . . . . . . . . . . . . . . 23 5.3 Greenwood’s Formula for the Standard Error of the K-M Estimate . . . . . . 25 5.4 Nelson-Aalen Estimate of the Cumulative Hazard . . . . . . . . . . . . . . . 26 5.5 The Actuarial Estimate of S . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6 Regression models 27 6.1 Inhomogeneous Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 6.2 Cox’s Proportional Hazards Model . . . . . . . . . . . . . . . . . . . . . . . 27 6.2.1 Partial likelihood function . . . . . . . . . . . . . . . . . . . . . . . . 28 6.2.2 Model checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.2.3 Estimating the baseline distribution . . . . . . . . . . . . . . . . . . 31 6.3 Accelerated Failure Time Model . . . . . . . . . . . . . . . . . . . . . . . . 31 6.3.1 Relation to proportional hazards . . . . . . . . . . . . . . . . . . . . 32 ii 7 The Markov Model 33 7.1 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 7.2 Homogeneous Continuous-Time Markov Jump Processes . . . . . . . . . . . 33 7.2.1 Markov Jump Processes . . . . . . . . . . . . . . . . . . . . . . . . . 33 7.2.2 Transition probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 34 7.2.3 Transition Intensities . . . . . . . . . . . . . . . . . . . . . . . . . . 34 7.2.4 Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 7.2.5 Age rate intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 7.2.6 Assuming homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . 36 7.3 A Two-State Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 7.4 Calculating Transition Probabilities . . . . . . . . . . . . . . . . . . . . . . . 37 7.4.1 Kolmogorov Forward Equations . . . . . . . . . . . . . . . . . . . . . 38 7.4.2 Kolmogorov Backward Equations . . . . . . . . . . . . . . . . . . . . 38 7.4.3 Example: The two-decrement model . . . . . . . . . . . . . . . . . . 39 7.5 Calculating Path Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 39 7.5.1 Holding times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 7.5.2 Jump probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 7.5.3 Example (continued): The two-decrement model . . . . . . . . . . . 41 7.6 Estimating Transition Intensities . . . . . . . . . . . . . . . . . . . . . . . . 41 7.6.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . 41 7.6.2 Interval censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 7.7 Statistical Testing of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . 43 7.7.1 χ2 Goodness of Fit Test . . . . . . . . . . . . . . . . . . . . . . . . 43 7.7.2 Standardised and Cumulative Deviations Tests . . . . . . . . . . . . . 43 8 The Binomial and Poisson Models of Mortality 45 8.1 Binomial model of mortality . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 8.2 General Binomial Model - Non-uniform observation times . . . . . . . . . . . 46 8.2.1 Uniform distribution of deaths (UDD) . . . . . . . . . . . . . . . . . 47 8.2.2 Balducci assumption . . . . . . . . . . . . . . . . . . . . . . . . . . 47 8.2.3 Constant force of mortality . . . . . . . . . . . . . . . . . . . . . . . 48 8.2.4 Comparison of the three assumptions . . . . . . . . . . . . . . . . . . 48 8.3 The actuarial estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 8.3.1 Actuarial notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 8.4 Statistical testing of estimates . . . . . . . . . . . . . . . . . . . . . . . . . 50 8.5 Poisson model of mortality . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 9 Counting Processes and the Poisson Process 52 9.1 Filtrations and Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 9.2 Counting Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 9.2.1 Intensity of a Counting Process . . . . . . . . . . . . . . . . . . . . . 53 9.3 Counting processes for survival data . . . . . . . . . . . . . . . . . . . . . . 54 9.3.1 Individual level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 9.3.2 Group level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 iii 9.4 Poisson processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 9.4.1 Homogeneous Poisson Processes . . . . . . . . . . . . . . . . . . . . 58 9.4.2 Inference for HPP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 9.5 Counting processes for Markov Models . . . . . . . . . . . . . . . . . . . . . 59 9.5.1 Individual level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 9.5.2 Group level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 9.5.3 Transition level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 10 Comparison of Markov, Binomial and Poisson models 61 10.1 Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 10.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 10.3 Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 10.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 11 Graduation 62 11.1 Requirements for Graduated Estimates . . . . . . . . . . . . . . . . . . . . . 62 11.2 Graduation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 11.2.1 Parametric Graduation . . . . . . . . . . . . . . . . . . . . . . . . . 62 11.2.2 Graduation Using Previous Estimates . . . . . . . . . . . . . . . . . . 62 11.3 Testing Graduated Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . 63 11.3.1 Testing Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 11.3.2 Testing Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . 63 iv 1 Principles of Modelling Time to Event Data 1.1 Statistical Modelling A stochastic or statistical model of a system is a mathematical model which represents inherent uncertainties in the system as random variables. Any model which does not make such allowances for uncertainty, on the other hand, is said to be deterministic. In actuarial science, statistical models are especially useful for dealing with uncertainty in the time until a specific event will occur, such as predicting the number of premium payments that will be received before paying out on a life assurance contract. These statistical models for time to event data also have much wider applications in fields such as medicine, biostatistics and engineering. Mathematical models are an imperfect representation of reality. Their utility comes from being able to approximately learn the consequences of hypothetically changing certain exper- imental inputs or actions. Statistical models extend this utility by capturing the uncertainty surrounding unknown future outcomes, providing the possibility of searching for an optimal decision under this uncertainty. 1.2 Model Choice 1.2.1 Complexity The complexity of a good statistical model is often constrained for several reasons. The analyst might be restricted by the inputs for which data are readily available, or by compu- tational or statistical limitations to the models which can be reliably fitted. Additionally, it is important that the chosen model satisfies the purposes for which it will be used; usually this means that the mathematical structure of the model need be easily interpretable for purposes of understanding and communication, and that the resulting fitted model will not overfit and understate uncertainty and risk. For these reasons, the statistical analyst will often seek to fit the most parsimonious model that the data will sensibly allow. 1.2.2 Sensitivity Model choice and fitting should be viewed as an iterative procedure. Once a particular model has been fitted to the available data, the quality of the fit should be inspected; hypothesis tests such as goodness of fit tests can determine the suitability of the selected model. If the structure of the data is not well captured by the model, this may suggest that the model chosen is inadequate and the analyst should return to the model selection stage to consider other alternatives. Note however that performance of the chosen model will also depend heavily on the quality of the data, with poor quality inputs likely to lead to unreliable output inference (garbage in, garbage out). If there are questions about the accuracy of the data being used, then it becomes particularly important to carry out a sensitivity analysis. This investigates the magnitude of change in the model outputs when small perturbations are applied to the model inputs. 1 Finally, it should be noted that the suitability of a well fitted model may not be comfort- ably relied upon outside the range of the data used; for example, an exponential relationship can appear fairly linear in the short term, before ballooning away from such a fit in the longer term. 2 2 Distributions of Event Times 2.1 Random Variable Modelling Consider a homogeneous population of individuals, who each have an associated event time which is initially unknown and treated as a random variable. We will often refer to the period of time until the event occurs as the lifetime of the individual. Throughout the course, let T denote the future lifetime of a new-born individual (aged 0). Unless otherwise stated, we shall assume T is a continuous random variable which takes values on the positive part of the real line R+ = [0,∞), with associated probability measure P. More specifically, in this chapter we might assume the existence of a limiting age ω which T cannot exceed, so T ∈ [0,ω]. For human life calculations, typically we take ω ≈120 years. We first define the cumulative distribution function, F , the survivor function S , the density f , the hazard µ and the cumulative hazard M for a lifetime, and derive relationships between them. This will provide us with a range of tools for specifying the distribution of a lifetime, and rules for moving between them. 2.2 Cumulative Distribution and Survivor Functions Definition: The cumulative distribution function (CDF) of T is F (t) = P(T ≤ t), the probability of death by age t. Definition: The survivor (or reliability) function of T is S(t) = P(T > t) = 1− F (t), the probability of surviving beyond age t. 2.2.1 Criteria for a valid survivor function: S(t) is a valid survivor function iff 1. S(0) = 1, lim t→∞ S(t) = 0 2. 0 ≤ S(t) ≤ 1, ∀t ∈ R+; 3. Monotonicity: ∀t1, t2 ∈ R+, t1 < t2 =⇒ S(t1) ≥ S(t2); 4. S is ca`dla`g: ∀t ∈ R+, (a) S(t+) ≡ lim u↓t S(u) = S(t) (S is right-continuous); (b) S(t−) ≡ lim u↑t S(u) exists. (left limits exist for S). 3 2.2.2 Future lifetime after age x Definition: Let Tx be the future lifetime of an individual who has survived to age x , for 0 ≤ x ≤ ω, so Tx ∈ [0,ω− x ]. So clearly • T0 ≡ T ; • [Tx ] ≡ [T − x |T > x ]. In an abuse of notation we shall generically refer to the probability measures associated with the random variables {Tx} simply as P. Definition: The cumulative distribution and survivor functions of Tx are Fx (t) = P(Tx ≤ t), Sx (t) = P(Tx > t) = 1− Fx (t). 2.2.3 Actuarial notation tqx ≡ Fx (t), tpx ≡ Sx (t) = 1− tqx . For example 2q30 = P(T30 ≤ 2) = P(T ≤ 32|T > 30). When working with units of one year, t = 1 is often omitted, i.e. qx ≡ 1qx and px ≡ 1px . In actuarial science, the CDF tqx is called the rate of mortality and qx is called the initial rate of mortality. 2.2.4 Relationship between Tx and T For consistency with T , for the CDF we have Fx (t) = P(Tx ≤ t) = P(T ≤ x + t|T > x), =⇒ Fx (t) = F (x + t)− F (x) S(x) and for the survivor function, Sx (t) = 1− Fx (t) = S(x + t) S(x) . In actuarial notation tqx = x+tq0 − xq0 xp0 and tpx = x+tp0 xp0 . 4 Hence, for any age x , and ∀s, t > 0 s+tpx = x+s+t p0 xp0 [ Sx (s + t) = S(x + s + t) S(x) ] = x+sp0 xp0 x+s+tp0 x+sp0 [ = S(x + s) S(x) S(x + s + t) S(x + s) ] =⇒ s+tpx = spx tpx+s . [Sx (s + t) = Sx (s)Sx+s(t)] That is, the probability of surviving for time s + t after age x is simply the probability of surviving for time s multiplied by the probability of surviving for further time t after reaching age x + s. 2.3 Density Function Definition: The probability density function (PDF) of the random variable Tx is fx (t) = d dt Fx (t) = lim h↓0 Fx (t + h)− Fx (t) h . Since we are assuming Tx is a continuous random variable, by definition this density exists. For individuals who have currently survived to age x , fx (t) is the rate of death t further units of time into the future. That is, for such an individual, the probability of death within the interval [t, t + h] for a small interval width h is approximately fx (t)h. 2.4 Hazard Function / Force of Mortality The hazard function plays a central role in survival analysis. We denote the hazard function (or force of mortality) at age x , 0 ≤ x ≤ ω, by µ(x). Definition: The hazard function of T is defined as µ(x) = lim h↓0 P(T ≤ x + h|T > x) h . We will always assume this limit exists. Note µ(x) = lim h↓0 Fx (h) h = fx (0). The interpretation of µ(x) is important. It represents the instantaneous death rate for an individual who has survived to time x . Or, approximately, for small h hqx ≈ hµ(x). (1) Given an individual has reached age x , the probability of death in the next short period of time of length h is roughly proportional to h, the constant of proportionality being µ(x). 5 Important result µ(x + t) = fx (t) Sx (t) , or alternatively fx (t) = µ(x + t)Sx (t). So in particular, for x = 0 we have f = µS . Proof fx (t) = lim h↓0 1 h {P(Tx ≤ t + h)− P(Tx ≤ t)} = lim h↓0 1 h {P(T ≤ x + t + h|T > x)− P(T ≤ x + t|T > x)} = lim h↓0 1 h {F (x + t + h)− F (x)} − {F (x + t)− F (x)} S(x) = lim h↓0 1 h F (x + t + h)− F (x + t) S(x) = S(x + t) S(x) lim h↓0 1 h F (x + t + h)− F (x + t) S(x + t) = S(x + t) S(x) lim h↓0 1 h P(T ≤ x + t + h|T > x + t) = S(x + t) S(x) µ(x + t) = Sx (t)µ(x + t). Summary • Tx is a continuous random variable denoting the random future lifetime of an individual alive at age x . • The distribution function Fx (t) ≡ tqx is defined on [0,ω− x ] with PDF fx (t) = F ′x (t). • The survivor function Sx (t) ≡ tpx = 1− tqx . Note that 1. s+tpx = spx tpx+s = tpx spx+t and 2. fx (t) = −S ′x (t). 6 • The survivor function and the PDF provide two methods of specifying the distribution of a continuous random variable. • An additional way was provided via the hazard function µ(x + t) = fx (t) Sx (t) = −S ′x (t) Sx (t) . Why additionally consider the hazard rate when we have Fx (t) and fx (t)? (i) It may be physically enlightening to consider the immediate risk. (ii) Comparisons of groups of individuals are sometimes most incisively made via the hazard. (iii) Hazard-based models are often convenient when there is censoring. (iv) When fitting parametric models the form of the hazard function can be enlightening about the assumptions made by the model: e.g. Exponential =⇒ constant hazard. 2.5 Cumulative Hazard Function Definition: The cumulative (or integrated) hazard rate of Tx , denoted Mx (t), is simply Mx (t) = ∫ t 0 µ(x + s)ds. This leads to another important relationship, Sx (t) = exp{−Mx (t)}. Recall, µ(x + t) = −S ′x (t) Sx (t) = − d dt log Sx (t). Furthermore we have the condition Sx (0) = 1 from which Sx (t) = exp{−Mx (t)} follows. Note, fx (t) = µ(x + t) exp{−Mx (t)}. 2.6 Complete and Curtate Expectations of Future Lifetime We have now met several different methods for specifying the distribution of Tx - through the hazard, density, survivor function, etc. Given this distribution, we now consider two expectations of interest. 7 2.6.1 Complete future lifetime Definition: The complete expectation of life at age x , e˚x , is the expected future lifetime e˚x = E(Tx ) = ∫ ω−x 0 tfx (t)dt = ∫ ω−x 0 −t ( d dt tpx ) dt = −[t tpx ]ω−x0 + ∫ ω−x 0 tpxdt = ∫ ω−x 0 tpxdt. In the case of human mortality e˚x , and in particular e˚0, is used as a measure of health both for comparison between different populations and for trends within a single population. Note that for 0 ≤ x1 < x2, e˚x2 is not necessarily less than e˚x1 . For example, there may be a high rate of infant mortality. We may also calculate the variance, Var(Tx ) = ∫ ω−x 0 t2fx (t)dt − e˚ 2x = ∫ ω−x 0 t2tpxµ(x + t)dt − e˚ 2x . 2.6.2 Curtate future lifetime Suppose once more that we have an individual who has survived to age x . As a discretised alternative to the continuous valued future lifetime Tx , we can consider the number of whole years Kx subsequently survived, so Kx = bTxc; that is, the integer part of Tx (rounded down). Kx is a discrete random variable on {0, 1, ... , bω − xc} with probability mass function (pmf) P(Kx = k) = P(k ≤ Tx < k + 1) = k+1qx − kqx or = kpx1qx+k . It is simple to check these are equivalent (Exercise). We can use this pmf to calculate the expectation of Kx . 8 Definition: The curtate expectation of life at age x , ex , is thus defined by ex = E(Kx ) = bω−xc ∑ k=0 kkpx1qx+k = bω−xc ∑ k=1 kpx . (Exercise) Again we might also be interested in the variance Var(Kx ) = bω−xc ∑ k=0 k2kpx1qx+k − ex2. 2.7 Random Survivorship Group Suppose we observe a group of `0 newborn lives sampled from our homogeneous population. That is, `0 (not necessarily independent) realisations of T . We might then be interested in the number of individuals still alive at age x , n(x), for some x > 0. Since each individual will survive beyond age x with probability xp0, if the lifetimes were independent we would have n(x) ∼ Binomial(`0, xp0). In any case, the expected number of survivors at x , denoted `x , is certainly given by `x = `0 xp0. (`0 is referred to as the radix.) Furthermore, it then follows that the expected number of deaths in the group between ages x and x + t, written tdx , is given by tdx = `x − `x+t = `0{xp0 − x+tp0}. Again when t = 1 we simply write dx . From the definition of the hazard function and (1), for small t we also have tdx ≈ `xµ(x)t. A plot of `xµ(x) (= `0f (x)) against x is known as the curve of deaths (!). 9 2.8 Life Tables Each year the http://www.statistics.gov.uk/default.aspOffice for National Statis- tics (ONS) produces a new set of estimates of the distribution of mortality in the UK, based on recorded deaths in the country over the most recent three year period. These estimates are presented in the form of life tables (The most recent numbers are http://www.ons.gov.uk/ ons/rel/lifetables/national-life-tables/2012-2014/ref-uk.xls for the UK and http://www.ons.gov.uk/ons/rel/lifetables/national-life-tables/2012-2014/ref-england. xls for England ). Life tables such as these which are based on large amounts of data and can therefore be considered reliable are known as standard tables. The life data used consider only the integer age at death, and so continuous concepts such as the density and hazard functions cannot be directly estimated. Instead, the statistics are presented as those from a hypothetical survivorship group of initial size `0=100,000. The life tables thus contain estimates for the following discrete quantities for each integer age x ∈ {0, 1, 2, ... ,ω}. `x - as above, the expected number of survivors from the `0 newborns at x . dx - as above, the expected number of individuals who die between ages x and x + 1. qx - the initial rate of mortality, which is the probability of an individual who has reached age x dying before age x + 1. So qx = dx/`x . mx - the central rate of mortality, is the death rate ’per year lived’. Those who die during the year can be assumed to live for half of the year, so mx = qx/(1− qx/2). mx is approximately the average force of mortality over [x , x + 1[ (for small qx). ex - the period life expectancy for an individual who has survived to x , which is the curtate expectation of life using the estimates qx as our probabilities. Interim Life Tables 10 AGE x mx qx `x dx ex 0 0.005006 0.004993 100000.0 499.3 78.05 1 0.000335 0.000335 99500.7 33.4 77.44 2 0.000189 0.000189 99467.3 18.8 76.46 3 0.000144 0.000144 99448.5 14.3 75.48 4 0.000107 0.000107 99434.2 10.6 74.49 5 0.000119 0.000119 99423.6 11.9 73.50 6 0.000112 0.000112 99411.7 11.1 72.51 7 0.000087 0.000087 99400.6 8.7 71.51 8 0.000115 0.000115 99391.9 11.4 70.52 9 0.000102 0.000102 99380.5 10.2 69.53 10 0.000098 0.000098 99370.3 9.7 68.54 ... . . . . . . . . . . . . . . . Table 1: Extract from table based on UK data for males for the years 2008-2010. dx/`0, 2008/10 ONS data 0 20 40 60 80 100 0. 00 0. 01 0. 02 0. 03 0. 04 Age d x l 0 qx , 2008/10 ONS data 11 0 20 40 60 80 100 0. 00 0. 05 0. 10 0. 15 0. 20 0. 25 0. 30 0. 35 Age q x 2.9 Censoring A defining feature of survival data, which renders many standard statistical analysis methods inappropriate, is that survival times are frequently censored. A survival time is said to be censored if the exact event time (e.g. time of death) has not been observed. Such observations provide only partial information about the value of T . There are a number of common censoring mechanisms that prevent observation of some event times. 2.9.1 Right-censoring An event time is right-censored if the censoring mechanism prematurely terminates observa- tion of the individual, before the event has actually occurred. For example, if we lose track of an individual, or the study comes to an end. We only learn that T > t for some t ∈ (0,ω). If it is known beforehand that right-censoring is due to happen at time c for all events which have not yet occurred, then the time spent observing the individual Tobs = min(T , c) is a random variable of mixed type. That is, Tobs has continuous distribution over the interval [0, c) and then a discrete atom of mass at time c . 2.9.2 Left-censoring An event time is left-censored if we discover the event occurred before observation of the individual began; alternatively, it could be that we observe the event time but only have a lower bound on when the life of the subject began. In either case, all we can say is that the actual time to event T is less than some time t. For example, we begin a study of patients three months after an operation and find that some have died. All we know is that T < 3 months for those patients. 12 2.9.3 Interval-censoring When all that is known is that the event occurred within an interval. For example, if we check the status of individuals every d days. Although always undesirable, a censoring mechanism may inevitably be introduced through the design of an experiment or data collection process. 2.9.4 Truncation Truncation happens when we wish to estimate the distribution of T but some of our data is sampled instead from a conditional distribution such as T |a < T ≤ b. Truncation is somewhat similar to censoring but is not a type of censoring. 2.9.5 Type I censoring Type I censoring occurs if we take n individuals and observe them for a pre-specified time t∗. Any non-events are right censored with censoring time t∗. Commonly found in medical studies, e.g. follow 15 patients for two years after operation. - 6 Time E E C E C t∗ • Generalised Type I censoring: As above, although now individuals join the study at different times ⇒ even censored observations have different durations on the study. - 6 Time E E C E C t∗ 2.9.6 Type II censoring Where we take n individuals and observe them until the d th event occurs. The remaining non-events are right-censored, although (unlike Type I) this censoring time is not known in advance. Commonly found in reliability analysis, e.g. take 20 components and keep testing until half of them fail. 13 2.9.7 Competing risks Suppose we are interested in the marginal distribution of failure time T1 due to a particular cause, but the occurrence of a separate event, a competing risk, at random time T2 would cause the individual to exit the study. Then we will only be able to observe min(T1, T2), along with the type of failure. Observations where T2 < T1 will constitute right-censored observations of T1. One important assumption made throughout this course is that there is independence between the censoring mechanism and the event time. That is, if C denotes a (possibly deterministic) censoring time, and T is the (perhaps unobserved) event time, for simplicity of inference we assume T and C are statistically independent random variables. 14 3 Parametric Distributions of Random Lifetimes T 3.1 Introduction In Chapter 2 we defined various mathematical expressions that could be used to describe the probability distribution of a future random lifetime T . But until now, we have not considered what forms these quantities might take. In this chapter we shall meet some standard parametric distributions that are commonly used in survival analysis. Clearly, any distribution over the positive half-line R+ is a possible candidate. Moreover, distributions on R can be models for log T . However, a number of standard distributions have emerged as being particularly appropriate for the task of analysing survival data. We shall consider the three most popular models, these being the exponential, the Weibull and the Gompertz-Makeham distributions. Each will make a different assumption about the nature of the hazard rate. The parametric models in this chapter do not admit a limiting age ω, all three have support over the whole of R+. If we were to truncate these distributions to lie on [0,ω], we should be mindful of the induced changes in the nature of the hazard functions. 3.2 Exponential Distribution The exponential distribution is a natural starting point as a lifetime distribution. For λ > 0, if T ∼ Exp(λ) then f (t) = λ exp(−λt), S(t) = exp(−λt), µ(t) = λ, M(t) = λt. The constant hazard function reflects a property known as lack of memory ; for the exponential distribution, Sx (t) = S(t). In practice however, the assumption of a constant hazard is often untenable and when fitting the exponential distribution to data it can be sensitive to outliers. The other two distributions we now consider can be seen to generalise the exponential distribution; that is, each contains the exponential distribution within their family as a special case. 15 3.3 Weibull Distribution The Weibull distribution can be written as f (t)= ηα−ηtη−1 exp(−(t/α)η), S(t)= exp(−(t/α)η), µ(t)= ηα−ηtη−1, M(t) = (t/α)η, where α > 0 and η > 0 are known as the scale and shape parameters respectively. The Weibull generalises the exponential distribution, which is recovered when η = 1. Figures 1 and 2 show the hazard and density functions for different values of the shape parameter. 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 5 t µµ((t )) η = 0.5 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 5 t µµ((t )) η = 1.5 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 5 t µµ((t )) η = 2.5 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 5 t µµ((t )) η = 5 Figure 1: Hazard functions for Weibull with mean 1 and shape parameter η. The mean and variance of the Weibull density are αΓ(η−1 + 1) and α2{Γ(2η−1 + 1)− [Γ(η−1 + 1)2]} respectively, where Γ is the gamma function Γ(t) = ∫ ∞ 0 u t−1e−udu. For η < 1 we have a monotonically decreasing (antitonic) hazard rate and for η > 1 we have a monotonically increasing (isotonic) hazard rate (see Fig. 1). In particular for 1 < η < 2 the hazard increases slower than linear in t; for η = 2 the increase is linear and for η > 2, faster than linear. 16 0.0 0.5 1.0 1.5 2.0 0. 0 0. 5 1. 0 1. 5 2. 0 t f(t) η = 0.5 0.0 0.5 1.0 1.5 2.0 0. 0 0. 5 1. 0 1. 5 2. 0 t f(t) η = 1.5 0.0 0.5 1.0 1.5 2.0 0. 0 0. 5 1. 0 1. 5 2. 0 t f(t) η = 2.5 0.0 0.5 1.0 1.5 2.0 0. 0 0. 5 1. 0 1. 5 2. 0 t f(t) η = 5 Figure 2: Density function for Weibull with mean 1 and shape parameter η. The Weibull distribution is probably the most widely used parametric distribution in survival analysis. Some possible reasons: • Simplicity of the survivor and hazard functions. • Covers a wide variety of distributional shapes (see Fig. 2). • Empirically it has been found to be accurate in many contexts. It is an extreme value distribution. 3.4 Gompertz-Makeham Distribution 3.4.1 Gompertz and Makeham laws of mortality The Gompertz law of mortality states that in a low mortality environment where external causes of death are rare, the force of mortality increases approximately exponentially with age. Empirical data gathered from observation of insects kept in laboratory conditions support this hypothesis. Outside of such an environment, the Makeham law of mortality considers the risk of other, external causes of death to be approximately constant with age. So together these laws suggest a hazard rate which is a sum of constant (Makeham) and exponential (Gompertz) terms. This motivates the Gompertz-Makeham distribution. 17 3.4.2 Gompertz-Makeham hazard function The Gompertz-Makeham distribution is a three parameter distribution with hazard function µ(t) = θ + β exp(γt) for non-negative parameters θ, β,γ. This distribution again generalises the exponential distribution. The survivor function and density function can be derived in the usual way (Exercise). 3.5 Choosing a Distribution We have listed three common parametric distributions for lifetime random variables (there are many more: gamma, log-normal, log-logistic,. . . ). Given a particular data set of, say, n realisations of T , how do we decide which distribution is appropriate? Each distribution makes particular assumptions about the form of the hazard. Knowledge of the ‘true’ hazard rate would enable us to decide which distribution was ‘closest’. In practice we can compare to a non-parametric estimate of the hazard which converges to the truth. 3.5.1 Empirical survivor function Suppose we have gathered some survival time data for n individuals drawn from a popu- lation with common survivor function S(t). Assume, for now, that there are no censored observations. Then the empirical survivor function Sˆ(t) = number of observations > t n provides an estimate of S(t) derived purely from the data. Notice that Sˆ(t) is an antitonic step function with jumps at the death times. (Sˆ(t) is a non-parametric estimator of S(t). In Chapter 5 we shall see how to estimate S in the presence of censoring.) Since S(t) = exp{−M(t)}, the simple transformation Mˆ(t) = − log Sˆ(t) then provides a similar, data-based estimate of the cumulative hazard. We can use a plot of Mˆ(t) to see how the overall shape of the function compares with those dictated by some different distributions. In particular, • M(t) vs. t is linear for the exponential; • log M(t) vs. log t is linear for the Weibull; • log M(t) vs. t approaches linearity for large t for the Gompertz-Makeham. After plotting Mˆ(t) in these ways we can decide which assumption is closest. This provides us with a crude but useful method of selecting a parametric model for a given set of data. 18 0.0 0.5 1.0 1.5 2.0 2.5 3.0 t 0. 0 0. 5 1. 0 Empirical Survivor Function 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0. 0 0. 5 1. 0 1. 5 2. 0 2. 5 3. 0 t Empirical Cumulative Hazard Function Figure 3: Sˆ(t) and Mˆ(t) for a sample of uncensored lifetime data. 19 4 Fitting Parametric Distributions In Chapter 3 we met some standard parametric families that might be appropriate for mod- elling T , and gave some simple criteria for choosing amongst them given a set of realised values. We now assume that a distribution has been selected. What values should the parameters of the distribution take? This is a problem in statistical inference. The most popular statistical approach to fitting parametric distributions to data is the method of maximum likelihood (ML) estimation. In survival analysis the inference problem is often complicated by the presence of censored observations, and so we will need to consider ML estimation in this context. 4.1 Maximum Likelihood Estimation Assume that the data (t1, ... , tn) are n independent, possibly censored realisations from the distribution F (t; θ), where the form of F is known (e.g. Weibull) but the parameters θ are unknown. Definition: The likelihood is the joint probability of the observed data, regarded as a function of the unknown parameters θ. The likelihood principle states that all of the information about the parameters in a sample of data is contained in the likelihood function. The law of likelihood extends this principle to state that the degree to which the data supports one parameter value over another is given by the ratio of their likelihoods. The Maximum likelihood estimate (MLE) of θ is based solely upon the likelihood of the observed data. Maximum likelihood estimation thus provides a principled and general method of estimating parameters in parametric distributions using observed data. As the name suggests the MLE is the value of the parameters that maximises the prob- ability of the observed data, θˆ = arg sup θ∈Θ L(θ), L(θ) = n ∏ i=1 Pr(ti |F (; θ)), where L is the likelihood function and Pr(ti |F (; θ)) is the probability of observing the i th observation given the distribution F with parameters θ. If none of the observations are censored we find, L(θ) = n ∏ i=1 f (ti ; θ). For each censored observation the contribution to the likelihood depends on the type of censoring. Recall, an observation is left (right) censored if we only have an upper (lower) bound for T , or interval censored if we only know that T lies in a specified interval. 20 • Pr(ti |F (; θ)) = S(ti ; θ) for an observation right-censored at ti ; • Pr(ti |F (; θ)) = F (ti ; θ) for an observation left-censored at ti ; • Pr(ti |F (; θ)) = F (t(u)i ; θ) − F (t(l)i ; θ) for an observation interval-censored within [t (l) i , t (u) i ]. From now on we shall only consider right-censoring, this being the most common. We can then split the data into two disjoint sets relating to U = uncensored data C = censored data. The likelihood function is then L(θ) =∏ i∈U f (ti ; θ) ∏ i∈C S(ti ; θ). It is almost always more convenient to work with the log-likelihood, `(θ) = log{L(θ)}= ∑ i∈U log f (ti ; θ) + ∑ i∈C log S(ti ; θ) = ∑ i∈U log µ(ti ; θ) + n ∑ i=1 log S(ti ; θ) = ∑ i∈U log µ(ti ; θ)− n ∑ i=1 M(ti ; θ). For an m-parameter distribution, the estimate θˆ may be found by solving the likelihood equations ∂ ∂θj `(θ) = 0 (j = 1, ... , m). Finding the solution often involves numerical techniques such as Newton and quasi-Newton methods. 4.1.1 Newton’s Method Newton’s Method for optimising `(θ) begins by first making an initial guess θ = θ0 ∈ Rm, sufficiently close to the true optimum. Provided this initial guess θ0 is incorrect, we would wish to add an increment δ ∈ Rm to θ0 s.t. `(θ0 + δ) is the optimum. The (multivariate) Taylor expansion of ` about a value θ yields `(θ + δ) ≈ `(θ) + δ′∇`(θ) + 1 2 δ′∇2`(θ)δ. (2) To solve the easier problem of maximising the Taylor expansion (2), we find the derivative of (2) wrt δ is equal to ∇`(θ) +∇2`(θ)δ 21 and hence get an approximate solution for δ of δ = −{∇2`(θ)}−1∇`(θ). So the algorithm proceeds iteratively, setting θn = θn−1 − {∇2`(θn−1)}−1∇`(θn−1) until sufficient convergence in the sequence θ0, θ1, θ2, ... occurs. 4.1.2 Asymptotic distribution of MLE Maximum likelihood estimation not only provides a set of optimal parameter values for θ but also allows assessment of the variance in the estimates. Consider the m×m information matrix I (θ) = −∇2`(θ), so I (θ)ij = −∂2 ∂θi∂θj `(θ), i , j = 1, ... , m. Then asymptotically for large samples, θˆ ∼ N(θ, I (θ)−1). Evaluating I (θ) by setting the unknown θ equal to θˆ, giving the observed information matrix, leads to an approximate covariance matrix for θˆ. That is, if V = I (θˆ)−1 then its ij th entry vij is an estimate of the covariance between θˆi and θˆj . In particular the standard error of θˆj is given by the square root of the j th diagonal element of V , s.e. ( θˆj ) ≈ √vjj . 4.1.3 Functional invariance of MLEs Another key advantage of ML estimation is that the MLE of a function of the parameters g(θ) is simply ĝ(θ) = g(θˆ). For example, the exponential distribution has mean E(T ) = 1/λ, hence the MLE of E(T ) is Ê(T ) = 1 λˆ = 1 r n ∑ i=1 ti , which is the total time on the study survived by the individuals divided by the number of deaths. 4.2 Examples of ML Estimation We assume that we have observed lifetimes t1, ... , tn with r uncensored observations and n− r right-censored observations. 22 4.2.1 Exponential distribution The log-likelihood is `(λ) = r log λ− λ n ∑ i=1 ti . Hence d` dλ = r λ − n ∑ i=1 ti , which may be set to zero and solved immediately to give λˆ = r ∑ni=1 ti . Also we have, −d2` dλ2 = r λ2 =⇒ s.e. (λˆ) ≈ λˆ r 1/2 . So we have approximately λˆ ∼ Normal(λ,λ2/r), and then more approximately λˆ ∼ Normal λ, r/{ n∑ i=1 ti }2 . 4.2.2 Weibull distribution Recall the Weibull survivor function has form, S(t) = exp(−λtη) with scale λ and shape η. Note, we have changed the parameterisation slightly, writing λ for α−η. The log-likelihood in the presence of right-censoring is `(λ, η) = r log(λη) + (η − 1) ∑ i∈U log ti − λ n ∑ i=1 t η i . We set ∂` ∂λ = 0 = r λˆ − n ∑ i=1 t ηˆ i (3) and ∂` ∂η = 0 = r ηˆ + ∑ i∈U log ti − λˆ n ∑ i=1 t ηˆ i log ti . (4) 23 Solving (3), we obtain λˆ = r ∑ni=1 t ηˆ i . Substituting into (4) gives r ηˆ + ∑ i∈U log ti − r ∑ni=1 t ηˆ i n ∑ i=1 t ηˆ i log ti = 0. This is a non-linear equation in ηˆ which can only be solved using an iterative numerical procedure. In practice it is common to maximise {λ, η} simultaneously using Newton’s method. An important by-product of which is an approximation of the covariance matrix from which standard errors on MLE can be obtained. 4.3 Hypothesis Testing for Nested Distributions At the end of Chapter 3 we provided some heuristic methods for distinguishing between distributions. Here we consider asymptotic properties of the likelihood function to provide a more rigorous test that can be used for certain comparisons. Suppose we are interested in making statements about a parameter subset θ(A) ⊆ θ ∈ Θ. Assume that θ is partitioned θ = (θ(A), θ(B)) and Θ = Θ(A) ×Θ(B). • For example, in the Weibull distribution we might take θ(A) = η (shape), θ(B) = λ (scale). Let θˆ = (θˆ(A), θˆ(B)) denote the MLE of θ = (θ(A), θ(B)). We will consider two tests for the null hypothesis H0 : θ(A) = θ (A) 0 vs. H1 : θ (A) ∈ Θ(A) which make use of the MLE. • Note that for the Weibull distribution a test for η = 1 is a test of exponentiality. The tests lead to the derivation of a confidence region for θ(A) which is the collection of parameter values in the subset Θ(A) not ‘rejected’ at a certain significance level. 4.3.1 Likelihood ratio statistic Let • θˆH1 = θˆ be the unconstrained MLE over Θ. • Let θˆH0 = ( θ (A) 0 , θˆ (B) θ (A) 0 ) , where θˆ (B) θ (A) 0 is the MLE estimate of θ(B) over Θ(B) with θ(A) fixed at θ(A) = θ (A) 0 . 24 Then the likelihood ratio test statistic is W ( θ (A) 0 ) = −2 [` (θˆH0)− ` (θˆH1)] . Under the null hypothesis H0 : θ(A) = θ (A) 0 , W is itself a random variable with an approximate chi-squared distribution with mA degrees of freedom, where mA = dim ( θ(A) ) . Large values of W relative to χ2mA supply evidence against H0. The corresponding 1− α confidence region for θ(A) is{ θ(A) : W ( θ(A) ) ≤ χ2mA,α } , where χ2mA,α = is the upper 100α percentage point of χ 2 mA . Within this region we cannot reject the null hypothesis at the α significance level. 4.3.2 Wald statistic Suppose V = V ( θˆ(A), θˆ(B) ) is the asymptotic covariance matrix for ( θˆ(A), θˆ(B) ) evaluated at the MLE. Let VA be the leading submatrix of V corresponding to θ (A). Then, W ∗ ( θ (A) 0 ) = ( θˆ(A) − θ(A)0 )′ V−1A ( θˆ(A) − θ(A)0 ) also has an approximate χ2mA distribution under the null hypothesis θ (A) = θ (A) 0 . The corresponding 1− α confidence region for θ(A) is{ θ(A) : W ∗ ( θ(A) ) ≤ χ2mA,α } . Both the Wald statistic and the likelihood ratio test statistic are based on large sample theory and both are asymptotically equivalent (and often give similar results in practice). However, large discrepancies are possible. In such cases the likelihood ratio statistic is perhaps preferable as the results are invariant to reparameterisation. Against this, the Wald statistic has the advantage that only one maximisation, over the unconstrained parameter space Θ, is required. Note however, that the theory is for large samples and may give a poor approximation to small-sample results. 25 5 Nonparametric Methods In Chapter 3 we met some parametric distributions which may be appropriate for modelling a random lifetime variable T . Once a parametric model has been chosen, the distribution of T is then constrained to take a fixed functional form and can only vary within the scope of the parameters. (In Chapter 4 we considered how best to choose those free parameters to fit a set of data.) A limitation of this approach is that when the model is inadequate some interesting features within the data will be hidden. That is, we are constrained in how we learn about the distribution of T by the model we have selected to represent it. In this chapter we see how this constraint can be relaxed and the search for an estimated distribution extended to the space of all distributions, leading to the name nonparametric or distribution-free inference. So far we have been assuming T to be continuous, but in what follows we will require some results from discrete random variable distributions. 5.1 Discrete Lifetimes Suppose that the random lifetime T could only take one of a countable set of values {a1 < a2 < ...}. Then T would now be a discrete random variable, with a probability mass function (pmf) pij = P(T = aj ), j = 1, 2, ... . As usual, we would require that {pij} satisfy ∀j , 0 ≤ pij ≤ 1 and ∑j pij = 1. 5.1.1 Discrete hazard rate We define the discrete hazard rate at time aj as µj = P(T = aj |T ≥ aj ) = pij 1−∑i. (5) pij = { µ1, j = 1 µj ∏i 1. (6) Rearranging (5) gives pij = µj ( 1−∑ ipii ) =⇒ pij = pij−1 µj µj−1 (1− µj−1) (Exercise) 26 and so by induction pij = { µ1, j = 1 µj ∏i 1. (7) Thus {pij} and {µj} both characterise the distribution of T , although for the latter we only need ∀j , 0 ≤ µj ≤ 1 (and if the range of T is finite, that ∃j s.t. µj = 1). Note that if ∃j s.t. µj = 1, then surely T ≤ aj . 5.1.2 Survivor and discrete cumulative hazard rate Since P(T > aj ) ≡ P(T ≥ aj+1), note that another rearrangement of (5) gives P(T > aj ) = pij+1 µj+1 . Substituting pij+1 with (7), we get P(T > aj ) = ∏ j i=1(1− µi ) =⇒ S(t) = ∏j :aj≤t(1− µj ). (8) So for an individual to survive to time t they must survive through all of the support points aj up to time t, where each represents a Bernoulli trial with death probability µj . Again, note that if ∃j s.t. µj = 1, then ∀t ≥ aj , S(t) = 0. In an analogous definition to the continuous case (§2.5), we define the cumulative hazard to be M(t) = ∑ j :aj≤t µj . (9) Note that this invalidates the identity M(t) = − log{S(t)} from §2.5. However, we can note that for small h, − log(1− h) ≈ h and thus for low hazard rates M(t) ≈ − ∑ j :aj≤t log(1− µj ) = − log{S(t)}. 5.2 Kaplan-Meier (Product-limit) Estimate Returning to the idea of nonparametric estimation, we would like to find the maximum likelihood distribution for some observed survival data, over the space of all distributions. Suppose we observe n independent identically distributed lives from a population with survivor function S , but that in the presence of right-censoring we only observe m deaths. Let t1 < t2 < ... < tk be the ordered death times, with k ≤ m so that more than one death can occur at any one time. Let dj denote the number of deaths at time tj , with ∑kj=1 dj = m. Observation of the remaining n−m times is censored; Suppose that cj lives are censored in the half open interval [tj , tj+1), for j = 0, 1, 2 ... , k , where t0 = 0 and tk+1 = ω. Let tj1, tj2, ... , tjcj be the censoring times in the j th interval, where the {tji} may or may not be distinct. Note that ∑kj=0 cj = n−m. 27 - tj d tj1 d tj2 d tj3 tj+1 × × Let nj = n− ∑ i≤j−1 ci − ∑ i≤j−1 di ( = k ∑ i=j (ci + di ) ) (10) be the total number of individuals still ‘in view’ as we reach time tj . Note the convention of including in nj any observations censored at tj ; alternative schemes may be more appropriate in some circumstances, and these naturally might lead to differing results. It is easy to show that regardless of the nature of the true underlying distribution, the maximum likelihood distribution for T will be that of a discrete random variable with atoms of probability mass {pij} at the death times {tj}. Informally, any distribution placing mass away from the death times would admit possibilities outside of the data observed, and so clearly give lower likelihood to the data that have been observed. To identify the maximum likelihood discrete distribution, the problem at hand then is to identify optimal probability masses to place at each of the death times t1, ... , tk . In constructing a likelihood function for our unknown discrete distribution, it is most straightforward to specify the likelihood of the observed data in terms of the discrete hazard function. Proceeding through time from birth, individuals would only be at risk of death at the discrete points t1, ... , tk . Viewing the data sequentially at these risk times, as we reach time tj , there are nj individuals still at risk, each subject to a Bernoulli trial with probability µj of death and (1− µj ) of survival. Once individuals have died or been right-censored, they cease to play a part in the subsequent terms of the likelihood function. Hence we can write the likelihood of the data given a discrete model as L(µ1, ... , µk) = k ∏ j=1 µ dj j (1− µj )nj−dj , which is clearly maximised by µˆj = dj nj , j = 1, 2, ... , k . Hence the MLE for S is Sˆ(t) = ∏ j :tj≤t ( 1− dj nj ) . This is known as the Kaplan-Meier or product-limit estimate. It is the MLE over the space of all valid distribution/survivor functions. Product-limit: consider, finer and finer partitions of the time axis and estimate S(t) as the product of probabilities of surviving each sub-interval. We can treat this as a discrete, sequential problem using the discrete hazard definition in §5.1, and thus obtain the Kaplan- Meier estimate in the limit as the meshes of the partition tend to zero. 28 Example Consider the following data on 10 individuals • death times: 1.1, 3, 3, 7, 10, 12.4; • right-censoring times: 0.2, 0.8, 4.5, 11. Death time Censoring time nj dj Sˆ(t) 0.2 1 0.8 1 1.1 8 1 0.875 3 7 2 0.625 4.5 0.625 7 4 1 0.469 10 3 1 0.312 11 0.312 12.4 1 1 0.0 Note: If the largest observation is a censored survival time, then Sˆ(t) is undefined beyond this maximum time tmax. 5.3 Greenwood’s Formula for the Standard Error of the K-M Estimate The Kaplan-Meier estimate is the most important and widely used estimate of the survivor function, and so it is of interest to estimate its variability. We have log Sˆ(t) = ∑ j :tj≤t log(1− µˆj ) and then by approximate independence the variance of log(Sˆ(t)) is Var { log Sˆ(t) } ≈ ∑ j :tj≤t Var{log(1− µˆj )}. The variance of (1− µˆj ) follows from the variance of a binomial random variable Var (1− µˆj ) ≈ µˆj (1− µˆj )/nj . Now we make use of a general result, known as the delta method, for approximating the variance of a function of an asymptotically normal statistical estimator (like the MLE). If X is a random variable with mean µX and g(X ) is a function of X with first derivative g ′(X ), then Var{g(X )} ≈ g ′(µX )2Var(X ), (11) 29 This leads to Var{log(1− µˆj )} ≈ Var(1− µˆj ) (1− µˆj )2 ≈ µˆj (1− µˆj )/nj (1− µˆj )2 = µˆj nj (1− µˆj ) = dj nj (nj − dj ) . Hence Var{log Sˆ(t)} ≈ ∑ j :tj≤t dj nj (nj − dj ) and further application of (11) leads to Var{log Sˆ(t)} ≈ 1{Sˆ(t)}2 Var{Sˆ(t)} so that Var{Sˆ(t)} ≈ {Sˆ(t)}2 ∑ j :tj≤t dj nj (nj − dj ) . Finally the standard error of the Kaplan-Meier estimate is the square root of the estimated variance; that is, s.e.{Sˆ(t)} = Sˆ(t) √ ∑ j :tj≤t dj nj (nj − dj ) , which is known as Greenwood’s formula. 5.4 Nelson-Aalen Estimate of the Cumulative Hazard We have found the MLE for S , Sˆ , to be the survivor function of a discrete random variable with atoms of probability mass at the death times {tj}. This has a discrete hazard rate µˆ = {µˆj}. If our interests lay in estimating the cumulative hazard function M, as in §3.5, then following definition (9) the corresponding estimate would be Mˆ(t) = ∑ j :tj≤t µˆj = ∑ j :tj≤t dj nj . This is the Nelson-Aalen estimate. By the property of invariance of MLEs under transfor- mation, this is the MLE for the (discrete) cumulative hazard function. The MLE for the continuous cumulative hazard function is instead − log(Sˆ(t)). Similarly to Greenwood’s formula for the Kaplan-Meier estimate we can obtain an esti- mate for the standard error of the Nelson-Aalen estimate, Var { Mˆ(t) } ≈ ∑ j :tj≤t dj (nj − dj ) n3j =⇒ s.e. {Mˆ(t)} ≈ √√√√ ∑ j :tj≤t dj (nj − dj ) n3j . 30 5.5 The Actuarial Estimate of S The “adjusted-observed” life-table estimate or actuarial estimate of the survivor function deals with aggregated, interval censored and right censored observations of a continuous time to event variable, which are commonly found in actuarial applications. Its construction is very similar to that of the product-limit estimate. We suppose that the time axis is divided up into intervals (usually of equal length, typically one year) [a0, a1), [a1, a2), ... , [ak−1, ak) where 0 = a0 < a1 < ... < ak . The data are aggregated within these intervals, so the number of deaths, dj , and losses to censoring, cj , are known for each interval [aj−1, aj ) but the actual lifetimes/censoring times are unknown. As in the product-limit model, we estimate the probabilities of surviving each subsequent interval. If all of the deaths were known to precede all of the losses to censoring, then we could proceed with estimation as before with µˆj = dj/nj . But in general we have no reason to assume this. The actuarial assumption is that within each region the losses to censorship occur uniformly across the region. Hence by a crude approximation the adjusted, expected number of individuals at risk at any time during the interval is n′j = n− ∑ i≤j−1 ci − ∑ i≤j−1 di − cj 2 = nj − cj 2 , where nj is the number still ‘in view’ at the start of the j th interval. So using the arguments developed in §5.1 and §5.2 we treat survival of the intervals as (approximate) binomial experiments, leading to the life-table estimate of S S∗(t) = ∏ j :aj≤t ( 1− dj n′j ) , with corresponding standard error s.e.{S∗(t)} ≈ S∗(t) √ ∑ j :aj≤t dj n′j (n ′ j − dj ) . 31 6 Regression models 6.1 Inhomogeneous Populations We have so far considered survival analysis under the assumption that the lifetimes of in- dividuals in our population are independent, identically distributed realisations of a random variable T following a common distribution F . This assumption of a homogeneous population is restrictive. We will often have additional information about the individuals, the knowledge of which affects our beliefs about their lifetime distribution. In this chapter we shall consider regression methods for accommodating explanatory variables which we believe affect the distribution of T for each individual. These variables could be: • demographic - e.g. age, sex. • behavioural - e.g. smoking history, exercise taken. • physiological - e.g. blood pressure, cholesterol level. • external - e.g. treatment group/centre, dose level. If the information on individuals is categorical (e.g. smokes/does not smoke) and we only have a few explanatory variables, we may be able to partition the data set into smaller groups of homogeneous individuals and use the methods in Chapters 3- 5 to estimate distributions separately for each group. For example, we could split the population into smokers and non-smokers and derive Kaplan-Meier estimates Sˆsmokers and Sˆnon−smokers and compare. However, often we might have many factors to take into account and if some of them are continuous then this simple approach breaks down. Methods that can deal with these general regression situations include accelerated life models and the popular proportional hazards model. Suppose we have decided upon p explanatory variables (or covariates) which we believe may affect the time until event. Let zi be the p-vector of values of these covariates observed for individual i . Where possible we attempt to standardise these covariates so that z = 0 relates to some standard set of conditions, such as the control group in a clinical trial. Under both the proportional hazards and accelerated lifetime approaches, a general model for lifetime distributions is constructed from two parts: 1. A baseline model for the (hypothetical) standard individuals, corresponding to z = 0; 2. A scheme for modifying the baseline model as the covariates z deviate from zero. 32 6.2 Cox’s Proportional Hazards Model The Cox 1 model proposes that the hazard rates of individuals are related via the relationship µ(t; z) = µ0(t) exp(β · z), (12) where β ∈ Rp is a vector of regression parameters. Equation (12) shows a multiplicative influence on the hazard of any deviation away from zero in each of the p covariates in z . Here µ0(t) is known as the baseline hazard and represents the hazard of a (possibly hypothetical) individual with z = 0. The survivor function and density follow: S(t; z) = S0(t) exp(β·z), f (t; z) = exp(β · z)S0(t)exp(β·z)−1f0(t), where S0(t) and f0(t) are the baseline survivor and density functions corresponding to µ0(t). In (12), only µ0, not z or β, depends on time but the model can also be formulated with time-dependent covariates. Under the Cox model, the hazard functions of two individuals with covariates z1, z2 are in constant proportion at all times; that is, µ(t; z1) µ(t; z2) = exp(β · z1) exp(β · z2) = exp{β · (z1 − z2)}, giving rise to the name proportional hazards. Note that in general we can have any positive function ψ of the covariates leading to µ(t; z) = µ0(t)ψ(z ; β) and a ‘proportional hazards’ structure. However, Cox’s model easily ensures that the hazard is always positive, and gives a linear model for the log-hazard which is very convenient in theory and practice. The utility of the model arises from the idea that the general shape of the hazard functions could be the same for all individuals in a population, with any covariate-specific differences between individuals having a constant, multiplicative effect. So, if we are not primarily concerned with the precise form of the hazard, but with the effects of the covariates, we may wish to ignore µ0 and estimate the regression coefficient vector β from the data without reference to the shape of µ0. This is termed a semi-parametric approach: a parametric model for covariate effects, exp(β · z), and a distribution-free reference to µ0. The Cox model (12) dominates the literature on regression of lifetimes, due to its sim- plicity and versatility. 1http://www.jstor.org/page/termsConfirm.jsp?redirectUri=/stable/pdfplus/2985181.pdf Cox, D. R. (1972) Regression Models and Life Tables, Journal of the Royal Statistical Society Series B, 34, 187-220. 33 6.2.1 Partial likelihood function Suppose we have observation times t1, ... , tn, some of which are right-censoring times. Let each of these individuals have associated covariate vectors z1, ... , zn. We consider inference about β when the baseline hazard function is completely unknown. This raises the question: “What is the likelihood of observing the data as a function of β when µ0 is unspecified?” That is, we would like the marginal likelihood of β. Strictly, this is not analytically available (this would require Bayesian methods). However, Cox developed the following (subjective) argument: In the absence of knowledge about µ0, the time intervals between deaths can provide little or no information about β, as their distribution will depend heavily on µ0. Hence it is only the event time ordering of the individuals that holds any in- formation about β. The contribution to the likelihood function for β for an individual, with covariates zi , that dies at time ti is P(individual i with covariates zi dies at ti |one death at ti ) = P(individual i with covariates zi dies at ti ) P(one death at ti ) = µ(ti ; zi ) ∑j∈Rti µ(ti ; zj ) = exp(β · zi ) ∑j∈Rti exp(β · zj ) , where Rti is the risk set; that is, the set of individuals “at risk” (or “in view”, so not dead or censored) at time ti . This leads to, L(β) =∏ i∈U exp(β · zi ) ∑j∈Rti exp(β · zj ) , (13) where the product is over the set of uncensored observations (death times). The likelihood (13) is not a full likelihood function for the observed data as it makes no use of the actual event/censoring times. For this reason it is referred to as a partial likelihood. Note that the baseline hazard cancels out in the partial likelihood and so L(β) is inde- pendent of µ0 when we only consider the order in which the deaths and losses to censoring occurred. Note that the censored observations only appear in the denominator of (13), through their presence/absence in the risk sets Rti at each of the death times. 34 If the event times are ordered, t1 < t2 < · · · < tn we can write j ∈ Rti ≡ {tj ≥ ti} ⇐⇒ j ≥ i . Example Consider the following data on 5 individuals: z1, death at 2.8; z2, censored at 3.1; z3, death at 2.2; z4, death at 4.7; z5, censored at 2.6. So there are 3 deaths. Writing ψi = exp(β · zi ), the partial likelihood for β is constructed sequentially: L(β) = ψ3 ψ1 + ψ2 + ψ3 + ψ4 + ψ5 × ψ1 ψ1 + ψ2 + ψ4 × ψ4 ψ4 . In general, analytically maximising L(β) (or `(β)) wrt β is not possible. Most statistical packages contain numerical procedures for fitting a Cox model, such as Newton methods, which make use of `′(β) and `′′(β). Handling ties: In practice complications in evaluating the partial likelihood can arise if 1. some dj > 1; more than one death at any time. 2. some censored observations coincide with a death time tj . It is usual to deal with 2. by assuming that the losses to censoring occurred just after time tj . So individuals censored at tj are included in the risk set Rtj . Full analytic treatment of multiple simultaneous deaths, case 1, leads to complicated expressions for the likelihood, since all possible orderings of the dj deaths from Rtj should be considered. The following approximation due to Breslow (1974) 2 is commonly used. For death times t1, ... , tk , L(β) = k ∏ j=1 exp(β · sj ) [∑i∈Rtj exp(β · zi )]dj , where sj is the sum of the covariate vectors z of the dj lives observed to die at time tj . The partial likelihood is a true likelihood for the order of the events, and so has all the usual properties; through maximisation it yields an estimator βˆ which is asymptotically (multivariate) normally distributed and unbiased, and whose asymptotic covariance matrix can be estimated by the inverse of the observed information matrix, found from I (β)ij = − ∂ 2 ∂βi∂βj `(β), i , j = 1, ... , p, 2http://www.jstor.org/page/termsConfirm.jsp?redirectUri=/stable/pdfplus/2529620. pdfBreslow, N. (1974) Covariance Analysis of Censored Survival Data. Biometrics, 30, 89-99. 35 evaluated at βˆ, where `(β) = log{L(β)} is the log partial likelihood. I (βˆ) is usually provided as a by product by most computer packages when fitting the Cox model using Newton’s method. This allows standard errors for βˆ to be obtained which are helpful for model checking. 6.2.2 Model checking In a practical survival problem several possible explanatory variables might present themselves as possibly affecting the survival time. Part of the modelling procedure is to assess which (if any) are relevant to the model. That is, which ones have a significant effect on the distribution of T . The fact that the partial likelihood behaves like a full likelihood allows us to make use of the test statistics from §4.3. Suppose we currently have a model with p covariates and we wish to consider the inclusion of an extra q covariates. We can fit both models, one with p and the other with the p+q covariates, by maximising the partial likelihoods. Let `p and `p+q be the maximised log-likelihoods of the two nested models. Note that `p+q ≥ `p. The likelihood ratio statistic is 2(`p+q − `p) and has an asymptotic χ2q distribution under the null hypothesis that the extra q covariates have no additional effect in the presence of the original p explanatory variables. H0 : βp+1 = ... = βp+q = 0 vs. H1 : (βp+1, ... , βp+q) ∈ Rq. Example Suppose we are seeking a model for learning the effect of hypertension on survival. The first model we might consider has two covariates zi = (zi1, zi2) relating to the sex of an individual, zi1, and their blood pressure, zi2. Suppose then we wanted to test the hypothesis that cigarette smoking has no effect, allowing for sex and blood pressure. We can construct two nested models with alternative covariate vectors z (1) = {(zi1, zi2) : i = 1, ... , n} and z (2) = {(zi1, zi2, zi3) : i = 1, ... , n} where zi3 is a 0-1 indicator on whether the i th individual smokes. We fit both models and note the value of the maximum log-likelihood for the two models, `z (1) , `z (2) . Under the assumption that smoking has no effect the test statistic W = 2(`z (2) − `z (1)) has an asymptotic χ 2 1 distribution. If W is small we cannot reject the null hypothesis, that smoking has no effect. 36 6.2.3 Estimating the baseline distribution Once the regression coefficients have been estimated by MLE, it is possible to obtain a nonparametric estimate of the baseline survivor function. The following formula from Kalbfleisch and Prentice (1980) 3 is a generalisation of the product-limit (PL) estimator of §5.2 when there are no ties Sˆ0(t) = ∏ ti≤t ( 1− exp(βˆ · zi ) ∑j∈Rti exp(βˆ · zj ) )exp(−βˆ·zi ) . As with the PL estimator, if the last observation is a censored value, then Sˆ0(t) is undefined past this point. Note that a plot of Sˆ0(t) may often suggest a suitable parametric form for S . Of course there is no requirement that µ0 should be unspecified to perform regression modelling in survival analysis. In the case where we can assume µ0 to be from a known parametric family, then full likelihood based inference for {µ0, β} can proceed. In most practical applications the form of µ0 will not be well understood beforehand; in which case, learning the effects of the covariates and from there estimating the shape of the baseline model, as above, might lead to consideration of a suitable parametric form. 6.3 Accelerated Failure Time Model The accelerated failure time model is an alternative, general model for survival data in which the covariates are assumed to act multiplicatively on the time-scale. The model has intuitive appeal in that the covariates are seen to ‘speed up’ or ‘slow down’ the passage of time for one individual relative to another. Formally, the survivor function for an individual is of the form S(t; z) = S0(tψ(z ; β)) where S0 is a baseline survivor function and ψ(z ; β) is a positive function of the covariates parameterised by β. Assuming ψ(0; β) = 1, the corresponding random variables are related by Tz = T0 ψ(z ; β) where Tz is the future lifetime random variable for an individual with covariates z and hence T0 is the baseline future lifetime random variable. It follows that f (t; z) = f0[tψ(z ; β)]ψ(z ; β) µ(t; z) = µ0[tψ(z ; β)]ψ(z ; β) As in the proportional hazards model it is convenient to take ψ(z ; β) = exp(β · z). 3Kalbfleisch, J. D. and Prentice, R. L. (1980) The statistical analysis of failure time data. Wiley. 37 6.3.1 Relation to proportional hazards For illustration we consider regression modelling of two groups of individuals with • zi ∈ {0, 1} (e.g. smokers/non-smokers) and • a baseline hazard which is piecewise constant, say µ0(t) = { 0.5 0 ≤ t ≤ 1 1 t > 1 Note that this is in effect a piecewise exponential model. Consider a proportional hazards model µ(t; zi ) = µ0(t) exp(βzi ) and accelerated failure time model, µ(t; zi ) = µ0[t exp(βzi )] exp(βzi ), both with β = 1. Under the accelerated failure time model the increase in hazard for the group {i : zi = 1} occurs sooner than under the proportional hazards model (see Fig 4, left). The ‘kink’ in the survivor function (Fig 4, right) also occurs earlier. 0.0 0.5 1.0 1.5 2.0 0. 0 0. 5 1. 0 1. 5 2. 0 2. 5 3. 0 Time H az ar d Fu nc tio n Baseline PH AFT 0.0 0.5 1.0 1.5 2.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 Time Su rv ivo r Fu nc tio n Baseline PH AFT Figure 4: Hazard (left) and survivor (right) functions: Baseline (–); Proportional Hazards (-); Accelerated failure (· · · ). With ψ(zi ; β) = exp(zi ), zi = 1. 38 7 The Markov Model 7.1 Stochastic Processes Previously we have modelled the future lifetime of an individual currently aged x as a single random variable T with an unknown distribution F . In this chapter we take a different approach and treat each lifetime as a realisation of a stochastic process; that is, an indexed family of random variables on a probability space. Definition: A stochastic process on a set Ω is a collection {X (t)|t ∈ S} of random variables X (t) ∈ Ω indexed by a set S . Let P be generic notation for the probability distribution for the random variables X (t). Usually the index set S ⊆ R. If S is countable (e.g. S = N), then we say {X (t)} is a discrete-time process. If S is an interval of R (e.g. S = [0,∞)), then {X (t)} is a continuous-time process. In the Markov model, the sample path of the process {X (t)} (the values that it takes over time) will be used to define our (possibly censored) time to event T . Example: 2-state model The simplest example of a continuous-time stochastic process for survival analysis is the two-state Markov model illustrated by the diagram below: -Alive0 Dead1 µx Figure 5: Alive-Dead model. There is just an Alive state and a Dead state, with transitions in one direction only. The process takes value X (t) = 0 if the individual is in state Alive at time t, or X (t) = 1 if the individual is in state Dead, so Ω = {0, 1}. The state Dead is said to be an absorbing state; that is, ∀s > 0 P(X (t + s) = 1|X (t) = 1) = 1. Here the event T = t in Chapters 2-5 corresponds simply to a change of state from Alive to Dead at time t. 7.2 Homogeneous Continuous-Time Markov Jump Processes 7.2.1 Markov Jump Processes If Ω is a finite set of N states (e.g. Ω = {1, 2, ... , N}) and X (t) is a stochastic process on the state space Ω, we say X (t) is a jump process. 39 If the path of a continuous-time process X = {X (t) : t ≥ 0} makes only a finite number of jumps in any finite time interval, we say it is a pure jump process. Furthermore, a pure jump process will be said to be a Markov jump process if it satisfies the Markov property P{X (t) = j |X (t1) = j1, ... , X (ts) = js} = P{X (t) = j |X (ts) = js}, ∀j1, ... , js , j ∈ Ω and 0 ≤ t1 < ... < ts < t. We shall use a Markov jump process to represent the future path of an individual currently aged x . [Explicit reference to x is suppressed in this section to simplify notation.] 7.2.2 Transition probabilities Definition: The transition probability pij (s, t) is defined to be probability of being in state j at time t, given that at time s the process was in state i , pij (s, t) = P(X (t) = j |X (s) = i), t ≥ s ∈ R+, i , j ∈ Ω. In actuarial notation t−spijs ≡ pij (s, t). 7.2.3 Transition Intensities Specifying coherent transition probabilities in continuous-time Markov models would be highly complex, and so instead these models are more typically characterised through a set of corresponding transition intensities. Definition: For t ≥ 0 and i , j ∈ Ω, i 6= j we define the force of transition or transition intensity, µij (t) = lim h↓0 pij (t, t + h) h (14) and assume these limits exist. Transition intensities generalise the concept of the hazard function met in §2.4 When explanatory variables are available for individuals, the transition intensities for each individual can be constructed to depend on a covariate vector z in an analogous way to the regression methods of Chapter 6. Under the assumption of existence of (14) we can write pij (t, t + dt) = µij (t)dt + o(dt), dt ≥ 0 where o(dt) dt → 0 as dt ↓ 0. So for small dt, the transition probability pij (t, t + dt) between two distinct states i and j is approximately linear in dt with constant of proportionality µij (t), pij (t, t + dt) ≈ µij (t)dt. 40 Note that at time t, if µij (t) = 0 then transitions from i to j surely cannot occur. For completeness, ∀i ∈ Ω we can define µii (t) = lim h↓0 pii (t, t + h)− 1 h , (15) although note that we do not have transitions i → i . This gives pii (t, t + dt) = 1+ µii (t)dt + o(dt), dt ≥ 0 =⇒ pii (t, t + dt) ≈ 1− (−µii (t))dt. This can be most helpfully interpreted as P{X (t + dt) 6= i |X (t) = i} ≈ −µii (t)dt. Defining µii (t) this way leads to a useful relationship. Since ∀i ∈ Ω and t, h > 0, we know ∑Nj=1 pij (t, t + h) = 1, we therefore have µii (t) = lim h↓0 {1−∑j 6=i pij (t, t + h)− 1} h = −∑ j 6=i lim h↓0 pij (t, t + h) h =⇒ µii (t) = −∑ j 6=i µij (t). Notice from (14) and (15) that ∀i 6= j ∈ Ω, ∀t ≥ 0, µij (t) ≥ 0 and µii (t) ≤ 0. Definition: The N ×N matrix Gt with ij th entry (Gt)ij = µij (t) is called the generator of the process. The generator can be thought of as the continuous-time analogue of the one-step tran- sition probability matrix of discrete-time Markov chains. We have seen that µii (t) = −∑ j 6=i µij (t) and hence N ∑ j=1 µij (t) = 0, ∀i ∈ Ω =⇒ Gt1′ = 0′ where 1′, 0′ are vectors of ones and zeros respectively. 41 7.2.4 Homogeneity Definition: The process is said to be homogeneous if pij (t, t + h) = pij (0, h), ∀i , j ∈ Ω, t, h > 0, in which case we can simplify notation and write pij (h) for pij (t, t + h). Clearly the condition of homogeneity is equivalent to the transition intensities being constant, µij (t) ≡ µij , Gt ≡ G. 7.2.5 Age rate intervals Homogeneity is clearly a very strong condition, as it implies individuals in the process expe- rience no effects from ageing. To make this assumption tenable, it is common to divide the time domain of interest (perhaps R+) into small interval domains S1, S2, ... and fit separate Markov models for each interval Si , within which homogeneity is more reasonable. Smoothing methods, known as graduation (see Chapter 11), can be deployed later to combine the resulting piecewise constant intensity estimates to obtain a smooth curve µij (t) over the full time domain. In studies of human mortality, these homogeneous intervals will typically be the integer age ranges [x , x + 1) for x ∈ N = {0, 1, 2, ...}. Hence in these applications our implicit, hidden notation is actually µij x+ 12 , as the estimator can be thought of as corresponding most closely to the midpoint of this integer age x interval. Alternative rate interval definitions for age would change this; for example, if we defined age group x as [x − 12 , x + 12 ], i.e. nearest birthday being x years, then we would be approximately estimating µijx . 7.2.6 Assuming homogeneity From now on we shall assume that X (t) is a homogeneous process. We write Pt for the N ×N matrix with entries pij (t), and the transition intensities µij (t) simply as µij . However, it should be noted that many of the models and results that follow can be extended into the inhomogeneous, duration-dependent setting. Theorem 7.1. The family {Pt : t ≥ 0}, known as the transition semigroup of the process, is a stochastic monoid under the matrix multiplication operation; that is, it satisfies the following: 1. {Pt : t ≥ 0} are right-stochastic: ∀t ≥ 0, the entries of each row of Pt are non- negative and sum to 1. 2. Identity element: P0 = I, the N ×N identity matrix. 3. Semigroup property: Ps+t = PsPt if s, t ≥ 0. These are known as the Chapman-Kolmogorov equations. 42 Proof 1. For any i ∈ Ω, N ∑ j=1 (Pt)ij = N ∑ j=1 pij (t) = P ( ∪Nj=1{X (t) = j}|X (0) = i ) = 1. 2. Clearly (P0)ij = p ij (0) = δij . 3. Chapman-Kolmogorov: pij (s + t) = P(X (s + t) = j |X (0) = i) = N ∑ k=1 P(X (s) = k |X (0) = i)P(X (s + t) = j |X (s) = k) = N ∑ k=1 pik(s)pkj (t). The distribution of {X (t) : t ≥ 0} is completely determined by the stochastic semigroup {Pt : t ≥ 0} and the initial distribution of X (0). Furthermore, if we condition on X (0) (i.e. assuming we can observe the initial states of our individuals) then the evolution of X (t) depends only on Pt . 7.3 A Two-State Markov Model The two-state homogeneous Markov model in Figure 5 provides a very simple and familiar special case of these general Markov jump processes. From above, the single transition intensity µ = µ01 has equation µ = lim h↓0 p01(x , x + h) h , ∀x > 0. (16) Noticing that in the right hand side of (16), p01(x , x + h) is equal to Fx (h) in random variable terminology, it is then apparent that here µ must be the hazard function. Since we have that µ is constant under the assumption of homogeneity, it immediately follows that the corresponding time to event random variable has an Exponential(µ) distribution. For making inference about this two-state Markov model in the presence of possibly censored data, the results for maximum likelihood estimation of µ immediately follow from §4.2.1. Let v be the sum of the waiting times (i.e. the time until death or censoring) of all individuals on a study. In actuarial science, this is often called the central exposed to risk and is denoted Ecx for age group x . 43 Then if d is the number of deaths in the sample, recall the MLE for µ is given by µˆ = d v . We had approximately µˆ ∼ Normal(µ, µ2/d), and then more approximately µˆ ∼ Normal(µ, d/v 2). Further, we can obtain MLEs for survival probabilities S(t) and hence the transition probabilities p01(t) via Sˆ(t) = exp{−µˆt} and 1− Sˆ(t) respectively. 7.4 Calculating Transition Probabilities The real utility of the Markov model for survival data is that the two-state model of §7.3 can be extended to any number of states, with arbitrary transitions between them. For example, the three-state Sickness-Death model (see Fig 6). Able0 Ill2 Dead1 - ff µ20x µ02x @ @ @ @ @ @ @ @ @ @R µ01x