Recommended Literature
All the material used in this course can be found in various textbooks.
• McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models. Chapman &
Hall,
• Lee, Y., Nelder, J. A. & Pawitan, Y. (2006). Generalized linear models with random
effects: unified analysis via Hlikelihood. CRC Press
• Hastie, T. & Tibshirani, R. (1990). Generalized additive models. Wiley Online
Library
• Venables, W. N. & Ripley, B. D. (2013). Modern applied statistics with SPLUS.
Springer Science & Business Media
• Dobson, A. J. & Barnett, A. (2008). An introduction to generalized linear models.
CRC press
• Wood, S. (2006). Generalized Additive Models: An Introduction with R. CRC Press
• Ravishanker, N. & Dey, D. K. (2001). A first course in linear model theory. CRC
Press
• Faraway, J. J. (2016). Extending the linear model with R: generalized linear, mixed
effects and nonparametric regression models, vol. 124. CRC press
• Fox, J. & Weisberg, S. (2011). An R companion to applied regression. Sage
Publications
These books can be found (some electronically) in the college library.
1
Chapter 1. Introduction to Statistical Models
Consider situations where we observe some data y from experiments, events etc. We
interpret the observation y as a realisation of some random variable Y.
A statistical model is the specification of the distribution of Y up to an unknown
parameter θ. Often, the observations y = (y1, . . . , yn) ∈ Rn is a vector and
Y = (Y1, . . . ,Yn) is a random vector. In this case a statistical model is the specification
of the joint distribution of Y1, . . . ,Yn up to an unknown parameter θ.
Example Mobile Phones
Observations: yi = student i looks at their mobile in first 10 mins of the lecture.
Model: Y1, . . . ,Yn iid, P(Yi = true) = 1− P(Yi = false) = θ, θ ∈ [0, 1].
In many experiments and situations, the observations Y1, . . . ,Yn do not have the same
distribution. The distribution of Y1, . . . ,Yn may depend on nonrandom quantities
x1, . . . , xn called covariates.
Example Do students who attend more lectures get higher marks in Maths exams?
yi = average mark, xi = number of lectures attended
Model: Yi = a+ bxi + ei, i = 1, . . . , n, ei
iid∼ N(0, σ2), θ = (a, b, σ2).
Model Fitting
We shall describe the process of the model fitting as follows:
1. Model specification — specify the distribution of the observations Y1, . . . ,Yn up
to unknown parameters.
2. Estimation of the unknown parameters of the model.
3. Inference — this involves constructing confidence intervals and testing
hypotheses about the parameters.
4. Diagnostics — to check how well the model fits the data.
An “ideal” model should
• agree with the observed data reasonably well.
• not contain more parameters than are necessary.
• be easy to interpret.
2
Chapter 2. Normal Linear Models
We start with a revision of linear models. We shall see how this simple model leads
to some closed form results. However, we shall also point out that linear models have
their limitations.
The intuition behind the fit
prestige.R prestigedata.RData
Consider the following dataset relating to the prestige of different occupations.
Each row, relates to a different occupation, contains a prestige score and the log
income. Can the (log) income be used to measure the prestige of occupations?
Occupation Prestige Log Income
gov.administrators 68.8 9.421
general.managers 69.1 10.16
accountants 63.4 9.135
purchasing.officers 56.8 9.09
chemists 73.5 9.036
physicists 77.6 9.308
biologists 72.6 9.019
architects 78.1 9.558
civil.engineers 73.1 9.339
mining.engineers 68.8 9.308
...
...
...
l l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l l
l
l
l
l
l
l ll
l
l
ll
l l
l
l
l
l
l
ll
l
lll
l
l
ll
l l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
7 8 9 10
20
40
60
80
log income
Pr
es
tig
e
A possible linear model:
Yi = β1 + xiβ2 + ei, i = 1, . . . , n,
where
• Yi is the outcome or response of interest (random variable) — observed
realisations yi.
• xi is a covariate — observed constant
• β1, β2 are unknown parameters.
• e1, . . . , en are iid error (random variables) with E(ei) = 0 and var(ei) = σ2. Note
that σ2 is another unknown parameter — not observed.
3
×
(x1, y1)
×
(x2, y2)
×
(x3, y3)
×
(x4, y4)
×
(x5, y5)
β1 + β2x
x
y
y1 − β1 − x1β2
The leastsquares estimators β̂1 and β̂2 are defined as the minimisers of
n
∑
i=1
(yi − β1 − xiβ2)2.
2.1 Specification of Normal Linear Models
Definition. A linear model is defined as
Y = Xβ+ ,
where
• Y is the ndimensional random vector of observations, called the response.
• X ∈ Rn×p is the design matrix (known) which contains the covariates
(predictors). Let r := rank(X).
• β ∈ Rp is the unknown parameter vector.
• is the nvariate unobserved error vector such that E() = 0
All vectors are column vectors and are presented in bold.
Further if, ∼ N(0, σ2 In) for some σ2 > 0, then the linear model is normal. The
normal linear model can be written as
Y ∼ N(Xβ, σ2 In),
where In is the n× n identity matrix.
The assumption that ∼ N(0, σ2 In) is called the Normal Theory Assumption
(NTA) and has two parts:
• the observations error are uncorrelated; and
• the variance of each Yi is the same, namely, σ2.
4
Examples of normal linear models
Example Consider the model where Y1, . . . ,Yn are independent normally distributed
random variables with variance σ2 and E(Yi) = β1 + xiβ2 for i = 1, . . . , n. This is a
normal linear model as it can be written in the form
Example Consider the model where Yi ∼ N(µi, σ2) independently with µi =
xi1β1 + xi2β2 + xi3xi4β3 for i = 1, . . . , n. This is a normal linear model as it can be
written as
Example Consider two school classes each consisting of 10 children. Denote the two
classes as class A and class B. Suppose we model the observations of all children
as normally distributed random variables with variance σ2. Further suppose for
child i from class A, denoted as YiA, that E(YiA) = µA for all i and analogously
the observation for child i from class B, denoted by YiB, that E(YiB) = µB. Then this is
a normal linear model as it can be written as
5
Therefore, the linear model accommodates a large variety of different models, i.e.
those with and without a constant intercept term and those structured according to
classes/groups. However, . . .
Example Suppose the independent observations Y1, . . . ,Yn are normal random
variables with E(Yi) = β1 + x
β2
i1 and var(Yi) = σ
2 for i = 1, . . . , n.
. . . is not a linear model, as it cannot be written in the form Y = Xβ+
2.2 Estimation
So far we have just introduced the normal linear model, where Y ∼ N(Xβ, σ2 In). The
joint probability density function of Y is
We now estimate the unknown parameter β using the maximum likelihood approach;
that is, we want to find (keeping σ2 fixed)
β̂ = argmax
β
L(β, σ2;y),
where L(β, σ2;y) is the likelihood function. Typically, it is easier to obtain the MLE
by maximising `(β, σ2;y) = log L(β, σ2;y).∗
First, the loglikelihood of the data is
To find the maximum likelihood estimator (MLE), β̂, we differentiate with respect to
β (keeping σ2 fixed and treating it as a constant) to get
−2XTY + 2XTXβ.
Solving this equation equal to zero give the socalled normal equations
XTXβ̂ = XTY .
If XTX is invertible (iff X has full rank), the normal equations can be solved directly
giving the MLE as
β̂ =
(
XTX
)−1
XTY . (2.1)
The MLE for σ2 can be obtained using the same method.
∗The probability density function (pdf) f (y; θ) is considered a function of y for fixed or given θ
whereas the likelihood L(θ; y), and therefore the loglikelihood `(θ; y) = log L(θ; y), is considered a
function of θ for a particular data y observed.
6
Properties of the Maximum Likelihood Estimator
The MLE of β has the following properties.
• β̂ is linear in Y .
• β̂ is unbiased for β
• cov(β̂) = σ2(XTX)−1.
In fact there is good reason to use the MLE β̂; it can be used to construct a best linear
unbiased estimator (BLUE). An estimator γ̂ is called linear if there exists L ∈ Rn such
that γ̂ = LTY .
Theorem 1. (The Gauss Markov Theorem for full rank linear models) Assume a
linear model with unknown parameter β, a design matrix of full rank and cov(e) =
σ2 In. Fix any c ∈ Rp and let β̂ be the least squares estimator of the linear model.
Then the following holds: The estimator cTβ̂ has the smallest variance among all
linear unbiased estimators for cTβ.
It so happens that the least squares estimator i.e. the β that minimises:
S(β) := (Y − Xβ)T(Y − Xβ)
and the MLE, β̂, are the same (see problem sheet).
Before we present an estimator of the variance σ2, it is necessary to introduce the
notion of projection matrices.
Projection Matrices
Let L be a linear subspace of Rn, where dim L = p ≤ n.
Definition. P ∈ Rn×n is projection matrix onto L, if
1. Px = x for all x ∈ L,
2. Px = 0 for all x ∈ L⊥ = {z ∈ Rn : zTy = 0 ∀y ∈ L}︸ ︷︷ ︸
orthogonal complement
It follows that
Lemma 2. P is a projection matrix ⇐⇒
symmetric︷ ︸︸ ︷
PT = P and
idempotent︷ ︸︸ ︷
P2 = P .
Definition. Ŷ = Xβ̂, where β̂ is the MLE, is called the vector of fitted values. In the
case where X has full rank, Ŷ = X(XTX)−1XTY .
7
It turns out that
P := X(XTX)−1XT
is a projection matrix; since (by Lemma 2)
where the first equation holds because (A−1)T = (AT)−1 for a nonsingular matrix A.
We now define the residuals and the residual sum of squares.
Definition. e = Y − Ŷ is called the vector of residuals.
Notice that the residuals can be written as
e = Y − Ŷ = Y − PY = (In − P)Y .
It can be shown that In − P is also a projection matrix (see problem sheet). The
projection matrix P is sometimes referred to as the hat matrix, as it puts the hat on Y.
Definition. RSS = eTe is called the residual sum of squares.
The RSS quantifies the departure of the data from the model  ideally we want a small
RSS value. It is also the minimum of S(β).
Theorem 3. σ̂2 := RSSn−r is an unbiased estimator of σ
2.
Proof. See problem sheet
If we derived the MLE for σ2 in the usual way — maximising the loglikelihood
and solving equal to zero, we would obtain a biased estimator. To see this:
`(β, σ2;Y ) =
∂`
∂(σ2)
=
Solving ∂`
∂(σ2)
= 0 gives:
σ˜2 =
Notice that if we plug in β = β̂, we have σ˜2 = RSS/n. Since E(RSS) = σ2(n− r), it
follows that E(σ˜2) = σ2(n− r)/n. This is why we use the estimator in Theorem 3.
8
Example In the following, we consider a normal linear model where the MLE of
β can be derived explicitly. Let Y1, . . . ,Yn be independent random variables with
Yi ∼ N(µi, σ2) with
µi = β1 + β2ai,
where a1, . . . , an are known deterministic constants. Take n ≥ 2. Then
Assume that not all ais are equal to ensure X has full rank. Then
Now we can write the MLE β̂ explicitly as βˆ = (XTX)−1XTY :
Identifiability
In this section we have discussed how to estimate the unknown parameter, β, in the
normal linear model. It can be the case that two parameter values lead to the same
distribution for the observed data. This occurs when r < p and we say that β is not
identifiable.
9
Example in R
gas.R gasdata.RData
Suppose we have data consisting of household gas consumption and average external
temperature (see table below). Can the external temperature be used to measure the
amount of gas consumed in a household?
# Temp Gas
1 0.8 7.2
2 0.7 6.9
3 0.4 6.4
4 2.5 6.0
5 2.9 5.8
6 3.2 5.8
7 3.6 5.6
8 3.9 4.7
9 4.2 5.8
10 4.3 5.2
11 5.4 4.9
12 6.0 4.9
13 6.0 4.3
# Temp Gas
14 6.0 4.4
15 6.2 4.5
16 6.3 4.6
17 6.9 3.7
18 7.0 3.9
19 7.4 4.2
20 7.5 4.0
21 7.5 3.9
22 7.6 3.5
23 8.0 4.0
24 8.5 3.6
25 9.1 3.1
26 10.2 2.6
Table 2.1: Gas consumption data
We take Gas as the response, Yi, and Temp as the covariate xi. Suppose that we use
a linear normal model to explain the data; where Yi are independent N(µi, σ2) with
µi = β1 + β2xi for i = 1, . . . , 26. For this model, we have
Then to compute the MLE for β̂ with the following commands. First, we assemble
the vectors of observations Y and the design matrix X:
> Y < dat$Gas
> X < cbind(1,dat$Temp)
As XTX has full rank, we can compute its inverse:
10
> solve(t(X)%*%X)
[,1] [,2]
[1,] 0.17718685 0.025929965
[2,] 0.02592996 0.004846722
The MLE β̂ = (XTX)−1XTY can be computed as follows:
> betahat < solve(t(X)%*%X)%*%t(X)%*%Y
> betahat
[,1]
[1,] 6.8538277
[2,] 0.3932388
Then to plot the model fit we can use
> plot(dat,pch=16) #pch=16 selected for better presentation
> xs < seq(2,12,by=0.5) #dummy x values
> lines(x=xs,y=betahat[1]+xs*betahat[2])
l
l
l
l
l l
l
l
l
l
l l
ll
ll
l
l
l
ll
l
l
l
l
l
0 2 4 6 8 10
3
4
5
6
7
Temp
G
as
Further, the residual sum of squares (RSS) is
> ehat < YX%*%betahat
> RSS < t(ehat)%*%ehat
> RSS
11
[,1]
[1,] 1.899568
Lastly, we can compute σ̂2:
> sig2hat < RSS/(262)
> sig2hat
[,1]
[1,] 0.07914867
12
2.3 Inference
In section 2.1 we specified a normal linear model. Then in section 2.2 we discussed
how to estimate the parameters of the linear model.
In this section we shall consider two main tools of statistical inference
1. Confidence Intervals The width of these intervals indicate uncertainty of
parameter estimates and inferences.
2. Hypothesis Testing These are used to compare how two related models fit the
data. To compare models we require a measure of their goodness of fit. We
shall present goodness of fit statistics based on the loglikelihood function.
In order to construct confidence intervals and measure the goodness of fit of normal
linear models, we require sampling distributions. Note that the derivations of these
sampling distributions are omitted — see Statistical Modelling I for details.
Before we begin, here is an example of constructing confidence interval and
performing a hypothesis test as a reminder:
Example
Confidence Interval
Let X1, . . . ,Xn be iid N(µ, σ2) random variables with σ2 > 0 known. We know that
where X = 1n ∑
n
i=1 Xi. This is our pivotal quantity for µ. We can construct a 1− α
confidence interval for µ as follows:
13
Notes
•
(
X− zα/2σ/
√
n,X+ zα/2σ/
√
n
)
is a random interval.
•
(
x− zα/2σ/
√
n, x+ zα/2σ/
√
n
)
is the observed interval.
• We could use asymmetric values, but symmetrical values i.e. using ±zα/2 gives
the shortest intervals in this example.
Definition. A (1− α) confidence interval for θ is a random interval I that contains
the true parameter with probability ≥ 1− α, i.e.
Pθ(θ ∈ I) ≥ 1− α ∀θ ∈ Θ
The interval, I, can be any type of interval. For example, if L and U are random
variables with L ≤ U then I could be the open interval (L,U), the closed interval
[L,U], the unbounded interval [L,∞), etc.
Example
Hypothesis Test (Continuing the previous example)
If we wish to test the null hypothesis H0 : µ = µ0 versus H1 : µ 6= µ0 a test rejection
region is {x : x − µ0 > zα/2σ/
√
n}. So, the null hypothesis, H0, is not rejected for
samples with
x− zα/2σ/
√
n < µ0 < x+ zα/2σ/
√
n.
This test is constructed to have significance level α i.e. P(H0 is not rejectedµ = µ0) =
1− α.
Example
Pipes
In a manufacturing process that produces pipes, historical data suggests that the
diameter of the pipes are N(30, 1.52). The process is modified but the variance remains
the same (at 1.52). After the modification n = 13 pipes are produced and the average
diameter of the pipes is 32.05; so that the new diameter ∼ N(µ, 1.52). Let’s construct
a 95% confidence interval for µ:
n = , σ = , x = , α = , zα/2 =
which gives the 95% confidence interval: . If we were to test the
hypothesis H0 : µ = 30 versus H1 : µ 6= 30; that is the null is that the process
has not changed, we would reject the null hypothesis as 30 is not contained in the
interval.
14
Confidence Intervals and Regions
We now discuss how to construct confidence regions for the model parameter β. First
we require a pivotal quantity for σ2. It turns out that
RSS
σ2
∼ χ2n−r, (2.2)
where r = rank(X) and χ2k denotes the chisquared distribution with k degrees of
freedom. Then it follows that, for any c ∈ Rp, and when X has full rank i.e.
rank(X) = p
cTβ̂− cTβ√
cT(XTX)−1c RSSn−p
∼ tn−p, (2.3)
where tk denotes the studentt distribution with k degrees of freedom. This follows
from
cTβ̂− cTβ√
cT(XTX)−1cσ2
∼ N(0, 1) (2.4)
and recalling that:
By selecting c appropriately, we can use (2.3) to construct confidence intervals for
components of β e.g. β1 — using c = (1, 0, . . . , 0).
In fact, we can go further and construct confidence regions for β by using
(β̂− β)TXTX(β̂− β)
RSS
n− p
p
∼ Fp,n−p, (2.5)
where Fa,b denotes the Fdistributions with degrees of freedom a and b.
Example Suppose we use a normal linear model with known σ2 = 1 and design
matrix
X =
1 0 0
1 0 0
0 1 0
0 1 0
0 0 1
0 0 1
.
A hypothesis test H0 : β2 = 0 versus H1 : β2 6= 0 is constructed as follows:
15
Note
• If σ is known, use the normal distribution. If σ is unknown, use the
tdistribution.
• We can now able to construct confidence intervals (and regions) and conduct
hypothesis test for β — see examples at the beginning of this section.
Warning
Following from (Wasserman, 2013, p. 111):
“There is sometimes much confusion over the interpretation of a confidence
interval. A confidence interval is not a probability statement about µ (or some other
parameter of interest) since µ is a fixed quantity, not a random variable. Some texts
interpret confidence intevals as follows”:
if I repeat the experiment over and over again, the interval will contain the true parameter
value 95% of the time.
“This is correct but useless since we rarely repeat the same experiment over and
over. A better interpretation is this”:
On day 1, you collect data and construct a 95% confidence interval for a parameter θ1.
On day 2, you collect new data and construct a 95% confidence interval for an unrelated
parameter θ2. On day 3, you collect new data and construct a 95% confidence interval for an
unrelated parameter θ3. You continue this way constructing confidence interval for a sequence
of unrelated parameters θ1,, θ2, . . . . Then the 95% of the time your intervals will contain the
true parameter. There is no need to introduce the idea of repeating the same experiment over
and over.
16
pvalues
Sometimes one does not want to specify the significance level α in advance. In this
case, the socalled pvalue is reported.
p = sup
θ∈Θ0
Pθ(observing something at least as extreme as the observation).
Then we reject H0 iff the pvalue p ≤ α.
Hypothesis Testing: The Ftest
Suppose we wish to test whether a submodel of a linear model E(Y ) = Xβ is better
i.e. we wish to test
H0 : E(Y ) ∈ span(X0) vs H1 : E(Y ) /∈ span(X0),
for some matrix X0 with span(X0) ⊂ span(X). The span is the linear space spanned
by the column vectors of the matrix.
Example Consider comparing the model H0 : Yi = β1 + β2x1,i + ei against H1 : Yi =
β1 + β2x1,i + β3x2,i + ei.
Theorem 4. Under H0 : E(Y ) ∈ span(X0),
F =
RSS0 − RSS
RSS
n− r
r− s ∼ Fr−s,n−r,
where r = rank(X) and s = rank(X0).
Think: what is the difference between the Ftest and the ttest?
A possible use of this Ftest is the ANOVA test, which we now discuss.
OneWay Analysis of Variance (ANOVA)
Suppose we have m groups of observations, each group consisting of k observation.
Label Yij as the jth observation from ith group. Assume the model is
E(Yij) = µ+ βi i = 1, . . . ,m; j = 1, . . . , k,
and that var(Yij) = σ2 with all observations independent. We want to test the null
hypothesis that all the observations Yij come from the same population; that is
H0 : β1 = β2 = · · · = βm,
or equivalently
H0 : E(Yij) = µ i = 1, . . . ,m; j = 1, . . . , k.
We can test this hypothesis using the Ftest (Theorem 4).
17
2.4 Prediction
Once we have fitted a model, we may use it to predict outcomes. Prediction, as
interpreted here, is concerned with answers to ‘whatif’ questions rather than actual
prediction into the future.
Suppose we have fitted a linear model with design matrix X to obtain the estimates
β̂ and σ̂2. For given covariates, x?, the predicted (expected) response is ŷ? = xT? β̂.
To ascertain the uncertainty in this prediction, we need to clear about the type of
prediction we are making.
Suppose we have fitted a linear regression model that predicts the selling price of
homes in a given area that is based on predictors such as the number of bedrooms
and closeness to a major highway. There are two kinds of predictions that can be
made for a given x?:
• Suppose a specific house comes on the market with characteristics x?. Its selling
price will be xT?β+ e. Since E(e) = 0, the predicted price is xT? β̂, but in assessing
the variance of this prediction, we must include the variance of e.
• Suppose we ask the question: “What would a house with characteristics x? sell
for on average?” This selling price is xT?β and is again predicted by xT? β̂ but
now only the variance in β̂ needs to be taken into account.
We have that var(xT? β̂) = xT? (XTX)−1x?σ2. So a 100(1− α)% confidence interval
for a single future response is
ŷ? ± t(α/2)n−p σ̂
√
1+ xT? (XTX)−1x?,
where P(T ≤ t(α/2)n−p ) = 1 − α/2, T ∼ tn−p. The confidence interval for the mean
response for given x? is
ŷ? ± t(α/2)n−p σ̂
√
xT? (XTX)−1x?.
These confidence intervals are sometimes referred to as prediction intervals.
2.4.1 Worked Example
lmpreds.R
We now show how predictions and their confidence intervals can be computed in R.
Consider the Galapagos dataset (which is in the faraway library).
18
Species Endemics Area Elevation Nearest Scruz Adjacent
Baltra 58.00 23.00 25.09 346.00 0.60 0.60 1.84
Bartolome 31.00 21.00 1.24 109.00 0.60 26.30 572.33
Caldwell 3.00 3.00 0.21 114.00 2.80 58.70 0.78
Champion 25.00 9.00 0.10 46.00 1.90 47.40 0.18
Coamano
...
...
...
...
...
...
...
This dataset consists of 30 Islands and seven variables. We shall be interested in
Species which is the number of species of tortoise found on the island. Let’s start by
fitting a linear model:
> library("faraway")
> X < cbind(1,gala$Area,gala$Elevation,gala$Nearest,gala$Scruz,
gala$Adjacent)
> y < gala$Species
> XTX < solve(t(X)%*%X)
> betahat < XTX%*%t(X)%*%y
> ehat < yX%*%betahat
> RSS < sum(ehat^2)
> sig2hat < RSS/(306)
Now suppose we want to predict the number of species of tortoise on an island
with predictors 0.08, 93, 3, 12, 0.34. The prediction (expected) response from the model
is:
> xstar < c(1,0.08,93,3,12,0.34) # include 1 for an intercept term.
> ystar < sum(xstar*betahat)
> ystar
[1] 33.89224
Now consider the two types of confidence intervals: we must decide whether we
are predicting the number of species on one new island or the mean response for all
islands with the same predictors x?.
For a mean response, we would use the second type of interval. First, we require
the critical value from the tdistribution
> qcrit < qt(0.975,24)
> qcrit
[1] 2.063899
19
The width of the confidence interval is:
> ciw < qcrit*sqrt(sig2hat)*sqrt(xstar%*%XTX%*%xstar)
> ciw
[,1]
[1,] 32.41583
and so the interval is
> c(ystarciw,ystar+ciw)
[1] 1.476401 66.308070
The prediction interval for a single future response is:
> ciw < qcrit*sqrt(sig2hat)*sqrt(1+xstar%*%XTX%*%xstar)
> c(ystarciw,ystar+ciw)
[1] 96.06219 163.84667
20
2.5 Diagnostics
Coefficient of Determination
A way of measuring the goodness of fit of a linear model is by inspecting the
coefficient of determination, which we now explain. In the simplest model with only
an intercept term
Y =
1
...
1
β1 + , E(e) = 0
we have the RSS = ∑ni=1(Yi − Y)2. Larger models with more parameters and large
design matrices will have a smaller RSS.
For models that include an intercept term, a measure of the quality of a linear
model is
R2 = 1− RSS
∑ni=1(Yi −Y)2
,
which is called the coefficient of determination or R2 statistic. Notice that 0 ≤ R2 ≤ 1
and R2 = 1 corresponds to the “perfect” model.
Intuition
For the intercept only models, the 1n ∑
n
i=1(Yi −Y)2 is an estimator of σ2 — call this the
total variance. Then
RSS
1
n ∑
n
i=1(Yi −Y)2
≈ var. of model
variance
.
Therefore, R2 ≈ proportion of total variance of the data that is explained by the model.
There is also an alternate version of the R2 statistic called the adjusted coefficient
of determination defined as
R2 = 1− RSS/(n− p)
∑ni=1(Yi −Y)2/(n− 1)
.
Notice that the RSS terms are divided by the models’ respective degrees of freedom 
thus R2 attempts to account for the degrees of freedom for each model.
21
2.5.1 The Danger of Using R2 : Anscombe Quartet
Anscombe.R
Using the R2 and the adjusted R2 summary statistics alone can be dangerous, as
illustrated by Anscombe’s quartet. Anscombe’s quartet is 4 datasets — see table
below:
x1 x2 x3 x4 y1 y2 y3 y4
1 10.00 10.00 10.00 8.00 8.04 9.14 7.46 6.58
2 8.00 8.00 8.00 8.00 6.95 8.14 6.77 5.76
3 13.00 13.00 13.00 8.00 7.58 8.74 12.74 7.71
4 9.00 9.00 9.00 8.00 8.81 8.77 7.11 8.84
5 11.00 11.00 11.00 8.00 8.33 9.26 7.81 8.47
6 14.00 14.00 14.00 8.00 9.96 8.10 8.84 7.04
7 6.00 6.00 6.00 8.00 7.24 6.13 6.08 5.25
8 4.00 4.00 4.00 19.00 4.26 3.10 5.39 12.50
9 12.00 12.00 12.00 8.00 10.84 9.13 8.15 5.56
10 7.00 7.00 7.00 8.00 4.82 7.26 6.42 7.91
11 5.00 5.00 5.00 8.00 5.68 4.74 5.73 6.89
If we proceed to fit a linear model for the 4 datasets, we obtain the following
parameter estimates and R2 values:
β0 β1 R2 Adj. R2
Model 1 3.000 0.500 0.667 0.629
Model 2 3.001 0.500 0.666 0.629
Model 3 3.002 0.500 0.666 0.629
Model 4 3.002 0.500 0.667 0.630
Table 2.2: Summary statistics
The parameter estimates and the R2 statistic values are approximately the same for the
4 datasets. So these models are (more or less) equal, in terms of the fit and summary
statistics. Let’s do what we should have done at the start; look at the data:
22
l
l
l
l l
l
l
l
l
l
l
5 10 15
4
6
8
10
12
x1
y1
l
l
ll
l
l
l
l
l
l
l
5 10 15
4
6
8
10
12
x2
y2
l
l
l
l
l
l
l
l
l
l
l
5 10 15
4
6
8
10
12
x3
y3
l
l
l
ll
l
l
l
l
l
5 10 15
4
6
8
10
12
x4
y4
Figure 2.1: Anscombe’s Quartet
Clearly, the data is different for each dataset and the fitted linear model may be
inappropriate in some cases.
Note
• Looking at plots of the data before fitting is important.
• Using a summary statistic, such as R2, on its own can be dangerous.
23
2.5.2 Outliers
An outlier is an observation that does not conform to the general pattern of the
rest of the data. Outliers can occur due to, among other causes, error in the data
recording, the data is a mixture of two or more populations and when the model
requires improvement. These outlying cases may involve large residuals and often
have dramatic effects on the fitted function. Therefore, it is important to study these
outlying cases carefully.
A case may be outlying or extreme with respect to its Y value, its X value(s) or
both.
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
ll
l ll
lll
l
l
l l
x
y
1
2
3 4
Figure 2.2: Illustration of different types of outliers.
In the scatter plot (Fig. 2.2) case 1 is outlying with respect to its Y value, given X.
Cases 2, 3 and 4 are outlying with respect to their X values since they have much
larger X values; cases 3 and 4 are also outlying with respect to their Y values given X.
Not all outlying cases have a strong influence on the fit. Case 1 may not be too
influential because a number of other cases have similar X values. This will keep
the fitted function from being displaced too far by the outlying case. Likewise, case 2
may not be too influential because its Y value is consistent with the regression relation
displayed by the nonextreme cases. Cases 3 and 4, on the other hand, are likely to be
very influential in affecting the fit of the regression function.
24
Notes
For models with one or two predictor variables, it is relatively easy to identify
outlying cases, with respect to their X or Y values by means of boxplots, scatter
plots, residuals etc and to study whether they are influential in affecting the fitted
regression function. However, when more than two predictor variables are included
in the model, identification of outliers by simple plots becomes difficult.
Residuals
A practical method for detecting outliers is to look at observations with residuals that
are “large”. We have
e =
E(e) =
cov(e) =
We now discuss some refined residuals for identifying observations with outlying
Y values. In the following, we are interested in identifying cases that are multivariable
outliers with respect to their X values. We shall assume a design matrix of full rank.
Standardized Residuals
To use residuals for detecting outlying X observations, it is important to consider
their variance — note that the variance of each residual may be substantially different.
Therefore it is appropriate to consider the magnitude of each ei relative to its estimated
standard deviation. An estimator of the standard deviation of ei is√
σ̂2(1− hii),
where hii ≡ Pii. Recall that in practice, σ2, is typically unknown. Then the
standardized residual is
ri :=
ei√
σ̂2(1− hii)
.
The studentized residuals ri have constant variance when the model is appropriate.
Studentized residuals often are called internally studentized residuals.
Using the plugin estimate for σ2 means that we lose the normality, however the
standardised residuals should be approximately normal. A plot of ri against some
other variable should not reveal any trends or patterns.
25
Residual Plots
l
l
l
l
l
l
l l
ll
l
l
l
l
l
l
l
lll
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
ll
l
l
ll
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
−20 0 20 40
−
3
−
2
−
1
0
1
2
3
x
St
an
da
rd
ise
d
Re
sid
ua
ls
l lllllll
l
ll
ll
ll
l
l
l
ll
ll
l
l
l
ll
l
l
ll
ll
ll
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
lll
l
l
l
l
l
ll
l
l
l
l
ll
l
l
l
l
l
l
−20 0 20 40
−
4
−
2
0
2
4
x
St
an
da
rd
ise
d
Re
sid
ua
ls
l
ll
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
50 100 150 200−
6
−
4
−
2
0
2
4
6
x
St
an
da
rd
ise
d
Re
sid
ua
ls
Fine Heteroscedastic error
(nonconstant variance)
Nonlinear
Leverages
We may be interested in how much each observation influences the model fit. For
instance, consider the residuals e where
var(ei) = σ2(1− hii).
An observation’s leverage is related to the variance of its residual.
Definition. The ith observation in a linear model has leverage equal to hii.
If an observation has leverage close to 1, its residual has a small variance. Notice
that the leverage only depends on the design matrix X.
In practice, the leverages are compared to the “average” r/n and looking for
hii > 2r/n, as ∑ni=1 hii = trace(P) = rank(X) = r
26
Deleted Residuals
Intuition:
l l
ll l
l
l
ll
l
l
ll
l
ll
l
l
ll
ll
l
l
l
l
l
l
−2 0 2 4 6 8 10
0
5
10
x
y
Figure 2.3: Illustration of observations with high leverage. Dotted line is a linear
model fit including N; Solid line is a linear model without N
Fit the model without the ith observation. Estimate (or predict) the ith observation
using xi; denote the expected value by Ŷ(i). The difference between the actual
observed value Yi and the estimated expected value Ŷ(i) is the deleted residual
denoted by di. Specifically,
di = Yi − Ŷ(i)
An algebraically equivalent expression for di that does not require recomputation of
the fit omitting the ith case is
di =
ei
1− hii .
The estimated variance of the deleted residual for the ith case, di, is
σ̂2(i)(1+ x
T
i (X
T
(i)X(i))
−1xi)
where xi is the predictor vector for the ith case, σ̂2(i) is the mean square error when
the ith case is omitted, and X(i) is the design matrix with the ith row deleted. An
equivalent expression for this variance is
σ̂2(i)
1− hii
Note: The estimated variance of the deleted residual can be obtained by “predicting”
the ith observation and using section 2.4.
27
Studentized Deleted Residuals
As before, we can “studentized” the deleted residuals, by dividing it by its standard
deviation. The studentized deleted residual, denoted by ti, is
ti =
di√
σ̂2
(i)
1−hii
=
ei√
σ̂2
(i)(1− hii)
The studentized deleted residual ti is also called an externally studentized residual.
2.5.3 Cook’s Distance
Another way to measure the influence of the observation is to consider the change or
influence it has on the estimator β. One such measure is Cook’s distance
Ci =
(β̂(i) − β̂)TXTX(β̂(i) − β̂)
pRSS/(n− p) , (leaveoneout estimator)
where β̂(i) is the estimator calculated without using the ith observation. The rule of
thumb is to look at observations with Ci close to 1.
2.5.4 Distributional Checks
If Y1, . . . ,Yn are independent N(µ, σ2) distributed random variables, then
P(Yi ≤ y) = P
(
Yi − µ
σ
≤ y− µ
σ
)
= Φ
(
y− µ
σ
)
.
Then the empirical cdf
Fn(x) :=
1
n
n
∑
i=1
I(Yi ≤ x)→ Φ
(
x− µ
σ
)
n→ ∞.
Therefore,
Φ−1(Fn(yi)) ≈ yi − µσ .
A plot of Φ−1(Fn(yi)) against yi should look like a straight line. This is called a QQ
plot.
28
2.5.5 Interpretation of Cook’s Distance
cookdistancedata.RData
We now investigate residuals, leverage and Cook’s distance in more detail. Consider 4
artificial data sets which are presented below each with a normal linear model fitted.
Further, a plot of the residuals against the leverages is presented.
Fine
l
l
l
l
l
l
l
l
l l
l
l
l
ll
l
l
l
l
l
l
l
ll
−5 0 5 10 15 20
−
20
0
20
40
Data Plot and Model Fit
x
y
0.00 0.05 0.10 0.15
−
1
0
1
2
Residuals vs Leverage
Leverage
St
an
da
rd
ize
d
re
sid
ua
ls
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
Cook's distance
0.5
3
17
2
High Leverage
Low Residual l
l
l
l
l
l
l
l
l l
l
l
l
ll
l
l
l
l
l
l
l
l
l
−10 0 10 20 30
−
20
20
60
x
y
0.00 0.10 0.20 0.30
−
1
0
1
2
Leverage
St
an
da
rd
ize
d
re
sid
ua
ls
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
Cook's distance 0.5
0.5
1
25
17
5
Low Leverage
High Residual l
l
l
l
l
l
l
l
l l
l
l
l
ll
l
l
l
l
l
l
l
ll
l
−5 0 5 10 15 20
−
20
0
20
40
x
y
0.00 0.05 0.10 0.15
−
1
0
1
2
3
Leverage
St
an
da
rd
ize
d
re
sid
ua
ls
l
l
l
l
l
l ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
Cook's distance
0.5
1
25
5
13
High Leverage
High Residual l
l
l
l
l
l
l
l
l l
l
l
l
ll
l
l
l
l
l
l
l
ll
l
−20 −10 0 10 20
−
20
0
20
40
x
y
0.0 0.1 0.2 0.3
−
2
0
1
2
3
4
5
Leverage
St
an
da
rd
ize
d
re
sid
ua
ls
l
l
ll
l
l
l
l
l l
l
ll
ll
l
l
l
l
l
l
l
l
l
Cook's distance
1
0.5
0.5
1
25
518
The red data point is a suspicious data point i.e. either high leverage, high residual
(in absolute value) or both. The red line is the fit including the red data point, and
the black line is the fit excluding the red data point.
29
Note
A data point with a high leverage has the potential to change the model fit
significantly. We also need to inspect the residual to see how much the point actually
affects the fit. This is why we look at Cook’s distance — it combines the leverage and
standardised residual as it can be written equivalently as
Ci = r2i
hii
(1− hii)r
where ri is the standardised residual and r = rank(X).
The Cook’s distance is presented in the plots on the previous page, in the form of
contours — for instance those points lying between the 0.5 red dotted lines have a
Cook’s distance less than or equal to 0.5. The red dashed lines are contour lines of
Cook’s distance.
30
2.6 Worked Example
carbo.R carbodata.RData
The data in Table 2.3 shows the percentage of total calories obtained from complex
carbohydrates, for twenty male diabetics who have been on a highcarbohydrate diet,
along with their age, weight and percentage of calories as protein.
# Carbohydrate Age Weight Protein
1 33 33 100 14
2 40 47 92 15
3 37 49 135 18
4 27 35 144 12
5 30 46 140 15
6 43 52 101 15
7 34 62 95 14
8 48 23 101 17
9 30 32 98 15
10 38 42 105 14
11 50 31 108 17
12 51 61 85 19
13 30 63 130 19
14 36 40 127 20
15 41 50 109 15
16 42 64 107 16
17 46 56 117 18
18 24 61 100 13
19 35 48 118 18
20 37 28 102 14
Table 2.3: Example Dataset
We take the Carbohydrate value as our responses Yi with age, weight and protein
as the covariates. Then we fit normal linear model with Yi ∼ N(µi, σ2) where
µi = E(Yi) = β1 + β2agei + β3weighti + β4proteini (i = 1, . . . , 20).
> y < dat$Carb #response Y is the Carbs.
> X < cbind(1,dat$Age,dat$Weight,dat$Protein) #design matrix
31
Then, to find the maximum likelihood estimator of β we need to solve XTXβ̂ = XTy:
> beta.hat < solve(t(X)%*%X)%*%t(X)%*%y
> t(beta.hat)
[,1] [,2] [,3] [,4]
[1,] 36.96006 0.1136764 0.2280174 1.957713
The unbiased estimator of the variance, RSS/(n− p), can be computed and used to
compute the standard deviation for each component of β̂.
> sig.sq.hat < sum((yX%*%beta.hat)^2)/(length(y)length(beta.hat))
> sqrt(diag(sig.sq.hat*solve(t(X)%*%X)))
[1] 13.07128293 0.10932548 0.08328895 0.63489286
The residuals are
> ehat < yX%*%beta.hat
> summary(ehat)
V1
Min. :10.3424
1st Qu.: 4.8203
Median : 0.9897
Mean : 0.0000
3rd Qu.: 3.8553
Max. : 7.9087
The R2 and the adjusted coefficient R2 are
> RSS < t(ehat)%*%ehat
> RSS0 < sum((ymean(y))^2)
> R2 < 1(RSS/RSS0)
> R2
32
[,1]
[1,] 0.4805428
> 1(RSS/(204))/(RSS0/(201))
[,1]
[1,] 0.3831445
We can check these results using the lm function in R as follows:
> mylm1 < lm(Carbohydrate~Age+Weight+Protein,data=dat)
> summary(mylm1)
Call:
lm(formula = Carbohydrate ~ Age + Weight + Protein, data = dat)
Residuals:
Min 1Q Median 3Q Max
10.3424 4.8203 0.9897 3.8553 7.9087
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 36.96006 13.07128 2.828 0.01213 *
Age 0.11368 0.10933 1.040 0.31389
Weight 0.22802 0.08329 2.738 0.01460 *
Protein 1.95771 0.63489 3.084 0.00712 **

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 5.956 on 16 degrees of freedom
Multiple Rsquared: 0.4805, Adjusted Rsquared: 0.3831
Fstatistic: 4.934 on 3 and 16 DF, pvalue: 0.01297
33
To compare two models using the Ftest (Theorem 4), we can use the anova command
(not to be confused on the ANOVA test)
> mylm2 < lm(Carbohydrate~Age,data=dat)
> anova(mylm2,mylm1)
Analysis of Variance Table
Model 1: Carbohydrate ~ Age
Model 2: Carbohydrate ~ Age + Weight + Protein
Res.Df RSS Df Sum of Sq F Pr(>F)
1 18 1088.98
2 16 567.66 2 521.32 7.3469 0.005452 **

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Doublecheck the Fstatistic agrees
> RSS0 < sum(mylm2$residuals^2)
> F < ((RSS0RSS)*16)/(RSS*2)
> F
[,1]
[1,] 7.346886
and also check the corresponding pvalues
> pf(F,df1=2,df2=16,lower.tail=FALSE)
[,1]
[1,] 0.005452024
Conclusion: Since the corresponding pvalues is less than 0.01 we make the following
conclusion: There is sufficient evidence to reject the null hypothesis at the 1% level
(corresponding to the smaller model), and therefore accept the alternative. This means
we should proceed to use the larger model.
34
Lastly, R can also produce some residual plots for us as follows:
> plot(mylm2)
37.0 37.4 37.8 38.2
−
15
−
5
0
5
10
Fitted values
R
es
id
ua
ls
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
Residuals vs Fitted
12
18
11
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
−2 −1 0 1 2
−
1
0
1
2
Theoretical Quantiles
St
an
da
rd
ize
d
re
sid
ua
ls
Normal Q−Q
12
18
11
37.0 37.4 37.8 38.2
0.
0
0.
4
0.
8
1.
2
Fitted values
St
a
n
da
rd
iz
e
d
re
si
du
a
ls
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
Scale−Location
1218
11
0.00 0.10 0.20
−
2
−
1
0
1
2
Leverage
St
an
da
rd
ize
d
re
sid
ua
ls
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
Cook's distance 0.5
0.5
Residuals vs Leverage
8
12
18
35
2.7 Transformations
The way that the data are presented may not necessarily reflect the way we should
use the data in our linear models. Recall the prestige dataset — I used the logarithm
of income, rather than income directly. Learning to use transformations effectively is
part of data analysis. Fortunately, computational tools exists to help us transform the
data.
Transformations of the response and predictors can improve the fit and correct
violations of model assumptions such as nonconstant error variance.
2.7.1 Transforming the Response
Suppose that you are contemplating a logged response in a simple regression
situation:
log y = β0 + β1x+ e.
On the original scale of the response, this model becomes:
y = exp(β0 + β1x) exp(e)
In this model, the errors enter multiplicatively and not additively. Therefore, for
linear models the logged response model requires that we believe the errors enter
multiplicatively on the original scale.
After transforming the response, any interpretations will need to be expressed on
the original scale. For example, in the logged model above, your prediction would
be exp(ŷ?). If your prediction confidence interval in the logged scale was [l, u], then
you would use [exp l, exp u]. When you use a log transformation on the response, the
regression coefficients have a particular interpretation:
log ŷ = β̂0 + β̂1x1 + · · ·+ β̂pxp
ŷ = eβ̂0eβ̂1x1 . . . eβ̂pxp
An increase of one unit in x1 would multiply the predicted response (in the original
scale) by exp(β̂1).
36
BoxCox Transformation
boxcox.R
The BoxCox method is a popular way to determine a transformation on the response
(and also the predictors). The idea is to work with the transformation:
y(λ) =
yλ−1
λ λ 6= 0
log(y) λ = 0
.
For fixed y > 0, y(λ) is continuous in λ. For y ≤ 0 the transformation is undefined.
The idea is to choose λ that maximises the likelihood.
By a change of variables, we obtain the distribution of y; the resulting
loglikelihood being:
`(λ,β, σ2) = −n
2
log(2piσ2)− 1
2σ2
(y(λ) − Xβ)T(y(λ) − Xβ) + (λ− 1)
n
∑
i=1
log(yi).
Substituting the MLEs of β and σ2, for a fixed λ, we obtain the (profile) loglikelihood:
`(λ) = c− n
2
log(RSSλ) + (λ− 1)
n
∑
i=1
log(yi) (2.6)
where c is a constant not involving λ and RSSλ is the residual sum of squares using
y(λ) as the response.
The procedure is to evaluate the profile loglikelihood (2.6) for a range of possible
values of λ. Rather than selecting the maximum, one often rounds to a value such as
−1, 0, 1/2, 1 or 2, particularly if the profile loglikelihood is relatively flat around the
maximum.
Formally, let λ̂ denote the value that maximises the profile loglikelihood. We can
test the hypothesis H0 : λ = λ0 for any fixed value λ0 by calculating the likelihood
ratio
2(`(λ̂)− `(λ0))
which is approximating χ21. This can also be used to construct confidence intervals.
Fortunately, we can maximise the loglikelihood and construct confidence intervals
numerically in R. We will need the boxcox function from the MASS package:
> library("MASS")
We shall demonstrate the boxcox function on the savings dataset
37
> library("faraway")
> data(savings)
> mylm < lm(sr~pop15+pop75+dpi+ddpi,data=savings)
> boxcox(mylm,plotit=TRUE)
> boxcox(mylm,plotit=TRUE,lambda=seq(0.5,1.5,by=0.1))
−2 −1 0 1 2
−
20
0
−
15
0
−
10
0
−
50
λ
lo
g−
Li
ke
lih
oo
d
95%
0.6 1.0 1.4−
60
.0
−
58
.5
−
57
.0
λ
lo
g−
Li
ke
lih
oo
d
95%
Figure 2.4: Output plots from boxcox command: Loglikelihood for BoxCox
transformations.
Not much can be seen from the first plot, so the range of λ is narrowed (second
plot). These plots also include the confidence interval for λ. The confidence interval
for λ is roughly (0.6, 1.4). We can see that there is no good reason to transform.
Some general considerations concerning the BoxCox method are
1. The BoxCox method is affected by outliers.
2. If maxi yi/mini yi is small, then the BoxCox will not have much effect.
3. There is some doubt whether the estimation of λ, counts as an extra parameter
to be considered in the degrees of freedom.
Notes
• The BoxCox method is not the only way of transforming — e.g. YeoJohnson
transformations.
• You can take a BoxCox style approach for each of the predictors, choosing the
transformation to minimize the RSS.
38
2.8 Contrasts for Categorical Variables
We have encountered different types of variables so far — continuous and
categorical. The way we handle categorical variables within a model needs some
careful attention which we now discuss. As a running example in this section, let’s
return to the Prestige data set. Suppose we want to fit the model:
[drop subscript i for now] prestige = β1 + β2education+ β3type+ e e ∼ N(0, σ2).
The variable education is assumed to be continuous, whereas the occupation type is
categorical taking the the levels: blue collar (bc), professional (prof), and white collar
(wc).
Let’s first discuss the wrong way to include the occupation type into the linear model.
[START OF THE WRONG WAY]
Suppose we enter the occupation type into the linear model by encoding occupation
type numerically as shown below.
Label (used in dataset) Encoding
blue collar 1
professional 2
white collar 3
In this model we assume that all three occupations exhibit the same relationship
between education and prestige but that they start from different baselines. Using
the encoding scheme described in the table above yields the following equations for
the occupations.
blue collar: prestige = β1 + β2education+ β3
professional: prestige = β1 + β2education+ 2β3
white collar: prestige = β1 + β2education+ 3β3
39
This translates into the following illustrative diagram:
prestige
education
The numeric coding scheme we have used for occupation has imposed an unwanted
assumption/constraint on the location of the intercepts:
• While the location of the intercepts of blue collar and professional workers are
arbitrary, the location of white collar workers is not. The numeric coding scheme
has constrained, for a fixed level of education, the difference in prestige between
blue collar and professional workers to be exactly the same as the difference
between white collar and professional workers, β3.
• Furthermore the prestige difference between white collar and blue collar
workers, for fixed education level, is twice the difference between either one
and professional workers.
These assumptions may or may not be correct — in general we would prefer to test
this instead of assuming them through the postulated model. In order to correct this
problem, we require a more general encoding of the categories.
Note
Just changing the number assignments, i.e. instead of using 1, 2 and 3 use other
numbers, will lead to the same problem – just with a different set of constraints on
the intercepts. Therefore, we require a way of encoding that does not enforce any
constraints on the estimates of the model.
[END OF THE WRONG WAY]
The standard approach when handling a categorical variable with n distinct levels is
to create n− 1 different regressors. These are often called dummy regressors.
If the Prestige data set is read into R using the read.table function, the character
variable occupational type is automatically converted to a variable of class type
"factor". When a factor is used in a linear model it automatically employs the correct
40
number of dummy regressors. The coding used is termed a contrast and in R the
default contrast type for a factor is "treatment" and denoted contr.treatment. This
produces the two dummy regressors X1 and X2 shown below.
X1 X2
bc 0 0
prof 1 0
wc 0 1
Note
By default R orders the levels of the character variable and defines the dummy coding
levels in alphabetical order. As a result the first level alphabetically becomes the
baseline level.
If we fit a normal linear model in R with prestige as the response and education and
occupation type as predictors by lm(prestige∼education+type,data=Prestige) the
following model is fitted:
prestige = β1 + β2education+ β3X1 + β4X2 + e.
• In the case X1 = 0 and X2 = 0 for blue collar workers, the model is
prestige = β1 + β2education+ e.
• In the case X1 = 1 and X2 = 0 for professional workers, the model is
prestige = (β1 + β3) + β2education+ e.
• In the case X1 = 0 and X2 = 1 for white collar workers, the model is
prestige = (β1 + β4) + β2education+ e.
These fitted models are illustrated in the following diagram:
prestige
education
41
Since β1, β3, and β4 are estimated separately it is clear from the diagram above there
are no constraints placed on the location of the intercepts.
It is also clear why we only need n − 1 regressors to independently describe n
levels of a categorical variable. In the additive model, the categorical variable levels
just serve to change the intercept of the regression line. Since there is already an
intercept in the model, it is used for one of levels of the categorical variable.
2.8.1 Other Codings
There are many possible coding schemes for categorical variables other than the
treatment contrast. To simplify matters we shall consider a model in which the
only predictor is the variable occupation. Thus in R notation we fit the model:
lm(prestige∼type,data=Prestige). For the following coding schemes, we represent
the occupation type using two regressors.
Dummy Coding
As explained above with dummy coding the occupation variable is converted into
dummy variables X1 and X2 whose values for the various levels of occupation were
listed above. By default the level that is first alphabetically becomes the baseline level.
This is the contr.treatment coding scheme of R. If we specify the regression model
lm(prestige∼type,data=Prestige), R fits the model:
prestige = β1 + β2X1 + β3X2 + e
The model equation can be used to predict the mean as a function of X1 and X2.
When we choose values for X1 and X2 that correspond to the various occupations,
we obtain the means of those professions. More precisely, with the dummy coding
scheme we obtain the following equations for the means:
µbc = β1
µprof = β1 + β2
µwc = β1 + β3
By subtracting the mean for blue collar workers from each of the other equations we
obtain expressions for β2 and β3.
µbc = β1
µprof − µbc = β2
µwc − µbc = β3
Thus in the dummy coding scheme each coefficient measures a difference in
(conditional) mean between one classification level and the classification level that was
42
chosen as baseline. The intercept corresponds to the mean of the baseline classification
level.
Deviation Coding
In R deviation coding is denoted contr.sum. To assign this contrast to the variable
occupation type we would use the following statement:
contrasts(Prestige$type) < "contr.sum"
The deviation coding scheme used in R yields the two dummy regressors shown
below:
X1 X2
bc 1 0
prof 0 1
wc 1 1
If we run lm(prestige∼type), R fits the model:
prestige = β1 + β2X1 + β3X2
As before, the regression model estimates the conditional mean of y, conditional on
the values that are specified for the predictors. Thus when we choose values for X1
and X2 that correspond to the various professions, we obtain the means of those
professions. The deviation coding scheme yields the following equations for the
means.
µbc = β1 + β2
µprof = β1 + β3
µwc = β1 − β2 − β3
Further, summing these equations give
µbc + µprof + µwc = 3β1 ∴ β1 =
µbc + µprof + µwc
3
Thus the intercept in deviation coding corresponds to the mean of all three levels.
From the equations for blue collar and professional and using the formula for β1
above we immediately obtain interpretations for β2 and β3.
β2 = the difference between the mean for blue collar and the overall mean
β3 = the difference between the mean for professional and the overall mean
Thus in deviation coding the coefficients measure the distance between individual
levels and the mean of all the levels.
43
Helmert Coding
In RHelmert coding is denoted contr.helmert. Assigning this contrast to the variable
occupation type is done with the following statement.
contrasts(Prestige$type) < "contr.helmert"
The Helmert coding scheme yields the two dummy regressors shown below
X1 X2
bc 1 1
prof 1 1
wc 0 2
As before when we specify the regression model lm(prestige∼type), R actually fits
the model: prestige = β1 + β2X1 + β3X2. The Helmert coding scheme yields the
following equations for the means.
µbc = β1 − β2 − β3
µprof = β1 + β2 − β3
µwc = β1 + 2β3
As with deviation coding, if we add all three equations together we isolate β1 and
find that the intercept represents the mean of all the levels.
β1 =
µbc + µprof + µwc
3
We can isolate β2 by subtracting the blue collar mean from the professional mean.
µprof − µbc = 2β2 ∴ β2 =
µprof − µbc
2
The best way to interpret this coefficient is to observe that if β2 is not significantly
different from zero we would conclude that the mean prestige for professionals and
the mean prestige for blue collar workers are not significantly different from each
other.
We can isolate β3 by taking the average of the blue collar and professional means
and then subtracting the result from the white collar mean.
µwc −
µbc + µprof
2
∴ β3 =
1
3
(
µwc −
µbc + µprof
2
)
The best way to interpret this coefficient is to observe that if β3 is not significantly
different from zero we would conclude that the mean prestige of white collar workers
is not significantly different from the mean prestige of blue collar and professional
workers together.
Thus Helmert coding compares the current level with the average of the all the
levels that preceded it. Thus Helmert contrast coding is especially appropriate if
there is a natural order to the categories because then sequential comparisons of this
sort make sense.
44
Chapter 3. Generalized Linear Models
Motivating Example
beetle.R beetledata.RData
Let us consider the following dataset regarding numbers of beetles that died after
five hours of exposure to gaseous carbon disulphide at various dose levels. Can the
dosage be used to measure the proportion of deaths?
Dose xi Number ni Number Killed yi
1.6907 59 6
1.7242 60 13
1.7552 62 18
1.7842 56 28
1.8113 63 52
1.8369 59 53
1.861 62 61
1.8839 60 60
l
ll
l
ll
ll
1.0 1.5 2.0 2.5
0.
0
0.
5
1.
0
1.
5
Dose (xi)
Pr
op
or
tio
n
Ki
lle
d
(y i
/ n
i)
Here the response of interest is the proportion of deaths, which must lie between 0
and 1. Thus, a linear model will not explain the data adequately. Therefore, we need
a different class of model  in this case, one where the fitted values remain between 0
and 1. In this chapter, we will discuss the class of generalized linear models (GLMs),
that is able to give a suitable fit. For this example, a model fitted using a generalized
linear model is presented below.
l
ll
l
ll
ll
1.0 1.5 2.0 2.5
0.
0
0.
5
1.
0
1.
5
Dose (xi)
Pr
op
or
tio
n
Ki
lle
d
(y i
/ n
i)
45
3.1 Specification of Generalized Linear Models
3.1.1 The components of GLMs
NORMAL LINEAR MODELS
1. Random Component
The components of Y have
independent normal distributions
with E(Y ) = µ and individual
variance σ2.
2. Systematic Component
Using the covariates x1, . . . ,xp
form the linear predictor
η = Xβ =
p
∑
j=1
xjβ j
3. Link
The link between the random and
systematic components is
η = µ
GENERALIZED LINEAR MODELS
1. Random Component
The components of Y are
independent and have the same
distribution. This distribution is a
member of an exponential family with
E(Y ) = µ.
2. Systematic Component
Using the covariates x1, . . . ,xp form
the linear predictor
η = Xβ =
p
∑
j=1
xjβ j
3. Link
The link between the random and
systematic components is
ηi = g(µi) for i = 1, . . . , n.
In GLMs
• The random component specifies the probability distribution of the response
variables. Specifically, the components of y have pdf or pmf from an exponential
family of distributions (see section 3.1.2), with E(Y ) = µ.
• The systematic component specifies a linear predictor η = Xβ as a function of
the covariates and the unknown parameters.
• The link function g may be any monotonic differentiable function. The link
function provides a functional relationship between the systematic component
and the expectation of the response in the random component; namely η = g(µ)
46
3.1.2 Exponential Families
Definition. Consider a random variable Y whose probability density/mass function
depends on the two parameters θ and φ. The distribution of Y is said to be a member
of an exponential family if its probability density function can be written in the form
exp
{
yθ − b(θ)
a(φ)
+ c(y, φ)
}
, (3.1)
for some specific functions a(·), b(·) and c(·). If φ is known, this is an exponential
family model with canonical parameter θ.
The function a(φ) commonly takes the form
a(φ) = φ/w,
where φ is called the dispersion parameter that is constant over all observations and
w is a known prior weight which can vary across the observations. However, the
practical cases of interest dealt with in this course will use the form a(φ) = φ/w,
where in most cases w = 1.
Example Consider Y ∼ N(µ, σ2). The density function for Y can be written in
exponential family form
Example Consider Y ∼ Poisson(λ). The mass function for Y can be written in
exponential family form
47
Example Consider Y ∼ Inverse Gaussian(µ,λ). Its probability density function can
be written in exponential family form as follows√
λ
2piy3
exp
{−λ(y− µ)2
2µ2y
}
= exp
{−λ(y− µ)2
2µ2y
+
1
2
log
(
λ
2piy3
)}
= exp
{
y(−1/2µ2) + (1/µ)
1/λ
− λ
2y
+
1
2
log
(
λ
2piy3
)}
where we identify θ = (−1/2µ2), φ = 1/λ and
a(φ) = φ, b(θ) = −(−2θ)1/2, c(y, φ) = −1
2
(
log(2piφy3) +
1
φy
)
.
Note that if φ is unknown, it is called a nuisance parameter of the distribution
as it will interfere with inference of µ or θ. We need to be wary about this when
performing hypothesis tests or constructing confidence intervals (see later in section
3.3).
Properties of exponential family distributions
The mean and variance of an exponential family distribution are easily derived as
follows. From the definition of any probability density function∫
f (y; θ, φ)dy = 1.
Under regularity conditions, we can differentiate with respect to θ to obtain
∂
∂θ
∫
f (y; θ, φ)dy =
∫
∂
∂θ
f (y; θ, φ)dy =
∂
∂θ
1 = 0. (3.2)
If we differentiate again with respect to θ, we get∫
∂2
∂θ2
f (y; θ, φ)dy = 0. (3.3)
These general results can be applied to distributions from an exponential family. From
(3.1) recall that general form of the pdf from an exponential family is:
so that
∂
∂θ
f (y; θ, φ) =
Then by (3.2) we find
48
by definition of the expected value. Thus
µ ≡ E(Y) = b′(θ). (3.4)
Next, we have
∂2
∂θ2
f (y; θ, φ) = −b
′′(θ)
a(φ)
f (y; θ, φ) +
(
y− b′(θ)
a(φ)
)2
f (y; θ, φ). (3.5)
Plugging this into (3.3) gives
Rearranging yields
var(Y) = b′′(θ)a(φ).
The function b′′(θ) is referred to as the variance function. The variance function
considered as a function of µ will be denoted as V(µ). To summarise
E(Y) ≡ µ = b′(θ)
var(Y) = b′′(θ)a(φ) ≡ V(µ)a(φ)
Warning! In general, the variance function is not the same as the variance of Y.
Example Suppose Y ∼ N(µ, σ2). Recall that the exponential family form for the pdf
of Y is defined as θ = µ, φ = σ2 with a(φ) = φ and b(θ) = θ2/2. Then E(Y) and
var(Y) are easily obtained:
49
Example Suppose Y ∼ Poisson(λ). Then for any λ > 0 and y ∈N
f (y;λ) = exp(y logλ− λ− log(y!)),
where we identify θ = logλ, a(φ) = 1 and
b(θ) = exp(θ) and c(y, φ) = − log(y!).
We can check the mean and variance:
Therefore, we are able to compute the mean and variance of any exponential
family distribution just by looking at the form of its probability density function and
differentiating b.
50
3.1.3 Link Functions
Recall that in the specification of GLMs, the function g describes the link between the
expected response µ ≡ E(Y ) and the linear predictor η = Xβ. We introduced the
link function g as
ηi = g(µi) for i = 1, . . . , n
as any monotonic differentiable function. Recall a function, g, is monotonically
increasing (or decreasing), if for all z1 and z2 with z1 ≤ z2, we have g(z1) ≤ g(z2)
(or g(z1) ≥ g(z2))
Why link functions are necessary
Example Suppose that Y1, . . . ,Yn are independent random variables such that
Yi ∼ Poisson(λi) for some λi > 0. Then µi ≡ E(Yi) = λi > 0. It is plausible that
the linear predictor ηi = ∑
p
j=1 xijβ j can take any value on R. Thus using the identity
link function, i.e. g(u) = u, is not recommended as ηi may be negative while λi is
always positive. Therefore, using another link function g is required.
Canonical Links
For exponential family distributions, there is a natural or canonical choice of link
function. These canonical links occur when
θ = η
where θ is the canonical parameter (see Definition 3.1.2). Recall that in the
GLM specification, the link function can be any monotonic differentiable function.
Therefore, using the canonical link is not necessary. However, as we shall see later,
using the canonical link leads to some desirable results (see later in section 3.2).
51
Example of GLMs
Example Suppose Y1, . . . ,Yn are independent random variables each with a Poisson
distribution and
E(Yi) = exp(β1 + β2xi)
Then this is a GLM as the Poisson distribution can be written in exponential family
form and the link function g(z) = log(z) is monotonic and differentiable with linear
predictor η = Xβ.
Example Suppose Y1, . . . ,Yn are independent N(µi, σ2) random variables where
µi = β1 + β2xi
Then this is a GLM as the Normal distribution can be written in exponential family
form and the link function g(z) = z (the identity) is monotonic and differentiable.
The generalization: from normal linear models to GLMs
Notice the two generalizations:
• The distribution of the response can be any member of the exponential family
— not just a normal distribution.
• The link function, connecting the mean of the response and the linear predictor
η = Xβ, may be any monotonic differentiable function — not only the identity
function.
These generalizations lead to more complicated estimation and inference procedures,
as we shall see.
52
3.2 Estimation
So far we have just introduced the class of models called the GLMs. Now we discuss
how to estimate the unknown parameter β in a GLM. Recall the estimation procedure
for normal linear models; we want to find β̂ by maximising the loglikelihood. The
loglikelihood for GLMs is
`(β; φ,y) =
n
∑
i=1
{
yiθi − b(θi)
a(φ)
+ c(yi, φ)
}
and depends on β through:
µi = b′(θi), g(µi) = ηi, ηi =
p
∑
j=1
xijβ j, for i = 1, . . . , n.
The approach used to find the estimator of β for GLMs is the same we used before;
namely to maximise the loglikelihood by solving
∂`
∂β
!
= 0.
However, for GLMs, this (generally) involves solving a nonlinear set of equations.
Therefore, we use an iterative method to obtain a numerical solution.
Before we present the maximum likelihood estimation procedure for GLMs, we
require a brief detour to introduce the numerical method and some maximum
likelihood theory.
3.2.1 Maximum Likelihood Theory
Suppose that Y = (Y1, . . . ,Yn)T has a pdf or pmf f (y;ψ) for y ∈ Rn and parameter
vector ψ ∈ Rp. Then likelihood function is a function of ψ for each fixed y given by
L(ψ;y) = f (y;ψ).
Similarly, the loglikelihood function is given by
`(ψ;y) = log L(ψ;y) = log f (y;ψ).
53
We then have the following:
Score Vector: U (ψ;y) = ∇`(ψ;y) where ∇ =
(
∂
∂ψ1
, . . . ,
∂
∂ψp
)
so that Uj(ψ;y) =
∂
∂ψj
`(ψ;y), j = 1, . . . , p
Observed Information Matrix: I = −∇∇T`(ψ;y)
so that Ijk(ψ;y) = − ∂
2
∂ψj∂ψk
`(ψ;y)
Fisher’s Information Matrix: J (ψ) = E(I(ψ;Y ))
If Y1, . . . ,Yn are iid with pdf or pmf f (yi;ψ), y ∈ R then
L(ψ; y1, . . . , yn) =
n
∏
i=1
L(ψ, yi) and `(ψ; y1, . . . , yn) =
n
∑
i=1
`(ψ; yi)
Example Suppose Y1, . . . ,Yn
iid∼ Exponential(1/µ) for some µ > 0. We have
f (yi; µ) =
1
µ
exp
(
−yi
µ
)
yi ≥ 0
`(µ; y1, . . . yn) =
U =
∂
∂µ
`(µ; y) =
I = − ∂
2
∂µ2
`(µ; y1, . . . , yn) =
J =
54
Example Suppose Y1, . . . ,Yn ∼ N(µi, σ2) independently with σ2 > 0 known and
some for µi ∈ R. We have
f (yi; µi) =
1√
2piσ2
exp
(
− 1
2σ2
(yi − µi)2
)
f (y1, . . . , y1;µ) =
n
∏
i=1
1√
2piσ2
exp
(
− 1
2σ2
(yi − µi)2
)
`(µ; y1, . . . , yn) =
Uj =
∂
∂µj
`(µ;y) =
Ijk =
Jjk =
Theorem 5. Let Y = (Y1, . . . ,Yn)T have a pdf or pmf f (y;ψ) for y ∈ Rn and
parameter vector ψ ∈ Rp. Then, under certain regularity conditions,
E(U ) = 0 and cov(U ) = E(UUT) = J .
Proof. For j = 1, . . . , p we have
E(Uj) =
It then follows that cov(U ) = E(UUT) as E(U ) = 0. Thus we only need to show that
E(UUT) = J = E(I). See problem sheet for remainder of proof
We now return to exponential families. We shall write xij = (X)ij; also recall that the
parameter of interest is β ∈ Rp.
55
Theorem 6. Suppose we have Y1, . . . ,Yn independent random variables from an
exponential family with means µ1, . . . , µn and variance functions V(µ1), . . . ,V(µn)
with common dispersion parameter φ. Then for j = 1, . . . , p and k = 1, . . . , p,
Uj =
n
∑
i=1
[
(yi − µi)
var(Yi)
xij
∂µi
∂ηi
]
and Jjk =
n
∑
i=1
[
xijxik
var(Yi)
(
∂µi
∂ηi
)2]
.
Proof. The likelihood function for Y1, . . . ,Yn is
L(β; y1, . . . , yn) =
where θ(µi) = θ(g−1(ηi)), ηi = ∑
p
j=1 xijβ j. Therefore, the loglikelihood is
`(β; y) =
For score function
Uj =
∂`(β; y)
∂β j
=
n
∑
i=1
∂`i
∂β j
for j = 1, . . . , p,
we decompose the partial derivative as
∂`i
∂β j
=
∂`i
∂θi
∂θi
∂µi
∂µi
∂β j
where θi ≡ θ(µi).
Now consider each partial derivative in turn. First,
∂`i
∂θi
=
Second,
∂θi
∂µi
=
Third,
∂µi
∂β j
=
Inserting these into the score Uj yields
Uj =
n
∑
i=1
[
(yi − µi)
var(Yi)
xij
(
∂µi
∂ηi
)]
, (3.6)
56
as required.
For the Fisher information matrix, J = E(UUT), we use (3.6) to get
Jjk =
where we used that the observations are independent.
These results can be succintly be written as
U = XTWc, (3.7)
where the components of c are given by
ci = (yi − µi) ∂ηi∂µi ,
and the Fisher information matrix can be represented as
J = XTWX, (3.8)
with W ∈ Rn×n being a diagonal matrix with entries
wii =
1
var(Yi)
(
∂µi
∂ηi
)2
.
Recall that to find the MLE, β̂, the procedure is to differentiate the loglikelihood
with respect to β and solve equal to zero:
∂`
∂β
!
= 0.
This is equivalent to solving Uj = 0 for j = 1, . . . , p. Therefore, the maximum
likelihood equations are
n
∑
i=1
[
(yi − µi)
var(Yi)
xij
(
∂µi
∂ηi
)]
= 0, for j = 1, . . . , p. (3.9)
In general, this is a nonlinear system of equations in β and a numerical solution is
required. We proceed to solve (3.9) using a variant of the NewtonRaphson algorithm.
57
Algorithm. NewtonRaphson: For f (x) : X ⊆ Rp → R;x 7→ f (x), the algorithm
below:
x(m+1) = x(m) − H−1g, where H = ∂
2 f (x)
∂xi∂xj
∣∣∣∣
x=x(m)
, g =
∂
∂x
f (x)x=x(m)
will converge to x? = argmax
x
f (x) (under regularity conditions).
Using the NewtonRapshon algorithm for `(β;y) yields:
β̂(m+1) = β̂(m) +
(
I (m)
)−1
U (m),
where U (m) is the score vector evaluated at β = β̂(m), and similarly for the observed
information matrix I (m). In maximum likelihood estimation it is common to estimate
I by E(I) = J i.e. Fishers information matrix∗. The result is the Fisher scoring
algorithm:
β̂(m+1) = β̂(m) +
(
J (m)
)−1
U (m). (3.10)
Continuing to find simpler representations for the iterations of the scoring algorithm:
Plugging in (3.7) and (3.8) into (3.10) gives
β̂(m+1) =
(3.11)
These iterated weighted least squares (IWLS) equations (3.11) may be viewed as a
weighted least squares equations using the linear model
z = Xβ+ c, where c = (y −µ)(∂η/∂µ) ∼ N(0,W−1). (3.12)
To see this, start by rewriting (3.12) as
z = Xβ+W−1/2.
Then multiplying on the left by W1/2
W1/2z︸ ︷︷ ︸
y˜
= W1/2X︸ ︷︷ ︸
X˜
β+ , ∼ N(0, In),
∗see problem sheet
58
which is of the form of a normal linear model. We can write the estimate down
directly
β̂ = (X˜TX˜)−1X˜T y˜
= (XTWX)−1XTWz,
which is similar to (3.11). This motivates the iterative weighted least squares (IWLS)
algorithm.
Algorithm 3.1 Iterative weighted least squares (IWLS) algorithm
1: Given a current estimate β̂, form the linear predictor η̂ and the fitted values µ̂.
2: Form the adjusted dependent variable
zi = η̂i + (yi − µ̂i)∂η∂µ
∣∣∣∣
µ=µ̂i
3: Form the estimated weights w˜ii by
w˜−1ii =
(
∂η
∂µ
)2
V(µ)
∣∣∣∣
µ=µ̂i
4: Regress zi on xi with weights w˜ii and obtain the new estimate β̂ .
5: Repeat steps 1 to 4 until convergence.
Note the relationship between the two weights: w˜ii = φwii ¶(see below)
Lastly, we have that
cov(β̂) = cov
{(
XTWX
)−1
XTWz
}
=
[(
XTWX
)−1
XTW
]
cov(z)︸ ︷︷ ︸
W−1
[(
XTWX
)−1
XTW
]T
=
(
XTWX
)−1
= φ(XTW˜X)−1
= J −1. (3.13)
Notice, that J involves the dispersion parameter φ— this needs to be known in order
to compute J .
We estimate cov(β̂) by “plugging” in the last W(m) from the IWLS algorithm to
get
cov(β̂) ≈ (XTW(m)X)−1
59
To be clear¶ why we are using two weights: Recall that we want to solve the maximum
likelihood equations :
Uj =
n
∑
i=1
[
(yi − µi)
var(Yi)
xij
(
∂µi
∂ηi
)]
= 0, for j = 1, . . . , p. (3.9 revisited)
Since we have that var(Yi) = a(φ)V(µi) = φV(µi), solving (3.9) is equivalent to
solving
n
∑
i=1
[
(yi − µi)
V(µi)
xij
(
∂µi
∂ηi
)]
= 0, for j = 1, . . . , p.
Therefore, within the IWLS algorithm (Algorithm 3.1) we use
w˜ii =
1
V(µ)
(
∂µ
∂η
)2∣∣∣∣
µ=µ̂i
rather than
wii =
1
var(Yi)
(
∂µ
∂η
)2∣∣∣∣
µ=µ̂i
=
1
φV(µ)
(
∂µ
∂η
)2∣∣∣∣
µ=µ̂i
60
Example calculations
Example For independent Y1, . . . ,Yn with Yi ∼ Binomial(ni,pii), for known ni,
f (yi;pii) =
(
ni
yi
)
pi
yi
i (1− pii)ni−yi ,
Then, working with the canonical link:
µi ≡ E(Yi) =
θi =
b(θi) =
b′(θi) =
b′′(θi) =
V(µi) =
ηi =
∂ηi
∂µi
=
zi = η̂i +
w−1ii =
61
Example For independent Y1, . . . ,Yn with Yi ∼ Normal(ζi, σ2) for known σ2 > 0,
f (yi;λi) =
1√
2piσ2
exp
{
− 1
2σ2
(yi − ζi)2
}
.
Then, working with the canonical link:
µi ≡ E(Yi) = ζi
θi =
a(φ) =
b(θi) =
b′(θi) =
b′′(θi) =
V(µi) =
ηi =
∂η
∂µ =
zi = η̂i +
w−1ii =
62
Example For independent Y1, . . . ,Yn where Yi ∼ Poisson(λi)
f (yi;λi) =
e−λiλyii
yi!
.
Then, working with the identity link:
µi ≡ E(Yi) =
θi =
a(φ) =
b(θi) =
b′(θi) =
b′′(θi) =
V(µi) =
ηi
identity link
=
∂ηi
∂µi
=
zi = η̂i +
w−1ii =
63
Example in R
poisson.R poissondata.RData
Consider the following artificial dataset
1 2 3 4 5 6 7 8 9
y 2 4 6 7 8 10 11 12 14
x 1 1 0 0 0 0 1 1 1
Table 3.1: Example Dataset
l
l
l
l
l
l
l
l
l
−1.0 −0.5 0.0 0.5 1.0
2
4
6
8
10
14
x
y
A possible model may be one using a Poisson distribution as the response
distribution. Lets model the relationship between Yi and xi as
µi ≡ E(Yi) = β1 + β2xi, for i = 1, . . . , 9,
and Yi ∼ Poisson(µi). Thus we have taken the link function as the identity; then
∂µi
∂ηi
= 1 and wii =
1
var(Yi)
=
1
β1 + β2xi
.
The design matrix is
X =
1 x1
1 x2
...
...
1 x9
=
1 −1
1 −1
...
...
1 1
64
Start by forming the response vector
> y < dat$y
We can then fit the model using the iterative weighted least squares algorithm
(Algorithm 3.1).
> beta < c(20,4) #initial guess
> for (i in 1:25){
+ eta < cbind(1,x)%*%beta #estimated linear predictor
+ mu < eta #estimated mean response
+ z < y #form the adjusted variate
+ w < 1/mu #weights
+ lmod < lm(z~x, weights=w) #regress z on x with weights w
+ beta < as.numeric(lmod$coeff) #new beta
+ print(beta) #print out the beta estimate every iteration
+ }
[1] 7.703704 4.666667
[1] 7.701896 4.682932
[1] 7.701886 4.683026
[1] 7.701886 4.683027
...
The algorithm seems to converge at β =(7.702, 4.683).
l
l
l
l
l
l
l
l
l
−1.0 −0.5 0.0 0.5 1.0
2
4
6
8
10
14
x
y
Using (3.13) we can work out the estimated standard error of these parameters:
65
> X < cbind(1,x)
> J < t(X)%*%diag(1/as.vector(mu))%*%X
> invJ < solve(J)
> sqrt(as.vector(diag(invJ)))
[1] 0.9020884 1.1317672
In R there is a function that automatically fits the model for us:
> myglm < glm(y~x,family=poisson(link="identity"))
> summary(myglm)
Call:
glm(formula = y ~ x, family = poisson(link = "identity"))
Deviance Residuals:
Min 1Q Median 3Q Max
0.6382 0.4012 0.1100 0.4495 0.7913
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 7.7019 0.9021 8.538 < 2e16 ***
x 4.6830 1.1318 4.138 3.51e05 ***

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 16.4022 on 8 degrees of freedom
Residual deviance: 2.1658 on 7 degrees of freedom
AIC: 40.682
Number of Fisher Scoring iterations: 4
The estimates of the parameter and its standard deviation seem to agree, which
should be no surprise as the estimation procedures are exactly the same.
66
Notes
• The estimation procedure may fail (e.g. see later in Section 3.8.3) – in R the
maximum number of iterations is 25 by default.
• The convergence criterion used by the glm command in R will be discussed later
in Section 3.6.1.
• A difference between the results may be due to a different initial starting point.
• Notice that the dispersion parameter, φ, is recognised by R as 1.
67
3.3 Inference
As in the inference section for linear models, section 2.3, we discuss how to construct
confidence intervals and perform hypothesis tests. In particular, for hypothesis testing
we shall discuss how to compare two related models. For GLMs, we say that two
models are related if
1. the distribution of the response Y is the same; and
2. the same link functions is used.
The models differ in the number of parameters i.e. the dimensionality of β. To
compare models we require a measure of their goodness of fit. We shall present
goodness of fit statistics based on the loglikelihood function.
3.3.1 Sampling Distributions for GLMs
In this section, we discuss the sampling distributions† relevant to GLMs. We shall use
the following notion: under appropriate regularity conditions, which are satisfied for
generalized linear models, if S is a statistic of interest, then
S− E(S)√
var(S)
·∼· N(0, 1)
or equivalently
(S− E(S))2
var(S)
·∼· χ21,
where we shall use ·∼· to denote “approximately distributed as”.
If there is a vector of statistics of interest
S =
S1
...
Sp
with asymptotic expectation E(S) and asymptotic variancecovariance matrix Q, then
asymptotically
S − E(S) ·∼· N(0,Q) (3.14)
and further, asymptotically
(S − E(S))T Q−1 (S − E(S)) ·∼· χ2p (3.15)
provided Q is nonsingular so that Q−1 exists and is unique.
†The sampling distribution of a statistic is the distribution of that statistic, considered as a random
variable, when derived from a random sample
68
To be clear, these approximations follow from the central limit theorems:
Central Limit Theorem
Suppose A1, . . . , An are iid random variables with E(Ai) = µ and var(Ai) = σ2 (both
finite). Then
√
n
(
1
n ∑
n
i=1 Ai − µ√
σ2
)
d−→ N(0, 1) as n→ ∞
where d−→ means convergence in distribution.
Multivariate Central Limit Theorem
IfA1, . . . ,An are iid random vectors with E(Ai) = µ ∈ Rp and finite, positive definite,
symmetric covariance matrix Q then
√
n
(
1
n
n
∑
i=1
Ai −µ
)
d−→ N(0,Q) as n→ ∞
The asymptotic chisquared result (3.15) follows from the fact: If Z ∼ N(µ,Q) where
µ ∈ Rp and Q ∈ Rp×p is a positive definite, symmetric matrix, then
(Z −µ)TQ−1(Z −µ) ∼ χ2p.
Sampling distribution for the score
We can apply the ideas above to the score vector U . Recall from (3.6) that
Uj =
∂`
∂β j
=
n
∑
i=1
[
(yi − µi)
var(Yi)
xij
(
∂µi
∂ηi
)]
, j = 1, . . . , p.
Since E(U ) = 0 and cov(U ) = J we have, from (3.14) asymptotically
U ·∼· N(0,J ) (3.16)
and, from (3.15), asymptotically,
UTJ −1U ·∼· χ2p (3.17)
Example Suppose Y1, . . . ,Yn are independent and Yi ∼ N(µ, σ2) where µ is the
parameter of interest and σ2 > 0 is a known constant. The loglikelihood function
is
`(β;y) =
69
Then the score is
U =
and the Fisher’s information matrix is
J =
Then by rearranging (3.16), we get (asymptotically)
Y ∼ N(µ, σ2/n).
We can use this result to construct a confidence interval for µ. For example, a
95% confidence interval for µ is y ± 1.96σ/√n approximately due to asymptotic
normality‡.
Sampling distribution for MLEs
Let β̂ be the MLE for β in a GLM i.e.
β̂ := argmax
β
`(β)
and so U (β̂) = ∂`(β̂)∂β = 0.
Now consider the 1st order Taylor expansion of U (β) about the MLE β̂:
U (β) ≈
where we approximated U ′ by E(U ′) = −J . Consequently, we find
(β̂− β) =
{
J (β̂)
}−1
U (β),
assuming that J is invertible. If J is regarded as a constant then β̂ is unbiased for β.
It turns out that this is at least asymptotically true.
The variancecovariance matrix for β̂ is
E
[
(β̂− β)(β̂− β)T
]
=
‡Actually, this is exact and not approximate as the observations are Normal in this example. This
coincides with the results presented in Chapter 2
70
where we have used that J = J T and treated J as a constant.
Thus, by (3.14), we have
β̂ ·∼· N(β,J −1). (3.18)
and also, using (3.15), we have
(β̂− β)TJ (β̂)(β̂− β) ·∼· χ2p. (3.19)
This is called the Wald statistic.
We can use (3.18) to construct confidence intervals and conduct hypothesis tests
involving the components of β. For instance, from (3.18), we have
β̂1 − β1√
var(β̂1)
·∼· N(0, 1), where var(β̂1) = (XTWX)−111︸ ︷︷ ︸
the 1,1 entry of (XTWX)−1
which we can use to construct confidence intervals for β1. Further, under the null
hypothesis H0 : β1 = 0 we have
β̂1√
var(β̂1)
·∼· N(0, 1).
3.4 Prediction
For given covariates x?, we have§ η̂? = xT? β̂ with variance xT? (XTWX)−1x?.
Therefore, an approximate confidence interval can be constructed using the normal
distribution. To obtain an approximate confidence interval in terms of µ we can use
the inverse of the link function to transform the end points. As with the linear models,
we can use the predict function in R for GLMs. More precisely, an approximate 95%
confidence interval for the mean response of a prediction with covariates x? is(
g−1
(
η̂? − 1.96
√
xT? (XTWX)−1x?
)
, g−1
(
η̂? + 1.96
√
xT? (XTWX)−1x?
))
,
where g−1 is the inverse of the link function g.
§x? is a column vector containing covariates and typically an intercept. Also, note that η̂? is just a
number.
71
3.4.1 Measuring the Goodness of Fit
One measure of the goodness of fit of a GLM is to compare it with a more general
model with the maximum number of parameters that can be estimated. This model
is called the saturated model. It is a GLM with exactly the same response distribution
and link function as the model of interest.
For n observations, y1, . . . , yn, all with potentially different values for the linear
predictors ηi = ∑
p
j=1 xijβ j, a saturated model can be specified with n parameters. It
turns out that the fitted saturated model forces µi ≡ E(Yi) = yi for i = 1, . . . , n which
achieves the maximum attainable loglikelihood.
For n observations from a GLM with n parameters, the loglikelihood is
`(θ, φ;y) =
n
∑
i=1
(
yiθi − b(θi)
a(φ)
+ c(yi, φ)
)
.
Then, for i = 1, . . . , n,
∂`
∂θi
=
yi − b′(θi)
a(φ)
=
yi − µi
a(φ)
For this it is clear that loglikelihood is maximised if we force µi = yi for i = 1, . . . , n.
Now reconsider the loglikelihood in terms of the mean response µ:
`(µ, φ;y) =
n
∑
i=1
(
yiθ(µi)− b(θ(µi))
a(φ)
+ c(yi, φ)
)
.
Thus, the maximum achievable loglikelihood is `(y, φ;y).
For the model of interest, a GLM with typically less parameters than observations
(p < n), we can maximise the loglikelihood to get β̂, then η̂ = Xβ̂ and so
the estimated mean response µ̂ = g−1(η̂) — leading the maximised loglikelihood
`(µ̂, φ;y).
72
Deviance
A measure of the goodness of fit of a GLM is the deviance.
Definition. The deviance for a model with estimated mean response µ̂ is defined as
D = 2φ {`(y, φ;y)− `(µ̂, φ;y)} ,
and the scaled deviance is D∗ = D/φ.
To make things concrete: Suppose we have a GLM of interest and consider the
following extreme models with the same response distribution and link function:
Saturated Model: The GLM with number of parameters, p, equal to the number of
(distinct) observations. In this case, µi ≡ E(Yi) is equal to the observed response
yi, so there is no variation in the random component.
Null Model: The GLM with only one parameter (p = 1) representing a common
mean response µ for all ys.
In practice, the null model will be too simple and the saturated model is
uninformative as it just repeats the observed response. We aim for a model with a
likelihood close to the likelihood of the saturated model, but with fewer parameters.
73
Deviance — constrained, unconstrained likelihood?
Consider a saturated model where β ∈ Rn and (as usual) η = Xβ, with X ∈ Rn×n.
Assuming that X is invertible, the model imposes no constraints on the linear
predictor η — it can take any value on Rn. In turn, this means µ and also θ are
unconstrained. Thus, for the saturated model ; the maximised loglikelihood is
max
µ∈Rn s.t. µ=g−1(η) η∈Rn
`(µ;y) = `(y;y).
For models with p < n parameters, the maximised loglikelihood is
max
µ∈Rn s.t. µ=g−1(Xβ)
for some β∈Rp
`(µ;y) = `(µ̂;y)
For instance, suppose we have a GLM with n = 3, p = 1 with X = (1 1 1)T and
identity link function g(z) = z. Then
ηi = β ∈ R for i = 1, . . . , n,
and
µ =
g−1(β)g−1(β)
g−1(β)
= β
11
1
Thus µ is constrained to lie on a line in R3.
74
Why we do not use the saturated model.
We can make the fit as close as possible, in terms of minimising the deviance, by
including sufficiently many parameters. Below are some fits for a GLM based on the
7 data points represented as triangles. The first row corresponds to a saturated model
— note how the fit goes through every data point. The second row is another fit using
just 2 parameters. The second column is a repeat of the first, but with additional
data points — note how the saturated model would give poor predictions for the
additional data points. Thus simplicity, represented by parsimony of parameters, is a
desirable feature for models; we do not include parameters that are unnecessary.
−4 −2 0 2 4
0.
0
0.
4
0.
8
x
y
−4 −2 0 2 4
0.
0
0.
4
0.
8
x
y
−4 −2 0 2 4
0.
0
0.
4
0.
8
x
y
−4 −2 0 2 4
0.
0
0.
4
0.
8
x
y
Let us now switch back, considering the loglikelihood in terms of β — write the
scaled deviance as
D∗ = 2
{
`(β̂sat;y)− `(β̂;y)
}
,
where β̂sat is the MLE of βsat for the saturated model and β̂ is still the MLE of β in
the model of interest.
75
Example Consider the with Y1, . . . ,Yn independent where Yi ∼ N(µi, σ2) and
E(Yi) = µi =
p
∑
j=1
xijβ j.
The loglikelihood function is
`(µ;y) = −n
2
log(2piσ2)− 1
2σ2
(y −µ)T(y −µ).
Setting µ = y yields the maximum achievable loglikelihood:
`(y;y) =
Then for any other model with p < n (not a saturated model) we find the maximum
likelihood estimate β̂ = (XTX)−1XTy — which in turn gives the estimated mean
response µ̂ = g−1(η̂) = Xβ̂. So
`(β̂;y) =
Therefore the deviance is
D = 2φ
{
`(β̂sat;y)− `(β̂;y)
}
3.4.2 Pearson’s X2 statistic
Another important measure of the discrepancy of a GLM is Pearson’s X2 statistics.
X2 =
n
∑
i=1
(yi − µ̂i)2
V(µ̂i)
.
Pearson’s X2 statistic shares similar asymptotic distribution properties with the
deviance. However, in this course we shall focus on the deviance as a measure of
goodnessoffit as
• the deviance has a general advantage as a discrepancy measure in that it is
additive for nested sets of models (see next section).
• the deviance leads to better normalised residuals.
76
3.4.3 Hypothesis Testing
We now discuss how to compare GLMs by hypothesis testing. These tests will involve
the deviance and scaled deviance, therefore we first need their sampling distributions.
Model Comparison with known φ
First, consider the 2nd order Taylor expansion of the loglikelihood around the MLE
β̂:
`(β) ≈
where we approximated U ′ by E(U ′) = −J . Consequently, we find
2
{
`(β̂)− `(β)
}
= (β− β̂)TJ (β̂)(β− β̂) (3.20)
which is approximately χ2p distributed by (3.19). Rewriting the deviance and using
(3.20) gives
D∗ =
If the model with p parameters is correct, then
D∗ ·∼· χ2n−p.
Now consider comparing two nested models as follows: Consider the null
hypothesis
H0 : β = β0 =
β1
...
βq
,
corresponding to model M0 and a more general hypothesis, the alternative,
H1 : β = β1 =
β1
...
βp
,
77
corresponding to model M1 with q < p < n. We can test H0 against H1 using the
difference of the scaled deviances
D∗0 − D∗1 = 2
{
`(β̂sat;y)− `(β̂0;y)
}
− 2
{
`(β̂sat;y)− `(β̂1;y)
}
= 2
{
`(β̂1;y)− `(β̂0;y)
}
.
The statistic D∗0 − D∗1 is approximately χ2p−q distributed under the null hypothesis.
Model Comparison with unknown φ
Under the null hypothesis we have
D∗1 ·∼· χ2n−p and D∗0 − D∗1 ·∼· χ2p−q.
If we consider D∗1 and D
∗
0 − D∗1 as asymptotically independent, then
(D∗0 − D∗1)/(p− q)
D∗1/(n− p)
·∼· F(p−q),(n−p).
The advantage of this test statistic is that we can multiply top and bottom by φ to get
a test statistic based on the deviance:
(D0 − D1)/(p− q)
D1/(n− p)
·∼· F(p−q),(n−p).
The advantage of this hypothesis test for model comparison is that it does not depend
on φ.
Warning
There are certain assumptions made when proving that the deviance, D, is
approximately or asymptotically χ2n−p distributed which we should be wary of
• The observations are independent and distributed according to some member
of an exponential family.
• The approximation relies on the number of parameters in the model staying
fixed, while the sample size tends to infinity. But the saturated model has
as many parameters as number of data. Also, in the Binomial case the χ2n−p
approximation (or asymptotics) for the deviance are based upon the following
assumptions (McCullagh & Nelder, 1989, p. 118)
– the observations are truly distributed independently according to the
binomial distribution;
– The approximation is based on a limiting operation in which n is fixed and
ni → ∞ for each i (and in fact nipii(1− pii)→ ∞).
However, the χ2 approximation is sound when comparing two nested models, as the
deviance for the saturated model cancels out.
78
3.4.4 Estimating the Dispersion
Recall that the MLE for β does not depend on the dispersion φ. However, in cases
where the dispersion is unknown, it must be estimated. There are two commonly
used estimators:
First is based on the deviance:
φ̂D =
D
n− p ,
where D is the deviance of the model. This follows from the expected value of
D/φ = D∗ ·∼· χ2n−p.
Second is
φ̂P =
X2
n− p
where X2 is Pearson’s statistic — this is based on the approximation X2/φ ·∼· χ2n−p.
Note
If Z ∼ χ2d then E(Z) = d; that is the expected value of a χ2 distribution is equal to its
degrees of freedom.
We can then plug in an estimator for the dispersion parameter φ to estimate
cov(β̂):
cov(β̂) ≈ φ̂(XTW˜X)−1 (see page 59).
3.4.5 Akaike’s Information Criteria (AIC)
An alternative statistic to compare two models was suggested by Akaike: pick
whichever model minimises
AIC = −2`(β̂) + 2p,
for a given set of data. Notice, that it is similar to using the loglikelihood of
the data, but with a penalisation term for the number of parameters i.e. the more
parameters included, the higher the AIC. Therefore the AIC involves a tradeoff
between goodnessoffit of the model and its complexity (in terms of its number of
parameters).
79
3.5 Diagnostics
As with the normal linear models, we can use residuals to explore the fit of GLMs.
For GLMs we require extenstions of residual definitions to accommodate for all
nonNormal distributions.
3.5.1 Residuals
Pearson’s Residuals
Definition. For a single observation y, Pearson’s residual is defined as
rp =
y− µ√
V(µ)
.
Deviance Residuals
Suppose the deviance, D, is used as a measure of discrepancy of a GLM. Then each
observation contributes a quantity di to D so that “D = ∑ni=1 di”. Thus, it makes sense
to define a deviance based residual.
Definition. For a single observation, yi, the deviance residual is defined as
rD = sign(yi − µi)
√
di,
thus the deviance is D = ∑i r2D.
¶
Example For the Poisson distribution, we have
Pearson’s residual: rp =
Deviance residual: rD =
Similar with the residuals for linear models, we can standardise to account for
each observation’s leverage. To this end, we require the corresponding hat matrix for
GLMs:
P = W1/2X(XTWX)−1XTW1/2.
The standardised Pearson’s and deviance residuals are obtained by dividing by√
(1− Pii)φ̂.
¶sign(x) =
+1 if x ≥ 0−1 if x < 0
80
Note
These residuals are approximately normal in general. However, the deviance residuals
are the closest, for n large and ni (for Binomial) large. In practice, one should not
expect them to lie on a straight line in a QQ plot, but rather on a smooth curve —
departure from this curve may indicate an outlier.
3.5.2 Cook’s Distance
Similar to the normal linear model case, we can examine the Cook’s distance for the
observations.
Ci =
(
β̂(i) − β̂
)T
(XTWX)
(
β̂(i) − β̂
)
pφ̂
where β̂(i) is the estimator calculated without using the ith observation. Again we
look for large Ci close to 1.
81
3.6 Worked Examples
3.6.1 Worked Example 1
seedsdata.RData seeds.R
The data presented in Table 3.2 shows the number of Orobanche seeds germinated
for two genotypes and two treatments.
# Germinated y Total tested n Genotype x1 Treatment x2
1 10 39 0 0
2 23 62 0 0
3 23 81 0 0
4 26 51 0 0
5 17 39 0 0
6 5 6 0 1
7 53 74 0 1
8 55 72 0 1
9 32 51 0 1
10 46 79 0 1
11 10 13 0 1
12 8 16 1 0
13 10 30 1 0
14 8 28 1 0
15 23 45 1 0
16 0 4 1 0
17 3 12 1 1
18 22 41 1 1
19 15 30 1 1
20 32 51 1 1
21 3 7 1 1
Table 3.2: Orobanche seeds Dataset
We are interested in, yi/ni, the proportion, so we want the fit to remain between 0 and
1. Suppose Y1, . . . ,Yn are independent Binomial(ni,pii) random variables and take the
logit link function. Take the design matrix
X =
1 x11 x12 x11x12
...
...
...
...
1 x21,1 x21,2 x21,1x21,2
Using the binomial distribution we would have µi ≡ E(Yi) = nipii with pdf
n
∏
i=1
(
ni
yi
)
pi
yi
i (1− pii)ni−yi .
82
Now we fit the model using the IWLS algorithm (Algorithm 3.1)
> beta < c(0.5,0.5,0,0) #initial guess
> #inverse logit function
> inv.link < function(u)
+ n*(1/(1+exp(u)))
> #deviance function
> D < function(p){
+ a < y*log(y/p)
+ b < (ny)*log((ny)/(np))
+ a[y==0] < 0
+ 2*sum(a+b)
+ }
> oldD < D(inv.link(as.numeric(X%*%beta)))
> jj < 0
> while(jj==0){
+ eta < X%*%beta #estimated linear predictor
+ mu < inv.link(eta) #estimated mean response
+ detadmu < n/(mu*(nmu))
+ z < eta+ (ymu)*detadmu #adjusted dependent variable
+ w < mu*(nmu)/n #weights
+ lmod < lm(z~x, weights=w) #regress z on x with weights w
+ beta < as.vector(lmod$coeff) #new beta
+ newD < D(inv.link(X%*%beta))
+ control < abs(newDoldD)/(abs(newD)+0.1)
+ if(control<1e8)
+ jj < 1
+ oldD < newD
+ }
> beta #final estimate
[1] 0.5581717 0.1459269 1.3181819 0.7781037
> newD #last deviance calculated
[1] 33.27779
Notice that this time we have used the change in the deviance as the convergence
criterion; this is what R uses: If ∣∣Dnew − Dold∣∣
Dnew+ 0.1
83
is less than 1× 10−8 then the algorithm is deemed to have converged.
By (3.13), the standard errors for these estimates are
> J < t(X)%*%diag(as.vector(w))%*%X
> invJ < solve(J)
> beta.sd < sqrt(as.vector(diag(invJ)))
> beta.sd
[1] 0.1260213 0.2231657 0.1774677 0.3064330
The deviance residuals (3.5.1) are
> p < as.vector(inv.link(X%*%beta))
> a < y*log(y/p)
> b < (ny)*log((ny)/(np))
> a[y==0] < 0
> d < sign(ymu)*sqrt(2*(a+b))
> summary(d)
V1
Min. :2.01617
1st Qu.:1.24398
Median : 0.05995
Mean :0.08655
3rd Qu.: 0.84695
Max. : 2.12122
We can test individual parameters using (3.18). The corresponding pvalues are:
> z < beta/beta.sd
> z
[1] 4.429187 0.653895 7.427729 2.539229
> 2*(1pnorm(abs(z),lower.tail=TRUE))
[1] 9.458885e06 5.131795e01 1.105782e13 1.110970e02
84
The AIC (section 3.4.5) is
> 2*sum(dbinom(y,n,as.vector(mu/n),log=TRUE)) + 2*length(beta)
[1] 117.874
Of course, R can do this for us:
> dat2 < seeds
> y < cbind(dat2$r, dat2$n  dat2$r)
> my.bin.glm < glm(y ~ seed + extract + seed * extract,
+ family = binomial(link = "logit"), data = dat2)
> summary(my.bin.glm)
Call:
glm(formula = y ~ seed + extract + seed * extract, family = binomial(link = "logit"),
data = dat2)
Deviance Residuals:
Min 1Q Median 3Q Max
2.01617 1.24398 0.05995 0.84695 2.12123
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 0.5582 0.1260 4.429 9.46e06 ***
seed 0.1459 0.2232 0.654 0.5132
extract 1.3182 0.1775 7.428 1.10e13 ***
seed:extract 0.7781 0.3064 2.539 0.0111 *

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 98.719 on 20 degrees of freedom
Residual deviance: 33.278 on 17 degrees of freedom
AIC: 117.87
Number of Fisher Scoring iterations: 4
85
3.6.2 Worked Example 2
The data used in this example can be directly loaded in R using infert
Consider the following data relating to the study of abortions. Table 3.3 presents the
first 6 data only — in total there are 248 cases.
education age parity induced case spontaneous stratum pooled.stratum
1 05yrs 26 6 1 1 2 1 3
2 05yrs 42 1 1 1 0 2 1
3 05yrs 39 6 2 1 0 3 4
4 05yrs 34 4 2 1 0 4 2
5 611yrs 35 3 1 1 1 5 32
6 611yrs 36 4 2 1 1 6 36
Table 3.3: Infertility Dataset
Let us fit a binomial model to see if case, a binary observation, can be predicted using
the other measures as covariates. Note that .˜ means that we include all other data
columns singly (useful for large datasets).
> myglm0 < glm(case ~.,data=infert,family=binomial)
> summary(myglm0)
Call:
glm(formula = case ~ ., family = binomial, data = infert)
Deviance Residuals:
Min 1Q Median 3Q Max
1.7975 0.7836 0.4599 0.8556 2.8998
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 4.039297 2.135797 1.891 0.0586 .
education611yrs 1.320471 1.565614 0.843 0.3990
education12+ yrs 3.489701 2.965837 1.177 0.2393
age 0.078590 0.038060 2.065 0.0389 *
parity 0.451423 0.276877 1.630 0.1030
induced 1.435629 0.320870 4.474 7.67e06 ***
spontaneous 2.191282 0.329069 6.659 2.76e11 ***
stratum 0.002842 0.014621 0.194 0.8459
pooled.stratum 0.078768 0.043800 1.798 0.0721 .

86
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 316.17 on 247 degrees of freedom
Residual deviance: 254.53 on 239 degrees of freedom
AIC: 272.53
Number of Fisher Scoring iterations: 4
At first glance, the residual deviance looks reasonable for the degrees of freedom.
We can look at sequentially adding terms using (somewhat confusingly the anova
function again):
> anova(myglm0,test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: case
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 247 316.17
education 2 0.002 245 316.17 0.99886
age 1 0.006 244 316.16 0.94012
parity 1 0.026 243 316.14 0.87088
induced 1 0.056 242 316.08 0.81372
spontaneous 1 58.284 241 257.80 2.269e14 ***
stratum 1 0.003 240 257.79 0.95346
pooled.stratum 1 3.263 239 254.53 0.07085 .

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
The anova function recognises that a GLM was fitted and produces an analysis of
deviance table. The test="Chisq" option reports the pvalues on the right.
87
Model comparisons and selection can be done automatically in R. First, consider
similar models where just one parameter is dropped. This is achieved using the drop1
function:
> drop1(myglm0,test="Chisq")
Single term deletions
Model:
case ~ education + age + parity + induced + spontaneous + stratum +
pooled.stratum
Df Deviance AIC LRT Pr(>Chi)
education 2 256.76 270.76 2.230 0.32791
age 1 258.85 274.85 4.315 0.03778 *
parity 1 257.26 273.26 2.728 0.09861 .
induced 1 277.42 293.42 22.890 1.716e06 ***
spontaneous 1 316.03 332.03 61.504 4.419e15 ***
stratum 1 254.57 270.57 0.038 0.84591
pooled.stratum 1 257.79 273.79 3.263 0.07085 .

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
This suggests dropping the education and stratum parameters (and perhaps more)
from the model.
The drop1 function has a brother function — the add1 function, which adds individual
terms to the given model:
> add1(myglm0,~.^2,test="Chisq")
Single term additions
Model:
case ~ education + age + parity + induced + spontaneous + stratum +
pooled.stratum
Df Deviance AIC LRT Pr(>Chi)
education:age 2 254.51 276.51 0.0198 0.99013
88
education:parity 2 254.33 276.33 0.1977 0.90589
education:induced 2 247.58 269.58 6.9513 0.03094 *
education:spontaneous 2 252.28 274.28 2.2478 0.32500
education:stratum 2 254.41 276.41 0.1219 0.94089
education:pooled.stratum 2 254.01 276.01 0.5227 0.77003
age:parity 1 254.19 274.19 0.3444 0.55728
age:induced 1 254.20 274.20 0.3282 0.56670
age:spontaneous 1 251.39 271.39 3.1409 0.07635 .
age:stratum 1 254.36 274.36 0.1723 0.67805
age:pooled.stratum 1 254.51 274.51 0.0205 0.88626
parity:induced 1 252.90 272.90 1.6332 0.20126
parity:spontaneous 1 252.57 272.57 1.9616 0.16134
parity:stratum 1 254.48 274.48 0.0521 0.81942
parity:pooled.stratum 1 254.42 274.42 0.1089 0.74145
induced:spontaneous 1 254.50 274.50 0.0323 0.85736
induced:stratum 1 248.77 268.77 5.7651 0.01635 *
induced:pooled.stratum 1 250.36 270.36 4.1678 0.04120 *
spontaneous:stratum 1 251.27 271.27 3.2615 0.07093 .
spontaneous:pooled.stratum 1 253.63 273.63 0.9052 0.34140
stratum:pooled.stratum 1 254.42 274.42 0.1120 0.73790

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
The ˜. ˆ2 informs R to consider all possible twofactor interactions. The
result above suggests including an education:induced, induced:stratum and
induced:pooled.stratum terms.
The drop1 and add1 perform many individual χ2 tests between models — it does
not select any particular model.
The step function will automatically search through the models for us.
> stepsearch < step(myglm0,~.^2,test="Chisq")
... this is produce a lot of output! To summarise the step search, we can use the anova
component:
> stepsearch$anova
89
Step Df Deviance Resid. Df Resid. Dev AIC
1 NA NA 239 254.5310 272.5310
2 + induced:stratum 1 5.765135 238 248.7658 268.7658
3  education 2 1.749457 240 250.5153 266.5153
4 + age:spontaneous 1 4.072592 239 246.4427 264.4427
5  pooled.stratum 1 1.888343 240 248.3310 264.3310
Reading from top to bottom: a induced:stratum parameter was added, then the
education parameter removed, etc. The criteria to add or remove parameters is based
on the AIC. The step search continues until the AIC cannot be reduced any further.
Finally, a summary oxf the final chosen model can be called (no need to fit again):
> summary(stepsearch)
Call:
glm(formula = case ~ age + parity + induced + spontaneous + stratum +
induced:stratum + age:spontaneous, family = binomial, data = infert)
Deviance Residuals:
Min 1Q Median 3Q Max
1.8253 0.7699 0.5257 0.8536 2.6835
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 1.086896 1.463758 0.743 0.45776
age 0.000298 0.041023 0.007 0.99420
parity 0.880102 0.193553 4.547 5.44e06 ***
induced 2.278841 0.487873 4.671 3.00e06 ***
spontaneous 0.503841 1.395070 0.361 0.71798
stratum 0.003730 0.008352 0.447 0.65519
induced:stratum 0.025723 0.009161 2.808 0.00498 **
age:spontaneous 0.080038 0.044768 1.788 0.07380 .

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 316.17 on 247 degrees of freedom
Residual deviance: 248.33 on 240 degrees of freedom
AIC: 264.33
90
Number of Fisher Scoring iterations: 4
Finally, let us inspect the diagnostic plots.
−3 −1 0 1 2
−
2
−
1
0
1
2
3
Predicted values
R
es
id
ua
ls l
l
l
l
l
l
l
l
l
ll
ll
l
l
lll
ll
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
lll
l
l
l
ll
l
l
ll
l
l
l
l
l
l
ll
l
l
l
ll
l
l
l
ll l
l
l
l
Residuals vs Fitted
39
810
l
l
l
l
l
l
l
l
l
l
l
lll
l
ll
ll
l
l
l
ll
l
l
l
l
l
l
l lll
l
l
l
l l
l
l
l
l
l
ll
ll
l
l
l
l
l
l
l
l
l
ll
l lll
l
l
l
−3 −1 0 1 2 3−
2
−
1
0
1
2
3
Theoretical Quantiles
St
d.
d
ev
ia
nc
e
re
si
d.
Normal Q−Q
39
810
−3 −1 0 1 2
0.
0
0.
5
1.
0
1.
5
Predicted values
St
d.
de
vi
a
n
ce
re
si
d.
l
l
l
l
l
l
l
l
l
ll
ll
l
l
ll
l
ll
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
lll
l
l
l
l
Scale−Location
39
810
0.00 0.04 0.08 0.12
−
2
0
2
4
6
Leverage
St
d.
P
e
a
rs
o
n
r
e
si
d.
l
l
l
l
l
l
l
l
l
l
l
l l
l l
l
l lll
ll
l
l
l
l
l
l
ll
l
l
l
l l
l
l
l
l
l
ll
l
l
l ll
l
l
l
l
ll
ll
l
l
l
l
l
l l
ll ll
l
l
l
l
ll
l lll l ll
l ll
l
lll l
l
l
l
l
l l
ll
l
l llll
l l
lll
l l l llll
l l
lll
l
l
l l
l l
l
ll
l
l
Cook's distance
0.5
1
Residuals vs Leverage
813
74
The residual plots are less informative than they were for linear models. The response
contains less information than a continuous one. Nevertheless, the issue of outliers
and influential observations are just as prevalent in logistic regression as for linear
models — so look at the Cook’s distance plot and investigate any highly influential
observations.‖
Finally, for convenience, the residual values as follows:
> residuals(stepsearch,type="pearson")
> residuals(stepsearch,type="deviance")
and the Cook’s distances:
‖but this is not the end of the investigation...
91
> cooks.distance(stepsearch)
and the standardised residuals:
> rstandard(stepsearch,type="pearson")
> rstandard(stepsearch,type="deviance")
The commands above can be executed upon any glm model fitted in R.
We can proceed to check for large (in magnitude) residuals and observations with
high leverages:
> plot(abs(rstandard(stepsearch,type="pearson")))
> plot(influence(stepsearch)$hat) #extract hat matrix values
> l.threshold < 2*8/248 #rule of thumb
> l.threshold
[1] 0.06451613
> abline(h=l.threshold) #adds a horizontal line
l
l
l
l
l
l
l
l
l
l
l
l
l
l
lll
l
l
ll
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
ll
l
l
l
ll
l
l
l
lllll
l
ll
ll
l
ll
l
l
l
ll
l
ll
l
l
ll
l
lllll
l
ll
l
ll
lll
l
l
l
l
l
l
l
ll
l
l
l
l
lll
ll
ll
ll
l
lll
l
ll
ll
l
ll
l
lll
l
ll
llll
l
lll
l
lll
l
l
l
l
ll
l
l
0 50 100 200
0
1
2
3
4
5
6
Index
St
d.
Pe
a
rs
o
n
's
r
e
si
du
al

l
l
l
l
l
l
l
l
l
l
ll
l
l
l
ll
l
l
l
l
l
ll
l
ll
l
l
l
l
l
l
l
l
l
l
lll
l
l
l
l
l
l
ll
l
l
l
l
ll
l
l
l
l
l
l
ll
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
llll
l
l
ll
l
l
ll
l
lll
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
ll
l
l
l
l
l
ll
l
l
l
l
lll
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
lll
l
l
ll
l
l
l
ll
l
ll
ll
l
ll
l
l
l
ll
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
0 50 100 200
0.
02
0.
06
0.
10
0.
14
Index
Le
ve
ra
ge
We can mark any suspicious points and investigate if removing them significantly
influences the fit.
92
A handy way of looking for these: First find the largest, say 5, standardised
residuals:
> order(abs(rstandard(stepsearch,type="pearson")),decreasing=TRUE)[1:5]
[1] 39 8 10 48 81
These are the corresponding index numbers.
Next, find the observations 5 with the largest leverages:
> order(influence(stepsearch)$hat,decreasing=TRUE)[1:5]
[1] 74 239 1 174 51
If we see any reoccuring indices, we may want to investigate further, as they will have
both high leverage and high magnitude residual.
93
3.7 Binomial Regression
Suppose the independent response variable Y1, . . . ,Yn where Yi ∼ Binomial(ni,pii), so
that
P(Yi = yi) =
(
ni
yi
)
pi
yi
i (1− pii)ni−yi .
Suppose that for the ith response we also observe covariates xi,1, xi,2 . . . . Following
the linear model approach, we construct a linear predictor:
ηi =
p
∑
j=1
xijβ j.
We shall discuss the three common choices of link functions are used for binomial
regression:
1) logit: η = log
(
pi
1− pi
)
2) probit: η = Φ−1(pi), where Φ−1 is the inverse cdf of a standard normal
distribution.
3) complementary log log: η = log(− log(1− pi))
Note
• A logistic binomial regression models is a GLM with binomial response and
logit link function and a binary logistic binomial regression model is one with
ni = 1.
• The link above is defined through pi and not µ ≡ npi; this is annoying, but
sometimes used in practice. You can use either pi or µ to get the same results.
94
Worked Example
challenger.R challengerdata.RData
In January 1989, the space shuttle Challenger exploded shortly after launch. An
investigation was conducted into the cause of the crash, paying particular attention
to the Oring seals in the rocket boosters. Could the failure of the Orings have
been predicted? Table 3.4 contains information about the damage of Orings from
22 previous shuttle missions.
Temperature No. Failure Orings Total No. Orings
66 0 6
70 1 6
69 0 6
68 0 6
67 0 6
72 0 6
73 0 6
70 0 6
57 1 6
63 1 6
70 1 6
78 0 6
67 0 6
53 2 6
67 0 6
75 0 6
70 0 6
81 0 6
76 0 6
79 0 6
76 0 6
58 1 6
Table 3.4: Challenger Dataset
Let us proceed to fit a logistic regression model in R:
> mat < cbind(dat$Fail,6dat$Fail)
> logitmod < glm(mat~Temp,family=binomial(link=logit),data=dat)
> probitmod < glm(mat~Temp,family=binomial(link=probit),data=dat)
> cloglogmod < glm(mat~Temp,family=binomial(link=cloglog),data=dat)
The estimated parameter values using the different link functions look very different:
95
logit probit cloglog
(Intercept) 8.6615667 4.1452794 7.9384946
Temp 0.1768048 0.0875188 0.1665137
However, the actual fit given by each model are similar, especially around the
observations:
> plot(x=dat$Temp,y=dat$Fail/6)
> x < seq(25,85,by=0.5) # dummy x values
> logiteta < 8.66156670.1768048*x
> probiteta < 4.14527940.0875188*x
> cloglogeta < 7.93849460.1665137*x
> lines(x,1/(1+exp(logiteta)),lwd=2)
> lines(x,pnorm(probiteta),lwd=2,lty=2)
> lines(x,inv.clog(cloglogeta),lwd=2,lty=3)
l
l
lll lll
l l
l
l
l ll l
l
30 40 50 60 70 80
0.
0
0.
4
0.
8
Temperature
Pr
op
.
o
f D
am
ag
e
We can predict the response given by each model at 31F:
> xstar < 31
> 1/(1+exp((8.66156670.1768048*xstar)))
[1] 0.9600983
> pnorm(4.14527940.0875188*xstar)
[1] 0.9239562
96
> inv.clog(7.93849460.1665137*xstar)
[1] 0.9999999
The models suggest that there is a very high probability of damage at this
temperature.
Let us check the deviance for the logistic model using the χ2 test:
> pchisq(deviance(logitmod),df.residual(logitmod),lower=FALSE)
[1] 0.9776587
This pvalue based on the χ2 test of the deviance in well in excess of e.g. the 5% level.
Thus we may conclude by saying model fits the data well. We may not say that the
model is correct.
To construct a confidence interval for the prediction for the model using the logit
link at 31F. Then
> ustar < c(1,31)
> etastar < ustar%*%logitmod$coefficients
> etastar
[,1]
[1,] 3.180617
> 1/(1+exp(etastar))
[,1]
[1,] 0.9600983
> J < vcov(logitmod)
> se < sqrt(t(ustar)%*%J%*%ustar)
Then for an approximate 95% confidence interval (see section 3.4) on the probability
scale:
97
> 1/(1+exp(c(etastar  1.96*se,etastar + 1.96*se)))
[1] 0.3957676 0.9988700
We can obtain the predictions directly using the predict command:
> predict(logitmod,newdata=list(Temp=31))
1
3.180617
> predict(logitmod,newdata=list(Temp=31),type="response")
1
0.9600983
The predict function can also be used to extract the fitted mean response values, µ̂:
> predict(logitmod,type="response")
The same procedure above can be applied for the other link functions.
98
Odds
Odds are sometimes a better scale than probability to represent chance.
• A 41 against bet would pay £4 for every £1.
• A 41 on bet would pay £1 for every £4.
Let p be the probability and o be the odds, where we represent e.g. 41 against as 1/4
and 4  1 on as 4. In general, we have the relationship
o =
p
1− p and p =
o
1+ o
.
Odds Ratio
Suppose we have two groups where the probability of an event being in the first group
is p1 and the probability for the second group is p2. Then the odds ratio is:
p1/(1− p1)
p2/(1− p2) .
An odds ratio
• equal to 1 indicates the event is equally likely to occur in both groups.
• greater than 1 indicates the event is more likely to occur in the first group.
• less than 1 indicates the event is less likely to occur in the first group.
Odds Interpretation
Suppose we have a logistic Binomial regression model using two covariates x1 and x2.
The linear predictor is given by
η = log
(
p
1− p
)
= log(o) = β0 + β1x1 + β2x2.
We can interpret β1 as follows: a unit increase in x1 keeping x2 held fixed increases
the logodds of success by β1, or equivalently increases the odds of success by a factor
of exp(β1).
Note
This interpretation follows from using the logit link — no simple interpretation exists
for other links.
99
3.7.1 Worked Example
breastfeeding.R breastfeedingdata.RData
Consider the data in Table 3.5 containing information from a study on infant
respiratory disease by type of breast feeding and gender. The ratios presented are
of the form “# with disease / total #”.
Bottle only Some breast with supplement Breast only
Boys 77/458 19/147 47/494
Girls 48/384 16/127 31/464
Table 3.5: Breast Feeding data
Can gender and feeding type be used to measure whether or not infants contract a
respiratory disease?
Let us proceed to fit a logistic regression model in R:
> mat < cbind(babyfood$disease, babyfood$nondisease)
> lrmod < glm(mat~ sex + food, family=binomial, data=babyfood)
> summary(lrmod)
Call:
glm(formula = mat ~ sex + food, family = binomial, data = babyfood)
Deviance Residuals:
1 2 3 4 5 6
0.1096 0.5052 0.1922 0.1342 0.5896 0.2284
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 1.6127 0.1124 14.347 < 2e16 ***
sexGirl 0.3126 0.1410 2.216 0.0267 *
foodBreast 0.6693 0.1530 4.374 1.22e05 ***
foodSuppl 0.1725 0.2056 0.839 0.4013

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 26.37529 on 5 degrees of freedom
Residual deviance: 0.72192 on 2 degrees of freedom
AIC: 40.24
Number of Fisher Scoring iterations: 4
100
We expect the χ2 approximation for the deviance to be accurate for this data (why?).
Consider the interpretation of the coefficient for breast feeding. We have
> exp(0.6693)
[1] 0.5120669
Following the logodds interpretation: breast feeding reduces the odd of respiratory
disease by 51% of that for bottle feeding. We could compute the confidence interval
on the odds scale. However, to get better coverage properties we compute the interval
on the logodds scale and then transforming the endpoints as follows:
> z < qnorm(0.025,lower.tail=FALSE) # critical value
> z
[1] 1.959964
> exp(c(0.669z*0.153,0.669+z*0.153))
[1] 0.3795099 0.6913386
101
3.7.2 Prospective and Retrospective Sampling
Prospective Sampling : In prospective sampling, the covariates are fixed and then
the outcome is observed.
Retrospective Sampling : In retrospective sampling, the outcome is fixed and then
the covariates are observed.
Consider a study in which the contraction of a disease is of interest. Let
• v0 be the probability that an individual is included in the study if they do not
have the disease,
• v1 be the probability that an individual is included in the study if they do have
the disease.
For a prospective study, v0 = v1 as we have no knowledge of the outcome. For a
retrospective study, typically v1 is much greater than v0.
For a given covariate x, let
• p∗(x) denote the conditional probability that an individual has the disease given
inclusion in the study.
• p(x) denote the unconditional probability that an individual has the disease, as
we would obtain from a prospective study.
Then by Bayes Theorem:
p∗(x) = v1p(x)
v1p(x) +v0(1− p(x)) .
Rearranging yields
logit(p∗(x)) = log
(
v1
v0
)
+ logit(p(x)).
So the only difference between the retrospective and prospective study is the intercept
term log(v1/v0).
• Generally, v1/v0 is unknown — meaning we cannot estimate β0.
• However, knowledge of the other β can be used to assess the relative error of
the covariates.
102
Now return to the respiratory disease example:
Prospective Sampling : In the infant respiratory example, we would select a sample
of newborns whose parents had chosen a particular method of feeding and then
monitor them for their first year.
Retrospective Sampling : In the infant respiratory example, typically, we would find
infants visiting a doctor with a respiratory disease in the first year and then
record their gender and method of feeding. We would also obtain a sample of
respiratory diseasefree infants. How these samples are collected is important
— we require that the probability of inclusion in the study is independent of the
predictor.
Suppose that the respiratory disease example had been a prospective study. Then,
focussing on the boys only,
• Given the infant is breast fed, the logodds of having a respiratory disease are
log(47/447) = −2.25.
• Given the infant is bottle fed, the logodds of having a respiratory disease are
log(77/381) = −1.60.
The difference between these two logodds, ∆ = −1.60 − (−2.25) = 0.65,
represents the increased risk of respiratory disease incurred by bottle feeding relative
to breast feeding. This is the logodds ratio.
Now suppose that the respiratory disease example had been a retrospective study —
we could compute the logodds of feeding type given respiratory disease status and
then find the difference. The logodds ratio would be exactly the same:
∆ = log(77/47)− log(381/447) = 0.65
This shows that a retrospective design is as effective as a prospective design for
estimating ∆.∗∗
Notes
• Retrospective designs are cheaper, faster and more efficient, so it is convenient
that the same results may be obtained from a prospective study.
• Retrospective studies are typically less reliable than prospective studies  relies
on historical records that may be incomplete or inaccurate.
∗∗See Supplementary handout for further details.
103
Summary for using the logit link
Canonical choice for binomial response GLMs is the logit link. Using the logit link
• leads to simplier mathematics,
• easy interpretation using odds;
• allows for easy analysis of retrospective data.
104
3.8 Poisson Regression
So far we have examined the following cases for the response:
• Response is real→ normal linear model.
• Response is a probability→ Binomial regression.
• Response is a bounded integer→ Binomial regression.
What if the response is an unbounded integer?
One possible approach is to use a Poisson distribution as the response. Let the
response Y follow a Poisson distribution with mean λ > 0:
P(Y = y) =
e−λλy
y!
.
Examples where the response can be modelled as Poisson are:
• counting the occurrence of a rare event (i.e. a small probability of success).
• counting the number of events in a time interval.
Recall that the Poisson distribution is a member of the exponential family as its pmf
can be written as
exp {y log(λ)− λ− log(y!)} ,
where θ = log(λ), b(θ) = λ = exp(θ), a(φ) = φ = 1 and c(y, φ) = − log(yi!).
Therefore the canonical link is
η = θ = log(λ)
Using the canonical link, the likelihood, for independent Y1, . . . ,Yn where Yi ∼
Poisson(λi), is:
L(β;y) =
and the loglikelihood is:
`(β;y) =
105
The discrepancy of the model can be measured using the deviance, which is
D = 2
n
∑
i=1
(
yi log(yi/λ̂i)− (yi − λ̂i)
)
We can use the deviance to
• judge the goodness of fit of the model using the χ2n−p approximation.
• compare two nested models.
Alternatively, we can use Pearson’s X2 statistic as a goodness of fit measure, which
for a Poisson response distribution, takes the form:
X2 =
n
∑
i=1
(
yi − λ̂i
)2
λ̂i
Note
A key property of the Poisson distribution is that its mean is equal to its variance.
3.8.1 Worked Example
In R fit the linear model using the following command:
> mylm1 < lm(interlocks~assets+sector+nation,data=Ornstein)
The normal linear model fitted has response vector and design matrix:
Y = , X =
The Ornstein dataset contains 248 rows of data. The observations are the 248 largest
Canadian firms specifying their assets in millions of dollars, the sector the firm
belongs to, the nation that controls the firm and the number of interlocking director
and executive positions shared with other firms.
Can we use a company’s assets, sector and nation values to predict the number
of interlocking positions?
106
Let us take a look at the summary of linear model fitted in R:
> summary(mylm1)
Call:
lm(formula = interlocks ~ assets + sector + nation, data = Ornstein)
Residuals:
Min 1Q Median 3Q Max
25.001 6.602 1.629 4.780 40.728
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 1.027e+01 1.561e+00 6.575 3.14e10 ***
assets 8.096e04 6.119e05 13.231 < 2e16 ***
sectorBNK 1.781e+01 5.906e+00 3.016 0.00284 **
sectorCON 4.709e+00 4.728e+00 0.996 0.32034
sectorFIN 5.153e+00 2.646e+00 1.948 0.05266 .
sectorHLD 8.777e01 4.004e+00 0.219 0.82669
sectorMAN 1.149e+00 2.065e+00 0.556 0.57849
sectorMER 1.491e+00 2.636e+00 0.566 0.57206
sectorMIN 4.880e+00 2.067e+00 2.361 0.01905 *
sectorTRN 6.171e+00 2.760e+00 2.236 0.02629 *
sectorWOD 8.228e+00 2.679e+00 3.072 0.00238 **
nationOTH 1.241e+00 2.695e+00 0.461 0.64555
nationUK 5.775e+00 2.674e+00 2.159 0.03184 *
nationUS 8.618e+00 1.496e+00 5.760 2.64e08 ***

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 9.827 on 234 degrees of freedom
Multiple Rsquared: 0.6463, Adjusted Rsquared: 0.6267
Fstatistic: 32.89 on 13 and 234 DF, pvalue: < 2.2e16
What can we say about the model fit from this summary?
107
Lets take a look at a residual plot:
0 20 40 60 80 100
−
20
0
20
40
Fitted values
R
es
id
ua
ls
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
ll
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
ll
l
l
l
l
l l
l
l
l
l
l l
ll
l l
l
ll l
l
ll
l
l
l l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
lll
ll
ll
ll
Residuals vs Fitted
9
6386
This plot shows signs of heteroscedasity. Perhaps using a GLM with a Poisson
response distribution could explain the data better. We now proceed to fit the model
with independent Y1, . . . ,Yn with Yi ∼ Poisson(λi) and
E(Yi) = exp(β1 + β2assetsi + β3sectori + β4nationi), i = 1, . . . , n.
Therefore, the link function is log link (canonical link).
108
> myglm < glm(interlocks~assets+sector+nation,data=Ornstein,family=poisson)
> summary(myglm)
Call:
glm(formula = interlocks ~ assets + sector + nation, family = poisson,
data = Ornstein)
Deviance Residuals:
Min 1Q Median 3Q Max
5.9908 2.4767 0.8582 1.3472 7.3610
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 2.325e+00 5.193e02 44.762 < 2e16 ***
assets 2.085e05 1.202e06 17.340 < 2e16 ***
sectorBNK 4.092e01 1.560e01 2.623 0.00872 **
sectorCON 6.196e01 2.120e01 2.923 0.00347 **
sectorFIN 6.770e01 6.879e02 9.841 < 2e16 ***
sectorHLD 2.085e01 1.189e01 1.754 0.07948 .
sectorMAN 5.260e02 7.553e02 0.696 0.48621
sectorMER 1.777e01 8.654e02 2.053 0.04006 *
sectorMIN 6.211e01 6.690e02 9.283 < 2e16 ***
sectorTRN 6.778e01 7.483e02 9.059 < 2e16 ***
sectorWOD 7.116e01 7.532e02 9.447 < 2e16 ***
nationOTH 1.632e01 7.362e02 2.217 0.02663 *
nationUK 5.771e01 8.903e02 6.482 9.05e11 ***
nationUS 8.259e01 4.897e02 16.867 < 2e16 ***

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 3737.0 on 247 degrees of freedom
Residual deviance: 1887.4 on 234 degrees of freedom
AIC: 2813.4
Number of Fisher Scoring iterations: 5
Based on the (residual) deviance, we can perform a χ2n−p significance test:
> pchisq(deviance(myglm), df.residual(myglm), lower=FALSE)
109
[1] 5.793237e256
This is an extremely small pvalue indicating an illfitting model if the Poisson
response distribution is correct. Why is this a poor fit? Let us look at the diagnostic
plots.
1 2 3 4 5
−
5
0
5
Predicted values
R
es
id
ua
ls
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
ll
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l l
l
l l
l
l
l
l
l
l
l
l l
l
l
l
l
l ll
l l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
ll
l l
l
l
l
l
l
l
l
l
ll
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l l
l
ll
l
ll
Residuals vs Fitted
86
9
167
This residual plot looks better than that for the linear model. However, the residuals
are quite large — larger than the Poisson distribution suggest.
The Poisson distribution is restrictive in the sense that it has only one parameter,
forcing the mean to equal the variance of the observations, which is not very flexible
for fitting purposes. This problem can be alleviated by estimating φ, which then leads
to better representative standard errors. To this end, we compute Pearson’s dispersion
estimate φ̂p:
> phihat < sum(residuals(myglm,type="pearson")^2)/df.residual(myglm)
> phihat
[1] 7.943697
We can then “plugin” this estimate into the model:
> summary(myglm, dispersion=phihat)
Call:
glm(formula = interlocks ~ assets + sector + nation, family = poisson,
data = Ornstein)
110
Deviance Residuals:
Min 1Q Median 3Q Max
5.9908 2.4767 0.8582 1.3472 7.3610
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 2.325e+00 1.464e01 15.882 < 2e16 ***
assets 2.085e05 3.389e06 6.152 7.64e10 ***
sectorBNK 4.092e01 4.397e01 0.931 0.352038
sectorCON 6.196e01 5.974e01 1.037 0.299703
sectorFIN 6.770e01 1.939e01 3.492 0.000480 ***
sectorHLD 2.085e01 3.350e01 0.622 0.533800
sectorMAN 5.260e02 2.129e01 0.247 0.804857
sectorMER 1.777e01 2.439e01 0.728 0.466323
sectorMIN 6.211e01 1.886e01 3.294 0.000989 ***
sectorTRN 6.778e01 2.109e01 3.214 0.001309 **
sectorWOD 7.116e01 2.123e01 3.352 0.000803 ***
nationOTH 1.632e01 2.075e01 0.787 0.431534
nationUK 5.771e01 2.509e01 2.300 0.021456 *
nationUS 8.259e01 1.380e01 5.984 2.17e09 ***

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for poisson family taken to be 7.943697)
Null deviance: 3737.0 on 247 degrees of freedom
Residual deviance: 1887.4 on 234 degrees of freedom
AIC: 2813.4
Number of Fisher Scoring iterations: 5
Note that the estimation of β is independent of the dispersion parameter, φ, therefore
modifying φ does not change the estimated coefficients. Further, notice, in this
example, that now some of the coefficients have become nonsignificant.
3.8.2 Overdispersion
The term overdispersion means that the observed variance of the response is larger
than the variation implied by the distribution used to fit the model. Overdispersion
can be caused by several different problems — we state some below:
111
• Observations for different individuals with the same covariates do not have
exactly the same distribution; that is, there are unaccounted for individual
differences not included in the model.
• Observations may be correlated or clustered, while the specified variance
function wrongly assumes uncorrelated data
One approach to mitigate the problem of overdispersion is to estimate the dispersion
parameter, φ rather than assume φ = 1 for the binomial and Poisson distributions.
The procedure is to “plugin” the estimated dispersion parameter φ̂ into the analysis,
as done in the worked example above. As mentioned above, estimating the dispersion
parameter has no effect on the estimate of β but it inflates all their standard errors.
112
3.8.3 Estimation Problems
It can be the case that the glm function in R fails to converge. Problems may arise
due to problems with the Fisher scoring method or a “bad” initial starting point,
however sometimes it is a problem with the data themselves, which is exhibited in
the following example.
The following data set contains the values of androgen and estrogen (types of
hormones) from 26 healthy males with their sexual orientation.
androgen estrogen orientation
1 3.9 1.8 s
2 4.0 2.3 s
3 3.8 2.3 s
4 3.9 2.5 s
5 2.9 1.3 s
6 3.2 1.7 s
7 4.6 3.4 s
8 4.3 3.1 s
9 3.1 1.8 s
10 2.7 1.5 s
11 2.3 1.4 s
12 2.5 2.1 g
13 1.6 1.1 g
androgen estrogen orientation
14 3.9 3.9 g
15 3.4 3.6 g
16 2.3 2.5 g
17 1.6 1.7 g
18 2.5 2.9 g
19 3.4 4.0 g
20 1.6 1.9 g
21 4.3 5.3 g
22 2.0 2.7 g
23 1.8 3.6 g
24 2.2 4.1 g
25 3.1 5.2 g
26 1.3 4.0 g
Table 3.6: Hormone Dataset
Suppose we fit a binomial model to see if the orientation can be predicted from the
hormone values.
> myglm < glm(orientation~estrogen+androgen,data=hormone,family=binomial)
Executing the above command gives the following warnings
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
Lets take a look at the summary
113
> summary(myglm)
Call:
glm(formula = orientation ~ estrogen + androgen, family = binomial,
data = hormone)
Deviance Residuals:
Min 1Q Median 3Q Max
2.759e05 2.100e08 2.100e08 2.100e08 3.380e05
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 84.49 136095.03 0.001 1.000
estrogen 90.22 75910.98 0.001 0.999
androgen 100.91 92755.62 0.001 0.999
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3.5426e+01 on 25 degrees of freedom
Residual deviance: 2.3229e09 on 23 degrees of freedom
AIC: 6
Number of Fisher Scoring iterations: 25
Notice
• the residual deviance is extraordinarily small indicating a good fit, yet none of
the parameters are significant as they have very high standard errors.
• the default maximum number of iterations (25) has been reached.
The reason for these results come from the data.
114
s
ss s
s
s
s
s
s
ss
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
1.5 2.0 2.5 3.0 3.5 4.0 4.5
1
2
3
4
5
androgen
e
st
ro
ge
n
The plot of the data reveals that the two classes of orientation are linearly separable:
it is possible to draw a straight line that divides the two groups perfectly. In this
case, the data do not contain enough information to make stable estimates of the
coefficients. The likelihood function does not have a finite maximum. This behaviour
reflects the small amount of data used to fit the model.
Possible approaches to deal with these types of problems:
• Exact logistic regression.
• Bias reduction method — R package available brlr.
However, these methods are outside the scope of the course.
115
3.9 QuasiLikelihood
Recall that a GLM is determined by the
•
•
We now discuss an approach that only requires specification of the link and variance
functions of the model, but not the distribution of the response.
Motivation
Suppose we have the independent random variables Y1, · · · ,Yn. Let Yi have mean µi
and variance φV(µi). Define the score as
Ui =
yi − µi
φV(µi)
.
It follows that
E(Ui) = 0
var(Ui) =
1
φV(µi)
and − E
(
∂Ui
∂µi
)
=
1
φV(µi)
.
These properties are shared by the score function of members of the exponential
family. This suggests that we may use Ui as a score. Hence
Qi =
∫ µi
yi
yi − t
φV(t)
dt,
should behave like a loglikelihood function for µi (if the integral exists). We shall
refer to Qi as the log quasilikelihood (or confusingly as just the quasilikelihood) for
µi. As we assume the observations are independent, the log quasilikelihood for the
complete data is just the sum of the components: Q = ∑ni=1Qi.
116
Example Take V(µ) = 1 and φ = σ2. Then
U =
and Q =
which is the same as the loglikelihood of a normal distribution up to constants.
In general, using variance functions associated with members of the exponential
family recovers the loglikelihood. Further, other choices of V(µ) may not correspond
to any known distribution or may even lead to something that is not a distribution.
Estimation
The estimation of β in the model is obtained by maximising the log quasilikelihood,
Q. We can again use the IWLS algorithm (Algorithm 3.1) . The only exception is now
the dispersion parameter φ is estimated by
φ̂P =
1
n− p
n
∑
i=1
(yi − µi)2
V(µi)
,
which is based on Pearson’s X2 statistic. We do not use the deviance estimator for φ
as it is based on the likelihood — not reliable here.
Inference
By analogy, the quasideviance function for a single observation is ††
Di = −2φQi = 2
∫ yi
µi
yi − t
V(t)
dt,
and the total quasideviance is the sum over the Di. This total quasideviance can be
used like the ordinary deviance to perform inference on the model.
††see problem sheet
117
Example in R
quasiseeds.R seedsdata.RData
We continue with the seeds example presented in section 3.6.1. We can fit a
corresponding quasibinomial model as follows:
> my.quasi.bin < glm(prop ~ seed + extract + seed * extract,
+ family = quasibinomial(link = "logit"), data = dat)
> summary(my.quasi.bin)
Call:
glm(formula = prop ~ seed + extract + seed * extract, family = quasibinomial(link = "logit"),
data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
0.88834 0.18458 0.01555 0.13622 0.38167
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 0.5262 0.2764 1.904 0.07398 .
seed 0.2000 0.3969 0.504 0.62079
extract 1.4479 0.3865 3.747 0.00161 **
seed:extract 0.8478 0.5496 1.543 0.14135

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for quasibinomial family taken to be 0.08915264)
Null deviance: 3.9112 on 20 degrees of freedom
Residual deviance: 1.8151 on 17 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 4
Notice that the estimated dispersion parameter is not 1 (as in the previous
analysis), but φ = 0.0892, far less than 1. Further, the models deviance has been
significantly reduced.
Comparison of multiple quasi models can be done using the Ftest e.g.
> my.quasi.bin.2 < glm(prop ~ seed + extract, family = quasibinomial,
+ data = dat)
> anova(my.quasi.bin.2, my.quasi.bin, test = "F")
118
Analysis of Deviance Table
Model 1: prop ~ seed + extract
Model 2: prop ~ seed + extract + seed * extract
Resid. Df Resid. Dev Df Deviance F Pr(>F)
1 18 2.0280
2 17 1.8151 1 0.21282 2.3871 0.1407
This Fvalue is not significant, therefore there is insufficient evidence to reject the null
hypothesis, i.e. use the smaller model.
The other options for use of quasilikelihoods in R are
quasi(link = "identity", variance = "constant")
quasibinomial(link = "logit")
quasipoisson(link = "log")
Note
• The dispersion, φ, can be modelled as a free parameter which is useful in
modelling overdispersion or underdispersion.
• Although using the quasilikelihood approach is attractive as it uses fewer
assumptions — the quasi based estimators are generally less efficient than
corresponding regular likelihoodbased estimator — so if information about the
distribution is available, use it.
119
Chapter 4. Normal Linear Mixed Models
Models with mixed effects consist of both a fixed effect and a random effect. In
the previous chapters we have only considered fixed effects; where the only source
of randomness arises from considering the samples as independent and random
samples. Random effects are used to model more complex correlation structures,
such when there is more than one observation in a group. In this setting, we expect
each group to vary independently, but measurements from the same group may be
correlated.
We shall only consider random effects for normal linear models. However, random
effects can be included in other, more complicated models e.g. nonlinear, nonNormal
(outside the scope of this course).
Motivating Example
Consider the dataset containing the math exam scores of students from 3 different
schools. There are 10 scores from each school in this dataset. The data is presented in
the boxplots below:
5
10
15
20
25
School
Ex
am
S
co
re
First, ignore the grouping structure of the data and assume the simple model:
Yij = µ+ eij, j = 1, . . .m i = 1, . . . ,Kj,
where Yij is the observed score for student i from school j, µ is the mean score across
the population of students being sampled, and the e are independent N(0, σ2) error
terms.
> mylm < lm(score~1,data=dat)
120
Let’s look at the summary of this model and some plots of the residuals:
Call:
lm(formula = score ~ 1, data = dat)
Residuals:
Min 1Q Median 3Q Max
9.966 5.624 1.810 5.941 14.031
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 10.96 1.26 8.701 1.4e09 ***

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 6.9 on 29 degrees of freedom
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
5 10 15 20 25
−
10
0
5
10
Exam Score y
R
es
id
ua
ls
l
l
l
l
l
l
l
l
l
l
l
l
ll
llllll
l
l
l
l
l
l
l
l
l
l
0 5 10 15 20 25 30
−
10
0
5
10
Index
R
es
id
ua
ls
Notice that the group school effects are incoporated into the residuals — we see
clear division in the residuals given by the different schools. This leads to an inflated
estimate of variability i.e. the RSS.
121
One approach would be to incorporate the group effects, by introducing a seperate
mean for each school:
> mylm2 < lm(score~as.factor(school)+0,data=dat)
Call:
lm(formula = score ~ as.factor(school) + 0, data = dat)
Residuals:
Min 1Q Median 3Q Max
13.5675 2.1747 0.0172 2.9415 10.4295
Coefficients:
Estimate Std. Error t value Pr(>t)
as.factor(school)140 14.564 1.751 8.318 6.30e09 ***
as.factor(school)142 4.927 1.751 2.814 0.00902 **
as.factor(school)155 13.394 1.751 7.650 3.15e08 ***

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 5.537 on 27 degrees of freedom
Multiple Rsquared: 0.834, Adjusted Rsquared: 0.8155
Fstatistic: 45.21 on 3 and 27 DF, pvalue: 1.162e10
Notice that the residual standard error has been slightly reduced in comparison to the
previous model. Let us inspect the residual plots again.
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
5 10 15 20 25
−
10
0
5
10
Score y
R
es
id
ua
ls
l
l
l
l
l
l
l
l
l
l
l
l
ll
llllll
l
l
l
l
l
l
l
l
l
l
0 5 10 15 20 25 30
−
10
0
5
10
Index
R
es
id
ua
ls
These residual plots still show a trend.
122
This model has several disadvantages:
• it only models the specific sample of students used in the experiment. We
cannot say anything about the population of students from which the sample
was drawn.
• it does not provide an estimate of the betweenschool variability.
We clearly need a class of model that incorporates the group structure of the data.
Then we can pose interesting questions such as: what was the average maths exam
score across the population, and how much variation were there between schools.
123
4.1 Specification of Normal Linear Mixed Models
Definition. The normal linear mixed model is defined as
Y = Xβ+ Zν + ,
where
• Y ∈ Rn is the response vector,
• X ∈ Rn×p is the design matrix for the fixed effect,
• Z ∈ Rn×m model matrix for the random effects,
• β ∈ Rp is the unknown parameter,
• ν is an mvariate vector with ν ∼ N(0, σ2ν Im),
• is an nvariate vector with ∼ N(0, σ2e In).
and ν and are independent.
The joint distribution of Y can be easily derived as follows: First,
E(Y )
Next,
cov(Y )
Since Y is a sum of normally distributed variables, it follows that
Y ∼ N(Xβ, σ2e (In + ZΨZT)), where Ψ =
σ2ν
σ2e
Im
For compactness, we sometimes group the variances into one parameter τ = (σ2e , σ2ν )T,
called the variancecomponents of the model.
124
Example A simple normal linear model is
Yij = µ+ νj + eij, j = 1, . . .m, i = 1, . . . ,Kj,
where, as before, µ is the fixed effect, while νj ∼ N(0, σ2ν ) and eij ∼ N(0, σ2e ). Casting
this model into the definition above:
Y =
Y11
...
YK1,1
Y1,2
...
YK2,2
...
, X =
1
1
...
1
, Z =
1 0 . . . 0
...
...
...
...
1 0 . . . 0
0 1 . . . 0
...
...
...
...
0 1 . . . 0
...
...
...
...
,
β = µ, ν =
ν1
ν2
...
νm
and =
e11
...
eK1,1
e12
...
eK2,2
...
,
where n = ∑mj=1 Kj, Y ∈ Rn, X ∈ Rn×1, Z ∈ Rn×m, β ∈ R. This model is called the
oneway random effects model.
125
4.2 Estimation
Estimation for models with random effects is not as straightforward as it was for fixed
effects models. The standard methods of estimation for normal linear mixed models
are maximum likelihood (ML) and restricted maximum likelihood (REML). We shall
discuss both methods in this section.
4.2.1 Estimation of β
Under a normal linear mixed model, the distribution of Y is
1
(2pi)n/2σ2eV1/2
exp
{
− 1
2σ2e
(y − Xβ)TV−1(y − Xβ)
}
,
where V := In + ZΨZT. Thus, the loglikelihood function is given by
`(β, τ ;y) = −n
2
log(2pi)− 1
2
log
∣∣∣σ2eV∣∣∣− 12σ2e (y − Xβ)TV−1(y − Xβ),
where the variancecovariance term τ appears in V. By differentiating the
loglikelihood with respect to β, we obtain
∂`
∂β
= (4.1)
The standard approach to find the MLEs is to solve ∂`∂β = 0 (and
∂`
∂τ = 0).
For simplicity, assume that X has full rank; so that rank(X) = p. Let (β̂, τ̂ ) be the
MLE. From (4.1) we obtain
β̂ = (4.2)
where Vτ̂ = V(τ̂ ), that is, V with the MLE τ̂ plugged in. Thus once the MLE of τ is
found, the MLE of β can be calculated by expression (4.2). Thus one procedure is to
first solve for τ̂ , then compute β̂ using (4.2). However, finding τ̂ is not so simple.
126
4.2.2 Estimation of Variance Components
There are several approaches to estimating the variance components, σ2e and σ2ν , of
normal linear mixed models. The first approach we discuss is to use the analysis of
variances:
4.2.2.1 ANOVA estimators
Let’s start with the simplest possible random effects model to gain intuition:
Yij = µ+ νj + eij, j = 1, . . . ,m i = 1, . . . ,Kj.
where νj is a random effect with E(νj) = 0, var(νj) = σ2ν , and E(eij) = 0, var(eij) = σ2e .
This induces a correlation between the observations in the same group
ρ =
σ2ν
σ2ν + σ
2
e
.
Proof.
The correlation, ρ, is known as the intraclass correlation coefficient (ICC). Note
• if σ2ν = 0 then ρ = 0.
• if σ2ν σ2e then ρ ≈ 1.
127
For simplicity, suppose there are an equal number of observations for each of the
m groups i.e. set Kj = K — referred to as a balanced design. We can decompose the
variation as follows:
K
∑
i=1
m
∑
j=1
(Yij −Y)2 =
K
∑
i=1
m
∑
j=1
(Yij −Y•j)2 +
k
∑
i=1
m
∑
j=1
(Y•j −Y)2
where
Y•j =
1
K
K
∑
i=1
Yij and Y =
1
mK
K
∑
i=1
m
∑
j=1
Yij.
This decomposition can be written as: SST = SSE+ SSA where SSE is the residual
sum of squares, SST is the total sum of squares (corrected for the mean) and SSA is
the sum of square due to ν. It turns out that (see Lemmas below):
E(SSE) = m(K− 1)σ2e , E(SSA) = (m− 1)(Kσ2ν + σ2e )
which suggests using the estimators:
σ̂2e =
SSE
m(K− 1) =: MSE, σ̂
2
ν =
SSA/(m− 1)− σ̂2e
K
=:
MSA−MSE
K
,
where MSA = SSA/(m− 1). These estimators, σ̂2e and σ̂2ν , are known as the ANOVA
estimators.
128
Lemma 7.
SST = SSE+ SSA
Proof.
Lemma 8. The expectations of these sums of squares are:
i) E(SSE) = m(K− 1)σ2e
ii) E(SSA) = (m− 1)(Kσ2ν + σ2e )
iii) E(SST) = (mK− 1)σ2e + K(m− 1)σ2ν
Proof. i) For SSE, the expectation of the summand is
E
{(
Yij −Y•j
)2}
129
ii) For E(SSA), again start by looking at the expectation of the summand
E
{(
Y•j −Y
)2}
130
iii) Plugging in the results for i) and ii) yields:
E(SST) =m(K− 1)σ2e + (m− 1)(Kσ2ν + σ2e )
=(mK− 1)σ2e + K(m− 1)σ2ν
131
These were the first estimators developed for the variance components. They have an
explicit form suitable for hand calculations, which was important in precomputing
days. However, they have a number of disadvantages:
1. The estimates can take negative values. For instance, if MSA < MSE then
σ2ν < 0u˚
2. A balanced design is assumed — where the number of observations in each
group is equal. In such settings, the ANOVA decomposition into sum of squares
is unique. For unbalanced data, this is not true and therefore we must choose
which ANOVA decomposition and therefore the ANOVA estimator to use.
3. The need for complicated algebraic calculations. Formulae for the simpler
models are easy to derive and code — but more complex models will require
difficult constructions.
4.2.2.2 Maximum Likelihood Approach
Recall that the normal linear mixed model can be written as
Y ∼ N(Xβ, σ2e (In + ZΨZT)︸ ︷︷ ︸
=:V
), where Ψ =
σ2ν
σ2e
Im.
If Ψ is known (i.e. the variance components), we can estimate β using generalised least
squares. However, the estimation of the variance components is often the purpose of
the analysis. Standard maximum likelihood is one method of estimation that we can
use. If we let V = In + ZΨZT, then the distribution of Y is
1
(2pi)(n/2) σ2eV1/2
exp
{
− 1
2σ2e
(y − Xβ)TV−1(y − Xβ)
}
.
Thus the loglikelihood takes the form
`(β, τ ;y) = −n
2
log(2pi)− 1
2
log
∣∣∣σ2eV∣∣∣− 12σ2e (y − Xβ)TV−1(y − Xβ), (4.3)
where the variancecovariance term τ appears in V.
We can then proceed to optimise to find the maximum likelihood estimates of β
and τ . This is straightforward in principle, but difficult in practice. More complex
models involving larger numbers of random effects parameters can be difficult to
estimate. One particular problem is that the variance cannot be negative so the MLE
for the variance might be zero. Nevertheless, we rely on numerical techniques to find
the MLEs.∗ A problem with MLEs is that they can be biased — as illustrated in the
following example.
∗the optimisation techniques are not examinable.
132
Example (NeymanScott)
Consider the model
Yij = µj + eij where eij
iid∼ N(0, σ2),
for i = 1, 2 and j = 1, . . . ,m. We are interested in estimating σ2.
The joint pdf of Y is
m
∏
j=1
[(
1√
2piσ2
)2
exp
{
− 1
2σ2
(
Y1j − µj
)2} exp{− 1
2σ2
(
Y2j − µj
)2}] .
Hence the loglikelihood is
The derivative with respect to µj is
for j = 1, . . . ,m. Consequently, their MLEs are
µ̂j =
Similarly we find
∂`
∂σ2
=
Thus the MLE of σ2 is
σ̂2 =
133
We then proceed to plug in the MLEs for µj into the formula for σ̂2 to get
σ̂2 =
Now consider the bias of this estimator for σ2.
E(σ̂2) = .
Since 4s2j =
{
(Y2j − µj)− (Y1j − µj)
}2, it follows from independence that
4E(s2j ) =
and so
E(s2j ) =
Finally,
E(σ̂2) =
which shows that σ̂2 is biased and does not even become unbiased as n = 2m → ∞.
Here, the maximum likelhood estimator is not consistent.
134
4.2.2.3 Restricted Maximum Likelihood
The NeymanScott example demonstrates that the MLE approach can lead to a biased
estimator of the variance. Notice that the fixed effects, the µis, are considered to
be nuisance parameters and the main interest lies in the variance term σ2. We now
discuss a method that can be used to estimate the parameters of interest without
dealing with nuisance parameters.
Example (NeymanScott continued)
Consider the transformation Bj = Y1j − Y2j. Then B1, . . . , Bm are independent and
N(0, 2σ2) distributed. Notice that this transformation leads to a distribution that does
not involves the nuisance parameters µ1, . . . , µm. Then the joint pdf of B1, . . . , Bm is
f (b1, . . . , bm) =
m
∏
j=1
1√
2pi(2σ2)
exp
{
− 1
2(2σ2)
b2j
}
.
Hence the loglikelihood is
and it follows that
∂`
∂σ2
=
Therefore, the MLE of σ2 is
σ̂2 =
Finally, we can compute the expectation of this MLE and find that is it indeed an
unbiased estimator of σ2:
E(σ̂2) =
The trick shown in this example is to apply a transformation to the data to
eliminate the fixed effects; then use to transformed data to estimate the variance
component. This idea can be applied in general.
Assume (w.l.o.g.) that X has full rank i.e. rank(X) = p. Let L ∈ Rn×(n−p) such
that rank(L) = n− p and
LTX = 0.
Then multiplying the mixed normal linear model equation by LT on the left we get
LTY︸︷︷︸
=:B
=
135
After the transformation we have
B ∼ N(0, σ2e LTVL).
It follows that the joint pdf of B is
1
(2pi)(n−p)/2 σ2e LTVL1/2
exp
{
− 1
2σ2e
bT(LTVL)−1b
}
so that the loglikelihood is
`R(τ ; b) = − (n− p)2 log(2pi)−
1
2
log
∣∣∣σ2e LTVL∣∣∣− 12σ2e bT(LTVL)−1b.
To emphasis this is the restricted loglikelihood we use the subscript R. The REML
(restricted maximum likelihood) estimator of τ is defined as the maximiser of `R. As
with ordinary MLE; such a maximiser satifies the REML equation
∂`R
∂τ
= 0.
Again, we shall rely on numerical methods to find the maximiser of the REML
equation. After the random effects are estimated; the fixed effects are subsequently
estimated.
Notice that we have already seen a candidate for L. Let LT = In − X(XTX)−1XT,
then†
LTY = LTXβ+ LTZν + LT
†see problem sheet
136
4.2.3 Worked Example
REMpulp.R
We now illustrate the fitting methods used in R using some data from an experiment
to test the paper brightness depending on a shift operator:
> library("faraway")
> data("pulp")
bright operator
1 59.8 a
2 60.0 a
3 60.8 a
4 60.8 a
5 59.8 a
6 59.8 b
7 60.2 b
8 60.4 b
9 59.9 b
10 60.0 b
bright operator
11 60.7 c
12 60.7 c
13 60.5 c
14 60.9 c
15 60.3 c
16 61.0 d
17 60.8 d
18 60.6 d
19 60.5 d
20 60.5 d
Table 4.1: Pulp Dataset
We start by calculating the ANOVA estimators:
> myaov < aov(bright~operator,data=pulp)
> summary(myaov)
Df Sum Sq Mean Sq F value Pr(>F)
operator 3 1.34 0.4467 4.204 0.0226 *
Residuals 16 1.70 0.1062

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
> coef(myaov)
(Intercept) operatorb operatorc operatord
60.24 0.18 0.38 0.44
137
The estimate of σ2e = 0.1062 and of σ2ν is
> (0.44670.1062)/5
[1] 0.0681
We now demonstrate the maximum likelihood estimators.
> library("lme4")
> mylme < lmer(bright~1+(1operator),data=pulp)
Take note of how the model is specified in the lmer function: the first part corresponds
to the fixed effects; the second part corresponds to the random effect, where
(1operator) indicates that the data are grouped by operator and the 1 indicates
that the random effect is constant within each group.
> summary(mylme)
Linear mixed model fit by REML ['lmerMod']
Formula: bright ~ 1 + (1  operator)
Data: pulp
REML criterion at convergence: 18.6
Scaled residuals:
Min 1Q Median 3Q Max
1.4666 0.7595 0.1244 0.6281 1.6012
Random effects:
Groups Name Variance Std.Dev.
operator (Intercept) 0.06808 0.2609
Residual 0.10625 0.3260
Number of obs: 20, groups: operator, 4
Fixed effects:
Estimate Std. Error t value
(Intercept) 60.4000 0.1494 404.2
138
Also note that the default fitting method is REML. The reported standard
deviations are just the square root of the reported variances — not estimates of the
uncertainty in the variances.
The maximum likelihood estimates can also be computed.
> mylme2 < lmer(bright~1+(1operator),data=pulp,REML=FALSE)
> summary(mylme2)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: bright ~ 1 + (1  operator)
Data: pulp
AIC BIC logLik deviance df.resid
22.5 25.5 8.3 16.5 17
Scaled residuals:
Min 1Q Median 3Q Max
1.50554 0.78116 0.06353 0.65850 1.56232
Random effects:
Groups Name Variance Std.Dev.
operator (Intercept) 0.04575 0.2139
Residual 0.10625 0.3260
Number of obs: 20, groups: operator, 4
Fixed effects:
Estimate Std. Error t value
(Intercept) 60.4000 0.1294 466.7
139
4.3 Inference
To compare two nested models M0 (smaller) and M1, we compute the difference of
the loglikelihoods (or difference of deviances)
2
{
`(β̂1, τ̂1;y)− `(β̂0, τ̂0;y)
}
,
where β̂0, τ̂0 are the MLEs of the parameters for M0 and β̂1, τ̂1 are the MLEs of the
parameters for M1.
We have two choices: either (wrongly) assume the variance parameters are equal
to their estimated values, then perform a χ2test, or try to take into account their
uncertainty, then use an Ftest.
Testing whether to include the random effects or not involves the hypothesis:
H0 : σ2ν = 0.
The standard derivation of the asymptotic χ2 distribution depends on the null
hypothesis lying on the interior of the parameter space. This assumption does
not hold for this test. Consequently, one may resort to sampling procedures – see
parametric bootstrap below (Section 4.3.1)
If the χ2 approximation is used, then the test will tend to be conservative in the sense
that the pvalues will tend to be larger than they should be.
inference in the mixed model is intricate, and often controversial
Warning
We cannot use the REML estimation method if the deviance test is to be used. The
reason is that; the REML estimates the random effects by eliminating the fixed effects
(we transformed using L). If these fixed effects are changed, the likelihoods of the
two models are not directly comparable. Therefore, we should use the maximum
likelihood estimation method in this situation.
140
4.3.1 Worked Example (Continued) — Parametric Bootstrap
REMpulpcont.R
Returning to the pulp example. We fitted the mixed effects model using the maximum
likelihood (ML) approach
> mylme2 < lmer(bright~1+(1operator),data=pulp,REML=FALSE)
We now consider a test concerning the random effects — therefore we must use ML.
Say we are interested in comparing the mylme2 and
> mylm < lm(bright~1,data=pulp)
that is, comparing if the random effects should be included or not. Our null model
is the smaller linear model mylm. Note that as there are no random effects in this
model, we must use the lm function. For once, we cannot use anova and therefore
must compute things directly:
> d < as.numeric(2*(logLik(mylme2)logLik(mylm)))
> d
[1] 2.568371
> pchisq(d,1,lower=FALSE)
[1] 0.1090199
This pvalue is now well above the 5% significance level. We cannot say that this result
is necessarily wrong, but the use of the χ2 approximation does cause us to doubt the
result.
We can use the parametric bootstrap approach to obtain a more accuate pvalue.
We need to estimate the probability, given that the null hypothesis is true, of observing
a d value of 2.568371 or greater. Under the null hypothesis, Y ∼ N(µ, σ2). A
simulation approach would proceed to generate data under this model, fit the null and
alternative models and then compute the deviance. The process would be repeated a
large number of times and the proportion of deviance values exceeding the observed
value of 2.568371 would be used to estimate the pvalue. In practice, we do not know
the true values of µ and σ2 but we can use the estimated values — this distinguishes
the parametric approach from the simulation approach. We can simulate responses
under the null as follows:
141
> y < simulate(mylm)
Now taking the generated data, fit both the null and alternative model compute the
deviance and repeat, say 1000 times:
> ds < numeric(1000)
> for(i in 1:1000){
+ y < unlist(simulate(mylm))
+ nullmod < lm(y~1)
+ altmod < lmer(y~1+(1operator),data=pulp,REM=FALSE)
+ ds[i] < as.numeric(2*(logLik(altmod)logLik(nullmod)))
+ }
We may examine the distribution of the bootstrapped deviance values i.e. look at the
proportion of values close to zero
> mean(ds<0.00001)
[1] 0.698
This clearly indicates that the deviance in this case does not have the χ2 distribution
under the null hypothesis.
Our estimated pvalue is:
> phat < mean(ds>2.568371)
> phat
[1] 0.02
We should compute the standard error for this estimate (where does this come from?)
> sqrt(0.02*0.98/1000)
[1] 0.004427189
So we can be fairly sure it is under 5%. If in doubt, do more replications.
142
4.4 Prediction
4.4.1 Best prediction when all the parameters are known
When the fixed effects, β, and variance components, τ , are known, the best predictor
for ν, in the sense of minimum mean squared error, is its conditional expectation
given the data; that is,
ν˜ := E(νy) = ΨZTV−1(y − Xβ).
To show E(νy) = ΨZTV−1(y − Xβ), first consider the covariance between Y and ν:
.
The proof then follows by considering the joint over Y and ν which is an (n +
m)dimensional multivariate Gaussian distribution:(
Y
ν
)
∼ N
((
Xβ
0
)
, σ2e
(
In + ZΨZT ZΨ
ΨZT Ψ
))
where we also used that ΨT = Ψ. Now recall the result:(
Wa
Wb
)
∼N
((
µa
µb
)
,
(
Σaa Σab
Σba Σbb
))
⇒ Wb Wa = wa ∼ N
(
µb + ΣbaΣ
−1
aa (wa −µa),Σbb − ΣbaΣ−1aa Σab
)
.
This allows us to obtain the conditional νy (i.e. the posterior) without performing an
intergration:
ν˜ := E(ν  Y = y) =
Once the best predictors ν˜ are obtained, the best predictor of η = xT?β+ zT?ν is
η̂B := xT?β+ z
T
? ν˜ = x
T
?β+ z
T
?ΨZ
TV−1(y − Xβ).
The subscript B refers to the best predictor.
143
4.4.2 Best linear unbiased prediction (BLUP)
If the fixed effects, β, are unknown but the variance components, τ , are known, η̂B is
not a predictor. In this case, it is customary to replace β by is MLE β̂, which is
β̂ = (XTV−1X)−1XTV−1Y .
After replacing β by its MLE, the predictor for η := xT?β+ zT?ν becomes
η̂BLUP = xT? β̂+ z
T
? ν˜β̂ = x
T
? β̂+ z
T
?ΨZ
TV−1(y − Xβ̂).
This is the best linear unbiased predictor (BLUP) of η, in the sense that (i) is it linear in
y, (ii) its expected value is equal to η and (iii) it minimises the MSE among all linear
unbiased predictors. The vector
ν˜β̂ := ΨZ
TV−1(y − Xβ̂)
is also called the BLUP of ν.
4.4.3 Empirical BLUP
In practice, the fixed effects and variance components are typically unknown.
Therefore, in most cases neither the best predictor nor the BLUP is computable. In
such cases, one replaces the vector of variance components, τ , which is involved in
the expression of BLUP, by, τ̂ . That is, instead of predicting the random effects by ν˜
we use
ν̂ := E(νY = y, τ = τ̂ ) = ΨZTV−1τ̂ (y − Xβ̂)
where Vτ̂ is V with an estimator of τ plugged in. The resulting predictor, often called
empirical BLUP or eBLUP, is
η̂eBLUP = xT? β̂+ z
T
? ν̂ = x
T
? β̂+ z
T
?ΨZ
TV−1τ̂ (y − Xβ̂).
Note
Recall that the random effects, ν, are not parameters: they are random variables.
So rather than estimating their values, we predict them. Therefore, the conditional
expectations are referred to as predictors.
144
4.5 Worked Example
REMsim.R
We begin by simulating some data as follows:
> n < 60
> X < rnorm(n,0,1)
> Z < c(rep(1,20),rep(2,20),rep(3,20))
> nu < rnorm(3,0,1)
> beta < 1.5
> y < nu[Z]+beta*X+rnorm(60,0,0.1)
> dat < data.frame(y=y,Z=Z,X=X)
We have simulated from 3 groups with an associated random effect with standard
deviation 0.1 and additional covariate X, representing a fixed effect with regression
coefficient 1.5 with standard deviation 1.
Next , we fit the normal linear mixed model, Y = Xβ+ Zν + , as follows
> library("lme4")
> mylme < lmer(y~X+(1Z)+0,data=dat,REML=FALSE); summary(mylme)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: y ~ X + (1  Z) + 0
Data: dat
AIC BIC logLik deviance df.resid
71.4 65.1 38.7 77.4 57
Scaled residuals:
Min 1Q Median 3Q Max
2.58247 0.59496 0.00208 0.49970 2.02786
Random effects:
Groups Name Variance Std.Dev.
Z (Intercept) 0.77931 0.8828
Residual 0.01123 0.1060
Number of obs: 60, groups: Z, 3
Fixed effects:
Estimate Std. Error t value
X 1.51026 0.01572 96.09
145
We see that the output includes the standard effects summary and an additional
summary for the random effects; reporting the variance of the random effect (named
as Intercept) and the residual variance. Notice that the estimated residual variance
is accurate (true value is 0.01), but the estimated random effect variance is not as
accurate (true value is 1). Why? The estimated coefficient, β, is also accurate (true
value 1.5).
To get the posterior estimate for the random effects i.e. ν̂ := E(νy, τ̂ ):
> ranef(mylme)
$Z
(Intercept)
1 0.8867433
2 1.1064969
3 0.5706097
If we wish to predict the response of an observation given the covariate x? = 1, but
do not know which group it belongs to, then the best predictor is xT? β̂ = β̂; for this
example this is 1.51026.
If we want to predict the response of an observation from, say, group j = 1, then we
use the estimate of the random effect to get the estimate
> fixef(mylme)+ranef(mylme)$Z
(Intercept)
1 2.3970000
2 2.6167536
3 0.9396471
146
4.6 Inference Continued
4.6.1 Inference about the fixed effects, β
Let us return to the unrestricted loglikelihood
`(β, τ ;y) = −n
2
log(2pi)− 1
2
log
∣∣∣σ2eVτ ∣∣∣− 12σ2e (y − Xβ)TV−1τ (y − Xβ). (4.4)
For fixed τ , taking derivative of this loglikelihood with repsect to β gives
∂`
∂β
= XTV−1τ (y − Xβ)
so that the MLE of β is the solution of
XTV−1τ Xβ̂ = XTV−1τ y,
which we recognise. Recall that Vτ := (In + ZΨZT) where Ψ =
σ2ν
σ2e
Im — therefore Vτ
depends on the variance components τ = (σ2e , σ2ν )T. The (profile) loglikelihood of the
variance parameter τ is
`(τ ;y) = −n
2
log(2pi)− 1
2
log
∣∣∣σ2eVτ ∣∣∣− 12σ2e (y − Xβ̂)TV−1τ (y − Xβ̂).
and the Fisher information of β is
J (β) = XTV−1τ X.
In practice, the estimated value of τ̂ is plugged into the Fisher information, from
which we can find the standard error for the MLE β̂ in the form
β̂ = β̂τ̂
J
(
β̂
)
= XTV−1τ̂ X.
Problem
The standard errors computed from this plugin formula do not take into account the
uncertainty in the estimation of τ ; but this approach is nevertheless commonly used.
Note
We can plug in any estimator of the variance components τ in the above. This is how
we get estimate of the fixed effect, β̂, using the REML estimators.
147
4.7 Diagnostics
For normal mixed linear models, the two main distributional assumptions pertain to
the normality of the random effects ν and the errors .
Normality of Random Effects
For normal mixed linear models, we assumed
ν ∼ N(0, σ2ν Im).
To check this assumption, some “estimates” of the random variables ν are required.
Typically, the conditional expectations of the random effects, given the observations,
are used:
ν̂ = E(ν  y, τ = τ̂ ) = ΨZTV−1τ̂ (y − Xβ̂).
In turns out that using QQ plots of the predicted random effects for the purpose of
checking their normality are not useful. This is because the observed distribution
of ν̂ does not necessarily reflect the true distribution of ν. In practice, checking the
normality assumption for ν should be based on the results for a linear mixed model
with and without assuming this normality (outside scope of course).
Residuals
As for linear models, the main tools for checking the assumption of the normality of
the errors, , are based on residuals. First, recall the normal linear mixed model:
Y = Xβ+ Zν + or Y ∼ N(Xβ, σ2e (In + ZΨZT)).
One set of residuals is the conditional residuals defined as
(c) = y − Xβ̂− Zν̂,
where ν̂ = E(νy, τ = τ̂ ).
Another set is the marginal residuals defined as
(m) = y − Xβ̂.
These raw residuals are useful to check heterogeneity of the conditional or marginal
variance. However, they are not useful for checking normality assumptions and/or
detecting outlying observations. This is because the raw residuals are typically
correlated and their variances will differ. Moreover, standardised and Pearson
residuals are not appropriate for checking normality. This is because the linear
mixed model allows for correlation between the errors. A workaround (approximate)
148
solution is to transform the raw residual such that the resulting transformed residual
is approximately Normal. Then a QQ plot of the transformed residuals should
show an approximate straight line. This transformation can be based upon the
Cholesky decomposition of the variancecovariance matrix. Some recommend the
following diagnostic plots: plot of the marginal residuals against covariates to check
the linearity assumptions for the covariates and; plots of the conditional residuals
against the estimated conditional means can be used to detect outlying observations
or heteroscedasticity of the errors.
149
Question: Does the REML estimator depend on the choice of transformation, L?
150
4.7.1 Summary: Fixed or Random Effects
In certain cases, random effects are almost obligatory:
• when we wish to generalise for unseen levels of a factor (i.e., we view the
observed levels as samples from a population).
• when we have reason to believe that the grouping effects are actually “tightly”
distributed around an overall population mean.
• when the design forces correlation within groups (e.g,. block designs where one
treatment only is tested per plot/block).
151
欢迎咨询51作业君