辅导案例-MTHM506

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
MTHM506 Statistical Data Modelling
Outline Solutions Problem Sheet 1
Topic 1
1. In the lecture notes, there is an example relating to a study of the number of fish caught of differing sizes in a
net of a particular mesh size. We developed a binomial GLM for probability of escape from net with a single
predictor x, the fish length in cm. The model for the data y1, . . . , yn (fish escaping) is
Yi ∼ Bin(Ni, pi) Yi independent
log
(
pi
1− pi
)
= β0 + β1xi
where Ni (total fish number) has also been observed and xi is fish length. The notes contain details of the
relevant log-likelihood and code the R script topic1.R directly maximises the log-likelihood of this binomial
model using the data set fish using the R (optimiser) function nlm() and reports the maximum likelihood
estimates for β0 and β1 along with their standard errors and other model diagnostics.
(a) Extend the R code in order to fit the modified model:
yi ∼ Bin(ni, pi)
log
(
pi
1− pi
)
= β0 + β1xi + β2x
2
i
Report the maximum likelihood estimates for β0 , β1 and β2 along with their standard errors and the AIC
for the modified model.
(b) Calculate the log likelihood ratio statistic relevant to a formal comparison of difference in fit between
the original and the modified model. Perform a formal test of the hypothesis that there is no significant
difference in fit between the original and the modified model.
Solution
(a) From the lecture notes, the log-likelihood of the model is:
`(y;β0, β1) = const +
n∑
i=1
[
yi(β0 + β1xi + β2x
2
i )− ni log(1 + eβ0+β1xi+β2x
2
i )
]
.
Create a function in R to evaluate −`(y;β0, β1, β2) (because nlm() will minimise, we work with minus
the log-likelihood):
> fn <- function (beta){
> sum(-fish$escape*(beta[1]+beta[2]*fish$length+beta[3]*fish$length^2) +
> fish$total*log(1+exp(beta[1]+beta[2]*fish$length+beta[3]*fish$length^2)))}
Maximise the likelihood with starting values β0 = 1, β1 = 0 and β2 = 0 (random):
> out <- nlm(fn,p=c(1,0,0),hessian=T,iterlim=10000,steptol=1e-10)
> out$estimate
[1] 31.20001576 -1.30410232 0.01018133
Compare with a binomial GLM:
> bmodel <- glm(cbind(fish$escape,fish$total-fish$escape)~length+I(length^2),family=binomial,
data=fish)
> summary(bmodel)
Coefficients:
(Intercept) 31.35398 5.94186 5.277 1.31e-07 ***
length -1.31181 0.28843 -4.548 5.41e-06 ***
I(length^2) 0.01027 0.00349 2.944 0.00324 **
1
Parameter estimates and standard errors for both models are comparable.
The model suggests that fish length significantly decreases the odds for a fish escaping. A plot of ob-
served vs fitted proportions of fish escaped shows that the model fits rather well (Figure 1):
> plot(fish$length,fish$escape/fish$total)
> lines(fish$length,bmodel$fitted.values)
l l l l l
l
l
l
l
l
l
l
l
l l
l
l
l
l l l l l l l l l l l l l l l l l l l l l l l l
20 30 40 50 60 70 80
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Observed and fitted proportions of fish escaped
length
Figure 1: Proportion of fish escaped vs their length
(b) Create function to calculate the likelihood of original model and minimise:
> fn2 <- function (beta){
> sum(-fish$escape*(beta[1]+beta[2]*fish$length) +
> fish$total*log(1+exp(beta[1]+beta[2]*fish$length)))}
> out2 <- nlm(fn2,p=c(1,0),hessian=T,iterlim=10000,steptol=1e-10)
likelihood ratio test is minus twice the log of the ratio between the two likelihoods evaluated at the max.
likelihood estimates so
> LR <- -2*( (-out2$minimum)-(-out$minimum) )
We now compare this to a χ21 distr. with 40− 39 = 1 degrees of freedom:
> 1-pchisq(LR,1) ## p-value
[1] 0.3130268
The p-value is larger than 0.05 so the two models are not significantly different so the larger model is
redundant.
2. The module’s R workspace, contains a data frame nlmodel which involves data on a response variable y and
a single explanatory variable x. A scatter plot of y versus x suggests a strong non-linear relationship. Suppose
for these data we wish to consider the model:
Yi ∼ N
(
θ1xi
θ2 + xi
, σ2
)
Yi independent.
(a) Why can’t this model be fitted as a linear (regression) model? (1)
(b) Write down the likelihood L(θ1, θ2, σ2;y,x) and the log-likelihood `(θ1, θ2, σ2;y,x). (2)
(c) Write an R function mylike() which evaluates −`(θ1, θ2, σ2;y,x) for any values of the three parameters.
(1)
2
(d) Use the R function nlm() in association with your function mylike() to numerically minimise the log-
likelihood. Provide some evidence of how you chose sensible starting values. (3)
(e) Report the maximum likelihood estimates of the parameters and superimpose a plot of the associated
mean relationship on a scatter plot of y versus x. (2)
(f) Report the standard errors for θ1 and θ2, and use those to construct 95% confidence intervals. (5)
(g) Test the hypothesis that θ2 = 0.08 at the 5% significance level (not using the confidence interval) and
compute the associated p-value of the test. (3)
(h) Use plug-in prediction to construct and plot 95% prediction intervals. (4)
(21)
3. The dataframe sexr contains a variable ratio which is the ration of male births to female births across various
countries. Interest lies in seeing how this relates to a deprivation measure, namely infant mortality tinmort, so
it was suggested to use a Gamma distribution (since the ratio is strictly positive) with a mean that is a function
of tinmort. Specifically:
Yi ∼ Gamma(µi, λ) Yi indep.
log(µi) = β0 + β1xi
where xi is the year and where the pdf of the Gamma is
p (yi) =
yα−1i e
−yi/λi
Γ (α)λαi
, yi > 0, (1)
which has mean E [Yi] = µi = αλi and variance var [Yi] = σ2i = αλ
2
i . Parameter α is called the shape
parameter, while λi is the scale parameter.
(a) Write down the likelihood and the log-likelihood of this model.
(b) Use R to estimate the unknown parameters and report the standard errors.
(c) Plot the estimated relationship.
(d) Test the hypothesis that β1 = 0.
(e) Predict the ratio for the infant mortality value of 0.4 and report a 95% prediction interval.
Solution
(a) The Yi are independent so the likelihood is a product of the individual pdfs. Better to express the pdf in
terms of the mean µi = exp{β0 + β1xi}, where λi = µi/α
p (yi) =
yα−1i e
−yiα/µi
Γ (α) (µi/α)α
so the likelihood is:
L(β0, β1, α;y) =
n∏
i=1
yα−1i e
−yα/µi
Γ (α) (µi/α)α
=
1
Γ(α)n
∏ yα−1i e−yiα/µi
(µi/α)α
and so the log-likelihood is (noting that the log of a product in the sum of the logs):
`(β0, β1, α;y) = −n log(Γ(α)) +
n∑
i=1
log
(
yα−1i e
−yiα/µi
(µi/α)α
)
= −n log(Γ(α)) +
n∑
i=1
(α− 1) log(yi)− (yiα)/µi − α log(µi/α)
= −n log(Γ(α)) +
n∑
i=1
(α− 1) log(yi)− (yiα)/µi − α log(µi) + α log(α)
= −n log(Γ(α)) +
n∑
i=1
(α− 1) log(yi)− yiα
exp{β0 + β1xi} − α log(exp{β0 + β1xi}) + α log(α)
= −n log(Γ(α)) +
n∑
i=1
(α− 1) log(yi)− yiα
exp{β0 + β1xi} − α(β0 + β1xi) + α log(α)
3
(b) The R function to compute minus the log-likelihood is:
llhood <- function(theta){
n <- nrow(sexr)
result <- -n*lgamma(theta[3]) + sum( (theta[3]-1)*log(sexr$ratio) -
(sexr$ratio*theta[3])/(exp(theta[1]+theta[2]*sexr$tinmort)) -
theta[3]*(theta[1]+theta[2]*sexr$tinmort) + theta[3]*log(theta[3]) )
-result
}
where the following initial values seem sensible as per Figure 2:
> inits <- c(0.05,0,10)
> plot(sexr$tinmort,sexr$ratio,pch=20)
> xx <- seq(0,0.3,len=200)
> lines(xx,exp(inits[1]+inits[2]*xx),col="red")
0.00 0.05 0.10 0.15 0.20 0.25 0.30
1.
02
1.
04
1.
06
1.
08
1.
10
sexr$tinmort
se
xr
$ra
tio
Figure 2: Ratio vs infant mortality along with initial values guess for β0 and β1 in blue and the estimated mean
relationship in red.
The estimates and standard errors are:
> out <- nlm(llhood,p=inits,hessian=T,iterlim=10000,steptol=1e-10)
> round(out$estimate,4)
[1] 0.0560 -0.0253 11417.7943
> OIM <- solve(out$hessian)
> VarianceBeta <- diag(OIM)
> stand_error <- sqrt(VarianceBeta)
> round(stand_error,4)
[1] 0.0003 0.0028 370.9105
(c) The following code the fitted line to Figure 2:
lines(xx,exp(out$estimate[1]+out$estimate[2]*xx),col="red",lwd=3)
(d) The estimate of β1 is -0.0253 and the standard error is 0.0028 so we construct a hypothesis test H0 :
β1 = 0 vs Ha : β1 6= 0 with test statistic:
Z =
βˆ1 − 0
s.e.(βˆ1)
=
−0.0253
0.0028
with p-value:
> z_test <- out$estimate[2]/stand_error[2]
> p_value <- 2*(pnorm(z_test,0,1))
> p_value
[1] 3.990903e-19
which is much smaller than 0.05 so H0 is rejected.
4
(e) Our prediction is of course the mean at xi = 0.4, i.e. µi = exp{βˆ0 + βˆ1 × 0.4}:
> exp(out$estimate[1] + out$estimate[2]*0.4)
[1] 1.046889
To construct the pred. interval we use the 2.5% and 97.5% quantile of the Gamma distribution:
> mu <- exp(out$estimate[1] + out$estimate[2]*0.4)
> mu
[1] 1.046889
> shape <- out$estimate[3]
> scale <- mu/shape
> qgamma(0.025,shape=shape,scale=scale)
[1] 1.027773
> qgamma(0.975,shape=shape,scale=scale)
[1] 1.066178
so the prediction is 1.047 with a 95% prediction interval [1.028,1.066].
Topic 2
4. In the lecture slides for Topic 2, we argued that the Normal, Binomial and Poisson distribution all belong to the
exponential family
p (y; θ, φ) = exp
{
(yθ − b (θ))
a(φ)
+ c (y, φ)
}
.
Show that this is the case by using the fact that a mathematical function f(x) can be written as exp{log(f(x))}.
Solution
Normal:
The pdf of the Normal is:
p(y;µ, σ2) =
1√
2piσ2
exp
{
− 1
2σ2
(y − µ)2
}
which we can rewrite as
p(y;µ, σ2) = exp
{
log
(
1√
2piσ2
)}
exp
{
− 1
2σ2
(y − µ)2
}
= exp
{
log
(
1√
2piσ2
)
− 1
2σ2
(y − µ)2
}
= exp
{
− 1
2σ2
(y2 − 2yµ+ µ2) + log
(
1√
2piσ2
)}
= exp
{
yµ− µ2/2
σ2
− y
2
2
− log(

2piσ2)
}
= exp
{
yµ− µ2/2
σ2
− 1
2
[
y2
σ2
+ log(2piσ2)
]}
So we have that the Normal distribution is in the exponential family with: θ = µ, a(φ) = φ = σ2, b(θ) = µ
2
2 =
θ2
2
and c(y, φ) = − 12
[
y2
σ2 + log(2piσ
2)
]
.
Binomial:
The pmf of the Binomial is:
p(y;N, pi) =
(
N
y
)
piy(1− pi)N−y
5
which we can rewrite as
p(y;N, pi) = exp
{
log
((
N
y
)
piy(1− pi)N−y
)}
= exp
{
log (piy) + log
(
(1− pi)N−y)+ log(N
y
)}
= exp
{
y log (pi) + (N − y) log (1− pi) + log
(
N
y
)}
= exp
{
y log(pi)− y log(1− pi) +N log(1− pi) + log
(
N
y
)}
= exp
{
y log
(
pi
1− pi
)
+N log(1− pi) + log
(
N
y
)}
So the Binomial fits into the exponential family, with: θ = log
(
pi
1−pi
)
, a(φ) = 1, b(θ) = N log(1 − pi) =
N log(1 + eθ) and c(y, φ) = log
(
N
y
)
.
Poisson:
The pmf of the Poisson is:
p(y;λ) =
λye−λ
y!
which we can rewrite as
p(y;λ) = exp
{
log
(
λye−λ
y!
)}
= exp
{
log(λy)− λ+ log
(
1
y!
)}
= exp {y log(λ)− λ− log(y!)}
So the Poisson fits into the exponential family, with: θ = log(λ), a(φ) = 1, b(θ) = λ = eθ and c(y, φ) =
− log(y!).
5. Consider the Gamma distribution
p (y) =
yα−1e−y/λ
Γ (α)λα
, y > 0,
which has mean E [Y ] = µ = αλ and variance var [Y ] = σ2 = αλ2.
(a) Rewrite the probability distribution p(y) in terms of the mean µ and parameter α. (1)
(b) Thus show that it belongs to the exponential family, identifying parameters θ and functions a(·), b (·) and
c (·) in terms of α and µ. (3)
(c) Verify the expressions for the mean and the variance of the distribution using the associated general
results for the exponential family. (3)
(7)
6. The module workspace contains a data frame dicentric which presents data from an experiment conducted
to determine the effect of gamma radiation on the numbers of chromosomal abnormalities ca observed. The
number of cells exposed (in hundreds) in each run, differs and is contained in the variable cells. The dose
amount doseamt and the rate doserate at which the dose is applied are the predictors of interest.
(a) Directly model the rate, i.e. number of abnormalities per cell by using a Poisson GLM with response ca,
with a log link, with doseamt and doserate as predictors and with a suitable offset involving the numbers
of cells (cells). Comment on the model summary - parameter estimates, standard errors, model fit and
residual plots.
6
(b) If your final model from (a) is substantially over-dispersed, then fit a quasi-Poisson model instead and
comment on the differences between that and the results from (a).
(c) Alternatively, fit a Negative Binomial model and comment on the differences between that and the results
from (a).
(d) It was suggested that there might be an interaction between dose rate and dose amount, on the basis
that the effect of dose rate is not independent of dose amount (and vice versa). Fit a Poisson GLM with
the interaction, and also non-linear effects if deemed necessary.
(e) Use the AIC to choose between the Negative Binomial model and the Poisson GLM with the interactions,
and discuss what this model tells us about the relationship of the covariates to the response.
Solution
(a) We want to model the rate ρ (number of abnormalities per 100 cells) using a Poisson model with log-link
so we need to include an offset involving cells. We specify the model
Yi ∼ Pois(λi) Yi independent
log(λi) = log(NoCellsi × ρi) = log(cellsi) + β0 + β1x1,i + β2x2,i
where Yi is the number of abnormalities, “NoCells” is the total number of cells, x1 is the dose amount and
x2 is the dose rate. Fit this in R:
> dicentric$log.cells <- log(dicentric$cells)
> pois.model <- glm(ca~offset(log.cells)+doseamt+doserate,family=poisson,data=dicentric)
> summary(pois.model)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.461150 0.043257 -80.01 <2e-16 ***
doseamt 0.662302 0.009689 68.35 <2e-16 ***
doserate 0.155012 0.013685 11.33 <2e-16 ***
Residual deviance: 282.95 on 24 degrees of freedom
Note that cells is the number of cells in 100s so we are modelling here the rate per 100 cells. If we
wanted to model the rate per individual cell then we would be fitting the model:
> dicentric$log.cells <- log(dicentric$cells*100)
> pois.model2 <- glm(ca~offset(log.cells)+doseamt+doserate,family=poisson,data=dicentric)
> summary(pois.model2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.066321 0.043257 -186.47 <2e-16 ***
doseamt 0.662302 0.009689 68.35 <2e-16 ***
doserate 0.155012 0.013685 11.33 <2e-16 ***
Residual deviance: 282.95 on 24 degrees of freedom
and note that the two models are exactly the same except that the intercept β0 is different.
Clearly both variables doseamt and doserate are significant, both having a positive effect on the rate.
Let’s look at the residual plots in Figure 3: The residuals vs fitted values plot on the left shows some
structure but keep in mind the low number of observations. Normal Q-Q plot also is acceptable. The third
plot indicates some outliers which have both high leverage and a large Cook’s distance. No real reason
to remove these points though, as we can’t confirm that they are erroneous.
Check whether model fits:
> 1-pchisq(pois.model$deviance,pois.model$df.residual)
[1] 0
7
3.5 4.0 4.5 5.0 5.5 6.0

5
0
5
Predicted values
R
es
id
ua
ls
Residuals vs Fitted
145
19
−2 −1 0 1 2

5
0
5
Theoretical Quantiles
St
d.
P
e
a
rs
o
n
r
e
si
d.
Normal Q−Q
19
1415
0.00 0.05 0.10 0.15 0.20 0.25 0.30

5
0
5
Leverage
St
d.
P
e
a
rs
o
n
r
e
si
d.
Cook's distance
1
0.5
0.5
1
Residuals vs Leverage
19
9
27
Figure 3: Residual plots for Poisson GLM.
implying that the model does not fit (it is over-dispersed).
(b) Let’s fit a quasi-Poisson model:
> qpois.model <- glm(ca~offset(log.cells)+doseamt+doserate,family=quasipoisson,data=dicentric)
> summary(qpois.model)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.06632 0.15430 -52.277 < 2e-16 ***
doseamt 0.66230 0.03456 19.163 4.72e-16 ***
doserate 0.15501 0.04881 3.176 0.00407 **
(Dispersion parameter for quasipoisson family taken to be 12.72344)
Residual deviance: 282.95 on 24 degrees of freedom
As expected the parameter estimates for β0, β1 and β2 are the same as with pois.model. What has
changed though are the standard errors which are now larger. This is because the quasi model delib-
erately inflates the dispersion parameter φ to account for over-dispersion. The two parameters are still
significant but now we have more confidence in the inference as the standard errors are more realistic.
The residual plots in Figure 4 look exactly the same as with the Poisson GLM, except that there not longer
any outliers flagged as influential on the last plot. This is because there is now more variance to capture
those.
3.5 4.0 4.5 5.0 5.5 6.0

5
0
5
Predicted values
R
es
id
ua
ls
Residuals vs Fitted
145
19
−2 −1 0 1 2

2

1
0
1
2
Theoretical Quantiles
St
d.
P
e
a
rs
o
n
r
e
si
d.
Normal Q−Q
19
1415
0.00 0.05 0.10 0.15 0.20 0.25 0.30

2

1
0
1
2
Leverage
St
d.
P
e
a
rs
o
n
r
e
si
d.
Cook's distance
0.5
0.5
Residuals vs Leverage
19
9
27
Figure 4: Residual plots for quasi-Poisson GLM.
(c) Let’s fit a Negative-Binomial model:
Yi ∼ NegBin(λi, φ) Yi independent
log(λi) = log(NoCellsi × ρi) = log(cellsi) + β0 + β1x1,i + β2x2,i
where φ is the dispersion parameter
8
> library(MASS) # contains the function glm.nb
> nb.model <- glm.nb(ca~offset(log.cells)+doseamt*doserate, data=dicentric)
> summary(nb.model)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.97734 0.15274 -52.230 < 2e-16 ***
doseamt 0.66339 0.03821 17.362 < 2e-16 ***
doserate 0.15387 0.05057 3.043 0.00235 **
Residual deviance: 27.738 on 24 degrees of freedom
Given this model has a dispersion parameter, we don’t have to worry about whether the model fits well (is
overdispersed). The standard errors and estimates are basically the same as the quasiPoisson model.
The residual plots in Figure 5 look exactly the same as with the Poisson GLM, except that the Q-Q plot is
slightly worse (but still acceptable) and again, no outliers flagged in the last plot.
3.5 4.0 4.5 5.0 5.5 6.0

2

1
0
1
2
Predicted values
R
es
id
ua
ls
Residuals vs Fitted
1415
16
−2 −1 0 1 2

1
0
1
2
Theoretical Quantiles
St
d.
P
e
a
rs
o
n
r
e
si
d.
Normal Q−Q
1415
19
0.00 0.05 0.10 0.15 0.20

2

1
0
1
2
Leverage
St
d.
P
e
a
rs
o
n
r
e
si
d.
Cook's distance
0.5
0.5
Residuals vs Leverage
19
9
18
Figure 5: Residual plots for Negative Binomial GLM.
(d) Let’s fit a Poisson model with an interaction:
cai ∼ Pois(λi) Yi independent
log(λi) = log(NoCellsi × ρi) = log(cellsi) + β0 + β1x1,i + β2x2,i + β3x1,ix2,i
We fit the model thus:
> pois.model3 <- glm(ca~offset(log.cells)+doseamt*doserate, data=dicentric, family=poisson(link="log"))
> summary(pois.model3)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.90511 0.06160 -128.322 < 2e-16 ***
doseamt 0.61224 0.01707 35.862 < 2e-16 ***
doserate 0.06401 0.02922 2.191 0.028476 *
doseamt:doserate 0.02715 0.00765 3.549 0.000387 ***
Residual deviance: 270.26 on 23 degrees of freedom
> 1-pchisq(pois.model3$deviance,pois.model3$df.residual)
[1] 0
which indicates that the interaction is significant, but the model does not fit. Let’s look at residual plots
in Figure 6 Well not much different to the original Poisson model, but still the first plot indicates some
structure, so perhaps worth trying non-linear terms of the covariates. So let’s fit a model with squared
terms and their interaction
cai ∼ Pois(λi) Yi independent
log(λi) = log(NoCellsi × ρi) = log(cellsi) + β0 + β1x1,i + β2x2,i + β3x1,ix2,i
+ β4x
2
1,i + β5x
2
2,i + β6x
2
1,ix
2
2,i
9
3.5 4.0 4.5 5.0 5.5 6.0

6

4

2
0
2
4
6
8
Predicted values
R
es
id
ua
ls
Residuals vs Fitted
154
16
−2 −1 0 1 2

6

4

2
0
2
4
6
8
Theoretical Quantiles
St
d.
P
e
a
rs
o
n
r
e
si
d.
Normal Q−Q
1514
19
0.0 0.1 0.2 0.3 0.4 0.5

5
0
5
Leverage
St
d.
P
e
a
rs
o
n
r
e
si
d.
Cook's distance
1
0.5
0.5
1
Residuals vs Leverage
19
279
Figure 6: Residual plots for Poisson GLM with an interaction.
fitted as follows:
> pois.model4 <- glm(ca~offset(log.cells)+doseamt*doserate+I(doserate^2)+
I(doseamt^2)+I(doserate^2):I(doseamt^2),data=dicentric, family=poisson(link="log"))
> summary(pois.model4)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.866152 0.130777 -67.796 < 2e-16 ***
doseamt 1.502357 0.083908 17.905 < 2e-16 ***
doserate 0.092618 0.103556 0.894 0.371120
I(doserate^2) -0.029279 0.018999 -1.541 0.123306
I(doseamt^2) -0.151065 0.012153 -12.430 < 2e-16 ***
doseamt:doserate 0.092403 0.026724 3.458 0.000545 ***
I(doserate^2):I(doseamt^2) -0.002374 0.001025 -2.316 0.020552 *
Residual deviance: 44.299 on 20 degrees of freedom
> 1 - pchisq(pois.model4$deviance,pois.model4$df.residual)
[1] 0.001372009
All terms are significant but the model still does not fit (although the p-value is bigger than before). Let’s
look at residual plots in Figure 7 The QQ plot is acceptable but the residuals vs predicted values plot
3.5 4.0 4.5 5.0 5.5 6.0

3

2

1
0
1
2
Predicted values
R
es
id
ua
ls
Residuals vs Fitted
10
19
20
−2 −1 0 1 2

2
0
2
Theoretical Quantiles
St
d.
P
e
a
rs
o
n
r
e
si
d.
Normal Q−Q
19 10
27
0.0 0.2 0.4 0.6 0.8

4

2
0
2
4
Leverage
St
d.
P
e
a
rs
o
n
r
e
si
d.
Cook's distance
1
0.5
0.5
1
Residuals vs Leverage
27
1910
Figure 7: Residual plots for Poisson GLM with an interaction and quadratic terms.
indicates some signs of non-constant variance. Given that adding quadratic terms has helped, let’s try a
model with cubic terms and their interactions:
cai ∼ Pois(λi) Yi independent
log(λi) = log(NoCellsi × ρi) = log(cellsi) + β0 + β1x1,i + β2x2,i + β3x1,ix2,i
+ β4x
2
1,i + β5x
2
2,i + β5x
2
2,i + β6x
2
1,ix
2
2,i +
+ β7x
3
1,i + β8x
3
2,i + β9x
3
1,ix
3
2,i
10
fitted as:
> pois.model5 <- glm(ca~offset(log.cells)+doseamt*doserate+I(doserate^2)+
+ I(doseamt^2)+I(doserate^3)+I(doseamt^3)+I(doserate^2):I(doseamt^2)+
+ I(doserate^3):I(doseamt^3),data=dicentric, family=poisson(link="log"))
> summary(pois.model5)
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.9262017 0.1837638 -48.574 <2e-16 ***
doseamt 1.4363203 0.1085311 13.234 <2e-16 ***
doserate 0.4193853 0.2739157 1.531 0.1258
I(doserate^2) -0.2805326 0.1222924 -2.294 0.0218 *
I(doseamt^2) -0.1428256 0.0144010 -9.918 <2e-16 ***
I(doserate^3) 0.0419961 0.0180033 2.333 0.0197 *
I(doseamt^3) NA NA NA NA
doseamt:doserate 0.1627393 0.0713913 2.280 0.0226 *
I(doserate^2):I(doseamt^2) -0.0087011 0.0064667 -1.346 0.1785
I(doserate^3):I(doseamt^3) 0.0001771 0.0001894 0.935 0.3499
Residual deviance: 26.041 on 18 degrees of freedom
> 1 - pchisq(pois.model5$deviance,pois.model5$df.residual)
[1] 0.09882702
There are not enough unique values of dose amount to estimate a cubic relationship, so the term is
ignored. However all the other terms are significant and the model now fits at the 5% level. Let’s look
at residual plots in Figure 8. The residual vs predicted values still show signs of non-constant variance,
3.5 4.0 4.5 5.0 5.5 6.0

2

1
0
1
2
3
Predicted values
R
es
id
ua
ls
Residuals vs Fitted
20
2210
−2 −1 0 1 2

2

1
0
1
2
3
Theoretical Quantiles
St
d.
P
e
a
rs
o
n
r
e
si
d.
Normal Q−Q
20
22
10
0.0 0.2 0.4 0.6 0.8 1.0

3

2

1
0
1
2
3
Leverage
St
d.
P
e
a
rs
o
n
r
e
si
d.
Cook's distance
1
0.5
Residuals vs Leverage
19
22
20
Figure 8: Residual plots for Poisson GLM with an interaction and cubic terms.
although note there not many data points. QQ plot show significant deviation at lower values which is not
ideal. No outliers.
(e) Let’s look at the AIC values:
> pois.model5$aic
[1] 223.4579
> nb.model$aic
[1] 277.7973
indicating that the Poisson GLM with the cubic terms and the interaction is better. Given the interaction
and higher order terms, the output is hard to interpret from the summary table. Easier to plot the estimated
relationship of mean abnormality rate with dose rate for each of the three values of dose amount in the
data, shown in Figure 9:
pois.model6 <- glm(ca~offset(log.cells)+doseamt*doserate+I(doserate^2)+
I(doseamt^2)+I(doserate^3)+I(doserate^2):I(doseamt^2)+
I(doserate^3):I(doseamt^3),data=dicentric, family=poisson(link="log"))
Drate <- seq(0.1,4,len=200)
Damt <- rep(1,200)
11
Offset <- rep(0,200) # so we are predicting the rate abnormal cells per (exp(0)=1) cell
preds <- predict(pois.model6,newdata=data.frame(doserate=Drate,doseamt=Damt,log.cells=Offset),
type="link",se.fit=T)
plot(Drate,exp(preds$fit),type="l",lwd=2,ylim=c(0,0.017))
lines(Drate,exp(preds$fit+preds$se.fit),lwd=2,lty=2)
lines(Drate,exp(preds$fit-preds$se.fit),lwd=2,lty=2)
Damt <- rep(2.5,200)
preds <- predict(pois.model6,newdata=data.frame(doserate=Drate,doseamt=Damt,log.cells=Offset),
type="link",se.fit=T)
lines(Drate,exp(preds$fit),lwd=2,col="red")
lines(Drate,exp(preds$fit+preds$se.fit),lwd=2,lty=2,col="red")
lines(Drate,exp(preds$fit-preds$se.fit),lwd=2,lty=2,col="red")
Damt <- rep(5,200)
preds <- predict(pois.model6,newdata=data.frame(doserate=Drate,doseamt=Damt,log.cells=Offset),
type="link",se.fit=T)
lines(Drate,exp(preds$fit),lwd=2,col="blue",type="l")
lines(Drate,exp(preds$fit+preds$se.fit),lwd=2,lty=2,col="blue")
lines(Drate,exp(preds$fit-preds$se.fit),lwd=2,lty=2,col="blue")
legend("topleft",c("dosemt=1","dosemt=2.5","dosemt=5"),lty=c(1,1,1),col=c("black","red","blue"))
0 1 2 3 4
0.
00
0
0.
00
5
0.
01
0
0.
01
5
Drate
ex
p(p
red
s$
fit)
dosemt=1
dosemt=2.5
dosemt=5
Figure 9: Estimated relationship for the Poisson model with interaction and cubic terms.
Figure 9 indicates that for dose amount 1, the relationship is fairly flat. The mean response increases as
we go to dose amount 2.5 and even more when we go to 5, albeit in a non-linear way.
7. In lectures, we have seen data on number of quarterly aids cases in the UK, yi, from January 1983 to March
1994. The data are in dataframe aids, where the variable cases is yi and date is time, symbolised here as
xi. In this question we consider two competing models to describe the trend in the number of cases. Model 1
is
Yi ∼ Pois(λi) Yi independent
log(λi) = β0 + β1xi
and Model 2 is
Yi ∼ N(µi, σ2) Yi independent
log(µi) = γ0 + γ1xi
12
(a) Plot yi against xi and comment on whether the two proposed models are sensible. (3)
(b) Fit the two models in R, add the estimated trends from each model (λˆi and µˆi) on the plot from (7a) along
with approximate 95% confidence intervals on the mean and comment on the validity of each model
(based on the plot). For the confidence intervals you can assume that the sampling distribution of λˆi and
µˆi is approximately Gaussian with standard errors obtained from the R function
predict(...,type=‘‘response’’,se.fit=T). Also use the estimates and the formulation of the log-
likelihoods from (7a) to calculate the AIC for each model and thus comment on which model is preferable
according to this criterion. (5)
(c) Produce the deviance residuals vs fitted values (λˆi and µˆi) plot for each model, comment appropriately
and thus propose a way that the two models might be extended to improve the fit. (2)
(d) Implement the proposed extensions to each model, to arrive at a final version for each of them (justified
by appropriate hypothesis tests). (3)
(e) On the basis of the analogous plots as in (7b) and (7c) but also on arguments of model fit based on the
deviance and the AIC, comment on which (if any) of the two final models in (d) you would choose as the
best. Mention at least one reason why either model is not ideal. (9)
(f) Further extend the final Poisson model to a Negative Binomial model and comment on whether this model
is preferable to the other two, on the basis of all the criteria used for comparison so far. (3)
(25)
8. The dataframe titanic2 contains an aggregated version of the titanic survival data. The age covariate has
been aggregated to “child” and “adult”.
(a) Fit a Binomial model with just the “main effects” of age_group, pclass and gender, and interpret the
estimates.
(b) To see whether the effect of age is different for males and females, and also whether the effect of class is
different for males and females, fit the model with an interaction between age_group and gender as well
pclass and gender, and interpret the estimates.
Solution
(a) The model we are fitting is:
Yi ∼ Bin(pii, Ni) Yi independent
logit(pii) = β0 + β1x1,i + β2x2,i + β3x3,i + β4x4,i
where Yi is the number survived; Ni is the total number of passengers; x1 = 1 if child and x1 = 0 if adult;
x2 = 1 is class 2 and x2 = 0 otherwise; x3 = 1 is class 3 and x3 = 0 otherwise; and x4 = 1 if male and
x4 = 0 otherwise. We fit this by:
> model1 <- glm(cbind(survived,total-survived)~age_group+pclass+gender,data=titanic2,
family=binomial(link="logit"))
> summary(model1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.2330 0.1810 12.340 < 2e-16 ***
age_groupchild 1.3834 0.2733 5.061 4.17e-07 ***
pclass2 -0.8656 0.2003 -4.322 1.54e-05 ***
pclass3 -1.8654 0.1819 -10.253 < 2e-16 ***
gendermale -2.7155 0.1542 -17.605 < 2e-16 ***
Residual deviance: 64.718 on 6 degrees of freedom
> 1 - pchisq(model1$deviance,model1$df.residual)
[1] 4.925949e-12
13
The model does not fit, however let’s interpret the estimates anyway. β1 is the difference in mean prob-
ability of survival (at logit scale) between adults and children. This is significant at 5% level so children
have higher chance (1.383) of survival than adults. Class 2 passengers have significantly smaller chance
of survival (-0.866) than class 1 passengers, and class 3 passengers have even smaller (-1.865). Lastly,
males have significantly smaller chance of survival than females (-2.7155).
(b) The model with the two interactions is:
Yi ∼ Bin(pii, Ni) Yi independent
logit(pii) = β0 + β1x1,i + β2x2,i + β3x3,i + β4x4,i
+ β5x5,i + β6x6,i + β7x7,i
where x1 = 1 if female child in class 1 and x1 = 0 if female adult in class 1; x2 = 1 is female adult in
class 2 and x2 = 0 otherwise; x3 = 1 is female adult class 3 and x3 = 0 otherwise; x4 = 1 if male adult
in class 1 and x4 = 0 otherwise; x5 = 1 if male child in class 1 and x5 = 0 otherwise; x5 = 1 if adult
male in class 2 and x5 = 0 otherwise; and x6 = 1 if adult male in class 3 and x6 = 0 otherwise. (Note
it’s easier to fit the model in R before interpreting what the covariates mean.) We fit the model using:
> model2 <- glm(cbind(survived,total-survived)~age_group+pclass+gender+age_group:gender+
+ pclass:gender,data=titanic2,family=binomial(link="logit"))
> summary(model2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.2195 0.7123 5.924 3.14e-09 ***
age_groupchild 0.6285 0.3491 1.800 0.071797 .
pclass2 -2.0915 0.7874 -2.656 0.007898 **
pclass3 -4.1222 0.7270 -5.671 1.42e-08 ***
gendermale -4.9856 0.7301 -6.829 8.58e-12 ***
age_groupchild:gendermale 1.2755 0.4787 2.664 0.007714 **
pclass2:gendermale 1.1960 0.8267 1.447 0.147961
pclass3:gendermale 2.8668 0.7584 3.780 0.000157 ***
Residual deviance: 15.38 on 3 degrees of freedom
> 1 - pchisq(model2$deviance,model2$df.residual)
[1] 0.001519179
We note that the model does not fit, but we interpret the results regardless. At this point, using the table
is extremely confusing, so best do it using plots:
> library(effects)
> plot(predictorEffects(model2),lines=list(multiline=TRUE),
axes=list(grid=TRUE),confint=list(style="bars"))
which produces the plot shown in Figure 10. We assess significance using confidence intervals. Note that
confidence intervals that don’t overlap imply statistically significant difference. But overlapping intervals
don’t necessarily mean the difference is not significant, although they do imply not-so-strong evidence.
E.g. overlapping intervals could arise from a p-value less than 0.05, but only just less than, e.g. 0.04.
Interpretation: Top left plot indicates that children are more likely to survive but the effect is only sig-
nificant in females. Also, females have significantly bigger chance of survival in both age groups. The
top right plot indicates that females have significantly bigger chance of survival across all 3 classes, but
the difference gets smaller as we move from class 1 to 2 to 3. The bottom left plot shows that there is
not much difference in female survival for the two age groups across the 3 classes. For males however,
children have significantly bigger chance of survival across all 3 classes. Also, male adults in class 1 have
significantly bigger chance of survival than male adults in classes 2 and 3, while for male children the
differences are not as pronounced. Female children in classes 1 and 2 have significantly bigger chance
of survival than ones in class 3, and the same can be said for adult women.
14
age_group predictor effect plot
age_group
cb
in
d(s
urv
ive
d,
to
ta
l −
s
ur
vi
ve
d)
0.2
0.4
0.6
0.8
adult child
gender
female male
pclass predictor effect plot
pclass
cb
in
d(s
urv
ive
d,
to
ta
l −
s
ur
vi
ve
d)
0.2
0.4
0.6
0.8
1 2 3
gender
female male
gender predictor effect plot
gender
cb
in
d(s
urv
ive
d,
to
ta
l −
s
ur
vi
ve
d)
0.2
0.4
0.6
0.8
female male
= pclass 1 = pclass 2
0.2
0.4
0.6
0.8
= pclass 3
age_group
adult child
Figure 10: Estimated relationship for the Binomial titanic model with interaction terms.
9. The module workspace, contains a data frame titanic which relates to 1309 passengers on the last voyage
of the ocean liner ‘Titanic’. The response variable survived is a binary variable where the value 1 means the
passenger survived the sinking. The data frame also contains predictors relating to passenger class (1st, 2nd,
3rd), gender, age and the fare amount each passenger paid. Passenger names are also available (for interest,
rather than for modelling).
(a) Fit a Bernoulli GLM with logistic link of survived with age, pclass and gender are predictors, as well
as all the associated two-way interactions. Reduce the model if and as appropriate using the AIC in
conjunction with the R function drop1(), and interpret the final model in terms of parameter estimates
and their significance. Perform relevant model checking analysis (model fit and residuals).
(13)
10. In 1972-74, a survey of one in six residents of Whickham, near Newcastle, England was made. Twenty years
later, this data recorded in a follow-up study. Only women who are current smokers or who have never smoked
are included. Resulting data set comprises 28 obs on the following 4 variables: y is the observed count for
given combination, smoker is a factor with levels yes no, dead is a factor with levels yes/no, age is a factor with
age-group levels 18-24, 25-34, 35-44, 45-54, 55-64, 65-74 and 75+. Interest here lies on the effects of age
and smoking on the probability of death.
(a) Investigate the association of age and smoking on the chance of dying, using Poisson log-linear models
for this contingency table. Clearly state any assumptions you are making about the sampling design.
Solution
15
(a) Assuming all margins are random, then no particular terms need to be included by design. For reasons
of brevity, we avoid writing the models down mathematically, noting that the models are Poisson GLMs
with a log link. We start with biggest model possible and reduce if necessary, and this is model with all
main affects, 2-way interactions and the 3-way interaction.
> model1 <- glm(y~smoker*dead*age,data=femsmoke,family=poisson)
> deviance (model1)
[1] 3.033713e-10
This is of course the saturated model, which is useless, so we remove the 3-way interaction (biggest
interaction) and check if the model fits:
> model2 <- glm(y~(smoker+dead+age)^2,data=femsmoke,family=poisson)
> 1-pchisq(deviance(model2),model2$df.residual)
[1] 0.8815485
The model fits, but we must still consider whether we can reduce further. Use function drop1() which
conducts a likelihood ratio test for all possibilities of removing a single 2-way interaction:
> drop1(model2,test="Chi")
Single term deletions
Df Deviance AIC LRT Pr(>Chi)
2.38 180.58
smoker:dead 1 8.33 184.52 5.95 0.01475 *
smoker:age 6 92.63 258.83 90.25 < 2e-16 ***
dead:age 6 632.30 798.49 629.92 < 2e-16 ***
For all possibilities, p-values are less than 0.05 (and all AIC values larger) so we keep the current model
and simply interpret:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.54284 0.58736 0.924 0.355384
smokerno -0.29666 0.25324 -1.171 0.241401
deadno 3.43271 0.59014 5.817 6.00e-09 ***
age25-34 0.92902 0.68381 1.359 0.174273
age35-44 1.94048 0.62486 3.105 0.001900 **
age45-54 2.76845 0.60657 4.564 5.02e-06 ***
age55-64 3.37507 0.59550 5.668 1.45e-08 ***
age65-74 2.86586 0.60894 4.706 2.52e-06 ***
age75+ 2.02211 0.64955 3.113 0.001851 **
smokerno:deadno 0.42741 0.17703 2.414 0.015762 *
smokerno:age25-34 0.11752 0.22091 0.532 0.594749
smokerno:age35-44 0.01268 0.22800 0.056 0.955654
smokerno:age45-54 -0.56538 0.23585 -2.397 0.016522 *
smokerno:age55-64 0.08512 0.23573 0.361 0.718030
smokerno:age65-74 1.49088 0.30039 4.963 6.93e-07 ***
smokerno:age75+ 1.89060 0.39582 4.776 1.78e-06 ***
deadno:age25-34 -0.12006 0.68655 -0.175 0.861178
deadno:age35-44 -1.34112 0.62857 -2.134 0.032874 *
deadno:age45-54 -2.11336 0.61210 -3.453 0.000555 ***
deadno:age55-64 -3.18077 0.60057 -5.296 1.18e-07 ***
deadno:age65-74 -5.08798 0.61951 -8.213 < 2e-16 ***
deadno:age75+ -27.31727 8839.01146 -0.003 0.997534
We only discuss interactions with deadno, which are the effects of the two covariates on the prob-
ability of not dying. Begin with smokerno:deadno. The effect is positive and significant, so non-
smokers have higher chances of not dying compared to smokers. Then deadno:age. The effect of
deadno:age25-34 is not significant so no difference in survival for age groups 18-24 and 25-34. The
effect of deadno:age35-44 is negative and significant so age group 35-44 has bigger chance of dying.
In fact the effect increases (and is significant) with age group (as expected). The age group 75+ has a
very large standard error, implying there is very little to estimate this effect.
16
11. In an archaeological study of animal husbandry, there is a particular interest in proportions of fragments of
animal bones of different types discovered in excavations in the south of England over four time eras. A data
set is collated from the excavation reports which consists of a three-way contingency table where cells contain
total numbers of bone fragments from all excavations cross-classfied by animal type (‘sheep’ or ‘cattle’ or
‘pig’), by the surface geology of the excavation sites (‘valley terrace’ or ’other’) and by the era (‘early saxon’
or ‘middle saxon’ or ‘late saxon’ or ‘high medieval’). These data are contained in the dataframe bones in the
module workspace. Interest lies in the association of era and geology, with the number of different bone types.
Explore significant associations in these data using Poisson log-linear models. In particular, identify an appro-
priate final preferred model which suitably fits the data and then report what this model suggests concerning
significant associations between the numbers of bones of different types and surface geology and ditto with
era. What assumptions are you making about the sampling design for the study in carrying out your analysis
and what are the modelling implications of those? (15)
12. The dataframe ToothGrowth, included in the base R installation, see ?ToothGrowth, contains information on
the growth of teeth in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2
mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).
(a) Fit an Exponential GLM to length of teeth with two covariates, with the default link function and check
whether the model fits.
(b) Alternatively, fit this as a Gamma distribution and interpret the effect of the covariates.
(c) Check the residual plots of the Gamma model, and suggest a way that these can be improved.
(d) Implement your suggestion and check whether this has remedied the problem.
(e) Interpret the effects of the covariates from your final model.
Solution
(a) The model we are fitting is
Yi ∼ Exp(λi), Yi indep.
λi =
1
µi
= β0 + β1x1,i + β2x2,i
where Yi is tooth length, x1,i = 1 if supplement type is VC and 0 otherwise, and x2,i is dose of supple-
ment. We fit this model in R using
> model <- glm(len~supp+dose,family=Gamma(link="inverse"),data=ToothGrowth)
> summary(model,dispersion=1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.08456 0.01926 4.391 1.13e-05 ***
suppVC 0.00956 0.01329 0.720 0.4718
dose -0.02610 0.01090 -2.396 0.0166 *
(Dispersion parameter for Gamma family taken to be 1)
Residual deviance: 6.1123 on 57 degrees of freedom
As it is an Exponential model, we must first check that this model fits. If it does, then the scaled deviance
should come from a χ2n−p−1 distribution where here n− p− 1 = 57. From summary(model) we see that
the scaled deviance 6.11 is very small compared to 57. The p-value of the upper tail of the χ257 is
> 1 - pchisq(model$deviance,model$df.residual)
[1] 1
The p-value is very big implying that the model is a good fit. (Note that the deviance is extreme but low,
which is fine given that the scaled deviance Dm/φ is the logarithm of the likelihood ratio of the fitted
model to the saturated model. Values of the ratio close to 1 imply good model fit, which on the log-scale
translates to small values.)
17
(b) Let’s also fit this model as a Gamma GLM, which makes more sense since the Exponential is just a
special case of the Gamma. The model is:
Yi ∼ Gamma(µi, φ), Yi indep.
1
µi
= β0 + β1x1,i + β2x2,i
where φ is the dispersion parameter. In R, the model is still the same, it is only the summary() function
that changes:
> summary(model)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.084560 0.005931 14.257 < 2e-16 ***
suppVC 0.009560 0.004092 2.336 0.023 *
dose -0.026099 0.003355 -7.778 1.61e-10 ***
(Dispersion parameter for Gamma family taken to be 0.09485256)
Residual deviance: 6.1123 on 57 degrees of freedom
We know this model fits w.r.t. the scaled deviance, because it has a non-fixed dispersion parameter
(estimated to be 0.0949). Note however that the standard errors are different from the Exponential GLM,
so we would prefer this model on the basis that it can reduce to the Exponential GLM if necessary.
Predictor suppVC has a positive significant effect on the inverse mean implying that taking vitamin C
supplement using method “VC” reduces the mean tooth length compared to taking it using method “OJ”.
Predictor dose has a negative significant effect meaning that the more the dose, the longer the tooth (on
average).
(c) Figure 11 shows, from left to right, 1) the standardised deviance residuals vs linear predictor plot, 2) the
Q-Q plot of the standardised deviance residuals and 3) the standardised Pearson residuals vs leverage
plot along with Cook’s distance contours. These were obtained by:
> plot(model,which=1,pch=20)
> plot(model,which=2,pch=20)
> plot(model,which=5,pch=20)
0.04 0.05 0.06 0.07 0.08

1.
0

0.
5
0.
0
0.
5
Predicted values
R
es
id
ua
ls
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
ll
ll
l
l
ll
l
l
l
ll
l
l
l
l
l
l
Residuals vs Fitted
1
9
4
l
l
l
l
l
l
ll
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
ll
l
l
l
ll
l
ll
l
ll
l
l
−2 −1 0 1 2

3

2

1
0
1
2
Theoretical Quantiles
St
d.
d
ev
ia
nc
e
re
si
d.
Normal Q−Q
1
9
4
0.00 0.02 0.04 0.06 0.08

2

1
0
1
2
Leverage
St
d.
P
e
a
rs
o
n
r
e
si
d.
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
ll
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
Cook's distance
Residuals vs Leverage
1
23
32
Figure 11: Residual plots
The first plot on the left indicates a quadratic relationship, rather than random scatter about the horizontal
zero line. Perhaps we need to consider squared terms of dose in the model. The Q-Q plot shows
systematic deviation from the diagonal at both ends indicating that the Gamma distribution may not be
appropriate for this data, however deviations are small so we may accept the fit but note that there is
room for improvement. The last plot does not show any particular outliers of concern as all the points
have low Cook’s distance (no Cook’s distance contours are even visible). The first plot may be improved
by including non-linear effects. In particular we could try a quadratic term of the dose covariate (the other
is a factor) and all the 2-way interactions.
18
(d) So let’s try to fit a model with a squared term of dose and an interaction with supp (for both dose and
dose squared). This is a model with a different quadratic relationship of dose for each type of supplement.
Mathematically:
Yi ∼ Gamma(µi, φ), Yi indep.
1
µi
= β0 + β1x1,i + β2x2,i + β3x1,ix2,i + β4x
2
1,i + β5x
2
1,ix2,i
We fit as follows:
> ToothGrowth$doseSq <- ToothGrowth$dose^2
> model2 <- glm(len~supp+dose+doseSq+supp:dose+supp:doseSq,family=Gamma(link="inverse"),
data=ToothGrowth)
> summary(model2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.12625 0.01630 7.745 2.55e-10 ***
suppVC 0.10141 0.03091 3.281 0.001817 **
dose -0.12045 0.02782 -4.330 6.50e-05 ***
doseSq 0.03826 0.01008 3.795 0.000376 ***
suppVC:dose -0.12091 0.05141 -2.352 0.022367 *
suppVC:doseSq 0.03507 0.01833 1.914 0.060956 .
(Dispersion parameter for Gamma family taken to be 0.05472392)
Residual deviance: 3.0094 on 54 degrees of freedom
The squared term and the interaction terms are all significant (at least at the 10% level) so let’s check
residuals in Figure 12. The standardised deviance residuals vs linear predictor plot is much better now,
and so is the Q-Q plot, so we can be content with model fit.
0.04 0.06 0.08 0.10 0.12

0.
6

0.
2
0.
2
0.
6
Predicted values
R
es
id
ua
ls
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
ll
l
l
ll
l
l
l
l
ll
l
l
Residuals vs Fitted
1
32
37
l
l
l
l
l
l
l l
l
l
ll
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l l
l
l
l
l
l
l
ll
l
l
l
l
l
−2 −1 0 1 2

2

1
0
1
2
Theoretical Quantiles
St
d.
d
ev
ia
nc
e
re
si
d.
Normal Q−Q
1
32
37

2

1
0
1
2
3
Factor Level Combinations
St
d.
P
e
a
rs
o
n
r
e
si
d.
OJ VC
supp :
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
ll
l
l
l
l
l
l
l
Residuals vs Factor Levels
32
1
2
Figure 12: Residual plots
(e) Interpreting the model: When x2 = 0, the dose effect is 0.126−0.120x2 +0.038x22 while when x2 = 1 the
dose effect is (0.126+0.101)− (0.120+0.121)x2 +(0.038−0.035)x22. The significance of the interaction
terms implies that these two are significantly different. Hard to interpret due to the quadratic nature, so
better to use predictions of µ and visualise as shown in Figure 13. Interesting how the two lines indicate
that the differences in dose effect for the two values of supp are more pronounced for values of dose
below 1, while above 1 they are not significantly different as indicated from the overlap in the confidence
intervals. (The CIs of the mean are calculated as transformed intervals of the linear predictor using the
Normal 95% confidence interval formula.) The plots were obtained using the following code:
doseSeq <- seq(0.5,2,length=200)
predsOJ <- predict(model2,newdata=data.frame(dose=doseSeq,doseSq=doseSeq^2,supp="OJ"),type="link",se.fit=T)
predsVC <- predict(model2,newdata=data.frame(dose=doseSeq,doseSq=doseSeq^2,supp="VC"),type="link",se.fit=T)
muOJ <- 1/predsOJ$fit
muOJ_upper <- 1/(predsOJ$fit+1.96*predsOJ$se.fit)
muOJ_lower <- 1/(predsOJ$fit-1.96*predsOJ$se.fit)
muVC <- 1/predsVC$fit
muVC_upper <- 1/(predsVC$fit+1.96*predsVC$se.fit)
muVC_lower <- 1/(predsVC$fit-1.96*predsVC$se.fit)
x11(width=8,height=6) ## might have to use win.graph() in Windows
par(mar = c(4, 4, 1, 1),cex=1.3,lwd=2) ## sets the plot margins for "nicer" look
19
plot(ToothGrowth$dose,ToothGrowth$len,pch=20,ylim=c(0,45))
lines(doseSeq,muOJ)
lines(doseSeq,muOJ_upper,lty=2)
lines(doseSeq,muOJ_lower,lty=2)
lines(doseSeq,muVC,col="red")
lines(doseSeq,muVC_upper,lty=2,col="red")
lines(doseSeq,muVC_lower,lty=2,col="red")
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
ll
l
l
l
ll
l
l
l
0.5 1.0 1.5 2.0
0
10
20
30
40
ToothGrowth$dose
To
o
th
G
ro
w
th
$le
n
Figure 13: Estimate mean growth as a function of dose.
13. Consider the data frame gehan in the module R workspace. This involves a trial of 42 leukaemia patients
where some were treated with the drug 6-mercaptopurine and the rest are controls. The trial was designed
as matched pairs, both withdrawn from the trial when either came out of remission (more info can be found by
typing ?gehan having first loaded the R package MASS).
Variable cens indicates that some observations are censored (cens=0). This means that the patient was
removed or dropped out from the trial before their remission time can be recorded. The only data available for
such patients is the time they were removed, i.e. that they didn’t come out of remission before this removal
time. When cens=0, variable time records the drop-out time rather than the remission time.
(a) Create a new dataframe by removing the censored observations and use the command
gehan$treat <- relevel(gehan$treat, "control")
to ensure that it is the “control” that gets buried in the intercept. Write down and fit an Exponential
GLM with an inverse link, with time as the response and treat as the covariate. Discuss the effects of
treatment, and check whether the model fits. (5)
(b) Discarding censored observations simplifies the situation, however this means throwing away potentially
useful information (even if partial). To fit a model with all the data we can instead alter the likelihood
contributions of the censored and non-censored data points:
L(λi; ti) = p(ti;λi) if no censoring
L(λi; ti) = Pr(t > ti;λi) = 1− Pr(t < ti;λi) if censoring
where p(ti;λi) indicates the Exponential probability density function for survival times ti, and Pr(t <
ti;λi) is the Exponential cumulative distribution function. The likelihood can then be written as:
L(λi; t1, . . . , tn) =
n∏
i=1
p(ti;λi)
ci [1− Pr(t < ti;λi)]1−ci
20
where ci = 1 for no censoring and ci = 0 for censoring. Show that for the exponential distribution
p(ti;λi) = λie
−λiti , the likelihood can be written as:
L(λi; t1, . . . , tn) =
n∏
i=1
(λiti)
cie−λiti
tcii
and note that this likelihood is equivalent to the Poisson likelihood up to a constant. One can therefore
fit this model using a Poisson model with mean λiti where the data are the ci. Write down this Poisson
model mathematically, such that λi = β0+β1xi where xi is the treatment variable. Fit this Poisson model
in R, check that it fits, and compare the treatment effect with the Exponential model on the ‘thinned’ data
from part (a). (Hint: to fit a model without an intercept, add a “−1” in the model formula of the glm() call.
Also, you will need to use the command gehan$treat <- relevel(gehan$treat, "6-MP") before
fitting the model, to revert the changes from part (a).) (14)
(19)
(100)
21

欢迎咨询51作业君
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468