MTHM506 Statistical Data Modelling Outline Solutions Problem Sheet 1 Topic 1 1. In the lecture notes, there is an example relating to a study of the number of fish caught of differing sizes in a net of a particular mesh size. We developed a binomial GLM for probability of escape from net with a single predictor x, the fish length in cm. The model for the data y1, . . . , yn (fish escaping) is Yi ∼ Bin(Ni, pi) Yi independent log ( pi 1− pi ) = β0 + β1xi where Ni (total fish number) has also been observed and xi is fish length. The notes contain details of the relevant log-likelihood and code the R script topic1.R directly maximises the log-likelihood of this binomial model using the data set fish using the R (optimiser) function nlm() and reports the maximum likelihood estimates for β0 and β1 along with their standard errors and other model diagnostics. (a) Extend the R code in order to fit the modified model: yi ∼ Bin(ni, pi) log ( pi 1− pi ) = β0 + β1xi + β2x 2 i Report the maximum likelihood estimates for β0 , β1 and β2 along with their standard errors and the AIC for the modified model. (b) Calculate the log likelihood ratio statistic relevant to a formal comparison of difference in fit between the original and the modified model. Perform a formal test of the hypothesis that there is no significant difference in fit between the original and the modified model. Solution (a) From the lecture notes, the log-likelihood of the model is: `(y;β0, β1) = const + n∑ i=1 [ yi(β0 + β1xi + β2x 2 i )− ni log(1 + eβ0+β1xi+β2x 2 i ) ] . Create a function in R to evaluate −`(y;β0, β1, β2) (because nlm() will minimise, we work with minus the log-likelihood): > fn <- function (beta){ > sum(-fish$escape*(beta[1]+beta[2]*fish$length+beta[3]*fish$length^2) + > fish$total*log(1+exp(beta[1]+beta[2]*fish$length+beta[3]*fish$length^2)))} Maximise the likelihood with starting values β0 = 1, β1 = 0 and β2 = 0 (random): > out <- nlm(fn,p=c(1,0,0),hessian=T,iterlim=10000,steptol=1e-10) > out$estimate [1] 31.20001576 -1.30410232 0.01018133 Compare with a binomial GLM: > bmodel <- glm(cbind(fish$escape,fish$total-fish$escape)~length+I(length^2),family=binomial, data=fish) > summary(bmodel) Coefficients: (Intercept) 31.35398 5.94186 5.277 1.31e-07 *** length -1.31181 0.28843 -4.548 5.41e-06 *** I(length^2) 0.01027 0.00349 2.944 0.00324 ** 1 Parameter estimates and standard errors for both models are comparable. The model suggests that fish length significantly decreases the odds for a fish escaping. A plot of ob- served vs fitted proportions of fish escaped shows that the model fits rather well (Figure 1): > plot(fish$length,fish$escape/fish$total) > lines(fish$length,bmodel$fitted.values) l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 20 30 40 50 60 70 80 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 Observed and fitted proportions of fish escaped length Figure 1: Proportion of fish escaped vs their length (b) Create function to calculate the likelihood of original model and minimise: > fn2 <- function (beta){ > sum(-fish$escape*(beta[1]+beta[2]*fish$length) + > fish$total*log(1+exp(beta[1]+beta[2]*fish$length)))} > out2 <- nlm(fn2,p=c(1,0),hessian=T,iterlim=10000,steptol=1e-10) likelihood ratio test is minus twice the log of the ratio between the two likelihoods evaluated at the max. likelihood estimates so > LR <- -2*( (-out2$minimum)-(-out$minimum) ) We now compare this to a χ21 distr. with 40− 39 = 1 degrees of freedom: > 1-pchisq(LR,1) ## p-value [1] 0.3130268 The p-value is larger than 0.05 so the two models are not significantly different so the larger model is redundant. 2. The module’s R workspace, contains a data frame nlmodel which involves data on a response variable y and a single explanatory variable x. A scatter plot of y versus x suggests a strong non-linear relationship. Suppose for these data we wish to consider the model: Yi ∼ N ( θ1xi θ2 + xi , σ2 ) Yi independent. (a) Why can’t this model be fitted as a linear (regression) model? (1) (b) Write down the likelihood L(θ1, θ2, σ2;y,x) and the log-likelihood `(θ1, θ2, σ2;y,x). (2) (c) Write an R function mylike() which evaluates −`(θ1, θ2, σ2;y,x) for any values of the three parameters. (1) 2 (d) Use the R function nlm() in association with your function mylike() to numerically minimise the log- likelihood. Provide some evidence of how you chose sensible starting values. (3) (e) Report the maximum likelihood estimates of the parameters and superimpose a plot of the associated mean relationship on a scatter plot of y versus x. (2) (f) Report the standard errors for θ1 and θ2, and use those to construct 95% confidence intervals. (5) (g) Test the hypothesis that θ2 = 0.08 at the 5% significance level (not using the confidence interval) and compute the associated p-value of the test. (3) (h) Use plug-in prediction to construct and plot 95% prediction intervals. (4) (21) 3. The dataframe sexr contains a variable ratio which is the ration of male births to female births across various countries. Interest lies in seeing how this relates to a deprivation measure, namely infant mortality tinmort, so it was suggested to use a Gamma distribution (since the ratio is strictly positive) with a mean that is a function of tinmort. Specifically: Yi ∼ Gamma(µi, λ) Yi indep. log(µi) = β0 + β1xi where xi is the year and where the pdf of the Gamma is p (yi) = yα−1i e −yi/λi Γ (α)λαi , yi > 0, (1) which has mean E [Yi] = µi = αλi and variance var [Yi] = σ2i = αλ 2 i . Parameter α is called the shape parameter, while λi is the scale parameter. (a) Write down the likelihood and the log-likelihood of this model. (b) Use R to estimate the unknown parameters and report the standard errors. (c) Plot the estimated relationship. (d) Test the hypothesis that β1 = 0. (e) Predict the ratio for the infant mortality value of 0.4 and report a 95% prediction interval. Solution (a) The Yi are independent so the likelihood is a product of the individual pdfs. Better to express the pdf in terms of the mean µi = exp{β0 + β1xi}, where λi = µi/α p (yi) = yα−1i e −yiα/µi Γ (α) (µi/α)α so the likelihood is: L(β0, β1, α;y) = n∏ i=1 yα−1i e −yα/µi Γ (α) (µi/α)α = 1 Γ(α)n ∏ yα−1i e−yiα/µi (µi/α)α and so the log-likelihood is (noting that the log of a product in the sum of the logs): `(β0, β1, α;y) = −n log(Γ(α)) + n∑ i=1 log ( yα−1i e −yiα/µi (µi/α)α ) = −n log(Γ(α)) + n∑ i=1 (α− 1) log(yi)− (yiα)/µi − α log(µi/α) = −n log(Γ(α)) + n∑ i=1 (α− 1) log(yi)− (yiα)/µi − α log(µi) + α log(α) = −n log(Γ(α)) + n∑ i=1 (α− 1) log(yi)− yiα exp{β0 + β1xi} − α log(exp{β0 + β1xi}) + α log(α) = −n log(Γ(α)) + n∑ i=1 (α− 1) log(yi)− yiα exp{β0 + β1xi} − α(β0 + β1xi) + α log(α) 3 (b) The R function to compute minus the log-likelihood is: llhood <- function(theta){ n <- nrow(sexr) result <- -n*lgamma(theta[3]) + sum( (theta[3]-1)*log(sexr$ratio) - (sexr$ratio*theta[3])/(exp(theta[1]+theta[2]*sexr$tinmort)) - theta[3]*(theta[1]+theta[2]*sexr$tinmort) + theta[3]*log(theta[3]) ) -result } where the following initial values seem sensible as per Figure 2: > inits <- c(0.05,0,10) > plot(sexr$tinmort,sexr$ratio,pch=20) > xx <- seq(0,0.3,len=200) > lines(xx,exp(inits[1]+inits[2]*xx),col="red") 0.00 0.05 0.10 0.15 0.20 0.25 0.30 1. 02 1. 04 1. 06 1. 08 1. 10 sexr$tinmort se xr $ra tio Figure 2: Ratio vs infant mortality along with initial values guess for β0 and β1 in blue and the estimated mean relationship in red. The estimates and standard errors are: > out <- nlm(llhood,p=inits,hessian=T,iterlim=10000,steptol=1e-10) > round(out$estimate,4) [1] 0.0560 -0.0253 11417.7943 > OIM <- solve(out$hessian) > VarianceBeta <- diag(OIM) > stand_error <- sqrt(VarianceBeta) > round(stand_error,4) [1] 0.0003 0.0028 370.9105 (c) The following code the fitted line to Figure 2: lines(xx,exp(out$estimate[1]+out$estimate[2]*xx),col="red",lwd=3) (d) The estimate of β1 is -0.0253 and the standard error is 0.0028 so we construct a hypothesis test H0 : β1 = 0 vs Ha : β1 6= 0 with test statistic: Z = βˆ1 − 0 s.e.(βˆ1) = −0.0253 0.0028 with p-value: > z_test <- out$estimate[2]/stand_error[2] > p_value <- 2*(pnorm(z_test,0,1)) > p_value [1] 3.990903e-19 which is much smaller than 0.05 so H0 is rejected. 4 (e) Our prediction is of course the mean at xi = 0.4, i.e. µi = exp{βˆ0 + βˆ1 × 0.4}: > exp(out$estimate[1] + out$estimate[2]*0.4) [1] 1.046889 To construct the pred. interval we use the 2.5% and 97.5% quantile of the Gamma distribution: > mu <- exp(out$estimate[1] + out$estimate[2]*0.4) > mu [1] 1.046889 > shape <- out$estimate[3] > scale <- mu/shape > qgamma(0.025,shape=shape,scale=scale) [1] 1.027773 > qgamma(0.975,shape=shape,scale=scale) [1] 1.066178 so the prediction is 1.047 with a 95% prediction interval [1.028,1.066]. Topic 2 4. In the lecture slides for Topic 2, we argued that the Normal, Binomial and Poisson distribution all belong to the exponential family p (y; θ, φ) = exp { (yθ − b (θ)) a(φ) + c (y, φ) } . Show that this is the case by using the fact that a mathematical function f(x) can be written as exp{log(f(x))}. Solution Normal: The pdf of the Normal is: p(y;µ, σ2) = 1√ 2piσ2 exp { − 1 2σ2 (y − µ)2 } which we can rewrite as p(y;µ, σ2) = exp { log ( 1√ 2piσ2 )} exp { − 1 2σ2 (y − µ)2 } = exp { log ( 1√ 2piσ2 ) − 1 2σ2 (y − µ)2 } = exp { − 1 2σ2 (y2 − 2yµ+ µ2) + log ( 1√ 2piσ2 )} = exp { yµ− µ2/2 σ2 − y 2 2 − log( √ 2piσ2) } = exp { yµ− µ2/2 σ2 − 1 2 [ y2 σ2 + log(2piσ2) ]} So we have that the Normal distribution is in the exponential family with: θ = µ, a(φ) = φ = σ2, b(θ) = µ 2 2 = θ2 2 and c(y, φ) = − 12 [ y2 σ2 + log(2piσ 2) ] . Binomial: The pmf of the Binomial is: p(y;N, pi) = ( N y ) piy(1− pi)N−y 5 which we can rewrite as p(y;N, pi) = exp { log (( N y ) piy(1− pi)N−y )} = exp { log (piy) + log ( (1− pi)N−y)+ log(N y )} = exp { y log (pi) + (N − y) log (1− pi) + log ( N y )} = exp { y log(pi)− y log(1− pi) +N log(1− pi) + log ( N y )} = exp { y log ( pi 1− pi ) +N log(1− pi) + log ( N y )} So the Binomial fits into the exponential family, with: θ = log ( pi 1−pi ) , a(φ) = 1, b(θ) = N log(1 − pi) = N log(1 + eθ) and c(y, φ) = log ( N y ) . Poisson: The pmf of the Poisson is: p(y;λ) = λye−λ y! which we can rewrite as p(y;λ) = exp { log ( λye−λ y! )} = exp { log(λy)− λ+ log ( 1 y! )} = exp {y log(λ)− λ− log(y!)} So the Poisson fits into the exponential family, with: θ = log(λ), a(φ) = 1, b(θ) = λ = eθ and c(y, φ) = − log(y!). 5. Consider the Gamma distribution p (y) = yα−1e−y/λ Γ (α)λα , y > 0, which has mean E [Y ] = µ = αλ and variance var [Y ] = σ2 = αλ2. (a) Rewrite the probability distribution p(y) in terms of the mean µ and parameter α. (1) (b) Thus show that it belongs to the exponential family, identifying parameters θ and functions a(·), b (·) and c (·) in terms of α and µ. (3) (c) Verify the expressions for the mean and the variance of the distribution using the associated general results for the exponential family. (3) (7) 6. The module workspace contains a data frame dicentric which presents data from an experiment conducted to determine the effect of gamma radiation on the numbers of chromosomal abnormalities ca observed. The number of cells exposed (in hundreds) in each run, differs and is contained in the variable cells. The dose amount doseamt and the rate doserate at which the dose is applied are the predictors of interest. (a) Directly model the rate, i.e. number of abnormalities per cell by using a Poisson GLM with response ca, with a log link, with doseamt and doserate as predictors and with a suitable offset involving the numbers of cells (cells). Comment on the model summary - parameter estimates, standard errors, model fit and residual plots. 6 (b) If your final model from (a) is substantially over-dispersed, then fit a quasi-Poisson model instead and comment on the differences between that and the results from (a). (c) Alternatively, fit a Negative Binomial model and comment on the differences between that and the results from (a). (d) It was suggested that there might be an interaction between dose rate and dose amount, on the basis that the effect of dose rate is not independent of dose amount (and vice versa). Fit a Poisson GLM with the interaction, and also non-linear effects if deemed necessary. (e) Use the AIC to choose between the Negative Binomial model and the Poisson GLM with the interactions, and discuss what this model tells us about the relationship of the covariates to the response. Solution (a) We want to model the rate ρ (number of abnormalities per 100 cells) using a Poisson model with log-link so we need to include an offset involving cells. We specify the model Yi ∼ Pois(λi) Yi independent log(λi) = log(NoCellsi × ρi) = log(cellsi) + β0 + β1x1,i + β2x2,i where Yi is the number of abnormalities, “NoCells” is the total number of cells, x1 is the dose amount and x2 is the dose rate. Fit this in R: > dicentric$log.cells <- log(dicentric$cells) > pois.model <- glm(ca~offset(log.cells)+doseamt+doserate,family=poisson,data=dicentric) > summary(pois.model) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.461150 0.043257 -80.01 <2e-16 *** doseamt 0.662302 0.009689 68.35 <2e-16 *** doserate 0.155012 0.013685 11.33 <2e-16 *** Residual deviance: 282.95 on 24 degrees of freedom Note that cells is the number of cells in 100s so we are modelling here the rate per 100 cells. If we wanted to model the rate per individual cell then we would be fitting the model: > dicentric$log.cells <- log(dicentric$cells*100) > pois.model2 <- glm(ca~offset(log.cells)+doseamt+doserate,family=poisson,data=dicentric) > summary(pois.model2) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -8.066321 0.043257 -186.47 <2e-16 *** doseamt 0.662302 0.009689 68.35 <2e-16 *** doserate 0.155012 0.013685 11.33 <2e-16 *** Residual deviance: 282.95 on 24 degrees of freedom and note that the two models are exactly the same except that the intercept β0 is different. Clearly both variables doseamt and doserate are significant, both having a positive effect on the rate. Let’s look at the residual plots in Figure 3: The residuals vs fitted values plot on the left shows some structure but keep in mind the low number of observations. Normal Q-Q plot also is acceptable. The third plot indicates some outliers which have both high leverage and a large Cook’s distance. No real reason to remove these points though, as we can’t confirm that they are erroneous. Check whether model fits: > 1-pchisq(pois.model$deviance,pois.model$df.residual) [1] 0 7 3.5 4.0 4.5 5.0 5.5 6.0 − 5 0 5 Predicted values R es id ua ls Residuals vs Fitted 145 19 −2 −1 0 1 2 − 5 0 5 Theoretical Quantiles St d. P e a rs o n r e si d. Normal Q−Q 19 1415 0.00 0.05 0.10 0.15 0.20 0.25 0.30 − 5 0 5 Leverage St d. P e a rs o n r e si d. Cook's distance 1 0.5 0.5 1 Residuals vs Leverage 19 9 27 Figure 3: Residual plots for Poisson GLM. implying that the model does not fit (it is over-dispersed). (b) Let’s fit a quasi-Poisson model: > qpois.model <- glm(ca~offset(log.cells)+doseamt+doserate,family=quasipoisson,data=dicentric) > summary(qpois.model) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -8.06632 0.15430 -52.277 < 2e-16 *** doseamt 0.66230 0.03456 19.163 4.72e-16 *** doserate 0.15501 0.04881 3.176 0.00407 ** (Dispersion parameter for quasipoisson family taken to be 12.72344) Residual deviance: 282.95 on 24 degrees of freedom As expected the parameter estimates for β0, β1 and β2 are the same as with pois.model. What has changed though are the standard errors which are now larger. This is because the quasi model delib- erately inflates the dispersion parameter φ to account for over-dispersion. The two parameters are still significant but now we have more confidence in the inference as the standard errors are more realistic. The residual plots in Figure 4 look exactly the same as with the Poisson GLM, except that there not longer any outliers flagged as influential on the last plot. This is because there is now more variance to capture those. 3.5 4.0 4.5 5.0 5.5 6.0 − 5 0 5 Predicted values R es id ua ls Residuals vs Fitted 145 19 −2 −1 0 1 2 − 2 − 1 0 1 2 Theoretical Quantiles St d. P e a rs o n r e si d. Normal Q−Q 19 1415 0.00 0.05 0.10 0.15 0.20 0.25 0.30 − 2 − 1 0 1 2 Leverage St d. P e a rs o n r e si d. Cook's distance 0.5 0.5 Residuals vs Leverage 19 9 27 Figure 4: Residual plots for quasi-Poisson GLM. (c) Let’s fit a Negative-Binomial model: Yi ∼ NegBin(λi, φ) Yi independent log(λi) = log(NoCellsi × ρi) = log(cellsi) + β0 + β1x1,i + β2x2,i where φ is the dispersion parameter 8 > library(MASS) # contains the function glm.nb > nb.model <- glm.nb(ca~offset(log.cells)+doseamt*doserate, data=dicentric) > summary(nb.model) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -7.97734 0.15274 -52.230 < 2e-16 *** doseamt 0.66339 0.03821 17.362 < 2e-16 *** doserate 0.15387 0.05057 3.043 0.00235 ** Residual deviance: 27.738 on 24 degrees of freedom Given this model has a dispersion parameter, we don’t have to worry about whether the model fits well (is overdispersed). The standard errors and estimates are basically the same as the quasiPoisson model. The residual plots in Figure 5 look exactly the same as with the Poisson GLM, except that the Q-Q plot is slightly worse (but still acceptable) and again, no outliers flagged in the last plot. 3.5 4.0 4.5 5.0 5.5 6.0 − 2 − 1 0 1 2 Predicted values R es id ua ls Residuals vs Fitted 1415 16 −2 −1 0 1 2 − 1 0 1 2 Theoretical Quantiles St d. P e a rs o n r e si d. Normal Q−Q 1415 19 0.00 0.05 0.10 0.15 0.20 − 2 − 1 0 1 2 Leverage St d. P e a rs o n r e si d. Cook's distance 0.5 0.5 Residuals vs Leverage 19 9 18 Figure 5: Residual plots for Negative Binomial GLM. (d) Let’s fit a Poisson model with an interaction: cai ∼ Pois(λi) Yi independent log(λi) = log(NoCellsi × ρi) = log(cellsi) + β0 + β1x1,i + β2x2,i + β3x1,ix2,i We fit the model thus: > pois.model3 <- glm(ca~offset(log.cells)+doseamt*doserate, data=dicentric, family=poisson(link="log")) > summary(pois.model3) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -7.90511 0.06160 -128.322 < 2e-16 *** doseamt 0.61224 0.01707 35.862 < 2e-16 *** doserate 0.06401 0.02922 2.191 0.028476 * doseamt:doserate 0.02715 0.00765 3.549 0.000387 *** Residual deviance: 270.26 on 23 degrees of freedom > 1-pchisq(pois.model3$deviance,pois.model3$df.residual) [1] 0 which indicates that the interaction is significant, but the model does not fit. Let’s look at residual plots in Figure 6 Well not much different to the original Poisson model, but still the first plot indicates some structure, so perhaps worth trying non-linear terms of the covariates. So let’s fit a model with squared terms and their interaction cai ∼ Pois(λi) Yi independent log(λi) = log(NoCellsi × ρi) = log(cellsi) + β0 + β1x1,i + β2x2,i + β3x1,ix2,i + β4x 2 1,i + β5x 2 2,i + β6x 2 1,ix 2 2,i 9 3.5 4.0 4.5 5.0 5.5 6.0 − 6 − 4 − 2 0 2 4 6 8 Predicted values R es id ua ls Residuals vs Fitted 154 16 −2 −1 0 1 2 − 6 − 4 − 2 0 2 4 6 8 Theoretical Quantiles St d. P e a rs o n r e si d. Normal Q−Q 1514 19 0.0 0.1 0.2 0.3 0.4 0.5 − 5 0 5 Leverage St d. P e a rs o n r e si d. Cook's distance 1 0.5 0.5 1 Residuals vs Leverage 19 279 Figure 6: Residual plots for Poisson GLM with an interaction. fitted as follows: > pois.model4 <- glm(ca~offset(log.cells)+doseamt*doserate+I(doserate^2)+ I(doseamt^2)+I(doserate^2):I(doseamt^2),data=dicentric, family=poisson(link="log")) > summary(pois.model4) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -8.866152 0.130777 -67.796 < 2e-16 *** doseamt 1.502357 0.083908 17.905 < 2e-16 *** doserate 0.092618 0.103556 0.894 0.371120 I(doserate^2) -0.029279 0.018999 -1.541 0.123306 I(doseamt^2) -0.151065 0.012153 -12.430 < 2e-16 *** doseamt:doserate 0.092403 0.026724 3.458 0.000545 *** I(doserate^2):I(doseamt^2) -0.002374 0.001025 -2.316 0.020552 * Residual deviance: 44.299 on 20 degrees of freedom > 1 - pchisq(pois.model4$deviance,pois.model4$df.residual) [1] 0.001372009 All terms are significant but the model still does not fit (although the p-value is bigger than before). Let’s look at residual plots in Figure 7 The QQ plot is acceptable but the residuals vs predicted values plot 3.5 4.0 4.5 5.0 5.5 6.0 − 3 − 2 − 1 0 1 2 Predicted values R es id ua ls Residuals vs Fitted 10 19 20 −2 −1 0 1 2 − 2 0 2 Theoretical Quantiles St d. P e a rs o n r e si d. Normal Q−Q 19 10 27 0.0 0.2 0.4 0.6 0.8 − 4 − 2 0 2 4 Leverage St d. P e a rs o n r e si d. Cook's distance 1 0.5 0.5 1 Residuals vs Leverage 27 1910 Figure 7: Residual plots for Poisson GLM with an interaction and quadratic terms. indicates some signs of non-constant variance. Given that adding quadratic terms has helped, let’s try a model with cubic terms and their interactions: cai ∼ Pois(λi) Yi independent log(λi) = log(NoCellsi × ρi) = log(cellsi) + β0 + β1x1,i + β2x2,i + β3x1,ix2,i + β4x 2 1,i + β5x 2 2,i + β5x 2 2,i + β6x 2 1,ix 2 2,i + + β7x 3 1,i + β8x 3 2,i + β9x 3 1,ix 3 2,i 10 fitted as: > pois.model5 <- glm(ca~offset(log.cells)+doseamt*doserate+I(doserate^2)+ + I(doseamt^2)+I(doserate^3)+I(doseamt^3)+I(doserate^2):I(doseamt^2)+ + I(doserate^3):I(doseamt^3),data=dicentric, family=poisson(link="log")) > summary(pois.model5) Coefficients: (1 not defined because of singularities) Estimate Std. Error z value Pr(>|z|) (Intercept) -8.9262017 0.1837638 -48.574 <2e-16 *** doseamt 1.4363203 0.1085311 13.234 <2e-16 *** doserate 0.4193853 0.2739157 1.531 0.1258 I(doserate^2) -0.2805326 0.1222924 -2.294 0.0218 * I(doseamt^2) -0.1428256 0.0144010 -9.918 <2e-16 *** I(doserate^3) 0.0419961 0.0180033 2.333 0.0197 * I(doseamt^3) NA NA NA NA doseamt:doserate 0.1627393 0.0713913 2.280 0.0226 * I(doserate^2):I(doseamt^2) -0.0087011 0.0064667 -1.346 0.1785 I(doserate^3):I(doseamt^3) 0.0001771 0.0001894 0.935 0.3499 Residual deviance: 26.041 on 18 degrees of freedom > 1 - pchisq(pois.model5$deviance,pois.model5$df.residual) [1] 0.09882702 There are not enough unique values of dose amount to estimate a cubic relationship, so the term is ignored. However all the other terms are significant and the model now fits at the 5% level. Let’s look at residual plots in Figure 8. The residual vs predicted values still show signs of non-constant variance, 3.5 4.0 4.5 5.0 5.5 6.0 − 2 − 1 0 1 2 3 Predicted values R es id ua ls Residuals vs Fitted 20 2210 −2 −1 0 1 2 − 2 − 1 0 1 2 3 Theoretical Quantiles St d. P e a rs o n r e si d. Normal Q−Q 20 22 10 0.0 0.2 0.4 0.6 0.8 1.0 − 3 − 2 − 1 0 1 2 3 Leverage St d. P e a rs o n r e si d. Cook's distance 1 0.5 Residuals vs Leverage 19 22 20 Figure 8: Residual plots for Poisson GLM with an interaction and cubic terms. although note there not many data points. QQ plot show significant deviation at lower values which is not ideal. No outliers. (e) Let’s look at the AIC values: > pois.model5$aic [1] 223.4579 > nb.model$aic [1] 277.7973 indicating that the Poisson GLM with the cubic terms and the interaction is better. Given the interaction and higher order terms, the output is hard to interpret from the summary table. Easier to plot the estimated relationship of mean abnormality rate with dose rate for each of the three values of dose amount in the data, shown in Figure 9: pois.model6 <- glm(ca~offset(log.cells)+doseamt*doserate+I(doserate^2)+ I(doseamt^2)+I(doserate^3)+I(doserate^2):I(doseamt^2)+ I(doserate^3):I(doseamt^3),data=dicentric, family=poisson(link="log")) Drate <- seq(0.1,4,len=200) Damt <- rep(1,200) 11 Offset <- rep(0,200) # so we are predicting the rate abnormal cells per (exp(0)=1) cell preds <- predict(pois.model6,newdata=data.frame(doserate=Drate,doseamt=Damt,log.cells=Offset), type="link",se.fit=T) plot(Drate,exp(preds$fit),type="l",lwd=2,ylim=c(0,0.017)) lines(Drate,exp(preds$fit+preds$se.fit),lwd=2,lty=2) lines(Drate,exp(preds$fit-preds$se.fit),lwd=2,lty=2) Damt <- rep(2.5,200) preds <- predict(pois.model6,newdata=data.frame(doserate=Drate,doseamt=Damt,log.cells=Offset), type="link",se.fit=T) lines(Drate,exp(preds$fit),lwd=2,col="red") lines(Drate,exp(preds$fit+preds$se.fit),lwd=2,lty=2,col="red") lines(Drate,exp(preds$fit-preds$se.fit),lwd=2,lty=2,col="red") Damt <- rep(5,200) preds <- predict(pois.model6,newdata=data.frame(doserate=Drate,doseamt=Damt,log.cells=Offset), type="link",se.fit=T) lines(Drate,exp(preds$fit),lwd=2,col="blue",type="l") lines(Drate,exp(preds$fit+preds$se.fit),lwd=2,lty=2,col="blue") lines(Drate,exp(preds$fit-preds$se.fit),lwd=2,lty=2,col="blue") legend("topleft",c("dosemt=1","dosemt=2.5","dosemt=5"),lty=c(1,1,1),col=c("black","red","blue")) 0 1 2 3 4 0. 00 0 0. 00 5 0. 01 0 0. 01 5 Drate ex p(p red s$ fit) dosemt=1 dosemt=2.5 dosemt=5 Figure 9: Estimated relationship for the Poisson model with interaction and cubic terms. Figure 9 indicates that for dose amount 1, the relationship is fairly flat. The mean response increases as we go to dose amount 2.5 and even more when we go to 5, albeit in a non-linear way. 7. In lectures, we have seen data on number of quarterly aids cases in the UK, yi, from January 1983 to March 1994. The data are in dataframe aids, where the variable cases is yi and date is time, symbolised here as xi. In this question we consider two competing models to describe the trend in the number of cases. Model 1 is Yi ∼ Pois(λi) Yi independent log(λi) = β0 + β1xi and Model 2 is Yi ∼ N(µi, σ2) Yi independent log(µi) = γ0 + γ1xi 12 (a) Plot yi against xi and comment on whether the two proposed models are sensible. (3) (b) Fit the two models in R, add the estimated trends from each model (λˆi and µˆi) on the plot from (7a) along with approximate 95% confidence intervals on the mean and comment on the validity of each model (based on the plot). For the confidence intervals you can assume that the sampling distribution of λˆi and µˆi is approximately Gaussian with standard errors obtained from the R function predict(...,type=‘‘response’’,se.fit=T). Also use the estimates and the formulation of the log- likelihoods from (7a) to calculate the AIC for each model and thus comment on which model is preferable according to this criterion. (5) (c) Produce the deviance residuals vs fitted values (λˆi and µˆi) plot for each model, comment appropriately and thus propose a way that the two models might be extended to improve the fit. (2) (d) Implement the proposed extensions to each model, to arrive at a final version for each of them (justified by appropriate hypothesis tests). (3) (e) On the basis of the analogous plots as in (7b) and (7c) but also on arguments of model fit based on the deviance and the AIC, comment on which (if any) of the two final models in (d) you would choose as the best. Mention at least one reason why either model is not ideal. (9) (f) Further extend the final Poisson model to a Negative Binomial model and comment on whether this model is preferable to the other two, on the basis of all the criteria used for comparison so far. (3) (25) 8. The dataframe titanic2 contains an aggregated version of the titanic survival data. The age covariate has been aggregated to “child” and “adult”. (a) Fit a Binomial model with just the “main effects” of age_group, pclass and gender, and interpret the estimates. (b) To see whether the effect of age is different for males and females, and also whether the effect of class is different for males and females, fit the model with an interaction between age_group and gender as well pclass and gender, and interpret the estimates. Solution (a) The model we are fitting is: Yi ∼ Bin(pii, Ni) Yi independent logit(pii) = β0 + β1x1,i + β2x2,i + β3x3,i + β4x4,i where Yi is the number survived; Ni is the total number of passengers; x1 = 1 if child and x1 = 0 if adult; x2 = 1 is class 2 and x2 = 0 otherwise; x3 = 1 is class 3 and x3 = 0 otherwise; and x4 = 1 if male and x4 = 0 otherwise. We fit this by: > model1 <- glm(cbind(survived,total-survived)~age_group+pclass+gender,data=titanic2, family=binomial(link="logit")) > summary(model1) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.2330 0.1810 12.340 < 2e-16 *** age_groupchild 1.3834 0.2733 5.061 4.17e-07 *** pclass2 -0.8656 0.2003 -4.322 1.54e-05 *** pclass3 -1.8654 0.1819 -10.253 < 2e-16 *** gendermale -2.7155 0.1542 -17.605 < 2e-16 *** Residual deviance: 64.718 on 6 degrees of freedom > 1 - pchisq(model1$deviance,model1$df.residual) [1] 4.925949e-12 13 The model does not fit, however let’s interpret the estimates anyway. β1 is the difference in mean prob- ability of survival (at logit scale) between adults and children. This is significant at 5% level so children have higher chance (1.383) of survival than adults. Class 2 passengers have significantly smaller chance of survival (-0.866) than class 1 passengers, and class 3 passengers have even smaller (-1.865). Lastly, males have significantly smaller chance of survival than females (-2.7155). (b) The model with the two interactions is: Yi ∼ Bin(pii, Ni) Yi independent logit(pii) = β0 + β1x1,i + β2x2,i + β3x3,i + β4x4,i + β5x5,i + β6x6,i + β7x7,i where x1 = 1 if female child in class 1 and x1 = 0 if female adult in class 1; x2 = 1 is female adult in class 2 and x2 = 0 otherwise; x3 = 1 is female adult class 3 and x3 = 0 otherwise; x4 = 1 if male adult in class 1 and x4 = 0 otherwise; x5 = 1 if male child in class 1 and x5 = 0 otherwise; x5 = 1 if adult male in class 2 and x5 = 0 otherwise; and x6 = 1 if adult male in class 3 and x6 = 0 otherwise. (Note it’s easier to fit the model in R before interpreting what the covariates mean.) We fit the model using: > model2 <- glm(cbind(survived,total-survived)~age_group+pclass+gender+age_group:gender+ + pclass:gender,data=titanic2,family=binomial(link="logit")) > summary(model2) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.2195 0.7123 5.924 3.14e-09 *** age_groupchild 0.6285 0.3491 1.800 0.071797 . pclass2 -2.0915 0.7874 -2.656 0.007898 ** pclass3 -4.1222 0.7270 -5.671 1.42e-08 *** gendermale -4.9856 0.7301 -6.829 8.58e-12 *** age_groupchild:gendermale 1.2755 0.4787 2.664 0.007714 ** pclass2:gendermale 1.1960 0.8267 1.447 0.147961 pclass3:gendermale 2.8668 0.7584 3.780 0.000157 *** Residual deviance: 15.38 on 3 degrees of freedom > 1 - pchisq(model2$deviance,model2$df.residual) [1] 0.001519179 We note that the model does not fit, but we interpret the results regardless. At this point, using the table is extremely confusing, so best do it using plots: > library(effects) > plot(predictorEffects(model2),lines=list(multiline=TRUE), axes=list(grid=TRUE),confint=list(style="bars")) which produces the plot shown in Figure 10. We assess significance using confidence intervals. Note that confidence intervals that don’t overlap imply statistically significant difference. But overlapping intervals don’t necessarily mean the difference is not significant, although they do imply not-so-strong evidence. E.g. overlapping intervals could arise from a p-value less than 0.05, but only just less than, e.g. 0.04. Interpretation: Top left plot indicates that children are more likely to survive but the effect is only sig- nificant in females. Also, females have significantly bigger chance of survival in both age groups. The top right plot indicates that females have significantly bigger chance of survival across all 3 classes, but the difference gets smaller as we move from class 1 to 2 to 3. The bottom left plot shows that there is not much difference in female survival for the two age groups across the 3 classes. For males however, children have significantly bigger chance of survival across all 3 classes. Also, male adults in class 1 have significantly bigger chance of survival than male adults in classes 2 and 3, while for male children the differences are not as pronounced. Female children in classes 1 and 2 have significantly bigger chance of survival than ones in class 3, and the same can be said for adult women. 14 age_group predictor effect plot age_group cb in d(s urv ive d, to ta l − s ur vi ve d) 0.2 0.4 0.6 0.8 adult child gender female male pclass predictor effect plot pclass cb in d(s urv ive d, to ta l − s ur vi ve d) 0.2 0.4 0.6 0.8 1 2 3 gender female male gender predictor effect plot gender cb in d(s urv ive d, to ta l − s ur vi ve d) 0.2 0.4 0.6 0.8 female male = pclass 1 = pclass 2 0.2 0.4 0.6 0.8 = pclass 3 age_group adult child Figure 10: Estimated relationship for the Binomial titanic model with interaction terms. 9. The module workspace, contains a data frame titanic which relates to 1309 passengers on the last voyage of the ocean liner ‘Titanic’. The response variable survived is a binary variable where the value 1 means the passenger survived the sinking. The data frame also contains predictors relating to passenger class (1st, 2nd, 3rd), gender, age and the fare amount each passenger paid. Passenger names are also available (for interest, rather than for modelling). (a) Fit a Bernoulli GLM with logistic link of survived with age, pclass and gender are predictors, as well as all the associated two-way interactions. Reduce the model if and as appropriate using the AIC in conjunction with the R function drop1(), and interpret the final model in terms of parameter estimates and their significance. Perform relevant model checking analysis (model fit and residuals). (13) 10. In 1972-74, a survey of one in six residents of Whickham, near Newcastle, England was made. Twenty years later, this data recorded in a follow-up study. Only women who are current smokers or who have never smoked are included. Resulting data set comprises 28 obs on the following 4 variables: y is the observed count for given combination, smoker is a factor with levels yes no, dead is a factor with levels yes/no, age is a factor with age-group levels 18-24, 25-34, 35-44, 45-54, 55-64, 65-74 and 75+. Interest here lies on the effects of age and smoking on the probability of death. (a) Investigate the association of age and smoking on the chance of dying, using Poisson log-linear models for this contingency table. Clearly state any assumptions you are making about the sampling design. Solution 15 (a) Assuming all margins are random, then no particular terms need to be included by design. For reasons of brevity, we avoid writing the models down mathematically, noting that the models are Poisson GLMs with a log link. We start with biggest model possible and reduce if necessary, and this is model with all main affects, 2-way interactions and the 3-way interaction. > model1 <- glm(y~smoker*dead*age,data=femsmoke,family=poisson) > deviance (model1) [1] 3.033713e-10 This is of course the saturated model, which is useless, so we remove the 3-way interaction (biggest interaction) and check if the model fits: > model2 <- glm(y~(smoker+dead+age)^2,data=femsmoke,family=poisson) > 1-pchisq(deviance(model2),model2$df.residual) [1] 0.8815485 The model fits, but we must still consider whether we can reduce further. Use function drop1() which conducts a likelihood ratio test for all possibilities of removing a single 2-way interaction: > drop1(model2,test="Chi") Single term deletions Df Deviance AIC LRT Pr(>Chi)
2.38 180.58 smoker:dead 1 8.33 184.52 5.95 0.01475 * smoker:age 6 92.63 258.83 90.25 < 2e-16 *** dead:age 6 632.30 798.49 629.92 < 2e-16 *** For all possibilities, p-values are less than 0.05 (and all AIC values larger) so we keep the current model and simply interpret: Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.54284 0.58736 0.924 0.355384 smokerno -0.29666 0.25324 -1.171 0.241401 deadno 3.43271 0.59014 5.817 6.00e-09 *** age25-34 0.92902 0.68381 1.359 0.174273 age35-44 1.94048 0.62486 3.105 0.001900 ** age45-54 2.76845 0.60657 4.564 5.02e-06 *** age55-64 3.37507 0.59550 5.668 1.45e-08 *** age65-74 2.86586 0.60894 4.706 2.52e-06 *** age75+ 2.02211 0.64955 3.113 0.001851 ** smokerno:deadno 0.42741 0.17703 2.414 0.015762 * smokerno:age25-34 0.11752 0.22091 0.532 0.594749 smokerno:age35-44 0.01268 0.22800 0.056 0.955654 smokerno:age45-54 -0.56538 0.23585 -2.397 0.016522 * smokerno:age55-64 0.08512 0.23573 0.361 0.718030 smokerno:age65-74 1.49088 0.30039 4.963 6.93e-07 *** smokerno:age75+ 1.89060 0.39582 4.776 1.78e-06 *** deadno:age25-34 -0.12006 0.68655 -0.175 0.861178 deadno:age35-44 -1.34112 0.62857 -2.134 0.032874 * deadno:age45-54 -2.11336 0.61210 -3.453 0.000555 *** deadno:age55-64 -3.18077 0.60057 -5.296 1.18e-07 *** deadno:age65-74 -5.08798 0.61951 -8.213 < 2e-16 *** deadno:age75+ -27.31727 8839.01146 -0.003 0.997534 We only discuss interactions with deadno, which are the effects of the two covariates on the prob- ability of not dying. Begin with smokerno:deadno. The effect is positive and significant, so non- smokers have higher chances of not dying compared to smokers. Then deadno:age. The effect of deadno:age25-34 is not significant so no difference in survival for age groups 18-24 and 25-34. The effect of deadno:age35-44 is negative and significant so age group 35-44 has bigger chance of dying. In fact the effect increases (and is significant) with age group (as expected). The age group 75+ has a very large standard error, implying there is very little to estimate this effect. 16 11. In an archaeological study of animal husbandry, there is a particular interest in proportions of fragments of animal bones of different types discovered in excavations in the south of England over four time eras. A data set is collated from the excavation reports which consists of a three-way contingency table where cells contain total numbers of bone fragments from all excavations cross-classfied by animal type (‘sheep’ or ‘cattle’ or ‘pig’), by the surface geology of the excavation sites (‘valley terrace’ or ’other’) and by the era (‘early saxon’ or ‘middle saxon’ or ‘late saxon’ or ‘high medieval’). These data are contained in the dataframe bones in the module workspace. Interest lies in the association of era and geology, with the number of different bone types. Explore significant associations in these data using Poisson log-linear models. In particular, identify an appro- priate final preferred model which suitably fits the data and then report what this model suggests concerning significant associations between the numbers of bones of different types and surface geology and ditto with era. What assumptions are you making about the sampling design for the study in carrying out your analysis and what are the modelling implications of those? (15) 12. The dataframe ToothGrowth, included in the base R installation, see ?ToothGrowth, contains information on the growth of teeth in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC). (a) Fit an Exponential GLM to length of teeth with two covariates, with the default link function and check whether the model fits. (b) Alternatively, fit this as a Gamma distribution and interpret the effect of the covariates. (c) Check the residual plots of the Gamma model, and suggest a way that these can be improved. (d) Implement your suggestion and check whether this has remedied the problem. (e) Interpret the effects of the covariates from your final model. Solution (a) The model we are fitting is Yi ∼ Exp(λi), Yi indep. λi = 1 µi = β0 + β1x1,i + β2x2,i where Yi is tooth length, x1,i = 1 if supplement type is VC and 0 otherwise, and x2,i is dose of supple- ment. We fit this model in R using > model <- glm(len~supp+dose,family=Gamma(link="inverse"),data=ToothGrowth) > summary(model,dispersion=1) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.08456 0.01926 4.391 1.13e-05 *** suppVC 0.00956 0.01329 0.720 0.4718 dose -0.02610 0.01090 -2.396 0.0166 * (Dispersion parameter for Gamma family taken to be 1) Residual deviance: 6.1123 on 57 degrees of freedom As it is an Exponential model, we must first check that this model fits. If it does, then the scaled deviance should come from a χ2n−p−1 distribution where here n− p− 1 = 57. From summary(model) we see that the scaled deviance 6.11 is very small compared to 57. The p-value of the upper tail of the χ257 is > 1 - pchisq(model$deviance,model$df.residual) [1] 1 The p-value is very big implying that the model is a good fit. (Note that the deviance is extreme but low, which is fine given that the scaled deviance Dm/φ is the logarithm of the likelihood ratio of the fitted model to the saturated model. Values of the ratio close to 1 imply good model fit, which on the log-scale translates to small values.) 17 (b) Let’s also fit this model as a Gamma GLM, which makes more sense since the Exponential is just a special case of the Gamma. The model is: Yi ∼ Gamma(µi, φ), Yi indep. 1 µi = β0 + β1x1,i + β2x2,i where φ is the dispersion parameter. In R, the model is still the same, it is only the summary() function that changes: > summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.084560 0.005931 14.257 < 2e-16 *** suppVC 0.009560 0.004092 2.336 0.023 * dose -0.026099 0.003355 -7.778 1.61e-10 *** (Dispersion parameter for Gamma family taken to be 0.09485256) Residual deviance: 6.1123 on 57 degrees of freedom We know this model fits w.r.t. the scaled deviance, because it has a non-fixed dispersion parameter (estimated to be 0.0949). Note however that the standard errors are different from the Exponential GLM, so we would prefer this model on the basis that it can reduce to the Exponential GLM if necessary. Predictor suppVC has a positive significant effect on the inverse mean implying that taking vitamin C supplement using method “VC” reduces the mean tooth length compared to taking it using method “OJ”. Predictor dose has a negative significant effect meaning that the more the dose, the longer the tooth (on average). (c) Figure 11 shows, from left to right, 1) the standardised deviance residuals vs linear predictor plot, 2) the Q-Q plot of the standardised deviance residuals and 3) the standardised Pearson residuals vs leverage plot along with Cook’s distance contours. These were obtained by: > plot(model,which=1,pch=20) > plot(model,which=2,pch=20) > plot(model,which=5,pch=20) 0.04 0.05 0.06 0.07 0.08 − 1. 0 − 0. 5 0. 0 0. 5 Predicted values R es id ua ls l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l ll ll l l ll l l l ll l l l l l l Residuals vs Fitted 1 9 4 l l l l l l ll l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l ll l ll l ll l l −2 −1 0 1 2 − 3 − 2 − 1 0 1 2 Theoretical Quantiles St d. d ev ia nc e re si d. Normal Q−Q 1 9 4 0.00 0.02 0.04 0.06 0.08 − 2 − 1 0 1 2 Leverage St d. P e a rs o n r e si d. l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll ll l l l l l l l ll l l l l l l Cook's distance Residuals vs Leverage 1 23 32 Figure 11: Residual plots The first plot on the left indicates a quadratic relationship, rather than random scatter about the horizontal zero line. Perhaps we need to consider squared terms of dose in the model. The Q-Q plot shows systematic deviation from the diagonal at both ends indicating that the Gamma distribution may not be appropriate for this data, however deviations are small so we may accept the fit but note that there is room for improvement. The last plot does not show any particular outliers of concern as all the points have low Cook’s distance (no Cook’s distance contours are even visible). The first plot may be improved by including non-linear effects. In particular we could try a quadratic term of the dose covariate (the other is a factor) and all the 2-way interactions. 18 (d) So let’s try to fit a model with a squared term of dose and an interaction with supp (for both dose and dose squared). This is a model with a different quadratic relationship of dose for each type of supplement. Mathematically: Yi ∼ Gamma(µi, φ), Yi indep. 1 µi = β0 + β1x1,i + β2x2,i + β3x1,ix2,i + β4x 2 1,i + β5x 2 1,ix2,i We fit as follows: > ToothGrowth$doseSq <- ToothGrowth$dose^2 > model2 <- glm(len~supp+dose+doseSq+supp:dose+supp:doseSq,family=Gamma(link="inverse"), data=ToothGrowth) > summary(model2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.12625 0.01630 7.745 2.55e-10 *** suppVC 0.10141 0.03091 3.281 0.001817 ** dose -0.12045 0.02782 -4.330 6.50e-05 *** doseSq 0.03826 0.01008 3.795 0.000376 *** suppVC:dose -0.12091 0.05141 -2.352 0.022367 * suppVC:doseSq 0.03507 0.01833 1.914 0.060956 . (Dispersion parameter for Gamma family taken to be 0.05472392) Residual deviance: 3.0094 on 54 degrees of freedom The squared term and the interaction terms are all significant (at least at the 10% level) so let’s check residuals in Figure 12. The standardised deviance residuals vs linear predictor plot is much better now, and so is the Q-Q plot, so we can be content with model fit. 0.04 0.06 0.08 0.10 0.12 − 0. 6 − 0. 2 0. 2 0. 6 Predicted values R es id ua ls l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l ll l l ll l l l l ll l l Residuals vs Fitted 1 32 37 l l l l l l l l l l ll l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l −2 −1 0 1 2 − 2 − 1 0 1 2 Theoretical Quantiles St d. d ev ia nc e re si d. Normal Q−Q 1 32 37 − 2 − 1 0 1 2 3 Factor Level Combinations St d. P e a rs o n r e si d. OJ VC supp : l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l ll l l l l l l l Residuals vs Factor Levels 32 1 2 Figure 12: Residual plots (e) Interpreting the model: When x2 = 0, the dose effect is 0.126−0.120x2 +0.038x22 while when x2 = 1 the dose effect is (0.126+0.101)− (0.120+0.121)x2 +(0.038−0.035)x22. The significance of the interaction terms implies that these two are significantly different. Hard to interpret due to the quadratic nature, so better to use predictions of µ and visualise as shown in Figure 13. Interesting how the two lines indicate that the differences in dose effect for the two values of supp are more pronounced for values of dose below 1, while above 1 they are not significantly different as indicated from the overlap in the confidence intervals. (The CIs of the mean are calculated as transformed intervals of the linear predictor using the Normal 95% confidence interval formula.) The plots were obtained using the following code: doseSeq <- seq(0.5,2,length=200) predsOJ <- predict(model2,newdata=data.frame(dose=doseSeq,doseSq=doseSeq^2,supp="OJ"),type="link",se.fit=T) predsVC <- predict(model2,newdata=data.frame(dose=doseSeq,doseSq=doseSeq^2,supp="VC"),type="link",se.fit=T) muOJ <- 1/predsOJ$fit muOJ_upper <- 1/(predsOJ$fit+1.96*predsOJ$se.fit) muOJ_lower <- 1/(predsOJ$fit-1.96*predsOJ$se.fit) muVC <- 1/predsVC$fit muVC_upper <- 1/(predsVC$fit+1.96*predsVC$se.fit) muVC_lower <- 1/(predsVC$fit-1.96*predsVC$se.fit) x11(width=8,height=6) ## might have to use win.graph() in Windows par(mar = c(4, 4, 1, 1),cex=1.3,lwd=2) ## sets the plot margins for "nicer" look 19 plot(ToothGrowth$dose,ToothGrowth$len,pch=20,ylim=c(0,45)) lines(doseSeq,muOJ) lines(doseSeq,muOJ_upper,lty=2) lines(doseSeq,muOJ_lower,lty=2) lines(doseSeq,muVC,col="red") lines(doseSeq,muVC_upper,lty=2,col="red") lines(doseSeq,muVC_lower,lty=2,col="red") l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l ll l l l ll l l l 0.5 1.0 1.5 2.0 0 10 20 30 40 ToothGrowth$dose To o th G ro w th $le n Figure 13: Estimate mean growth as a function of dose. 13. Consider the data frame gehan in the module R workspace. This involves a trial of 42 leukaemia patients where some were treated with the drug 6-mercaptopurine and the rest are controls. The trial was designed as matched pairs, both withdrawn from the trial when either came out of remission (more info can be found by typing ?gehan having first loaded the R package MASS). Variable cens indicates that some observations are censored (cens=0). This means that the patient was removed or dropped out from the trial before their remission time can be recorded. The only data available for such patients is the time they were removed, i.e. that they didn’t come out of remission before this removal time. When cens=0, variable time records the drop-out time rather than the remission time. (a) Create a new dataframe by removing the censored observations and use the command gehan$treat <- relevel(gehan$treat, "control") to ensure that it is the “control” that gets buried in the intercept. Write down and fit an Exponential GLM with an inverse link, with time as the response and treat as the covariate. Discuss the effects of treatment, and check whether the model fits. (5) (b) Discarding censored observations simplifies the situation, however this means throwing away potentially useful information (even if partial). To fit a model with all the data we can instead alter the likelihood contributions of the censored and non-censored data points: L(λi; ti) = p(ti;λi) if no censoring L(λi; ti) = Pr(t > ti;λi) = 1− Pr(t < ti;λi) if censoring where p(ti;λi) indicates the Exponential probability density function for survival times ti, and Pr(t < ti;λi) is the Exponential cumulative distribution function. The likelihood can then be written as: L(λi; t1, . . . , tn) = n∏ i=1 p(ti;λi) ci [1− Pr(t < ti;λi)]1−ci 20 where ci = 1 for no censoring and ci = 0 for censoring. Show that for the exponential distribution p(ti;λi) = λie −λiti , the likelihood can be written as: L(λi; t1, . . . , tn) = n∏ i=1 (λiti) cie−λiti tcii and note that this likelihood is equivalent to the Poisson likelihood up to a constant. One can therefore fit this model using a Poisson model with mean λiti where the data are the ci. Write down this Poisson model mathematically, such that λi = β0+β1xi where xi is the treatment variable. Fit this Poisson model in R, check that it fits, and compare the treatment effect with the Exponential model on the ‘thinned’ data from part (a). (Hint: to fit a model without an intercept, add a “−1” in the model formula of the glm() call. Also, you will need to use the command gehan$treat <- relevel(gehan$treat, "6-MP") before fitting the model, to revert the changes from part (a).) (14) (19) (100) 21 欢迎咨询51作业君