STAT6030 GENERALISED LINEAR MODELLING

The Australian National University

Assignment 2

2023 Summer Session

Instructions

• This assignment is worth 55 marks in total and 25% of your overall marks for this

course. The assignment is compulsory and must be submitted by 5pm on Monday

6 March 2023.

• You must write your answers to this assignment individually and by yourself. If you

copy someone else’s work or allow your work to be copied, you will receive a mark of

zero for the assignment and risk severe academic consequences.

• Your answers should be individually submitted through Turnitin on Wattle as a

single pdf/Word document (less than 50MB) including the following:

1. The assignment Cover Sheet (available on Wattle).

2. Your answers (no more than 10 pages including graphs, summaries, tables, etc...

but not Appendix and Cover Sheet, and respecting the other requirements for each

part).

3. An Appendix including all the R commands you used (no page limit).

• Assignments should be typed and not handwritten. Your assignment may include some

carefully edited R output (e.g., graphs, summaries, tables, etc...) and appropriate dis-

cussion of these results, as well as some selected R commands. Please be selective about

what you present and only include as many pages and as much R output as necessary

to justify your solution. Clearly label each part and question of your assignment and

appendix with the corresponding numbers.

• Unless otherwise advised, use a significance level of 5%.

• Round numeric answers to 4 decimal places (e.g., 0.00115 is rounded to 0.0012).

• Marks will be deducted if these instructions are not strictly respected, especially when

the total report is of an unreasonable length, i.e., more than the above page limit. The

Appendix will generally not be marked and checked if what you have written or done

needs clarifications.

• Name your submission “CourseCode Uid”, e.g., “STAT6030 u1234567”.

• Try to submit your assignment at least 30 minutes before the deadline in case

something unexpected happens, for instance an internet connection problem.

• Late submissions will NOT be accepted. Extensions will usually be granted on

medical or compassionate grounds on production of appropriate evidence, but must

receive lecturer’s approval at least 24 hours before the deadline.

1

Part 1 [16 Marks]

Please provide your answers to the following questions and include short working out if there

is any. There is a limit of 3 pages on your answers for Part 1.

(a) [1 mark] What is the definition of canonical link function in the context of generalised

linear models?

(b) [1 mark] Explain in words and/or by drawing a plot when a link function of a generalised

linear model is valid.

(c) [1 mark] In the context of generalised linear models, does the value of the maximised

log-likelihood for the saturated model depend on the choice of link function and why?

(d) [1 mark] The mean of a generalised linear model is known to lie between 1 and 2

whatever the value of the linear predictor ηi = x

⊤

i β is, i.e. 1 < µi < 2. Let Φ denote

the cumulative distribution function of the standard normal distribution N(0, 1) and

Φ−1 denote the inverse function of Φ. Which function below is an appropriate link

function in this setting? Notes: (i) precisely one answer below is correct and the other

ones are incorrect; (ii) an incorrect answer scores zero while the correct answer scores

full marks for the question.

A. ηi = g(µi), where g(µi) = Φ(µi − 1).

B. ηi = g(µi), where g(µi) = Φ(µi/2).

C. ηi = g(µi), where g(µi) = Φ

−1(µi − 1).

D. ηi = g(µi), where g(µi) = Φ

−1(µi/2).

(e) [1 mark] The gamma distribution has probability density function

f(y;α, β) = {βα/Γ(α)}yα−1 exp(−βy),

where y > 0, α > 0 is a shape parameter, β > 0 is a rate parameter and Γ(·) is the

gamma function. You may assume that

(i) the mean µ of the gamma distribution is given by µ = α/β;

(ii) the gamma distribution is a generalised linear model with dispersion parameter

ϕ = 1/α, in the notation of equation (4.1) of Topic 4.

What is the canonical link function when the generalised linear model is gamma?

(f) [3 marks] The geometric distribution has probability mass function f(y; p) = (1− p)py,

for y = 0, 1, . . ., where 0 < p < 1. What are the canonical link function and variance

function of the geometric distribution?

The deviance residual for observation i is given by sign(yi − µˆi)

√

d2i , where

d2i =

2

ϕ

[

yi

{

h(yi)− h(µˆi)

}− {b(h(yi))− b(h(µˆi))}]

is the deviance associated with observation i, which is written as a function of the

response variable yi and of the fitted value µˆi, while sign(·) is the sign function defined

in the lecture notes. Also recall that b′−1(µ) ≜ h(µ). What is the expression for d2i , as

a function of yi and µˆi, when the generalised linear model is geometric? Please simplify

your expression as much as you can.

2

(g) [1 mark] Consider a generalised linear model with linear predictor ηi = υi+x

⊤

i β, where

υi is an offset, xi is a vector of covariates of length p and β is a parameter vector of

length p to be estimated. Assuming that the model’s dispersion parameter ϕ = 1 is

known, how many free parameters (i.e., parameters to estimate) are there in this model?

(h) [1 mark] A logistic regression model was fitted to a dataset consisting of a binary

outcome variable, yi, taking values 0 and 1, and a single numerical covariate xi. The

estimated intercept and slope on the linear predictor scale were found to be −0.47 and

1.3, respectively, so that the linear predictor as a function of xi is given by

ηˆ(xi) = −0.47 + 1.3xi.

Recall the estimated probability Prob[yi = 1|xi] is given by

Prob[yi = 1|xi] = exp{ηˆ(xi)}/[1 + exp{ηˆ(xi)}]

and so the estimated probability Prob[yi = 0|xi] is given by 1− Prob[yi = 1|xi]. What

is the value of xi such that the odds of the event yi = 1 is 0.75? Recall that the odds

of an event that occurs with probability π is given by π/(1− π).

(i) [2 marks] Consider a distribution with the probability density function

f(y;µ) = [1/(2πy3)]−1/2 exp[−(y − µ)2/(2µ2y)],

where µ is the mean of the distribution and y > 0. What is the variance function, V (µ),

of this distribution?

(j) [1 mark] The following output from a linear regression model fit in R was obtained.

Calculate the value for ++++ that the R program would give if the sample size is 10.

Call :

lm( formula = y ˜ x )

C o e f f i c i e n t s :

Estimate Std . Error t value Pr(>| t | )

( I n t e r c ep t ) −0.08888 0.66793 −0.133 0 .897

x 1.06903 0.10765 ???? ++++

(k) [1 mark] Suppose we fit a Poisson regression model A with log link to a dataset whose

response variable is a count. No offset is included. In the fitted model we have included

a covariate x and the estimated coefficient of x is βˆA. Suppose that we then decide to

fit a second model B which is the same as model A but with x included as an offset as

well as included in the linear predictor as before. Suppose the estimated coefficient of

x is βˆB in model B. Which of the following statements about the second fitted model is

correct?

Notes: (i) precisely one answer below is correct and the other ones are incorrect; (ii) an

incorrect answer scores zero while the correct answer scores full marks for the question.

A. βˆB = βˆA − 1 and the residual deviance of model B will (usually) change compared

to that of model A.

3

B. βˆB = βˆA− 1 and the residual deviance of model B will not change compared to that

of model A.

C. βˆB = βˆA + 1 and the residual deviance of model B will (usually) change compared

to that of model A.

D. βˆB = βˆA+1 and the residual deviance of model B will not change compared to that

of model A.

(l) [2 marks] Suppose we have fitted a Poisson log-linear regression with extra-Poisson

variation and the estimate of the dispersion parameter ϕ is greater than 1. If the

standard Poisson model was used in this situation, would this be likely to be a case of

underdispersion or overdispersion, and which assumption between mean and variance

of the Poisson distribution should fail? What would happen to the estimates of the β

parameters for the standard Poisson model?

4

Part 2 [12 Marks]

Different doses of two chemicals, A and B, were used in a trial whose purpose was to reduce

cockroach numbers. The variable x1 gives the dose of chemical A and the variable x2 gives

the dose of chemical B. In the R code below, the first column of c gives the number of

cockroaches killed and the second column of c gives the number of cockroaches that survived.

The following R outputs were obtained:

Please provide your answers to the following questions and include short working out if there

is any. There is a limit of 2 pages on your answers for Part 2.

(a) [1 mark] What type of generalised linear model is being fitted here and what link

function is being used?

5

(b) [5 marks] Determine the missing information indicated by the letters A, B, C, D, E, F,

G, H, J and K. Note that for E you are required to specify the link function.

(c) [2 marks] Write down the relevant model in mathematical form, focusing on the contri-

bution of observation i to the model.

(d) [2 marks] Briefly indicate your impressions of the results of the statistical analysis

provided above.

(e) [2 marks] What are the next questions you would investigate in the statistical analysis?

State what your next two steps would be.

6

Part 3 [12 Marks]

The presence of sprouted or diseased kernels in wheat can reduce the value of a wheat pro-

ducer’s entire crop. It is important to identify these kernels after being harvested but prior

to sale. To facilitate this identification process, automated systems have been developed to

separate healthy kernels from the rest. Improving these systems requires a better understand-

ing of the measurable ways in which healthy kernels differ from kernels that have sprouted

prematurely or are infected with a fungus. To this end, Martin et al. (1998) conducted a

study examining numerous physical properties of kernels - density, hardness, size, weight,

and moisture - measured on a sample of wheat kernels from two different classes of wheat,

hard red winter (hrw) and soft red winter (srw) (represented by the categorical variable class)

in the wheat.csv dataset on Wattle. Each kernel’s condition was also classified as “Healthy”,

“Partly Diseased” and “Diseased” by human visual inspection (represented by the categorical

variable type2).

Please provide your answers to the following questions and include short working out if there

is any. There is a limit of 3 pages on your answers for Part 3.

Throughout the following questions, treat type2 as the response variable.

Suppose that we have conducted the following R analysis and obtained the R output below:

7

(a) [2 marks] Describe the interpretations of coefficient estimates -10.95451 and -0.6480912

in the summary() output.

(b) [2 marks] What are the null and alternative hypotheses corresponding to the p-value

0.0291 in the Anova() output? What conclusion can you obtain based on the p-value?

(c) [2 marks] Suppose we have a new observation of the following form:

> xnew=data . frame ( class=’ srw ’ ,density=1, hardness=25, s i z e =2,

+ weight=25,moisture=12)

> xnew

class density hardness s i z e weight moisture

1 srw 1 25 2 25 12

If we use predict(), what are the predicted probabilities for the different categories

of the response type2 and what is the prediction of the response type2 for this new

observation?

Suppose that we conducted further R analysis and obtained the R output below:

8

(d) [2 marks] Describe the interpretations of coefficient estimates -0.17370 and 13.50540

in the summary() output, respectively.

(e) [2 marks] What are the null and alternative hypotheses corresponding to the p-value

0.65749 in the Anova() output? What conclusion can you obtain based on the p-value?

(f) [2 marks] Fit a nominal logistic regression model and an ordinal logistic regression model,

respectively, with covariates class, density, hardness, size, weight, moisture,

class:density, class:hardness, class:size, class:weight and class:moisture.

Based on the model fitting results, which model is better? Please explain why this

model is better.

9

Part 4 [15 Marks]

An analysis of some ship damage data is presented below. The data consists of a factor typ,

corresponding to ship type, with 3 levels, A, B and C; a factor cons, corresponding to the

period of construction of the ship, with 3 levels, 1960-1964, 1965-1969 or 1975-1979; a factor

opr, corresponding to years of operation of the ship, with 2 levels, either 1960-1975 or 1975-

1979; a numerical variable mnths, corresponding to the total number of months at risk; and

dmge, corresponding to the number of damage incidents reported for the ship. The following

R output was obtained.

10

Please provide your answers to the following questions and include short working out if there

is any. There is a limit of 2 pages on your answers for Part 4.

(a) [1 mark] What type of generalised linear model i being fitted here to obtain the output

out1 and what link function is being used?

(b) [7 marks] Determine the missing information indicated by the letters A, B, C, D, E, F,

G, H, J, K, M, N, P and Q. Note that F should consist of either a blank, a dot, one

star, two stars or three stars; and for J you should specify the link function that was

used. All the other letters apart from A represent a number.

(c) [2 marks] Explain what is meant by an offset and the motivation for offsetting L=log(mnths)

rather than mnths itself.

(d) [2 marks] Using the R printout for out1, give the value of the linear predictor for a ship

of Type A that was constructed in the period 1965-1969 and operated in the period

1975-1979, assuming that mnths=1095.

(e) [3 marks] Write down brief notes on what you would conclude about the wave damage

data from the R output. Can we draw any conclusions as to whether overdispersion is

present in this dataset? What action you would consider taking if overdispersion were

suspected to be present.

11

欢迎咨询51作业君