辅导案例-MAST90084-Assignment 1

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

MAST90084: Statistical Modelling Assignment 1
Due time: 11PM, Wednesday May 6.
DO NOT FORGET TO COMPLETE THE PLAGIARISM DECLARATION ON THE SUBJECT’S LMS
BEFORE SUBMIT YOUR FIRST ASSIGNMENT.
1. Data in the following 2× 2× 3 table were used to study the effect of passive smoking on lung cancer. The
table summarizes the results of case-control studies from 3 countries for nonsmoking women married to
smokers. (Source: Blot and Fraumeni, J. Nat. Cancer Inst., 77:993-1000 (1986) and Agresti (1996).) [15]
Country Spouse Smoked Cases Controls
Japan No 21 82
Yes 73 188
UK No 5 16
Yes 19 38
USA No 71 249
Yes 137 363
(a) A log-linear model mod1 can be fitted to the data, with the results being given in the following R
output. Give the mathematical formula (of form ln(λ) = · · ·) for model mod1. Explain why this
model is called a homogeneous association model.
> pasSmoking.dat=data.frame(freq=c(21,73,5,19,71,137,82,188,16,38,249,363))
> pasSmoking.dat$Cnt=factor(rep(c("Japan","UK", "USA"), times=2, each=2))
> pasSmoking.dat$Smo=factor(rep(c("No","Yes"), times=6))
> pasSmoking.dat$Can=factor(rep(c("Case","Control"), each=6))
> pasSmoking.dat
freq Cnt Smo Can
1 21 Japan No Case
2 73 Japan Yes Case
3 5 UK No Case
4 19 UK Yes Case
5 71 USA No Case
6 137 USA Yes Case
7 82 Japan No Control
8 188 Japan Yes Control
9 16 UK No Control
10 38 UK Yes Control
11 249 USA No Control
12 363 USA Yes Control
> mod1=glm(freq~Cnt+Smo+Can+Cnt:Smo+Cnt:Can+Smo:Can, family=poisson, data=pasSmoking.dat)
> anova(mod1, test="Chisq")
Analysis of Deviance Table; Model: poisson; Link: log; Response: freq
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL 11 1168.85
Cnt 2 726.43 9 442.42 < 2.2e-16
Smo 1 112.52 8 329.90 < 2.2e-16
Can 1 307.56 7 22.34 < 2.2e-16
Cnt:Smo 2 15.50 5 6.84 0.0004316
Cnt:Can 2 1.05 3 5.80 0.5919109
Smo:Can 1 5.56 2 0.24 0.0184215
> 1-pchisq(0.24,2)
[1] 0.8869204
> 1-pchisq(5.80,3)
[1] 0.1217566
MAST90084 Statistical Modelling Assignment 1 Semester 1, 2020
(b) Test based on mod1 the significance of the interaction effect Smo:Can, eliminating the effects of all
other terms in the model. Comment on the implication of your result.
(c) Test the adequacy of model Cnt+Smo+Can+Cnt:Smo+Cnt:Can, at significance level 0.05, using the R
output in (a). Comment on the implication of your result and how it is related to the result of (b).
2. This question refers to the quasi-likelihood method for GLM given in the lecture notes. Show the following
results are true. [15]
(a) Based on the definition of quasi-likelihood, the quasi-score function is given by
s(β) =
n∑
i=1
ziDi(β)[σ
2
i (β)]
−1[yi − µi(β)] = ZTD(β)Σ−1(β)[y − µ(β)].
(b) The expected quasi-information is F (β) =
n∑
i=1
ziz
T
i wi(β) = Z
TW (β)Z.
(c) The variance matrix of s(β) is V (β) = Cov(s(β)) =
n∑
i=1
ziz
T
i D
2
i (β) ·
σ20i
σ4i (β)
.
3. Let yi = (yi1, · · · , yiq)T be a q×1 random vector following a probability distribution from multi-parameter
exponential family. Namely, the pdf of yi is f(yi|θi, φ, wi) = exp
{
yTi θi − b(θi)
φ
wi + c(yi, φ, wi)
}
, where
θi = (θi1, · · · , θiq)T is a q×1 natural parameter vector, φ is a dispersion parameter and wi is a weight. It is
known that E
[
∂ ln f
∂θij
]
= 0, j = 1, · · · , q; and E
[
∂2 ln f
∂θij∂θij′
]
+E
[(
∂ ln f
∂θij
)
·
(
∂ ln f
∂θij′
)]
= 0, j, j′ = 1, · · · , q.
Using these properties show that E(yij) =
∂b(θi)
∂θij
, Var(yij) =
φ
wi
·∂
2b(θi)
∂θ2ij
and Cov(yij , yij′) =
φ
wi
· ∂
2b(θi)
∂θij∂θij′
,
j, j′ = 1, · · · , q. [15]
4. Let Y be a response variable having k nominal categories. Let U1, · · · , Uk be k independent latent utility
variables satisfying Ur = ur + εr, r = 1, · · · , k, with ur’s being fixed and εr’s being i.i.d. having cdf F and
pdf f = F ′. Following the principle of maximum random utility it has been shown that Y = r if and only
if Ur = max{U1, · · · , Uk}, r = 1, · · · , k. Moreover it has been shown that
P (Y = r) =
∫ ∞
−∞
∏
s 6=r
F (ur − us + ε)f(ε)dε, r = 1, · · · , k.
Using these results, find a closed-form result for P (Y = r) if an extreme maximal-value cdf F (x) =
exp(− exp(−x)) is chosen. [15]
5. For response variable Y having k ordered categories, the cumulative model — based on given explanatory
variable x, thresholds θ1, · · · , θq and a cdf F — is specified as
P (Y ≤ r|x) = F (θr + xTγ).
Find the link function for this model when F is chosen as the extreme maximal-value cdf. [15]
6. You need to install the R package faraway to do this question. The hsb data was collected from the High
School and Beyond Study. Type help(hsb) to see the description of the dataset. We want to see how the
relevant variables in the data are related to the choice of the type of program — academic, vocational, or
general — that the students pursue in high school. The response is multinomial with three levels. [25]
(a) Fit a trinomial response model with the other variables as predictors (untransformed).
(b) For the student with id 99, compute the predicted probabilities of the three possible choices.
Total marks = 100
2