STAT2004/2904/7004 2024 – Assignment 4
Due date: 25 October 2024 at 16:00
STAT2004/7004: Complete Exercises 1–4 for a maximum of 40 marks and a total of 10%.
STAT2904: Complete Exercises 1–5 for a maximum of 50 (+5 bonus) marks and a total of
10% (+1% bonus).
Allstudents: Exercise6isabonusquestion. Acompleteandcorrectsolutionofthisquestion
earns you an extra 1% for this assignment.
Note that some questions involve interpretation and communication of results in the
form of an audio recording which you upload onto Blackboard as an audio file.
Reminder: while discussion of the Assignment questions (amongst yourselves, with lecturers
and/or tutors) is encouraged, the final write-up must be your own. If you cannot express a
solution in your own words, then you must cite your source(s).
Question 1 (Testing exponential rates) (8 marks)
Let X ,X ,...,X be a random sample from an exponential distribution with pdf
1 2 7
f (x) = λe−λx , x ≥ 0 ,
X
and Y ,Y ,...,Y be another independent sample from an exponential distribution with pdf
1 2 8
f (y) = θe−θy , y ≥ 0 .
Y
Here, λ > 0 and θ > 0 are both unknown parameters. We want to test the null hypothesis
H : λ = θ versus the alternative hypothesis H : λ ̸= θ.
0 1
(a) (2 marks) Show that under the null hypothesis, the maximum likelihood estimator of
λ = θ is given by
15
λˆ = θˆ= .
(cid:80) (cid:80)
X + Y
i j
(b) (1 mark) Show that under the alternative hypothesis, the maximum likelihood estima-
tors of λ and θ are given respectively by
7 8
λˆ = , and θˆ= .
(cid:80) (cid:80)
X Y
i j
(c) (3 marks) Construct a generalised likelihood ratio test for testing H : λ = θ versus
0
H : λ ̸= θ, and show that it reduces to a test based on large or small values of the test
1
statistic
(cid:80)
X
i
T(X,Y) = .
(cid:80) (cid:80)
X + Y
i j
1
It is given to you that T(X,Y) ∼ Beta(7,8) under the null hypothesis H : λ = θ.
0
(d) (1 mark) Explain how you would set critical value(s) for your test from part (c) to
control the Type I error at α = 5%, and write down your decision rule explicitly using
these critical value(s).
(e) (1 mark) [Audio question]: Is your test from parts (c)–(d) uniformly most powerful
for testing H : λ = θ versus H : λ ̸= θ at the 5% significance level? Briefly explain
0 1
why, or why not.
Question 2 (Comparing ratings across groups) (8 marks)
A recent poll asked social media users to provide their opinions on a decision by a popular
photo-sharing app to remove the number of “likes” from their posts. Each respondent was
asked to express their opinion on the following five-point scale:
1 = Strongly disagree
2 = Disagree
3 = Neutral
4 = Agree
5 = Strongly Agree
Of the n = 198 respondents, 98 were “influencers” (with over 10,000 followers each) while
the other 100 were regular users. The full dataset can be downloaded as a .csv file from
Blackboard > Assessment > Assignment 4 > likes.csv.
(a) (1 mark) Visualise the data using an appropriate graph(s).
(b) (3 marks) Do the two types of users exhibit differing opinions regarding the recent
changes to the photo-sharing app? Answer this question by carrying out an appropri-
ate hypothesis test. Clearly state the null and alternative hypotheses, propose a test
statistic, compute and interpret a p-value, and write your conclusions in a way that is
understandable to a social scientist.
(c) (2 mark) State and critically assess any assumption(s) you made in answering (b).
(d) (2 marks) [Audio question]: A social scientist suggests comparing the two groups
using a two-sample t-test applied directly to the five-point responses. Explain to her
why this is inappropriate here.
2
Question 3 (Tuberculosis and blood type) (14 marks)
Overfield and Klauber (1980) published the following data on the incidence of tuberculosis
in relation to ABO blood groups in a sample of Eskimos:
blood type
tuberculosis severity O A AB B
moderate or advanced 7 5 3 13
minimal 27 32 8 18
not present 55 50 7 24
We want to investigate whether tuberculosis incidence is related to blood type.
Let p denote the underlying proportion of the population with tuberculosis severity i ∈
ij
{moderate/advanced, minimal, not present} and blood type j ∈ {O, A, AB, B}. For con-
venience, write p = (p ) for the 3×4 vector of proportions.
ij
(a) (1 mark) Write down the null and alternative hypotheses in words.
(b) (1 mark) Write down the likelihood function for p given the observed counts x.
Under the null hypothesis, p = p ×p for each i and j, where p is the overall proportion
ij i• •j i•
with tuberculosis severity level i and p is the overall proportion with blood type j.
•j
(c) (3 marks) Show that under the null model the ML estimates of each pˆ and pˆ are
i• j•
given, respectively, by
pˆ = x /n and pˆ = x /n ,
i• i• •j •j
where x is the observed number of cases of tuberculosis severity i, x is the observed
i• •j
number of cases of blood type j, and n is the total sample size.
(d) (1 mark) Using the results from part (c), or otherwise, what counts would we expect
to see in each cell of the table if the null hypothesis is indeed true?
Under the alternative hypothesis, there are no restrictions on the cell proportions p (except
ij
that they must all sum to 1).
(e) (1 mark) State the ML estimates pˆ of each cell proportion p under the alternative.
ij ij
(You do not have to prove that these are the MLEs).
(f) (2 marks) Using your results from parts (b), (c) and (e), or otherwise, numerically
evaluate the generalized likelihood ratio test statistic,
sup L(p|x)
Λ = H0 ,
sup L(p|x)
H1
for testing the association between tuberculosis and blood type based on the observed
counts in the table above. Also, numerically compute the transformation −2logΛ.
(g) (1 mark)Usingyourresultsfrompart(d), orotherwise, computePearson’sχ2 statistic,
(cid:0) (cid:1)2
(cid:88) Observed ij −Expected ij
.
Expected
ij
cells i,j
Is Pearson’s χ2 statistic numerically close to the −2logΛ statistic from part (f)?
3
(h) (2 marks) Carry out the hypothesis test by computing and interpreting a p-value, and
state your conclusion in a way that is understandable to a population health scientist.
Notice that one of the cells in the table contains only 3 counts. This may render the asymp-
totic χ2 distribution inaccurate for part (h). Instead, we can consider Fisher’s exact test.
(i) (2 marks) Using an alternative approach, or otherwise, re-do the analysis to account
for the low counts in some of the cells. Does your conclusion from part (h) change?
Question 4 (Weight gain in pigs) (10 marks)
AtrialwasconductedinIowa,USA,examiningtheeffectsofvitaminB12dietarysupplements
and antibiotics on weight gain in pigs. Twelve adult pigs were randomly divided into four
groups (one using standard pig chow, one using pig chow with added vitamin B12, one using
pig chow with added antibiotics, and one using pig chow with both added vitamin B12 and
antibiotics). After one week of feeding, the pigs were weighed and their weight gain (in
grams) was recorded. The data are plotted below:
Vitamin B12
)gk(
niag
thgieW
005
004
003
002
001
0
Weight gain in pigs, by Vitamin B12 level
and [P]resence or [A]bsence of Antibiotics
P
P
P
A
A
A
A A
A
PP
P
No Yes
We can model the weight gains {Y } using a two-way ANOVA with interactions:
jki
Y = µ+α +β +δ +ϵ ,
jki j k jk jki
where j = 1,2 denotes the level of factor A (antibiotics), k = 1,2 denotes the level of factor
B (vitamin B12), and i = 1,2,3 indexes the observations in each group. Assume that the
errors ϵ i ∼id N(0,σ2) across all j,k and i. The common variance σ2 is taken to be unknown.
jki
If we parametrize this model using the contrast constraints,
α = 0, β = 0 and δ = δ = 0 for j,k = 1,2,
1 1 1k j1
4
then µ can then be interpreted as the mean of the baseline group with no antibiotics and no
vitamin B12, α is the mean change from adding antibiotics only, β is the mean change from
2 2
adding vitamin B12 only, andthe interactionδ is additionalmean change from adding both
22
antibiotics and vitamin B12 simultaneously.
(a) (3 marks) Show that under the sum constraints the MLE of each parameter is given by
µˆ = Y ,
11•
αˆ = Y −Y ,
2 21• 11•
βˆ = Y −Y ,
2 12• 11•
δˆ = Y −Y −Y +Y .
22 22• 21• 12• 11•
(b) (2 marks) Show that the following sum-of-squares decomposition holds:
SS = SS +SS +SS +SS ,
Total A B AB residual
where
(cid:88)
SS = (Y −Y )2 is the overall sum-of-squares ignoring groups,
Total jki •••
jki
(cid:88)
SS = (Y −Y )2 is the sum-of-squares between levels of factor A,
A j•• •••
jki
(cid:88)
SS = (Y −Y )2 is the sum-of-squares between levels of factor B,
B •k• •••
jki
(cid:88)
SS = (Y −Y −Y +Y )2 is the interaction sum-of-squares,
AB jk• j•• •k• •••
jki
(cid:88)
SS = (Y −Y )2 is the residual sum-of-squares within groups.
residual jki jk•
jki
Hint: Start with the following identity:
Y −Y = (Y −Y )+(Y −Y )+(Y −Y )+(Y −Y −Y +Y )
jki ••• jki jk• j•• ••• •k• ••• jk• j•• •k• •••
(c) (1 mark) Briefly explain why the residual sum-of-squares has distribution given by
SS
residual ∼ χ2 ,
σ2 dfresidual
wheredf = JK(r−1) = 8. [Here,J = 2isthenumberoflevelsoffactorA,K = 2
residual
isthenumberoflevelsoffactorB,andr = 3isthenumberofreplicationsineachgroup.]
(d) (1 mark) Briefly argue why the residual sum-of-squares SS is independent of the
residual
interaction sum-of-squares SS .
AB
Using similar calculations to part (c), it also can be shown that under the null hypothesis
H : all interactions δ = 0, the interaction sum-of-squares has distribution given by
0 jk
SS
AB ∼ χ2 ,
σ2 dfAB
where df = (J −1)(K −1) = 1.
AB
5
(e) (1 mark) Using parts (c), (d) and the above result, or otherwise, argue why the null
distribution of the so-called F-ratio,
MS SS /df
AB AB AB
F = := ,
MS SS /df
residual residual residual
is an F distribution with numerator degrees-of-freedom df and denominator degrees-
AB
of-freedom df .
residual
A partially-complete two-way ANOVA table for the pigs weight dataset is given below:
Source df SS MS F P
VitaminB12 1 218700 218700 60.33 < 0.0001
Antibiotics 1 19200 19200 5.30 ≈ 0.05
VitaminB12:Antibiotics 1 172800 172800 47.67 < 0.0005
Residuals 8 29000 3625
Total 11 439700
(e) (2 marks) Using your results from parts (b) and (e), or otherwise, complete the above
ANOVA table. Hence, summarise the main finding(s) of this experiment and write a
short conclusion.
Question 5 (STAT2904 only) (10 marks)
Let X ,X ,...,X be iid random variables from a Pareto distribution with pdf
1 2 n
(cid:40)
θνθ
, x ≥ ν ,
f(x|θ,ν) = xθ+1
0 , otherwise ,
where θ,ν > 0 are two unknown parameters.
(a) (4 marks) Find the MLEs for θ and ν
(b) (1 mark) If it is given to you that θ = 1, does that change the MLE for ν?
(c) (5 marks)Usingparts(a)and(b), orotherwise, constructageneralizedlikelihoodratio
test (GLRT) for testing
H : θ = 1, ν unknown versus H : θ ̸= 1, ν unknown,
0 1
and show that it reduces to a test based on either small or large values of the statistic
T(X) given by
(cid:20) (cid:81)n
X
(cid:21)
T(X) = log i=1 i .
(minX )n
i
To finish specifying this test, we need to set the critical values for T(X) that determine what
is “too small” or “too large”. However, the distribution of T(X) is too difficult to derive
analytically. Instead, we can use simulations to help us find these critical values.
6
STAT2904 Bonus Questions (5 marks):
(d) (2marks)Forasamplesizeofn = 22,say,simulateonesetofobservationsx ,x ,...,x
1 2 22
from the Pareto distribution with θ = 1 and ν = 2.1. From this realisation, compute
the value of the observed test statistic
(cid:34) (cid:81)22
x
(cid:35)
T(x) = log i=1 i .
(minx )22
i
(e) (1mark)Repeatthesimulationsettingfrompart(d)10,000times,eachtimecomputing
and saving the observed test statistic T(x)
(f) (1 mark) Estimate the upper and lower 2.5%-tiles of the distribution of T(X) using the
simulated values from part (e).
(g) (1 mark) Investigate numerically how the cutoff values from part (f) changes if you set
the nuisance parameter ν to another value (e.g., try ν = 1.3, 2.7, 3.4, etc...)
Question 6 (Bonus question for all students) (4 marks)
Let Y and Y be two random samples from a Uniform(λ,λ+1) distribution. To test the
1 2
hypothesis H : λ = 0 versus H : λ > 0, two competing tests are proposed:
0 1
• Geoff’s Test: reject H in favour of H if Y ≥ 0.95.
0 1 2
• Alan’s Test: reject H in favour of H if Y +Y ≥ c for some critical value c.
0 1 1 2
(a) Find the value of c such that Alan’s Test has the same significance level as Geoff’s Test.
(b) Prove or disprove: Alan’s Test is more powerful than Geoff’s Test.
(c) Construct a test with the same significance level but is more powerful than both Alan’s
and Geoff’s Test.
7