MTHM506/COMM511: Statistical Data Modelling

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Assessment - Individual Exercises

Marks achieved in this assignment will contribute towards 50% of the final module mark. You should attempt all questions on this sheet. Note that the questions are organised in the order we covered the topics, and not in order of difficulty. Therefore it is advised that you read through the questions first, and start working on those that you feel more comfortable with.

Deadline: Noon (12pm), on 3rd March 2023

You should submit one pdf via eBART containing your solutions - it should be written up using word processing software (e.g. LaTeX, R Markdown, or Word). Solutions are expected to be concise, well structured and well presented. Commented R code (e.g. ‘model <- glm(...)’) and the outcomes/plots should form part of your solutions. Do not display too much raw R output (e.g. don’t display the full output of ‘summary(model)’), but edit this down to the essentials. Ensure to include justification for each step of your analyses, providing comments alongside your R code to explain what you are doing and add appropriate titles and labelled axes to your plots. Hand written solutions will be accepted where mathematical descriptions are required, but a professional word processed submission is preferred.

You are expected to work independently - strict disciplinary action will be taken for any plagiarism. Late submissions will also be penalised according the University’s late submission policy.

The data required for this assignment datasets_exercises.RData can be downloaded from the ELE page and loaded into R using the load() function.

Question 1

The data frame nlmodel contains data on a response variable y and a single explanatory variable x. A scatter plot of y versus x suggests a strong non-linear relationship:

200

150

100

0.00 0.25 0.50

0.75 1.00

Suppose for these data we wish to consider the model

? θ1xi 2? Yi∼N θ+x,σ

i = 1,2,...,100, Yi independent

(a) [1 mark] Why can’t this model be fit using a linear (regression) model?

(b) [2 marks] Write down the likelihood L(θ1, θ2, σ2; y, x) and the log-likelihood l(θ1, θ2, σ2; y, x).

(c) [1mark]WriteanRfunctionmylike()whichevaluatesthenegativelog-likelihood(i.e.−l(θ1,θ2,σ;y,x)) for any values of the three parameters.

(d) [3 marks] Use the R function nlm() in association with your function mylike() to numerically minimise the log-likelihood and report the maximum likelihood estimates for the model parameters. Provide some evidence of how you chose sensible starting values.

(e) [2 marks] Estimate the standard errors and construct 99% confidence intervals for θ1 and θ2.

(f) [2 marks] Test the hypothesis that θ2 = 0.08 at the 10% significance level (not using the confidence

interval).

(g) [4 marks] Produce a plot of the associated mean relationship and the associated 95% prediction intervals

on a scatter plot of y versus x. Comment on the appropriateness of the model.

Question 2

The dataframe aids data relates to the number of quarterly AIDS cases in the UK, yi, from January 1983 to March 1994. The variable cases is yi and date is time, symbolised here as xi. A scatter plot of yi versus xi shows an increasing trend in cases:

500

400

300

200

100

82.5 85.0

87.5 90.0 92.5

Date

Number of cases

In this question we consider two competing models to describe the trend in the number of cases. Model 1 is Yi ∼ Pois(λi)

and Model 2 is

log(λi) = β0 + β1xi

Yi ∼N(μi,σ2) log(μi) = γ0 + γ1xi

(a) [2 marks] Comment on whether the proposed models are sensible in terms of the distribution and the relationship of x with the mean.

(b) [3 marks] Fit the two models in R and plot the estimated trends from each model (λˆi and μˆi) on top of the data with approximate 95% confidence intervals around the mean. Comment on the validity of each model (based on the plot).

(d) [2 marks] Produce the deviance residuals vs fitted values (λˆi and μˆi) plot for each model, comment

appropriately and thus propose a way that the two models might be extended to improve the fit.

(e) [4 marks] Implement the proposed extensions to each model, to arrive at a final version for each of

them (justified by appropriate hypothesis tests).

(f) [8 marks] On the basis of your answers (a)-(d), but also on arguments of model fit based on the deviance,

comment on which (if any) of the two final models in (e) you would choose as the best. Mention at

least one reason why either model is not ideal.

(g) [4 marks] Further extend your final Poisson model to a Negative Binomial model and comment on

whether this model is preferable to the other two, on the basis of all the criteria used for comparison so far.