代写辅导接单-Assignment 2 of MAST90125: Bayesian Statistical Learning

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Assignment 2 of MAST90125: Bayesian Statistical Learning

Due: 11:59pm Sunday, 01 October 2023

There are places in this assignment where R code will be required. Therefore set the random seed so assignment is reproducible.

set.seed(123456) #Please change random seed to your student id number.

# Loading datasets

data= read.table("./heights.txt",header=TRUE) # change it to the absolute path if needed data_f<-data[data$sex ==0,] #Subsetting data to female only

data_m<-data[data$sex ==1,] #Subsetting data to male only

CovidAustralia= read.csv("./CovidAustralia.csv",header=TRUE)

# change it to the absolute path if needed

Question One (6 marks)

Suppose the data y follows a pdf p(y|θ) where θ = (θ1, θ2) is a two-dimensional vector parameter of interest. Our objective is to use Gibbs sampler to simulate samples of θ from the posterior pdf p(θ|y).

a) Specify the transition pdf J(θ(t−1), θ(t)) used in generating a Markov chain {θ(0), θ(1), · · · , θ(t−1), θ(t), · · · } by the Gibbs sampler. Prove that the detailed balance condition is satisfied for the specified transition pdf J(θ(t−1),θ(t)) so that the posterior pdf p(θ|y) is the unique stationary pdf assocoated with the generated Markov chain.

b) Show that R p(θ(t−1)|y)J(θ(t−1),θ(t))dθ(t−1) = p(θ(t)|y) holds. Also interpret this result regarding the ergodicity of the generated Markov chain.

Question Two (23 marks)

Consider the following data extracted from a study on human height:

600 individuals were recruited to have their heights and other variables (features) measured. The measured variables are described below:

• height: height of participant, measured in cm.

• mother: Difference from mean female height for biological mother of participant, corrected for maternal

age, measured in cm.

• father: Difference from mean male height for biological father of participant, corrected for paternal age, measured in cm.

• age: Age in years of participant (range 21-50).

• sex: Gender of participant (f emale = 0, male = 1).

The data required to answer this question is heights.txt, which can be downloaded from Canvas. a) Consider the following linear regression model,

y=Xβ+ε, ε∼N(0,σ2In).

Assume that the priors are p(β) ∝ 1, an improper flat prior; and p(τ) = Ga(0,0), an improper Gamma

prior.

Task: Determine the marginal posterior for the precision parameter τ = (σ2)−1, p(τ|y), and the conditional

posterior for β, p(β|τ,y). Show details of your work.

b) Use the lm function in R to fit the following model to female participants only (sex = 0):

E(height) = βfemale + β1mother + β2father + β3age, Var(height) = σ2.

Task: Based on the results obtained and the posterior distribution obtained in a), write down the specific form for the posterior distribution of τ and the conditional posterior distribution of β (keeping τ in the conditional posterior distribution of β).

c) Use the lm function in R to fit the following linear regression model to all participants:

E(height) = βmale + βfemale + β1mother + β2father + β3age, Var(height) = σ2. To do this, assume the dataset was imported as data and then run the following command. mod2<-lm(height~0+sex+I(1-sex)+mother+father+age,data=data)

Write β = (βmale, βfemale, β1, β2, β3)′.

Task I: Report βˆ, and the associated 95 % confidence intervals based on the above lm output.

Note: One way to build the design matrix X here is: 1). set the column associated with βmale to 1 for male participants and to 0 for female participants; 2). set the column associated with βfemale to 1 for female participants and to 0 for male participants; 3). The remaining columns of X are filled with the covariate values for mother, father and age, respectively.

Suppose we want to use a Gibbs sampler to generate samples of β = (βmale,βfemale,β1,β2,β3)′ and τ = (σ2)−1 such that the stationary distribution of the associated Markov chain equals the posterior pdf of (βmale,βfemale,β1,β2,β3,τ). Then we use the generated posterior samples of β and τ to perform a Bayesian inferential analysis on the above regression model. The process is described as following:

Bayesian computing and analysis

1. The prior of τ is set to be the marginal posterior of τ found in b). The prior of βmale is set to be normal with mean 173 and variance 20. The prior of (βfemale,β1,β2,β3) is set to be the conditional posterior of (βfemale, β1, β2, β3) given τ found in b). Note the above three priors are set independently of each other.

Find the conditional posterior pdf of (βmale,βfemale,β1,β2,β3) and τ, respectively. Note the block Gibbs sampler requires these two conditional posterior pdf’s. For help in determining the necessary conditional posteriors, refer to lecture 14.

2. Construct a block Gibbs sampler to generate posterior samples of β = (βmale, βfemale, β1, β2, β3)′ and τ. Write down both the algorithm and the R function for this Gibbs sampler.

3. Use the constructed Gibbs sampler and the lm results of the fitted regression model to generate two chains of (β′,τ), each of length 10,000, with the initial value of τ being 0.1 and 1, respectively, for each chain. Discard the first 1000 iterations of each chain as the warm-up period. When storing results, convert τ back to σ2.

4. Combine the two generated chains and report the posterior means, standard deviations and 95 % central credible intervals for each component of β and σ2. Also report the 95 % central credible for β1 − β2, and give an interpretation regarding the effect of each parent on child height.

5. Use the test statistics T(y,β) = cor(y−Xβ,Xβ) and T(yrep,β,τ) = cor(yrep −Xβ,Xβ) to perform posteriorpredictivecheckingofthemodel,basedonthecombinedgeneratedchainof(β′,τ). Reportthe results in terms of the numeric and graphic summaries of T(y,β) and T(yrep,β,τ) values. Also report the Bayes p-values by comparing T(y,β) and T(yrep,β,τ) with T(y,βˆ) and T(yrep,βˆ,τˆ), respectively, where βˆ and τˆ are the maximum likelihood estimates obtained from the lm results.

Task II: Complete the above 5 steps and answer the questions therein.

d) Perform convergence checks for the chains obtained in c). Report both graphical summaries and Gelman-Rubin diagnostic results. For the calculation of Gelman-Rubin diagnostics, you will need to install the R package coda. An example of processing chains for calculating Gelman-Rubin diagnostics is given below.

Processing chains for calculation of Gelman-Rubin diagnostics. Imagine you have 4 chains of

a multi-parameter problem, and thinning already completed, called par1,par2,par3,par4

Step one: Converting the chains into mcmc lists.

library(coda)

par1<-as.mcmc.list(as.mcmc((par1)))

par2<-as.mcmc.list(as.mcmc((par2)))

par3<-as.mcmc.list(as.mcmc((par3)))

par4<-as.mcmc.list(as.mcmc((par4)))

Step two: Calculating diagnostics

par.all<-c(par1,par2,par3,par4)

gelman.diag(par.all)

e) One question of interest is whether there exists significant interaction between gender and parental height. The analysis done earlier ignored any interaction effect. Based on the results of posterior predictive checking mentioned in c), do you think there exist such interaction effects?

Question Three (13 marks)

In generalised linear models, rather than estimating effects from the response data directly, we model through a link function, η(θ), and assume η(θ)i = xi′ β. The link function can be determined by re-arranging the likelihood of interest into an exponential family form,

p(y|θ) = f(y)g(θ)eη(θ)′u(y). (1) a) Re-arrange the Binomial probability mass function into the exponential family format to determine

the link function. The Binomial pmf is

Pr(y|n,p)= y p (1−p) .

?n? y n−y 3

To explore some properties of Metropolis-Hasting sampling, consider the dataset CovidAustralia.csv, which can be downloaded from Canvas. This dataset contains the total number of cases and deaths by age group (30-39,40-49,50-59,60-69,70-79,80-89,90+) and gender (male/female) as of 29 August 2020.

b) Fit a logistic regression to the Covid data, with age group and gender treated as factors, without interactions, using the function glm. Report co-efficient estimates and the variance-covariance matrix. If the co-efficient estimates can be regarded as the maximum a posteriori (MAP) estimates, what prior have we assumed for β?

c) Perform Bayesian logistic regression analysis using Metropolis-Hasting sampling. Note the generalized linear regression with logit function can be expressed as

yi ∼ Binomial(ni, pi)

logit(pi) = xi′ β, where logit(pi) = ln? pi ?

1−pi

Use the prior of β determined in b) in the analysis. Extract the design matrix X from the glm output. Use a multivariate normal distribution with mean β(t−1) and variance-covariance matrix c2Σˆ as the proposal distribution, where Σˆ is the variance-covariance matrix from the glm fit. Consider three candidates for c: 1.6/√p, 2.4/√p, 3.2/√p, where p is the number of parameters estimated. Run the Metropolis-Hasting algorithm for 10000 iterations, and discard the first 5000. Report the following:

• Check each generated chain converges to the same distribution. To do this, you may find installing the R package coda helpful.

• The proportion of candidate draws that are accepted.

• The effective sample size for each chain.

• Based on your results to the above bullet points, what do you think is the best choice for c?

Hint: You can refer to "Group-level: Proportion data" in http://www.simonqueenborough.info/R/statistics/glm- binomial to see how to get the initial estimation using glm function in R.