This homework is due on Tuesday November 16 at 11:59pm. There is a total of 38 points. Submit

your solutions in a pdf document on Canvas. Include your R code (which must be commented and

properly indented) in the pdf file. Copying code from websites is not permitted. Cite all sources

(including lecture notes). Show all of the steps that you took to solve each problem. Please name

the pdf file

must be commented and properly indented.

1. You will analyze a reduced version of a dataset from Karagas et al. (1996). There are n = 21

subjects. The response is arsenic.toenail, which is the level of arsenic in the subject’s

toenail. There are three explanatory variables:

• arsenic.water, the level of arsenic in the subject’s household water supply;

• gender, the gender of the subject;

• age, the age of the subject in years.

The dataset is in the dataframe object arsenic in the R binary file “arsenic.rdata” posted on

canvas. If this file is in R’s current working directory, then the command load("arsenic.rdata")

puts the dataframe object arsenic in R’s workspace. Calling the functions lm() or glm() is

not allowed in this problem.

(a) (4 points) Fit a linear regression model to these data, where the response is the natural

logarithm of arsenic.toenail, and the explanatory variables are those listed above with

the addition of an interaction between gender and arsenic.water. Report estimates

of the regression coefficients and the error variance.

(b) (5 points) What does the model used in part 1a assume about these data? We are

looking for a full specification of the data-generating model here, where all symbols are

defined and it is clear what is unknown. Phrases like “realization of” should be used.

(c) (5 points) Let the model with the three explanatory variables listed (without interac-

tions) be our full model. Determine the submodel of this full model (which has a subset

of the explanatory variables) that is selected by AIC. Ensure that all possible submodels

that respect the hierarchy of terms are evaluated.

2. Suppose that the yet-to-be observed measurements of a response X1, . . . , Xn are iid N(µ∗, µ∗),

where µ∗ ∈ (0,∞) is unknown. We will study three competing estimators of µ∗: X¯ =

n−1

∑n

i=1Xi, S

2 = (n − 1)−1∑ni=1(Xi − X¯)2, and µˆ, defined as the maximum likelihood

estimator of µ∗. The negative loglikelihood function f : (0,∞)→ R is defined by

f(µ) =

n

2

log(2pi) +

n

2

log(µ) +

1

2µ

n∑

i=1

(Xi − µ)2.

(a) (3 points) A statistician claims that cov(X¯, S2) = 0. Perform a simulation study to

see if there is simulation-based statistical evidence that cov(X¯, S2) 6= 0. Set µ∗ = 0.5

and n = 10. It is recommended that you make a 95% approximate simulation-based

confidence interval for cov(X¯, S2) = E((X¯ − µ∗)(S2 − µ∗)) based on 10000 independent

replications.

1

(b) (2 points) Show that every convex combination of X¯ and S2 is unbiased for µ∗.

(c) (5 points) Consider the competing unbiased estimator of µ∗ defined by λˆX¯ + (1− λˆ)S2,

where

λˆ = arg min

λ∈[0,1]

E

[(

λX¯ + (1− λ)S2 − µ∗

)2]

.

Using the fact that cov(X¯, S2) = 0, derive a simple formula for λˆ. This formula should

involve n and µ∗. Since µ∗ is unknown in practice, this estimator would need to be

modified for practical use, e.g. by replacing µ∗ with its maximum likelihood estimator

in the formula for λˆ.

(d) (4 points) Find the convex subset of (0,∞) over which the negative loglikelihood is a

convex function. At least one endpoint for this interval should involve n and X1, . . . , Xn.

(e) (2 points) Set n = 10 and µ∗ = 0.5. Generate a realization of X1, . . . , X10 and graph

the realization of f over the interval derived in part 2d. Since the left boundary of

this interval is zero, which is not in the domain of f , I recommend choosing the left

endpoint close to 0.05 or 0.1 (instead of values very close to zero like 10−7) to improve

the illustration.

(f) (2 points) Let µˆ be the maximum likelihood estimator of µ∗. Derive a simplified expres-

sion for µˆ.

(g) (6 points) Set n = 10. For each µ∗ ∈ {10−2, 10−1, 100, 101, 102}, perform a simulation

study that computes 99% approximate simulation-based confidence intervals, based on

10,000 replications, for the following five expected values: E

(|X¯ − µ∗|), E (|S2 − µ∗|),

E (|µˆ− µ∗|), E

(|X¯ − µ∗| − |µˆ− µ∗|), and E (|S2 − µ∗| − |µˆ− µ∗|). In addition, for each

value of µ∗ used, report the value of λˆ derived in part 2c. Based on the results of this

simulation study, which of the three estimators of µ∗ is the best? Explain.

2

欢迎咨询51作业君