STAT 5701 Homework 4 – Fall 2021 This homework is due on Tuesday November 16 at 11:59pm. There is a total of 38 points. Submit your solutions in a pdf document on Canvas. Include your R code (which must be commented and properly indented) in the pdf file. Copying code from websites is not permitted. Cite all sources (including lecture notes). Show all of the steps that you took to solve each problem. Please name the pdf file
-HW4.pdf. Please also submit one text file with your R code, which must be commented and properly indented. 1. You will analyze a reduced version of a dataset from Karagas et al. (1996). There are n = 21 subjects. The response is arsenic.toenail, which is the level of arsenic in the subject’s toenail. There are three explanatory variables: • arsenic.water, the level of arsenic in the subject’s household water supply; • gender, the gender of the subject; • age, the age of the subject in years. The dataset is in the dataframe object arsenic in the R binary file “arsenic.rdata” posted on canvas. If this file is in R’s current working directory, then the command load("arsenic.rdata") puts the dataframe object arsenic in R’s workspace. Calling the functions lm() or glm() is not allowed in this problem. (a) (4 points) Fit a linear regression model to these data, where the response is the natural logarithm of arsenic.toenail, and the explanatory variables are those listed above with the addition of an interaction between gender and arsenic.water. Report estimates of the regression coefficients and the error variance. (b) (5 points) What does the model used in part 1a assume about these data? We are looking for a full specification of the data-generating model here, where all symbols are defined and it is clear what is unknown. Phrases like “realization of” should be used. (c) (5 points) Let the model with the three explanatory variables listed (without interac- tions) be our full model. Determine the submodel of this full model (which has a subset of the explanatory variables) that is selected by AIC. Ensure that all possible submodels that respect the hierarchy of terms are evaluated. 2. Suppose that the yet-to-be observed measurements of a response X1, . . . , Xn are iid N(µ∗, µ∗), where µ∗ ∈ (0,∞) is unknown. We will study three competing estimators of µ∗: X¯ = n−1 ∑n i=1Xi, S 2 = (n − 1)−1∑ni=1(Xi − X¯)2, and µˆ, defined as the maximum likelihood estimator of µ∗. The negative loglikelihood function f : (0,∞)→ R is defined by f(µ) = n 2 log(2pi) + n 2 log(µ) + 1 2µ n∑ i=1 (Xi − µ)2. (a) (3 points) A statistician claims that cov(X¯, S2) = 0. Perform a simulation study to see if there is simulation-based statistical evidence that cov(X¯, S2) 6= 0. Set µ∗ = 0.5 and n = 10. It is recommended that you make a 95% approximate simulation-based confidence interval for cov(X¯, S2) = E((X¯ − µ∗)(S2 − µ∗)) based on 10000 independent replications. 1 (b) (2 points) Show that every convex combination of X¯ and S2 is unbiased for µ∗. (c) (5 points) Consider the competing unbiased estimator of µ∗ defined by λˆX¯ + (1− λˆ)S2, where λˆ = arg min λ∈[0,1] E [( λX¯ + (1− λ)S2 − µ∗ )2] . Using the fact that cov(X¯, S2) = 0, derive a simple formula for λˆ. This formula should involve n and µ∗. Since µ∗ is unknown in practice, this estimator would need to be modified for practical use, e.g. by replacing µ∗ with its maximum likelihood estimator in the formula for λˆ. (d) (4 points) Find the convex subset of (0,∞) over which the negative loglikelihood is a convex function. At least one endpoint for this interval should involve n and X1, . . . , Xn. (e) (2 points) Set n = 10 and µ∗ = 0.5. Generate a realization of X1, . . . , X10 and graph the realization of f over the interval derived in part 2d. Since the left boundary of this interval is zero, which is not in the domain of f , I recommend choosing the left endpoint close to 0.05 or 0.1 (instead of values very close to zero like 10−7) to improve the illustration. (f) (2 points) Let µˆ be the maximum likelihood estimator of µ∗. Derive a simplified expres- sion for µˆ. (g) (6 points) Set n = 10. For each µ∗ ∈ {10−2, 10−1, 100, 101, 102}, perform a simulation study that computes 99% approximate simulation-based confidence intervals, based on 10,000 replications, for the following five expected values: E (|X¯ − µ∗|), E (|S2 − µ∗|), E (|µˆ− µ∗|), E (|X¯ − µ∗| − |µˆ− µ∗|), and E (|S2 − µ∗| − |µˆ− µ∗|). In addition, for each value of µ∗ used, report the value of λˆ derived in part 2c. Based on the results of this simulation study, which of the three estimators of µ∗ is the best? Explain. 2 欢迎咨询51作业君