程序代写案例-STAT 4001 COVID

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

STAT 4001 COVID Project
Instructions : In the following questions, provide at least 3 significant digits for every
answer, e.g. βˆ0 = .423, βˆ1 = 1.83. More than 3 is also acceptable. It does not matter which
software, e.g. Excel, R, Python, you use to get your answers. We do not need your code. No
partial credit is given for any answer except those that require explanations.
The adjusted R2, R¯2, is a modification of the traditional R2 defined by
1− (1−R2) n− 1
n− p− 1 ,
where n is the number of data points and p is the number of predictor variables. Note that the
adjusted R2 penalizes you for having more predictor variables.
Question 1.
From the NY COVID Data, for the daily NY data from 3/25/2020 through 4/30/2020, let Xi
be the reported confirmed cases for the ith day, and let Yi be the reported deaths for the ith
day, where i = 1 starts from 3/25/2020.
(a) (2 points) Fit a least squares line (simple linear regression) to the data (Xi, Yi) and state
the regression line.
(b) (5 points) Fit a least squares line instead with X
(k)
i = Xi−k and (X
(k)
i , Yi). In other
words, we are pairing the number of reported cases from k days ago with the number of
reported deaths today. Let k ∈ {2, 3, 4, 5, 6}. Let all the X(k)i be such that the they all
start from the same day 3/25/2020, i.e. we are technically shifting Yi instead of Xi but
from a modeling perspective, it makes more sense to use data from the past than the
future. In other words, X
(k)
i are exactly the same for all k for every i.
Fit a least squares line for (X
(k)
i , Yi) and summarize the following: the regression coeffi-
cients βˆ0, βˆ1 and the adjusted R
2 value for all 5 regressions.
(c) From your answer in part (a) and (b), let the criterion be that we want to maximize the
adjusted R2. Out of all 6 regressions (1 from (a) and 4 from (b)),
(i) (1 point) Which one has the highest adjusted R2?
(ii) (2 points) According the adjusted R2, is this a good fit? Give an interpretation why
this regression fits the best.
(d) (2 points) Note that X
(0)
i = Xi−0 = Xi. Let j ∈ {0, 2, 3, 4, 5, 6} be the answer to part
(c). Let X˜i = X
(j)
i . Fit a simple linear regression with the following transformation
(X˜i, log(Yi)). What is the adjusted R
2? If the adjusted R2 value for the transformation
is better than the adjusted R2 value from part (c) subpart (i), provide a short explana-
tion why this may be the case.
(e) In the last question, we obtain the estimates βˆ0 = 5.22 and βˆ1 = 1.24 × 10−4 for the
slope-intercept and the slope, respectively. The t-values for the respective estimates are
32.14 and 5.998.
(i) (1 point) What is the number of degrees of freedom?
(ii) (4 points) Using the t-statistic, test whether a linear association exists between
log Yi, the log of the reported deaths, and X˜i. State the null, alternative, decision
rule, and conclusion. Test at the significance level α = 0.001.
Question 2.
Let us go back to X˜i from your answer to part (d). We will now incorporate the other co-
variates/predictor variables. From the NY COVID Data, for the daily NY data from 6/6/2020
through 11/29/2020, let Hi := “Currently Hospitalized”, Ii := “In ICU Currently”, Vi := “On
1
Ventilator Currently” on the ith day. So, Hi, Ii, and Vi all start on 6/6/2020 but X˜
(j)
i starts j
days before 6/6/2020. In particular, fit a multiple linear regression with
(i) (1 point) All predictor variables X˜i, Hi, Ii, Vi against Yi. Calculate the adjusted R
2. Use
4 significant digits.
(ii) (1 point) Only with the predictor variables Hi, Ii, Vi against Yi. Calculate the adjusted
R2. Use 4 significant digits.
(iii) (1 point) Only with the predictor variables Hi, Ii against Yi. Calculate the adjusted R
2.
Use 4 significant digits.
(iv) (2 points) You should get that the combination in part (iii) has the highest adjusted
R2. In fact, using different criteria (which you don’t have to do), you can see that the
combination in part (iii) is the best. This implies that the number of reported cases is
in fact not as important as other predictor variables. Further, note that the number of
patients on ventilators Vi has also been eliminated from part (ii) to part (iii). Why would
this be the case? What kind of relationship is there between Vi and Ii, the number of
patients in the ICU? Hence, conclude that the information obtained from Vi does not
help much from a modeling perspective.
(v) (1 point) Is the model in part (iii) a good model? No correct answer.
2

欢迎咨询51作业君