程序代写案例-STAT331-Assignment 3

STAT331: Assignment 3
Due: Friday, July 23, 2021 at 5pmET on Crowdmark
General instructions:
• Your work may be written up using R Markdown, LaTeX,
or Word. If you hand-write your
solutions, make sure they are legible. No points will be given if the grader cannot read your
handwriting.
• You may discuss problems with your peers, but you must write up your own answers, and
include names of anyone you worked with on your assignment.
• For data analysis problems: You must clearly present your final answers in addition to the
steps or commands for obtaining your answers. You must include well-commented R code
(and only necessary code) to reproduce your work.
1. [Coding: 10 points] Write an R function from scratch to conduct forward selection as
described in class. Your function should allow the following user-specified options: selection
based on adjusted R2, AIC, as well as BIC. You can use built in functions in R like lm() or
summary() but you cannot use any functions dedicated to model selection like step, etc. You
only need to accommodate continuous covariates, but you can get bonus points for handling
categorical covariates and hierarchically well-formulated interactions.
Make sure to thoroughly comment your code.
1

2. [Theory: 5 points] Consider the Jackknife studentized residuals from class (Lecture 19)
ri(i) = eiq
ˆ2(i)(1hi)
, where ˆ2(i) is the MSE (i.e. our usual unbiased estimator of
2) for the
model fit to all but the ith observation. Show that ri(i) = ri
h
(np2)
np1r2i
i1/2
, where ri is the
ith studentized residual. Note: you may want to wait until Lecture 19 to start this problem.
2
3. [Data analysis: 10 points] The dataset energyapp.csv for this problem is posted on
Learn, and comes from a study (Energy and Buildings, Volume 140, pp. 81-97) on the energy
use of appliances in homes. The dataset contains n = 979 observations, recorded in order
over time. The response variable of interest is appEuse, the energy use of appliances (in Wh).
The following explanatory variables are available:
• lights, energy use of light fixtures in the house in Wh
• T1, Temperature in kitchen area, in Celsius
• RH 1, Humidity in kitchen area, in %
• T2, Temperature in living room area, in Celsius
• RH 2, Humidity in living room area, in %
• T3, Temperature in laundry room area
• RH 3, Humidity in laundry room area, in %
• T4, Temperature in oce room, in Celsius
• RH 4, Humidity in oce room, in %
• T5, Temperature in bathroom, in Celsius
• RH 5, Humidity in bathroom, in %
• T6, Temperature outside the building (north side), in Celsius
• RH 6, Humidity outside the building (north side), in %
• T7, Temperature in ironing room, in Celsius
• RH 7, Humidity in ironing room, in %
• T8, Temperature in teenager room 2, in Celsius
• RH 8, Humidity in teenager room 2, in %
• T9, Temperature in parents room, in Celsius
• RH 9, Humidity in parents room, in %
• To, Temperature outside, in Celsius
• Pressure in mm Hg
• RH out, Humidity outside, in %
• Windspeed in m/s
• Visibility in km
• Tdewpoint dewpoint in Celsius
First split the data into a training set of the first 500 observations, for use in parts (a)–(c),
and a test set of the remaining observations, for use in (d)
(a) Intuitively (in plain English), would you expect strong multicollinearity among the ex-
planatory variables in this dataset? What could happen if all the explanatory variables are
used in a multiple linear regression model?
(b) As a rule of thumb, consider a VIF> 10 to indicate high multicollinearity. Starting from
a model regressing appEuse on all the explanatory variables, remove explanatory variables
with VIF> 10 one by one—excluding the covariate with the largest VIF each time—
until there are no more with ‘high’ multicollinearity. How many explanatory variables are
left? Why might this screening procedure be preferable to excluding all explanatory variables
with VIF> 10 simultaneously?
3
(c) Using this reduced set of explanatory variables, conduct forward selection using your
function in question 1, based on (i) adjusted R2, (ii) AIC (iii) BIC. Report the 3 selected
models, and report the adjusted R2, AIC, and BIC for each. [If you cannot complete number
1, you can use a built in function for forward selection.]
(d) Report and comment on the prediction accuracy of each of the 3 models based on the
test dataset.
4
4. [Simulation: 15 points] In this problem we will conduct a simulation study to explore the
e↵ects of variable selection on inference.
The ‘true’ model we will assume for this problem is
yi = 0 + ✏i, ✏i
iid⇠ N(0,2), for i = 1, . . . , n
Important: Since this question involves simulation, your first line of R code for this problem
must set the random seed with this command: set.seed(123) where you replace 123 with
your student number.
(A) Conduct a simulation study as follows:
(i) Randomly generate n = 100 observations of p = 50 independent standard normally-
distributed covariates. That is, generate x1, . . . , x50, where xj is of length 100.
(ii) Randomly generate n = 100 outcomes, y, according to the true model above, with
0 = 1 and = 2.
(iii) Regress the simulated outcomes y on the first 5 covariates: i.e. fit the following model
yi = 0 + 1xi1 + 2xi2 + 3xi3 + 4xi4 + 5xi5 + ✏i, ✏i
iid⇠ N(0,2), for i = 1, . . . , n
(iv) Now test for any association between these covariates and the outcome—i.e. test
whether all coecients (except the intercept) are equal to 0 at the 5% level.
(v) Repeat steps (ii)—(iv) 1000 times. Plot a histogram of the p-values from (iv). In what
proportion of datasets do we reject the null hypothesis?
(B) Repeat simulation (A) replacing step (iii) with the following: Fit five di↵erent models
as follows: Regress y on the first 5 covariates; call this model 1. Regress y on the 6th–10th
covariates; call this model 2. . . . Regress y on the 21st–25th covariates; call this model 5.
Compute the adjusted–R2 for each of these models, and choose the model which fits best
according to this criteria. Proceed to step (iv) using the model you select.
(C) Summarize your results. Explain what we can learn from this.
(D) What if instead of manually comparing models as in (B), we used automatic selection?
Repeat simulation (A), this time replacing step (iii) with: Use forward selection to pick a
model (based on AIC), considering all 50 covariates. Proceed to step (iv) using the model
you select.
(E) Explain your findings. Why are the results so extreme?
(F) What if we are interested in the association between xi1 and yi, but we want to use
automatic model selection to choose which other covariates to include in the model. Repeat
part (D), but ensure that xi1 is always in the model. Instead of testing for any association,
test for association between xi1 and yi (adjusted for the other covariates selected into the
model). What can you conclude about the e↵ects of selection on inference?
Remember to include your R code in your submissions for Questions 1, 3 and 4!
5
so819682

欢迎咨询51作业君
51作业君 51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: ITCSdaixie