STAT331: Assignment 3

Due: Friday, July 23, 2021 at 5pmET on Crowdmark

General instructions:

• Your work may be written up using R Markdown, LaTeX, or Word. If you hand-write your

solutions, make sure they are legible. No points will be given if the grader cannot read your

handwriting.

• You may discuss problems with your peers, but you must write up your own answers, and

include names of anyone you worked with on your assignment.

• For data analysis problems: You must clearly present your final answers in addition to the

steps or commands for obtaining your answers. You must include well-commented R code

(and only necessary code) to reproduce your work.

1. [Coding: 10 points] Write an R function from scratch to conduct forward selection as

described in class. Your function should allow the following user-specified options: selection

based on adjusted R2, AIC, as well as BIC. You can use built in functions in R like lm() or

summary() but you cannot use any functions dedicated to model selection like step, etc. You

only need to accommodate continuous covariates, but you can get bonus points for handling

categorical covariates and hierarchically well-formulated interactions.

Make sure to thoroughly comment your code.

1

2. [Theory: 5 points] Consider the Jackknife studentized residuals from class (Lecture 19)

ri(i) = eiq

ˆ2(i)(1hi)

, where ˆ2(i) is the MSE (i.e. our usual unbiased estimator of

2) for the

model fit to all but the ith observation. Show that ri(i) = ri

h

(np2)

np1r2i

i1/2

, where ri is the

ith studentized residual. Note: you may want to wait until Lecture 19 to start this problem.

2

3. [Data analysis: 10 points] The dataset energyapp.csv for this problem is posted on

Learn, and comes from a study (Energy and Buildings, Volume 140, pp. 81-97) on the energy

use of appliances in homes. The dataset contains n = 979 observations, recorded in order

over time. The response variable of interest is appEuse, the energy use of appliances (in Wh).

The following explanatory variables are available:

• lights, energy use of light fixtures in the house in Wh

• T1, Temperature in kitchen area, in Celsius

• RH 1, Humidity in kitchen area, in %

• T2, Temperature in living room area, in Celsius

• RH 2, Humidity in living room area, in %

• T3, Temperature in laundry room area

• RH 3, Humidity in laundry room area, in %

• T4, Temperature in oce room, in Celsius

• RH 4, Humidity in oce room, in %

• T5, Temperature in bathroom, in Celsius

• RH 5, Humidity in bathroom, in %

• T6, Temperature outside the building (north side), in Celsius

• RH 6, Humidity outside the building (north side), in %

• T7, Temperature in ironing room, in Celsius

• RH 7, Humidity in ironing room, in %

• T8, Temperature in teenager room 2, in Celsius

• RH 8, Humidity in teenager room 2, in %

• T9, Temperature in parents room, in Celsius

• RH 9, Humidity in parents room, in %

• To, Temperature outside, in Celsius

• Pressure in mm Hg

• RH out, Humidity outside, in %

• Windspeed in m/s

• Visibility in km

• Tdewpoint dewpoint in Celsius

First split the data into a training set of the first 500 observations, for use in parts (a)–(c),

and a test set of the remaining observations, for use in (d)

(a) Intuitively (in plain English), would you expect strong multicollinearity among the ex-

planatory variables in this dataset? What could happen if all the explanatory variables are

used in a multiple linear regression model?

(b) As a rule of thumb, consider a VIF> 10 to indicate high multicollinearity. Starting from

a model regressing appEuse on all the explanatory variables, remove explanatory variables

with VIF> 10 one by one—excluding the covariate with the largest VIF each time—

until there are no more with ‘high’ multicollinearity. How many explanatory variables are

left? Why might this screening procedure be preferable to excluding all explanatory variables

with VIF> 10 simultaneously?

3

(c) Using this reduced set of explanatory variables, conduct forward selection using your

function in question 1, based on (i) adjusted R2, (ii) AIC (iii) BIC. Report the 3 selected

models, and report the adjusted R2, AIC, and BIC for each. [If you cannot complete number

1, you can use a built in function for forward selection.]

(d) Report and comment on the prediction accuracy of each of the 3 models based on the

test dataset.

4

4. [Simulation: 15 points] In this problem we will conduct a simulation study to explore the

e↵ects of variable selection on inference.

The ‘true’ model we will assume for this problem is

yi = 0 + ✏i, ✏i

iid⇠ N(0,2), for i = 1, . . . , n

Important: Since this question involves simulation, your first line of R code for this problem

must set the random seed with this command: set.seed(123) where you replace 123 with

your student number.

(A) Conduct a simulation study as follows:

(i) Randomly generate n = 100 observations of p = 50 independent standard normally-

distributed covariates. That is, generate x1, . . . , x50, where xj is of length 100.

(ii) Randomly generate n = 100 outcomes, y, according to the true model above, with

0 = 1 and = 2.

(iii) Regress the simulated outcomes y on the first 5 covariates: i.e. fit the following model

yi = 0 + 1xi1 + 2xi2 + 3xi3 + 4xi4 + 5xi5 + ✏i, ✏i

iid⇠ N(0,2), for i = 1, . . . , n

(iv) Now test for any association between these covariates and the outcome—i.e. test

whether all coecients (except the intercept) are equal to 0 at the 5% level.

(v) Repeat steps (ii)—(iv) 1000 times. Plot a histogram of the p-values from (iv). In what

proportion of datasets do we reject the null hypothesis?

(B) Repeat simulation (A) replacing step (iii) with the following: Fit five di↵erent models

as follows: Regress y on the first 5 covariates; call this model 1. Regress y on the 6th–10th

covariates; call this model 2. . . . Regress y on the 21st–25th covariates; call this model 5.

Compute the adjusted–R2 for each of these models, and choose the model which fits best

according to this criteria. Proceed to step (iv) using the model you select.

(C) Summarize your results. Explain what we can learn from this.

(D) What if instead of manually comparing models as in (B), we used automatic selection?

Repeat simulation (A), this time replacing step (iii) with: Use forward selection to pick a

model (based on AIC), considering all 50 covariates. Proceed to step (iv) using the model

you select.

(E) Explain your findings. Why are the results so extreme?

(F) What if we are interested in the association between xi1 and yi, but we want to use

automatic model selection to choose which other covariates to include in the model. Repeat

part (D), but ensure that xi1 is always in the model. Instead of testing for any association,

test for association between xi1 and yi (adjusted for the other covariates selected into the

model). What can you conclude about the e↵ects of selection on inference?

Remember to include your R code in your submissions for Questions 1, 3 and 4!

5

so819682

欢迎咨询51作业君