MATH2831 Linear Models Assignment Note: This assignment is due by 11:59pm Monday 16 November (week 10) Please follow the instructions below for completing the assignment, it’s worth 20% of your final mark. • This assignment must be completed individually. • Your assignment must be submitted as a pdf file. It may be typed or handwritten, then converted into one pdf file. You must include the completed coversheet in your assignment. • You must sign and date your submitted assignment, and include your name and zID below. I declare that this assessment item is my own work, except where acknowledged, and has not been submitted for academic credit elsewhere, and acknowledge that the assessor of this item may, for the purpose of assessing this item: • Reproduce this assessment item and provide a copy to another member of the University; and/or, • Communicate a copy of this assessment item to a plagiarism checking ser- vice (which may then retain a copy of the assessment item on its database for the purpose of future plagiarism checking). I certify that I have read and understood the University Rules in respect of Student Academic Misconduct. Student’s full name and zID Signed: Date: 1 1. An experiment was conducted in order to study the size of squid eaten by sharks and tuna. The predictor variables are characteristic of the beak or mouth of the squid. The predictors and response considered for the study are: x1 : Rostral length in inches x2 : Wing length in inches x3 : Rostral to notch length x4 : Notch to wing length x5 : Width in inches y : Weight in pounds The study involved measurements and weight taken on 22 specimen and is available in the squid.txt data set. (I) Best subset selection. Carry out a best subset linear regression analysis using the regsubsets() function on the squid data set. (a) Copy summary output in your report and briefly comment on the models identified in this output. From the summary ouput, identify the best model obtained with the four predictors. (b) What is the best model based on adjusted R2, PRESS and Cp from among the chosen models by the regsubsets() function? To provide evidence for your answer, include in your report a table showing the values of adjusted R2, PRESS and Cp for the best subsets of each size. Include also two plots, one of adjusted R2 and another of Cp for the best subsets of each size against the number of predictors. (II) Sequential variable selection on the squid data set. (a) Carry out forward model selection with the stepAIC() function, using all the available predictors and starting from the model with just an intercept. (i) Copy the R output in your report and describe the selection procedure from the output. At each step state the ’current model’, which predictor was added to the current model and why. (ii) Clearly state the final model obtained, including the coefficient estimates of the fitted model. How does your answer compare to the results in (I)? (b) Repeat a) using backward model selection, starting from the model with all the available predictors. 2 (i) Copy the R output in your report and describe the selection procedure from the output. At each step state the ’current model’, which predictor was removed from the current model and why. What is the AIC for the model with just x4? (ii) Clearly state the final model obtained, including the coefficient estimates of the fitted model. Do you obtain the same model as in the forward selection above? (c) Carry out a stepwise selection procedure with the stepAIC() func- tion, using all the available predictors and starting from the model with just x1. Do NOT include the R output, just state the final model in your answer, and compare to your findings in parts a) and b) above. (III) Model criticism. Fit a linear model with y as the response and all the available predictors to the squid data. (a) Include in your report diagnostic plots of residuals for the fitted model and comment on the appropriateness of the general linear model assumptions. In particular, comment on whether or not there appear to be any violation of model assumptions, such as incorrectly specified mean, failure of the constancy of error vari- ance, departure from normality, outliers and observations that have a large influence on the model analysis. Note, that you can plot all four residual plots using: par(mfrow=c(2,2)) plot(model) par(mfrow=c(1,1)) (b) Recommend the final optimal model based on you findings in (I) and (II). Give reasons to support your choice. Fit chosen optimal model to the squid data and repeat part (a). Produce the summary output of the fitted model and include in your report. Are all the predictor variables ”significant” in the final model according to the parial t-tests? 3 2. In this question we consider derivation of Mallows’ Cp statistic for model selection discussed in lectures. Suppose the experimenter proposes a model y = X1β1 + ε ∗ (p parameters) where X1 is n× p matrix and vector β1 contains p parameters. The “true” model however contains additional m− p parameters described by vector β2. So the “true” model is given by y = X1β1 +X2β2 + ε (m parameters,m > p) where X2 is n×(m−p) matrix. Assume that errors ε are uncorrelated with mean zero and common variance σ2. Consider fitting the proposed general linear model to data and write ŷi for the fitted value at xi and MSE(ŷi) for its mean squared error. Recall that if the error variance σ2 is known, then an estimate of∑n i=1MSE(ŷi) σ2 = ∑n i=1 V ar(ŷi) σ2 + ∑n i=1Bias 2(ŷi) σ2 is p+ (n− p)(σ̂2 − σ2) σ2 (1) where σ̂2 is the estimate of the error variance for the proposed model and p is the number of parameters. You now have to provide a justification for (1). (a) Writing ŷ = (ŷ1, ..., ŷn) > for the vector of fitted values for the proposed model and observing that ŷ = X1(X > 1 X1) −1X>1 y = H1y, show that n∑ i=1 V ar(ŷi) = σ 2tr(H1), where tr(A) denotes the trace of A and H1 = X1(X > 1 X1) −1X>1 denotes the hat matrix corresponding to the proposed model. By using the rules given in lectures about matrix traces, deduce that∑n i=1 V ar(ŷi) σ2 = p. 4 (b) Consider the estimate of σ2 obtained in lectures for the proposed model, σ̂2 = y>(I −X1(X>1 X1)−1X>1 )y n− p . By using the result stated in lectures about the expected value of a quadratic form y>Ay, and noting that E(y) = X1β1 +X2β2, show that E(σ̂2) = σ2 + 1 n− pβ > 2 X > 2 (I −H1)X2β2. (c) Show that n∑ i=1 Bias2(ŷi) = (E(y)− E(ŷ))>(E(y)− E(ŷ)) = β>2 X > 2 (I −H1)X2β2. (d) From b) and c), deduce that an unbiased estimator of∑n i=1Bias 2(ŷi) σ2 is (n− p)(σ̂2 − σ2) σ2 , from which it follows that (1) is a sensible estimator of∑n i=1MSE(ŷi) σ2 . 5
欢迎咨询51作业君