STAT 3022/3922/4022 Applied Linear Models – Semester 1, 2022 Practice Final Examination Reading and writing

time allowed: 2 hours and 10 minutes Document upload time: 15 minutes (after the examination with no further working)

Notes and Instructions: • Examination Conditions: Open book. You may refer to any notes on Canvas website. • This examination must be taken on a computer or laptop with satisfactory internet connectivity. It should NOT be taken on a mobile device. • Compatible web browsers include updated versions of Mozilla Firefox or Google Chrome. Any other browser may not display questions correctly. • Please be mindful we may access logs of your Canvas activity in the event of any discrepancy or concerns regarding breaches of integrity. • The content of this examination is not to be shared or distributed in any form. • The work that you submit for this examination must be your sole effort (i.e. not copied from, or discussed with, anyone else). • The paper contains 4 questions. Attempt all questions. Marks are shown in each question. Total marks are 100. • If you are asked to calculate certain quantity, you must show your working with the minimal of showing values substituted into a suitable formula in order to obtain full marks. Then you may use a calculator or any package to evaluate the values. • If you are asked to state or report certain values from R output provided, no calculation is required. • If you are asked to explain certain concepts, you need to write at most two sentences. • Unless otherwise stated, you can assume the normal assumptions to hold for all the inferences (i.e confidence intervals, hypothesis testing) questions. • Unless otherwise stated, take the significant level α to be 5% and round your answer to 3 or 4 decimal places in general. • You can expect the real final exam to have the same structure as the practice exam, but the content of the questions can be very different. Hence, only use this practice exam as a complement to all the lecture notes/tutorials/computer labs of the course. • You are encouraged to do this exam on your own within the time limit, and then please try to scan, convert it to pdf, and upload your work to familiarize yourself with the process. In the real exam, failing to convert your work into the pdf and submit within the time limit is not an excuse for special consideration. 1 Question 1 (15 marks) (i) (3 marks) Consider the dataset (xi,yi) for i = 1,...,20 with x ̄ = 2, y ̄ = 4, and the fitted simple linear regression model is yˆ = 2 − 0.15x. At x = 1, the 95% confidence interval for the mean response is (1.65, 2.05). What is the 95% confidence interval for the mean response at x = 3? Please show your calculation (you can use any formula in the lecture notes or tutorials). (ii) (3 marks) Consider a linear regression model with an outcome y and two potential continuous covariates X1 and X2. The sample correlation between y and X1 is 0.577, and the sample correlation between y and X2 is −0.23. If X1 and X2 are uncorrelated with each other, what should be the multiple R2 of the multiple linear regression between y and both X1 and X2? Note that “there is not enough information“ may be the correct answer. (iii) (3 marks) After fitting a multiple linear regression model, a practitioner recognize that the outcome of the first observation y1 is wrongly reported. The wrong value he used to fit the model is 13, but the correct value is actually 1.3. If he refits the model with the correct value, how is the leverage of the first observation going to change? (iv) (3 marks) Consider the one-way ANOVA model yij = μ+αi +εij for i = A,B,C and j = 1,2,3, i.e we have t = 3 treatment groups (labeled A, B, C), and each group has r = 3 observations. The overall sample mean of the outcome data is y ̄•• = 10, while the sample mean of the outcome data for each group is y ̄A• = 9, y ̄B• = 14, and y ̄C• = 7. If we impose the constraint αˆB = 0, what would be the estimateμˆ,αˆA andαˆC? (v) (3 marks) Consider the linear mixed-effect model yijk = μ+αi +bj +εijk, with bj ∼iid N(0,σb2), εijk ∼ N(0,σ2) for i = 1,2, j = 1,2 and k = 1,2. Write the model in terms of the matrix form y = Xβ+Zu+ε. Please make sure you define all the notations clearly as well as state all the assumptions in the matrix form. Question 2 (40 marks) A study investigates the relationship between the weight of babies at birth (Y = Wgt, in kg) versus two potential predictors, including the number of weeks of gestation (X1 = Gest) and whether the mother is a smoker or not (Smoker). The dataset contains 32 observations. Selected R outputs are included at the end of this document. Based only on these outputs, answer the following questions. (i) (8 marks) First, the researcher fits a simple linear regression between the outcome and the number of weeks in gestations. Complete the following summary table of the regression fit. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) ??? 0.49811 ??? ??? Gest ??? 0.01286 ??? ??? Residual standard error: 0.1673 on ??? degrees of freedom Multiple R-squared: ???, Adjusted R-squared: 0.7677 F-statistic: ??? on ??? and ?? DF, p-value: ??? (ii) (5 marks) The researchers is curious whether the quadratic term of X1 should be included in the model and fitting the model m2. Writing out the models under the null and alternative hypotheses for (a) the t-test associated with X1 in the summary table and (b) the F-test for X1 in the ANOVA table (you can either write the models under the R syntax or in the mathematical form). Should the quadratic term be included in the model that already included the linear term of X1? Now the researcher wants to see if the above relationship depends upon whether the mother is a smoker or not. Instead of using indicator variables to represent this categorical variables, the researcher chooses to 2 represent it as follows. Let ��1, if the mother is a smoker ��1, if the mother is not a smoker X2 = −1, ifthemotherisnotasmoker , X3 = −1, ifthemotherisasmoker (iii) (3 marks) Consider the model (written in the R syntax) as y ∼ X1 + X2 + X3. Explaining why this model will have perfect multicollinearity. (iv) (8 marks) Assuming that Smoker only has the additive effect, compute the 95% confidence interval for the difference in the intercept of the regression equation corresponding to smoker and non-smoker. (v) (8 marks) Assuming that Smoker only has the additive effect, obtain the point estimates and the corresponding 95% confidence intervals for the mean baby weight if (a) the length of gestation is 36 weeks, but the mother is a smoker, and (b) the length of gestation is 36 weeks, but the mother is not a smoker. (vi) (8 marks) The researcher wants to see whether the Smoker has the interaction effect with the length of gestation. Complete the ANOVA table for the model m4 and answer whether the interaction effect is significant. Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Question 3 (30 marks) Pr(>F) 0.000 0.000 ??? X1 ??? X2 ??? X1:X2 ??? Residuals ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? A Latin-square design was used in the road testing of four tire brands (labelled A, B, C, D) for a tire wholesaler. The purpose of the study was to assess whether tires of different brand would result in different fuel efficiencies of commercial trucks. To obtain reliable measures of fuel efficiency (y), a single test run would have to be of several hundred kilometers, so it was decided to use several test trucks and conduct the test so that each tire was tested on each truck. It was also necessary to conduct the experiment over several days, and each tire was tested on each day of the test program. At the end, four trucks were selected, and a four-day test period was identified. (i) (4 marks) Complete the following Latin square design plan by filling the remaining empty cells in the following table. Day1 Day2 Day3 Day4 Truck3 D B A Truck4BA D Truck2 A D Truck 1 A After the experiment, the data was collected. Selected R outputs are given at the end of this document. Based on this output, answer the following questions. First, the analyst treats Truck and Day to have fixed effects. (ii) (3 marks) Write out the statistical model to analyze the data from this design. Define all the notations and state all the assumptions you have made in the model. (iii) (5 marks) Complete the ANOVA table for the Latin-square design, and conclude the overall effect of tire on the fuel efficiency. Make sure you state the hypothesis in terms of the notations you have defined. 3 Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) Day ??? 0.26300 ??? Truck ??? 0.06765 ??? Tire ??? ??? ??? Residuals ??? 0.09280 ??? ??? ??? ??? ??? ??? ??? (iv) (6 marks) Compute the confidence intervals for all the pairwise differences in the mean efficiency corresponding to different tires. Please use Bonferroni correction to adjust for multiple comparison. Now, it can be argued that all the trucks and days selected in the experiment are only representative samples of all the possible trucks and days, hence, it would be more appropriate to treat the two blocking factors to have random effects. Let σa2, σb2 and σ2 be the variance corresponding to day, truck, and random noise, and γk denote the effect of the jth tire on the efficiency for k = A,...,D. (v) (4 marks) Write out the statistical model to analyze the data from this design. Define all the new notations and state all the assumptions you have made in the model. (vi) (4 marks) Using restricted maximum likelihood, what are the estimates for variance components of the model, and under the first treatment constraint, what are the estimates for the grand mean and tire effects? (vii) (4 marks) It can be proven that the expected mean squares are given as below: E(MSDay) = σ2 + tσa2, E(MSTruck) = σ2 + tσb2 t ��t E(MSTire) = σ2 + t − 1 k=1 1 ��t (γk − γ ̄)2, with γ ̄ = t γi i=1 E(MSE) = σ2. Based on this expected mean square, conduct a valid F-test for the overall effect of tire on the fuel efficiency when both Trucks and Days have random effects. Question 4a (15 marks - for STAT3022 only) Consider the multiple linear regression model y = Xβ + ε, where X is an n × p full-ranked design matrix with the first column being the column of one, and other notations being understood as in the lecture notes. Suppose the model error ε still has the mean vector 0, but now has the covariance matrix structure to be σ2V, where V is a positive-definite matrix. Let βˆ = (X⊤X)−1X⊤y denote the least square estimator. (a) (8 marks) Show that in this case, βˆ is still an unbiased estimator for β. (b) (7 marks) Find the variance-covariance of βˆ in this case. Question 4b (15 marks - for STAT3922/4022 only) Consider the linear model y = Xβ + ε, where y = (y1,...,yn)⊤ is the vector of outcomes, X is an n × p design matrix that can be either full-ranked or not, with the first column being the column of one, β and ε are the appropriate coefficient vectors and model errors. Let yˆ = (yˆ1,...,yˆn)⊤ and eˆ = y−yˆ be the (unique) vector of the fitted values and residuals from the ordinary least square, respectively. Let r be the sample correlation between yˆ and y, and R2 = 1 − SSE , with SSE = e⊤e and SST = ��ni=1(yi − y ̄)2, be SST the multiple R2 of the model. Show that r2 = R2. 4 Output for Question 2 c(mean(dat$y), mean(dat$X1)) [1] 3.019875 38.656250 round(c(var(dat$y), var(dat$X1)), 3) [1] 0.121 5.459 cor(dat$y, dat$X1) [1] 0.8804323 table(dat$Smoke) no yes 16 16 Call: lm(formula = y ~ X1 + I(X1^2), data = dat) Residuals: Min 1Q Median 3Q Max -0.36004 -0.11648 0.01215 0.10588 0.26004 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.746971 8.073435 -0.464 0.646 X1 0.220163 0.421213 0.523 0.605 I(X1^2) -0.001163 0.005480 -0.212 0.833 Residual standard error: 0.1701 on 29 degrees of freedom Multiple R-squared: 0.7755, Adjusted R-squared: 0.76 F-statistic: 50.09 on 2 and 29 DF, p-value: 3.912e-10 anova(m2) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) X1 1 2.89584 2.89584 100.136 6.499e-11 *** I(X1^2) 1 0.00130 0.00130 0.045 0.8334 Residuals 29 0.83865 0.02892 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Call: lm(formula = y ~ X1 + X2, data = dat) m2 <- lm(y ~ X1 + I(X1ˆ2), data = dat) summary(m2) dat$X2 <- ifelse(dat$Smoke == "yes", 1, -1) dat$X3 <- ifelse(dat$Smoke == "no", -1, 1) m3 <- lm(y ~ X1 + X2, data = dat) summary(m3) 5 Residuals: Min 1Q Median 3Q Max -0.223693 -0.092063 -0.009365 0.079663 0.197507 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.511845 X1 0.143100 X2 -0.122272 --- 0.353449 -7.107 8.07e-08 *** 0.009128 15.677 1.07e-15 *** 0.020991 -5.825 2.58e-06 *** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.1155 on 29 degrees of freedom Multiple R-squared: 0.8964, Adjusted R-squared: 0.8892 F-statistic: 125.4 on 2 and 29 DF, p-value: 5.289e-15 anova(m3) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) X1 1 2.89584 2.89584 216.962 5.365e-15 *** X2 1 0.45288 0.45288 33.931 2.577e-06 *** Residuals 29 0.38707 0.01335 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 round(vcov(m3), 4) (Intercept) X1 X2 (Intercept) X1 X2 0.1249 -0.0032 0.0017 -0.0032 0.0001 0.0000 0.0017 0.0000 0.0004 m4 <- lm(y ~ X1*X2, data = dat) summary(m4) Call: lm(formula = y ~ X1 * X2, data = dat) Residuals: Min 1Q Median 3Q Max -0.228528 -0.089560 0.000273 0.083629 0.184529 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.510351 X1 0.143118 X2 0.035787 X1:X2 -0.004089 --- 0.358475 -7.003 1.29e-07 *** 0.009258 15.460 3.06e-15 *** 0.358475 0.100 0.921 0.009258 -0.442 0.662 Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.1172 on 28 degrees of freedom 6 Multiple R-squared: 0.8971, Adjusted R-squared: 0.8861 F-statistic: 81.37 on 3 and 28 DF, p-value: 6.144e-14 Output for Question 3 mean(dat$y) [1] 6.7825 tapply(dat$y, dat$Day, mean) 1234 6.6375 6.6775 6.8725 6.9425 tapply(dat$y, dat$Truck, mean) 1234 6.6925 6.7500 6.8550 6.8325 tapply(dat$y, dat$Tire, mean) ABCD 6.4550 6.8325 6.6225 7.2200 fit.lm <- lm(y ~ Day + Truck + Tire, data = dat) Linear mixed model fit by REML ['lmerMod'] Formula: y ~ Tire + (1 | Truck) + (1 | Day) Data: dat REML criterion at convergence: -4.1 Scaled residuals: Min 1Q Median 3Q Max -1.25438 -0.40179 -0.09672 0.40818 1.37661 Random effects: Groups Name Variance Std.Dev. library(lme4) fit.lme <- lmer(y ~ Tire + (1 | Truck) + (1|Day), data=dat, REML=TRUE) summary(fit.lme) Truck Day Residual (Intercept) 0.001771 0.04208 (Intercept) 0.018050 0.13435 0.015467 0.12437 Number of obs: 16, groups: Truck, 4; Day, 4 Fixed effects: Estimate Std. Error t value (Intercept) 6.45500 TireB 0.37750 TireC 0.16750 TireD 0.76500 0.09392 68.725 0.08794 4.293 0.08794 1.905 0.08794 8.699 Correlation of Fixed Effects: (Intr) TireB TireC TireB -0.468 7 TireC -0.468 0.500 TireD -0.468 0.500 0.500 THIS IS THE LAST PAGE 8