McGill University Faculty of Science Department of Mathematics and Statistics Final Examination - Take-Home MATH423: Applied Regression Date: 5-Dec-20 at 6:30 PM to 8-Dec-20 at 6:30 PM Time: 72 hours Instructions • This paper contains three questions. Each question carries 20 marks. Credit will be given for all questions attempted. The total mark available is 60 but rescaling of the final mark may occur. Questions Marks Q1 Q2 Q3 This exam comprises the cover page and nine pages of questions. c© 2020 McGill University Final Examination - Take-Home Page 1 of 9 Final Exam Notation: The following notation will be used: for i = 1, . . . , n, yi is the observed response; Yi is the random variable version of the response; y and Y are the n × 1 vector versions of the responses; xi is the row vector of predictor values, X is the matrix of predictor values; ŷi, Ŷi, ŷ and Ŷ are the fitted or predicted response values or vectors arising from a given model; β is the vector of regression coefficients; β̂ is the vector of estimates or estimators. Furthermore, 0n is the n-dimensional vector of zeros, and In is the n-dimensional identity matrix. Q1. The Minnesota Twins professional baseball team plays its games in the Metrodome, an indoor stadium with a fabric roof. In addition to the large air fans required to keep to roof from collapsing, the baseball field is surrounded by ventilation fans that blow heated or cooled air into the stadium. Air is normally blown into the center of the field equally from all directions. According to a retired supervisor in the Metrodome, in the late innings of some games the fans would be modified so that the ventilation air would blow out from home plate toward the outfield. The idea is that the air flow might increase the length of a fly ball. For example, if this were done in the middle of the eighth inning, then the air-flow advantage would be in favor of the home team for six outs, three in each of the eighth and ninth innings, and in favor of the visitor for three outs in the ninth inning, resulting in a slight advantage for the home team. To see if manipulating the fans could possibly make any difference, a group of students at the University of Minnesota and their professor built a “cannon” that used compressed air to shoot baseballs. They then did the following experiment in the Metrodome in March, 2003: 1. A fixed angle of 50 degrees and velocity of 150 feet per second was selected. In the actual experiment, neither the velocity nor the angle could be controlled exactly, so the actual angle and velocity varied from shot to shot. 2. The ventilation fans were set so that to the extent possible all the air was blowing in from the outfield towards home plate, providing a headwind. After waiting about 20 minutes for the air flows to stabilize, 20 balls were shot into the outfield, and their distances were recorded. Additional variables recorded on each shot include the weight (in grams) and diameter (in cm) of the ball used on that shot, and the actual velocity and angle. 3. The ventilation fans were then reversed, so as much as possible air was blowing out toward the outfield, giving a tailwind. After waiting 20 minutes for air currents to stabilize, 15 balls were shot into the outfield, again measuring the ball weight and diameter, and the actual velocity and angle on each shot. In this data, the variable names are Cond, the condition, head or tail wind; Velocity, the actual velocity in feet per second; Angle, the actual angle; BallWt, the weight of the ball in grams used on that particular test; BallDia, the diameter in inches of the ball used on that test; Dist, distance in feet of the flight of the ball. 1. The following plot shows a boxplot of the response Dist for each value of Cond: 1 > boxplot(Dist ∼ Cond, data=domedata) Question Q1 continues on the next page. 2 Final Exam Summarize the plot. Based on the boxplot, can we conclude that there is enough evidence that manipulating the fans can change the distance that a baseball travels? 2 MARKS 2. We next examine the scatterplot matrix of the response and the continuous predictors, using Cond to color and mark the points. Summarize the key features of the following graph. 3 MARKS Question Q1 continues on the next page. 3 Final Exam 3. Interpret the ANOVA output for m1, m2 and m3. In particular, define appropriate hypotheses for these tests. Summarize the conclusions to be made from this output. 5 MARKS 1 > m1 <- lm( Dist ∼ Velocity + Angle) 2 > m2 <- lm( Dist ∼ Velocity + Angle + BallWt + BallDia) 3 > m3 <- lm( Dist ∼ Velocity + Angle + BallWt + BallDia + Cond) 4 > anova(m1,m2,m3) 5 Analysis of Variance Table 6 7 Model 1: Dist ∼ Velocity + Angle 8 Model 2: Dist ∼ Velocity + Angle + BallWt + BallDia 9 Model 3: Dist ∼ Velocity + Angle + BallWt + BallDia + Cond 10 Res.Df RSS Df Sum of Sq F Pr(>F) 11 1 31 2042.2 12 2 29 1747.0 2 295.15 3.1869 0.056627 . 13 3 28 1296.6 1 450.46 9.7279 0.004177 ** 4. Test to see if the Velocity differential in Dist is the same in each Cond. Clearly state the null and alternative hypotheses of the tests. Comment on your findings. 5 MARKS 1 > ma <- lm(Dist ∼ Velocity + Angle + BallWt + BallDia + Cond) 2 > mb <- update(ma, ∼ .+Velocity:Cond) 3 > anova(ma,mb) 4 Analysis of Variance Table 5 6 Model 1: Dist ∼ Velocity + Angle + BallWt + BallDia + Cond 7 Model 2: Dist ∼ Velocity + Angle + BallWt + BallDia + Cond + Velocity:Cond 8 Res.Df RSS Df Sum of Sq F Pr(>F) 9 1 28 1296.6 10 2 27 1296.5 1 0.078273 0.0016 0.9681 Final Examination - Take-Home MATH423: Applied Regression (2020) Page 4 of 9 4 Final Exam Q2. Consider the hospital infection risk data. The variables we will analyze are the following: – Y = infection risk in hospital – X1 = average length of patient’s stay (in days) – X2 = a measure of frequency of giving X-rays – X3 = indication in which of 4 U.S. regions the hospital is located (north-east, north-central, south, west). The focus of the analysis will be on regional differences. Region is a categorical variable so we must use indicator variables to incorporate region information into the model. There are four regions. The full set of indicator variables for the four regions is as follows: – I1 = 1 if the hospital is in region 1 (north-east) and 0 if not. – I2 = 1 if the hospital is in region 2 (north-central) and 0 if not. – I3 = 1 if the hospital is in region 3 (south) and 0 if not. – I4 = 1 if the hospital is in region 4 (west), 0 otherwise. To avoid a linear dependency in the X matrix, we will leave out one of these indicators when we forming the model. Using all but the first indicator to describe regional differences (so that "north- east" is the reference region), a possible multiple regression model for E(Y ), the mean infection risk, is: E(Y |X) = β0 + β1X1 + β2X2 + β3I2 + β4I3 + β5I4 (1) 1. Based on Equation (1), write down the regression function E(Y |X) for the following case: 2 MARKS (a) For the hospital in region 1 (north-east) (b) For the hospital in region 2 (north-central) (c) For the hospital in region 3 (south) (d) For the hospital in region 4 (west) 2. Interpret the coefficients β3, β4, β5 in model (1) 2 MARKS 3. The output for model (1) is reported below 1 Estimate Std. Error t value Pr(>|t|) 2 (Intercept) -2.134259 0.877347 -2.433 0.01668 * 3 Stay 0.505394 0.081455 6.205 1.11e-08 *** 4 Xray 0.017587 0.005649 3.113 0.00238 ** 5 i2 0.171284 0.281475 0.609 0.54416 6 i3 0.095461 0.288852 0.330 0.74169 7 i4 1.057835 0.378077 2.798 0.00612 ** 8 --- 9 Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 10 11 Residual standard error: 1.036 on 105 degrees of freedom Multiple R-squared: 0.4198, Adjusted R-squared: 0.3922 F-statistic: 15.19 on 5 and 105 DF, p-value: 3.243e-11 Question Q2 continues on the next page. (a) What is the sample size n of this dataset? 2 MARKS 5 Final Exam (b) Interpret the p-value for I4. 2 MARKS (c) Whether the difference between mean infection risks in the north-east and west is strong? 3 MARKS (d) If we consider an overall test of regional differences with the null H0 : β3 = β4 = β5 = 0 against the alternative H0 : at least one of β3, β4, β5 is not zero. What is the reduced model under the null hypothesis? If the reduced model has SSRes(reduced) = 123.56 with df = 108, and the full model (1) has SSRes(full) = 112.71 with df = 105, what is the value of the F -statistics? 2 MARKS (e) From the p-values of I2, I3 and I4 what conclusions you can draw? 2 MARKS 4. If we compare the following two models H0 : E(Y |X) = β0 + β1X1 + β2X2 + β5I4. Ha : E(Y |X) = β0 + β1X1 + β2X2 + β3I2 + β4I3 + β5I4. and we know that the reduced model has SSE(reduced) = 113.11, what is the corresponding degree of freedom df of this model? What is the value of the F -statistics for this test? 5 MARKS Final Examination - Take-Home MATH423: Applied Regression (2020) Page 6 of 9 6 Final Exam Q3. Depression Treatments data. Some researchers (Daniel, 1999) were interested in comparing the effectiveness of three treatments for severe depression. For the sake of simplicity, we denote the three treatments A, B, and C. The researchers collected the following data on a random sample of n = 36 severely depressed individuals: – yi = measure of the effectiveness of the treatment for individual i – xi1 = age (in years) of individual i – xi2 = 1 if individual i received treatment A and 0, if not – xi3 = 1 if individual i received treatment B and 0, if not A scatter plot of the data with treatment effectiveness on the y-axis and age on the x-axis looks like: The blue circles represent the data for individuals receiving treatment A, the red squares represent the data for individuals receiving treatment B, and the green diamonds represent the data for individuals receiving treatment C. We consider a (second-order) multiple regression model with interaction terms: yi = β0 + β1xi1 + β2xi2 + β3xi3 + β12xi1xi2 + β13xi1xi3 + i. (2) 1. Write down the regression function E(Y |X) 4 MARKS (a) if patient receives A. (b) if patient receives B. (c) if patient receives C. 2. If the estimated regression function is (a) If patient receives A: yˆ = 47.5 + 0.33x1. (b) If patient receives B: yˆ = 28.9 + 0.52x1. (c) If patient receives C: yˆ = 6.21 + 1.03x1. When we plot these three "best fitting" lines, we obtain: 7 Final Exam Question: What do the estimated slopes tell us? What does the "nonparallelness" of the lines imply. Does age appear to interact with treatment in its impact on treatment effectiveness? Why? 4 MARKS 3. The residual plot and the QQ plot of the residuals for the model (2) are provided below: 8 Final Exam Question: What conclusions you can draw from those plots? 4 MARKS 4. For every age, is there a difference in the mean effectiveness for the three treatments? Write down the null and alternative hypothesis of the corresponding test. 4 MARKS 5. Does the effect of age on the treatment’s effectiveness depend on treatment? Write down the null and alternative hypothesis of the corresponding test. 4 MARKS Final Examination - Take-Home MATH423: Applied Regression (2020) Page 9 of 9 9
欢迎咨询51作业君