程序代写案例-MATH423

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

McGill University
Faculty of Science
Department of Mathematics and Statistics
Final Examination - Take-Home
MATH423: Applied Regression
Date: 5-Dec-20 at 6:30 PM to 8-Dec-20 at 6:30 PM Time: 72 hours
Instructions
• This paper contains three questions. Each question carries 20 marks. Credit will be given for all
questions attempted. The total mark available is 60 but rescaling of the final mark may occur.
Questions Marks
Q1
Q2
Q3
This exam comprises the cover page and nine pages of questions.
c© 2020 McGill University Final Examination - Take-Home Page 1 of 9
Final Exam
Notation: The following notation will be used: for i = 1, . . . , n, yi is the observed response; Yi is the
random variable version of the response; y and Y are the n × 1 vector versions of the responses; xi is
the row vector of predictor values, X is the matrix of predictor values; ŷi, Ŷi, ŷ and Ŷ are the fitted or
predicted response values or vectors arising from a given model; β is the vector of regression coefficients;
β̂ is the vector of estimates or estimators. Furthermore, 0n is the n-dimensional vector of zeros, and In is
the n-dimensional identity matrix.
Q1. The Minnesota Twins professional baseball team plays its games in the Metrodome, an indoor
stadium with a fabric roof. In addition to the large air fans required to keep to roof from collapsing,
the baseball field is surrounded by ventilation fans that blow heated or cooled air into the stadium.
Air is normally blown into the center of the field equally from all directions.
According to a retired supervisor in the Metrodome, in the late innings of some games the fans
would be modified so that the ventilation air would blow out from home plate toward the outfield.
The idea is that the air flow might increase the length of a fly ball. For example, if this were done
in the middle of the eighth inning, then the air-flow advantage would be in favor of the home team
for six outs, three in each of the eighth and ninth innings, and in favor of the visitor for three outs in
the ninth inning, resulting in a slight advantage for the home team.
To see if manipulating the fans could possibly make any difference, a group of students at the
University of Minnesota and their professor built a “cannon” that used compressed air to shoot
baseballs. They then did the following experiment in the Metrodome in March, 2003:
1. A fixed angle of 50 degrees and velocity of 150 feet per second was selected. In the actual
experiment, neither the velocity nor the angle could be controlled exactly, so the actual angle
and velocity varied from shot to shot.
2. The ventilation fans were set so that to the extent possible all the air was blowing in from the
outfield towards home plate, providing a headwind. After waiting about 20 minutes for the
air flows to stabilize, 20 balls were shot into the outfield, and their distances were recorded.
Additional variables recorded on each shot include the weight (in grams) and diameter (in cm)
of the ball used on that shot, and the actual velocity and angle.
3. The ventilation fans were then reversed, so as much as possible air was blowing out toward the
outfield, giving a tailwind. After waiting 20 minutes for air currents to stabilize, 15 balls were
shot into the outfield, again measuring the ball weight and diameter, and the actual velocity
and angle on each shot.
In this data, the variable names are Cond, the condition, head or tail wind; Velocity, the actual velocity
in feet per second; Angle, the actual angle; BallWt, the weight of the ball in grams used on that
particular test; BallDia, the diameter in inches of the ball used on that test; Dist, distance in feet of
the flight of the ball.
1. The following plot shows a boxplot of the response Dist for each value of Cond:
1 > boxplot(Dist ∼ Cond, data=domedata)
Question Q1 continues on the next page.
2
Final Exam
Summarize the plot. Based on the boxplot, can we conclude that there is enough evidence that
manipulating the fans can change the distance that a baseball travels? 2 MARKS
2. We next examine the scatterplot matrix of the response and the continuous predictors, using
Cond to color and mark the points. Summarize the key features of the following graph. 3
MARKS
Question Q1 continues on the next page.
3
Final Exam
3. Interpret the ANOVA output for m1, m2 and m3. In particular, define appropriate hypotheses
for these tests. Summarize the conclusions to be made from this output. 5 MARKS
1 > m1 <- lm( Dist ∼ Velocity + Angle)
2 > m2 <- lm( Dist ∼ Velocity + Angle + BallWt + BallDia)
3 > m3 <- lm( Dist ∼ Velocity + Angle + BallWt + BallDia + Cond)
4 > anova(m1,m2,m3)
5 Analysis of Variance Table
6
7 Model 1: Dist ∼ Velocity + Angle
8 Model 2: Dist ∼ Velocity + Angle + BallWt + BallDia
9 Model 3: Dist ∼ Velocity + Angle + BallWt + BallDia + Cond
10 Res.Df RSS Df Sum of Sq F Pr(>F)
11 1 31 2042.2
12 2 29 1747.0 2 295.15 3.1869 0.056627 .
13 3 28 1296.6 1 450.46 9.7279 0.004177 **
4. Test to see if the Velocity differential in Dist is the same in each Cond. Clearly state the null and
alternative hypotheses of the tests. Comment on your findings. 5 MARKS
1 > ma <- lm(Dist ∼ Velocity + Angle + BallWt + BallDia + Cond)
2 > mb <- update(ma, ∼ .+Velocity:Cond)
3 > anova(ma,mb)
4 Analysis of Variance Table
5
6 Model 1: Dist ∼ Velocity + Angle + BallWt + BallDia + Cond
7 Model 2: Dist ∼
Velocity + Angle + BallWt + BallDia + Cond + Velocity:Cond
8 Res.Df RSS Df Sum of Sq F Pr(>F)
9 1 28 1296.6
10 2 27 1296.5 1 0.078273 0.0016 0.9681
Final Examination - Take-Home MATH423: Applied Regression (2020) Page 4 of 9
4
Final Exam
Q2. Consider the hospital infection risk data. The variables we will analyze are the following:
– Y = infection risk in hospital
– X1 = average length of patient’s stay (in days)
– X2 = a measure of frequency of giving X-rays
– X3 = indication in which of 4 U.S. regions the hospital is located (north-east, north-central,
south, west).
The focus of the analysis will be on regional differences. Region is a categorical variable so we must
use indicator variables to incorporate region information into the model. There are four regions.
The full set of indicator variables for the four regions is as follows:
– I1 = 1 if the hospital is in region 1 (north-east) and 0 if not.
– I2 = 1 if the hospital is in region 2 (north-central) and 0 if not.
– I3 = 1 if the hospital is in region 3 (south) and 0 if not.
– I4 = 1 if the hospital is in region 4 (west), 0 otherwise.
To avoid a linear dependency in the X matrix, we will leave out one of these indicators when we
forming the model. Using all but the first indicator to describe regional differences (so that "north-
east" is the reference region), a possible multiple regression model for E(Y ), the mean infection risk,
is:
E(Y |X) = β0 + β1X1 + β2X2 + β3I2 + β4I3 + β5I4 (1)
1. Based on Equation (1), write down the regression function E(Y |X) for the following case: 2
MARKS
(a) For the hospital in region 1 (north-east)
(b) For the hospital in region 2 (north-central)
(c) For the hospital in region 3 (south)
(d) For the hospital in region 4 (west)
2. Interpret the coefficients β3, β4, β5 in model (1) 2 MARKS
3. The output for model (1) is reported below
1 Estimate Std. Error t value Pr(>|t|)
2 (Intercept) -2.134259 0.877347 -2.433 0.01668 *
3 Stay 0.505394 0.081455 6.205 1.11e-08 ***
4 Xray 0.017587 0.005649 3.113 0.00238 **
5 i2 0.171284 0.281475 0.609 0.54416
6 i3 0.095461 0.288852 0.330 0.74169
7 i4 1.057835 0.378077 2.798 0.00612 **
8 ---
9 Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
10
11 Residual standard error: 1.036 on 105 degrees of freedom Multiple R-squared:
0.4198, Adjusted R-squared: 0.3922 F-statistic: 15.19 on 5 and 105 DF,
p-value: 3.243e-11
Question Q2 continues on the next page.
(a) What is the sample size n of this dataset? 2 MARKS
5
Final Exam
(b) Interpret the p-value for I4. 2 MARKS
(c) Whether the difference between mean infection risks in the north-east and west is strong?
3 MARKS
(d) If we consider an overall test of regional differences with the null H0 : β3 = β4 = β5 = 0
against the alternative H0 : at least one of β3, β4, β5 is not zero. What is the reduced model
under the null hypothesis? If the reduced model has SSRes(reduced) = 123.56 with
df = 108, and the full model (1) has SSRes(full) = 112.71 with df = 105, what is the
value of the F -statistics? 2 MARKS
(e) From the p-values of I2, I3 and I4 what conclusions you can draw? 2 MARKS
4. If we compare the following two models
H0 : E(Y |X) = β0 + β1X1 + β2X2 + β5I4.
Ha : E(Y |X) = β0 + β1X1 + β2X2 + β3I2 + β4I3 + β5I4.
and we know that the reduced model has SSE(reduced) = 113.11, what is the corresponding
degree of freedom df of this model? What is the value of the F -statistics for this test? 5 MARKS
Final Examination - Take-Home MATH423: Applied Regression (2020) Page 6 of 9
6
Final Exam
Q3. Depression Treatments data. Some researchers (Daniel, 1999) were interested in comparing the
effectiveness of three treatments for severe depression. For the sake of simplicity, we denote the
three treatments A, B, and C. The researchers collected the following data on a random sample of
n = 36 severely depressed individuals:
– yi = measure of the effectiveness of the treatment for individual i
– xi1 = age (in years) of individual i
– xi2 = 1 if individual i received treatment A and 0, if not
– xi3 = 1 if individual i received treatment B and 0, if not
A scatter plot of the data with treatment effectiveness on the y-axis and age on the x-axis looks like:
The blue circles represent the data for individuals receiving treatment A, the red squares represent
the data for individuals receiving treatment B, and the green diamonds represent the data for
individuals receiving treatment C.
We consider a (second-order) multiple regression model with interaction terms:
yi = β0 + β1xi1 + β2xi2 + β3xi3 + β12xi1xi2 + β13xi1xi3 + i. (2)
1. Write down the regression function E(Y |X) 4 MARKS
(a) if patient receives A.
(b) if patient receives B.
(c) if patient receives C.
2. If the estimated regression function is
(a) If patient receives A: yˆ = 47.5 + 0.33x1.
(b) If patient receives B: yˆ = 28.9 + 0.52x1.
(c) If patient receives C: yˆ = 6.21 + 1.03x1.
When we plot these three "best fitting" lines, we obtain:
7
Final Exam
Question: What do the estimated slopes tell us? What does the "nonparallelness" of the lines
imply. Does age appear to interact with treatment in its impact on treatment effectiveness?
Why? 4 MARKS
3. The residual plot and the QQ plot of the residuals for the model (2) are provided below:
8
Final Exam
Question: What conclusions you can draw from those plots? 4 MARKS
4. For every age, is there a difference in the mean effectiveness for the three treatments? Write
down the null and alternative hypothesis of the corresponding test. 4 MARKS
5. Does the effect of age on the treatment’s effectiveness depend on treatment? Write down the
null and alternative hypothesis of the corresponding test. 4 MARKS
Final Examination - Take-Home MATH423: Applied Regression (2020) Page 9 of 9
9

欢迎咨询51作业君