# 程序代写案例-STAC67-Assignment 3

STAC67: Assignment 3
Deadline to hand in: Nov. 15, 2021
Total Points: 100
Q. 1 (10 pts) Provide proof for following two statements.
(a) (4 pts)
Suppose A: m × n constant matrix and Y: n × 1 is a random vector.
Then
V ar(AY˜ ) = AV ar(Y˜ )A′
(b) (6 pts) SSR (Sum of Squares of Regression) in matrix notation is:
β̂′˜X′Y˜ − 1nY˜ ′JY˜
Q. 2 (14 points) A Regression Analysis (STAC67) class has assignments, two term
tests and a final exam. The instructor wanted to know how well the final
exam mark was related to the assignment and term test marks so she used
least squares to fit a linear model to the marks for 20 students. Here is the
resulting parameter estimates table.
Estimate Std.Error tvalue Pr(> |t|)
(Intercept) −11.87286 8.17407 −1.45 0.1657
test1 0.50015 0.08800 5.68 < .0001
test2 0.55284 0.13449 4.11 0.0008
assn 0.20775 0.12309 1.69 0.1109
(a) (5 pts) If we are told that the total sum of squares is 2501 and the model sum
of squares is 2314, construct the Analysis of Variance table for the fitted
model.
(b) (4 pts) Estimate the standard deviation of the error terms.
(c) (5 pts) What final exam mark would you predict for a student who scored 70
on both term tests and 90 on the assignments? Give the appropriate
formula to produce a prediction interval for this student and explain
how such an interval should be interpreted.
Q. 3 (20 points) The public health department wished to study the relation between
the average estimated probability of acquiring an infection in the hospital
(infections, in percent; higher is worse) and the average length of stay of all
patients in hospital (StayLength in days, X1), the average age of patients
(Age, in years, X2), the average number of beds in hospital during study
period (Beds, X3). The data file, ”Infectons.csv” can be found in Quercus.
Please ignore the other three variables (MedSchool, Region,and Nurses) for
this question.
(a) (4 pts) Obtain the scatter plot matrix and the correlation matrix. Interpret
these and state your principal findings. Is there any concern about multi-
collinearity?
(b) (4 pts) Fit regression model for three predictor variables to the data and state
the estimated regression function. How is βˆ2 interpreted here?
(c) (4 pts) Test whether there is a regression relation; use α = 0.05. State the
alternatives, decision rule, and conclusion. What does your test imply
about β1, β2, and β3? What is the P -value of the test?
(d) (4 pts) Calculate the coefficient of determination, and also adjusted coefficient
of determination. What does it indicate here?
(e) (4 pts) Obtain a 90 % prediction interval for a new hospital infection rate when
StayLength = 10, Age = 45, and Beds = 150. Interpret your prediction
interval.
Q. 4 (10 pts) Cobb and Douglas (1928) proposed a multiplicative production func-
tion: Quantitiy Produced (Y), and the independent variables are: Capital
(X1) and Labor(X2). Data is from US production data from 1899-1922. The
data file, “cobbdoug.txt” is posted at Quercus, and provide names of columns
as follows: Year, Y, X1 and X2.
Transform all variables to log first:
Y ∗ = log(Y ), X∗1 = log(X1), and X

2 = log(X2)
Test the hypothesis:
H0 : β1 + β2 = 1
(a) (5 pts) Use the generalized hypothesis test using F distribution.
(b) (5 pts) Use t distribution for the test, and compare with (a)
Q. 5 (20 points) Suppose that X is a categorical variable with 3 levels (A, B, C)
and we define the indicator variable I1 and I2 as:
I1 =
{
1, X = A
0, otherwise
I2 =
{
1, X = B
0, otherwise
For a continuous response variable Y consider fitting the linear model
Y = β0 + β1I1 + β2I2 + .
We take a total sample of n individuals. Let nA, nB, nC be the number of
individuals in each category of X and let y¯A, y¯B, y¯C be the sample means of
Y for individuals in each category of X
(a) (5 pts) Find X ′X and X ′Y˜ .
(b) (10 pts) Show that the least squares estimates for this model are
βˆ0 = y¯C , βˆ1 = y¯A − y¯C , βˆ2 = y¯B − y¯C .
using both options (each option is 5 points each)
(option 1) βˆ = (X tX)−1Xy.
(option 2) For any parameter values β0, β1, β2 we therefore need to min-
imize the sum of squared errors
S(β0, β1, β2) =
n∑
i=1
(yi − β0 − β1I1i − β2I2i)2.
(c) (5 pts) Let s2A, s
2
B, s
2
C be the usual sample standard deviations of Y for indi-
viduals in each category of X. Show that the error sum of squares can
be written as
SSE = (nA − 1)s2A + (nB − 1)s2B + (nC − 1)s2C
Q. 6 (26 pts) We will use the same dataset, “Infections.csv” in Question 3 for this question.
Following are the description of variables that will be used:
• Infections (Y): the average estimated probability of acquiring an infection
in the hospital, in percent; higher is worse
• Beds: the average number of beds in hospital during study period
• Region: geographic region (NE = Northeast, NC = North Central, S =
South, W = West)
(a) (8 pts) Write down the full model with the interaction terms. Fit the full model
in R. Compute the estimated regression functions for geographic region
and plot them.
(b)(4 pts) Test whether the slopes relating the average number of beds to infections
are the same for each geographic region at the α = 0.05, significance level.
(c) (2 pts) What model would you choose for this data? Justify your answer.
(d) (6 pts) For the model you chose in (c), check and comment on the standard
assumptions for regression model.
(e) (6 pts) Look for the transformation of Y and/or X (=Beds). Fit the regres-
sion with the transformed variable(s) without interaction and comment
whether this model fits better.  Email:51zuoyejun

@gmail.com