STAC67: Assignment 3

Deadline to hand in: Nov. 15, 2021

Total Points: 100

Q. 1 (10 pts) Provide proof for following two statements.

(a) (4 pts) Suppose A: m × n constant matrix and Y: n × 1 is a random vector.

Then

V ar(AY˜ ) = AV ar(Y˜ )A′

(b) (6 pts) SSR (Sum of Squares of Regression) in matrix notation is:

β̂′˜X′Y˜ − 1nY˜ ′JY˜

Q. 2 (14 points) A Regression Analysis (STAC67) class has assignments, two term

tests and a final exam. The instructor wanted to know how well the final

exam mark was related to the assignment and term test marks so she used

least squares to fit a linear model to the marks for 20 students. Here is the

resulting parameter estimates table.

Estimate Std.Error tvalue Pr(> |t|)

(Intercept) −11.87286 8.17407 −1.45 0.1657

test1 0.50015 0.08800 5.68 < .0001

test2 0.55284 0.13449 4.11 0.0008

assn 0.20775 0.12309 1.69 0.1109

(a) (5 pts) If we are told that the total sum of squares is 2501 and the model sum

of squares is 2314, construct the Analysis of Variance table for the fitted

model.

(b) (4 pts) Estimate the standard deviation of the error terms.

(c) (5 pts) What final exam mark would you predict for a student who scored 70

on both term tests and 90 on the assignments? Give the appropriate

formula to produce a prediction interval for this student and explain

how such an interval should be interpreted.

Q. 3 (20 points) The public health department wished to study the relation between

the average estimated probability of acquiring an infection in the hospital

(infections, in percent; higher is worse) and the average length of stay of all

patients in hospital (StayLength in days, X1), the average age of patients

(Age, in years, X2), the average number of beds in hospital during study

period (Beds, X3). The data file, ”Infectons.csv” can be found in Quercus.

Please ignore the other three variables (MedSchool, Region,and Nurses) for

this question.

(a) (4 pts) Obtain the scatter plot matrix and the correlation matrix. Interpret

these and state your principal findings. Is there any concern about multi-

collinearity?

(b) (4 pts) Fit regression model for three predictor variables to the data and state

the estimated regression function. How is βˆ2 interpreted here?

(c) (4 pts) Test whether there is a regression relation; use α = 0.05. State the

alternatives, decision rule, and conclusion. What does your test imply

about β1, β2, and β3? What is the P -value of the test?

(d) (4 pts) Calculate the coefficient of determination, and also adjusted coefficient

of determination. What does it indicate here?

(e) (4 pts) Obtain a 90 % prediction interval for a new hospital infection rate when

StayLength = 10, Age = 45, and Beds = 150. Interpret your prediction

interval.

Q. 4 (10 pts) Cobb and Douglas (1928) proposed a multiplicative production func-

tion: Quantitiy Produced (Y), and the independent variables are: Capital

(X1) and Labor(X2). Data is from US production data from 1899-1922. The

data file, “cobbdoug.txt” is posted at Quercus, and provide names of columns

as follows: Year, Y, X1 and X2.

Transform all variables to log first:

Y ∗ = log(Y ), X∗1 = log(X1), and X

∗

2 = log(X2)

Test the hypothesis:

H0 : β1 + β2 = 1

(a) (5 pts) Use the generalized hypothesis test using F distribution.

(b) (5 pts) Use t distribution for the test, and compare with (a)

Q. 5 (20 points) Suppose that X is a categorical variable with 3 levels (A, B, C)

and we define the indicator variable I1 and I2 as:

I1 =

{

1, X = A

0, otherwise

I2 =

{

1, X = B

0, otherwise

For a continuous response variable Y consider fitting the linear model

Y = β0 + β1I1 + β2I2 + .

We take a total sample of n individuals. Let nA, nB, nC be the number of

individuals in each category of X and let y¯A, y¯B, y¯C be the sample means of

Y for individuals in each category of X

(a) (5 pts) Find X ′X and X ′Y˜ .

(b) (10 pts) Show that the least squares estimates for this model are

βˆ0 = y¯C , βˆ1 = y¯A − y¯C , βˆ2 = y¯B − y¯C .

using both options (each option is 5 points each)

(option 1) βˆ = (X tX)−1Xy.

(option 2) For any parameter values β0, β1, β2 we therefore need to min-

imize the sum of squared errors

S(β0, β1, β2) =

n∑

i=1

(yi − β0 − β1I1i − β2I2i)2.

(c) (5 pts) Let s2A, s

2

B, s

2

C be the usual sample standard deviations of Y for indi-

viduals in each category of X. Show that the error sum of squares can

be written as

SSE = (nA − 1)s2A + (nB − 1)s2B + (nC − 1)s2C

Q. 6 (26 pts) We will use the same dataset, “Infections.csv” in Question 3 for this question.

Following are the description of variables that will be used:

• Infections (Y): the average estimated probability of acquiring an infection

in the hospital, in percent; higher is worse

• Beds: the average number of beds in hospital during study period

• Region: geographic region (NE = Northeast, NC = North Central, S =

South, W = West)

(a) (8 pts) Write down the full model with the interaction terms. Fit the full model

in R. Compute the estimated regression functions for geographic region

and plot them.

(b)(4 pts) Test whether the slopes relating the average number of beds to infections

are the same for each geographic region at the α = 0.05, significance level.

(c) (2 pts) What model would you choose for this data? Justify your answer.

(d) (6 pts) For the model you chose in (c), check and comment on the standard

assumptions for regression model.

(e) (6 pts) Look for the transformation of Y and/or X (=Beds). Fit the regres-

sion with the transformed variable(s) without interaction and comment

whether this model fits better.

欢迎咨询51作业君