STAC67: Assignment 3 Deadline to hand in: Nov. 15, 2021 Total Points: 100 Q. 1 (10 pts) Provide proof for following two statements. (a) (4 pts) Suppose A: m × n constant matrix and Y: n × 1 is a random vector. Then V ar(AY˜ ) = AV ar(Y˜ )A′ (b) (6 pts) SSR (Sum of Squares of Regression) in matrix notation is: β̂′˜X′Y˜ − 1nY˜ ′JY˜ Q. 2 (14 points) A Regression Analysis (STAC67) class has assignments, two term tests and a final exam. The instructor wanted to know how well the final exam mark was related to the assignment and term test marks so she used least squares to fit a linear model to the marks for 20 students. Here is the resulting parameter estimates table. Estimate Std.Error tvalue Pr(> |t|) (Intercept) −11.87286 8.17407 −1.45 0.1657 test1 0.50015 0.08800 5.68 < .0001 test2 0.55284 0.13449 4.11 0.0008 assn 0.20775 0.12309 1.69 0.1109 (a) (5 pts) If we are told that the total sum of squares is 2501 and the model sum of squares is 2314, construct the Analysis of Variance table for the fitted model. (b) (4 pts) Estimate the standard deviation of the error terms. (c) (5 pts) What final exam mark would you predict for a student who scored 70 on both term tests and 90 on the assignments? Give the appropriate formula to produce a prediction interval for this student and explain how such an interval should be interpreted. Q. 3 (20 points) The public health department wished to study the relation between the average estimated probability of acquiring an infection in the hospital (infections, in percent; higher is worse) and the average length of stay of all patients in hospital (StayLength in days, X1), the average age of patients (Age, in years, X2), the average number of beds in hospital during study period (Beds, X3). The data file, ”Infectons.csv” can be found in Quercus. Please ignore the other three variables (MedSchool, Region,and Nurses) for this question. (a) (4 pts) Obtain the scatter plot matrix and the correlation matrix. Interpret these and state your principal findings. Is there any concern about multi- collinearity? (b) (4 pts) Fit regression model for three predictor variables to the data and state the estimated regression function. How is βˆ2 interpreted here? (c) (4 pts) Test whether there is a regression relation; use α = 0.05. State the alternatives, decision rule, and conclusion. What does your test imply about β1, β2, and β3? What is the P -value of the test? (d) (4 pts) Calculate the coefficient of determination, and also adjusted coefficient of determination. What does it indicate here? (e) (4 pts) Obtain a 90 % prediction interval for a new hospital infection rate when StayLength = 10, Age = 45, and Beds = 150. Interpret your prediction interval. Q. 4 (10 pts) Cobb and Douglas (1928) proposed a multiplicative production func- tion: Quantitiy Produced (Y), and the independent variables are: Capital (X1) and Labor(X2). Data is from US production data from 1899-1922. The data file, “cobbdoug.txt” is posted at Quercus, and provide names of columns as follows: Year, Y, X1 and X2. Transform all variables to log first: Y ∗ = log(Y ), X∗1 = log(X1), and X ∗ 2 = log(X2) Test the hypothesis: H0 : β1 + β2 = 1 (a) (5 pts) Use the generalized hypothesis test using F distribution. (b) (5 pts) Use t distribution for the test, and compare with (a) Q. 5 (20 points) Suppose that X is a categorical variable with 3 levels (A, B, C) and we define the indicator variable I1 and I2 as: I1 = { 1, X = A 0, otherwise I2 = { 1, X = B 0, otherwise For a continuous response variable Y consider fitting the linear model Y = β0 + β1I1 + β2I2 + . We take a total sample of n individuals. Let nA, nB, nC be the number of individuals in each category of X and let y¯A, y¯B, y¯C be the sample means of Y for individuals in each category of X (a) (5 pts) Find X ′X and X ′Y˜ . (b) (10 pts) Show that the least squares estimates for this model are βˆ0 = y¯C , βˆ1 = y¯A − y¯C , βˆ2 = y¯B − y¯C . using both options (each option is 5 points each) (option 1) βˆ = (X tX)−1Xy. (option 2) For any parameter values β0, β1, β2 we therefore need to min- imize the sum of squared errors S(β0, β1, β2) = n∑ i=1 (yi − β0 − β1I1i − β2I2i)2. (c) (5 pts) Let s2A, s 2 B, s 2 C be the usual sample standard deviations of Y for indi- viduals in each category of X. Show that the error sum of squares can be written as SSE = (nA − 1)s2A + (nB − 1)s2B + (nC − 1)s2C Q. 6 (26 pts) We will use the same dataset, “Infections.csv” in Question 3 for this question. Following are the description of variables that will be used: • Infections (Y): the average estimated probability of acquiring an infection in the hospital, in percent; higher is worse • Beds: the average number of beds in hospital during study period • Region: geographic region (NE = Northeast, NC = North Central, S = South, W = West) (a) (8 pts) Write down the full model with the interaction terms. Fit the full model in R. Compute the estimated regression functions for geographic region and plot them. (b)(4 pts) Test whether the slopes relating the average number of beds to infections are the same for each geographic region at the α = 0.05, significance level. (c) (2 pts) What model would you choose for this data? Justify your answer. (d) (6 pts) For the model you chose in (c), check and comment on the standard assumptions for regression model. (e) (6 pts) Look for the transformation of Y and/or X (=Beds). Fit the regres- sion with the transformed variable(s) without interaction and comment whether this model fits better.
欢迎咨询51作业君