STAT331: Assignment 1 Due: Friday, June 4, 2021 at 5pmEST on Crowdmark General instructions: • Your work may be written up using R Markdown, LaTeX, or Word. If you hand-write your solutions, make sure they are legible. No points will be given if the grader cannot read your handwriting. • You may discuss problems with your peers, but you must write up your own answers, and include names of anyone you worked with on your assignment. • For data analysis problems: You must clearly present your final answers in addition to the steps or commands for obtaining your answers. You must include well-commented R code (and only necessary code) to reproduce your work. 1. [Theory] Suppose we observe a sample of n outcomes yi and covariates xi, and assume the ususal simple linear regression model: yi = 0 + 1xi + ✏i, ✏i iid⇠ N(0, 2), for i = 1, 2, . . . , n and we want to compute the usual least squares (LS) estimators ( ˆ0, ˆ1) along with corre- sponding 95% confidence intervals as we did in class. (a) If the equal variance (i.e. homoskedasticity) assumption does not hold: are our LS estimators still unbiased? Explain. (b) If the equal variance (i.e. homoskedasticity) assumption does not hold: are our confidence intervals still valid? Explain. (c) If the independence assumption does not hold: are our LS estimators still unbiased? Explain. (d) If the independence assumption does not hold: are our confidence intervals still valid? Explain. (e) If the normality assumption does not hold: are our LS estimators still unbiased? Explain. 1 2. [Theory] Consider fitting Model A: yi = 0 + 1xi + ✏i with the usual assumptions. (a) Suppose we conduct a hypothesis test of H0 : 1 = 0 against the two-sided alternative H1 : 1 6= 0 as we did in class. If we reject the null hypothesis at the 0.05-level (meaning the p value is less than 0.05), what can be said about the corresponding 95% confidence interval for 1? Explain. Suppose now we fit a second Model B: yi = B 0 + B 1 x ⇤ i + ✏ B i using a new (standardized) variable x⇤i = (xi x¯)/sx, where sx is the sample standard deviation of xi. (Note the B is a label, not an exponent.) (b) How do we interpret B1 ? (c) How are estimates ˆ0 and ˆ1 related to ˆB0 and ˆ B 1 ? (d) Consider a test of the null hypothesis of H0 : 1 = 0 against the alternative H1 : 1 6= 0. How is the corresponding p-value related to the p-value for a test of the null hypothesis of H0 : B1 = 0 (against H1 : B 1 6= 0)? 2 3. [Data analysis] The dataset berries.csv for this problem is posted on Learn, and comes from a study (Journal of Texture Studies, 44, 95-103) on the properties of fruits. The variables are: • sugar: Sugar content (g/L) • chewiness: Chewiness (mJ) and there are 90 berries in the dataset. Suppose we are interested in predicting a berry’s chewiness (Y) from its sugar content (X). (a) Show a scatterplot of the data. Include labels for the axes and the plot (e.g., use the xlab,ylab,main arguments in R’s plot function). (b) Assume we use simple linear regression to model the relationship between X and Y. Compute the least squares estimates ˆ0 and ˆ1 (e.g., using R’s lm function), and give inter- pretations of those estimates (as appropriate). Add the fitted line to your scatterplot (e.g., using R’s abline function, which can take your fitted regression model as input). (c) Formally test the hypothesis H0 : 1 = 0 vs HA : 1 6= 0, e.g., by using the output of R’s lm function. Include the t-statistic and also the p-value. Write a sentence summarizing your conclusion at the ↵ = 0.01 level; i.e., do chewiness and sugar have a statistically significant linear relationship? (d) Predict the chewiness of a berry with 110 g/L sugar. Provide an estimate and a 95% prediction interval. (e) Compute a 95% confidence interval for the mean chewiness for berries with 110 g/L sugar. 3 4. [Simulation] In this problem you will get some practice coding with R, and numerically investigate the probability distributions we derived for ˆ0 and ˆ1 in simple linear regression. The ‘true’ model we will assume for this problem is Yi indep⇠ N(1 + bxi, 2), for i = 1, . . . , 10 where the 10 known values of xi are 1, 1, 1, 1, 3, 4, 5, 5, 6, 7. Important: Since this question involves simulation, your first line of R code for this problem must set the random seed with this command: set.seed(123) where you replace 123 with your student number. (a) Write an R function that takes two argument (corresponding to b and ) and conducts the following simulation: (i) Simulate a set of data values y1, . . . , y10 according to the model. [To generate a random value for Y1, use R’s rnorm function.] (ii) Fit the model yi = 0+ 1xi+ ✏i (i.e. regress y on x) using the simulated outcomes from (i). (iii) Compute a p-value for a hypothesis test of H0 : 1 = 0 against the two sided alternative HA : 1 6= 0. (iv) Repeat (i)–(iii) 2000 times, saving your results (p-value) each time. (b) Using your R function, conduct a simulation under b = 0.25 and = 1 as described in part (a). Plot a histogram of the p values across the 2000 datasets. In what proportion of samples would you reject the null hypothesis at the 5% level? (c) Repeat part (b) with b = 0. (d) Repeat (b)–(c) with = 0.8. Explain your findings. (e) What would happen if we increased the sample size? Remember to include your R code in your submissions for Questions 3 and 4! 4
欢迎咨询51作业君