School of Mathematics and Statistics FIRST SEMESTER EXAMINATIONS STAT2401 ANALYSIS OF EXPERIMENTS FAMILY NAME: STUDENT ID: GIVEN NAMES: SIGNATURE: This Paper contains: 6 pages (including title page) Time allowed: 2 hours and 45 minutes INSTRUCTIONS: • This is version 0 . This is an open book exam. • The marks for each question are indicated in the questions for a total of 75 marks available. • This examination requires you to use the statistics package R or RStudio. • You should answer the questions in the Electronic Answer Sheet (Available Online). You will not gain any mark if the answers are written in somewhere else. The submitted Answer Sheet should be in PDF format. Photo or in any other format will NOT be accepted. Make sure you are NOT submitting a blank Answer Sheet. Use SAVE or PRINT to create a new PDF file that contains your answers • The submission of your answer sheet should be done via LMS over the Final Exam Upload Point under the exam folder. Please save the name of your pdf file be “your student number [your name].pdf”. • Late submissions will not be marked. • There are 10 versions numbered 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. Please take the version that is identical to the last digit of your student number. • The data is available in LMS, please download the corresponding version. • All non-integer numerical answers should be given up to 4 decimal places. Fail to follow this would award a mark of zero. • When using R or RStudio it is recommended that you write down answers as soon as you have obtained the necessary output. In this way you should lose little of importance in the unlikely event of a computer failure. • You must show your working in order to obtain full marks. Semester 1 Examinations June 2020 2. STAT2401 1. Does pollution kill people? Data in one early study designed to explore this issue came from five Standard Metropolitan Statistical Areas (SMSA) in the United States, obtained for the years 1959–1961. Total age-adjusted mortality (Mortality) from all causes, in deaths per 100,000 population, is the response variable. The 15 explanatory variables for each of 60 cities are (1) Precip: mean annual precipitation (in inches); (2) Humidity: percent relative humidity (annual average at 1 P.M.); (3) JanTemp: mean January temperature (in degrees Fahrenheit); (4) JulyTemp: mean July temperature (in degrees Fahrenheit); (5) Over65: percentage of the population aged 65 years or over; (6) House: population per household; (7) Educ: median number of school years completed by persons of age 25 years or more; (8) Sound: percentage of the housing that is sound with all facilities; (9) Density: population density (in persons per square mile of urbanized area); (10) NonWhite: percentage of 1960 population that is nonwhite; (11) WhiteCol: percentage of employment in white-collar occupations; (12) Poor: percentage of households with annual income under $3,000 in 1960; (13) HC: relative pollution potential of hydrocarbons (HC); (14) NOX: relative pollution potential of oxides of nitrogen (NOX); and (15) SO2: relative pollution potential of sulphur dioxide (SO2). It is desired to determine whether the pollution variables (13, 14, and 15) are associated with mortality, after the other climate and socioeconomic variables are accounted for. Save the data in “your working directory”, and read in the data by setwd("your working directory") Pollution = read.csv(file="Pollution-Version-0.csv",header=TRUE) (a) Describe the process of backward variable selection, implemented using F - test and p-value approach, for a multiple linear regression model. [5 marks] (b) Fit a linear model with response Mortality, including only the first 12 ex- planatory variables, to the data. Report your R code (NOT the R-output) and the fitted model. [4 marks] (c) Starting with the fitted model in part (b), perform backward variable selection using F -test/p-value approach to select a model. Report your R code (NOT the R-output) and the fitted model that is finally selected. [4 marks] QUESTION 1 CONTINUES OVER THE PAGE Semester 1 Examinations June 2020 1 (Continued) 3. STAT2401 (d) Starting with the final model in part (c), perform forward variable selection using F -test/p-value approach to select the last 3 explanatory “pollution” variables (13, 14, and 15). Report your R code (NOT the R-output) and the fitted model that is finally selected. [4 marks] (e) Starting with the NULL model, perform forward variable selection using F - test/p-value approach to select a model including only the first 12 explanatory variables. Report your R code (NOT the R-output) and the fitted model that is finally selected. [4 marks] (f) Starting with the final model in part (e), perform forward variable selection using F -test/p-value approach to select the last 3 explanatory “pollution” variables (13, 14, and 15). Report your R code (NOT the R-output) and the fitted model that is finally selected. [4 marks] (g) State the common explanatory variables of the final models found in parts (d) and (f). [4 marks] 2. This question concerns data from an observational study on the selective mech- anisms of evolution. An interesting variable in this respect is brain size. One might expect that bigger brains are better, but certain penalties seem to be associated with large brains, such as the need for longer pregnancies and fewer offspring. Although the individual members of the large brained species may have more chance of surviving, the benefits for the species must be good enough to compensate for these penalties. To shed some light on this issue, it is helpful to determine exactly which characteristics are associated with large brains, after getting the effect of body size out of the way. The data Brain contains the variables: natural logarithm of the average values of brain weight (logBrain, response), body weight (logBody), and 4 different levels of natural logarithm of gestation lengths (loggestation) for 96 species of mammals. Save the data in “your working directory”, and read in the data by setwd("your working directory") load(file="Brain-Version-0.RData") QUESTION 2 CONTINUES OVER THE PAGE Semester 1 Examinations June 2020 2 (Continued) 4. STAT2401 (a) Fit the following models in order to explain the response variable logBrain (natural logarithm of Brain weight) based on the information of logBody (natural logarithm of body size): • M1, a simple linear regression for all observations (i.e. intercept and slope not dependent on different levels of natural logarithm of gestation lengths (logGestation)). • M2, parallel regressions for observations from each each level of natu- ral logarithm of gestation lengths (i.e. regressions have the same slope but the intercept varies for the different levels of natural logarithm of gestation lengths (logGestation)). • M3, separate regression for observations from each level of natural loga- rithm of gestation lengths (i.e. regressions have intercept and slope that varies for the different levels of natural logarithm of gestation lengths (logGestation)). Report the R code (NOT the R-output) that you used to fit these models. [3 marks] (b) Use F tests to select the most appropriate model from M1, M2, and M3, working at a 5% significance level. Explain your reasoning clearly, and include the p-values that you obtain for your tests, also report your R code (NOT the R-output). [6 marks] (c) For your preferred model, report the fitted models for all levels of natural logarithm of gestation lengths. [4 marks] 3. This question comes from Ramsey and Schafer Statistical Sleuth, Second Edition, Chapter 7. Immediately after slaughter the pH in postmortem muscle of a steer carcass is around 7.0-7.2. For a certain kind of meat processing to take place it is necessary for pH to decrease to 6.0 so an estimate is needed of the time after slaughter at which the pH reaches 6.0. To do so, a number of steer carcasses were identified to have their immediate slaughter postmortem pH level taken, and then at one of 5 times after slaughter. Time is measured in hours. Save the data in “your working directory”, and read in the data by QUESTION 3 CONTINUES OVER THE PAGE Semester 1 Examinations June 2020 3 (Continued) 5. STAT2401 setwd("your working directory") meat = read.csv(file="meat-Version-0.csv",header=TRUE) (a) Run a simple linear regression model of log(pH) on log(hour). Report the your R code (NOT the R-output) and the fitted model. [3 marks] (b) Test whether the time log(hour) after slaughter is a statistically significant predictor of postmoterm log(pH) levels. Report your answer and the p-value for this test. [5 marks] (c) Use the model fitted in (a), find the estimated mean pH at 4 hours and its confidence interval. Report also your R code (NOT the R-output). [4 marks] (d) Use the model fitted in (a), find the predicted pH at 4 hours and its prediction interval. Report also your R code (NOT the R-output). [4 marks] (e) Use the model fitted in (a), determine how long after slaughter you would expect the mean pH level to be 6.0? Report also your R code (NOT the R-output). [3 marks] 4. The human brain is protected from bacteria and toxins, which course through the bloodstream, by a single layer of cells called the blood–brain barrier. This barrier normally allows only a few substances, including some medications, to reach the brain. Because chemicals used to treat brain cancer have such large molecular size, they cannot pass through the barrier to attack tumor cells. At the Oregon Health Sciences University, Dr. Neuwelt developed a method of disrupting the barrier by infusing a solution of concentrated sugars. As a test of the disruption mechanism, researchers conducted a study on rats, which possess a similar barrier. The rats were inoculated with human lung cancer cells to induce brain tumors. After 9 to 11 days they were infused with either the barrier disruption (BD) solution or, as a control, a normal saline (NS) solu- tion. Fifteen minutes later, the rats received a standard dose of the therapeutic antibody L6-F(ab’)2. After a set time they were sacrificed, and the amounts of antibody in the brain tumor and in normal tissue were measured. The time line QUESTION 4 CONTINUES OVER THE PAGE Semester 1 Examinations June 2020 4 (Continued) 6. STAT2401 for the experiment is as follows Measurements for the 34 rats are: Brain: Brain tumor count (per gm); Liver: Liver tumor count (per gm); Time: Sacrifice time (hours); Treatment: Treatment; Days: Days post inoculation; Sex: Sex; Weight: Initial weight (grams); Loss: Weight loss (grams); Tumor: Tumor weight (10–4 grams) The response of interest is taken to be the natural logarithm of antibody concen- tration ratio (Brain tumor-to-Liver tumor), that is logBLRatio = log(Brain Liver ). Save the data in “your working directory”, and read in the data by setwd("your working directory") BBB = read.csv(file="BBB-Version-0.csv",header=TRUE) (a) Fit a multiple linear regression model for logBLRatio based on all other variables, Time, Treatment, Days, Sex, Weight, Loss, and Tumor. Do you see evidence for significance of regression? Report also your R code. [4 marks] (b) Are there outliers and high leverage points? Use the answer to determine whether there are influential points (known as ‘bad’ leverage points)? If yes, state them. Report also your R code. [6 marks] (c) Calculate the Cook’s distances and determine the influential points (Cook’s distance greater than 1). Are these influential points different from those in part (b)? Report also your R code. [4 marks]
欢迎咨询51作业君