STATS 330 Revision 1. The selling prices (in British pounds) at auction of 32 antique grandfather clocks were recorded. The age of each clock (in years) and the number of people who participated in the bidding were also recorded. Age Bidders Price Age Bidders Price Age Bidders Price 127 13 1235 115 12 1080 127 7 845 150 9 1522 156 6 1047 182 11 1979 156 12 1822 132 10 1253 137 9 1297 113 9 946 137 15 1713 117 11 1024 137 8 1147 153 6 1092 117 13 1152 126 10 1336 170 14 2131 182 8 1550 162 11 1884 184 10 2041 143 6 854 159 9 1483 108 14 1055 175 8 1545 108 6 729 179 9 1792 111 15 1175 187 8 1593 111 7 785 115 7 744 194 5 1356 168 7 1262 (a) A conditional plot of Price versus Age given the level of Bidders was created using coplot. l l l l l l l l l 80 0 12 00 16 00 20 00 l l ll l l l l 120 140 160 180 l l l l l l l l l l l l l l l l l l l l 120 140 160 180 l l l l l l l l l l l l l 120 140 160 180 80 0 12 00 16 00 20 00 Age Pr ic e 6 8 10 12 14 Given : Bidders This plot suggests that Age and Bidders interact. i. What is meant by the statement “Age and Bidders interact”. ii. Explain what feature of the conditional plot indicates that there is an interaction between Age and Bidders (your explanation should clearly identify how the plot would be different if the two variables did not interact). [5 marks] (b) The output for the regression model containing the Age:Bidders interaction is: Estimate Std. Error t value Pr(>|t|) (Intercept) 320.4580 295.1413 1.086 0.28684 Age 0.8781 2.0322 0.432 0.66896 Bidders -93.2648 29.8916 -3.120 0.00416 ** Age:Bidders 1.2978 0.2123 6.112 1.35e-06 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 88.91 on 28 degrees of freedom Multiple R-squared: 0.9539,Adjusted R-squared: 0.9489 F-statistic: 193 on 3 and 28 DF, p-value: < 2.2e-16 i. Consider the following line of this output. F-statistic: 193 on 3 and 28 DF, p-value: < 2.2e-16 What does indicate about the fitted model? ii. Consider the following statement: As the fitted coefficient for Bidders is negative, this model indicates that E(Price) decreases as the number of bidders increases. Do you agree with this statement? Explain your answer. [5 marks] (c) Suppose that a seller has a clock that is 150 years old. i. Based on our fitted model, write down the relationship between E(Price) and Bidders for this clock. ii. How does this relationship change as the value of Age is increased? [5 marks] 2. The Stat2Data package in R contains a data set that records the prices of horses advertised for sale on the internet. The data was collected by a group of students from California Polytechnic State University. The information they collected for each horse included: Price – Price in US dollars. Age – Age of the horse in years. Height – Height of the horse in hands. Sex – f=female and m=male. This data was used to create a data frame horse.df in R. > head(horse.df) Price Age Height Sex 1 38000 3 16.75 m 2 40000 5 17.00 m 3 12000 8 16.00 f 4 25000 4 16.25 m 5 35000 8 16.25 f 6 35000 5 16.50 m > summary(horse.df) Price Age Height Sex Min. : 1100 Min. : 0.500 Min. :14.25 f:18 1st Qu.:15750 1st Qu.: 5.000 1st Qu.:16.00 m:29 Median :25000 Median : 7.000 Median :16.50 Mean :27957 Mean : 7.489 Mean :16.33 3rd Qu.:40000 3rd Qu.: 8.500 3rd Qu.:16.75 Max. :60000 Max. :20.000 Max. :17.25 The following regression model was fitted using this data: > horse.lm=lm(log(Price)~Age*Sex+Height,data=horse.df) > summary(horse.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.19038 2.14289 1.955 0.057201 . Age -0.14882 0.02116 -7.035 1.3e-08 *** Sexm -0.53174 0.29584 -1.797 0.079464 . Height 0.41017 0.13818 2.968 0.004926 ** Age:Sexm 0.13585 0.03253 4.176 0.000146 *** Residual standard error: 0.4666 on 42 degrees of freedom Multiple R-squared: 0.6884, Adjusted R-squared: 0.6587 F-statistic: 23.2 on 4 and 42 DF, p-value: 3.582e-10 (a) The fitted model uses log(Price) as the response. The assumptions for the ordinary regression model are linearity, independence, constant variance and Normality. Which of these assumptions are affected by using log(Price) rather than Price as the response? 2 marks (b) For the fitted model, Sex is a factor and its indicator variable is set up using f as the baseline level. An explanatory variable Age:Sexm which represents the interaction be- tween Age and Sex has also been set up. For the first six observations of horse.df (given on the previous page) write down the values of Sexm and Age:Sexm. 2 marks (c) The output for the anova command is: > anova(horse.lm) Response: log(Price) Df Sum Sq Mean Sq F value Pr(>F) Age 1 7.1303 7.1303 32.745 9.955e-07 *** Sex 1 6.9854 6.9854 32.080 1.207e-06 *** Height 1 2.2938 2.2938 10.534 0.0023039 ** Age:Sex 1 3.7970 3.7970 17.437 0.0001465 *** Residuals 42 9.1456 0.2178 The p-value for Sex in this table is much smaller than the p-value for Sexm in the summary output on the previous page. However the p-value for Age:Sex in this table and that for Age:Sexm in the summary output are the same. Explain why this occurs. 3 marks (d) Use the fitted model to compare the impact Age, Sex and Height have on Price. 5 marks 3. An advertisement for diamonds was included in the February 18, 2000 edition of The Business Times. This advertisement gave the following characteristics for 308 diamonds: price Price in Singapore dollars. carats Size measured in carats. colour Colour rating on the GIA colour scale which goes from D (colourless) through Z (light colour). The diamonds listed all had ratings of D, E, F, G, H or I. clarity Clarity rating on the GIA clarity scale. The diamonds listed all had ratings of IF (internally flawless), VVS1, VVS2, VS1 or VS2. VS1 and VS2 indicate very slightly flawed diamonds whereas VVS1 and VVS2 indicate very very slightly flawed diamonds. For VS1 and VVS1 the flaw is only visible from the pavilion (bottom) and for VS2 and VVS2 the flaw is only visible from the crown (top). cert The certification agency: Gemmological Institute of America (GIA), International Gem- mological Institute (IGI) or Hoge Raad Voor Diamant (HRD). This data was read into a data frame named diamond.df in R and the following output obtained: > summary(diamond.df) carats colour clarity cert price Min. :0.1800 D:16 IF :44 GIA:151 Min. : 638 1st Qu.:0.3500 E:44 VS1 :81 HRD: 79 1st Qu.: 1625 Median :0.6200 F:82 VS2 :53 IGI: 78 Median : 4215 Mean :0.6309 G:65 VVS1:52 Mean : 5019 3rd Qu.:0.8500 H:61 VVS2:78 3rd Qu.: 7446 Max. :1.1000 I:40 Max. :16008 (a) The diamond data was used to construct a regression model that relates the price of a diamond to the remaining characteristics. The following set of diagnostic plots was obtained for the linear model: > diamond1.lm<-lm(price~carats+colour+clarity+cert,data=diamond.df) > plot(diamond1.lm) 0 2000 6000 10000 − 20 00 0 20 00 40 00 Fitted values R es id ua ls l l l l l l l l l ll l l l l l l ll ll lll l l ll l lll l l l ll ll ll llll l l ll ll l l ll l lll l ll l ll l ll l l l l l l l l l llll l ll l l l ll ll ll l l l l l l l l l l l l l l l l l l l l l l l ll l ll l l l lll l ll ll l ll l l ll l ll ll ll l l l lll ll ll l l l l ll l l l l l l l Residuals vs Fitted 131 116279 l l l l l ll l l l l l l l l l l l l l l l l l l l l lllllll l lll ll l l l l l l l l ll l l l l l l l l l l −3 −2 −1 0 1 2 3 − 2 0 2 4 6 Theoretical Quantiles St an da rd ize d re sid ua ls Normal Q−Q 131 116279 0 2000 6000 10000 0. 0 0. 5 1. 0 1. 5 2. 0 Fitted values St a n da rd iz e d re si du a ls l ll l l l ll l l l l l l l l l l l l l l l l ll l l l l l l l l l l l ll lll ll l l l l l l l l ll l l ll l l ll l l l l l ll l ll l l l l l ll l l l l ll l ll l ll l l l l l l l l l l ll ll l l l l l l l ll l l l l ll l l l l l l l l l l l l l l l l l l l l ll ll l l l l l l ll l l l l l l l l l l ll l l l l ll l ll l l l ll lll lll ll l l l l l ll l l l l l l l l l ll l l l l l l l l l l ll l l l l l l l l Scale−Location 131 116279 0.00 0.02 0.04 0.06 0.08 0.10 − 2 0 2 4 6 Leverage St an da rd ize d re sid ua ls l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l ll lll l l ll l l l ll l l l ll l ll l lll l ll l l l l l l ll l l l l l l l l l ll ll l l l l ll l l l l l l l l l l l l l l l l ll l l l l lll ll l llll l l l l lll ll l l ll l ll ll l l l l lll lll l ll l l l l l lll ll Cook's distance Residuals vs Leverage 131 116 120 Based on these plots discuss how well this model fits the data. Give reasons for any conclusions you reach. (5 marks) (b) It was decided to modify the regression model by using √ price as the response. Consider the following output: > diamond2.lm<-lm(price^.5~carats+colour+clarity+cert,data=diamond.df) > new.df=data.frame(carats=1, colour = "F", clarity="VS2", cert="HRD") > predict(diamond2.lm,new.df,interval = "confidence") fit lwr upr 1 98.00119 97.06094 98.94144 > predict(diamond2.lm,new.df,interval = "prediction") fit lwr upr 1 98.00119 93.27767 102.7247 i. Use this output to find a 95% confidence interval and a 95% prediction interval for the price of a 1 carat diamond certified by HRD that has colour “F” and clarity “VS2”. ii. In the context of this example, clearly explain why the prediction interval is wider than the confidence interval. (5 marks) (c) The factor cert was included in the model since it was thought that the reputation of the certification agency (GIA, IGI or HRD) may have an impact on price. Consider using the anova function to evaluate the relevance of cert in the model using an F-test: > diamond2.lm<-lm(price^.5~carats+colour+clarity+cert, data=diamond.df) > anova(diamond2.lm) Analysis of Variance Table Df Sum Sq Mean Sq F value Pr(>F) carats 1 177237 177237 32036.7488 < 2.2e-16 *** colour 5 5227 1045 188.9561 < 2.2e-16 *** clarity 4 2950 737 133.3014 < 2.2e-16 *** cert 2 107 54 9.6783 8.49e-05 *** Residuals 295 1632 6 If the regressors are ordered differently in the model, the value of the F-statistic for cert changes from 9.6783 to 5617.27 in the ANOVA table. > diamond2A.lm<-lm(price^.5~cert+carats+colour+clarity, data=diamond.df) > anova(diamond2A.lm) Analysis of Variance Table Df Sum Sq Mean Sq F value Pr(>F) cert 2 62153 31076 5617.27 < 2.2e-16 *** carats 1 115217 115217 20826.30 < 2.2e-16 *** colour 5 5356 1071 193.62 < 2.2e-16 *** clarity 4 2795 699 126.28 < 2.2e-16 *** Residuals 295 1632 6 Both tests result in very small p-values but they are testing different hypotheses. i. Carefully explain the difference between the tests for cert in these two tables. ii. Which test is more relevant in investigating whether the reputation of the certifica- tion agency has an impact on price? Explain your answer. (5 marks) (d) The output from summary for the √ price model used in parts (b) and (c) is: > summary(diamond2A.lm) Call: lm(formula = price^0.5 ~ cert + carats + colour + clarity, data = diamond.df) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 25.62462 0.84435 30.348 < 2e-16 *** certHRD 0.03969 0.35509 0.112 0.911 certIGI -1.80848 0.42466 -4.259 2.77e-05 *** carats 91.47532 0.62917 145.391 < 2e-16 *** colourE -5.52416 0.68862 -8.022 2.47e-14 *** colourF -8.17357 0.64641 -12.645 < 2e-16 *** colourG -10.71430 0.66350 -16.148 < 2e-16 *** colourH -14.12727 0.67184 -21.028 < 2e-16 *** colourI -18.23773 0.70429 -25.895 < 2e-16 *** clarityVS1 -8.45895 0.52868 -16.000 < 2e-16 *** clarityVS2 -10.96488 0.56679 -19.346 < 2e-16 *** clarityVVS1 -2.69994 0.52951 -5.099 6.12e-07 *** clarityVVS2 -5.95314 0.49254 -12.087 < 2e-16 *** Consider the coefficients for the four indicator (dummy) variables used for the factor clarity. i. What does the estimated coefficient for clarityVS1 (-8.45895) represent in terms of the levels of clarity? ii. What is the estimated difference in the expected √ price for very very slightly flawed diamonds where the flaw is only visible from the pavilion (VVS1) and where the flaw is only visible from the crown (VVS2)? iii. Suppose we want to formally test (i.e. find a p-value) the hypothesis that there is no difference in the average price of VVS1 and VVS2 diamonds (given that all other characteristics are the same). Suggest a simple method of adjusting the regression model so that the required p-value is calculated for us. (5 marks) 4. Sahoo and H.S. Pandalai (Natural Resources Research, vol. 8, 1999) used logistic regression to predict the probability of a gold deposit being present based the concentrations of Arsenic (As) and Antimony (Sb) in ground water samples. Ground water samples were taken from a total of 64 sites – of these, 28 had a gold deposit present. The concentrations of As and Sb in the samples were measured as parts per million (ppm). (a) The fitted model logistic for this data is: > gold.glm<-glm(gold~As+Sb,family=binomial,data=gold.df) > summary(gold.glm) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.9664 1.3675 -3.632 0.000281 *** As 1.2490 0.3777 3.307 0.000943 *** Sb 0.9235 0.4486 2.059 0.039518 * --- Null deviance: 87.720 on 63 degrees of freedom Residual deviance: 18.306 on 61 degrees of freedom AIC: 24.306 i. Let pi represent the probability that a gold deposit is present. Write down the logistic form of the fitted model (i.e. pi = . . .). ii. Find a 95% confidence interval for the coefficient for As. iii. Interpret the coefficient for As in terms of the impact of the concentration of As on the odds of a gold deposit being present. (5 marks) (b) The following output was obtained using the anova function. > anova(gold.glm,test="Chisq") Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 63 87.720 As 1 65.117 62 22.603 7.057e-16 *** Sb 1 4.297 61 18.306 0.03818 * i. What hypothesis is being tested by the line As 1 65.117 62 22.603 7.057e-16 *** in this table? What do you conclude from this test? ii. The output for summary(gold.glm) also provides a test for As. Estimate Std. Error z value Pr(>|z|) As 1.2490 0.3777 3.307 0.000943 *** How does this test differ from the test in (i)? (5 marks) (c) The following diagnostic plots were obtained for the fitted model. 0 20 40 60 0. 0 0. 1 0. 2 0. 3 0. 4 Hat Matrix Diagonals Index H M D 12 345 6 78 9 10 11 12131415 16 171819202122 23 4 25 2627282930 313233345 36 3738 39 4041 42 43 44 45 46 47 4849 50 51 52 535455 56575859606162 63 64 0 20 40 60 0. 0 0. 1 0. 2 0. 3 0. 4 Cook's Distances Index Co ok 's di st an ce 12345678 9 101112341516171819202122234252627282930313233 34 563738 39 40414243444546 47 48495051 52 5354555678596061626364 0 20 40 60 0 2 4 6 8 Deviance Changes Index D ev ia nc e ch an ge 12345678 9 101112341516171819202122234 25 2627282930313233 34 5363738 39 40 4142 43 44 4546 47 48495051 52 535455567859606162 63 64 i. Which points are identified as being usual by these plots? For each of these points describe what makes it unusual. Note that using HMD and deviance changes as diagnostics were not covered this semester so you can ignore those two plots and just look at the Cook’s distance plot. ii. Briefly describe how you would further investigate the unusual points you identified. (5 marks) (d) Suppose the fitted model is to be used to predict the presence of gold deposits for other sites (it is not known whether gold is present at these sites). A gold deposit is predicted to be present if the estimated probability is more than one half and is otherwise predicted to be absent. The following table summarizes the performance of this prediction procedure when it is applied to the 64 observations in the data set. Prediction Actual gold present gold absent gold present 26 2 gold absent 1 35 i. Based on this table, estimate the sensitivity and the specificity for this prediction model. Explain why these estimates are too optimistic. ii. Suggest a different method for estimating the sensitivity and specificity that still uses the original data but should give more realistic estimates. (5 marks) 5. The data for this question come from an ecological study conducted by Dr Rick Linhurst (1979, PhD thesis, North Carolina State). The study focussed on the relationship between the biomass (weight per unit area) of spartina (a type of grass found in coastal salt marshes) and the properties of the soil at different sites in the Cape Fear estuary of North Carolina. Data was collected from three different locations: Oak Island (OI), Smith Island (SI) and Snow’s Marsh (SM). At each location, there were areas where the spartina vegetation was short (SHRT), other areas where it was tall (TALL) and still other areas where it had died and re-vegetated (DVEG). For each combination of location and vegetation type, five sites were selected and the biomass of spartina was measured. Consider the following conditional plot for this data. l l l l l l l ll l l l l l l 50 0 10 00 15 00 20 00 l ll l l lll l l l l l l DVEG SHRT TALL l l l l l l l l l l l l l DVEG SHRT TALL 50 0 10 00 15 00 20 00 Type of Vegetation Bi om as s of s pa rti na OI SI SM Given: Location (a) Consider creating a regression model that relates biomass (BIO) to location (Loc) and vegetation type (Type). The plot on the previous page indicates that it may be appro- priate to i. use log(BIO) as the response and ii. include the Loc:Type interaction in the model. Explain how the plot indicates that each of these actions may be appropriate. [4 marks] (b) Consider a regression model that uses the logged biomass (BIO) as the response and vegetation type (Type) and location (Loc) as regressors. > model1.lm<-lm(log(BIO) ~ Loc*Type,data=spartina.df) > summary(model1.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.6805 0.1466 45.575 < 2e-16 *** LocSI -0.8659 0.2073 -4.177 0.000179 *** LocSM 0.6189 0.2073 2.985 0.005066 ** TypeSHRT -0.3688 0.2073 -1.779 0.083668 . TypeTALL 0.4512 0.2073 2.176 0.036171 * LocSI:TypeSHRT 0.2311 0.2932 0.788 0.435751 LocSM:TypeSHRT -0.7450 0.2932 -2.541 0.015496 * LocSI:TypeTALL 1.0486 0.2932 3.577 0.001015 ** LocSM:TypeTALL -0.4358 0.2932 -1.486 0.145882 --- Residual standard error: 0.3278 on 36 degrees of freedom Multiple R-squared: 0.8195,Adjusted R-squared: 0.7794 F-statistic: 20.43 on 8 and 36 DF, p-value: 3.142e-11 Use this model to estimate each of the following: i. the biomass of spartina for short vegetation at Smith Island. ii. the difference in the biomass of spartina between re-vegetated sites (DVEG) at Snow’s Marsh and at Oak Island. iii. the difference in the biomass of spartina short vegetation and tall vegetation at Smith Island. [6 marks] (c) Recall that for each combination of location and vegetation type, five sites were selected and the biomass of spartina was measured. Characteristics of the soil were also measured at each site. One of these characteristics was the pH which indicates the acidity or alkalinity of the soil. The following search procedure was used to see whether it would be useful to include pH (or interactions involving pH) in the model. model2.lm<-lm(log(BIO) ~ Loc+Type+pH,data=spartina.df) model3.lm<-lm(log(BIO) ~ Loc*Type*pH,data=spartina.df) spartstepln.lm<-step(object=model2.lm,scope=formula(model3.lm), direction="both") Briefly, describe the following aspects of this search procedure. i. The starting model. ii. The set of regressors being considered. iii. The procedure used at each step to select the next model. [4 marks] (d) Consider the model selected by this search procedure. Call: lm(formula = log(BIO) ~ Loc + Type + pH + Loc:Type + Loc:pH, data = spartina.df) Estimate Std. Error t value Pr(>|t|) (Intercept) 6.92216 0.92315 7.498 1.27e-08 *** LocSI -12.65295 2.75000 -4.601 5.95e-05 *** LocSM -0.62059 1.74134 -0.356 0.723821 TypeSHRT -0.37689 0.17288 -2.180 0.036491 * TypeTALL 0.39757 0.26485 1.501 0.142840 pH -0.05055 0.19148 -0.264 0.793424 LocSI:TypeSHRT 0.37952 0.24460 1.552 0.130294 LocSM:TypeSHRT -0.55440 0.36236 -1.530 0.135559 LocSI:TypeTALL -13.14536 3.20872 -4.097 0.000255 *** LocSM:TypeTALL -0.49372 0.35519 -1.390 0.173828 LocSI:pH 3.55980 0.80948 4.398 0.000107 *** LocSM:pH 0.25336 0.35515 0.713 0.480610 --- Residual standard error: 0.269 on 33 degrees of freedom Multiple R-squared: 0.8885,Adjusted R-squared: 0.8514 F-statistic: 23.91 on 11 and 33 DF, p-value: 1.315e-12 Use this model to describe the impact that pH has on the biomass of spartina. [6 marks] 6. The number of different mussel species was recorded in 41 rivers along the east coast of the USA. Nine possible explanatory variables were recorded and described as follows: area - the area of drainage basin. sAC, sAP, sSL, sSV - the number of intermediate rivers to 4 major species-source river systems: Alabama-Coosa (sAC), Apalachicola (sAP) St. Lawrence (sSL), and Savannah (sSV). nitrate - nitrate concentration. hy - hydronium ion concentration. solres the amount of solid residue present in the water. logarea - it was thought that taking the log of the area might improve the model fit so this was also included as a potential regressor. Reference: J.J. Sepkoski, Jr., M.A. Rex (1974). “Distribution of Freshwater Mussels: Coastal Rivers as Biogeographic Islands,” Systematic Zoology, Vol. 23(2), pp. 165-188. The first six rows of the data frame for the mussels data: > head(mussels.df) species area sAC sAP sSV sSL nitrate solres hy logarea Penobscot 9 8440 33 28 21 4 0.8 57 4.0 9.0407 Kennebec 8 5960 32 27 20 5 0.4 31 3.2 8.6928 Androscoggin 7 3510 31 26 19 6 0.6 65 2.5 8.1634 Saco 6 1730 30 25 18 7 0.8 33 2.5 7.4559 Merrimac 11 5020 29 24 17 8 2.6 78 6.3 8.5212 Blackstone 8 425 26 21 14 11 8.4 120 20.0 6.0521 The step function was used to conduct two searches for a suitable model using the following R code: > null.glm<- glm(species~1,family = poisson, data=mussels.df) > full.glm<-glm(species~.,family = poisson, data=mussels.df) > formula(full.glm) species ~ area + sAC + sAP + sSV + sSL + nitrate + solres + hy + logarea > step1.glm=step(null.glm,scope=formula(full.glm),trace=0,direction="both") > summary(step1.glm) Estimate Std. Error z value Pr(>|z|) (Intercept) 0.9483756 0.4701266 2.017 0.043667 * logarea 0.2495191 0.0513711 4.857 1.19e-06 *** sAC -0.0258976 0.0056666 -4.570 4.87e-06 *** solres -0.0022974 0.0006491 -3.540 0.000401 *** hy -0.0329310 0.0112529 -2.926 0.003429 ** nitrate 0.0514423 0.0285902 1.799 0.071971 . Null deviance: 127.527 on 43 degrees of freedom Residual deviance: 46.426 on 38 degrees of freedom AIC: 240.45 > step2.glm=step(full.glm,scope=formula(full.glm),trace=0,direction="both") > summary(step2.glm) Estimate Std. Error z value Pr(>|z|) (Intercept) 0.765412 0.487918 1.569 0.116711 sAP -0.034419 0.007187 -4.789 1.67e-06 *** sSV 0.016718 0.011265 1.484 0.137800 nitrate 0.049525 0.028270 1.752 0.079799 . solres -0.002328 0.000655 -3.554 0.000379 *** hy -0.031616 0.011363 -2.782 0.005394 ** logarea 0.257352 0.051853 4.963 6.94e-07 *** Null deviance: 127.527 on 43 degrees of freedom Residual deviance: 44.303 on 37 degrees of freedom AIC: 240.33 (a) Consider the two searches that were performed. i. Briefly describe the procedure used for the first search (step1.glm). ii. How does the second search differ from the first search? iii. Consider the models step1.glm and step2.glm found by the two searches. Is either model clearly better than the other one? Explain your answer. (5 marks) (b) Consider the impact of hydronium concentration (hy) on the mean number of mussel species found in a river. i. Based on the output for model step1.glm, explain the impact on the mean number of mussel species if the hydronium concentration is increased by 1 unit (and all remaining explanatory variables are fixed). ii. Use the following output to produce a 95% confidence interval for your estimate of this impact. > confint(step1.glm) 2.5 % 97.5 % (Intercept) 0.020585439 1.864031858 logarea 0.149254036 0.350694327 sAC -0.037083119 -0.014862806 solres -0.003625416 -0.001075839 hy -0.056618776 -0.012315184 nitrate -0.006685696 0.105584270 iii. Estimate the impact of increasing hydronium concentration by 3 units on the mean number of mussel species found in a river. (6 marks) (c) If a goodness of fit test is performed for model step1.glm, the p-value is equal to 0.19. Explain how the p-value is calculated. What do you conclude about model step1.glm from this test? (4 marks) (d) Consider the following set of diagnostic plots for model step1.glm 1.5 2.0 2.5 3.0 − 3 − 2 − 1 0 1 2 Predicted values R es id ua ls l lll l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Residuals vs Fitted Altamaha Blackstone Kissimmee l lll l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l lll l l l l l l l −2 −1 0 1 2 − 3 − 2 − 1 0 1 2 Theoretical Quantiles St d. d ev ia nc e re si d. Normal Q−Q Altamaha Satilla Blackstone 1.5 2.0 2.5 3.0 0. 0 0. 5 1. 0 1. 5 Predicted values St d. de v ia n c e re s id . l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l Scale−Location Altamaha SatillaBlackstone 0.0 0.1 0.2 0.3 0.4 0.5 0.6 − 3 − 2 − 1 0 1 2 3 Leverage St d. P e a rs o n r e s id . l lll l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l lll l l l l l l l Cook's distance 1 0.5 0.5 1 Residuals vs Leverage Satilla Blackstone Delaware Also consider the GAM plot produced for this data. 0 2 4 6 8 − 2. 5 − 1. 5 − 0. 5 0. 5 nitrate s (n itra te , 1) l lll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll ll lll 100 200 300 400 500 − 2. 5 − 1. 5 − 0. 5 0. 5 solres s (so lre s,1 ) ll l l l ll l l l l ll l l l l l l l l l l l l l l l ll ll l l l l l l l l 0 5 10 15 20 25 30 − 2. 5 − 1. 5 − 0. 5 0. 5 hy s (h y, 2. 04 ) l lll l l l l ll l l l l l l l l l l l l l l l l l l l l l l l ll l l l l 6 7 8 9 10 − 2. 5 − 1. 5 − 0. 5 0. 5 logarea s (lo ga re a,1 ) l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l i. Based on these plots describe any concerns or issues you have with the fitted model. ii. Briefly explain how you would address each of the concerns/issues you have identi- fied. (5 marks)
欢迎咨询51作业君