STATS 330 THE UNIVERSITY OF AUCKLAND SEMESTER TWO 2020 Campus: City STATISTICS Advanced Statistical Modelling (Time allowed: TWO hours) INSTRUCTIONS • Attempt ALL 3 questions. • There are a total of 80 marks for this examination. • There is a list of useful formulae on the last page of this exam. Page 1 of 22 STATS 330 1. [35 marks] The data described below show the free-throw results obtained by the Los Angeles Lakers player Shaq O’Neal in 23 NBA play-off games in the year 2000. (For those unfamiliar with basketball, a free throw is when a player is allowed to take an unopposed shot at the basket from the free-throw line. Thus in game 1, O’Neal attempted 5 free throws, 4 of which were successful.) People are interested in whether the number of attempted free throws and the proportion that were successful changes as the play-offs progressed (i.e. with increasing game number). The data in the file Shaq.df consists of: game the game number n the number of free throws s the number of successful free throws. > head(Shaq.df) game s n 1 1 4 5 2 2 5 11 3 3 5 14 4 4 5 12 5 5 2 7 6 6 7 10 Page 2 of 22 STATS 330 Consider the plot and fitted model below: > plot(n~game, data=Shaq.df) > n.mod1=glm(n~game, family=poisson, data=Shaq.df) 5 10 15 20 5 10 20 30 40 game n (a) For n.mod1, what is (i) the assumed relationship between game and the number of free throws n, and (ii) the assumed distribution of its response value? [5 marks] log(µi) = β0 + gamei Yi ∼ Poisson(µi), where, for the ith observation, Yi is the number of free throws gamei = i is the game number. Consider the following code: > n.mod1=glm(n~game, family=poisson, data=Shaq.df) > 1 - pchisq(deviance(n.mod1), df.residual(n.mod1)) [1] 4.7952e-08 > n.mod2=glm.nb(n~game, data=Shaq.df) > 1 - pchisq(deviance(n.mod2), df.residual(n.mod2)) [1] 0.36648 Page 3 of 22 STATS 330 (b) Using this output explain why we prefer n.mod2 over n.mod1 in terms of model adequacy. [5 marks] The GOF test: 1 - pchisq(deviance(n.mod1), df.residual(n.mod1))= 4.7952 × 10−8 indicating we have extremely strong evidence against the assumption that the data comes from a Poisson distribution. For the negative binomial model we have 1 - pchisq(deviance(n.mod2), df.residual(n.mod2))= 0.36648 indicating a good fit for these data. Page 4 of 22 STATS 330 Consider this negative binomial analysis of these data: > summary(n.mod2) Call: glm.nb(formula = n ~ game, data = Shaq.df, init.theta = 6.075494072, link = log) Deviance Residuals: Min 1Q Median 3Q Max -1.769 -0.669 -0.236 0.318 2.536 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.3260 0.2162 10.76 <2e-16 *** game 0.0185 0.0155 1.19 0.23 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for Negative Binomial(6.0755) family taken to be 1) Null deviance: 23.930 on 22 degrees of freedom Residual deviance: 22.585 on 21 degrees of freedom AIC: 152.3 Number of Fisher Scoring iterations: 1 Theta: 6.08 Std. Err.: 2.54 2 x log-likelihood: -146.28 > n.mod3=glm.nb(n~1, data=Shaq.df) > 1 - pchisq(deviance(n.mod3), df.residual(n.mod3)) [1] 0.41698 > confint(n.mod3) Waiting for profiling to be done... 2.5 % 97.5 % 2.3510 2.7643 (c) The estimate and its standard error associated with the variable game (from the model n.mod2) are 0.0185 and 0.0155, respectively. Explain how we obtain the z value of 1.19. [5 marks] Page 5 of 22 STATS 330 The null hypothesis assumes H0 : β1 = 0. and the Z value= (βˆ1 − 0) se(βˆ1) and so Z value= (0.0185− 0) 0.0155 = 1.19. Page 6 of 22 STATS 330 Consider another negative binomial analysis of these data: > n.mod3=glm.nb(n~1, data=Shaq.df) > summary(n.mod3) Call: glm.nb(formula = n ~ 1, data = Shaq.df, init.theta = 5.636962983, link = log) Deviance Residuals: Min 1Q Median 3Q Max -1.797 -0.564 -0.136 0.170 2.807 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.555 0.105 24.3 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for Negative Binomial(5.637) family taken to be 1) Null deviance: 22.732 on 22 degrees of freedom Residual deviance: 22.732 on 22 degrees of freedom AIC: 151.6 Number of Fisher Scoring iterations: 1 Theta: 5.64 Std. Err.: 2.31 2 x log-likelihood: -147.59 > 1 - pchisq(deviance(n.mod3), df.residual(n.mod3)) [1] 0.41698 > confint(n.mod3) Waiting for profiling to be done... 2.5 % 97.5 % 2.3510 2.7643 (d) For model n.mod3 calculate the value of the Pearson residual associated with the first play-off game. [5 marks] n=5, and yˆ1 = µˆ1 = exp(β̂0) = exp(2.55487) = 12.86957. Page 7 of 22 STATS 330 V̂ar(yˆ1) = µˆ1 + µˆ1 2 θˆ = 12.86957 + 12.869572 5.63696 = 42.25165. ri = yi − yˆ1√ Var(yˆ1) = 5− 12.86957√ 42.25165 = −1.2107. > residuals(n.mod3, "pearson")[1] 1 -1.2107 Page 8 of 22 STATS 330 Consider the following analysis of the proportion of successful free throws by O’Neil over this play-off season. > plot(I(s/n)~game, data=Shaq.df) 5 10 15 20 0. 2 0. 4 0. 6 0. 8 game I(s /n) > bin.mod1=glm(cbind(s, n-s)~game, family=binomial, data=Shaq.df) > summary(bin.mod1) Call: glm(formula = cbind(s, n - s) ~ game, family = binomial, data = Shaq.df) Deviance Residuals: Min 1Q Median 3Q Max -1.463 -1.039 -0.352 0.984 2.787 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.0128 0.2621 0.05 0.96 game -0.0159 0.0185 -0.86 0.39 (Dispersion parameter for binomial family taken to be 1) Null deviance: 33.376 on 22 degrees of freedom Residual deviance: 32.633 on 21 degrees of freedom AIC: 99.8 Number of Fisher Scoring iterations: 3 Page 9 of 22 STATS 330 > bin.mod2=glm(cbind(s, n-s)~1, family=binomial, data=Shaq.df) > summary(bin.mod2) Call: glm(formula = cbind(s, n - s) ~ 1, family = binomial, data = Shaq.df) Deviance Residuals: Min 1Q Median 3Q Max -1.663 -0.947 -0.182 1.013 2.758 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.190 0.117 -1.63 0.1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 33.376 on 22 degrees of freedom Residual deviance: 33.376 on 22 degrees of freedom AIC: 98.55 Number of Fisher Scoring iterations: 3 > 1-pchisq(deviance(bin.mod2),df.residual(bin.mod2)) [1] 0.056781 > # est > exp(coef(bin.mod2))/(1+exp(coef(bin.mod2))) (Intercept) 0.4527 > #Confidence interval > ci=confint(bin.mod2) Waiting for profiling to be done... > exp(ci)/(1+exp(ci)) 2.5 % 97.5 % 0.39659 0.50963 (e) It appears there is some evidence that these data are not consistent with a Binomial distribution. This observation is based on using a chi-squared distribution to approximate the sampling distribution of the residual de- viance. Explain why the use of a chi-squared distribution may not be valid in this case. [5 marks] The chi-squared approximation works reasonably well provided n ≥ 5 when Page 10 of 22 STATS 330 p ≈ .5. Here, p = 0.4527 and we see that we that we are very close to this limit. With the above comment in mind, the following simulation was performed: > set.seed(123) > Nsim=1e4 > pdevs=rep(0,Nsim) > probs=predict(bin.mod2,type="response") > > for ( i in 1:Nsim){ ysim=rbinom( n=nrow(Shaq.df),size=Shaq.df$n, prob=probs) mod=glm(cbind(ysim,Shaq.df$n-ysim)~1,family=binomial) pdevs[i]=deviance(mod)} > > sum(pdevs>=deviance(bin.mod2))/Nsim [1] 0.0932 (f) What do we conclude from the above analysis? Explain briefly. [5 marks] We see that the p-value= 0.0932 associated with GOF is larger. We have, at best, weak evidence to believe that the binomial assumption may be invalid—or not! Assume the model bin.mod2 can be used for interpretation. (g) Write a brief executive summary about the number of free throws and their success rate for Shaq O’Neal in these play-offs. [5 marks] We are interested in the number of free throws and subsequent success of Shaq ONeal during the 2000 play-off season of basketball. Over this period of play-offs we found no evidence of change for both of these measures. We estimate that Shaq ONeal, on average, between 10.5 and 15.9 free throws with the proportion of successful throws being somewhere between 40% and 51% of the time for each game. Page 11 of 22 STATS 330 2. [20 marks] Roberts and Foppa (Vector-Borne and Zoonotic Diseases, 2006, 6, 1–6) used Poisson regression to model the occurrence of West Nile virus (WNV) in horses. WNV is a disease that attacks the central nervous system and its effects can be fatal. Birds are the most commonly affected animal but WNV also affects mammals including horses and humans. WNV is spread mainly by mosquitos, with birds being the primary hosts for the virus. Typically, bird mortality due to WNV precedes human and equine infection and thus dead bird surveillance is used to monitor the risk to human and horse populations. Es- sentially, people are asked to report cases of dead birds which are then collected and tested. As population increases it is expected that more dead birds will be reported and thus the number of positive tests will be higher. To compensate, the “positive bird rate” (the number of dead birds that tested positive for WNV divided by the population of the county) is used as the explanatory variable of main interest. Data collected for 25 counties in South Carolina, USA in 2003 were used to fit a model that involved the following variables: equine: The number of cases of WNV in horses (response variable). farms: The number of farms in the county (offset variable). PBR: The positive bird rate per 10,000 population (explanatory variable of main interest). density: The population density recorded as people per square mile (explanatory variable). The following output shows the first 6 observations in the data set and a summary of the variables: > head(WNV.df) county equine farms density PBR 1 Abbeville 0 471 51.50 0.000 2 Aiken 6 729 132.86 0.982 3 Allendale 0 114 27.51 1.782 4 Anderson 1 1271 230.81 0.543 5 Bamberg 2 254 42.43 0.600 6 Barnwell 1 325 42.83 3.834 > summary(WNV.df) county equine farms density PBR Abbeville: 1 Min. : 0.00 Min. : 92 Min. : 27.5 Min. :0.000 Aiken : 1 1st Qu.: 0.00 1st Qu.: 254 1st Qu.: 49.0 1st Qu.:0.223 Allendale: 1 Median : 0.00 Median : 348 Median : 74.2 Median :0.561 Anderson : 1 Mean : 1.17 Mean : 438 Mean :123.1 Mean :0.809 Bamberg : 1 3rd Qu.: 1.00 3rd Qu.: 590 3rd Qu.:157.3 3rd Qu.:0.979 Barnwell : 1 Max. :10.00 Max. :1271 Max. :480.6 Max. :3.923 (Other) :40 Page 12 of 22 STATS 330 Consider the following negative binomial regression model: > NB.glm=glm.nb(equine~density*PBR+offset(log(farms)),link = log, control=glm.control(maxit=50),data=WNV.df) > summary(NB.glm) Call: glm.nb(formula = equine ~ density * PBR + offset(log(farms)), data = WNV.df, control = glm.control(maxit = 50), link = log, init.theta = 2.39092649) Deviance Residuals: Min 1Q Median 3Q Max -1.277 -0.981 -0.752 0.313 1.956 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.915243 0.434071 -15.93 <2e-16 *** density -0.000689 0.002299 -0.30 0.764 PBR 0.451653 0.240296 1.88 0.060 . density:PBR 0.004464 0.002292 1.95 0.051 . --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for Negative Binomial(2.3909) family taken to be 1) Null deviance: 72.300 on 45 degrees of freedom Residual deviance: 45.246 on 42 degrees of freedom AIC: 121.8 Number of Fisher Scoring iterations: 1 Theta: 2.39 Std. Err.: 1.91 2 x log-likelihood: -111.82 (a) The main purpose of this model was to relate the number of cases of WNV in horses equine to the positive bird rate PBR. i. The model includes log(farms) as an offset. How does this affect the interpretation of this model? Including log(farms) as an offset means that the model is exploring the number of cases of WNR in horses per farm. ii. The model includes density as an explanatory variable that interacts with PBR. As a result the relationship between equine and PBR depends on the level of density. Compare the relationship between equine and PBR for a county with a density of 100 people per square mile to a county with a density of 200 people per square mile. Page 13 of 22 STATS 330 For 100 people per square mile the fitted model is: log µˆ = −6.9152− .000689× 100 + (.4516 + .004464× 100)× PBR + log(farms) = −6.984 + .898× PBR + log(farms) For 200 people per square mile the fitted model is: log µˆ = −6.9152− .000689× 200 + (.4516 + .004464× 200)× PBR + log(farms) = −7.053 + 1.344× PBR + log(farms) For both densities the expected of cases of WNV increases with increas- ing PBR. However, the rate of increase is higher for a density of 200 than a density of 100. iii. Based on this model, what is the fitted value of equine for Abbeville county? Note that the values of the explanatory variables for Abbeville are given on the previous page. Plugging in the values of the explanatory variables for Abbeville gives: log µˆ = −6.9152−.000689×51.5+(.4516+.004464×51.5)×0+log(471) = −.7958 . Thus the fitted value of equine for Abbeville is exp(−.7958) = .45. (b) Consider the following output > anova(NB.glm,test="Chisq") Analysis of Deviance Table Model: Negative Binomial(2.3909), link: log Response: equine Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 45 72.3 Page 14 of 22 STATS 330 density 1 0.00 44 72.3 0.971 PBR 1 23.89 43 48.4 1e-06 *** density:PBR 1 3.16 42 45.2 0.075 . --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 i. The output from anova() gives a different p-value for density:PBR than does the output from summary() even though they are both testing the hypothesis that the interaction is not needed in the model. Explain why this occurs. The test given in output from summary() is based on the assumption that the distribution of the fitted coefficient is aprproximately Normal. The test given in output from anova() is based on the assumption that the distribution of the difference in residual deviance (between the model that does not contain the interaction term and the model that does) is approximately χ2 with 1 df. As the tests are based on different assumptions they give different p-values. ii. Suppose that you wish to test the hypothesis that the interaction is not needed in the model without assuming that the change in deviance will have a chi-squared distribution. Explain how you would do this. For this test, the null hypothesis is that the interaction is not needed in the model. Thus we need to generate a reference distribution for the deviance change when the interaction is not needed in the model. The only technique we have to do this is to use a parametric bootstrap. Fit the model that does not include the density:PBR interaction. Use this model to generate a large number of new sets of data Note that the same sets of values are used for the explanatory variables but new values of the response are generated. For each of these data sets fit both model that contains the intera- tion and the model that doesn’t. Record the difference in residual deviances for these two models to create the reference distribution. Estimate the p-value as the proportion of values in the reference distribution that are greater than the deviance change (3.16) that was observed for the actual data. (c) Suppose that this model is to be used to predict cases of WNV in horses in other counties. Explain how a realistic estimate of the MSPE (mean square prediction error) could be obtained. Page 15 of 22 STATS 330 Given the relatively small number of observations, we would need to use cross validation. This technique randomly divides the data into 10 parts: 9 parts (training set) are used to fit the model which is then used to predict the values of equine for the remaining part (test set) and the MSPE is estimated based on these predictions. Each of the 10 parts is used once as the training set and the average of the 10 estimates of MSPE is taken. This procedure is then repeated for a large number of different random splits of the data. The average of all the estimates of MSPE is taken and used as our final estimate. Page 16 of 22 STATS 330 3. [25 marks] In Lecture Handout 16, data concerning academics at the Uni- versity of Washington were used as an example for causal analysis. For this question, we are going to consider the following subset of the variables from that example: gender: F (female) or M (male). deg: highest degree is a PhD (PhD), Prof (a professional degree) or other (any other type of degree). field: is Arts (visual and performing arts), Prof (medicine, law, nursing . . . ) or other (any other field). startyr: year of first employment. rank: in 1995 is Assist (assistant professor), Assoc (associate professor) or Full (full professor). admin: is 0 (no extra pay for admin duties) or 1 (extra pay for admin duties). Consider using these data to investigate the causal impact on admin of the other variables. Assume that the following causal diagram as being appropriate for this situation. field deg startyr gender admin rank Based on this diagram it can be deduced that: rank, gender, deg and startyr have direct effects on admin and to esti- mate these direct effects we should use a model that contains rank, gender, deg and startyr as explanatory variables. to estimate the total effect that gender has on admin, we should use a model that just uses gender as an explanatory variable. Page 17 of 22 STATS 330 (a) What model would you use to estimate the total effect of deg on admin? Explain your choice of model. [5 marks] Use a logistic regression model with admin as the response. Include deg (variable of interest), gender (confounding variable) and startyr (has a direct effect on the response) as explanatory variables. We don’t include rank (on an indirect causal pathway) or field (does not have a direct effect on response). (b) The following logistic regression model can be used to investigate the direct effects of rank, gender, deg and startyr on the chances of an academic receiving extra pay for administrative duties (admin). > admin2.glm= glm(admin~rank+gender+deg+startyr+I(startyr^2), family=binomial, data=salary.df) > summary(admin2.glm) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -16.833494 5.994717 -2.808 0.004984 ** rankAssoc 2.935786 1.034217 2.839 0.004530 ** rankFull 3.676929 1.035122 3.552 0.000382 *** genderM -0.030082 0.218661 -0.138 0.890579 degPhD -0.189948 0.297217 -0.639 0.522766 degProf 0.472800 0.387233 1.221 0.222098 startyr 0.310542 0.158666 1.957 0.050323 . I(startyr^2) -0.002046 0.001032 -1.983 0.047379 * Null deviance: 1077.92 on 1593 degrees of freedom Residual deviance: 981.87 on 1586 degrees of freedom i. Use this model to estimate the difference in the log(odds) of receiving extra pay for administrative duties between an academic whose highest degree is a PhD and an academic who has a professional degree (given the levels of the remaining explanatory variables are fixed). The difference in the log odds will be .473− (−.190) = .663. ii. Given that gender and deg are set to their baseline levels, what is the estimated probability that a full professor with startyr= 70 receives extra pay for administrative duties? The log odds will be −16.8333 + 3.677 + .31054× 70− .002046× 702 = −1.444. Page 18 of 22 STATS 330 To convert log odds to a probability: > exp(-1.444)/(1+exp(-1.444)) [1] 0.19093 iii. Does this model suggest that male academics are more apt to receive extra pay for administrative duties than female academics (given the levels of the remaining explanatory variables are fixed)? Explain your answer. The coefficient for genderM is not significant so this indicates that there is no evidence that male academics are more likely to receive extra pay for administrative work given that all of the other variables are fixed. Page 19 of 22 STATS 330 (c) The logistic model that just has gender as an explanatory variable can be used to investigate the total effect that gender has on admin. > admin5.glm= glm(admin~gender,family=binomial, data=salary.df) > summary(admin5.glm) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.4639 0.1841 -13.380 <2e-16 *** genderM 0.4282 0.2053 2.086 0.037 * Null deviance: 1077.9 on 1593 degrees of freedom Residual deviance: 1073.3 on 1592 degrees of freedom AIC: 1077.3 > anova(admin5.glm,test="Chisq") Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 1593 1077.9 gender 1 4.6659 1592 1073.2 0.03077 * Based on this output, describe the total effect that gender has on admin. [5 marks] The total effect for gender indicates that women academics are less likely to get extra pay for administrative work than male academics. Using this model, the probably that a female academic gets extra pay for administra- tive work is > exp(-2.4639)/(1+exp(-2.4639)) [1] 0.078428 Whereas, the probability for a male academic is > exp(-2.4639+.4282)/(1+exp(-2.4639+.4282)) [1] 0.11551 (d) Using the results from (b) and (c), describe the nature of the relationship between gender and the probability that an academic receives extra pay for administrative duties. Your explanation should be understandable for someone who is not familiar with the terms “direct effect” and “total effect.” [6 marks] The results from (b)ii indicate that if we have a male academic and a fe- male academic that have identical values for all of the other regressors, then Page 20 of 22 STATS 330 their probabilities of getting extra pay for administrative work are approx- imately the same. The result in (c) indicates that a smaller percentage of women get extra pay for administrative work than men. The explanation for this apparent discrepancy is that gender may affect the values of some of the other regressors which impact the probability of getting extra pay for administrative work. For example, we see that a professor is more apt to get extra pay for administrative work than either an associate or assistant professor. The result from (b)ii indicates a female professor is as likely to get extra pay for administrative work as a male professor. However, if a female academic is less likely to be a professor this would mean that female academics are also less likely to get extra pay for administrative work. Page 21 of 22 STATS 330 Some Useful equations If Yi ∼ Normal(µi, σ2), then E(Yi) = µi, Var(Yi) = σ 2. If Yi ∼ Binomial(ni, pi), then E(Yi) = nipi, Var(Yi) = nipi(1− pi). If Yi ∼ Beta− Binomial(ni, pi, ρi), then E(Yi) = nipi, Var(Yi) = nipi(1− pi)(1 + ρ(ni − 1)). Here, ρ will appear as rho in any R-output. (Also, sometimes, called θ (or theta in R )). If Yi ∼ Poisson(µi), then E(Yi) = µi, (1) Var(Yi) = µi. (2) If Yi ∼ negativebinomial(µi, θ), then E(Yi) = µi, Var(Yi) = µi + µ2i θ Here, θ will appear as theta in any R-output. For model fitting a model mod: AIC = 2(k − LLmod) where k is the number of parameters used in describing the model and LLmod is the log likelihood. Page 22 of 22
欢迎咨询51作业君