程序代写案例-STATS 330

STATS 330
THE UNIVERSITY OF AUCKLAND
SEMESTER TWO 2020
Campus: City
STATISTICS
(Time allowed: TWO hour
s)
INSTRUCTIONS
• Attempt ALL 3 questions.
• There are a total of 80 marks for this examination.
• There is a list of useful formulae on the last page of this exam.
Page 1 of 22
STATS 330
1. [35 marks] The data described below show the free-throw results obtained
by the Los Angeles Lakers player Shaq O’Neal in 23 NBA play-off games in the
year 2000. (For those unfamiliar with basketball, a free throw is when a player
is allowed to take an unopposed shot at the basket from the free-throw line.
Thus in game 1, O’Neal attempted 5 free throws, 4 of which were successful.)
People are interested in whether the number of attempted free throws and the
proportion that were successful changes as the play-offs progressed (i.e. with
increasing game number).
The data in the file Shaq.df consists of:
game the game number
n the number of free throws
s the number of successful free throws.
game s n
1 1 4 5
2 2 5 11
3 3 5 14
4 4 5 12
5 5 2 7
6 6 7 10
Page 2 of 22
STATS 330
Consider the plot and fitted model below:
> plot(n~game, data=Shaq.df)
> n.mod1=glm(n~game, family=poisson, data=Shaq.df)
5 10 15 20
5
10
20
30
40
game
n
(a) For n.mod1, what is (i) the assumed relationship between game and the
number of free throws n, and (ii) the assumed distribution of its response
value? [5 marks]
log(µi) = β0 + gamei
Yi ∼ Poisson(µi),
where, for the ith observation, Yi is the number of free throws gamei = i
is the game number.
Consider the following code:
> n.mod1=glm(n~game, family=poisson, data=Shaq.df)
> 1 - pchisq(deviance(n.mod1), df.residual(n.mod1))
[1] 4.7952e-08
> n.mod2=glm.nb(n~game, data=Shaq.df)
> 1 - pchisq(deviance(n.mod2), df.residual(n.mod2))
[1] 0.36648
Page 3 of 22
STATS 330
(b) Using this output explain why we prefer n.mod2 over n.mod1 in terms of
The GOF test:
1 - pchisq(deviance(n.mod1), df.residual(n.mod1))= 4.7952 × 10−8
indicating we have extremely strong evidence against the assumption that the
data comes from a Poisson distribution.
For the negative binomial model we have 1 - pchisq(deviance(n.mod2),
df.residual(n.mod2))= 0.36648 indicating a good fit for these data.
Page 4 of 22
STATS 330
Consider this negative binomial analysis of these data:
> summary(n.mod2)
Call:
glm.nb(formula = n ~ game, data = Shaq.df, init.theta = 6.075494072,
Deviance Residuals:
Min 1Q Median 3Q Max
-1.769 -0.669 -0.236 0.318 2.536
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.3260 0.2162 10.76 <2e-16 ***
game 0.0185 0.0155 1.19 0.23
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(6.0755) family taken to be 1)
Null deviance: 23.930 on 22 degrees of freedom
Residual deviance: 22.585 on 21 degrees of freedom
AIC: 152.3
Number of Fisher Scoring iterations: 1
Theta: 6.08
Std. Err.: 2.54
2 x log-likelihood: -146.28
> n.mod3=glm.nb(n~1, data=Shaq.df)
> 1 - pchisq(deviance(n.mod3), df.residual(n.mod3))
[1] 0.41698
> confint(n.mod3)
Waiting for profiling to be done...
2.5 % 97.5 %
2.3510 2.7643
(c) The estimate and its standard error associated with the variable game (from
the model n.mod2) are 0.0185 and 0.0155, respectively. Explain how we
obtain the z value of 1.19.
[5 marks]
Page 5 of 22
STATS 330
The null hypothesis assumes H0 : β1 = 0. and the Z value=
(βˆ1 − 0)
se(βˆ1)
and
so Z value=
(0.0185− 0)
0.0155
= 1.19.
Page 6 of 22
STATS 330
Consider another negative binomial analysis of these data:
> n.mod3=glm.nb(n~1, data=Shaq.df)
> summary(n.mod3)
Call:
glm.nb(formula = n ~ 1, data = Shaq.df, init.theta = 5.636962983,
Deviance Residuals:
Min 1Q Median 3Q Max
-1.797 -0.564 -0.136 0.170 2.807
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.555 0.105 24.3 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(5.637) family taken to be 1)
Null deviance: 22.732 on 22 degrees of freedom
Residual deviance: 22.732 on 22 degrees of freedom
AIC: 151.6
Number of Fisher Scoring iterations: 1
Theta: 5.64
Std. Err.: 2.31
2 x log-likelihood: -147.59
> 1 - pchisq(deviance(n.mod3), df.residual(n.mod3))
[1] 0.41698
> confint(n.mod3)
Waiting for profiling to be done...
2.5 % 97.5 %
2.3510 2.7643
(d) For model n.mod3 calculate the value of the Pearson residual associated
with the first play-off game. [5 marks]
n=5, and yˆ1 = µˆ1 = exp(β̂0) = exp(2.55487) = 12.86957.
Page 7 of 22
STATS 330
V̂ar(yˆ1) = µˆ1 +
µˆ1
2
θˆ
= 12.86957 +
12.869572
5.63696
= 42.25165.
ri =
yi − yˆ1√
Var(yˆ1)
=
5− 12.86957√
42.25165
= −1.2107.
> residuals(n.mod3, "pearson")[1]
1
-1.2107
Page 8 of 22
STATS 330
Consider the following analysis of the proportion of successful free throws by
O’Neil over this play-off season.
> plot(I(s/n)~game, data=Shaq.df)
5 10 15 20
0.
2
0.
4
0.
6
0.
8
game
I(s
/n)
> bin.mod1=glm(cbind(s, n-s)~game, family=binomial, data=Shaq.df)
> summary(bin.mod1)
Call:
glm(formula = cbind(s, n - s) ~ game, family = binomial, data = Shaq.df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.463 -1.039 -0.352 0.984 2.787
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0128 0.2621 0.05 0.96
game -0.0159 0.0185 -0.86 0.39
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 33.376 on 22 degrees of freedom
Residual deviance: 32.633 on 21 degrees of freedom
AIC: 99.8
Number of Fisher Scoring iterations: 3
Page 9 of 22
STATS 330
> bin.mod2=glm(cbind(s, n-s)~1, family=binomial, data=Shaq.df)
> summary(bin.mod2)
Call:
glm(formula = cbind(s, n - s) ~ 1, family = binomial, data = Shaq.df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.663 -0.947 -0.182 1.013 2.758
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.190 0.117 -1.63 0.1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 33.376 on 22 degrees of freedom
Residual deviance: 33.376 on 22 degrees of freedom
AIC: 98.55
Number of Fisher Scoring iterations: 3
> 1-pchisq(deviance(bin.mod2),df.residual(bin.mod2))
[1] 0.056781
> # est
> exp(coef(bin.mod2))/(1+exp(coef(bin.mod2)))
(Intercept)
0.4527
> #Confidence interval
> ci=confint(bin.mod2)
Waiting for profiling to be done...
> exp(ci)/(1+exp(ci))
2.5 % 97.5 %
0.39659 0.50963
(e) It appears there is some evidence that these data are not consistent with
a Binomial distribution. This observation is based on using a chi-squared
distribution to approximate the sampling distribution of the residual de-
viance. Explain why the use of a chi-squared distribution may not be valid
in this case. [5 marks]
The chi-squared approximation works reasonably well provided n ≥ 5 when
Page 10 of 22
STATS 330
p ≈ .5. Here, p = 0.4527 and we see that we that we are very close to this
limit.
With the above comment in mind, the following simulation was performed:
> set.seed(123)
> Nsim=1e4
> pdevs=rep(0,Nsim)
> probs=predict(bin.mod2,type="response")
>
> for ( i in 1:Nsim){
ysim=rbinom(
n=nrow(Shaq.df),size=Shaq.df\$n, prob=probs)
mod=glm(cbind(ysim,Shaq.df\$n-ysim)~1,family=binomial)
pdevs[i]=deviance(mod)}
>
> sum(pdevs>=deviance(bin.mod2))/Nsim
[1] 0.0932
(f) What do we conclude from the above analysis? Explain briefly. [5 marks]
We see that the p-value= 0.0932 associated with GOF is larger. We have, at
best, weak evidence to believe that the binomial assumption may be invalid—or
not!
Assume the model bin.mod2 can be used for interpretation.
(g) Write a brief executive summary about the number of free throws and their
success rate for Shaq O’Neal in these play-offs. [5 marks]
We are interested in the number of free throws and subsequent success of Shaq
ONeal during the 2000 play-off season of basketball.
Over this period of play-offs we found no evidence of change for both of these
measures.
We estimate that Shaq ONeal, on average, between 10.5 and 15.9 free throws
with the proportion of successful throws being somewhere between 40% and
51% of the time for each game.
Page 11 of 22
STATS 330
2. [20 marks] Roberts and Foppa (Vector-Borne and Zoonotic Diseases, 2006,
6, 1–6) used Poisson regression to model the occurrence of West Nile virus
(WNV) in horses. WNV is a disease that attacks the central nervous system and
its effects can be fatal. Birds are the most commonly affected animal but WNV
also affects mammals including horses and humans. WNV is spread mainly by
mosquitos, with birds being the primary hosts for the virus. Typically, bird
mortality due to WNV precedes human and equine infection and thus dead bird
surveillance is used to monitor the risk to human and horse populations. Es-
sentially, people are asked to report cases of dead birds which are then collected
and tested. As population increases it is expected that more dead birds will be
reported and thus the number of positive tests will be higher. To compensate,
the “positive bird rate” (the number of dead birds that tested positive for WNV
divided by the population of the county) is used as the explanatory variable of
main interest.
Data collected for 25 counties in South Carolina, USA in 2003 were used to fit
a model that involved the following variables:
equine: The number of cases of WNV in horses (response variable).
farms: The number of farms in the county (offset variable).
PBR: The positive bird rate per 10,000 population (explanatory variable of main
interest).
density: The population density recorded as people per square mile (explanatory
variable).
The following output shows the first 6 observations in the data set and a
summary of the variables:
county equine farms density PBR
1 Abbeville 0 471 51.50 0.000
2 Aiken 6 729 132.86 0.982
3 Allendale 0 114 27.51 1.782
4 Anderson 1 1271 230.81 0.543
5 Bamberg 2 254 42.43 0.600
6 Barnwell 1 325 42.83 3.834
> summary(WNV.df)
county equine farms density PBR
Abbeville: 1 Min. : 0.00 Min. : 92 Min. : 27.5 Min. :0.000
Aiken : 1 1st Qu.: 0.00 1st Qu.: 254 1st Qu.: 49.0 1st Qu.:0.223
Allendale: 1 Median : 0.00 Median : 348 Median : 74.2 Median :0.561
Anderson : 1 Mean : 1.17 Mean : 438 Mean :123.1 Mean :0.809
Bamberg : 1 3rd Qu.: 1.00 3rd Qu.: 590 3rd Qu.:157.3 3rd Qu.:0.979
Barnwell : 1 Max. :10.00 Max. :1271 Max. :480.6 Max. :3.923
(Other) :40
Page 12 of 22
STATS 330
Consider the following negative binomial regression model:
control=glm.control(maxit=50),data=WNV.df)
> summary(NB.glm)
Call:
glm.nb(formula = equine ~ density * PBR + offset(log(farms)),
data = WNV.df, control = glm.control(maxit = 50), link = log,
init.theta = 2.39092649)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.277 -0.981 -0.752 0.313 1.956
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.915243 0.434071 -15.93 <2e-16 ***
density -0.000689 0.002299 -0.30 0.764
PBR 0.451653 0.240296 1.88 0.060 .
density:PBR 0.004464 0.002292 1.95 0.051 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(2.3909) family taken to be 1)
Null deviance: 72.300 on 45 degrees of freedom
Residual deviance: 45.246 on 42 degrees of freedom
AIC: 121.8
Number of Fisher Scoring iterations: 1
Theta: 2.39
Std. Err.: 1.91
2 x log-likelihood: -111.82
(a) The main purpose of this model was to relate the number of cases of WNV
in horses equine to the positive bird rate PBR.
i. The model includes log(farms) as an offset. How does this affect the
interpretation of this model?
Including log(farms) as an offset means that the model is exploring the
number of cases of WNR in horses per farm.
ii. The model includes density as an explanatory variable that interacts
with PBR. As a result the relationship between equine and PBR depends
on the level of density. Compare the relationship between equine and
PBR for a county with a density of 100 people per square mile to a
county with a density of 200 people per square mile.
Page 13 of 22
STATS 330
For 100 people per square mile the fitted model is:
log µˆ = −6.9152− .000689× 100 + (.4516 + .004464× 100)× PBR
+ log(farms)
= −6.984 + .898× PBR + log(farms)
For 200 people per square mile the fitted model is:
log µˆ = −6.9152− .000689× 200 + (.4516 + .004464× 200)× PBR
+ log(farms)
= −7.053 + 1.344× PBR + log(farms)
For both densities the expected of cases of WNV increases with increas-
ing PBR. However, the rate of increase is higher for a density of 200
than a density of 100.
iii. Based on this model, what is the fitted value of equine for Abbeville
county? Note that the values of the explanatory variables for Abbeville
are given on the previous page.
Plugging in the values of the explanatory variables for Abbeville gives:
log µˆ = −6.9152−.000689×51.5+(.4516+.004464×51.5)×0+log(471) = −.7958
.
Thus the fitted value of equine for Abbeville is exp(−.7958) = .45.
(b) Consider the following output
> anova(NB.glm,test="Chisq")
Analysis of Deviance Table
Response: equine
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 45 72.3
Page 14 of 22
STATS 330
density 1 0.00 44 72.3 0.971
PBR 1 23.89 43 48.4 1e-06 ***
density:PBR 1 3.16 42 45.2 0.075 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
i. The output from anova() gives a different p-value for density:PBR
than does the output from summary() even though they are both testing
the hypothesis that the interaction is not needed in the model. Explain
why this occurs.
The test given in output from summary() is based on the assumption
that the distribution of the fitted coefficient is aprproximately Normal.
The test given in output from anova() is based on the assumption
that the distribution of the difference in residual deviance (between the
model that does not contain the interaction term and the model that
does) is approximately χ2 with 1 df. As the tests are based on different
assumptions they give different p-values.
ii. Suppose that you wish to test the hypothesis that the interaction is not
needed in the model without assuming that the change in deviance will
have a chi-squared distribution. Explain how you would do this.
For this test, the null hypothesis is that the interaction is not needed
in the model. Thus we need to generate a reference distribution for the
deviance change when the interaction is not needed in the model. The
only technique we have to do this is to use a parametric bootstrap.
 Fit the model that does not include the density:PBR interaction.
 Use this model to generate a large number of new sets of data Note
that the same sets of values are used for the explanatory variables
but new values of the response are generated.
 For each of these data sets fit both model that contains the intera-
tion and the model that doesn’t. Record the difference in residual
deviances for these two models to create the reference distribution.
 Estimate the p-value as the proportion of values in the reference
distribution that are greater than the deviance change (3.16) that
was observed for the actual data.
(c) Suppose that this model is to be used to predict cases of WNV in horses in
other counties. Explain how a realistic estimate of the MSPE (mean square
prediction error) could be obtained.
Page 15 of 22
STATS 330
Given the relatively small number of observations, we would need to use
cross validation. This technique randomly divides the data into 10 parts: 9
parts (training set) are used to fit the model which is then used to predict
the values of equine for the remaining part (test set) and the MSPE is
estimated based on these predictions. Each of the 10 parts is used once as
the training set and the average of the 10 estimates of MSPE is taken. This
procedure is then repeated for a large number of different random splits of
the data. The average of all the estimates of MSPE is taken and used as
our final estimate.
Page 16 of 22
STATS 330
3. [25 marks] In Lecture Handout 16, data concerning academics at the Uni-
versity of Washington were used as an example for causal analysis. For this
question, we are going to consider the following subset of the variables from
that example:
gender: F (female) or M (male).
deg: highest degree is a PhD (PhD), Prof (a professional degree) or other (any
other type of degree).
field: is Arts (visual and performing arts), Prof (medicine, law, nursing . . . ) or
other (any other field).
startyr: year of first employment.
rank: in 1995 is Assist (assistant professor), Assoc (associate professor) or Full
(full professor).
Consider using these data to investigate the causal impact on admin of the other
variables. Assume that the following causal diagram as being appropriate for
this situation.
field deg startyr
gender
rank
Based on this diagram it can be deduced that:
 rank, gender, deg and startyr have direct effects on admin and to esti-
mate these direct effects we should use a model that contains rank, gender,
deg and startyr as explanatory variables.
 to estimate the total effect that gender has on admin, we should use a
model that just uses gender as an explanatory variable.
Page 17 of 22
STATS 330
(a) What model would you use to estimate the total effect of deg on admin?
Explain your choice of model. [5 marks]
Use a logistic regression model with admin as the response. Include deg
(variable of interest), gender (confounding variable) and startyr (has a
direct effect on the response) as explanatory variables. We don’t include
rank (on an indirect causal pathway) or field (does not have a direct
effect on response).
(b) The following logistic regression model can be used to investigate the direct
effects of rank, gender, deg and startyr on the chances of an academic
family=binomial, data=salary.df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -16.833494 5.994717 -2.808 0.004984 **
rankAssoc 2.935786 1.034217 2.839 0.004530 **
rankFull 3.676929 1.035122 3.552 0.000382 ***
genderM -0.030082 0.218661 -0.138 0.890579
degPhD -0.189948 0.297217 -0.639 0.522766
degProf 0.472800 0.387233 1.221 0.222098
startyr 0.310542 0.158666 1.957 0.050323 .
I(startyr^2) -0.002046 0.001032 -1.983 0.047379 *
Null deviance: 1077.92 on 1593 degrees of freedom
Residual deviance: 981.87 on 1586 degrees of freedom
i. Use this model to estimate the difference in the log(odds) of receiving
degree is a PhD and an academic who has a professional degree (given
the levels of the remaining explanatory variables are fixed).
The difference in the log odds will be .473− (−.190) = .663.
ii. Given that gender and deg are set to their baseline levels, what is the
estimated probability that a full professor with startyr= 70 receives
The log odds will be −16.8333 + 3.677 + .31054× 70− .002046× 702 =
−1.444.
Page 18 of 22
STATS 330
To convert log odds to a probability:
> exp(-1.444)/(1+exp(-1.444))
[1] 0.19093
iii. Does this model suggest that male academics are more apt to receive
levels of the remaining explanatory variables are fixed)? Explain your
The coefficient for genderM is not significant so this indicates that there
is no evidence that male academics are more likely to receive extra pay
for administrative work given that all of the other variables are fixed.
Page 19 of 22
STATS 330
(c) The logistic model that just has gender as an explanatory variable can be
used to investigate the total effect that gender has on admin.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.4639 0.1841 -13.380 <2e-16 ***
genderM 0.4282 0.2053 2.086 0.037 *
Null deviance: 1077.9 on 1593 degrees of freedom
Residual deviance: 1073.3 on 1592 degrees of freedom
AIC: 1077.3
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 1593 1077.9
gender 1 4.6659 1592 1073.2 0.03077 *
Based on this output, describe the total effect that gender has on admin.
[5 marks]
The total effect for gender indicates that women academics are less likely
to get extra pay for administrative work than male academics. Using this
model, the probably that a female academic gets extra pay for administra-
tive work is
> exp(-2.4639)/(1+exp(-2.4639))
[1] 0.078428
Whereas, the probability for a male academic is
> exp(-2.4639+.4282)/(1+exp(-2.4639+.4282))
[1] 0.11551
(d) Using the results from (b) and (c), describe the nature of the relationship
someone who is not familiar with the terms “direct effect” and “total effect.”
[6 marks]
The results from (b)ii indicate that if we have a male academic and a fe-
male academic that have identical values for all of the other regressors, then
Page 20 of 22
STATS 330
their probabilities of getting extra pay for administrative work are approx-
imately the same. The result in (c) indicates that a smaller percentage of
women get extra pay for administrative work than men. The explanation
for this apparent discrepancy is that gender may affect the values of some
of the other regressors which impact the probability of getting extra pay for
administrative work. For example, we see that a professor is more apt to
get extra pay for administrative work than either an associate or assistant
professor. The result from (b)ii indicates a female professor is as likely to
get extra pay for administrative work as a male professor. However, if a
female academic is less likely to be a professor this would mean that female
academics are also less likely to get extra pay for administrative work.
Page 21 of 22
STATS 330
Some Useful equations
If Yi ∼ Normal(µi, σ2), then
E(Yi) = µi,
Var(Yi) = σ
2.
If Yi ∼ Binomial(ni, pi), then
E(Yi) = nipi,
Var(Yi) = nipi(1− pi).
If Yi ∼ Beta− Binomial(ni, pi, ρi), then
E(Yi) = nipi,
Var(Yi) = nipi(1− pi)(1 + ρ(ni − 1)).
Here, ρ will appear as rho in any R-output. (Also, sometimes, called θ (or theta in R
)).
If Yi ∼ Poisson(µi), then
E(Yi) = µi, (1)
Var(Yi) = µi. (2)
If Yi ∼ negativebinomial(µi, θ), then
E(Yi) = µi,
Var(Yi) = µi +
µ2i
θ
Here, θ will appear as theta in any R-output.
For model fitting a model mod: AIC = 2(k − LLmod) where k is the number of
parameters used in describing the model and LLmod is the log likelihood.
Page 22 of 22

Email:51zuoyejun

@gmail.com