STATS 330

THE UNIVERSITY OF AUCKLAND

SEMESTER TWO 2020

Campus: City

STATISTICS

Advanced Statistical Modelling

(Time allowed: TWO hours)

INSTRUCTIONS

• Attempt ALL 3 questions.

• There are a total of 80 marks for this examination.

• There is a list of useful formulae on the last page of this exam.

Page 1 of 22

STATS 330

1. [35 marks] The data described below show the free-throw results obtained

by the Los Angeles Lakers player Shaq O’Neal in 23 NBA play-off games in the

year 2000. (For those unfamiliar with basketball, a free throw is when a player

is allowed to take an unopposed shot at the basket from the free-throw line.

Thus in game 1, O’Neal attempted 5 free throws, 4 of which were successful.)

People are interested in whether the number of attempted free throws and the

proportion that were successful changes as the play-offs progressed (i.e. with

increasing game number).

The data in the file Shaq.df consists of:

game the game number

n the number of free throws

s the number of successful free throws.

> head(Shaq.df)

game s n

1 1 4 5

2 2 5 11

3 3 5 14

4 4 5 12

5 5 2 7

6 6 7 10

Page 2 of 22

STATS 330

Consider the plot and fitted model below:

> plot(n~game, data=Shaq.df)

> n.mod1=glm(n~game, family=poisson, data=Shaq.df)

5 10 15 20

5

10

20

30

40

game

n

(a) For n.mod1, what is (i) the assumed relationship between game and the

number of free throws n, and (ii) the assumed distribution of its response

value? [5 marks]

log(µi) = β0 + gamei

Yi ∼ Poisson(µi),

where, for the ith observation, Yi is the number of free throws gamei = i

is the game number.

Consider the following code:

> n.mod1=glm(n~game, family=poisson, data=Shaq.df)

> 1 - pchisq(deviance(n.mod1), df.residual(n.mod1))

[1] 4.7952e-08

> n.mod2=glm.nb(n~game, data=Shaq.df)

> 1 - pchisq(deviance(n.mod2), df.residual(n.mod2))

[1] 0.36648

Page 3 of 22

STATS 330

(b) Using this output explain why we prefer n.mod2 over n.mod1 in terms of

model adequacy. [5 marks]

The GOF test:

1 - pchisq(deviance(n.mod1), df.residual(n.mod1))= 4.7952 × 10−8

indicating we have extremely strong evidence against the assumption that the

data comes from a Poisson distribution.

For the negative binomial model we have 1 - pchisq(deviance(n.mod2),

df.residual(n.mod2))= 0.36648 indicating a good fit for these data.

Page 4 of 22

STATS 330

Consider this negative binomial analysis of these data:

> summary(n.mod2)

Call:

glm.nb(formula = n ~ game, data = Shaq.df, init.theta = 6.075494072,

link = log)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.769 -0.669 -0.236 0.318 2.536

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 2.3260 0.2162 10.76 <2e-16 ***

game 0.0185 0.0155 1.19 0.23

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(6.0755) family taken to be 1)

Null deviance: 23.930 on 22 degrees of freedom

Residual deviance: 22.585 on 21 degrees of freedom

AIC: 152.3

Number of Fisher Scoring iterations: 1

Theta: 6.08

Std. Err.: 2.54

2 x log-likelihood: -146.28

> n.mod3=glm.nb(n~1, data=Shaq.df)

> 1 - pchisq(deviance(n.mod3), df.residual(n.mod3))

[1] 0.41698

> confint(n.mod3)

Waiting for profiling to be done...

2.5 % 97.5 %

2.3510 2.7643

(c) The estimate and its standard error associated with the variable game (from

the model n.mod2) are 0.0185 and 0.0155, respectively. Explain how we

obtain the z value of 1.19.

[5 marks]

Page 5 of 22

STATS 330

The null hypothesis assumes H0 : β1 = 0. and the Z value=

(βˆ1 − 0)

se(βˆ1)

and

so Z value=

(0.0185− 0)

0.0155

= 1.19.

Page 6 of 22

STATS 330

Consider another negative binomial analysis of these data:

> n.mod3=glm.nb(n~1, data=Shaq.df)

> summary(n.mod3)

Call:

glm.nb(formula = n ~ 1, data = Shaq.df, init.theta = 5.636962983,

link = log)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.797 -0.564 -0.136 0.170 2.807

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 2.555 0.105 24.3 <2e-16 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(5.637) family taken to be 1)

Null deviance: 22.732 on 22 degrees of freedom

Residual deviance: 22.732 on 22 degrees of freedom

AIC: 151.6

Number of Fisher Scoring iterations: 1

Theta: 5.64

Std. Err.: 2.31

2 x log-likelihood: -147.59

> 1 - pchisq(deviance(n.mod3), df.residual(n.mod3))

[1] 0.41698

> confint(n.mod3)

Waiting for profiling to be done...

2.5 % 97.5 %

2.3510 2.7643

(d) For model n.mod3 calculate the value of the Pearson residual associated

with the first play-off game. [5 marks]

n=5, and yˆ1 = µˆ1 = exp(β̂0) = exp(2.55487) = 12.86957.

Page 7 of 22

STATS 330

V̂ar(yˆ1) = µˆ1 +

µˆ1

2

θˆ

= 12.86957 +

12.869572

5.63696

= 42.25165.

ri =

yi − yˆ1√

Var(yˆ1)

=

5− 12.86957√

42.25165

= −1.2107.

> residuals(n.mod3, "pearson")[1]

1

-1.2107

Page 8 of 22

STATS 330

Consider the following analysis of the proportion of successful free throws by

O’Neil over this play-off season.

> plot(I(s/n)~game, data=Shaq.df)

5 10 15 20

0.

2

0.

4

0.

6

0.

8

game

I(s

/n)

> bin.mod1=glm(cbind(s, n-s)~game, family=binomial, data=Shaq.df)

> summary(bin.mod1)

Call:

glm(formula = cbind(s, n - s) ~ game, family = binomial, data = Shaq.df)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.463 -1.039 -0.352 0.984 2.787

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 0.0128 0.2621 0.05 0.96

game -0.0159 0.0185 -0.86 0.39

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 33.376 on 22 degrees of freedom

Residual deviance: 32.633 on 21 degrees of freedom

AIC: 99.8

Number of Fisher Scoring iterations: 3

Page 9 of 22

STATS 330

> bin.mod2=glm(cbind(s, n-s)~1, family=binomial, data=Shaq.df)

> summary(bin.mod2)

Call:

glm(formula = cbind(s, n - s) ~ 1, family = binomial, data = Shaq.df)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.663 -0.947 -0.182 1.013 2.758

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.190 0.117 -1.63 0.1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 33.376 on 22 degrees of freedom

Residual deviance: 33.376 on 22 degrees of freedom

AIC: 98.55

Number of Fisher Scoring iterations: 3

> 1-pchisq(deviance(bin.mod2),df.residual(bin.mod2))

[1] 0.056781

> # est

> exp(coef(bin.mod2))/(1+exp(coef(bin.mod2)))

(Intercept)

0.4527

> #Confidence interval

> ci=confint(bin.mod2)

Waiting for profiling to be done...

> exp(ci)/(1+exp(ci))

2.5 % 97.5 %

0.39659 0.50963

(e) It appears there is some evidence that these data are not consistent with

a Binomial distribution. This observation is based on using a chi-squared

distribution to approximate the sampling distribution of the residual de-

viance. Explain why the use of a chi-squared distribution may not be valid

in this case. [5 marks]

The chi-squared approximation works reasonably well provided n ≥ 5 when

Page 10 of 22

STATS 330

p ≈ .5. Here, p = 0.4527 and we see that we that we are very close to this

limit.

With the above comment in mind, the following simulation was performed:

> set.seed(123)

> Nsim=1e4

> pdevs=rep(0,Nsim)

> probs=predict(bin.mod2,type="response")

>

> for ( i in 1:Nsim){

ysim=rbinom(

n=nrow(Shaq.df),size=Shaq.df$n, prob=probs)

mod=glm(cbind(ysim,Shaq.df$n-ysim)~1,family=binomial)

pdevs[i]=deviance(mod)}

>

> sum(pdevs>=deviance(bin.mod2))/Nsim

[1] 0.0932

(f) What do we conclude from the above analysis? Explain briefly. [5 marks]

We see that the p-value= 0.0932 associated with GOF is larger. We have, at

best, weak evidence to believe that the binomial assumption may be invalid—or

not!

Assume the model bin.mod2 can be used for interpretation.

(g) Write a brief executive summary about the number of free throws and their

success rate for Shaq O’Neal in these play-offs. [5 marks]

We are interested in the number of free throws and subsequent success of Shaq

ONeal during the 2000 play-off season of basketball.

Over this period of play-offs we found no evidence of change for both of these

measures.

We estimate that Shaq ONeal, on average, between 10.5 and 15.9 free throws

with the proportion of successful throws being somewhere between 40% and

51% of the time for each game.

Page 11 of 22

STATS 330

2. [20 marks] Roberts and Foppa (Vector-Borne and Zoonotic Diseases, 2006,

6, 1–6) used Poisson regression to model the occurrence of West Nile virus

(WNV) in horses. WNV is a disease that attacks the central nervous system and

its effects can be fatal. Birds are the most commonly affected animal but WNV

also affects mammals including horses and humans. WNV is spread mainly by

mosquitos, with birds being the primary hosts for the virus. Typically, bird

mortality due to WNV precedes human and equine infection and thus dead bird

surveillance is used to monitor the risk to human and horse populations. Es-

sentially, people are asked to report cases of dead birds which are then collected

and tested. As population increases it is expected that more dead birds will be

reported and thus the number of positive tests will be higher. To compensate,

the “positive bird rate” (the number of dead birds that tested positive for WNV

divided by the population of the county) is used as the explanatory variable of

main interest.

Data collected for 25 counties in South Carolina, USA in 2003 were used to fit

a model that involved the following variables:

equine: The number of cases of WNV in horses (response variable).

farms: The number of farms in the county (offset variable).

PBR: The positive bird rate per 10,000 population (explanatory variable of main

interest).

density: The population density recorded as people per square mile (explanatory

variable).

The following output shows the first 6 observations in the data set and a

summary of the variables:

> head(WNV.df)

county equine farms density PBR

1 Abbeville 0 471 51.50 0.000

2 Aiken 6 729 132.86 0.982

3 Allendale 0 114 27.51 1.782

4 Anderson 1 1271 230.81 0.543

5 Bamberg 2 254 42.43 0.600

6 Barnwell 1 325 42.83 3.834

> summary(WNV.df)

county equine farms density PBR

Abbeville: 1 Min. : 0.00 Min. : 92 Min. : 27.5 Min. :0.000

Aiken : 1 1st Qu.: 0.00 1st Qu.: 254 1st Qu.: 49.0 1st Qu.:0.223

Allendale: 1 Median : 0.00 Median : 348 Median : 74.2 Median :0.561

Anderson : 1 Mean : 1.17 Mean : 438 Mean :123.1 Mean :0.809

Bamberg : 1 3rd Qu.: 1.00 3rd Qu.: 590 3rd Qu.:157.3 3rd Qu.:0.979

Barnwell : 1 Max. :10.00 Max. :1271 Max. :480.6 Max. :3.923

(Other) :40

Page 12 of 22

STATS 330

Consider the following negative binomial regression model:

> NB.glm=glm.nb(equine~density*PBR+offset(log(farms)),link = log,

control=glm.control(maxit=50),data=WNV.df)

> summary(NB.glm)

Call:

glm.nb(formula = equine ~ density * PBR + offset(log(farms)),

data = WNV.df, control = glm.control(maxit = 50), link = log,

init.theta = 2.39092649)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.277 -0.981 -0.752 0.313 1.956

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -6.915243 0.434071 -15.93 <2e-16 ***

density -0.000689 0.002299 -0.30 0.764

PBR 0.451653 0.240296 1.88 0.060 .

density:PBR 0.004464 0.002292 1.95 0.051 .

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(2.3909) family taken to be 1)

Null deviance: 72.300 on 45 degrees of freedom

Residual deviance: 45.246 on 42 degrees of freedom

AIC: 121.8

Number of Fisher Scoring iterations: 1

Theta: 2.39

Std. Err.: 1.91

2 x log-likelihood: -111.82

(a) The main purpose of this model was to relate the number of cases of WNV

in horses equine to the positive bird rate PBR.

i. The model includes log(farms) as an offset. How does this affect the

interpretation of this model?

Including log(farms) as an offset means that the model is exploring the

number of cases of WNR in horses per farm.

ii. The model includes density as an explanatory variable that interacts

with PBR. As a result the relationship between equine and PBR depends

on the level of density. Compare the relationship between equine and

PBR for a county with a density of 100 people per square mile to a

county with a density of 200 people per square mile.

Page 13 of 22

STATS 330

For 100 people per square mile the fitted model is:

log µˆ = −6.9152− .000689× 100 + (.4516 + .004464× 100)× PBR

+ log(farms)

= −6.984 + .898× PBR + log(farms)

For 200 people per square mile the fitted model is:

log µˆ = −6.9152− .000689× 200 + (.4516 + .004464× 200)× PBR

+ log(farms)

= −7.053 + 1.344× PBR + log(farms)

For both densities the expected of cases of WNV increases with increas-

ing PBR. However, the rate of increase is higher for a density of 200

than a density of 100.

iii. Based on this model, what is the fitted value of equine for Abbeville

county? Note that the values of the explanatory variables for Abbeville

are given on the previous page.

Plugging in the values of the explanatory variables for Abbeville gives:

log µˆ = −6.9152−.000689×51.5+(.4516+.004464×51.5)×0+log(471) = −.7958

.

Thus the fitted value of equine for Abbeville is exp(−.7958) = .45.

(b) Consider the following output

> anova(NB.glm,test="Chisq")

Analysis of Deviance Table

Model: Negative Binomial(2.3909), link: log

Response: equine

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev Pr(>Chi)

NULL 45 72.3

Page 14 of 22

STATS 330

density 1 0.00 44 72.3 0.971

PBR 1 23.89 43 48.4 1e-06 ***

density:PBR 1 3.16 42 45.2 0.075 .

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

i. The output from anova() gives a different p-value for density:PBR

than does the output from summary() even though they are both testing

the hypothesis that the interaction is not needed in the model. Explain

why this occurs.

The test given in output from summary() is based on the assumption

that the distribution of the fitted coefficient is aprproximately Normal.

The test given in output from anova() is based on the assumption

that the distribution of the difference in residual deviance (between the

model that does not contain the interaction term and the model that

does) is approximately χ2 with 1 df. As the tests are based on different

assumptions they give different p-values.

ii. Suppose that you wish to test the hypothesis that the interaction is not

needed in the model without assuming that the change in deviance will

have a chi-squared distribution. Explain how you would do this.

For this test, the null hypothesis is that the interaction is not needed

in the model. Thus we need to generate a reference distribution for the

deviance change when the interaction is not needed in the model. The

only technique we have to do this is to use a parametric bootstrap.

Fit the model that does not include the density:PBR interaction.

Use this model to generate a large number of new sets of data Note

that the same sets of values are used for the explanatory variables

but new values of the response are generated.

For each of these data sets fit both model that contains the intera-

tion and the model that doesn’t. Record the difference in residual

deviances for these two models to create the reference distribution.

Estimate the p-value as the proportion of values in the reference

distribution that are greater than the deviance change (3.16) that

was observed for the actual data.

(c) Suppose that this model is to be used to predict cases of WNV in horses in

other counties. Explain how a realistic estimate of the MSPE (mean square

prediction error) could be obtained.

Page 15 of 22

STATS 330

Given the relatively small number of observations, we would need to use

cross validation. This technique randomly divides the data into 10 parts: 9

parts (training set) are used to fit the model which is then used to predict

the values of equine for the remaining part (test set) and the MSPE is

estimated based on these predictions. Each of the 10 parts is used once as

the training set and the average of the 10 estimates of MSPE is taken. This

procedure is then repeated for a large number of different random splits of

the data. The average of all the estimates of MSPE is taken and used as

our final estimate.

Page 16 of 22

STATS 330

3. [25 marks] In Lecture Handout 16, data concerning academics at the Uni-

versity of Washington were used as an example for causal analysis. For this

question, we are going to consider the following subset of the variables from

that example:

gender: F (female) or M (male).

deg: highest degree is a PhD (PhD), Prof (a professional degree) or other (any

other type of degree).

field: is Arts (visual and performing arts), Prof (medicine, law, nursing . . . ) or

other (any other field).

startyr: year of first employment.

rank: in 1995 is Assist (assistant professor), Assoc (associate professor) or Full

(full professor).

admin: is 0 (no extra pay for admin duties) or 1 (extra pay for admin duties).

Consider using these data to investigate the causal impact on admin of the other

variables. Assume that the following causal diagram as being appropriate for

this situation.

field deg startyr

gender

admin

rank

Based on this diagram it can be deduced that:

rank, gender, deg and startyr have direct effects on admin and to esti-

mate these direct effects we should use a model that contains rank, gender,

deg and startyr as explanatory variables.

to estimate the total effect that gender has on admin, we should use a

model that just uses gender as an explanatory variable.

Page 17 of 22

STATS 330

(a) What model would you use to estimate the total effect of deg on admin?

Explain your choice of model. [5 marks]

Use a logistic regression model with admin as the response. Include deg

(variable of interest), gender (confounding variable) and startyr (has a

direct effect on the response) as explanatory variables. We don’t include

rank (on an indirect causal pathway) or field (does not have a direct

effect on response).

(b) The following logistic regression model can be used to investigate the direct

effects of rank, gender, deg and startyr on the chances of an academic

receiving extra pay for administrative duties (admin).

> admin2.glm= glm(admin~rank+gender+deg+startyr+I(startyr^2),

family=binomial, data=salary.df)

> summary(admin2.glm)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -16.833494 5.994717 -2.808 0.004984 **

rankAssoc 2.935786 1.034217 2.839 0.004530 **

rankFull 3.676929 1.035122 3.552 0.000382 ***

genderM -0.030082 0.218661 -0.138 0.890579

degPhD -0.189948 0.297217 -0.639 0.522766

degProf 0.472800 0.387233 1.221 0.222098

startyr 0.310542 0.158666 1.957 0.050323 .

I(startyr^2) -0.002046 0.001032 -1.983 0.047379 *

Null deviance: 1077.92 on 1593 degrees of freedom

Residual deviance: 981.87 on 1586 degrees of freedom

i. Use this model to estimate the difference in the log(odds) of receiving

extra pay for administrative duties between an academic whose highest

degree is a PhD and an academic who has a professional degree (given

the levels of the remaining explanatory variables are fixed).

The difference in the log odds will be .473− (−.190) = .663.

ii. Given that gender and deg are set to their baseline levels, what is the

estimated probability that a full professor with startyr= 70 receives

extra pay for administrative duties?

The log odds will be −16.8333 + 3.677 + .31054× 70− .002046× 702 =

−1.444.

Page 18 of 22

STATS 330

To convert log odds to a probability:

> exp(-1.444)/(1+exp(-1.444))

[1] 0.19093

iii. Does this model suggest that male academics are more apt to receive

extra pay for administrative duties than female academics (given the

levels of the remaining explanatory variables are fixed)? Explain your

answer.

The coefficient for genderM is not significant so this indicates that there

is no evidence that male academics are more likely to receive extra pay

for administrative work given that all of the other variables are fixed.

Page 19 of 22

STATS 330

(c) The logistic model that just has gender as an explanatory variable can be

used to investigate the total effect that gender has on admin.

> admin5.glm= glm(admin~gender,family=binomial, data=salary.df)

> summary(admin5.glm)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -2.4639 0.1841 -13.380 <2e-16 ***

genderM 0.4282 0.2053 2.086 0.037 *

Null deviance: 1077.9 on 1593 degrees of freedom

Residual deviance: 1073.3 on 1592 degrees of freedom

AIC: 1077.3

> anova(admin5.glm,test="Chisq")

Df Deviance Resid. Df Resid. Dev Pr(>Chi)

NULL 1593 1077.9

gender 1 4.6659 1592 1073.2 0.03077 *

Based on this output, describe the total effect that gender has on admin.

[5 marks]

The total effect for gender indicates that women academics are less likely

to get extra pay for administrative work than male academics. Using this

model, the probably that a female academic gets extra pay for administra-

tive work is

> exp(-2.4639)/(1+exp(-2.4639))

[1] 0.078428

Whereas, the probability for a male academic is

> exp(-2.4639+.4282)/(1+exp(-2.4639+.4282))

[1] 0.11551

(d) Using the results from (b) and (c), describe the nature of the relationship

between gender and the probability that an academic receives extra pay

for administrative duties. Your explanation should be understandable for

someone who is not familiar with the terms “direct effect” and “total effect.”

[6 marks]

The results from (b)ii indicate that if we have a male academic and a fe-

male academic that have identical values for all of the other regressors, then

Page 20 of 22

STATS 330

their probabilities of getting extra pay for administrative work are approx-

imately the same. The result in (c) indicates that a smaller percentage of

women get extra pay for administrative work than men. The explanation

for this apparent discrepancy is that gender may affect the values of some

of the other regressors which impact the probability of getting extra pay for

administrative work. For example, we see that a professor is more apt to

get extra pay for administrative work than either an associate or assistant

professor. The result from (b)ii indicates a female professor is as likely to

get extra pay for administrative work as a male professor. However, if a

female academic is less likely to be a professor this would mean that female

academics are also less likely to get extra pay for administrative work.

Page 21 of 22

STATS 330

Some Useful equations

If Yi ∼ Normal(µi, σ2), then

E(Yi) = µi,

Var(Yi) = σ

2.

If Yi ∼ Binomial(ni, pi), then

E(Yi) = nipi,

Var(Yi) = nipi(1− pi).

If Yi ∼ Beta− Binomial(ni, pi, ρi), then

E(Yi) = nipi,

Var(Yi) = nipi(1− pi)(1 + ρ(ni − 1)).

Here, ρ will appear as rho in any R-output. (Also, sometimes, called θ (or theta in R

)).

If Yi ∼ Poisson(µi), then

E(Yi) = µi, (1)

Var(Yi) = µi. (2)

If Yi ∼ negativebinomial(µi, θ), then

E(Yi) = µi,

Var(Yi) = µi +

µ2i

θ

Here, θ will appear as theta in any R-output.

For model fitting a model mod: AIC = 2(k − LLmod) where k is the number of

parameters used in describing the model and LLmod is the log likelihood.

Page 22 of 22

欢迎咨询51作业君