UNSW SYDNEY

SCHOOL OF MATHEMATICS AND STATISTICS

Midterm test 2020

MATH3821

Statistical Modelling and Computing

(1) TIME ALLOWED – 2 HOURS

(2) TOTAL NUMBER OF QUESTIONS – 1

(3) ANSWER ALL QUESTIONS

(4) THE QUESTIONS ARE NOT OF EQUAL VALUE

(5) THIS PAPER MAY BE RETAINED BY THE CANDIDATE

Instructions:

• Download and Open (click on) the file mid-2020.Rmd.

• Fill in (between the " ") your familyname, othername and studentnumber (top of the file).

• Click on Knit ( ). This should create and open the resulting PDF file.

• Save very regularly your work (the Rmd file).

• Click on Knit each time you have completed a chunk, and check the output in the PDF file.

• Submit your pdf and Rmd file via the submission link on Moodle prior to the deadline.

Midterm test 2020 MATH3821 Page 2

1. [33 marks]

The Coronary Risk-Factor Study (CORIS) data involve 462 males between the ages of 15

and 64 from three rural areas in South Africa, (Rousseauw et al. (1983)). The outcome

Y is the presence (Y = 1) or absence (Y = 0) of coronary heart disease. There are nine

covariates: systolic blood pressure, cumulative tobacco (kg), ldl (low density lipoprotein

cholesterol), adiposity, famhist (family history of heart disease), typea (type-A behavior),

obesity, alcohol (current alcohol consumption), and age. We will use data which is available

in the file coris.txt.

a) [1 mark]

Read the data file coris.txt into a dataframe called coris.df using the read.table()

function. You will need to use the argument sep = ",". Then use the str function to

gain some understanding about the data set.

b) [2 marks] Find the proportion of males (prop.chd) in the study that have coronary

heart disease. Find the odds (odds) of coronary heart disease. Find the log odds

(logodds) of coronary hear disease.

c) [4 marks] Men with a family history of coronary heart disease are more likely to have

a coronary heart disease than those who do not. Estimate the proportions with coronary

heart disease among those with a family history (prop.chd.famhist) and the others

without a family history (prop.chd.oth). Estimate the odds ratio directly from the

variables prop.chd.famhist and prop.chd.oth. Find this same value using the glm()

function. Test its significance (what is the p-value?).

d) [3 marks]

Fit an appropriate regression model including an intercept term, with the presence of

coronary heart disease (Y) as the response. We will use the predictors in the following

order: systolic blood pressure (sbp), tobacco (tobacco), age (age), obesity (obesity),

alcohol (alcohol) and family history of coronary heart disease (famhist). Do not forget

to encode categorical or binary predictors as factors. Produce output that shows which

explantory variables have a significant effect at the five percent level and comment on

the results.

e) [2 marks]

Are you surprised by the fact that systolic blood pressure is not significant or by the

minus sign for the obesity and alcohol coefficients? Explain why or why not.

f) [2 marks]

Compute and interpret carefully the odds ratio for family history of coronary heart

disease (famhist) based on the regression model in part (d).

g) [4 marks]

Test the significance of famhist using a deviance approach based on the regression model

in part (d). You will need to provide the decrease in deviance (famhist.deviance)

when the variable famhist is removed from the model. What is the associated p-value

Please see over . . .

Midterm test 2020 MATH3821 Page 3

as output by the R function you used? You will also use the pchisq() function on the

famhist.deviance variable to confirm this finding. What is your conclusion based on

the reduction in deviance?

h) [2 marks]

For each individual predict the probability that they will NOT have coronary heart

disease and compute the average of these values. Compare this average value with the

observed proportion.

i) [2 marks]

Suppose we are interested in predicting a males systolic blood pressure (sbp) based on

the indivduals obesity (obesity) levels. Estimate the r(·) function by cubic smoothing

spline regression. Let’s call this estimate rˆ. You will use the value 0.1 for the lambda

argument. You will store the results of your estimation in a variable called res.smooth.

Display the content of res.smooth.

j) [1 mark]

Produce a scatterplot of sbp aganist obesity and then add the smoother to the scatter-

plot.

k) [5 marks]

Create and then plot the Generalised Cross-Validation score GCV versus lambda, for

values λ = 0.008 + i× 0.000001, i = 0, . . . , 1000. Note that the formula for GCV is given

by

GCV = RSS

n(1− tr(Sλ)/n)2 ,

where RSS =

∑n

i=1(yi − yˆi)2 and Sλ is the smoothing matrix with tr(Sλ) = df .

l) [1 mark]

What value of lambda do you recommend to choose now instead of the one used in (i)?

m) [4 marks] Compute the density estimate for the variable sbp. Produce a variability

plot with a 94% confidence interval and add it to the plot. For the variability plot,

generate 1000 bootstrap resamples and evaluate the density function at 100 equally

spaced points over the range of the variable sbp. Label the plot appropriately.