# 辅导案例-MATH3821

UNSW SYDNEY
SCHOOL OF MATHEMATICS AND STATISTICS
Midterm test 2020
MATH3821
Statistical Modelling and Computing
(1) TIME ALLOWED – 2 HOURS
(2) TOTAL NUMBER OF QUESTIONS – 1
(4) THE QUESTIONS ARE NOT OF EQUAL VALUE
(5) THIS PAPER MAY BE RETAINED BY THE CANDIDATE
Instructions:
• Fill in (between the " ") your familyname, othername and studentnumber (top of the file).
• Click on Knit ( ). This should create and open the resulting PDF file.
• Save very regularly your work (the Rmd file).
• Click on Knit each time you have completed a chunk, and check the output in the PDF file.
• Submit your pdf and Rmd file via the submission link on Moodle prior to the deadline.
Midterm test 2020 MATH3821 Page 2
1. [33 marks]
The Coronary Risk-Factor Study (CORIS) data involve 462 males between the ages of 15
and 64 from three rural areas in South Africa, (Rousseauw et al. (1983)). The outcome
Y is the presence (Y = 1) or absence (Y = 0) of coronary heart disease. There are nine
covariates: systolic blood pressure, cumulative tobacco (kg), ldl (low density lipoprotein
cholesterol), adiposity, famhist (family history of heart disease), typea (type-A behavior),
obesity, alcohol (current alcohol consumption), and age. We will use data which is available
in the file coris.txt.
a) [1 mark]
Read the data file coris.txt into a dataframe called coris.df using the read.table()
function. You will need to use the argument sep = ",". Then use the str function to
gain some understanding about the data set.
b) [2 marks] Find the proportion of males (prop.chd) in the study that have coronary
heart disease. Find the odds (odds) of coronary heart disease. Find the log odds
(logodds) of coronary hear disease.
c) [4 marks] Men with a family history of coronary heart disease are more likely to have
a coronary heart disease than those who do not. Estimate the proportions with coronary
heart disease among those with a family history (prop.chd.famhist) and the others
without a family history (prop.chd.oth). Estimate the odds ratio directly from the
variables prop.chd.famhist and prop.chd.oth. Find this same value using the glm()
function. Test its significance (what is the p-value?).
d) [3 marks]
Fit an appropriate regression model including an intercept term, with the presence of
coronary heart disease (Y) as the response. We will use the predictors in the following
order: systolic blood pressure (sbp), tobacco (tobacco), age (age), obesity (obesity),
alcohol (alcohol) and family history of coronary heart disease (famhist). Do not forget
to encode categorical or binary predictors as factors. Produce output that shows which
explantory variables have a significant effect at the five percent level and comment on
the results.
e) [2 marks]
Are you surprised by the fact that systolic blood pressure is not significant or by the
minus sign for the obesity and alcohol coefficients? Explain why or why not.
f) [2 marks]
Compute and interpret carefully the odds ratio for family history of coronary heart
disease (famhist) based on the regression model in part (d).
g) [4 marks]
Test the significance of famhist using a deviance approach based on the regression model
in part (d). You will need to provide the decrease in deviance (famhist.deviance)
when the variable famhist is removed from the model. What is the associated p-value
Please see over . . .
Midterm test 2020 MATH3821 Page 3
as output by the R function you used? You will also use the pchisq() function on the
famhist.deviance variable to confirm this finding. What is your conclusion based on
the reduction in deviance?
h) [2 marks]
For each individual predict the probability that they will NOT have coronary heart
disease and compute the average of these values. Compare this average value with the
observed proportion.
i) [2 marks]
Suppose we are interested in predicting a males systolic blood pressure (sbp) based on
the indivduals obesity (obesity) levels. Estimate the r(·) function by cubic smoothing
spline regression. Let’s call this estimate rˆ. You will use the value 0.1 for the lambda
argument. You will store the results of your estimation in a variable called res.smooth.
Display the content of res.smooth.
j) [1 mark]
Produce a scatterplot of sbp aganist obesity and then add the smoother to the scatter-
plot.
k) [5 marks]
Create and then plot the Generalised Cross-Validation score GCV versus lambda, for
values λ = 0.008 + i× 0.000001, i = 0, . . . , 1000. Note that the formula for GCV is given
by
n(1− tr(Sλ)/n)2 ,
∑n
i=1(yi − yˆi)2 and Sλ is the smoothing matrix with tr(Sλ) = df .
l) [1 mark]
What value of lambda do you recommend to choose now instead of the one used in (i)?
m) [4 marks] Compute the density estimate for the variable sbp. Produce a variability
plot with a 94% confidence interval and add it to the plot. For the variability plot,
generate 1000 bootstrap resamples and evaluate the density function at 100 equally
spaced points over the range of the variable sbp. Label the plot appropriately.  Email:51zuoyejun

@gmail.com