辅导案例-STAT 4540

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

STAT 4540: PROJECT – PART I
November 25, 2020
Deadline: December 2, 2020
Instructions:
1. There are two questions in this project. The first question focuses on regression and the second
is about classification.
2. Unless asked otherwise, always use 10-fold cross-validation to estimate the test error rate or
mean square error.
3. Show and explain all your work (even for conclusions). Partial credit cannot be given otherwise.
4. Points will be deducted for incorrect work even if the final answer is correct.
5. Attach your R code at the end of the project report as an appendix. R code cannot be a
replacement for explanations or comments.
6. Be concise. Explanations should not exceed 2 sentences unless mentioned otherwise.
Question Points Your Score
1 60
2 90
Total 150
Question 1
This question is about developing a statistical model for accurately predicting a score quantifying
Parkinson’s disease progression. The Unified Parkinson’s Disease Rating Scale (UPDRS) is defined
by a trained medical professional in a clinic to track the progression of Parkinson’s disease. The
motor UPDRS score is defined using the UPDRS that measures the motor damage caused due to
the Parkison’s disease. Because computing the score in a clinic is time consuming and expensive,
there have been various efforts to develop instruments that can be used for computing the motor
UPDRS score at a patient’s home. This increases patients’ convenience and reduces expenses in
keeping track of the disease. Once such approach has been developed by Tsanas et al. (2010). We
use Tsanas et al.’s data for predicting the motor UPDRS scores using linear and K-NN (nearest
neighbor) regression models.
The motor UPDRS score of a patient is computed by a medical professional after an interview.
A low motor UPDRS score indicates a healthy state, whereas a high score denotes severe motor
impairment. A more convenient alternative to the interviews is provided by the smart systems
tuned for monitoring patients at their homes. Tsanas et al. (2010) developed an approach to
replicate UPDRS assessment with clinically useful accuracy using noninvasive speech tests admin-
istered at home. They use covariates from a previous study, which is available under the name
Parkinsons Telemonitoring at the UCI Machine Learning repository (https://archive.ics.uci.
edu/ml/datasets/Parkinsons+Telemonitoring). We have modified the original data and the
UPDRS scores and stored it in the file named park.csv. Table 1 provides a detailed description
of the variables in the modified data set. We treat motor UPDRS score as the response and use
polynomial linear and K-NN regression models to predict the motor UPDRS score of patients using
six biomedical voice measures.
Variable Description
motor updrs modified clinician’s motor UPDRS score
Abs, PPQ5 Measures of variation in fundamental frequency
dB, APQ11 Measures of variation in amplitude
NHR A measure of ratio of noise to tonal components in the voice
RPDE A nonlinear dynamical complexity measure
Table 1: Description of the response and predictors in park.csv file. The response is in the first
row and the remaining rows describe the biomedical voice measures.
1. (40 pts) Answer the following questions based on polynomial linear regression. Hint: see
Problem 9 in Chapter 3.
(i) (2 pts) Produce a scatterplot matrix including all the variables in the data. Identify at
least one pair of variables with a strong linear dependence.
(ii) (2 pts) Compute the matrix of correlations between all the variables using the cor()
function. Compare the correlations with your answer to the previous question.
(iii) (6 pts) Fit a linear regression model with Abs, PPQ5, dB, APQ11, NHR, and RPDE as the
predictors using the lm() function. Draw the six residual plots corresponding to the six
predictors. Comment on your findings.
(iv) (4 pts) Which predictor appears to have a significant non-linear relationship with the
response? Guess the correct regression model based on the residual plots.
(v) (10 pts) Write down the polynomial regression model of degree 2 that has Abs, PPQ5, dB,
APQ11, NHR2, and RPDE2 as the predictors and that obeys the hierarchical principle. Fit
this model using the lm() function.
1
(vi) (4 pts) Is there a linear relationship between the predictors and the response? Interpret
the regression coefficient of APQ11 on the response and its 95% confidence interval.
(vii) (12 pts) Consider fitting the following six models with predictors:
(a) Abs, PPQ5, dB, APQ11, NHR, and RPDE;
(b) Abs, PPQ5, dB, APQ11, NHR, RPDE, and NHR×RPDE;
(c) Abs, PPQ5, dB, APQ11, NHR, NHR2, and RPDE;
(d) Abs, PPQ5, dB, APQ11, NHR, RPDE, and RPDE2;
(e) Abs, PPQ5, dB, APQ11, NHR, RPDE, NHR2, and RPDE2; and
(f) Abs, PPQ5, dB, APQ11, NHR, RPDE, NHR2, RPDE2, and NHR×RPDE.
Using 10-fold and leave-one-out cross-valuation methods, select the best among the six
models in (a)–(f). Comment on the findings of both methods. Note that you can use
Eq. (5.2) in Chapter 5 of the book for estimating the test mean square error (MSE) using
leave-one-out cross-valuation.
2. (20 pts) Answer the following question based on K-NN regression. Hint: see Lecture 11 slides.
(i) (10 pts) Using a validation set approach to cross-validation with a 80-20 split of training
and test data, construct the test and training MSE curve as a function of K. Comment on
the differences between training and test MSE.
(ii) (4 pts) Find the Ks with the minimum test and training MSEs, respectively.
(iii) (2 pts) Compare the minimum test and training MSEs in the previous questions with the
test and training MSEs when K = 1.
(iv) (4 pts) Arguing via the bias-variance tradeoff, comment on the values of MSEs depending
on K in the previous two questions. For which the value of K is the K-NN regression
model least flexible?
Question 2
This question is about developing a statistical model for accurately classifying a genuine banknote
from its forged versions. We are using the Banknote Authentication data set from the UCI machine
learning repository collected by Volker Lohweg at the University of Applied Sciences, Ostwestfalen-
Lippe (https://archive.ics.uci.edu/ml/datasets/banknote+authentication). This data set
has 1372 observations that are obtained from the images of genuine and forged banknote-like
specimens and is provided in the banknote.csv file. Every observation in the data set has five
variables, one response and four imaging features that are constructed from the banknote images
using a wavelet transform. The response equals 1 and 0 if the imaging features are for a genuine
banknote and forged banknote, respectively. The four imaging features are continuous variables and
represent the variance, skewness, and curtosis of the wavelet transformed image and entropy of the
image. Table 2 provides a detailed description of the variables in the banknote data. Using logistic,
LDA, QDA, and K-NN classification approaches, we train four different models for predicting if a
banknote is genuine or forged using the following four predictors: variance, skewness, and curtosis
of a wavelet transformed banknote image and entropy of the banknote image.
1. (30 pts) Answer the following questions based on logistic regression. Hint: see Problem 10 in
Chapter 4.
(i) (4 pts) Produce 4 graphical summaries for variance, skewness, curtosis, and entropy
conditioned on the class. Do there appear to be any patterns?
2
Variable Description
class Response taking two values: 0 for a forged banknote and 1 for a genuine banknote
variance Variance of wavelet transformed banknote image
skewness Skewness of wavelet transformed banknote image
curtosis Curtosis of wavelet transformed banknote image
entropy Entrop of the banknote image
Table 2: Description of the response and predictors in banknote.csv file. The response is in the
first row and the remaining rows describe the four predictors, which are continuous and can take
any real values.
(ii) (6 pts) Write down the logistic regression model with variance, skewness, curtosis,
and entropy as the predictors and fit this model using the glm() function. Interpret the
coefficient of entropy and its 95% confidence interval.
(iii) (4 pts) Assume that the glm() function issued a “fitted probabilities numerically 0 or 1
occurred” warning message (this will happen if you have used the full data). State the
reason for this warning message. Which statistical learning method can to bypass this
problem and under what assumptions?
(iv) (16 pts) In this question, use 0.5 and 0.9 as the probability cutoffs for predictions and
a validation set approach to cross-validation with a 80-20 split of training and test data.
Construct the confusion matrices (using 0.5 and 0.9 as probability cutoffs) for the following
five models with predictors:
(a) variance, skewness, curtosis, and entropy;
(b) variance, skewness, curtosis, entropy, and entropy2;
(c) variance, skewness, curtosis, entropy, and curtosis2;
(d) variance, skewness, curtosis, entropy, and entropy×curtosis; and
(e) variance, skewness, curtosis, entropy, entropy2, curtosis2, and entropy×curtosis.
Select the models (and their cutoffs) with the minimum error rate, sensitivity, and speci-
ficity, respectively.
2. (40 pts) Answer the following questions based on LDA and QDA. Use the probability thresholds
and cross-validation setup of question 1(iv). Hint: see Lecture 14 slides.
(i) (6 pts) Write down the discriminant functions used in LDA for classification in the ban-
knote data using variance, skewness, curtosis, and entropy as predictors. What is the
definition of the decision boundary?
(ii) (2 pts) Fit the model in the previous question using the lda() function. Identify the prior
probability and mean parameter estimates from the R output.
(iii) (4 pts) Use the R output to define the decision rule for classifying a new test predictor
vector.
(iv) (6 pts) Repeat part (i) for QDA.
(v) (2 pts) Repeat part (ii) using the qda() function.
(vi) (4 pts) How many parameters are estimated in LDA and QDA? Do you think we have
enough training data to estimate them reliably? Explain.
(vii) (12 pts) Construct two confusions matrices for LDA and QDA, respectively. Select the
LDA and/or QDA models (and their cutoffs) with the minimum error rate, sensitivity, and
specificity, respectively.
3
(viii) (4 pts) Which models in question 1(iv) are closest to LDA and QDA models with minimum
test error rates, respectively. Explain.
3. (20 pts) Answer the following question based on K-NN classification. Use the cross-validation
setup of question 1(iv) and the default probability threshold in the knn function for predicting
the response. Hint: see Lecture 11 and Lecture 17 slides.
(i) (5 pts) Construct the test error rate curve as a function of K.
(ii) (3 pts) Find the Ks with the minimum test error rates. If there are many Ks with the
same test error rate, then choose the best one among them. Justify your choices.
(iii) (2 pts) Compare the minimum test error rates in the previous questions when K is chosen
to be 1. Comment.
(iv) Using bootstrap,
(a) (4 pts) construct the sampling distribution of the chosen K; and
(b) (6 pts) construct a 95% confidence interval for the error rate, sensitivity, and specificity.
References
1. A Tsanas, MA Little, PE McSharry, LO Ramig (2010), ’Accurate telemonitoring of Parkin-
son.s disease progression by non-invasive speech tests’, IEEE Transactions on Biomedical
Engineering.
4

欢迎咨询51作业君