STAT 4540: PROJECT – PART I November 25, 2020 Deadline: December 2, 2020 Instructions: 1. There are two questions in this project. The first question focuses on regression and the second is about classification. 2. Unless asked otherwise, always use 10-fold cross-validation to estimate the test error rate or mean square error. 3. Show and explain all your work (even for conclusions). Partial credit cannot be given otherwise. 4. Points will be deducted for incorrect work even if the final answer is correct. 5. Attach your R code at the end of the project report as an appendix. R code cannot be a replacement for explanations or comments. 6. Be concise. Explanations should not exceed 2 sentences unless mentioned otherwise. Question Points Your Score 1 60 2 90 Total 150 Question 1 This question is about developing a statistical model for accurately predicting a score quantifying Parkinson’s disease progression. The Unified Parkinson’s Disease Rating Scale (UPDRS) is defined by a trained medical professional in a clinic to track the progression of Parkinson’s disease. The motor UPDRS score is defined using the UPDRS that measures the motor damage caused due to the Parkison’s disease. Because computing the score in a clinic is time consuming and expensive, there have been various efforts to develop instruments that can be used for computing the motor UPDRS score at a patient’s home. This increases patients’ convenience and reduces expenses in keeping track of the disease. Once such approach has been developed by Tsanas et al. (2010). We use Tsanas et al.’s data for predicting the motor UPDRS scores using linear and K-NN (nearest neighbor) regression models. The motor UPDRS score of a patient is computed by a medical professional after an interview. A low motor UPDRS score indicates a healthy state, whereas a high score denotes severe motor impairment. A more convenient alternative to the interviews is provided by the smart systems tuned for monitoring patients at their homes. Tsanas et al. (2010) developed an approach to replicate UPDRS assessment with clinically useful accuracy using noninvasive speech tests admin- istered at home. They use covariates from a previous study, which is available under the name Parkinsons Telemonitoring at the UCI Machine Learning repository (https://archive.ics.uci. edu/ml/datasets/Parkinsons+Telemonitoring). We have modified the original data and the UPDRS scores and stored it in the file named park.csv. Table 1 provides a detailed description of the variables in the modified data set. We treat motor UPDRS score as the response and use polynomial linear and K-NN regression models to predict the motor UPDRS score of patients using six biomedical voice measures. Variable Description motor updrs modified clinician’s motor UPDRS score Abs, PPQ5 Measures of variation in fundamental frequency dB, APQ11 Measures of variation in amplitude NHR A measure of ratio of noise to tonal components in the voice RPDE A nonlinear dynamical complexity measure Table 1: Description of the response and predictors in park.csv file. The response is in the first row and the remaining rows describe the biomedical voice measures. 1. (40 pts) Answer the following questions based on polynomial linear regression. Hint: see Problem 9 in Chapter 3. (i) (2 pts) Produce a scatterplot matrix including all the variables in the data. Identify at least one pair of variables with a strong linear dependence. (ii) (2 pts) Compute the matrix of correlations between all the variables using the cor() function. Compare the correlations with your answer to the previous question. (iii) (6 pts) Fit a linear regression model with Abs, PPQ5, dB, APQ11, NHR, and RPDE as the predictors using the lm() function. Draw the six residual plots corresponding to the six predictors. Comment on your findings. (iv) (4 pts) Which predictor appears to have a significant non-linear relationship with the response? Guess the correct regression model based on the residual plots. (v) (10 pts) Write down the polynomial regression model of degree 2 that has Abs, PPQ5, dB, APQ11, NHR2, and RPDE2 as the predictors and that obeys the hierarchical principle. Fit this model using the lm() function. 1 (vi) (4 pts) Is there a linear relationship between the predictors and the response? Interpret the regression coefficient of APQ11 on the response and its 95% confidence interval. (vii) (12 pts) Consider fitting the following six models with predictors: (a) Abs, PPQ5, dB, APQ11, NHR, and RPDE; (b) Abs, PPQ5, dB, APQ11, NHR, RPDE, and NHR×RPDE; (c) Abs, PPQ5, dB, APQ11, NHR, NHR2, and RPDE; (d) Abs, PPQ5, dB, APQ11, NHR, RPDE, and RPDE2; (e) Abs, PPQ5, dB, APQ11, NHR, RPDE, NHR2, and RPDE2; and (f) Abs, PPQ5, dB, APQ11, NHR, RPDE, NHR2, RPDE2, and NHR×RPDE. Using 10-fold and leave-one-out cross-valuation methods, select the best among the six models in (a)–(f). Comment on the findings of both methods. Note that you can use Eq. (5.2) in Chapter 5 of the book for estimating the test mean square error (MSE) using leave-one-out cross-valuation. 2. (20 pts) Answer the following question based on K-NN regression. Hint: see Lecture 11 slides. (i) (10 pts) Using a validation set approach to cross-validation with a 80-20 split of training and test data, construct the test and training MSE curve as a function of K. Comment on the differences between training and test MSE. (ii) (4 pts) Find the Ks with the minimum test and training MSEs, respectively. (iii) (2 pts) Compare the minimum test and training MSEs in the previous questions with the test and training MSEs when K = 1. (iv) (4 pts) Arguing via the bias-variance tradeoff, comment on the values of MSEs depending on K in the previous two questions. For which the value of K is the K-NN regression model least flexible? Question 2 This question is about developing a statistical model for accurately classifying a genuine banknote from its forged versions. We are using the Banknote Authentication data set from the UCI machine learning repository collected by Volker Lohweg at the University of Applied Sciences, Ostwestfalen- Lippe (https://archive.ics.uci.edu/ml/datasets/banknote+authentication). This data set has 1372 observations that are obtained from the images of genuine and forged banknote-like specimens and is provided in the banknote.csv file. Every observation in the data set has five variables, one response and four imaging features that are constructed from the banknote images using a wavelet transform. The response equals 1 and 0 if the imaging features are for a genuine banknote and forged banknote, respectively. The four imaging features are continuous variables and represent the variance, skewness, and curtosis of the wavelet transformed image and entropy of the image. Table 2 provides a detailed description of the variables in the banknote data. Using logistic, LDA, QDA, and K-NN classification approaches, we train four different models for predicting if a banknote is genuine or forged using the following four predictors: variance, skewness, and curtosis of a wavelet transformed banknote image and entropy of the banknote image. 1. (30 pts) Answer the following questions based on logistic regression. Hint: see Problem 10 in Chapter 4. (i) (4 pts) Produce 4 graphical summaries for variance, skewness, curtosis, and entropy conditioned on the class. Do there appear to be any patterns? 2 Variable Description class Response taking two values: 0 for a forged banknote and 1 for a genuine banknote variance Variance of wavelet transformed banknote image skewness Skewness of wavelet transformed banknote image curtosis Curtosis of wavelet transformed banknote image entropy Entrop of the banknote image Table 2: Description of the response and predictors in banknote.csv file. The response is in the first row and the remaining rows describe the four predictors, which are continuous and can take any real values. (ii) (6 pts) Write down the logistic regression model with variance, skewness, curtosis, and entropy as the predictors and fit this model using the glm() function. Interpret the coefficient of entropy and its 95% confidence interval. (iii) (4 pts) Assume that the glm() function issued a “fitted probabilities numerically 0 or 1 occurred” warning message (this will happen if you have used the full data). State the reason for this warning message. Which statistical learning method can to bypass this problem and under what assumptions? (iv) (16 pts) In this question, use 0.5 and 0.9 as the probability cutoffs for predictions and a validation set approach to cross-validation with a 80-20 split of training and test data. Construct the confusion matrices (using 0.5 and 0.9 as probability cutoffs) for the following five models with predictors: (a) variance, skewness, curtosis, and entropy; (b) variance, skewness, curtosis, entropy, and entropy2; (c) variance, skewness, curtosis, entropy, and curtosis2; (d) variance, skewness, curtosis, entropy, and entropy×curtosis; and (e) variance, skewness, curtosis, entropy, entropy2, curtosis2, and entropy×curtosis. Select the models (and their cutoffs) with the minimum error rate, sensitivity, and speci- ficity, respectively. 2. (40 pts) Answer the following questions based on LDA and QDA. Use the probability thresholds and cross-validation setup of question 1(iv). Hint: see Lecture 14 slides. (i) (6 pts) Write down the discriminant functions used in LDA for classification in the ban- knote data using variance, skewness, curtosis, and entropy as predictors. What is the definition of the decision boundary? (ii) (2 pts) Fit the model in the previous question using the lda() function. Identify the prior probability and mean parameter estimates from the R output. (iii) (4 pts) Use the R output to define the decision rule for classifying a new test predictor vector. (iv) (6 pts) Repeat part (i) for QDA. (v) (2 pts) Repeat part (ii) using the qda() function. (vi) (4 pts) How many parameters are estimated in LDA and QDA? Do you think we have enough training data to estimate them reliably? Explain. (vii) (12 pts) Construct two confusions matrices for LDA and QDA, respectively. Select the LDA and/or QDA models (and their cutoffs) with the minimum error rate, sensitivity, and specificity, respectively. 3 (viii) (4 pts) Which models in question 1(iv) are closest to LDA and QDA models with minimum test error rates, respectively. Explain. 3. (20 pts) Answer the following question based on K-NN classification. Use the cross-validation setup of question 1(iv) and the default probability threshold in the knn function for predicting the response. Hint: see Lecture 11 and Lecture 17 slides. (i) (5 pts) Construct the test error rate curve as a function of K. (ii) (3 pts) Find the Ks with the minimum test error rates. If there are many Ks with the same test error rate, then choose the best one among them. Justify your choices. (iii) (2 pts) Compare the minimum test error rates in the previous questions when K is chosen to be 1. Comment. (iv) Using bootstrap, (a) (4 pts) construct the sampling distribution of the chosen K; and (b) (6 pts) construct a 95% confidence interval for the error rate, sensitivity, and specificity. References 1. A Tsanas, MA Little, PE McSharry, LO Ramig (2010), ’Accurate telemonitoring of Parkin- son.s disease progression by non-invasive speech tests’, IEEE Transactions on Biomedical Engineering. 4
欢迎咨询51作业君