程序代写接单-MATH569 Statistical Learning Final Exam

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

MATH569 Statistical Learning Final Exam (Take Home), Due at 11:59 pm on December 9, 2020 Instructions:

i. This exam consists of FIVE problems. Answer all of them. ii. This exam is take-home exam. You can use any books or notes to help you answer the questions. But you MUST finish the exam INDEPENDENTLY, without discussing with anybody else. iii. Show all your work to justify your answers. Answers without adequate justification will not receive credit. iv. There are TWO data analysis problems. You should reformat the computer outputs when answering the questions. Do not directly paste the computer outputs. Mathematical problems Problem 1. (20 points 10-10) Let X1 R and X2 R be random variables and Y =m(X1,X2)+ where E() = 0 and E(2) = 2. (a) Consider the class of multiplicative predictors of the form m(x1,x2) = x1x2. Let be the best predictor, that is, minimizes EY,X1,X2 (Y X1X2)2. Find an expression for . (b) Suppose the true regression function is Y = X1 + X2 + . Also assume that E(X1) = E(X2) = 0, E(X12) = E(X2) = 1 and that X1 and X2 are independent. Find the predictive risk R = E(Y X1X2)2 where was defined in (a). Problem 2. (10 points) Show that for the linear regression model Y = XT + , the leave-one-out cross validation identity: n n2 1 ( Y i Y ( i ) ) 2 = 1 Y i Y i , n i=1 n i=1 1 Hii where H = X(XT X)1XT is the hat matrix and Hii is the ith diagonal entry, Yi is the ith prediction value for the training point xi and Y(i) is the leave-one-out prediction at xi. Problem 3. (20 points 5-5-5-5) Let (Z1, Y1), . . . , (Zn, Yn) be generated as follows: Yi N(5,1) if Zi = 1 (a) Assume we do not observe the Zis. Write the pdf f(y) of Y as a mixture of two normal distribution pdf. (Use the notation () which is the standard normal pdf.) (b) Write down the likelihood function for p (without Zis). (c) Write down the complete likelihood function for p (assuming the Zis are observed). (d) Find the maximum likelihood estimation of p using the likelihood from (c). Zi Bernoulli(p) N(0,1) if Zi = 0 1 Computational problems Problem 4. There are two options you can choose, and you only need to work out one of the two problems. If you provide solutions for both, I will only grade one of the two (up to my pick). So you might just as well provide one solution that you are most confident with. Option 1: (20 points 6-8-6) Ridge and Lasso regression Use the LA ozone dataset. Divide the dataset into two groups at random. (You can use the sample function). One group, which we call the training data, containing 2/3 of the observations and one group, which we call the test data, with 1/3 of the observations. In the following you are asked to regress the cube root of the ozone concentration on the other variables. You should in the following only use the training data for the estimation. (a) Best subset model: find the best subset model for each model size, i.e., the number of variables included, p = 1, 2, . . . , 9, according to the Cp criterion. Return two plots: (1) Cp value with respect tothedegreeoffreedomp+1,p=1,2,...,9(2)Trainingerror=n1ni=1(yiyi)2 withrespectto p+1andTesterror=m1 mi=1(yiyi)2 forthetestdatasetw.r.t. p+1. Herep=0,1,...,9. (b) Lasso method: use the lasso method and return the following two plots: (1) the path of the coeffi- cients. (use function plot). (2) the path of Training error and Test error with respect to each step. (Check the function predictor.lars to directly make all predictions for every step.) (c) Ridge regression: use ridge regression and plot the Training error and Test error with respect to lambda=seq(0,n 3,by=0.1). Option 2: (20 points 6-6-8) Gaussian process model. Consider the following model for simulating flowrate through a borehole: 2Tu(Hu Hl) y = 2LTu Tu , ln(r/rw) 1+ln(r/rw)rw2Kw +Tl where the ranges of interest for the eight input variables are rw (0.05,0.15), r (100,50000), Tu (63070, 115600), Hu (990, 1110), Tl (63.1, 116), Hl (700, 820), L (1120, 1680), and Kw (9855,12045). Generate a random symmetric Latin Hypercube Design in 50 runs. You can use the following R codes to do it and generate the training data. library(lhs) Original_D <- randomLHS(n=50, k=8) D <- Original_D lower <- c(0.05,100,63070,990,63.1,700,1120,9855) upper <- c(0.15,50000,115600,1110,116,820,1680,12045) for (i in 1:50) D[i,]=lower+D[i,] (upper-lower) y<-rep(0,50) for (i in 1:50) y[i]<-2 pi D[i,3] (D[i,4]-D[i,6])/(log(D[i,2]/D[i,1]) (1+2 D[i,7] D[i,3]/(log(D[i,2]/D[i,1]) D[i,1]^2 D[i,8])+D[i,3]/D[i,5])) Fit different versions of Gaussian Process interpolation with Gaussian product correlation function 8 C(x,x)=exp (xk xk)2k k=1 2 and use maximum likelihood estimation to estimate = (1, . . . , 8). You can use your own codes or existing R packages to do the following tasks. (a) Fit a GP interpolation model with unknown but constant but unknown mean and return the esti- mated parameter values, including , 2, and . (b) Fit a GP with a linear function mean (Problem 2 of Homework 6). Return the estimated parameter values for = (0,1,...,8), and 2. (c) Compute the RMSPE for the two fitted GP interpolation models by randomly generating 10,000 values in the experimental region. (Hint: you can do so by the same way of generating the training data.) Which model is better in terms of prediction? 1 10000 1/2 RMSPE = (y(xi)y(xi))2 Problem 5. (20 points 6-7-7) You have two data sets geno train.txt and geno test.txt. Each contain 16 columns of data from different individuals, with the first 15 being the genetic fingerprint the count of the number of repeats for certain so-called tandem repeats in the genome and the last being the popula- tion variable. The purpose is to predict the population from the genetic fingerprint. We refer below to the repeat counts as the count data (the x variables) and the population as the group (the y variable). (a) Using the training data set, for each of the input variable, estimate the density for different pop- ulation class, and then plot the 3 estimated density pdfs for each variable in the same plot. Use different color for different pdfs. Therefore, you will have 15 plots. You can use par(mfrow=c(3,5)) to arrange all the 15 plots in a 3 5 matrix. (b) Compare two different classification methods LDA and SVM according to the misclassification rate on the test data set. (c) Use principle analysis on the original input 15 variables of the training data set. (You will use the two R functions princomp, predict.) Then compare the LDA and SVM again only use the top 3 transformed input variables with largest variations as in (b). 10000 i=1 . 3