辅导案例-DATA7202-Assignment 3
Statistical Methods for Data Science DATA7202 Semester 1, 2020 Assignment 3 (Weight: 20%) Assignment 3 is due on 2 June, 2020, 2:00pm. There are four questions below. For questions 1 and 3, you should present your analysis of data using Python, Matlab, or R, as a short report, clearly answering the objectives and justifying the modeling (and hence statistical analysis) choices you make, as well as discussing your conclusions. Do not include excessive amounts of output in your reports, though you can append additional output (with explanation) to your report as an appendix. 1. (10%) Consider a function f(x) = 3x+ x2 − 200 cos(x) 1 6 x 6 8. Write a Crude Monte Carlo algorithm for the estimation of ` = ∫ 8 1 f(x) dx, using N = 10000 sample size. Deliver the 95% confidence interval. Compare the obtained estimation with the true value `. 2. (10%) Consider the following variant of the cross-validation procedure. (i) Using the available data, find a subset of “good” predictors that show corre- lation with the response variable. (ii) Using these predictors, construct a model (for regression or classification). (iii) Use cross-validation to estimate the model prediction error. Is this a good method? Do you expect to obtain the true prediction error? Explain your answer. 3. Consider the Hitters data-set (given in Hitters.csv). Our objective is to predict a hitter’s salary via linear models. (a) (5%) Load the data-set and replace all categorical values with numbers. (You can use the LabelEncoder object in Python). (b) (5%) Fit linear regression and report 10-Fold Cross-Validation mean squared error. (c) (10%) Apply Principal Component Regression (PCR) with all possible number of principal components. Using the 10-Fold Cross-Validation, plot the mean squared error as a function of the number of components and determine the optimal number of components. 1 (d) (10%) Apply the Lasso method and plot the the 10-Fold Cross-Validation mean squared error as a function of λ. Determine the best λ and the corresponding mean squared error. 4. (10%) Specify a method to generate a random variable from the discrete pdf f(x) = { 1 n+1 x = 0, 1, 2, . . . , n, 0 otherwise. Discuss the time complexity of your method in terms of n, e.g. is it O(n), O(ln(n)), etc. Give a short explanation (at most 2 sentences) for your answer. 5. Answer the following questions. (a) (10%) Let X be a random variable and consider the estimation of the proba- bility `γ = P(X > γ) for some large γ ∈ R. The Crude Monte Carlo (CMC) estimator of `γ is ̂` γ = 1 N N∑ i=1 Zi, (1) where Zi = 1{Xi>γ} is the indicator random variable, and X1, . . . , XN are iid copies of X for i = 1, . . . , N . Find the squared coefficient of variation CV2 of Z. (Recall that CV2 = Var(Z)/ (E[Z])2.) (b) (10%) Find the relative error of the estimator ̂`γ in terms of N and `γ. (c) (20%) The estimator (1) of `γ = E(Z) is said to be logarithmically efficient if lim γ→∞ lnE (Z2) ln (E[Z])2 = 1. Prove that the CMC estimator is not logarithmically efficient. 2