辅导案例-DATA7202-Assignment 3

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
Statistical Methods for Data Science
DATA7202
Semester 1, 2020
Assignment 3 (Weight: 20%)
Assignment 3 is due on 2 June, 2020, 2:00pm.
There are four questions below. For questions 1 and 3, you should present your
analysis of data using Python, Matlab, or R, as a short report, clearly answering the
objectives and justifying the modeling (and hence statistical analysis) choices you make,
as well as discussing your conclusions. Do not include excessive amounts of output in
your reports, though you can append additional output (with explanation) to your report
as an appendix.
1. (10%) Consider a function
f(x) = 3x+ x2 − 200 cos(x) 1 6 x 6 8.
Write a Crude Monte Carlo algorithm for the estimation of
` =
∫ 8
1
f(x) dx,
using N = 10000 sample size. Deliver the 95% confidence interval. Compare the
obtained estimation with the true value `.
2. (10%) Consider the following variant of the cross-validation procedure.
(i) Using the available data, find a subset of “good” predictors that show corre-
lation with the response variable.
(ii) Using these predictors, construct a model (for regression or classification).
(iii) Use cross-validation to estimate the model prediction error.
Is this a good method? Do you expect to obtain the true prediction error? Explain
your answer.
3. Consider the Hitters data-set (given in Hitters.csv). Our objective is to predict a
hitter’s salary via linear models.
(a) (5%) Load the data-set and replace all categorical values with numbers. (You
can use the LabelEncoder object in Python).
(b) (5%) Fit linear regression and report 10-Fold Cross-Validation mean squared
error.
(c) (10%) Apply Principal Component Regression (PCR) with all possible number
of principal components. Using the 10-Fold Cross-Validation, plot the mean
squared error as a function of the number of components and determine the
optimal number of components.
1
(d) (10%) Apply the Lasso method and plot the the 10-Fold Cross-Validation mean
squared error as a function of λ. Determine the best λ and the corresponding
mean squared error.
4. (10%) Specify a method to generate a random variable from the discrete pdf
f(x) =
{
1
n+1
x = 0, 1, 2, . . . , n,
0 otherwise.
Discuss the time complexity of your method in terms of n, e.g. is it O(n), O(ln(n)),
etc. Give a short explanation (at most 2 sentences) for your answer.
5. Answer the following questions.
(a) (10%) Let X be a random variable and consider the estimation of the proba-
bility `γ = P(X > γ) for some large γ ∈ R. The Crude Monte Carlo (CMC)
estimator of `γ is ̂`
γ =
1
N
N∑
i=1
Zi, (1)
where Zi = 1{Xi>γ} is the indicator random variable, and X1, . . . , XN are iid
copies of X for i = 1, . . . , N . Find the squared coefficient of variation CV2 of
Z. (Recall that CV2 = Var(Z)/ (E[Z])2.)
(b) (10%) Find the relative error of the estimator ̂`γ in terms of N and `γ.
(c) (20%) The estimator (1) of `γ = E(Z) is said to be logarithmically efficient if
lim
γ→∞
lnE (Z2)
ln (E[Z])2
= 1.
Prove that the CMC estimator is not logarithmically efficient.
2
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468