辅导案例-GR5241

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

STAT GR5241 Spring 2020 Final
Name:
UNI:
The final exam is open notes, open book and open internet. Students are not allowed to communicate with
anyone about the final with the exception of their respective TA or the instructor. Please solve the exam
on your own paper and upload the completed final as a single pdf document on Canvas by Saturday, May
9th at 1:00pm (NYC time). See the final exam submission guidelines for more information. Late exams
will not be accepted and students who don’t follow the submission guidelines will be significantly penalized.
Cheating will result in a score of zero and lead to the letter grade F.
Every student must write the following phrase on their cover page. Please fill in your printed
name, signature and date.
“I will not engage in academic dishonest activities for the STAT GR5241 final exam.
Signature: Date: ”
Problem 1: AdaBoost [20 points]
In this problem we perform the adaboost classification algorithm described in the lecture notes and homework
5. The dataset of interest consists of n = 100 training cases, p = 2 continuous features x1, x2, and
dichotomous response Y labeled Y = −1 and Y = 1. The training data is displayed in Figure 1, which
shows x2 versus x1 split by Y . Figure 1 also displays the location of a single test case xtest and training
case x17.
The weak learner is a decision stump, i.e., for the bth boosting iteration:
cb(x|j, θ) =
{
+1 if xj > θ
−1 if xj ≤ θ
The decision stump is trained by minimizing weighted 0-1 loss:
(j(b), θ(b)) := argmin
j,θ
∑n
i=1 wiI{yi 6= cb(xi|j, θ)}∑n
i=1 wi
1
Figure 1: Problem 1 training data
Our adaboost classifier is constructed from B = 8 weak learners. For boosting iterations b = 1, 2, . . . , 8, the
trained decision stumps c1, c2, . . . , c8 are:
(j(1) = 2, θ(1) = 0.5016) (j(2) = 1, θ(2) = 0.6803) (j(3) = 2, θ(3) = 0.9128)
(j(4) = 2, θ(4) = 0.5016) (j(5) = 1, θ(5) = 0.4986) (j(6) = 1, θ(6) = 0.9278)
(j(7) = 2, θ(7) = 0.5016) (j(8) = 1, θ(8) = 0.4986)
The weighted errors b are:
1 = 0.1400 2 = 0.1221 3 = 0.2018
4 = 0.1784 5 = 0.1475 6 = 0.1670
7 = 0.1760 8 = 0.1821
2
Solve problems (1.i)-(1.ii):
1.i [10 points] Classify the test case xtest =
(
0.600 0.395
)
based on the trained adaboost model. Use
all B = 8 weak learners to estimate Yˆtest. Show the details of your calculation for full credit.
1.ii [10 points] Does the structure of the training data (Figure 1) provide an advantage or disadvantage
when using adaboost with decision stumps? Describe your answer in a few sentences.
Problem 2: Neural Networks [30 points]
In this problem we fit a neural network to perform classification on the same n = 100 training data from
Problem 1. The dichotomous response Y is labeled Y = 0 and Y = 1, instead of −1 and 1. Our neural
network consists of a single hidden layer with d = 2 derived features, sigmoid activation function and
softmax output function. The schematic of this model is displayed below:
b1 b2
x1 h1 ya
x2 h2 yb
b
[1]
1
w
[1]
11
w
[1]
12
b
[1]
2
w
[1]
21
w
[1]
22
b
[2]
1
w
[2]
11
w
[2]
12
b
[2]
2
w
[2]
21
w
[2]
22
The objective or total cost Q(θ) is cross-entropy:
Q(θ) = −
n∑
i=1
(
yi log f1(xi) + (1− yi) log f2(xi)
)
=
n∑
i=1
Qi(θ)
Please Read: regarding the above notion, f1(xi) represents the first row and i
th column of matrix P from
the lecture notes (NN Backpropagation.pdf). Similarly f2(xi) represents the second row and i
th column
of matrix P. I should have been more careful when defining cross-entropy in that set of notes. However,
the forward pass and backpropagation algorithms provided are still correct.
3
The neural network is trained by minimizing Q(θ) with respect to θ, where parameter θ is the collection of
weights and biases W[1],b[1],W[2] and b[2]. The quantity Qi(θ) represents the i
th training case’s contribu-
tion to total cross-entropy Q(θ). The neural network is trained via gradient descent, yielding the estimated
weights:
bˆ[1] =
(
bˆ
[1]
1
bˆ
[1]
2
)
(2×1)
=
(−5.7840
3.4478
)
, Wˆ[1] =
(
wˆ
[1]
11 wˆ
[1]
12
wˆ
[1]
21 wˆ
[1]
22
)
(2×2)
=
(−17.7838 43.4752
−7.5506 −0.5264
)
bˆ[2] =
(
bˆ
[2]
1
bˆ
[2]
2
)
(2×1)
=
(−2.0010
2.0030
)
, Wˆ[2] =
(
wˆ
[2]
11 wˆ
[2]
12
wˆ
[2]
21 wˆ
[2]
22
)
(2×2)
=
(
25.6367 −66.6665
−26.3790 68.6225
)
Solve problems (2.i)-(2.iv):
The following problems can be simplified using matrix multiplication. If you solve these questions in R or
Python, please include all relevant steps in your handwritten solutions but do not include the code.
2.i [10 points] Classify the test case xtest =
(
0.600 0.395
)
based on the trained neural network. Show
the details of your calculation for full credit.
2.ii [5 points] The 17th case of the training data is x17 =
(
0.4474 0.8764
)
with label y17 = 0. Compute
the cost of case 17, i.e., compute Q17(θˆ).
2.iii [10 points] Note that the gradient of Q(θ) can be expressed
∇Q(θ) =
n∑
i=1
∇Qi(θ).
Use the backpropagation algorithm to compute ∇Q17(θˆ). This quantity represents the 17th case’s
contribution to the full gradient of Q(θ), evaluated at the point θ = θˆ . Show the details of your
calculation for full credit.
2.iv [5 points] What should ∇Q(θˆ) approximately equal? Justify your answer in one or two sentences.
Problem 3: Smoothing Splines [10 points]
Consider two curves, gˆ1 and gˆ2, defined by
gˆ1 = arg min
g
(
n∑
i=1
(yi − g(xi))2 + λ
∫
[g(3)(x)]2dx
)
,
gˆ2 = arg min
g
(
n∑
i=1
(yi − g(xi))2 + λ
∫
[g(4)(x)]2dx
)
,
4
where g(m) represents the mth derivative of g and λ > 0 is the tuning parameter. Similar to linear regression,
the training and testing RSS are respectively defined
RSStrain =
n∑
i=1
(yi − gˆ(xi))2
RSStest =
n∗∑
i=1
(y∗i − gˆ(x∗i ))2,
where gˆ is fit on the training data.
Solve problems (3.i)-(3.ii):
3.i [5 points] As λ → ∞, will gˆ1 or gˆ2 have the smaller training RSS, or there is no definite answer?
Describe your answer in a few sentences.
3.ii [5 points] As λ→∞, will gˆ1 or gˆ2 have the smaller test RSS, or there is no definite answer? Describe
your answer in a few sentences.
Problem 4: Optimization [10 points]
Consider the L2-penalized logistic regression model having p features. For simplicity, we exclude the inter-
cept and do not standardize the features. The objective or cost function is
Q(β) = Q(β1, β2, · · · , βp) = 1
n
L(β1, β2, · · · , βp) + λ
p∑
j=1
β2j ,
where L(β1, β2, · · · , βp) is the negative log-likelihood and λ > 0 is the tuning parameter. For fixed λ > 0,
our goal is to estimate β =
(
β1, β2, · · · , βp
)T
.
Solve the following problem:
Derive the update step for Newton’s method:
β(t+1) := β(t) − [HQ(β(t))]−1 · ∇Q(β(t))
Note: students are welcome to use relevant results and simplifications directly from the lecture notes.
Problem 5: EM Algorithm and Clustering [10 points]
For fixed K > 0, let x1, x2, . . . , xn be iid cases from the mixture distribution
pi(x) =
K∑
k=1
ckp(x|µk),
where
∑
k ck = 1, ck ≥ 0 and p(x|µk) is the exponential density
p(x|µk) = 1
µk
exp
(
− x
µk
)
, x > 0.
5
Here we are clustering the cases based on an exponential mixture model.
Solve the following problem:
Write down the EM algorithm for estimating parameters c1, c2, . . . , cK and µ1, µ2, . . . , µk. Note: this is an
easy problem by exploiting properties of exponential families.
Problem 6: True or False [20 points]
For this section, please clearly circle either TRUE or FALSE for each question. In your case, please write
down TRUE or FALSE on your submitted exam. Ambiguous choices will be marked incorrect. There is
no additional work required to justify each answer.
6.i In logistic regression, negative log-likelihood Q(β) is a convex function of β:
TRUE FALSE
6.ii Consider a neural network with one hidden layer, sigmoid activation and softmax output function.
The total cost Q(θ) (cross-entropy) is a convex function of θ:
TRUE FALSE
6.iii Consider a Gaussian mixture model with negative log-likelihood
Q(θ) = −
n∑
i=1
log
( K∑
k=1
ckp(xi|µk,Σk)
)
,
where θ is the collection means µ1,µ2, . . . ,µK , covariances Σ1,Σ2, . . . ,ΣK and mixture parameters
c1, c2, . . . , cK (see EM/clustering lecture notes for details). The negative log-likelihood is a convex
function of θ:
TRUE FALSE
6.iv Let x,x′ ∈ Rd and assume that k1(x,x′) and k2(x,x′) are valid kernels. The function
k(x,x′) = k1(x,x′)− k2(x,x′)
is a valid kernel:
TRUE FALSE
6.v Let x,x′ ∈ Rd. The function k(x,x′) = 1 + xTx′ is a valid kernel:
TRUE FALSE
6
6.vi Consider L1-penalized linear or logistic regression (LASSO). As λ > 0 increases, the L1 penalty forces
some coefficients βj to be exactly 0 and the remaining non-zero coefficients are the same estimates
produced from unpenalized regression:
TRUE FALSE
6.vii Consider L2-penalized linear or logistic regression (Ridge). Larger values of λ > 0 will typically
decrease the bias:
TRUE FALSE
6.viii Consider linear regression (OLS) with p > 0 features. Larger values of p will typically decrease the
bias:
TRUE FALSE
6.ix Consider k-NN classification or regression. Larger values of k > 0 will typically decrease the bias:
TRUE FALSE
6.x Consider a model with tuning parameter α. The k-fold cross-validation error
CV (fˆ , α) =
1
K
K∑
k=1
1
|Bk|
∑
(x,y)∈Bk
L(y, fˆ−k(x, α))
is an unbiased estimator of the true generalization error of fˆ :
TRUE FALSE
7