辅导案例-COMP3670/6670-Assignment 3

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

The Australian National University Semester 2, 2020
Research School of Computer Science Theory Assignment 3 of 5
COMP3670/6670: Introduction to Machine Learning
Release Date. Aug 31st, 2020
Due Date. 23:59pm, Sep 27th, 2020
Total mark 100.
Exercise 1 (15+15+5+5 credits)
Let A ∈ RN×N ,Λ ∈ RD×D be symmetric, positive definite. Define 〈x,y〉A := xTAy. Define the corre-
sponding norm in the usual fashion.
||x||A :=
√
〈x, x〉A
We can define 〈·, ·〉Λ and ||·||Λ in the same way. Suppose we are performing linear regression, with a
training set {(x1, y1), . . . , (xN , yN )}, where for each i, xi ∈ RD and yi ∈ R. We can define the two
matrices
X = [x1, . . . ,xN ]
T ∈ RN×D
and
y = [y1, . . . , yN ]
T ∈ RN .
We would like to find θ ∈ RD, such that y ≈ Xθ, where the error is measured using ||·||A. We avoid
overfitting by using a weighted regularization term, measured using ||·||Λ. We define the loss function
with regularizer:
LA,Λ(θ) = ||y −Xθ||2A + ||θ||2Λ
1. Find the gradient ∇θLA,Λ(θ).
2. Let ∇θLA,Λ(θ) = 0, and solve for θ.
3. Show that if we choose A = I, and Λ = λI, where λ ∈ R, your answer for 2. agrees with the
analytic solution for the standard least squares regression problem with regularization, given by θ =
(XTX + λI)−1XTy.
4. What advantages are there in choosing ||θ||2Λ over the more standard regularization term λ||θ||22?
HINT:
- You may use the property that a symetric positive definite matrix is invertible.
- Since A is symmetric and positive definite, that implies 〈·, ·〉A is a valid inner product.
Exercise 2 (15+15+15+15 credits)
During linear regression, the goal is to attempt to fit a model to given data. Usually we have a family
of models to choose from, and the particular model selected is categorized by parameter θ. We wish to
choose the θ that gives us the “best” fit, where “best” is usually defined as the θ that minimises some
measure of error. How the error is measured can vastly change the behaviour of the optimal model, along
with how easy it is to find an optimal value.
Consider the following linear regression problem in one dimension. We have a collection D = {x1, . . . , xn}
of N many points in R, and we want to choose the best representative µ ∈ R for that dataset D. Assume
that the points are listed in order,
x1 ≤ x2 ≤ . . . ≤ xn
The best choice of µ is one that minimises some loss function L(µ,D).
1. For the L2 loss function, L2(µ,D) =
∑N
i=1(µ − xi)2, show that the optimal choice for µ is the mean,
µ = 1N
∑N
i=1 xi.
2. For the L1 loss function, L1(µ,D) =
∑N
i=1 |µ − xi|, show that the optimal choice for µ is the median
value in the set.
(Hint: Try pairing off opposite terms in L1, and argue why they must be constant.)
3. For the L∞ loss function, L∞(µ,D) = max1≤i≤n |µ− xi|, show that the optimal choice for µ is x1+xn2 ,
the average of only the largest and smallest data point in the set.
(Hint: Argue why the internal points x2, . . . , xn−1 are irrelevant.)
4. For the following data sets, compute the optimal representative µ with respect to the L2, L1 and L∞
loss function. Draw your result in a table as shown below. (You don’t need to show the calculations for
this.) Comment on how sensitive each loss function is to outliers.
D1 = {0, 1, 2, 3, 5}
D2 = {0, 1, 2, 3, 100}
D3 = {0, 10, 30, 35, 100}
HINT: The middle value in the data set. For odd many points, it is just the middle point. For evenly
many points, it is the average of the two middle values.