辅导案例-COMP3670/6670-Assignment 3
The Australian National University Semester 2, 2020 Research School of Computer Science Theory Assignment 3 of 5 COMP3670/6670: Introduction to Machine Learning Release Date. Aug 31st, 2020 Due Date. 23:59pm, Sep 27th, 2020 Total mark 100. Exercise 1 (15+15+5+5 credits) Let A ∈ RN×N ,Λ ∈ RD×D be symmetric, positive definite. Define 〈x,y〉A := xTAy. Define the corre- sponding norm in the usual fashion. ||x||A := √ 〈x, x〉A We can define 〈·, ·〉Λ and ||·||Λ in the same way. Suppose we are performing linear regression, with a training set {(x1, y1), . . . , (xN , yN )}, where for each i, xi ∈ RD and yi ∈ R. We can define the two matrices X = [x1, . . . ,xN ] T ∈ RN×D and y = [y1, . . . , yN ] T ∈ RN . We would like to find θ ∈ RD, such that y ≈ Xθ, where the error is measured using ||·||A. We avoid overfitting by using a weighted regularization term, measured using ||·||Λ. We define the loss function with regularizer: LA,Λ(θ) = ||y −Xθ||2A + ||θ||2Λ 1. Find the gradient ∇θLA,Λ(θ). 2. Let ∇θLA,Λ(θ) = 0, and solve for θ. 3. Show that if we choose A = I, and Λ = λI, where λ ∈ R, your answer for 2. agrees with the analytic solution for the standard least squares regression problem with regularization, given by θ = (XTX + λI)−1XTy. 4. What advantages are there in choosing ||θ||2Λ over the more standard regularization term λ||θ||22? HINT: - You may use the property that a symetric positive definite matrix is invertible. - Since A is symmetric and positive definite, that implies 〈·, ·〉A is a valid inner product. Exercise 2 (15+15+15+15 credits) During linear regression, the goal is to attempt to fit a model to given data. Usually we have a family of models to choose from, and the particular model selected is categorized by parameter θ. We wish to choose the θ that gives us the “best” fit, where “best” is usually defined as the θ that minimises some measure of error. How the error is measured can vastly change the behaviour of the optimal model, along with how easy it is to find an optimal value. Consider the following linear regression problem in one dimension. We have a collection D = {x1, . . . , xn} of N many points in R, and we want to choose the best representative µ ∈ R for that dataset D. Assume that the points are listed in order, x1 ≤ x2 ≤ . . . ≤ xn The best choice of µ is one that minimises some loss function L(µ,D). 1. For the L2 loss function, L2(µ,D) = ∑N i=1(µ − xi)2, show that the optimal choice for µ is the mean, µ = 1N ∑N i=1 xi. 2. For the L1 loss function, L1(µ,D) = ∑N i=1 |µ − xi|, show that the optimal choice for µ is the median value in the set. (Hint: Try pairing off opposite terms in L1, and argue why they must be constant.) 3. For the L∞ loss function, L∞(µ,D) = max1≤i≤n |µ− xi|, show that the optimal choice for µ is x1+xn2 , the average of only the largest and smallest data point in the set. (Hint: Argue why the internal points x2, . . . , xn−1 are irrelevant.) 4. For the following data sets, compute the optimal representative µ with respect to the L2, L1 and L∞ loss function. Draw your result in a table as shown below. (You don’t need to show the calculations for this.) Comment on how sensitive each loss function is to outliers. D1 = {0, 1, 2, 3, 5} D2 = {0, 1, 2, 3, 100} D3 = {0, 10, 30, 35, 100} HINT: The middle value in the data set. For odd many points, it is just the middle point. For evenly many points, it is the average of the two middle values.