University of Waterloo CO673/CS794 2020 Fall CO673/CS794: Optimization for Data Science Homework 1 Due: 11:59pm AOE, September 29, 2020, submit on LEARN. Include your name and student number! Submit your writeup in pdf and all source code in a zip file (with proper documentation). Write a script for each programming exercise so that the TA can easily run and verify your results. Make sure your code runs! [Programming language: Python or Julia] Exercise 1: Gradient Descent for Logistic Regression (10 pts) The following pseudo-code is buggy! Nevertheless, it might give you a good sense of how to start. Algorithm 1: [w, f ] = gd(func; w = 0, ⌘=1, ls=true, tol=1e-3, maxiter=1e3) Input: func: function, returns function value and gradient; w: initializer; ⌘: constant step size; ls: boolean, indicating line search; tol: tolerance; maxiter: maximum iteration number 1 for t = 0, 1, . . . , maxiter do 2 [ft, g] func(w) // compute function value and gradient 3 if kgk2 tol then // check convergence 4 break 5 w w ⌘ · g // update 6 if ls then // Armijo’s rule 7 h func(w) 8 while h > ft ⌘ ⇤ kgk22/2 do 9 ⌘ ⌘/2 10 w w ⌘ · g // update 11 h func(w) 12 ****** // extract some info required for reporting In this exercise we solve logistic regression using gradient descent. Given a dataset consisting of X = [x1, . . . ,xn]> 2 Rn⇥d and y = [y1, . . . , yn]> 2 {±1}n, logistic regression minimizes the following objective: min w 1 n nX i=1 log(1 + exp( yˆiyi))| {z } f(w) , where yˆi = hxi,wi, xi := ✓ xi 1 ◆ (1) 1. (2 pts) Compute the gradient and Hessian of f above. You need to write your solution in matrix form and avoid unnecessary summation. 2. (1 pt) Is the objective L-smooth w.r.t. the Euclidean norm? If yes, provide an estimate of L. 3. (2 pts) Implement the gradient descent Algorithm 2.4 with two choices of step size: (a) constant, and (b) Armijo’s rule (Remark 2.17). A buggy pseudo-code is given at the top to (mis)lead you. 4. (2 pts) Apply your gradient descent implementation on the OR problem (n = 4, d = 2): x1 x2 y 1 1 1 1 -1 1 -1 1 1 -1 -1 -1 Plot 4 figures: Yao-Liang Yu (
[email protected]) ©2020 University of Waterloo CO673/CS794 2020 Fall • Use the constant L you estimated in Ex 1.2 and plot the resulting objective function value f(wt) w.r.t. iteration t; • Use Armijo’s rule and plot the resulting objective function value f(wt) w.r.t. iteration t; • Use Armijo’s rule and plot the magnitude of wt, i.e. kwtk2 w.r.t. iteration t; • Use Armijo’s rule and plot the classification accuracy of wt, i.e. 1n Pn i=1Jyˆiyi > 0K w.r.t. iteration t. The predicate J·K is 1 if the argument is true and 0 otherwise. 5. (2 pts) Use Armijo’s rule, compute w¯t = wt kwtk2 and plot the hyperplane represented by w¯t after t = 1, 10, 100, 1000, 10000 iterations, along with the OR dataset. (Similar to the figures we drew in Lecture 00 for Perceptron.) 6. (1 pt) Repeat Ex 1.4 on the XOR dataset: x1 x2 y 1 1 -1 1 -1 1 -1 1 1 -1 -1 -1 Exercise 2: Proximal Gradient for Sparse Robust Regression (10 pts) In this exercise we solve a sparse robust regression problem. Given a dataset consisting of X = [x1, . . . ,xn]> 2 Rn⇥d and y = [y1, . . . , yn]> 2 Rn, we minimize the regularized Huber’s loss: min w 1 n nX i=1 `(yˆi yi)| {z } f(w) +r (w), where yˆi = hx,wi, (2)