辅导案例-CO673/CS794

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

University of Waterloo CO673/CS794 2020 Fall
CO673/CS794: Optimization for Data Science
Homework 1
Due: 11:59pm AOE, September 29, 2020, submit on LEARN.
Include your name and student number!
Submit your writeup in pdf and all source code in a zip file (with proper documentation). Write a script for each
programming exercise so that the TA can easily run and verify your results. Make sure your code runs!
[Programming language: Python or Julia]
Exercise 1: Gradient Descent for Logistic Regression (10 pts)
The following pseudo-code is buggy! Nevertheless, it might give you a good sense of how to start.
Algorithm 1: [w, f ] = gd(func; w = 0, ⌘=1, ls=true, tol=1e-3, maxiter=1e3)
Input: func: function, returns function value and gradient; w: initializer; ⌘: constant step size; ls:
boolean, indicating line search; tol: tolerance; maxiter: maximum iteration number
1 for t = 0, 1, . . . , maxiter do
2 [ft, g] func(w) // compute function value and gradient
3 if kgk2  tol then // check convergence
4 break
5 w w ⌘ · g // update
6 if ls then // Armijo’s rule
7 h func(w)
8 while h > ft ⌘ ⇤ kgk22/2 do
9 ⌘ ⌘/2
10 w w ⌘ · g // update
11 h func(w)
12 ****** // extract some info required for reporting
In this exercise we solve logistic regression using gradient descent. Given a dataset consisting of X =
[x1, . . . ,xn]> 2 Rn⇥d and y = [y1, . . . , yn]> 2 {±1}n, logistic regression minimizes the following objective:
min
w
1
n
nX
i=1
log(1 + exp(yˆiyi))| {z }
f(w)
, where yˆi = hxi,wi, xi :=
✓
xi
1
◆
(1)
1. (2 pts) Compute the gradient and Hessian of f above. You need to write your solution in matrix form
and avoid unnecessary summation.
2. (1 pt) Is the objective L-smooth w.r.t. the Euclidean norm? If yes, provide an estimate of L.
3. (2 pts) Implement the gradient descent Algorithm 2.4 with two choices of step size: (a) constant, and
(b) Armijo’s rule (Remark 2.17). A buggy pseudo-code is given at the top to (mis)lead you.
4. (2 pts) Apply your gradient descent implementation on the OR problem (n = 4, d = 2):
x1 x2 y
1 1 1
1 -1 1
-1 1 1
-1 -1 -1
Plot 4 figures:
Yao-Liang Yu ([email protected]) ©2020

University of Waterloo CO673/CS794 2020 Fall
• Use the constant L you estimated in Ex 1.2 and plot the resulting objective function value f(wt)
w.r.t. iteration t;
• Use Armijo’s rule and plot the resulting objective function value f(wt) w.r.t. iteration t;
• Use Armijo’s rule and plot the magnitude of wt, i.e. kwtk2 w.r.t. iteration t;
• Use Armijo’s rule and plot the classification accuracy of wt, i.e. 1n
Pn
i=1Jyˆiyi > 0K w.r.t. iteration
t. The predicate J·K is 1 if the argument is true and 0 otherwise.
5. (2 pts) Use Armijo’s rule, compute w¯t =
wt
kwtk2 and plot the hyperplane represented by w¯t after t =
1, 10, 100, 1000, 10000 iterations, along with the OR dataset. (Similar to the figures we drew in Lecture
00 for Perceptron.)
6. (1 pt) Repeat Ex 1.4 on the XOR dataset:
x1 x2 y
1 1 -1
1 -1 1
-1 1 1
-1 -1 -1
Exercise 2: Proximal Gradient for Sparse Robust Regression (10 pts)
In this exercise we solve a sparse robust regression problem. Given a dataset consisting of X = [x1, . . . ,xn]> 2
Rn⇥d and y = [y1, . . . , yn]> 2 Rn, we minimize the regularized Huber’s loss:
min
w
1
n
nX
i=1
`(yˆi yi)| {z }
f(w)
+r(w), where yˆi = hx,wi, (2)
0 is a constant that balances the two terms, and we choose ` to be Huber’s loss:
`(t) = min
s
1
2 (s t)2 + |s|. (3)
(We omit the bias b throughout to make things a little easier and cleaner.)
1. (1 pt) Use case analysis to derive the explicit formula for `:
`(t) =
(
1
2 t
2, |t|  1
|t| 12 , |t| 1
. (4)
2. (1 pt) Is Huber’s function ` L-smooth? If yes, provide an estimate of L.
3. (1 pt) Is the function f L-smooth? If yes, provide an estimate of L. [Hint: Suppose `(z) is L-smooth.
Then, so is `(Aw) with a di↵erent L, according to the chain rule. How does summation and scalar
multiplication a↵ect L-smoothness?]
4. (1 pt) Let r(w) = kwk1 be the `1 norm. Derive its proximal map:
P⌘r(w) = argmin
z
1
2⌘kw zk22 + r(z). (5)
[Hint: What did you derive in Ex 2.1?]
5. (1 pt) Let r(w) = ◆B1(w) =
(
0, if w 2 B1
1, otherwise be the indicator function of the (scaled) cube B1 :=
{w : kwk1  }. Compute the proximal map P⌘r(w).
Yao-Liang Yu ([email protected]) ©2020
University of Waterloo CO673/CS794 2020 Fall
6. (1 pt) Let r(w) = kwk0 be the `0 “norm” where kwk0 :=
P
jJwj 6= 0K. Derive its proximal map
P⌘r(w).
7. (2 pts) Implement the proximal gradient algorithm by modifying your code in Ex 1 appropriately. Name
your function as [w, f ] = proxgrad(func; prox; w = 0, ⌘=1, ls=true, tol=1e-3, maxiter=1e3),
where prox(w, ⌘) returns the proximal map of the regularizer r. [You may also choose to supply to
proxgrad explicitly.]
8. (1 pt) Write a script [X, y] = datagen(n=100, d=10, sigma=1, w=[1, -1, 0, ..., 0]) to generate
the following dataset: xi ⇠ Nd(0, I), i = 1, . . . , n (the standard d-dimensional normal distribution with
zero mean and identity covariance). Let w = [1,1, 0, . . . , 0] 2 Rd, and define
yi = hxi,wi+ ✏i, ✏i i.i.d.⇠ N1(0,2), i = 1, . . . , n. (6)
Use your implementation proxgrad to solve (2) with the generated dataset (X,y). Use the `1-norm
regularizer in Ex 2.4 and try 2 {0.1, 1, 10}. Report your solutions w.
9. (1 pt) Repeat Ex 2.8 with the `0-norm regularizer. This time also plot the function value (f + r, w.r.t.
iteration t). Does proxgrad still converge? Report your solutions w.
Yao-Liang Yu ([email protected]) ©2020

欢迎咨询51作业君