CSE 5523: Machine Learning Fall 2019

Final Exam

Instructor: Raef Bassily Due on: Saturday, December 7, at 5 pm

Instructions

• Read each problem statement carefully. Make sure that you answer exactly what is required and

that your solution is expressed clearly.

• You may use any result covered in class, but please cite the result that you use.

• Some notation introduced in class, e.g., the notation we used for true risk, empirical risk, the output of

a learning algorithm given a training set, parameter vectors, etc., will be used here directly without

being defined again. Please, refer to the notes if you have any confusion about the notation. Also, here

k · k will be used to denote the Euclidean (i.e., L2) norm.

Problem 1: Solving a Kernel-Based SVM with SGD (20 points)

This problem illustrates a simple example for solving a kernel-based SVM with Gaussian kernel via SGD.

Consider a training set of 3 data points: (x(i), y(i)) 2 R2 ⇥ {1,+1}, i = 1, 2, 3, where

x(1) =

0

0

, y(1) = +1

x(2) =

1

1

, y(2) = 1

x(3) =

1

1

, y(3) = 1

Let K : R2 ⇥ R2 ! R be a Gaussian kernel defined as: K(x,x0) = exp 18 kx x0k2 for any x,x0 2 R2.

Suppose we want to solve the kernel-based SVM as described in class with respect to this training set:

1. Obtain the Kernel matrix with respect to the given training set. Hence, write down the

optimization problem corresponding to solving the soft-margin kernel-based SVM, i.e.,

write down an expression for the objective function (the regularized loss) that needs to be minimized

over the variable coefficients ↵ = (↵1,↵2,↵3) as discussed in class.

2. Set the regularization parameter ⇤ in the objective function of Part 1 as ⇤ = 1. Now, suppose we want

to solve this optimization problem via SGD as discussed in class. For simplicity, consider SGD with

total number of updates T = 6, i.e., 5 update steps not including initialization, which is assumed to

be at ↵(1) = 0. Showing all intermediate steps, perform those updates and obtain the final

coefficient-vector b↵ (i.e., show all updates ↵(t), t = 1, . . . , 6, as well as the final output b↵). Note

that in the actual execution, in each step, SGD samples one data point uniformly at random from the

training set. For the sake of this problem, assume that over the course of the 5 update steps after

initialization, SGD samples the folowing data points in the following order:

at t = 1,

⇣

x(1), y(1)

⌘

is sampled

at t = 2,

⇣

x(2), y(2)

⌘

is sampled

at t = 3,

⇣

x(3), y(3)

⌘

is sampled

at t = 4,

⇣

x(2), y(2)

⌘

is sampled

at t = 5,

⇣

x(3), y(3)

⌘

is sampled

3. Given the coefficient-vector b↵ obtained in Part 2,

(a) give an expression for the resulting kernel-based classifier,

(b) suppose you are given a new feature-vector x = (0, 1) 2 R2. What label will your kernel classifier

generate for x? Show your derivation.

Problem 2: Linear Hard-Margin SVMs (10 points)

In this problem, refer to the figure attached in the last page. In the attached figure, consider the

constellation of feature vectors xi = (x(i)1 , x

(i)

2 ) 2 R2 and their assigned labels y(i), i = 1, . . . , 7 (see the

color/label code in the top-right corner of the figure):

1. Find the margin-maximizing linear classifier; that is, find (!, b) = ((w1, w2) , b) describing the equation

of the plane w1x1+w2x2+ b = 0 of the largest margin with respect to the given constellation. Express

(!, b) after performing the proper normalization such that min

i2{1,...,7}

y(i)

h!, x(i)i+ b = 1.

[Note: You do not need to solve the optimization problem for the SVM or prove formally that your

plane is margin-maximizing. It is only required to find the parameters of the margin-maximizing

classifier (!, b) as instructed above. You may want to provide an intuitive explanation of why you think

your solution is optimal, for which you may get partial credit in case your solution is incorrect.]

2. Given your solution in Part 1:

(a) Identify the set Q ✓ {x(1), . . . ,x(7)} of points that are closest to your plane.

(b) Find a set of scalars {↵i : i 2 Q} that satisfy all of the following (showing all details of your

derivation):

8 i 2 Q, ↵i 0,X

i2Q

y(i)↵i = 0,

! =

X

i2Q

↵iy

(i)x(i)

(c) What are the support vectors of your classifier?

Problem 3: Regularized Least-Squares Linear Regression (10 points)

Let X = x 2 Rd : kxk 1 denote the feature space, i.e., the feature space is the unit-radius Euclidean

ball in Rd. For each x 2 X , let ex = (x, 1) 2 Rd+1 denote the corresponding extended feature-vector with

exta ‘1’ appended at the end. Let Y = [1, 1] denote the target space (note here Y is the continuous interval

[1, 1] ⇢ R). Let C = w 2 Rd+1 : kwk 1 denote the parameter space, namely, C is the unit-radius

Euclidean ball in Rd+1. For any parameter-vector w 2 C, feature-vector x 2 X , and target value y 2 Y,

define

`sq (w, (x, y)) ,

⇣

hw, exi y⌘2.

1. Show that for any x 2 X , y 2 Y, the loss `sq(·, (x, y)) is a convex function in its first input (i.e., a

convex function over C).

2. Is there ⇢ > 0 such that `sq is a ⇢-Lipschitz loss? I.e., is there ⇢ > 0 s.t. for any w 2 C, x 2 X , y 2 Y,

the gradient of `sq w.r.t. w 2 C satisfies kr`sq (w, (x, y)) k ⇢? If so, what is ⇢?

Let D be some distribution over X ⇥ Y . Let w⇤ be a minimizer of the true expected risk under `sq, i.e.,

w⇤ 2 argmin

w2C

Lsq (w; D) ,

where Lsq (w; D) , E

(x,y)⇠D

[`sq (w, (x, y))] . Given a training set S =

(x(1), y(1)), . . . , (x(n), y(n))

of

n i.i.d. examples from D, consider the problem of regularized loss minimization studied in class

where the convex loss is instantiated here with `sq.

3. For any value ⇤ > 0 of the regularization parameter, let w⇤ 2 C be the minimizer of the regular-

ized emprirical loss. Obtain an upper bound (in terms of ⇤) on the expected generalization error

E

S⇠Dn

h

Lsq (w⇤; D) bLsq (w⇤; S)i , where bLsq (w⇤; S) , 1n Pni=1 `sq w⇤ , x(i), y(i) is the empirical

risk.

4. Obtain an upper bound (in terms of ⇤) on the expected excess risk associated with w⇤, i.e., obtain an

upper bound on E

S⇠Dn

[Lsq (w⇤; D) Lsq (w⇤; D)] . Hence, find the best setting for ⇤ that results in

the tightest upper bound. (Your setting of ⇤ should be a function of n only).

Problem 4: Agnostic PAC Learning of Concentric Circles (10 points)

In this problem, we consider the standard agnostic PAC model with the binary loss function we studied

in the first half of this course. Let X = R2, Y = {0, 1}, and let Hcircles be the class of concentric circles

in the plane, that is, Hcircles = {hr : r 0}, where, for any r 0, the hypothesis hr is defined as follows:

8 x 2 R2, hr(x) , 1 (kxk r) (where, as defined in class, the function 1 (‘condition’) outputs 1 when

‘condition’ is true, and outputs 0 otherwise.)

1. What is VC(Hcircles)? Show your reasoning.

2. For any value of the accuracy parameter ✏ 2 (0, 1) and the confidence parameter 2 (0, 1), give an

expression for the sample complexity nHcircles (✏, ) for agnostic PAC learning of Hcircles.

3. Describe a computationally efficient agnostic PAC learner for this class. Namely, the running time (i.e.,

the total number of computations) of your algorithm should be (at most) polynomial in n, where n is

the size of the input training set. When reasoning about the running time, you may use O(·) notation.

(Here, you can assume that a fixed-point percision (that is independent of n) is used to represent real

numbers, and hence any single operation over the reals takes O(1) time.)

Note that in this problem we consider the agnostic setting. Realizability should not be assumed.

(1, 5)

+ 1 label

- 1 label

(4, -2)

(- 0.5, 0.5)

x2

x1

(-1, 4)

(-1, -5)

(0.5, - 0.5)

(-6, 3)

x(1) =

x(2) =

x(3) =

x(5) =

x(4 ) =

x(7) =

x(6) =