CSE 5523: Machine Learning Fall 2019
Final Exam
Instructor: Raef Bassily Due on: Saturday, December 7, at 5 pm
Instructions
• Read each problem statement carefully. Make sure that you answer exactly what is required and
that your solution is expressed clearly.
• You may use any result covered in class, but please cite the result that you use.
• Some notation introduced in class, e.g., the notation we used for true risk, empirical risk, the output of
a learning algorithm given a training set, parameter vectors, etc., will be used here directly without
being defined again. Please, refer to the notes if you have any confusion about the notation. Also, here
k · k will be used to denote the Euclidean (i.e., L2) norm.
Problem 1: Solving a Kernel-Based SVM with SGD (20 points)
This problem illustrates a simple example for solving a kernel-based SVM with Gaussian kernel via SGD.
Consider a training set of 3 data points: (x(i), y(i)) 2 R2 ⇥ {1,+1}, i = 1, 2, 3, where
x(1) =

0
0

, y(1) = +1
x(2) =

1
1

, y(2) = 1
x(3) =
1
1

, y(3) = 1
Let K : R2 ⇥ R2 ! R be a Gaussian kernel defined as: K(x,x0) = exp 18 kx x0k2 for any x,x0 2 R2.
Suppose we want to solve the kernel-based SVM as described in class with respect to this training set:
1. Obtain the Kernel matrix with respect to the given training set. Hence, write down the
optimization problem corresponding to solving the soft-margin kernel-based SVM, i.e.,
write down an expression for the objective function (the regularized loss) that needs to be minimized
over the variable coefficients ↵ = (↵1,↵2,↵3) as discussed in class.
2. Set the regularization parameter ⇤ in the objective function of Part 1 as ⇤ = 1. Now, suppose we want
to solve this optimization problem via SGD as discussed in class. For simplicity, consider SGD with
total number of updates T = 6, i.e., 5 update steps not including initialization, which is assumed to
be at ↵(1) = 0. Showing all intermediate steps, perform those updates and obtain the final
coefficient-vector b↵ (i.e., show all updates ↵(t), t = 1, . . . , 6, as well as the final output b↵). Note
that in the actual execution, in each step, SGD samples one data point uniformly at random from the
training set. For the sake of this problem, assume that over the course of the 5 update steps after
initialization, SGD samples the folowing data points in the following order:
at t = 1,

x(1), y(1)

is sampled
at t = 2,

x(2), y(2)

is sampled
at t = 3,

x(3), y(3)

is sampled
at t = 4,

x(2), y(2)

is sampled
at t = 5,

x(3), y(3)

is sampled
3. Given the coefficient-vector b↵ obtained in Part 2,
(a) give an expression for the resulting kernel-based classifier,
(b) suppose you are given a new feature-vector x = (0, 1) 2 R2. What label will your kernel classifier
generate for x? Show your derivation.
Problem 2: Linear Hard-Margin SVMs (10 points)
In this problem, refer to the figure attached in the last page. In the attached figure, consider the
constellation of feature vectors xi = (x(i)1 , x
(i)
2 ) 2 R2 and their assigned labels y(i), i = 1, . . . , 7 (see the
color/label code in the top-right corner of the figure):
1. Find the margin-maximizing linear classifier; that is, find (!, b) = ((w1, w2) , b) describing the equation
of the plane w1x1+w2x2+ b = 0 of the largest margin with respect to the given constellation. Express
(!, b) after performing the proper normalization such that min
i2{1,...,7}
y(i)
h!, x(i)i+ b = 1.
[Note: You do not need to solve the optimization problem for the SVM or prove formally that your
plane is margin-maximizing. It is only required to find the parameters of the margin-maximizing
classifier (!, b) as instructed above. You may want to provide an intuitive explanation of why you think
your solution is optimal, for which you may get partial credit in case your solution is incorrect.]
2. Given your solution in Part 1:
(a) Identify the set Q ✓ {x(1), . . . ,x(7)} of points that are closest to your plane.
(b) Find a set of scalars {↵i : i 2 Q} that satisfy all of the following (showing all details of your
derivation):
8 i 2 Q, ↵i 0,X
i2Q
y(i)↵i = 0,
! =
X
i2Q
↵iy
(i)x(i)
(c) What are the support vectors of your classifier?
Problem 3: Regularized Least-Squares Linear Regression (10 points)
Let X = x 2 Rd : kxk  1 denote the feature space, i.e., the feature space is the unit-radius Euclidean
ball in Rd. For each x 2 X , let ex = (x, 1) 2 Rd+1 denote the corresponding extended feature-vector with
exta ‘1’ appended at the end. Let Y = [1, 1] denote the target space (note here Y is the continuous interval
[1, 1] ⇢ R). Let C = w 2 Rd+1 : kwk  1 denote the parameter space, namely, C is the unit-radius
Euclidean ball in Rd+1. For any parameter-vector w 2 C, feature-vector x 2 X , and target value y 2 Y,
define
`sq (w, (x, y)) ,

hw, exi y⌘2.
1. Show that for any x 2 X , y 2 Y, the loss `sq(·, (x, y)) is a convex function in its first input (i.e., a
convex function over C).
2. Is there ⇢ > 0 such that `sq is a ⇢-Lipschitz loss? I.e., is there ⇢ > 0 s.t. for any w 2 C, x 2 X , y 2 Y,
the gradient of `sq w.r.t. w 2 C satisfies kr`sq (w, (x, y)) k  ⇢? If so, what is ⇢?
Let D be some distribution over X ⇥ Y . Let w⇤ be a minimizer of the true expected risk under `sq, i.e.,
w⇤ 2 argmin
w2C
Lsq (w; D) ,
where Lsq (w; D) , E
(x,y)⇠D
[`sq (w, (x, y))] . Given a training set S =

(x(1), y(1)), . . . , (x(n), y(n))

of
n i.i.d. examples from D, consider the problem of regularized loss minimization studied in class
where the convex loss is instantiated here with `sq.
3. For any value ⇤ > 0 of the regularization parameter, let w⇤ 2 C be the minimizer of the regular-
ized emprirical loss. Obtain an upper bound (in terms of ⇤) on the expected generalization error
E
S⇠Dn
h
Lsq (w⇤; D) bLsq (w⇤; S)i , where bLsq (w⇤; S) , 1n Pni=1 `sq w⇤ , x(i), y(i) is the empirical
risk.
4. Obtain an upper bound (in terms of ⇤) on the expected excess risk associated with w⇤, i.e., obtain an
upper bound on E
S⇠Dn
[Lsq (w⇤; D) Lsq (w⇤; D)] . Hence, find the best setting for ⇤ that results in
the tightest upper bound. (Your setting of ⇤ should be a function of n only).
Problem 4: Agnostic PAC Learning of Concentric Circles (10 points)
In this problem, we consider the standard agnostic PAC model with the binary loss function we studied
in the first half of this course. Let X = R2, Y = {0, 1}, and let Hcircles be the class of concentric circles
in the plane, that is, Hcircles = {hr : r 0}, where, for any r 0, the hypothesis hr is defined as follows:
8 x 2 R2, hr(x) , 1 (kxk  r) (where, as defined in class, the function 1 (‘condition’) outputs 1 when
‘condition’ is true, and outputs 0 otherwise.)
1. What is VC(Hcircles)? Show your reasoning.
2. For any value of the accuracy parameter ✏ 2 (0, 1) and the confidence parameter 2 (0, 1), give an
expression for the sample complexity nHcircles (✏, ) for agnostic PAC learning of Hcircles.
3. Describe a computationally efficient agnostic PAC learner for this class. Namely, the running time (i.e.,
the total number of computations) of your algorithm should be (at most) polynomial in n, where n is
the size of the input training set. When reasoning about the running time, you may use O(·) notation.
(Here, you can assume that a fixed-point percision (that is independent of n) is used to represent real
numbers, and hence any single operation over the reals takes O(1) time.)
Note that in this problem we consider the agnostic setting. Realizability should not be assumed.
(1, 5)
+ 1 label
- 1 label
(4, -2)
(- 0.5, 0.5)
x2
x1
(-1, 4)
(-1, -5)
(0.5, - 0.5)
(-6, 3)
x(1) =
x(2) =
x(3) =
x(5) =
x(4 ) =
x(7) =
x(6) =