Problem Set 2: Scaling with Sampling

CS4787/5777 — Principles of Large-Scale ML Systems

This problem set is assigned on September 7, 2022 and is due two weeks later on September 21,

2022 at 11:59PM. You may work in groups of up to three students. Submit your solutions on

Gradescope.

Problem 1: Empirical Risk Minimization. Consider a machine learning task where we want to

train a model h that maps examples x ∈ Rd to labels y ∈ R. We do so by selecting from a hypothesis

class hw parameterized by weights w ∈ Rd, defined by

hw(x) = x

Tw.

(Recall here that xTw denotes the dot product of the vectors x and w.) To train this model given a

dataset of n labeled examples (x1, y1), (x2, y2), . . . , (xn, yn), we solve the empirical risk minimization

problem

minimize:

1

n

n∑

i=1

(yi − xTi w)2 + λ ∥w∥2

over: w ∈ Rd,

where λ > 0 is a hyperparameter.

1(a). The given empirical risk minimization problem is an example of one of the following types of

machine learning models:

(A) Soft-margin support vector machine

(B) Hard-margin support vector machine

(C) k-Nearest neighbors classifier

(D) Perceptron

(E) Linear regression

(F) Logistic regression

Which one is it?

1(b). One way to solve this empirical risk minimization problem is to use gradient descent on the

loss. Define the loss function ℓ : Rd → R to be the function

ℓ(w) =

1

n

n∑

i=1

(yi − xTi w)2 + λ ∥w∥2

Let’s try to understand some things about this empirical risk minimization problem. Answer each of

the following questions, and provide a brief justification.

(i) Is the objective ℓ convex? (Recall that a function f is convex if it is “bowl-shaped” i.e. if any

line segment drawn between two points on its graph lies above the graph.)

1

(ii) Is ℓ strongly convex? If so, can you find a simple data-independent value for its constant of

strong convexity µ? (Recall that a twice-differentiable function f is mu-strongly convex if its

second directional derivative along any direction is always no less than µ, i.e. ∂

2

∂α2 f(w+αu) ≥

µ for any u,w ∈ Rd with ∥u∥ = 1.)

(iii) Is the solution of this empirical risk minimization problem necessarily unique?

(iv) Is the gradient ∇ℓ Lipschitz continuous? If so, can you find a simple data-independent value

for its constant of Lipschitz continuity L? (Recall that a function F is L-Lipschitz continuous if

∥F (u)− F (v)∥ ≤ L∥u− v∥ for any u ∈ Rd.)

1(c). Write out the expression for the update step of gradient descent with learning rate α on this

problem. You will need to derive the gradient to do this.

1(d). Now, suppose you wanted to use stochastic gradient descent (SGD) for this task to make

learning more scalable. (For this problem set, just assume the baseline SGD algorithm with no

minibatching, i.e. a batch size of 1.) Write out the expression for the update step of SGD on this

problem. You will need to derive the gradient to do this.

1(e). Your classmate, Iago, claims that gradient descent for this task is guaranteed to converge

asymptotically to some local minimum of the loss ℓ, as long as the step size is α = 14λ . Is Iago

correct? If so, prove it. If not, find a counterexample.

1(f). Iago also claims that if gradient descent (with batch size 1) for this task is guaranteed to

converge asymptotically to a local minimum of the loss ℓ for some fixed step size α > 0, then

stochastic gradient descent with that same step size α (and batch size 1) is also guaranteed to

converge asymptotically to a local minimum of the loss ℓ with probability 1. Is Iago’s second claim

correct? If so, prove it. If not, find a counterexample.

2

Problem 2: Computational Costs of Learning Algorithms.

In this problem, we will explore the computational cost of solving the ERM task we saw in Problem

1 using Gradient descent and SGD. Each of the parts of this problem asks you to evaluate how many

floating point operations are needed for a particular task. Note that you may need to use any of the

following types of operations:

• Additions (a+ b)

• Subtractions (a− b)

• Multiplications (a× b)

• Divisions (a/b)

Be sure to show your work, including a break down of the floating-point operations used by type.

Also note that your answer may depend on the problem size constants n and d. Note that there are

multiple possible correct answers here, which differ depending on how accumulators are initialized

in the implementations of the various operations: any answer with a good justification will be

accepted.

2(a). Suppose that we have already trained a model hw from the ERM task of Problem 1. We are

going to produce and output a prediction for an example x using hw . Suppose that we are going to

do this on a CPU using ordinary floating-point arithmetic. How many total floating-point operations

are required to compute this prediction, hw(x)?

2(b). Now suppose that we want to know the empirical risk associated with a given weight vector.

That is, we want to compute ℓ(w). How many total floating-point operations are required to compute

this prediction, ℓ(w)?

2(c). Now imagine that we are going to compute gradient descent for the Problem 1 task on a CPU

using ordinary floating-point arithmetic. Using your answer of 1(c), how many total floating-point

operations are required to compute one update step of gradient descent?

2(d). Now imagine that we are going to compute SGD (with batch size 1) for the Problem 1 task

on a CPU using ordinary floating-point arithmetic. Using your answer of 1(d), how many total

floating-point operations are required to compute one update step of SGD?

2(e). Now, let’s look at a concrete task. Suppose that you are training a model where the train-

ing examples have dimension d = 784, and there are n = 60000 total training examples. For

single-precision arithmetic, the H100 GPU accelerator has 60 teraFLOPS (60 · 1012 floating-point op-

erations per second) maximum performance. Assume that you are using single-precision arithmetic

to run gradient descent on this GPU. Using your answer from 2(c), what is the maximum number of

iterations of gradient descent you could run in an hour on this device, assuming that you are bound

only by FLOPS limitations?

3

欢迎咨询51作业君