CS 189 Introduction to Machine Learning

Fall 2019 Jennifer Listgarten & Stella Yu HW 07

Due: Tuesday, November 19, 2019

0 Getting Started

Read through this page carefully. You may typeset your homework in latex or submit neatly

handwritten/scanned solutions. Please start each question on a new page. Deliverables:

1. Submit a PDF of your writeup, with an appendix for your code, to assignment on Grade-

scope, “HW7 Write-Up”. If there are graphs, include those graphs in the correct sections.

Do not simply reference your appendix.

2. If there is code, submit all code needed to reproduce your results, “HW7 Code”.

3. If there is a test set, submit your test set evaluation results, “HW7 Test Set”.

1 Backpropagation Algorithm for Neural Networks

Learning goal: Understand backpropagation and associated hyperparameters of training

neural networks

In this problem, we will be implementing the backpropagation algorithm to train a neural network

to classify the difference between two handwritten digits (specifically the digits 3 and 9). The

dataset for this problem consists of over 10k black and white images of size 28 x 28 pixels, each

with a label corresponding to the digit 3 or 9.

Before we start we will install the library mnist so we can load our data properly.

Run pip install mnist in your terminal so the library is properly installed.

To establish notation for this problem, we define:

ai+1 = σ(zi) = σ(Wiai + bi).

In this equation, Wi is a ni+1 × ni matrix that maps the input ai of dimension ni to a vector of

dimension ni+1, where ni+1 is the size of layer i + 1. The vector bi is the bias vector added after

the matrix multiplication, and σ is the nonlinear function applied element-wise to the result of the

matrix multiplication and addition. zi = Wiai + bi is a shorthand for the intermediate result within

layer i before applying the nonlinear activation function σ. Each layer is computed sequentially

where the output of one layer is used as the input to the next. To compute the derivatives with

respect to the weights Wi and the biases bi of each layer, we use the chain rule starting with

HW 07, ©UCB CS 189, Fall 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 1

the output of the network and work our way backwards through the layers, which is where the

backprop algorithm gets its name.

You are given starter code with incomplete function implementations. For this problem, you will

fill in the missing code so that we can train a neural network to learn your nonlinear classifier. The

code currently trains a network with one hidden layer with 4 nodes.

(a) Start by drawing a small example network with three computational layers, where the

last layer has a single scalar output. The first layer has a single external input x. The

computational layers should have widths of 5, 3, and 1 respectively. The nonlinearities for the

hidden and output layers are relu and linear respectively. Label all the ni as well as all the

ai and Wi and bi weights. Consider the bias terms to be weights connected to a dummy unit

whose output is always 1 for the purpose of labeling. Also draw and label the loss function

that will be important during training — use a squared-error loss.

Here, the important thing is for you to understand your own clear way to illustrate neural nets.

You can follow conventions seen online, from class, or develop your own. The important thing

is to have your illustration be unambiguous so you can use it to help understand the forward

flow of information during evaluation and the backward flow during gradient computations.

(b) Let’s start by implementing the least squares loss function of the network. We’ll refer to the

class that’s encoded as 1 as the ’positive class’ and the class that’s encoded as -1 as the ’negative

class.’ This convention is arbitrary but convenient for discussions, especially since it matches

the usual conventions in binary classification. For this question, let’s treat the digit 3 as the

positive class and the digit 9 as the negative class. The sign of the predicted value will be the

predicted class label. This function is used to assign an error for each prediction made by the

network during training.

The error we actually care about is the misclassification error (MCE) which will be:

MCE(yˆ) =

1

n

n∑

i=1

I{sign(yi) , sign(yˆi)}

where yi is the observation that we want the neural network to output and yˆi is the prediction

from the network. However this function is hard to to optimize so the implementation will be

optimizing the mean squared error cost (MSE) function, which is given by

MSE(yˆ) =

1

2n

n∑

i=1

(yi − yˆi)2.

Write the derivative of the mean squared error cost function with respect to the pre-

dicted outputs yˆ. In backprop.py implement the functions QuadraticCost.fx and

QuadraticCost.dx. Remember to attach your function implementations in the writeup.

(c) Now, let’s take the derivatives of the nonlinear activation functions used in the network. Write

the derivatives and implement the following nonlinear functions in the code and their

derivatives:

σlinear(z) = z

HW 07, ©UCB CS 189, Fall 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 2

σReLU(z) =

0 z < 0z otherwise

σtanh(z) =

ez − e−z

ez + e−z

For the tanh function, feel free to use the tanh function in numpy. We have provided the

sigmoid function as an example activation function. Remember to attach your function

implementations in the writeup.

(d) We have implemented the forward propagation part of the network for you (see Model.evaluate

in the code). We now need to compute the derivative of the cost function with respect to the

weights W and the biases b of each layer in the network. We will be using all of the code

we previously implemented to help us compute these gradients. Assume that ∂MSE

∂ai+1

is given,

where ai+1 is the input to layer i+ 1. Write the expression for ∂MSE∂ai in terms of

∂MSE

∂ai+1

. Then

implement these derivative calculations in the function Model.compute grad. Recall, ai+1

is given by

ai+1 = σ(zi) = σ(Wiai + bi) .

Remember to attach your function implementations in the writeup.

(e) We use gradients to update the model parameters using batched stochastic gradient descent.

Implement the function GDOptimizer.update to update the parameters in each layer

of the network. You will need to use the derivatives ∂MSE

∂zi

and the outputs of each layer ai

to compute the derivates ∂MSE

∂Wi

and ∂MSE

∂bi

. Use the learning rate η, given by self.eta in the

function, to scale the gradients when using them to update the model parameters. Train with

batch sizes [10, 50, 100, 200] and number of training epochs [10, 20, 40]. Report the final

error on the training set given the various batch sizes and training epochs. Does the result

match your expectation, and why? What can you conclude from it. (Just for reference, staff’s

solution is able to achieve a error below 0.1 in general.) Remember to attach your function

implementations in the writeup.

(f) Let’s now explore how the number of hidden nodes per layer affects the approximation. Train

a models using the tanh and the ReLU activation functions with 2, 4, 8, 16, and 32 hidden

nodes per layer (width). Use the same training iterations and learning rate from the starter

code. Report the resulting error on the training set after training for each combination of

parameters. Does the result match your expectation, and why? What can you conclude from

it.

2 Regularized and Kernel k-Means

Learning Goal: Combine previous ideas such as regularization and kernels to produce commonly-

used variants of K-means

HW 07, ©UCB CS 189, Fall 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 3

Recall that in k-means clustering we attempt to minimize the objective

min

C1,C2,...,Ck

k∑

i=1

∑

x j∈Ci

‖x j − µi‖22, where

µi = argminµi∈Rd

∑

x j∈Ci

‖x j − µi‖22 =

1

|Ci|

∑

x j∈Ci

x j, i = 1, 2, . . . , k.

The samples are {x1, . . . , xn}, where x j ∈ Rd. Ci is the set of sample points assigned to cluster i and

|Ci| is its cardinality. Each sample point is assigned to exactly one cluster.

(a) What is the minimum value of the objective when k = n (the number of clusters equals the

number of sample points)?

(b) (Regularized k-means) Suppose we add a regularization term to the above objective. The

objective is now

k∑

i=1

λ‖µi‖22 + ∑

x j∈Ci

‖x j − µi‖22

.

Show that the optimum of

min

µi∈Rd

λ‖µi‖22 +

∑

x j∈Ci

‖x j − µi‖22

is obtained at µi = 1|Ci |+λ

∑

x j∈Ci x j.

(c) Here is an example where we would want to regularize clusters. Suppose there are n students

who live in a R2 Euclidean world and who wish to share rides efficiently to Berkeley for their

CS189 final exam. The university permits k vehicles for shuttling students to the exam location.

The students need to figure out k good locations to meet up. The students will then walk to the

closest meet up point and then the shuttles will deliver them to the exam location. Let x j be

the location of student j, and let the exam location be at (0, 0). Assume that we can drive as

the crow flies, i.e., by taking the shortest path between two points. Write down an appropriate

objective function to minimize the total distance that the students and vehicles need to travel.

Hint: your result should be similar to the regularized k-means objective.

(d) (Kernel k-means) Suppose we have a dataset {xi}ni=1, xi ∈ R` that we want to split into K clusters,

i.e., finding the best K-means clustering without regularization. Furthermore, suppose we

know a priori that this data is best clustered in an impractically high-dimensional feature space

Rm, l m with an appropriate metric. Fortunately, instead of having to deal with the (implicit)

feature map Φ : R` → Rm and (implicit) distance metric1, we have a kernel function κ(x1, x2) =

Φ(x1) · Φ(x2) that we can compute easily on the raw samples. How should we perform the

kernelized counterpart of k-means clustering?

1Just as how the interpretation of kernels in kernelized ridge regression involves an implicit prior/regularizer as well as an

implicit feature space, we can think of kernels as generally inducing an implicit distance metric as well. Think of how you would

represent the squared distance between two points in terms of pairwise inner products and operations on them.

HW 07, ©UCB CS 189, Fall 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 4

Derive the underlined portion of this algorithm, and show your work in deriving it. The

main issue is that although we define the means µi in the usual way, we can’t ever compute Φ

explicitly because it’s way too big. Therefore, in the step where we determine which cluster

each sample point is assigned to, we must use the kernel function κ to obtain the right result.

(Review the lecture on kernels if you don’t remember how that’s done.)

Algorithm 1: Kernel k-means

Require: Data matrix X ∈ Rn×l; Number of clusters K; kernel function κ(x1, x2)

Ensure: Cluster class class( j) for each sample x j.

function Kernel-k-means(X,K)

Randomly initialize class( j) to be an integer in 1, 2, . . . ,K for each x j.

while not converged do

for i← 1 to K do

Set S i = { j ∈ {1, 2, . . . , n} : class( j) = i}.

for j← 1 to n do

Set class( j) = argmink

Return S i for i = 1, 2, . . . ,K.

end function

(e) The expression you derived may have unnecessary terms or redundant kernel computations.

Explain how to eliminate them; that is, how to perform the computation quickly without doing

irrelevant computations or redoing computations already done.

3 Expectation Maximization (EM) Algorithm: In Action!

Learning Goal: Compare and contrast the behavior of various clustering algorithms on a

mixture of Gaussians.

Suppose we have the following general mixture of Gaussians. We describe the model by a pair of

random variables (X,Z) where X takes values in Rd and Z takes value in the set [K] = {1, . . . ,K}.

The joint-distribution of the pair (X,Z) is given to us as follows:

Z ∼ Multinomial(pi),

(X|Z = k) ∼ N(µk,Σk), k ∈ [K],

where pi = (pi1, . . . , piK)> and

∑K

k=1 pik = 1. Note that we can also write

X ∼

K∑

k=1

pikN(µk,Σk).

Suppose we are given a dataset {xi}ni=1 without their labels. Our goal is to identify the K-clusters

of the data. To do this, we are going to estimate the parameters µk,Σk using this dataset. We are

going to use the following three algorithms for this clustering task.

K-Means: For each data-point for iteration t we find its cluster by computing:

y(t)i = arg minj∈[K]

‖xi − µ(t−1)j ‖2

HW 07, ©UCB CS 189, Fall 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 5

C(t)j = {xi : y(t)i == j}ni=1

where µ(t−1)j denotes the mean of C

(t−1)

j , the j-th cluster in iteration t − 1. The cluster means

are then recomputed as:

µ(t)j =

1

|C(t)j |

∑

xi∈C(t)j

xi.

We can run K-means till convergence (that is we stop when the cluster memberships do not

change anymore). Let us denote the final iteration as T , then the estimate of the covariances

Σk from the final clusters can be computed as:

Σ j =

1

|C(T )j |

∑

xi∈C(T )j

(xi − µ(T )j )(xi − µ(T )j )>.

Notice that this method can be viewed as a “hard” version of EM.

K-Means-with-Covariance-Updates: Given that we also estimate the covariance, we may con-

sider a version of K-means where the covariances keep getting updated at every iteration and

also play a role in determining cluster membership. The objective at the assignment-step

would be given by

y(t)i = arg minj∈[K]

(xi − µ(t−1)j )>(Σ(t−1)j )−1(xi − µ(t−1)j ).

C(t)j = {xi : y(t)i == j}ni=1

We could then use C(t)j to recompute the parameters as follows:

µ(t)j =

1

|C(t)j |

∑

xi∈C(t)j

xi, and

Σ(t)j =

1

|C(t)j |

∑

xi∈C(t)j

(xi − µ(t)j )(xi − µ(t)j )>.

We again run K-means-with-covar until convergence (that is we stop when the cluster mem-

berships do not change anymore). Notice that, again, this method can be viewed as another

variant for the “hard” EM method.

EM: The EM updates are given by

– E-step: For k = 1, . . . ,K and i = 1 . . . , n, we have

q(t)i (Zi = k) = p(Z = k|X = xi; θ(t−1)).

– M-step: For k = 1, . . . ,K, we have

pi(t)k =

1

n

n∑

i=1

q(t)i (Zi = k) =

1

n

n∑

i=1

p(Z = k|X = xi; θ(t−1)),

HW 07, ©UCB CS 189, Fall 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 6

µ(t)k =

∑n

i=1 q

(t)

i (Zi = k)xi∑n

i=1 q

(t)

i (Zi = k)

, and

Σk

(t) =

∑n

i=1 q

(t)

i (Zi = k)(xi − µ(t)k )(xi − µ(t)k )>∑n

i=1 q

(t)

i (Zi = k)

.

Notice that unlike previous two methods, in the EM updates, each data point contributes in

determining the mean and covariance for each cluster.

We now see the three methods in action. You are provided with a code for all the above 3 algorithms

(gmm em kmean.py). You can run it by calling the following function from main:

experiments(seed, factor, num_samples, num_clusters)

We assume that x ∈ R2, and the default settings are number of samples is 500 (n = 500), and the

number of clusters is 3 (K = 3). Notice that seed will determine the randomness and factor will

determine how far apart are the clusters.

(a) Run the following setting:

experiments(seed=11, factor=1, num_samples=500, num_clusters=3)

Observe the initial guesses for the means and the plots for the 3 algorithms on convergence.

Comment on your observations. Attach the two plots for this case.

Note that the colors are used to indicate that the points that belong to different clusters, to help

you visualize the data and understand the results.

(b) Comment on the results obtained for the following setting:

experiments(seed=63, factor=10, num_samples=500, num_clusters=3)

and attach the two plots for this case as well.

4 One Dimensional Mixture of Two Gaussians

Learning Goal: Derive the EM algorithm for a simple mixture of Gaussians.

Suppose we have a mixtures of two Gaussians in R that can be described by a pair of random

variables (X,Z) where X takes values in R and Z takes value in the set 1, 2. The joint-distribution

of the pair (X,Z) is given to us as follows:

Z ∼ Bernoulli(β),

(X|Z = k) ∼ N(µk, σ2k), k ∈ 1, 2,

We use θ to denote the set of all parameters β, µ1, σ1, µ2, σ2.

HW 07, ©UCB CS 189, Fall 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 7

(a) Write down the expression for the joint likelihood pθ(X = xi,Zi = 1) and pθ(X = xi,Zi = 2).

What is the marginal likelihood pθ(X = xi)?

(b) What is the log-likelihood `θ(x)? Why can’t we optimize this by taking the derivative of the

log-likelihood and setting it to 0?

(c) You’d like to optimize the log likelihood: `θ(x). However, we just saw this can be hard to

solve for an MLE closed form solution. Show that a lower bound for the log likelihood is

`θ(xi) ≥ Eq

[

log

(

pθ(X=xi,Zi=k)

qθ(Zi=k|X=xi)

)]

.

(d) The EM algorithm initially starts with two randomly placed Gaussians (µ1, σ1) and (µ2, σ2),

which are both particular realizations of θ.

• E-step: qt+1i,k = pθt(Zi = k|X = xi). For each data point, determine which Gaussian

generated it, being either (µ1, σ1) or (µ2, σ2).

• M-step: : θt+1 = argmaxθ

∑n

i=1 Eqt+1

[

log

(

pθ(X = xi,Zi = k)

)]

. After labeling all datapoints

in the E-step, adjust (µ1, σ1) and (µ2, σ2).

Prove that alternating between the E-step and the M-step maximizes the lower bound. Hint:

Express the lower bound in two ways such that for the E-step we have the lower bound written

in terms of θt and q and for the M-step we have the lower bound written in terms of qt+1 and θ.

Then, isolate the variables being maximized (q for the E-step and θ for the M-step).

(e) E-step: What are expressions for probabilistically imputing (replacing missing data Z with an

estimated value based on observed data X) the classes for all the datapoints, i.e. qt+1i,1 and q

t+1

i,2 ?

(f) What is the expression for µt+11 for the M-step?

(g) Compare and contrast k-means, soft k-means, and mixture of Gaussians fit with EM. Some

questions to think about (you don’t need to answer each exactly) include: What are assump-

tions do we make about our solutions? What types of data might these models have trouble

with? How are they related? When might you choose one over another?

HW 07, ©UCB CS 189, Fall 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 8