CS 189 Introduction to Machine Learning
Fall 2019 Jennifer Listgarten & Stella Yu HW 07
Due: Tuesday, November 19, 2019
0 Getting Started
handwritten/scanned solutions. Please start each question on a new page. Deliverables:
1. Submit a PDF of your writeup, with an appendix for your code, to assignment on Grade-
scope, “HW7 Write-Up”. If there are graphs, include those graphs in the correct sections.
Do not simply reference your appendix.
2. If there is code, submit all code needed to reproduce your results, “HW7 Code”.
3. If there is a test set, submit your test set evaluation results, “HW7 Test Set”.
1 Backpropagation Algorithm for Neural Networks
Learning goal: Understand backpropagation and associated hyperparameters of training
neural networks
In this problem, we will be implementing the backpropagation algorithm to train a neural network
to classify the difference between two handwritten digits (specifically the digits 3 and 9). The
dataset for this problem consists of over 10k black and white images of size 28 x 28 pixels, each
with a label corresponding to the digit 3 or 9.
Before we start we will install the library mnist so we can load our data properly.
Run pip install mnist in your terminal so the library is properly installed.
To establish notation for this problem, we define:
ai+1 = σ(zi) = σ(Wiai + bi).
In this equation, Wi is a ni+1 × ni matrix that maps the input ai of dimension ni to a vector of
dimension ni+1, where ni+1 is the size of layer i + 1. The vector bi is the bias vector added after
the matrix multiplication, and σ is the nonlinear function applied element-wise to the result of the
matrix multiplication and addition. zi = Wiai + bi is a shorthand for the intermediate result within
layer i before applying the nonlinear activation function σ. Each layer is computed sequentially
where the output of one layer is used as the input to the next. To compute the derivatives with
respect to the weights Wi and the biases bi of each layer, we use the chain rule starting with
the output of the network and work our way backwards through the layers, which is where the
backprop algorithm gets its name.
You are given starter code with incomplete function implementations. For this problem, you will
fill in the missing code so that we can train a neural network to learn your nonlinear classifier. The
code currently trains a network with one hidden layer with 4 nodes.
(a) Start by drawing a small example network with three computational layers, where the
last layer has a single scalar output. The first layer has a single external input x. The
computational layers should have widths of 5, 3, and 1 respectively. The nonlinearities for the
hidden and output layers are relu and linear respectively. Label all the ni as well as all the
ai and Wi and bi weights. Consider the bias terms to be weights connected to a dummy unit
whose output is always 1 for the purpose of labeling. Also draw and label the loss function
that will be important during training — use a squared-error loss.
Here, the important thing is for you to understand your own clear way to illustrate neural nets.
You can follow conventions seen online, from class, or develop your own. The important thing
is to have your illustration be unambiguous so you can use it to help understand the forward
flow of information during evaluation and the backward flow during gradient computations.
(b) Let’s start by implementing the least squares loss function of the network. We’ll refer to the
class that’s encoded as 1 as the ’positive class’ and the class that’s encoded as -1 as the ’negative
class.’ This convention is arbitrary but convenient for discussions, especially since it matches
the usual conventions in binary classification. For this question, let’s treat the digit 3 as the
positive class and the digit 9 as the negative class. The sign of the predicted value will be the
predicted class label. This function is used to assign an error for each prediction made by the
network during training.
The error we actually care about is the misclassification error (MCE) which will be:
MCE(yˆ) =
1
n
n∑
i=1
I{sign(yi) , sign(yˆi)}
where yi is the observation that we want the neural network to output and yˆi is the prediction
from the network. However this function is hard to to optimize so the implementation will be
optimizing the mean squared error cost (MSE) function, which is given by
MSE(yˆ) =
1
2n
n∑
i=1
(yi − yˆi)2.
Write the derivative of the mean squared error cost function with respect to the pre-
dicted outputs yˆ. In backprop.py implement the functions QuadraticCost.fx and
(c) Now, let’s take the derivatives of the nonlinear activation functions used in the network. Write
the derivatives and implement the following nonlinear functions in the code and their
derivatives:
σlinear(z) = z
σReLU(z) =
0 z < 0z otherwise
σtanh(z) =
ez − e−z
ez + e−z
For the tanh function, feel free to use the tanh function in numpy. We have provided the
sigmoid function as an example activation function. Remember to attach your function
implementations in the writeup.
(d) We have implemented the forward propagation part of the network for you (see Model.evaluate
in the code). We now need to compute the derivative of the cost function with respect to the
weights W and the biases b of each layer in the network. We will be using all of the code
we previously implemented to help us compute these gradients. Assume that ∂MSE
∂ai+1
is given,
where ai+1 is the input to layer i+ 1. Write the expression for ∂MSE∂ai in terms of
∂MSE
∂ai+1
. Then
implement these derivative calculations in the function Model.compute grad. Recall, ai+1
is given by
ai+1 = σ(zi) = σ(Wiai + bi) .
Remember to attach your function implementations in the writeup.
(e) We use gradients to update the model parameters using batched stochastic gradient descent.
Implement the function GDOptimizer.update to update the parameters in each layer
of the network. You will need to use the derivatives ∂MSE
∂zi
and the outputs of each layer ai
to compute the derivates ∂MSE
∂Wi
and ∂MSE
∂bi
. Use the learning rate η, given by self.eta in the
function, to scale the gradients when using them to update the model parameters. Train with
batch sizes [10, 50, 100, 200] and number of training epochs [10, 20, 40]. Report the final
error on the training set given the various batch sizes and training epochs. Does the result
match your expectation, and why? What can you conclude from it. (Just for reference, staff’s
solution is able to achieve a error below 0.1 in general.) Remember to attach your function
implementations in the writeup.
(f) Let’s now explore how the number of hidden nodes per layer affects the approximation. Train
a models using the tanh and the ReLU activation functions with 2, 4, 8, 16, and 32 hidden
nodes per layer (width). Use the same training iterations and learning rate from the starter
code. Report the resulting error on the training set after training for each combination of
parameters. Does the result match your expectation, and why? What can you conclude from
it.
2 Regularized and Kernel k-Means
Learning Goal: Combine previous ideas such as regularization and kernels to produce commonly-
used variants of K-means
Recall that in k-means clustering we attempt to minimize the objective
min
C1,C2,...,Ck
k∑
i=1

x j∈Ci
‖x j − µi‖22, where
µi = argminµi∈Rd

x j∈Ci
‖x j − µi‖22 =
1
|Ci|

x j∈Ci
x j, i = 1, 2, . . . , k.
The samples are {x1, . . . , xn}, where x j ∈ Rd. Ci is the set of sample points assigned to cluster i and
|Ci| is its cardinality. Each sample point is assigned to exactly one cluster.
(a) What is the minimum value of the objective when k = n (the number of clusters equals the
number of sample points)?
(b) (Regularized k-means) Suppose we add a regularization term to the above objective. The
objective is now
k∑
i=1
λ‖µi‖22 + ∑
x j∈Ci
‖x j − µi‖22
 .
Show that the optimum of
min
µi∈Rd
λ‖µi‖22 +

x j∈Ci
‖x j − µi‖22
is obtained at µi = 1|Ci |+λ

x j∈Ci x j.
(c) Here is an example where we would want to regularize clusters. Suppose there are n students
who live in a R2 Euclidean world and who wish to share rides efficiently to Berkeley for their
CS189 final exam. The university permits k vehicles for shuttling students to the exam location.
The students need to figure out k good locations to meet up. The students will then walk to the
closest meet up point and then the shuttles will deliver them to the exam location. Let x j be
the location of student j, and let the exam location be at (0, 0). Assume that we can drive as
the crow flies, i.e., by taking the shortest path between two points. Write down an appropriate
objective function to minimize the total distance that the students and vehicles need to travel.
Hint: your result should be similar to the regularized k-means objective.
(d) (Kernel k-means) Suppose we have a dataset {xi}ni=1, xi ∈ R` that we want to split into K clusters,
i.e., finding the best K-means clustering without regularization. Furthermore, suppose we
know a priori that this data is best clustered in an impractically high-dimensional feature space
Rm, l m with an appropriate metric. Fortunately, instead of having to deal with the (implicit)
feature map Φ : R` → Rm and (implicit) distance metric1, we have a kernel function κ(x1, x2) =
Φ(x1) · Φ(x2) that we can compute easily on the raw samples. How should we perform the
kernelized counterpart of k-means clustering?
1Just as how the interpretation of kernels in kernelized ridge regression involves an implicit prior/regularizer as well as an
implicit feature space, we can think of kernels as generally inducing an implicit distance metric as well. Think of how you would
represent the squared distance between two points in terms of pairwise inner products and operations on them.
Derive the underlined portion of this algorithm, and show your work in deriving it. The
main issue is that although we define the means µi in the usual way, we can’t ever compute Φ
explicitly because it’s way too big. Therefore, in the step where we determine which cluster
each sample point is assigned to, we must use the kernel function κ to obtain the right result.
(Review the lecture on kernels if you don’t remember how that’s done.)
Algorithm 1: Kernel k-means
Require: Data matrix X ∈ Rn×l; Number of clusters K; kernel function κ(x1, x2)
Ensure: Cluster class class( j) for each sample x j.
function Kernel-k-means(X,K)
Randomly initialize class( j) to be an integer in 1, 2, . . . ,K for each x j.
while not converged do
for i← 1 to K do
Set S i = { j ∈ {1, 2, . . . , n} : class( j) = i}.
for j← 1 to n do
Set class( j) = argmink
Return S i for i = 1, 2, . . . ,K.
end function
(e) The expression you derived may have unnecessary terms or redundant kernel computations.
Explain how to eliminate them; that is, how to perform the computation quickly without doing
irrelevant computations or redoing computations already done.
3 Expectation Maximization (EM) Algorithm: In Action!
Learning Goal: Compare and contrast the behavior of various clustering algorithms on a
mixture of Gaussians.
Suppose we have the following general mixture of Gaussians. We describe the model by a pair of
random variables (X,Z) where X takes values in Rd and Z takes value in the set [K] = {1, . . . ,K}.
The joint-distribution of the pair (X,Z) is given to us as follows:
Z ∼ Multinomial(pi),
(X|Z = k) ∼ N(µk,Σk), k ∈ [K],
where pi = (pi1, . . . , piK)> and
∑K
k=1 pik = 1. Note that we can also write
X ∼
K∑
k=1
pikN(µk,Σk).
Suppose we are given a dataset {xi}ni=1 without their labels. Our goal is to identify the K-clusters
of the data. To do this, we are going to estimate the parameters µk,Σk using this dataset. We are
going to use the following three algorithms for this clustering task.
K-Means: For each data-point for iteration t we find its cluster by computing:
y(t)i = arg minj∈[K]
‖xi − µ(t−1)j ‖2
C(t)j = {xi : y(t)i == j}ni=1
where µ(t−1)j denotes the mean of C
(t−1)
j , the j-th cluster in iteration t − 1. The cluster means
are then recomputed as:
µ(t)j =
1
|C(t)j |

xi∈C(t)j
xi.
We can run K-means till convergence (that is we stop when the cluster memberships do not
change anymore). Let us denote the final iteration as T , then the estimate of the covariances
Σk from the final clusters can be computed as:
Σ j =
1
|C(T )j |

xi∈C(T )j
(xi − µ(T )j )(xi − µ(T )j )>.
Notice that this method can be viewed as a “hard” version of EM.
K-Means-with-Covariance-Updates: Given that we also estimate the covariance, we may con-
sider a version of K-means where the covariances keep getting updated at every iteration and
also play a role in determining cluster membership. The objective at the assignment-step
would be given by
y(t)i = arg minj∈[K]
(xi − µ(t−1)j )>(Σ(t−1)j )−1(xi − µ(t−1)j ).
C(t)j = {xi : y(t)i == j}ni=1
We could then use C(t)j to recompute the parameters as follows:
µ(t)j =
1
|C(t)j |

xi∈C(t)j
xi, and
Σ(t)j =
1
|C(t)j |

xi∈C(t)j
(xi − µ(t)j )(xi − µ(t)j )>.
We again run K-means-with-covar until convergence (that is we stop when the cluster mem-
berships do not change anymore). Notice that, again, this method can be viewed as another
variant for the “hard” EM method.
EM: The EM updates are given by
– E-step: For k = 1, . . . ,K and i = 1 . . . , n, we have
q(t)i (Zi = k) = p(Z = k|X = xi; θ(t−1)).
– M-step: For k = 1, . . . ,K, we have
pi(t)k =
1
n
n∑
i=1
q(t)i (Zi = k) =
1
n
n∑
i=1
p(Z = k|X = xi; θ(t−1)),
µ(t)k =
∑n
i=1 q
(t)
i (Zi = k)xi∑n
i=1 q
(t)
i (Zi = k)
, and
Σk
(t) =
∑n
i=1 q
(t)
i (Zi = k)(xi − µ(t)k )(xi − µ(t)k )>∑n
i=1 q
(t)
i (Zi = k)
.
Notice that unlike previous two methods, in the EM updates, each data point contributes in
determining the mean and covariance for each cluster.
We now see the three methods in action. You are provided with a code for all the above 3 algorithms
(gmm em kmean.py). You can run it by calling the following function from main:
experiments(seed, factor, num_samples, num_clusters)
We assume that x ∈ R2, and the default settings are number of samples is 500 (n = 500), and the
number of clusters is 3 (K = 3). Notice that seed will determine the randomness and factor will
determine how far apart are the clusters.
(a) Run the following setting:
experiments(seed=11, factor=1, num_samples=500, num_clusters=3)
Observe the initial guesses for the means and the plots for the 3 algorithms on convergence.
Comment on your observations. Attach the two plots for this case.
Note that the colors are used to indicate that the points that belong to different clusters, to help
you visualize the data and understand the results.
(b) Comment on the results obtained for the following setting:
experiments(seed=63, factor=10, num_samples=500, num_clusters=3)
and attach the two plots for this case as well.
4 One Dimensional Mixture of Two Gaussians
Learning Goal: Derive the EM algorithm for a simple mixture of Gaussians.
Suppose we have a mixtures of two Gaussians in R that can be described by a pair of random
variables (X,Z) where X takes values in R and Z takes value in the set 1, 2. The joint-distribution
of the pair (X,Z) is given to us as follows:
Z ∼ Bernoulli(β),
(X|Z = k) ∼ N(µk, σ2k), k ∈ 1, 2,
We use θ to denote the set of all parameters β, µ1, σ1, µ2, σ2.
(a) Write down the expression for the joint likelihood pθ(X = xi,Zi = 1) and pθ(X = xi,Zi = 2).
What is the marginal likelihood pθ(X = xi)?
(b) What is the log-likelihood `θ(x)? Why can’t we optimize this by taking the derivative of the
log-likelihood and setting it to 0?
(c) You’d like to optimize the log likelihood: `θ(x). However, we just saw this can be hard to
solve for an MLE closed form solution. Show that a lower bound for the log likelihood is
`θ(xi) ≥ Eq
[
log
(
pθ(X=xi,Zi=k)
qθ(Zi=k|X=xi)
)]
.
(d) The EM algorithm initially starts with two randomly placed Gaussians (µ1, σ1) and (µ2, σ2),
which are both particular realizations of θ.
• E-step: qt+1i,k = pθt(Zi = k|X = xi). For each data point, determine which Gaussian
generated it, being either (µ1, σ1) or (µ2, σ2).
• M-step: : θt+1 = argmaxθ
∑n
i=1 Eqt+1
[
log
(
pθ(X = xi,Zi = k)
)]
. After labeling all datapoints
in the E-step, adjust (µ1, σ1) and (µ2, σ2).
Prove that alternating between the E-step and the M-step maximizes the lower bound. Hint:
Express the lower bound in two ways such that for the E-step we have the lower bound written
in terms of θt and q and for the M-step we have the lower bound written in terms of qt+1 and θ.
Then, isolate the variables being maximized (q for the E-step and θ for the M-step).
(e) E-step: What are expressions for probabilistically imputing (replacing missing data Z with an
estimated value based on observed data X) the classes for all the datapoints, i.e. qt+1i,1 and q
t+1
i,2 ?
(f) What is the expression for µt+11 for the M-step?
(g) Compare and contrast k-means, soft k-means, and mixture of Gaussians fit with EM. Some
questions to think about (you don’t need to answer each exactly) include: What are assump-
tions do we make about our solutions? What types of data might these models have trouble
with? How are they related? When might you choose one over another?  