辅导案例-CS229

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

CS229 Midterm 2
1. [12 points] Multiple Choice
For each question except (a), choose the single best answer. For question (a), choose
all of the correct answers (there may be multiple correct answers).
(a) [2 points] Which of the following are hyperparameters? Select all of the correct
answers (there may be multiple correct answers for this problem).
(1) The number of hidden units in a two-layer neural network
(2) The cluster centroids in the k-means algorithm
(3) λ in the regularized cost function 1
2n
∑n
i=1(y
(i) − hθ(x(i)))2 + λ||θ||22
(4) The probability φj in the Naive Bayes classifier
(b) [2 points] Which of the following statements is false about neural networks?
(1) Without nonlinear activation functions, the hypothesis function hθ(x) of a
multi-layer feed-forward neural network is a linear function of the input x
just like in a linear model.
(2) When we use multilayer feed-forward neural networks to solve a binary clas-
sification problem, we can apply mini-batch stochastic gradient descent to-
gether with a 0-1 loss, i.e. the per-example loss `(yˆ, y) = 1 {yˆ 6= y} where
yˆ ∈ {0, 1} is the prediction of the neural network and y ∈ {0, 1} is the ground
truth label.
(3) Nonlinear activation functions are usually applied element-wise.
(4) ReLU(x) = max {x, 0}.
(c) [2 points] Which of the following machine learning algorithms can be kernelized?
(1) Linear regression with L2 regularization on the weights but not perceptron
(2) Perceptron but not linear regression with L2 regularization on the weights
(3) Both linear regression with L2 regularization on the weights and perceptron
(4) Neither linear regression with L2 regularization on the weights nor perceptron
(d) [2 points] Suppose we have a training set of n examples. Let X be the n × d
design matrix, where each row consists of the input features of an example, and
let ~y be the n-dimensional vector consisting of the corresponding target values.
Let θ be the d-dimensional parameter vector. The least squares loss that we have
seen in class is J(θ) = 1
2
(Xθ − ~y)T (Xθ − ~y). Now consider the weighted least
squares loss, which can be formulated as J(θ) = 1
2
(Xθ − ~y)TW (Xθ − ~y) where
W is a n×n diagonal matrix with the weights for each example on the diagonal.
What is the normal equation for weighted least squares?
(1) XTWXθ = XT~y
(2) XTXθ = XTW~y
(3) XTWXθ = XTW~y
(4) A normal equation does not exist for weighted least squares
(e) [2 points] Let’s consider Logistic Regression and Gaussian Discriminant Analysis
for binary classification. Which of the following statements is true?
CS229 Midterm 3
(1) Logistic Regression makes no modeling assumptions
(2) Gaussian Discriminant Analysis makes no modeling assumptions
(3) Logistic Regression makes stronger modeling assumptions than Gaussian Dis-
criminant Analysis
(4) Gaussian Discriminant Analysis makes stronger modeling assumptions than
Logistic Regression
(f) [2 points] Note: this question is somewhat harder than the others.
Suppose we have two non-overlapping datasets a and b, where all the examples
in a and b are distinct. Concretely, we have na examples in dataset a with input
x
(i)
a ∈ R and target y(i)a ∈ R for i ∈ {1, 2, . . . , na}, and we have nb examples in
dataset b with input x
(i)
b ∈ R and target y(i)b ∈ R for i ∈ {1, 2, . . . , nb}. When
we perform linear regression on dataset a to learn the parameters βa ∈ R and
γa ∈ R by minimizing the cost function Ja(βa, γa) = 12
∑na
i=1
(
y
(i)
a − βax(i)a − γa
)2
,
we get βˆa > 0. When we perform linear regression on dataset b by minimizing
Jb(βb, γb) =
1
2
∑nb
i=1
(
y
(i)
b − βbx(i)b − γb
)2
, we get βˆb > 0. If we combine the
two datasets a and b to get dataset c with inputs
[
x
(1)
a , . . . , x
(na)
a , x
(1)
b , . . . , x
(nb)
b
]
and targets
[
y
(1)
a , . . . y
(na)
a , y
(1)
b , . . . , y
(nb)
b
]
, and we perform linear regression on the
combined dataset c by minimizing Jc(βc, γc) =
1
2
∑na+nb
i=1
(
y
(i)
c − βcx(i)c − γc
)2
,
what is the sign of βˆc?
(1) Positive
(2) Negative
(3) It depends on the actual dataset
CS229 Midterm 4
2. [22 points] MLE for the Laplace Distribution
We start by defining and considering the Laplace distribution. The Laplace distri-
bution parametrized by b > 0, which we denote by Laplace(b), has the probability
density function
p(y; b) =
1
2b
exp
(
−|y|
b
)
for y ∈ R.
Both the Gaussian distribution and the Laplace distribution are continuous distri-
butions which can produce values in (−∞,∞), and the forms of their probability
density functions are similar with the parameter b playing a role similar to σ for the
Gaussian distribution. However, the Laplace distribution has “fatter” tails than the
Gaussian: the tails decay more slowly for the Laplace distribution. Here is a plot of
the Laplace distribution’s probability density function:
(a) [4 points] As a warmup, show that the distribution Laplace(b) is a member of the
exponential family. Recall that distributions from the exponential family have
probability density functions of the form:
p(y; η) = b(y) exp
(
ηTT (y)− a(η))
Clearly specify what b(y), η, T (y) and a(η) should be.
CS229 Midterm 5
(b) [12 points] Suppose we observe n training examples (x(i), y(i)), with x(i) ∈ R and
y(i) ∈ {0, 1} for i ∈ {1, ..., n}, that are assumed to be drawn i.i.d. according to
the following model1:
y(i) ∼ Bernoulli(φ)
x(i) | y(i) = 0 ∼ Laplace(b0)
x(i) | y(i) = 1 ∼ Laplace(b1)
where b0, b1 > 0 and φ ∈ [0, 1] are unknown parameters. Recall that the Bernoulli
distribution with mean φ, denoted by Bernoulli(φ), specifies a distribution over
y ∈ {0, 1}, so that p(y = 1;φ) = φ and p(y = 0;φ) = 1− φ.
Find the Maximum Likelihood Estimates of φ, b0 and b1 by maximizing the joint
likelihood L(φ, b0, b1) =
∏n
i=1 p(x
(i), y(i);φ, b0, b1).
You can use the indicator functions 1{y(i) = 0} and 1{y(i) = 1} in your answer.
1Note that the Laplace probability density was defined as a function of y on the previous page. In this
problem, x(i) | y(i) is drawn from the Laplace distribution.
CS229 Midterm 6
(c) [6 points] Now that we’ve learned the parameters φ, b0, and b1, we will use them
to predict y given a new x. Show that
p(y = 1|x;φ, b0, b1) = 1
1 + exp(−f(|x|))
where f(|x|) is some function of |x|. Assume b0 6= b1. Explicitly specify what
f(|x|) is. Note that your answer for f(|x|) should depend on |x|, φ, b0, and b1.
You do NOT need to plug in the estimates for φ, b0, and b1 that you found in
part (b).
CS229 Midterm 7
3. [26 points] Moment Matching in Exponential Families
We learned in class that an exponential family distribution is one whose probability
density can be represented as
p(y; η) = b(y) exp(ηTT (y)− a(η)),
where η is the natural parameter of the distribution, T (y) is the sufficient statistic,
a(η) is the log partition function, and b(y) is the base measure.
(a) [8 points] Suppose we are given N data points
{
y(1), y(2), . . . , y(N)
}
that are drawn
i.i.d. from the exponential family distribution p(y; η). The log-likelihood of the
data is
`(η) = log
(
N∏
i=1
p(y(i); η)
)
. (1)
Show that the log-likelihood of the data, `(η), is maximized with respect to the
natural parameter η when
E[T (Y ); η] =
1
N
N∑
i=1
T (y(i)) (2)
In other words, you are asked to show that the maximizer of equation (1) satisfies
equation (2).
Note that this equation can be interpreted as a moment-matching condition:
the expected sufficient statistics equals the observed sufficient statistics. The
expected sufficient statistic E[T (Y ); η] ,
∫
T (y)p(y; η)dy is the expectation of
T (Y ) when Y is distributed according to p(y; η).
Hint: The solution to one of the questions in Problem Set 1 may be helpful for
this problem.
CS229 Midterm 8
(b) [8 points] Consider the multivariate Gaussian distributionN (µ,Σ) in d-dimensions
with mean vector µ and covariance matrix Σ. The probability density function
is given by
p(y;µ,Σ) =
1
(2pi)d/2|Σ|1/2 exp
(
−1
2
(y − µ)TΣ−1(y − µ)
)
.
Show that the multivariate Gaussian distribution is in the exponential family,
and clearly state the values for η and T (y). You do not have to state the values
for b(y) and a(η), and you may assume the existence of b(y) and a(η).
Hint: The following hint is useful in one way of solving the question. Let A and
B be two n×n matrices, and let vec(A) ∈ Rn2 denote the column vector which is
the matrix A “flattened”, i.e. the elements of vec(A) are Ai,j for 1 ≤ i ≤ n, 1 ≤
j ≤ n. You may find one of the following identities helpful: ∑ni=1∑nj=1Ai,jBi,j =
vec(A)Tvec(B) and trace(AB) = vec(A)Tvec(BT ).
CS229 Midterm 9
(c) [10 points] Suppose we are given N data points
{
y(1), y(2), . . . , y(N)
}
that are
drawn i.i.d. from the multivariate Gaussian distribution N (µ,Σ) in d-dimensions
with mean vector µ and covariance matrix Σ. We learned in class that for Y ∼
N (µ,Σ), E[Y ] = µ and E[(Y − µ)(Y − µ)T ] = Σ. Using these two properties of
Gaussians and the results in parts (a) and (b), show that the maximum likelihood
estimates of the parameters µ and Σ are given by
µ =
1
N
N∑
i=1
y(i)
Σ =
1
N
N∑
i=1
(y(i) − µ)(y(i) − µ)T .
Note that this problem shows another way to derive the MLE without explicitly
computing derivatives of the log-likelihood with respect to µ and Σ.
CS229 Midterm 10
4. [18 points] EM for Mixture of Poissons
Suppose we have training data consisting of n independent, unlabeled examples{
x(1), x(2), . . . , x(n)
}
. Each observed sample x(i) ∈ N0 = {0, 1, 2, 3, . . . } follows a Pois-
son distribution with parameter λz(i) , where z
(i) ∈ {1, 2, . . . , k} is the corresponding
latent (unobserved) variable indicating which Poisson distribution x(i) belongs to.
Specifically, we have z(i) ∼ Multinomial (φ) (where ∑kj=1 φj = 1, φj ≥ 0 for all j, and
the parameter φj gives p(z
(i) = j)), and x(i) | z(i) = j ∼ Poisson (λj). In brief, the
training data follows a distribution consisting of a mixture of k Poissons.
Recall that at iteration t, the Expectation-Maximization (EM) Algorithm discussed
in lecture can be split into the E-step and M-step as follows:
• (E-step): For each i, j, set
w
(i)
j := p
(
z(i) = j | x(i); θ(t))
• (M-step): Set
θ(t+1) := argmax
θ
n∑
i=1
k∑
j=1
w
(i)
j log
p
(
x(i), z(i) = j; θ
)
w
(i)
j
Here θ = (φ, λ) represents all parameters that we need to estimate.
(a) [2 points] Show that the M-step is equivalent to the following:
θ(t+1) := argmax
θ
n∑
i=1
k∑
j=1
w
(i)
j log p
(
x(i), z(i) = j; θ
)
CS229 Midterm 11
(b) [4 points] Complete the E-step by finding an expression for w
(i)
j in terms of φ
(t),
λ(t) and x(i). Further, show that we can get
[
w
(i)
1 . . . w
(i)
k
]T
∈ Rk by applying
the softmax function to some k-dimensional vector, and specify what this vector
is in terms of φ(t), λ(t) and x(i).
Recall that if X follows a Poisson distribution with parameter λ, then
P (X = x;λ) = e−λ
λx
x!
CS229 Midterm 12
(c) [12 points] Complete the M-step by finding the expressions for φ
(t+1)
j and λ
(t+1)
j
through solving
θ(t+1) := argmax
θ
n∑
i=1
k∑
j=1
w
(i)
j log p
(
x(i), z(i) = j; θ
)
CS229 Midterm 13
5. [22 points] Batch Normalization
Machine Learning Algorithms tend to give better results when input features are
normalized and uncorrelated. For this reason, data normalization is a common pre-
processing step in deep learning. However, the distribution of each layer’s inputs
may also vary over time and across layers, sometimes making training deep networks
more difficult. Batch Normalization is a method that draws its strength from making
normalization a part of the model architecture and performing the normalization for
each training mini-batch.
Given a batch B = (x(1), ..., x(m)) of vectors in Rd, we first normalize inputs into
(xˆ(1), ..., xˆ(m)) and then linearly transform them into (y(1), ..., y(m)). A Batch Normal-
ization layer also has parameters γ ∈ Rd and β ∈ Rd. The layer works as follows:
Figure 1: Batch Normalization
First, compute the batch mean vector
µ(B) =
1
m
m∑
i=1
x(i) ∈ Rd (3)
Then, compute the batch variance vector v(B) ∈ Rd, which is defined element-wise for
any j ∈ {1, ..., d} by:
v
(B)
j =
1
m
m∑
i=1
(x
(i)
j − µ(B)j )2 ∈ R (4)
Then, normalize the vectors in B by computing vectors (xˆ(1), ..., xˆ(m)) such that xˆ(i) ∈
Rd is defined element-wise for any j ∈ {1, ..., d} by:
xˆ
(i)
j =
x
(i)
j − µ(B)j√
v
(B)
j
(5)
Finally, output vectors (y(1), ..., y(m)) in Rd defined by
y(i) = γ xˆ(i) + β (6)
where is the element-wise vector multiplication.
CS229 Midterm 14
In this question, you will derive the back-propagation rules for batch-normalization.
Let L = L(y(1), . . . , y(m)) be some scalar function of the batch-normalization output
vectors. We will calculate the gradients of L with respect to the parameters and some
intermediate variables.
(a) [6 points] Calculate the gradient of L w.r.t. β, γ, and xˆ(i) for i ∈ {1, ...,m}; i.e.
calculate ∂L
∂β
, ∂L
∂γ
, and ∂L
∂xˆ(i)
for i ∈ {1, ...,m}. Your answer can and should depend
on ∂L
∂y(i)
and the parameters and variables in the forward pass (such as xˆ(i)’s). Your
answer may be vectorized or un-vectorized (e.g. ∂L
∂β
or ∂L
∂βj
for j ∈ {1, ..., d}).
CS229 Midterm 15
(b) [4 points] Calculate the gradient of L w.r.t. v(B); i.e. calculate ∂L
∂v(B) . Your answer
can depend on ∂L
∂y(i)
and the parameters and variables in the forward pass, as well
as the gradients that you calculated in part (a). Your answer may be vectorized
or un-vectorized ( ∂L
∂v(B) or
∂L
∂v
(B)
j
for j ∈ {1, ..., d}).
CS229 Midterm 16
(c) [6 points] Show that the gradient of L w.r.t. µ(B) is
∂L
∂µ(B)
=
−1√
v(B)

m∑
i=1
∂L
∂xˆ(i)
Please show all of your work.
Hint: Consider applying the chain rule using xˆ
(i)
j for i ∈ {1, ...,m} as the inter-
mediate variables, and note that xˆ
(i)
j depends on µ
(B)
j directly through µ
(B)
j and
indirectly through v
(B)
j .
CS229 Midterm 17
(d) [6 points] Show that the gradient of L w.r.t. x(i) for i ∈ {1, ...,m} is
∂L
∂x(i)
=
1√
v(B)
∂L
∂xˆ(i)
+
2(x(i) − µ(B))
m
∂L
∂v(B)
+
1
m
∂L
∂µ(B)
Please show all of your work.
Hint: Consider applying the chain rule using xˆ
(k)
j for k ∈ {1, ...,m} as the
intermediate variables, and note that xˆ
(k)
j for k ∈ {1, ...,m} depends on x(i)j
directly through x
(i)
j for k = i and indirectly through µ
(B)
j and v
(B)
j for k ∈
{1, ...,m}.
Remarks on the broader context: After obtaining ∂L
∂x(i)
(as a function of
∂L
∂y(i)
and other quantities known in the forward pass), one can propagate the
gradient backwards to other layers that generated the x(i)’s (you are not asked
to show this). Empirically, it turns out to be important to consider µ(B) and
v(B) as variables (instead of constants), so that in the chain rule, we consider the
gradient through µ(B) and v(B).