程序代写案例-CSC413/2516

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

CSC413/2516 Winter 2021 with Professor Jimmy Ba & Professor Bo Wang Homework 4
Homework 4
Version: 1.1
Version Release Date: 2021-03-29
Deadline: Thursday, April 8th, at 11:59pm.
Submission: You must submit your solutions as a PDF file through MarkUs1. You can produce
the file however you like (e.g. LaTeX, Microsoft Word, scanner), as long as it is readable.
See the syllabus on the course website2 for detailed policies. You may ask questions about the
assignment on Piazza3. Note that 10% of the homework mark (worth 1 pt) may be removed for a
lack of neatness.
The teaching assistants for this assignment are Keiran Paster and Irene Zhang.
mailto:[email protected]
Clarifications
All the modifications to the original assignment will be marked with red in the main body of the
assignment. A complete list of the modifications can be found below:
• (v1.1) In question 3.2.1, we corrected how the variance term should be calculated, and that
the minimization problem should be done element-wise.
1https://markus.teach.cs.toronto.edu/csc413-2021-01/main
2https://csc413-uoft.github.io/2021/assets/misc/syllabus.pdf
3https://piazza.com/class/kjt32fc0f7y3kb
1
CSC413/2516 Winter 2021 with Professor Jimmy Ba & Professor Bo Wang Homework 4
1 RNNs and Self Attention
For any successful deep learning system, choosing the right network architecture is as important
as choosing a good learning algorithm. In this question, we will explore how various architectural
choices can have a significant impact on learning. We will analyze the learning performance from
the perspective of vanishing /exploding gradients as they are backpropagated from the final layer
to the first.
1.1 Warmup: A Single Neuron RNN
Consider an n layered fully connected network that has scalar inputs and outputs. For now, assume
that all the hidden layers have a single unit, and that the weight matrices are set to 1 (because
each hidden layer has a single unit, the weight matrices have a dimensionality of R1×1).
1.1.1 Effect of Activation - Sigmoid [0.25pt]
Lets say we’re using the sigmoid activation. Let x be the input to the network and let f : R1 → R1
be the function the network is computing. Do the gradients necessarily have to vanish or explode as
they are backpropagated? Answer this by showing that 0 ≤ |∂f(x)∂x | ≤ (14)n, where n is the number
of layers in the network.
1.1.2 Effect of Activation - Tanh [0.25pt]
Instead of sigmoid, now lets say we’re using the tanh activation (otherwise the same setup as in
1.1.1). Do the gradients necessarily have to vanish or explode this time? Answer this by deriving
a similar bound as in Sec 1.1.1 for the magnitude of the gradient.
1.2 Matrices and RNN
We will now analyze the recurrent weight matrices under Singular Value Decomposition. SVD is
one of the most important results in all of linear algebra. It says that any real matrix M ∈ Rmxn
can be written as M = UΣV T where U ∈ Rmxm and V ∈ Rnxn are square orthogonal matrices, and
Σ ∈ Rmxn is a rectangular diagonal matrix with nonnegative entries on the diagonal (i.e. Σii ≥ 0
for i ∈ {1, . . . ,min(m,n)} and 0 otherwise). Geometrically, this means any linear transformation
can be decomposed into a rotation/flip, followed by scaling along orthogonal directions, followed
by another rotation/flip.
1.2.1 Gradient through RNN [0.5pt]
Let say we have a very simple RNN-like architecture that computes xt+1 = tanh(Wxt). You can
view this architecture as a deep fully connected network that uses the same weight matrix at each
layer. Suppose the largest singular value of the weight matrix is σmax(W ) =
1
2 . Show that the
largest singular value of the input-output Jacobian has the following bound: 0 ≤ σmax(∂xn∂x1 ) ≤ (12)n.
(Hint: if C = AB, then σmax(C) ≤ σmax(A)σmax(B). Also, the input-output Jacobian is the mul-
tiplication of layerwise Jacobians).
2
CSC413/2516 Winter 2021 with Professor Jimmy Ba & Professor Bo Wang Homework 4
1.3 Self-Attention
In a self-attention layer (using scaled dot-product attention), the matrix of outputs is computed
as:
Attention(Q,K, V ) = softmax
(
QK>√
dk
)
V
where Q,K, V ∈ Rn×d are the query, key, and value matrices, n is the sequence length, and dm
is the embedding dimension.
1.3.1 Complexity of Self-Attention [0.25pt]
Show that the complexity of a self-attention layer at test-time scales quadratically with the sequence
length n.
1.3.2 Linear Attention with SVD [1pt]
It has been empirically shown in Transformer models that the context mapping matrix P =
softmax
(
QK>√
dk
)
often has a low rank. Show that if the rank of P is k and we already have access
to the SVD of P , then it is possible to compute self-attention in O(nkd) time.
1.3.3 Linear Attention by Projecting [1pt]
Suppose we ignore the Softmax and scaling and let P = QK> ∈ Rn×n. Assume P is rank k. Show
that there exist two linear projection matrices C,D ∈ Rk×n such that PV = Q(CK)>DV and the
right hand side can be computed in O(nkd) time. Hint: Consider using SVD in your proof.
2 Reversible Models and Variational AutoEncoders
In lecture, you saw how the change-of-variables formula can be used to get the likelihood of a
datapoint in a reversible model. You also saw how Variational AutoEncoders can be trained by
optimizing a lower bound on this likelihood.
In this problem, you will show one relationship between these two approaches. Specifically, you
will show how for a class of distributions for the encoder and decoder of a VAE that correspond to
deterministic random variables, you can derive the change-of-variables formula. This connection
makes it clear how to calculate a bound on the likelihood of a composition both stochastic and
bijective transformations. This question is inspired by SurVAE Flows [Nielsen et al., 2020], which
uses this connection to derive how to optimize models with surjective transformations.
2.1 Variational Lower Bound [1pt]
You saw in class that the log probability of data x can be lower bounded by the variational lower
bound using the latent prior p(z), decoder p(x|z) and encoder q(z|x). In this problem, you will
derive a different way of writing this bound. Show that log p(x) can be rewritten as:
log p(x) = Eq(z|x)[log p(z)] + Eq(z|x)
[
log
p(x|z)
q(z|x)
]
︸︷︷︸
variational lower bound
+Eq(z|x)
[
log
q(z|x)
p(z|x)
]
︸︷︷︸
gap in lower bound
(2.1)
3
CSC413/2516 Winter 2021 with Professor Jimmy Ba & Professor Bo Wang Homework 4
Figure 1: Figure from http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture11.
pdf
2.2 Connecting Reversible Models with VAEs [1pt]
Suppose we have an invertible, differentiable function f : Z → X which maps samples from some
prior distribution p(z) to data space. Let z = f−1(x). To compute the log probability of a datapoint
x, one should use the change-of-variables formula:
log p(x) = log p(z) + log
∣∣∣∣det ∂f(z)∂z
∣∣∣∣−1 (2.2)
In this problem, you will show that the change of variables formula can be derived as a special case
of Equation 2.1. In order to write f with probability distributions, we will use Dirac δ-functions.
A Dirac delta is defined as followed:
δ(x) =
{
+∞, x = 0
0, x 6= 0
with the additional property that it integrates to 1:∫ ∞
−∞
δ(x)dx = 1
Suppose we define deterministic conditionals for our function f :
p(x|z) = δ(x− f(z))
p(z|x) = δ(z − f−1(x))
Show that when using the above distributions and when q(z|x) = p(z|x), Equation 2.1 can be
rewritten as Equation 2.2.
Hint: use the following property to write one conditional in terms of the other:
Let z0 be a root of the function g : Z → X . The composition of a Dirac δ-function with a smooth
function g is defined as:
δ(g(z)) =
∣∣∣∣det ∂g(z)∂z
∣∣∣∣−1
z=z0
δ(z − z0) (2.3)
4
CSC413/2516 Winter 2021 with Professor Jimmy Ba & Professor Bo Wang Homework 4
2.3 Computing the Likelihood of a Composition of Transformations [1pt]
Consider the problem of computing a lower-bound on the likelihood of a datapoint x when using
some transformation f that maps latent samples z ∼ p(z) some x ∈ X . When f is smooth and
bijective, we can use Equation 2.2, the change of variables formula, to get the exact log-likelihood
of any x. When f is a stochastic transformation with p(x|z) and q(z|x), we can use Equation 2.1,
the variational lower bound, to get a bound on the likelihood.
Suppose the function f := fT ◦ fT−1 ◦ . . . ◦ f1 is a composition of several transformations which
are either stochastic (VAE) or bijective (reversible models).
We can write a bound on the likelihood as:
log p(x) ≥ log p(z) +
T∑
t=1
Vt
How would we find each of the Vt? Hint: these terms may be different depending on the type of
transformations.
3 Reinforcement Learning
3.1 Bellman Equation
The Bellman equation can be seen as a fix point equation to what’s called the Bellman Operator.
Given a policy pi, the Bellman operators T pi for V : B(S) → B(S) and T pi for Q : B(S × A) →
B(S ×A) are defined as follows:
(T piV )(s) , rpi(S) + γ
∫
P(s′ | s, a)pi(a | s)V (s′)
(T piQ)(s, a) , r(s, a) + γ
∫
P(s′ | s, a)pi(a′ | s′)Q(s′, a′)
for all s ∈ S( for V ) or all (s, a) ∈ S ×A( for Q).
The Bellman operators have two important properties, 1) monotononicity and 2) γ-contraction.
These properties give us many guarantees such as applying the operator repeatedly will converge to
an unique and optimal solution, which is what allow us to show RL algorithms such as Q-Learning
converges (under certain additional assumptions, but we won’t go over them here). In this section,
we will show that the Bellman operator indeed have these two properties.
a) [0.5pt] Show that the Bellman operator (on V ) has the monotonicity property. i.e., show that
for a fixed policy pi, if V1, V2 ∈ B(S), and V1 ≤ V2, then we have
T piV1 ≤ T piV2
b) [1pt] Show that the Bellman operator is a γ-contraction mapping with the supremum norm
(on Q). i.e. show that for a discount factor γ and Q1, Q2 ∈ B(S ×A), we have
||T pi(Q1)− T pi(Q2)||∞ ≤ γ||Q1 −Q2||∞
Recall from your math classes, the supremum norm (on Q) is as follows:
‖Q‖∞ = sup
(s,a)∈S×A
|Q(s, a)| (3.1)
5
CSC413/2516 Winter 2021 with Professor Jimmy Ba & Professor Bo Wang Homework 4
Hint: for some function f , we have the following. For this question, you can think about what
is P and f in our case. ∣∣∣∣∫ P (x)f(x)∣∣∣∣ ≤ ∫ |P (x)f(x)| = ∫ |P (x)| · |f(x)|
≤
∫
P (x) · sup
x∈X
|f(x)|
= sup
x∈X
|f(x)|
∫
P (x) = ‖f‖∞
(3.2)
where in the last line we used the fact
∫
P (x) = 1
c) [0.25pt] For this question, you may assume knowledge of the reward function r(s, a) and
transition probability function p(s′|s, a), where s′ is the next state.
1. Give a definition of v∗(s) in terms of q∗(s, a).
2. Give a definition of q∗(s, a) in terms of v∗(s).
3. Give a definition of a∗ in terms of q∗(s, a).
4. Give a definition of a∗ in terms of v∗(s).
3.2 The REINFORCE Gradient Estimator
3.2.1 Choice of Baseline [1pt]
In class/tutorial we saw that the reinforce gradient can be written as follows using the log derivative
trick:
∇θJ (θ) = E st∼ρ0(s)
at∼piθ(at|st)
st+1∼p(st+1|st,at)
[
T∑
t=1
∇θ log piθ(at | st)
[
T∑
t′=1
R(st′ , at′)
]]
(3.3)
In tutorial, we also saw that subtracting a baseline b ∈ R as follows is a variance reduction
technique:
∇θJb(θ) = E st∼ρ0(s)
at∼piθ(at|st)
st+1∼p(st+1|st,at)
[
T∑
t=1
∇θ log piθ(at | st)
[
(
T∑
t′=1
R(st′ , at′)− b)
]]
(3.4)
In practice, to estimate the gradient, we can execute the policy in the environment N times,
and the average of N runs we obtain for both (3.3) and (3.4) are unbiased estimators. For this
question, derive an optimal baseline b ∈ RM in terms of piθ and R that would minimize the variance
of this estimator, where M is the size of the gradient.
Hint: To do this, we can follow three steps, for each k-th element in the gradient term:
1) Write down the variance term Var[
∑T
t=1∇θk log piθ
(
a
(i)
t | s(i)t
)[(∑T
t′=tR
(
s
(i)
t′ , a
(i)
t′
))
− bk
]
].
Recall from your probability class that for a random variable x, we have Var[x] = E
[
x2
]− E[x]2
2) To minimize the variance, find ddbk Var, set this to zero, and solve for bk.
3) Finally, to show that what we find is a minima, we show that the second derivative is positive.
6
CSC413/2516 Winter 2021 with Professor Jimmy Ba & Professor Bo Wang Homework 4
3.2.2 Importance Sampling [1pt]
For a general function f and a probability measure P , computing the following will be hard if we
do not have access to samples from P , or if P have very low probability where f takes its largest
values:
EP (X)[f(X)] =
∫
x
P (x)f(x)dx ≈ 1
m
m∑
i=1
P
(
x(i)
)
f
(
x(i)
)
with x(i) ∼ P
Importance sampling provides an alternative solution, where instead of sampling from P , we sample
from some distributionQ of our choice. Given samples fromQ, we can then estimate the expectation
w.r.t. P as follows:
EP (X)[f(X)] = EQ(X)
[
P (X)
Q(X)
f(X)
]
≈ 1
m
m∑
i=1
P
(
x(i)
)
Q
(
x(i)
)f(x(i)) with x(i) ∼ Q
When we derive the REINFORCE estimator in the lecture/tutorial, we are assuming getting access
to samples from piθ. Suppose we no long have this assumption, and we only have samples from
another distribution, piθ′ instead. Derive the REINFORCE estimator for policy piθ under this new
setting with importance sampling, i.e. Derive ∇θJ(θ) with at ∼ piθ′ (a|st). Your solution should be
only in terms of piθ, piθ′ , and r. Show your steps explicitly. What happens if the two policies piθ and
piθ′are close to each other?
Hint 1: You can use the following property: for some function f , ∇θE[f(θ)] = E[∇θf(θ)]
Hint 2: Use the importance sampling equation given, where P should be the probability of
trajectory sampling actions from piθ, and Q should be the probability of trajectory sampling actions
from piθ′ .
References
Didrik Nielsen, Priyank Jaini, Emiel Hoogeboom, Ole Winther, and Max Welling. SurVAE flows:
Surjections to bridge the gap between vaes and flows. In Hugo Larochelle, Marc’Aurelio Ranzato,
Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Informa-
tion Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020,
NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/
paper/2020/hash/9578a63fbe545bd82cc5bbe749636af1-Abstract.html.
7

欢迎咨询51作业君