CSC413/2516 Winter 2021 with Professor Jimmy Ba & Professor Bo Wang Homework 4 Homework 4 Version: 1.1 Version Release Date: 2021-03-29 Deadline: Thursday, April 8th, at 11:59pm. Submission: You must submit your solutions as a PDF file through MarkUs1. You can produce the file however you like (e.g. LaTeX, Microsoft Word, scanner), as long as it is readable. See the syllabus on the course website2 for detailed policies. You may ask questions about the assignment on Piazza3. Note that 10% of the homework mark (worth 1 pt) may be removed for a lack of neatness. The teaching assistants for this assignment are Keiran Paster and Irene Zhang. mailto:
[email protected] Clarifications All the modifications to the original assignment will be marked with red in the main body of the assignment. A complete list of the modifications can be found below: • (v1.1) In question 3.2.1, we corrected how the variance term should be calculated, and that the minimization problem should be done element-wise. 1https://markus.teach.cs.toronto.edu/csc413-2021-01/main 2https://csc413-uoft.github.io/2021/assets/misc/syllabus.pdf 3https://piazza.com/class/kjt32fc0f7y3kb 1 CSC413/2516 Winter 2021 with Professor Jimmy Ba & Professor Bo Wang Homework 4 1 RNNs and Self Attention For any successful deep learning system, choosing the right network architecture is as important as choosing a good learning algorithm. In this question, we will explore how various architectural choices can have a significant impact on learning. We will analyze the learning performance from the perspective of vanishing /exploding gradients as they are backpropagated from the final layer to the first. 1.1 Warmup: A Single Neuron RNN Consider an n layered fully connected network that has scalar inputs and outputs. For now, assume that all the hidden layers have a single unit, and that the weight matrices are set to 1 (because each hidden layer has a single unit, the weight matrices have a dimensionality of R1×1). 1.1.1 Effect of Activation - Sigmoid [0.25pt] Lets say we’re using the sigmoid activation. Let x be the input to the network and let f : R1 → R1 be the function the network is computing. Do the gradients necessarily have to vanish or explode as they are backpropagated? Answer this by showing that 0 ≤ |∂f(x)∂x | ≤ (14)n, where n is the number of layers in the network. 1.1.2 Effect of Activation - Tanh [0.25pt] Instead of sigmoid, now lets say we’re using the tanh activation (otherwise the same setup as in 1.1.1). Do the gradients necessarily have to vanish or explode this time? Answer this by deriving a similar bound as in Sec 1.1.1 for the magnitude of the gradient. 1.2 Matrices and RNN We will now analyze the recurrent weight matrices under Singular Value Decomposition. SVD is one of the most important results in all of linear algebra. It says that any real matrix M ∈ Rmxn can be written as M = UΣV T where U ∈ Rmxm and V ∈ Rnxn are square orthogonal matrices, and Σ ∈ Rmxn is a rectangular diagonal matrix with nonnegative entries on the diagonal (i.e. Σii ≥ 0 for i ∈ {1, . . . ,min(m,n)} and 0 otherwise). Geometrically, this means any linear transformation can be decomposed into a rotation/flip, followed by scaling along orthogonal directions, followed by another rotation/flip. 1.2.1 Gradient through RNN [0.5pt] Let say we have a very simple RNN-like architecture that computes xt+1 = tanh(Wxt). You can view this architecture as a deep fully connected network that uses the same weight matrix at each layer. Suppose the largest singular value of the weight matrix is σmax(W ) = 1 2 . Show that the largest singular value of the input-output Jacobian has the following bound: 0 ≤ σmax(∂xn∂x1 ) ≤ (12)n. (Hint: if C = AB, then σmax(C) ≤ σmax(A)σmax(B). Also, the input-output Jacobian is the mul- tiplication of layerwise Jacobians). 2 CSC413/2516 Winter 2021 with Professor Jimmy Ba & Professor Bo Wang Homework 4 1.3 Self-Attention In a self-attention layer (using scaled dot-product attention), the matrix of outputs is computed as: Attention(Q,K, V ) = softmax ( QK>√ dk ) V where Q,K, V ∈ Rn×d are the query, key, and value matrices, n is the sequence length, and dm is the embedding dimension. 1.3.1 Complexity of Self-Attention [0.25pt] Show that the complexity of a self-attention layer at test-time scales quadratically with the sequence length n. 1.3.2 Linear Attention with SVD [1pt] It has been empirically shown in Transformer models that the context mapping matrix P = softmax ( QK>√ dk ) often has a low rank. Show that if the rank of P is k and we already have access to the SVD of P , then it is possible to compute self-attention in O(nkd) time. 1.3.3 Linear Attention by Projecting [1pt] Suppose we ignore the Softmax and scaling and let P = QK> ∈ Rn×n. Assume P is rank k. Show that there exist two linear projection matrices C,D ∈ Rk×n such that PV = Q(CK)>DV and the right hand side can be computed in O(nkd) time. Hint: Consider using SVD in your proof. 2 Reversible Models and Variational AutoEncoders In lecture, you saw how the change-of-variables formula can be used to get the likelihood of a datapoint in a reversible model. You also saw how Variational AutoEncoders can be trained by optimizing a lower bound on this likelihood. In this problem, you will show one relationship between these two approaches. Specifically, you will show how for a class of distributions for the encoder and decoder of a VAE that correspond to deterministic random variables, you can derive the change-of-variables formula. This connection makes it clear how to calculate a bound on the likelihood of a composition both stochastic and bijective transformations. This question is inspired by SurVAE Flows [Nielsen et al., 2020], which uses this connection to derive how to optimize models with surjective transformations. 2.1 Variational Lower Bound [1pt] You saw in class that the log probability of data x can be lower bounded by the variational lower bound using the latent prior p(z), decoder p(x|z) and encoder q(z|x). In this problem, you will derive a different way of writing this bound. Show that log p(x) can be rewritten as: log p(x) = Eq(z|x)[log p(z)] + Eq(z|x) [ log p(x|z) q(z|x) ] ︸ ︷︷ ︸ variational lower bound +Eq(z|x) [ log q(z|x) p(z|x) ] ︸ ︷︷ ︸ gap in lower bound (2.1) 3 CSC413/2516 Winter 2021 with Professor Jimmy Ba & Professor Bo Wang Homework 4 Figure 1: Figure from http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture11. pdf 2.2 Connecting Reversible Models with VAEs [1pt] Suppose we have an invertible, differentiable function f : Z → X which maps samples from some prior distribution p(z) to data space. Let z = f−1(x). To compute the log probability of a datapoint x, one should use the change-of-variables formula: log p(x) = log p(z) + log ∣∣∣∣det ∂f(z)∂z ∣∣∣∣−1 (2.2) In this problem, you will show that the change of variables formula can be derived as a special case of Equation 2.1. In order to write f with probability distributions, we will use Dirac δ-functions. A Dirac delta is defined as followed: δ(x) = { +∞, x = 0 0, x 6= 0 with the additional property that it integrates to 1:∫ ∞ −∞ δ(x)dx = 1 Suppose we define deterministic conditionals for our function f : p(x|z) = δ(x− f(z)) p(z|x) = δ(z − f−1(x)) Show that when using the above distributions and when q(z|x) = p(z|x), Equation 2.1 can be rewritten as Equation 2.2. Hint: use the following property to write one conditional in terms of the other: Let z0 be a root of the function g : Z → X . The composition of a Dirac δ-function with a smooth function g is defined as: δ(g(z)) = ∣∣∣∣det ∂g(z)∂z ∣∣∣∣−1 z=z0 δ(z − z0) (2.3) 4 CSC413/2516 Winter 2021 with Professor Jimmy Ba & Professor Bo Wang Homework 4 2.3 Computing the Likelihood of a Composition of Transformations [1pt] Consider the problem of computing a lower-bound on the likelihood of a datapoint x when using some transformation f that maps latent samples z ∼ p(z) some x ∈ X . When f is smooth and bijective, we can use Equation 2.2, the change of variables formula, to get the exact log-likelihood of any x. When f is a stochastic transformation with p(x|z) and q(z|x), we can use Equation 2.1, the variational lower bound, to get a bound on the likelihood. Suppose the function f := fT ◦ fT−1 ◦ . . . ◦ f1 is a composition of several transformations which are either stochastic (VAE) or bijective (reversible models). We can write a bound on the likelihood as: log p(x) ≥ log p(z) + T∑ t=1 Vt How would we find each of the Vt? Hint: these terms may be different depending on the type of transformations. 3 Reinforcement Learning 3.1 Bellman Equation The Bellman equation can be seen as a fix point equation to what’s called the Bellman Operator. Given a policy pi, the Bellman operators T pi for V : B(S) → B(S) and T pi for Q : B(S × A) → B(S ×A) are defined as follows: (T piV )(s) , rpi(S) + γ ∫ P(s′ | s, a)pi(a | s)V (s′) (T piQ)(s, a) , r(s, a) + γ ∫ P(s′ | s, a)pi(a′ | s′)Q(s′, a′) for all s ∈ S( for V ) or all (s, a) ∈ S ×A( for Q). The Bellman operators have two important properties, 1) monotononicity and 2) γ-contraction. These properties give us many guarantees such as applying the operator repeatedly will converge to an unique and optimal solution, which is what allow us to show RL algorithms such as Q-Learning converges (under certain additional assumptions, but we won’t go over them here). In this section, we will show that the Bellman operator indeed have these two properties. a) [0.5pt] Show that the Bellman operator (on V ) has the monotonicity property. i.e., show that for a fixed policy pi, if V1, V2 ∈ B(S), and V1 ≤ V2, then we have T piV1 ≤ T piV2 b) [1pt] Show that the Bellman operator is a γ-contraction mapping with the supremum norm (on Q). i.e. show that for a discount factor γ and Q1, Q2 ∈ B(S ×A), we have ||T pi(Q1)− T pi(Q2)||∞ ≤ γ||Q1 −Q2||∞ Recall from your math classes, the supremum norm (on Q) is as follows: ‖Q‖∞ = sup (s,a)∈S×A |Q(s, a)| (3.1) 5 CSC413/2516 Winter 2021 with Professor Jimmy Ba & Professor Bo Wang Homework 4 Hint: for some function f , we have the following. For this question, you can think about what is P and f in our case. ∣∣∣∣∫ P (x)f(x)∣∣∣∣ ≤ ∫ |P (x)f(x)| = ∫ |P (x)| · |f(x)| ≤ ∫ P (x) · sup x∈X |f(x)| = sup x∈X |f(x)| ∫ P (x) = ‖f‖∞ (3.2) where in the last line we used the fact ∫ P (x) = 1 c) [0.25pt] For this question, you may assume knowledge of the reward function r(s, a) and transition probability function p(s′|s, a), where s′ is the next state. 1. Give a definition of v∗(s) in terms of q∗(s, a). 2. Give a definition of q∗(s, a) in terms of v∗(s). 3. Give a definition of a∗ in terms of q∗(s, a). 4. Give a definition of a∗ in terms of v∗(s). 3.2 The REINFORCE Gradient Estimator 3.2.1 Choice of Baseline [1pt] In class/tutorial we saw that the reinforce gradient can be written as follows using the log derivative trick: ∇θJ (θ) = E st∼ρ0(s) at∼piθ(at|st) st+1∼p(st+1|st,at) [ T∑ t=1 ∇θ log piθ(at | st) [ T∑ t′=1 R(st′ , at′) ]] (3.3) In tutorial, we also saw that subtracting a baseline b ∈ R as follows is a variance reduction technique: ∇θJb(θ) = E st∼ρ0(s) at∼piθ(at|st) st+1∼p(st+1|st,at) [ T∑ t=1 ∇θ log piθ(at | st) [ ( T∑ t′=1 R(st′ , at′)− b) ]] (3.4) In practice, to estimate the gradient, we can execute the policy in the environment N times, and the average of N runs we obtain for both (3.3) and (3.4) are unbiased estimators. For this question, derive an optimal baseline b ∈ RM in terms of piθ and R that would minimize the variance of this estimator, where M is the size of the gradient. Hint: To do this, we can follow three steps, for each k-th element in the gradient term: 1) Write down the variance term Var[ ∑T t=1∇θk log piθ ( a (i) t | s(i)t )[(∑T t′=tR ( s (i) t′ , a (i) t′ )) − bk ] ]. Recall from your probability class that for a random variable x, we have Var[x] = E [ x2 ]− E[x]2 2) To minimize the variance, find ddbk Var, set this to zero, and solve for bk. 3) Finally, to show that what we find is a minima, we show that the second derivative is positive. 6 CSC413/2516 Winter 2021 with Professor Jimmy Ba & Professor Bo Wang Homework 4 3.2.2 Importance Sampling [1pt] For a general function f and a probability measure P , computing the following will be hard if we do not have access to samples from P , or if P have very low probability where f takes its largest values: EP (X)[f(X)] = ∫ x P (x)f(x)dx ≈ 1 m m∑ i=1 P ( x(i) ) f ( x(i) ) with x(i) ∼ P Importance sampling provides an alternative solution, where instead of sampling from P , we sample from some distributionQ of our choice. Given samples fromQ, we can then estimate the expectation w.r.t. P as follows: EP (X)[f(X)] = EQ(X) [ P (X) Q(X) f(X) ] ≈ 1 m m∑ i=1 P ( x(i) ) Q ( x(i) )f(x(i)) with x(i) ∼ Q When we derive the REINFORCE estimator in the lecture/tutorial, we are assuming getting access to samples from piθ. Suppose we no long have this assumption, and we only have samples from another distribution, piθ′ instead. Derive the REINFORCE estimator for policy piθ under this new setting with importance sampling, i.e. Derive ∇θJ(θ) with at ∼ piθ′ (a|st). Your solution should be only in terms of piθ, piθ′ , and r. Show your steps explicitly. What happens if the two policies piθ and piθ′are close to each other? Hint 1: You can use the following property: for some function f , ∇θE[f(θ)] = E[∇θf(θ)] Hint 2: Use the importance sampling equation given, where P should be the probability of trajectory sampling actions from piθ, and Q should be the probability of trajectory sampling actions from piθ′ . References Didrik Nielsen, Priyank Jaini, Emiel Hoogeboom, Ole Winther, and Max Welling. SurVAE flows: Surjections to bridge the gap between vaes and flows. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Informa- tion Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/ paper/2020/hash/9578a63fbe545bd82cc5bbe749636af1-Abstract.html. 7
欢迎咨询51作业君