DS-GA-1005/CSCI-GA.2569 Problem Set 3 1 Inference and Representation, Fall 2020 Problem Set 3: Variational Inference & MCMC sampling & Optimal Transport Due: Friday, Dec. 11, 2020 (as a PDF and .zip files uploaded to NYU Classes) 1. Variational Inference and Kullback-Liebler Divergences. In this question we will focus on the properties of the Kullback-Liebler divergence and their role in Variational Inference. Recall that for two probability measures P and Q over a set X such that P is absolutely continuous with respect to Q 1, it is defined in general as DKL(P‖Q) = ∫ X log ( dP dQ (x) ) dP (x) , where dP/dQ is the Radon-Nykodyn derivative of P with respect to Q. In particular, when both P and Q admit a density – say p and q respectively – with respect to a base measure µ, then we simply write DKL(P‖Q) = ∫ X log ( p(x) q(x) ) p(x)µ(dx) . (a) Show that DKL(P‖Q) ≥ 0 and DKL(P‖Q) = 0 iff P = Q. (b) Let P (x, y), Q(x, y) be two joint probability distributions, with marginals Px(x), Py(y) (resp. Qx(x), Qy(y)). Show that if the distributions are independent, ieP (x, y) = Px(x)Py(y) and Q(x, y) = Qx(x)Qy(y), then DKL(P‖Q) = DKL(Px‖Qx) + DKL(Py‖Qy). Give an example where without independence one getsDKL(P‖Q) > DKL(Px‖Qx)+DKL(Py‖Qy) and another one where DKL(P‖Q) < DKL(Px‖Qx) +DKL(Py‖Qy). [Note: numerical ex- ample is OK, but you can use property (a) as a hint too] (c) Recall from Jensen’s Inequality that in a mixture model p(x, z) with observed variables x and latent variables z, the data likelihood admits the variational bound log p(x) ≥ sup q Lq = Ez∼q{log p(x, z)} − Ez∼q{log q(z)} , (1) where the sup is over an arbitrary set of variational distributions absolutely continuous wrt µ. Verify that DKL(q‖p(z|x)) = log p(x)− Lq . Maximizing the upper bound (1) with respect to q is thus equivalent to minimizing the Kullback-Liebler divergence between q and the posterior p(z|x). (d) Assume a latent representation Z = (z1 . . . zm). Mean-Field Variational Inference seeks a variational bound where q(z) is separable q(z) = ∏m j=1 qj(zj). Show that the optimum q of separable form of (1) satisfies ∀ j ∈ [m] , qj(zj) ∝ exp ( EZ−j [log p(zj , Z−j , x)] ) , where Z−j denotes all the variables except for zj and the expectation is with respect to∏ k 6=j qk(zk). 1P is absolutely continuous with respect toQ if P (A) = 0 for every measurable set A for whichQ(A) = 0. DS-GA-1005/CSCI-GA.2569 Problem Set 3 2 (e) Compare the variational solution obtained above with the related optimization problem min q(z)= ∏ j qj(zj) DKL(p(z|x)‖q) . (2) Can you give a closed-form solution of (2)? Explain the benefits of the formulation (1) over (2) in statistical and computational terms. 2. MCMC methods for coloring. In computer science, it is of wide interest to try to sample uniformly from the set of proper q-colorings of an undirected graph. Given an undirected graph G on n vertices and q ∈ N, we call a mapping σ : [n] → [q], between the vertices of G and the q colors, a q-coloring of G. We say that a q-coloring of G is a proper q-coloring of G if for every edge on G, say between the vertex indexed by x and the vertex indexed by y, where x, y ∈ [n], it holds σ(x) 6= σ(y). Informally, the coloring is proper if no two edges are connecting two vertices of the same color. Motivation for the notion of proper coloring comes from various applications such as scheduling, radio frequency alignment and register allocation. Consider for example the case that we want to make am exam schedule for a university. We have list different subjects and students enrolled in every subject. Many subjects would have common students. How do we schedule the exam so that no two exams with a common student are scheduled at same time? What is the minimum time slots which are needed to schedule all exams? This problem can be represented as a graph where every vertex is a subject and an edge between two vertices mean there is a common student. So this is a coloring problem where being able to use only q time slots means we are able to find a proper q-coloring of the underlying graph. (a) Suppose that ∆ is the maximum degree of G. Suppose q ≥ ∆ + 1. Prove that there is always a proper q-coloring of G. (Hint: prove that a greedy algorithm always works.) (b) Suppose q ≥ ∆ + 1. Consider the state space of all q-colorings of G. Define a Metropolis chain defined on this state space with stationary distribution the uniform distribution over all proper q-colorings of G. (c) Suppose q ≥ ∆ + 1. Consider the state space of all q-colorings of G. Note that the Glauber dynamics for sampling uniformly a proper q-coloring, starts with an arbitrary q-coloring and at every step it chooses one vertex of G uniformly at random and then it updates its color uniformly among the colors who are not present among its immediate neighbors. Do the Metropolis chain from part (b) and the Glauber dynamics correspond to the same Markov chain? Justify your answer. As a note, Jerrum [1] and independently Salas and Sokal [2] independently proved that when q > 2∆, the mixing time of the above mentioned Metropolis chain (part (b)) is O(n log n) (which is extremely fast, since the space of q-coloring is of cardinality qn). As a challenge (and indepen- dently from the homework) you can try to prove this result when q > 4∆ via the coupling method we learnt in class. 3. Sampling for Optimization. In many inference settings one wishes to solve optimization problems of the form max f(x), x ∈ X . We assume that X is discrete, it satisfies |X | = N and is the vertex set of an undirected graph G. In many cases N is very large and an exhaustive search is impossible over X . Yet, the structure of G and f(x) can allow for better approaches. As an example to keep in mind, one can consider the high dimensional sparse regression setting we studied in Lecture 11 and the maximum likelihood estimator as the optimization problem of DS-GA-1005/CSCI-GA.2569 Problem Set 3 3 interest. Notice that in this setting, X is the set of binary k-sparse vectors. The graph G could be naturally defined as the one where an edge is in place between two binary k-sparse vectors if they differ in exactly two entries. A standard approach towards trying to solve the optimization problem is to use MCMC methods to sample from associated Gibbs distribution piβ(x) ∝ eβf(x), x ∈ X for sufficiently high β > 0. No- tice that running an MCMC methods defined on G for one iteration is usually not computationally expensive. The reason in most cases graphs of interests are “sparse” and the neighborhood of each vertex is “small”, compared to N . Hence, a step of the random walk with the appropriate rejection step is easy to implement. (a) Show that as β → +∞, piβ converges in distribution to the uniform measure among the global maximizers of the function f(x), x ∈ X (b) Consider the random walk defined on G where at each step, the random walk moves from a vertex to one neighboring vertex (i.e. connected with it with an edge) uniformly at random. Use this random walk to construct a Metropolis chain with stationary distribution piβ . Can it be the case that the chain moves from a vertex x to a vertex y with f(x) > f(y)? If not, provide a proof. If yes, explain why this could be beneficial for the optimization of f , in case it has local but non global maxima. (c) Suppose that v ∈ X is a cut-vertex in G that is X \ v can be written as the union of two disjoint sets S, T where no edges in G exists between an element of S and an element of T. Suppose furthermore that for some vertices v1 ∈ S, v2 ∈ T it holds f(v1) ≥ f(v) + L and f(v2) ≥ f(v) +L. Show that for all β > 0, the mixing time of the Metropolis chain is lower bound by eβL. (Hint: use the bottleneck ratio lower bound. Consider as candidate sets the vertices from S union v, and the vertices from T union v.) 4. Optimal coupling on the real line Recall the definition of the Kantorovich optimal transport prob- lem over two real-valued distributions: W (µ, ν) = min pi ∫ R2 c(x, y)pi(x, y)dxdy, where pi is a coupling of µ, ν, that is a distribution on R2 with marginal over x the distribution µ and with marginal over y the distribution ν. Suppose that µ, ν have CDFs F (x), x ∈ R and G(y), y ∈ R respectively. We make the assumptions that c(x, y) = d(x − y) where d : R → R is strictly convex and continuous (e.g. the quadratic function d(u) = u2). Let p˜i the distribution on R2 where p˜i((−∞, x]× (−∞, y]) = min{F (x), G(y)}, for all x, y ∈ R. We will show that p˜i is the unique optimal solution of the Kantorovich optimal transport problem. (a) Let pi∗ be an optimal solution of the Kantorovich optimal transport problem (you can as- sume an optimal solution exists). Show that if (x1, y1), (x2, y2) are in the support of pi∗ and x1 < x2 then y1 ≤ y2. Explain why this makes sense from an “optimal transport” point of view. (Hint: You may assume without proof that the support of pi∗ is monotone: that is for every (x1, y1), (x2, y2) which are in the support of pi∗ it holds d(x1 − y1) + d(x2 − y2) ≤ d(x1 − y2) + d(x2 − y1). Recall also that d is strongly convex.) DS-GA-1005/CSCI-GA.2569 Problem Set 3 4 (b) Conclude that pi∗ = p˜i. In particular, show that using the conclusion of part (a), prove that it necessarily holds pi∗((−∞, x]× (−∞, y]) = min{F (x), G(y)}, for all x, y ∈ R. (Hint: You can consider two sets, A = (−∞, x] × (y,+∞) and B = (x,+∞) × (−∞, y] and prove that necessarily pi∗(A) = 0 or pi∗(B) = 0.) (c) Provide some high-level informal intuition explaining that p˜i essentially moves the mass as- signed to x from µ to the point y where F (x) = G(y) and why this makes sense. References [1] Mark Jerrum. A very simple algorithm for estimating the number of k-colorings of a low-degree graph. Random Structures & Algorithms, 7(2):157–165, 1995. [2] Jesu´s Salas and Alan D. Sokal. Logarithmic corrections and finite-size scaling in the two-dimensional 4-state potts model. Journal of Statistical Physics, 88(3):567–615, 1997.
欢迎咨询51作业君