辅导案例-GA-1005/CSCI

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

DS-GA-1005/CSCI-GA.2569 Problem Set 3 1
Inference and Representation, Fall 2020
Problem Set 3:
Variational Inference & MCMC sampling & Optimal Transport
Due: Friday, Dec. 11, 2020 (as a PDF and .zip files uploaded to NYU Classes)
1. Variational Inference and Kullback-Liebler Divergences. In this question we will focus on the
properties of the Kullback-Liebler divergence and their role in Variational Inference. Recall that
for two probability measures P and Q over a set X such that P is absolutely continuous with
respect to Q 1, it is defined in general as
DKL(P‖Q) =
∫
X
log
(
dP
dQ
(x)
)
dP (x) ,
where dP/dQ is the Radon-Nykodyn derivative of P with respect to Q. In particular, when both
P and Q admit a density – say p and q respectively – with respect to a base measure µ, then we
simply write
DKL(P‖Q) =
∫
X
log
(
p(x)
q(x)
)
p(x)µ(dx) .
(a) Show that DKL(P‖Q) ≥ 0 and DKL(P‖Q) = 0 iff P = Q.
(b) Let P (x, y), Q(x, y) be two joint probability distributions, with marginals Px(x), Py(y)
(resp. Qx(x), Qy(y)). Show that if the distributions are independent, ieP (x, y) = Px(x)Py(y)
and Q(x, y) = Qx(x)Qy(y), then DKL(P‖Q) = DKL(Px‖Qx) + DKL(Py‖Qy). Give an
example where without independence one getsDKL(P‖Q) > DKL(Px‖Qx)+DKL(Py‖Qy)
and another one where DKL(P‖Q) < DKL(Px‖Qx) +DKL(Py‖Qy). [Note: numerical ex-
ample is OK, but you can use property (a) as a hint too]
(c) Recall from Jensen’s Inequality that in a mixture model p(x, z) with observed variables x and
latent variables z, the data likelihood admits the variational bound
log p(x) ≥ sup
q
Lq = Ez∼q{log p(x, z)} − Ez∼q{log q(z)} , (1)
where the sup is over an arbitrary set of variational distributions absolutely continuous wrt µ.
Verify that
DKL(q‖p(z|x)) = log p(x)− Lq .
Maximizing the upper bound (1) with respect to q is thus equivalent to minimizing the
Kullback-Liebler divergence between q and the posterior p(z|x).
(d) Assume a latent representation Z = (z1 . . . zm). Mean-Field Variational Inference seeks a
variational bound where q(z) is separable q(z) =
∏m
j=1 qj(zj). Show that the optimum q of
separable form of (1) satisfies
∀ j ∈ [m] , qj(zj) ∝ exp
(
EZ−j [log p(zj , Z−j , x)]
)
,
where Z−j denotes all the variables except for zj and the expectation is with respect to∏
k 6=j qk(zk).
1P is absolutely continuous with respect toQ if P (A) = 0 for every measurable set A for whichQ(A) = 0.
DS-GA-1005/CSCI-GA.2569 Problem Set 3 2
(e) Compare the variational solution obtained above with the related optimization problem
min
q(z)=
∏
j qj(zj)
DKL(p(z|x)‖q) . (2)
Can you give a closed-form solution of (2)? Explain the benefits of the formulation (1) over
(2) in statistical and computational terms.
2. MCMC methods for coloring. In computer science, it is of wide interest to try to sample uniformly
from the set of proper q-colorings of an undirected graph. Given an undirected graph G on n
vertices and q ∈ N, we call a mapping σ : [n] → [q], between the vertices of G and the q colors,
a q-coloring of G. We say that a q-coloring of G is a proper q-coloring of G if for every edge on
G, say between the vertex indexed by x and the vertex indexed by y, where x, y ∈ [n], it holds
σ(x) 6= σ(y). Informally, the coloring is proper if no two edges are connecting two vertices of the
same color.
Motivation for the notion of proper coloring comes from various applications such as scheduling,
radio frequency alignment and register allocation. Consider for example the case that we want to
make am exam schedule for a university. We have list different subjects and students enrolled in
every subject. Many subjects would have common students. How do we schedule the exam so that
no two exams with a common student are scheduled at same time? What is the minimum time slots
which are needed to schedule all exams? This problem can be represented as a graph where every
vertex is a subject and an edge between two vertices mean there is a common student. So this is
a coloring problem where being able to use only q time slots means we are able to find a proper
q-coloring of the underlying graph.
(a) Suppose that ∆ is the maximum degree of G. Suppose q ≥ ∆ + 1. Prove that there is always
a proper q-coloring of G.
(Hint: prove that a greedy algorithm always works.)
(b) Suppose q ≥ ∆ + 1. Consider the state space of all q-colorings of G. Define a Metropolis
chain defined on this state space with stationary distribution the uniform distribution over all
proper q-colorings of G.
(c) Suppose q ≥ ∆ + 1. Consider the state space of all q-colorings of G. Note that the Glauber
dynamics for sampling uniformly a proper q-coloring, starts with an arbitrary q-coloring and
at every step it chooses one vertex of G uniformly at random and then it updates its color
uniformly among the colors who are not present among its immediate neighbors. Do the
Metropolis chain from part (b) and the Glauber dynamics correspond to the same Markov
chain? Justify your answer.
As a note, Jerrum [1] and independently Salas and Sokal [2] independently proved that when
q > 2∆, the mixing time of the above mentioned Metropolis chain (part (b)) is O(n log n) (which
is extremely fast, since the space of q-coloring is of cardinality qn). As a challenge (and indepen-
dently from the homework) you can try to prove this result when q > 4∆ via the coupling method
we learnt in class.
3. Sampling for Optimization. In many inference settings one wishes to solve optimization problems
of the form max f(x), x ∈ X . We assume that X is discrete, it satisfies |X | = N and is the vertex
set of an undirected graph G. In many cases N is very large and an exhaustive search is impossible
over X . Yet, the structure of G and f(x) can allow for better approaches.
As an example to keep in mind, one can consider the high dimensional sparse regression setting
we studied in Lecture 11 and the maximum likelihood estimator as the optimization problem of
DS-GA-1005/CSCI-GA.2569 Problem Set 3 3
interest. Notice that in this setting, X is the set of binary k-sparse vectors. The graph G could be
naturally defined as the one where an edge is in place between two binary k-sparse vectors if they
differ in exactly two entries.
A standard approach towards trying to solve the optimization problem is to use MCMC methods to
sample from associated Gibbs distribution piβ(x) ∝ eβf(x), x ∈ X for sufficiently high β > 0. No-
tice that running an MCMC methods defined on G for one iteration is usually not computationally
expensive. The reason in most cases graphs of interests are “sparse” and the neighborhood of each
vertex is “small”, compared to N . Hence, a step of the random walk with the appropriate rejection
step is easy to implement.
(a) Show that as β → +∞, piβ converges in distribution to the uniform measure among the
global maximizers of the function f(x), x ∈ X
(b) Consider the random walk defined on G where at each step, the random walk moves from a
vertex to one neighboring vertex (i.e. connected with it with an edge) uniformly at random.
Use this random walk to construct a Metropolis chain with stationary distribution piβ . Can it
be the case that the chain moves from a vertex x to a vertex y with f(x) > f(y)? If not,
provide a proof. If yes, explain why this could be beneficial for the optimization of f , in case
it has local but non global maxima.
(c) Suppose that v ∈ X is a cut-vertex in G that is X \ v can be written as the union of two
disjoint sets S, T where no edges in G exists between an element of S and an element of T.
Suppose furthermore that for some vertices v1 ∈ S, v2 ∈ T it holds f(v1) ≥ f(v) + L and
f(v2) ≥ f(v) +L. Show that for all β > 0, the mixing time of the Metropolis chain is lower
bound by eβL.
(Hint: use the bottleneck ratio lower bound. Consider as candidate sets the vertices from S
union v, and the vertices from T union v.)
4. Optimal coupling on the real line Recall the definition of the Kantorovich optimal transport prob-
lem over two real-valued distributions:
W (µ, ν) = min
pi
∫
R2
c(x, y)pi(x, y)dxdy,
where pi is a coupling of µ, ν, that is a distribution on R2 with marginal over x the distribution
µ and with marginal over y the distribution ν. Suppose that µ, ν have CDFs F (x), x ∈ R and
G(y), y ∈ R respectively.
We make the assumptions that c(x, y) = d(x − y) where d : R → R is strictly convex and
continuous (e.g. the quadratic function d(u) = u2).
Let p˜i the distribution on R2 where p˜i((−∞, x]× (−∞, y]) = min{F (x), G(y)}, for all x, y ∈ R.
We will show that p˜i is the unique optimal solution of the Kantorovich optimal transport problem.
(a) Let pi∗ be an optimal solution of the Kantorovich optimal transport problem (you can as-
sume an optimal solution exists). Show that if (x1, y1), (x2, y2) are in the support of pi∗ and
x1 < x2 then y1 ≤ y2. Explain why this makes sense from an “optimal transport” point of
view.
(Hint: You may assume without proof that the support of pi∗ is monotone: that is for every
(x1, y1), (x2, y2) which are in the support of pi∗ it holds
d(x1 − y1) + d(x2 − y2) ≤ d(x1 − y2) + d(x2 − y1).
Recall also that d is strongly convex.)
DS-GA-1005/CSCI-GA.2569 Problem Set 3 4
(b) Conclude that pi∗ = p˜i. In particular, show that using the conclusion of part (a), prove that it
necessarily holds pi∗((−∞, x]× (−∞, y]) = min{F (x), G(y)}, for all x, y ∈ R.
(Hint: You can consider two sets, A = (−∞, x] × (y,+∞) and B = (x,+∞) × (−∞, y]
and prove that necessarily pi∗(A) = 0 or pi∗(B) = 0.)
(c) Provide some high-level informal intuition explaining that p˜i essentially moves the mass as-
signed to x from µ to the point y where F (x) = G(y) and why this makes sense.
References
[1] Mark Jerrum. A very simple algorithm for estimating the number of k-colorings of a low-degree
graph. Random Structures & Algorithms, 7(2):157–165, 1995.
[2] Jesu´s Salas and Alan D. Sokal. Logarithmic corrections and finite-size scaling in the two-dimensional
4-state potts model. Journal of Statistical Physics, 88(3):567–615, 1997.

欢迎咨询51作业君