辅导案例-COMP2610/COMP6261

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

COMP2610/COMP6261 — Information Theory
Assignment 1
Robert C. Williamson
August 18, 2019
Due 10:00AM 16 September
Follow the submission and format requirements spelled out on the Wattle page.
COMP2610 students, do questions 1 and 2 (worth 50 and 30 marks respectively); you will
be marked out of 80. COMP6261 students, do all questions.
Most of the questions require you to show a result which I state. And most build upon the
claims in earlier parts. If you are unable to show a particular part, you may still use the result
(even iif you have not shown it) in showing subsequent parts. Thus you can start anywhere!
The assignment has a philosophical theme, which is somewhat contrary to the flavour
of our text, namely that the standard Shannon information is not sufficient for theoretically
understanding data analysis (machine learning) problems. The good news is that there are other
notions of information (which you will meet in the assignment) which are the “right” notions for
such a task. The justification for this claim is in the deceptively simple looking boxed expression
in question 2(g); almost all preceeding questions are steps in its derivation.
1
Question 1 (50 marks total) Suppose g : R+ → R is convex and g(1) = 0 (such a g is called a
generator henceforth). The generalised divergence between two distributions p and q on
some set X is defined by
Dg(p ‖ q) B EX∼q
[
g
(
p(X)
q(X)
)]
=
∑
x∈X
q(x)g
(
p(x)
q(x)
)
.
Here it is assumed that g(0) = limx↓0 g(x), 0g(0/0) = 0 and 0g(a/0) = limx↓0 xg(a/x) for
a > 0.
(a) Prove that for all generators g, and for all p, q Dg(p ‖ q) ≥ 0. (2 marks)
(b) For any convex function g : R+ → R, the perspective of g is a function g˘ : R+×R+ →
R given by
g˘(x, y) = yg(x/y).
It is known that the convexity of g guarantees the (joint)-convexity of g˘(x, y) (mean-
ing it is convex in the pair (x, y) and consequently convex seperately in each of x
and y). Show that the map (p, q) 7→ Dg(p ‖ q) is convex. (2 marks)
(c) Show that g˘(x, y) is one-homogeneous: that is, for all x, y, for any α > 0, g˘(αx, αy) =
αg˘(x, y). Hence show that
lim
α→∞
1
α
g˘(αx, αy) = g˘(x, y).
(2 marks)
(d) Using the previous result, show that for any generator g, Dg(p ‖ q) satisfies
Dg(p ‖ q) =
∑
x∈X
g˘(p(x), q(x)).
(5 marks)
(e) Suppose pXY and qXY are two joint distributions with marginals pY and pY respec-
tively. Show that for any generator g,
Dg(pXY ‖ qXY ) ≥ Dg(pY ‖ qY ).
(5 marks)
(f) Given two distributions pX , qX and a channel pY |X , define
pY =
∑
x∈X
pY |X(· | x)pX(x)
qY =
∑
x∈X
pY |X(· | x)qX(x).
Show the data processing theorem for generalised divergences: for any generator g,
Dg(pY ‖ qY ) ≤ Dg(pX ‖ qX).
Hint: use the expression in part (d) for Dg in terms of g˘, and exploit the positive
homogeneity and convexity of g˘. (15 marks)
2
(g) Given a generator g, let g(x) := xg(1/x) = g˘(x, 1), which is thus guaranteed convex
(by the property of the perspective), and satisfies g(1) = 0, and thus g is also a
generator. Show that for any generator g, for all p, q,
Dg(p ‖ q) = Dg(q ‖ p),
and show that (g) = g. (5 marks)
(h) Consequently, determine the form of the generator gKL such
DgKL(p ‖ q) = DKL(p ‖ q).
(5 marks)
(i) Let gTV(x) = |x − 1|. Show that for all p, q,
DgTV(p ‖ q) =
∑
x∈X
|p(x) − q(x)|.
(5 marks)
(j) Show that for any p, q, DgTV(p, q) ≤ 2. Thus prove that there cannot exist a function
Φ : R→ R such that for all p, q:
DKL(p ‖ q) ≤ Φ(DgTV(p ‖ q)).
That is, one can not upper bound DKL in terms of DgTV , or equivalently, one
cannot lower bound DgTV in terms of DKL (in a manner that holds uniformly for
all distributions p, q). (4 marks)
3
Question 2 (30 marks total) Let X be a random variable on X and let Y be a binary random
variable. Let η(x) B P(Y = 1|X = x) = E(Y |X = x). Let
f ∗(x) B I{η(x)>1/2}(x),
where IA(x) is the indicator function of A: IA(x) = 1 if x ∈ A, and IA(x) = 0 if x < A
and when we write I{Φ(x)} for some proposition, we mean I{x∈X : Φ(x)}. Let us assume
henceforth, for simplicity, that η(x) , 1/2 for all x ∈ X. (This assumption can easily be
removed by defining f ∗ as above, except when η(x) = 1/2, f ∗ is defined to be a Bern(1/2)
random variable. All of the results below actually hold for that more general case.)
(a) Suppose f : X → {0, 1} is an arbitrary decision function. Show that
P( f (X) , Y |X = x) = 1 − (I{ f (x)=1}(x)η(x) + I{ f (x)=0}(x)(1 − η(x))) .
(4 marks)
(b) Show that for any f : X → {0, 1}, and all x ∈ X,
P( f (X) , Y |X = x) − P( f ∗(X) , Y |X = x) ≥ 0.
(4 marks)
(c) Thus show that for any f ,
P( f ∗(X) , Y ) ≤ P( f (X) ≤ Y ),
and hence that f ∗ is the optimal decision function. (3 marks)
(d) Show that for all x ∈ X
P( f ∗(X) , Y |X = x) = min(η(x), 1 − η(x),
and thus that
P( f ∗(X) , Y ) = EX min(η(X), 1 − η(X)).
(3 marks)
(e) The quantity P( f ∗(X) , Y ) denotes the error of the optimal decision function; this
is the smallest error one can make in predicting Y from X (regardless of method or
computational resources). Show that
P( f ∗(X) , Y ) = 1
2
− 1
2
EX[|2η(X) − 1|].
Hint: use min(a, b) = 12 (a + b) − 12 |a − b|. (3 marks)
(f) Let p(x) B P(X = x |Y = 1) and q(x) B P(X = x |Y = 0), for x ∈ X. Suppose
P(Y = 1) = 1/2. Show that
DgTV(p ‖ q) = 2EX |2η(X) − 1|
(8 marks)
4
(g) With f ∗, p, q, X and Y as in part (f), thus show that
P( f ∗(X) , Y ) = 1
2
− 1
4
DgTV (p ‖ q) ,
and thus that the optimal performance of a statistical classifier is governed by the
DgTV divergence between the “class-conditional distributions” p and q. (2 marks)
(h) Using question 1(j), argue that no such control of the error performance of a statistical
classifier is possible in terms of DKL, and thus that for the purposes of statistical
classification (machine learning) the standard notion of information (i.e. Shannon
information) does not suffice. (3 marks)
5
Question 3 (20 marks total) Suppose X is a random variable on a set X with |X| = M . Let
pX denote its distribution. Let pmax B maxx∈X pX(x).
(a) Let p = (pmax, p2, . . . , pM), where pmax ≥ pi, for i = 2, . . . ,M . Let
q =
(
pmax,
1 − pmax
M − 1 , . . . ,
1 − pmax
M − 1
)
.
Show that
DKL(p ‖ q) = H(X) − (1 − pmax) log(M − 1) − h2(pmax)
and thus that for any random variable X on X,
H(X) ≤ (1 − pmax) log(M − 1) + h2(pmax) =: FM(pmax).
(10 marks)
(b) Show that the function FM defined above is concave. (5 marks)
(c) Suppose |X| = M and that X −→ Y −→ Xˆ forms a Markov chain. The idea is that
one is wanting to predict X by using Xˆ . Let Xˆ = argmaxx pX |Y (x |y) (i.e. chose the
most likely value of x as one’s estimate Xˆ). Let pe B P(X , Xˆ) (the probability of
error in predicting X by using Xˆ). Show that
H(X |Y ) ≤ FM(P(X = Xˆ)) = pe log(M − 1) + h2(pe).
(5 marks)
6