Mathematics of Random Events Nikolaos Constantinou Leonardo T. Rolla 8th December 2020 University of Warwick, Autumn 2020, Module ST-342 The following may differ from the chapters posted weekly on Moodle: • Slight changes in the proof of Theorem 3.12. • Added more explanations in the proof of Theorem 2.43, and fixed a typo. • Amended the proof of Theorem 5.26 and added examples before it. • Definition 2.25, added definition of finite and changed the wording. • Expanded Definition 4.1. • Expanded Remark 6.21. • Fixed mistakes in §5.4.1. • Added Example 3.6. • Added §5.2.5. • Expanded Remark 5.38. • Typo in Theorem 5.22. • Expanded solution to Exercise 5.1. • Fixed Exercise 4.15. • Mention that sets form a partition in Example 4.14. • Expanded the proof of Lemma 4.8. • Added definition and properties of symmetric difference to §3.2. • In the proof of Proposition 2.34, fixed equation that was giving +∞−∞. • Definition 2.46, fixed a typo. • Example 1.8, typo fixed. 1 ©2019-2020 Nikolaos Constantinou & Leonardo T. Rolla This work is licensed under Reusing this material This license allows users to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. If you remix, adapt, or build upon the material, you must license the modified version under identical terms. Attribution Most of this material is based on the lecture notes typeset by Nikolaos Constantinou when the ST342 module was taught by Dr. Wei Wu in the Autumn 2019 term. The notes are available at https://github.com/leorolla/teaching/raw/master/Probability/ST342-2019.pdf 2 Contents Notation, terminology and preliminaries 4 1 Infinity 6 1.1 Events at infinity . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 The measure problem . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Infinite numbers and infinite sums . . . . . . . . . . . . . . . . . 9 2 Measure Spaces 13 2.1 Classes of sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Uniqueness and Dynkin’s pi-λ Theorem . . . . . . . . . . . . . . . 19 3 Existence and approximation 22 3.1 Carathéodory Extension Theorem . . . . . . . . . . . . . . . . . 22 3.2 Approximation by events in a generating algebra . . . . . . . . . 25 3.3 Lebesgue measure on Rd . . . . . . . . . . . . . . . . . . . . . . . 28 4 Measurable functions and random variables 30 4.1 Definition and properties . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 Extended Borel functions and random variables . . . . . . . . . . 32 4.3 Operations on extended Borel functions . . . . . . . . . . . . . . 35 5 Integration and expectation 38 5.1 Axiomatic definition . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 Main properties of the integral . . . . . . . . . . . . . . . . . . . 41 5.3 Convergence of integral and applications . . . . . . . . . . . . . . 49 5.4 Construction of the integral . . . . . . . . . . . . . . . . . . . . . 52 6 Density and Radon-Nikodým Theorem 59 6.1 Density of measures and continuous variables . . . . . . . . . . . 59 6.2 The Radon-Nikodým Theorem . . . . . . . . . . . . . . . . . . . 63 7 Product measures and independence 67 7.1 Product σ-algebras and their properties . . . . . . . . . . . . . . 67 7.2 Product measure . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 7.3 Fubini and Tonelli Theorems . . . . . . . . . . . . . . . . . . . . 71 7.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 8 Convergence of measurable functions 78 8.1 Modes of convergence . . . . . . . . . . . . . . . . . . . . . . . . 78 8.2 Almost sure convergence and Borel-Cantelli . . . . . . . . . . . . 84 8.3 Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . 88 8.4 Relation between modes of convergence . . . . . . . . . . . . . . 90 References 95 3 Notation, terminology and preliminaries Notation N {1, 2, 3, 4, . . . } N0 {0, 1, 2, 3, . . . } Ac Ω \A (note that Ω is implicit) An ↑ A An ⊆ An+1 for all n and ∪nAn = A An ↓ A An ⊇ An+1 for all n and ∩nAn = A xn ↑ x xn is non-decreasing and xn → x xn ↓ x xn is non-increasing and xn → x {f < g} {ω ∈ Ω : f(ω) < g(ω)} supn fn the function h defined by h(ω) = supn fn(ω) for each ω max{f1, f2} supn fn over n ∈ {1, 2} [x]+ max{0, x}, the positive part of x [x]− max{0,−x}, the negative part of x, so x = [x]+ − [x]− bxc the largest n ∈ Z such that n 6 x ∈ R, or x if x = ±∞ µ-a.e. except on a set A with µ(A) = 0 a.e. µ-a.e. when µ is obvious from the context µ-a.e. ω every ω ∈ Ω \A, for some set A ∈ F with µ(A) = 0 a.s. with probability 1 a.s. on ω for P-a.e. ω Terminology σ-additivity Countable additivity Countable sets See [Coh13, §A.6]. Extended real numbers See §1.3.1 and [Coh13, §B.4]. Supremum and infimum See [Coh13, §B.5]. liminf, limsup, limit See [Coh13, §B.6]. One criterion of convergence is the following. 4 Lemma 0.1. Let (an)n be a sequence of real numbers, and a ∈ R. Then limn an = a if and only if, for every subsequence (ank)k, there is a further subsequence (ankj )j such that limj ankj = a. Sketch of proof. Suppose an → a and let ank be a subsequence. For each ε > 0, {n : |an − a| > ε} is finite, so {k : |ank − a| > ε} is also finite, hence ank → a. Now suppose an 6→ a, so there is ε > 0 such that {n : |an − a| > ε} is infinite. Taking (nk)k from this set, no sub-subsequence (ankj ) may converge to a. Metric spaces Given a metric space (Ω, d) and a subset B ⊆ Ω, the interior of B is the set of points x ∈ Ω for which there is a ball centred at x and contained in B. A set B ⊆ R is called open if the interior of B equals B. A set B ⊆ R is called closed if Bc is open. The sets Ω and ∅ are both opened and closed. When Ω = R, the interior of B ⊆ R is the set {x ∈ B : ∃ε > 0, (x−ε, x+ε) ⊆ B}. Also, open intervals are open and closed intervals are closed. The only sets that are open and closed at the same time are R and ∅. Every open set is given by the union of countably many disjoint open intervals. Given two metric spaces (Ω, d) and (Ω′, d′) and a function f : Ω → Ω′, we say that f is continuous if the preimage of every open sets is open. Riemann integral - Darboux’s definition See [Coh13, §p.xv]. 5 1 Infinity Can we toss a coin infinitely many times? Why is countable additivity important? We know that E[X + Y ] = EX + EY, and X > Y implies EX > EY , but why? It doesn’t seem to follow from the usual definition that treats discrete and continuous random variables as hermetically separated entities. So, what is expectation, really? Do continuous random variables even exist? How small can a class A of events be so that their probabilities determine P? Why can we differentiate moment generating functions to compute moments? Our aim is to study concepts of Measure Theory useful to Probability Theory, providing a solid ground for the latter. A measure generalises the notion of area for arbitrary sets in Euclidean spaces Rd, d > 1. We introduce the theory of measurable spaces, measurable functions, integral with respect to a measure, density of measures, product measures, and convergence of functions and random variables. Below we give examples that motivate the need for such a theory, discuss in which sense modern Measure Theory is the best we can hope for, and introduce the concept of infinite numbers and infinite sums used throughout the remaining chapters. 1.1 Events at infinity We know that An ↑ A implies P(An) ↑ P(A) which, assuming that P is non- negative and finitely additive, is equivalent to P being σ-additive (a shorthand for countably additive). Below we see examples where interesting models and events require some sort of limiting process in their study. Example 1.1. Coin flips {0, 1}. (i) Coin flip N times, so the sample space is Ω = {(x1, . . . , xN ) : xi ∈ {0, 1}}. For N = 7, define the event A = {(x1, . . . , xN ) : x1 = 1, x5 = 0}. Rigorously, the probability of this event is P(A) = 1/4. (ii) Coin flip infinitely many times (if we can!), so the sample space in this case is Ω = {(x1, x2, . . . ) : xi ∈ {0, 1}}. Let B be the event of 106 consecutive zeros, i.e. B = ⋃∞ n=1{ω ∈ Ω : xn = xn+1 = · · · = xn+106−1 = 0}. The probability P(B) is computed by first an estimate, and then taking limit. 4 Example 1.2 (Ruin probability). Gambling with initial wealth X0 ∈ N0. For any t > 1, we bet an integer amount and reach a wealth denoted by Xt. If at any point in time, wealth amounts to 0, it remains 0 forever. The sample 6 space that indicates the wealth process is Ω = {(x0, x1, . . . ) : xi ∈ N0}. We define the function of wealth after the n-th gamble by, Xn : Ω→ N∪{0}, where Xn(ω) = xn. In fact, (Xn)n>0 is a Markov process. Then {stay in state 0 eventually} = ∞⋃ n=0 ∞⋂ m=n {ω ∈ Ω : Xm(ω) = 0}. (1.3) By monotonicity and continuity, its probability is P({stay in state 0 eventually}) = lim n→∞P({ω ∈ Ω : Xn(ω) = 0}). 4 Example 1.4 (Brownian motion). It is possible to construct a sequence of continuous piece-wise linear functions, which in the limit give a continuous nowhere differentiable random path that is at the core of Stochastic Analysis. 4 Example 1.5 (Uniform variable). If X ∼ U [0, 1], then P(X 6= x) = 1 for every x ∈ R. Nevertheless, P(X 6= x for every x ∈ R) = 0, so there will be some unlucky x that will happen to be hit by X. Now for a countable set A, by σ-additivity we have P(X ∈ A) = ∑x∈A P(X = x) = 0. Since Q is countable, we see that a uniform random variable is always irrational. Well, this is unless there is an uncountable number of such variables in the same probability space, in which case some unlucky variables may happen to take rational values. 4 Example 1.6 (Strong law of large numbers). Let X1, X2, X3, . . . be independent and take value ±1 with probability 12 each, and take Sn = X1 + · · ·+Xn. The strong law of large numbers says that, almost surely (a shorthand for “P(· · · ) = 1”), Snn → 0 as n→∞. In order to show this, we need to express this event as a limit and compute its probability. The first part is simply: { lim n→∞ Sn n = 0} = ∞⋂ k=1 ∞⋃ n0=1 ∞⋂ n=n0 {|Snn | < 1k}. 4 Example 1.7 (Recurrence of a random walk). For the sequence (Sn)n of the previous example, we know that Snn → 0 with probability one. We want to consider whether Snn converges to zero from above, from below, or oscillating, which is the same as asking whether Sn = 0 infinitely often. 4 1.2 The measure problem It is clear (it should be!) from previous examples that we want to work with measures that have nice continuity properties, so we can take limits. However, when the mass is spread over uncountably many sample points ω ∈ Ω, it is not possible to assign a measure to all subsets of Ω in a reasonable way. 7 We would like to define a random variable uniformly distributed on [0, 1], by means of a function that assigns a weight to subsets of this interval. For instance, what is the probability that this number is irrational? What is the probability that its decimal expansion does not contain a 3? This is the same problem as assigning a ‘length’ to subsets of R. We are also interested in defining a measure of ‘area’ on R2, ‘volume’ on R3, and so on. Of course, a good measure of length/area/volume/etc. on Rd should: (i) give the correct value on obvious sets, such as intervals and balls; (ii) give the same value if we rotate or translate a set; (iii) be σ-additive. We stress again that σ-additivity is equivalent to a measure being continuous, and we are not willing to resign that. On the other hand, we do not want more than that: each of the uncountably many points in [0, 1] alone has length zero, but all together they have length one; likewise, each sequence of coin tosses in {0, 1}N has probability zero, but all together they have probability one. The measure problem is the following. There is no measure m defined on all subsets of Rd which satisfy all the reasonable properties listed above. What modern Measure Theory does is to work with measures that are defined on a class of sets which is large enough to be useful and small enough for these properties to hold. The next example shows that there is no measure m defined on all subsets of R3 which satisfies these three properties. Example 1.8 (Banach-Tarski paradox). Consider the ball B = {x ∈ R3 : ‖x‖ 6 1}. There exist1 k ∈ N, disjoint sets A1, . . . , A2k, and isometries (maps that preserve distances and angles) S1, . . . , S2k : R3 → R3 such that B = (A1 ∪ · · · ∪Ak) ∪ (Ak+1 ∪ · · · ∪A2k), B = S1A1 ∪ · · · ∪ SkAk, B = Sk+1Ak+1 ∪ · · · ∪ S2kA2k. So B was decomposed into finitely many pieces, which were later on moved around rigidly and recombined to produce two copies of B! Why is it a paradox? Finitely many pieces is not the issue in itself, since N can be decomposed into even and odd numbers, and they can be compressed (or stretched, in some sense) to produce two copies of N. Rigidity alone is not the issue either, since we can move each of the uncountably many points of the segment [0, 1] to form the 1See https://youtu.be/s86-Z-CbaHA for a nice overview of the proof. 8 segment [0, 2]. The paradox is that this magic was done with rigid movements on finitely many pieces. And here we can see the measure problem: if all these disjoint sets A1, . . . , A2k were to have a volume V1, . . . , V2k > 0, what would be the volume of the ball B? 4 The next example is not nearly as effective in impressing friends at a party, and would certainly not make a youtube video with 31 million views, but it has two advantages. First, it shows directly that the measure problem already occurs on d = 1. Second, we can actually explain its proof in a third of a page rather than a dozen. Example 1.9 (Vitali set). Consider the unit circle S1 with points indexed by turns instead of degrees or radians. This is the same as the interval S1 = [0, 1) with the angle addition operation x⊕ y = x+ y mod 1. There exists a set E ⊆ [0, 1) such that S1 is decomposed into disjoint {En}n∈N which are translations of E. And here we see again the measure problem: by σ-additivity, if the length of E is zero, then the length of the circle is zero; and if the length of E is non-zero, then the length of the circle is infinite. So E is not measurable. 4 Sketch of proof. Write Q∩[0, 1) = {rn}n=1,2,3,.... For E ⊆ S1, let En = {x⊕rn : x ∈ E} be the translation of E by rn. We want to find a set E such that (i) The sets E1, E2, E3, . . . are disjoint, (ii) The union satisfies ∪nEn = S1. Start with a small set that satisfies the first property, such as E = {0}. Enlarge the set E by adding a point x ∈ S1 \ (∪nEn). Adding such point does not break the first property (proof omitted), and may help the second one. Keep adding points this way, until it is no longer possible. When it is no longer possible, it can only be so because the second property is also satisfied.2 Remark 1.10. Some people really like to emphasise that the Banach-Tarski paradox and the Vitali set depend crucially on the Axiom of Choice (for the above sketch of proof, it is concealed in the expression “keep adding until”). We may wonder what happens if we do not accept this axiom. In this case, we cannot prove the Banach-Tarski paradox, nor the existence of a Vitali set. But neither can we prove that they do not exist, so the measure problem persists. 4 1.3 Infinite numbers and infinite sums We now define the set of extended real numbers and briefly discuss some its useful properties, then discuss the meaning of infinite sums, and move on to other perhaps philosophical questions about this theory. 2See [Coh13, 1.4.9] for a complete proof. 9 1.3.1 Extended real numbers We are about to start working with measures, and because measures can be infinite, and integrals can be negative infinite, we work with the set of extended real numbers R := [−∞,+∞] that extends R by adding two symbols ±∞. The novelty is of course to conveniently allow operations and comparisons involving these symbols. Basically, we can safely operate as one would reasonably guess: −∞ < −1 < 0 < 5 < +∞, −7 +∞ = +∞, (−2)× (−∞) = +∞, | −∞| = +∞, (+∞)× (−∞) = −∞, a 6 b =⇒ a+ x 6 b+ x lim n→∞(2 + n 2) = lim n→∞ 2 + limn→∞n 2 = 2 +∞ = +∞, etc. Since we will never need to divide by infinity, let us leave x∞ = 0 undefined (otherwise we would need to check that x is finite). The non-obvious definition is 0 · ∞ = 0. In Calculus, it would have been considered an indeterminate form, but in Measure Theory it is convenient to define it this way because the integral of a function that takes value 0 on an interval of infinite length and +∞ at a few points should still be 0. That is, the area of a rectangle having zero width and infinite length is zero. Now some caveats. First, limn(anbn) = (limn an)(limn bn) may fail in case it gives 0 · ∞. Also, note that now a + b = a + c does not imply b = c. This can be false when a = ±∞. Likewise, a < b no longer implies that a + x < b + x. So we should be careful with cancellations. The one thing that is definitely not allowed, and that Measure Theory does not handle well, is “ +∞−∞ ” ! This is simply forbidden, and if we will ever write this, it will be in quotation marks and just in order to say that this case is excluded. The reader should consult [Coh13, §§B.4–B.6] and [Tao11, p. xi] for a more complete description of operations on R. 1.3.2 Infinite sums Infinite sums of numbers on [0,+∞] are always well-defined through a rather simple formula. If Λ is an index set and xα ∈ [0,+∞] for all α ∈ Λ, we define:∑ α∈Λ xα = sup A⊆Λ A finite ∑ α∈A xα. 10 The set Λ can be uncountable, but the sum can be finite only if Λ+ = {α : xα > 0} is countable (proof omitted). If xα ∈ [−∞,+∞], we define Λ− = {α : xα < 0} and ∑ α∈Λ xα = sup A⊆Λ+ A finite ∑ α∈A xα − sup A⊆Λ− A finite ∑ α∈A −xα, (1.11) as long as this difference does not give “+∞−∞” ! The theory of conditionally convergent sums as∑ j∈N xj = lim n n∑ j=1 xj (1.12) is hardly meaningful to us. In case the expression in (1.11) is well-defined, we can write (Λ− ∪ Λ+) = {αj}j∈N by ordering these indices in any way we want (assuming for simplicity that these sets are countable), and formula (1.12) will give the same result as (1.11). Pretty robust. However, in case (1.12) converges but (1.11) is not well-defined (so it gives “+∞−∞”), we can re-order the index set N so that (1.12) will give any number we want. This is definitely not the type of delicacy we want to handle here. For this reason, we will only use (1.12) when either xj ∈ [0,+∞] for all j, or when ∑ j |xj | <∞. So there are two overlapping cases where we can work comfortably: non-negative extended numbers, or series which are absolutely summable. 1.3.3 Two different and overlapping theories The above tradeoff is already a good prelude to something rather deep that will appear constantly in upcoming chapters. Borrowing from [Tao11]: Because of this tradeoff, we will see two overlapping types of measure and integration theory: the non-negative theory, which involves quantities taking values in [0,+∞], and the absolutely integrable theory, which involves quantities taking values in R or C. However, at the risk of leaving it for the reader to figure out some corner cases, we can (and will) extend these theories to a theory on [−∞,+∞] by doing what we just did above. Namely, whereas the absolutely integrable theory requires that both terms in (1.11) be finite, and the non-negative theory requires that one of them be zero, we only require that one of them be finite. 11 We end this chapter with the simplest example of these overlapping theories. Theorem 1.13 (Tonelli Theorem for series). Let xm,n ∈ [0,+∞] be a doubly- indexed sequence. Then ∞∑ m=1 ∞∑ n=1 xm,n = ∑ (m,n)∈N2 xm,n = ∞∑ n=1 ∞∑ m=1 xm,n. Proof. See [Tao11, Theorem 0.0.2] for a direct proof. A proof is also given in §7.3 as an application of Tonelli Theorem. Theorem 1.14 (Fubini Theorem for series). Let xm,n ∈ [−∞,+∞] be a doubly- indexed sequence. If ∞∑ m=1 ∞∑ n=1 |xm,n| <∞, then ∞∑ m=1 ∞∑ n=1 xm,n = ∑ (m,n)∈N2 xm,n = ∞∑ n=1 ∞∑ m=1 xm,n. Proof. Given in §7.3 as a application of Fubini Theorem. Example 1.15 (Failure). Consider the doubly-indexed sequence xm,n = 1 −1 0 0 0 . . . 0 1 −1 0 0 . . . 0 0 1 −1 0 . . . 0 0 0 1 −1 . . . ... ... ... ... . . . . . . which is not absolutely summable. Note that summing columns and then rows we get 1, whereas summing rows and then columns we get 0. 4 12 2 Measure Spaces The goal in this chapter is to define the notion of a measure space (Ω,F , µ) and work on examples to get familiar with the relevant concepts. Later on, we will focus our attention on two types of measures: the Lebesgue measure m and probability measures P. 2.1 Classes of sets This may be the least pleasant topic, but we need to go through it. 2.1.1 σ-algebras We start with the notion of σ-algebra on a sample space Ω. Definition 2.1 (σ-algebra). Let Ω be a sample space. A collection F of subsets of Ω is called a σ-algebra (or σ-field) on Ω, if (i) Ω ∈ F , (ii) A ∈ F =⇒ Ac ∈ F , (iii) A1, A2, · · · ∈ F =⇒ ⋃∞ i=1Ai ∈ F . The pair (Ω,F) is called a measurable space. The sets A ∈ F are called F-measurable sets, or measurable sets when F is clear from the context. Remark 2.2. In this case, it also satisfies (iv) ∅ ∈ F , (v) A1, A2, · · · ∈ F =⇒ ⋂∞ i=1Ai ∈ F . 4 Let us review the following examples to build our intuition. Example 2.3. For a given sample space Ω, let P(Ω) := {A : A ⊆ Ω} be the power set of Ω. Then P(Ω) is a σ-algebra on Ω. 4 Example 2.4. For a given sample space Ω, let A = {∅,Ω}. Then A defines a σ-algebra on Ω, and it is called the trivial σ-algebra. 4 Exercise 2.5. Let Ω be a set and A the class of all subsets which are either countable or whose complement is countable. Prove that A is a σ-algebra. 4 Exercise 2.6. Let A1 and A2 be σ-algebras on a given sample space Ω, and let A = A1 ∩ A2 be the class made of the sets that belong to both A1 and A2. Prove that A := A1 ∩ A2 is also a σ-algebra on Ω. 4 Exercise 2.7 (Pull-back). If f : Ω1 → Ω2 and F2 is a σ-algebra on Ω2, prove that f−1(F2) := {f−1(A) : A ∈ F2} is a σ-algebra on Ω1. 4 13 So when a function goes from a sample space Ω1 to another space Ω2, it pulls back any σ-algebra F on Ω2 to a σ-algebra G on Ω1, as in the following diagram: Ω1 f G Ω2 F f−1 OO (2.8) 2.1.2 The σ-algebra generated by a class It will often be very useful to consider the σ-algebra “spanned” by a class. Definition 2.9 (σ-algebra generated by a class of subsets). Let E be a class of subsets of Ω. We define the σ-algebra generated by E , denoted σ(E), as the unique class of subsets of Ω with the following properties: (i) σ(E) is a σ-algebra, (ii) σ(E) ⊇ E , (iii) if F ⊇ E and F is a σ-algebra, then F ⊇ σ(E). So σ(E) is the smallest σ-algebra that contains E . We now prove that σ(E) is indeed well-defined. Proof by exercise. Let {Aα}α∈Λ be a family of σ-algebras on Ω, where Λ is an arbitrary non-empty index set. Prove that ⋂ α∈ΛAα is also a σ-algebra. Let E be a class of subsets of Ω. Prove that there exists at least one σ-algebra on Ω that contains E as a subclass. Prove that σ(E) exists by considering the family of σ-algebras that contain E . Finally, prove that σ(E) is unique. Some people like to describe σ(E) as the collection of sets that can be obtained by countably many operations of taking union, intersection and complement of sets in E . Whereas this description may indeed give some comfort to those facing such bitter abstraction for the first time, we should not rely on it because it is hard to make it precise. It is better to work directly with the definition. The examples below illustrate this. Example 2.10. Let Ω be a sample space, A ⊆ Ω and E = {A}. We claim that the σ-algebra generated by E is A := {∅,Ω, A,Ac}. Indeed, A is a σ-algebra, it contains E , and any σ-algebra that contains E must contain A. 4 Example 2.11. Let E = {A1, . . . , An}, be a partition of a sample space Ω. Then the σ-algebra generated by E is A = {Ai1 ∪· · ·∪Aim : i1, . . . , im ∈ {1, . . . , n}}∪ {∅}. Indeed, A is a σ-algebra, it contains E , and any σ-algebra containing E must contain A. 4 14 Exercise 2.12. Let Ω = {1, 2, 3} and E = {{1}, {2}, {3}}. What is σ(E)? 4 Exercise 2.13. Let E = {{x} : x ∈ Ω} be the class of singletons. What is σ(E)? 4 Proposition 2.14. Suppose E1 and E2 are classes of subsets of Ω and E1 ⊆ σ(E2). Prove that σ(E1) ⊆ σ(E2). Exercise 2.15. Prove the above proposition. 4 2.1.3 The Borel sets on R and R See page 5 for a very brief recall of basic facts from Metric Spaces. When working on R or R, the most important σ-algebra is the following. Definition 2.16 (Borel sets on R). For the space Ω = R, the Borel σ-algebra B(R) is the σ-algebra generated by open sets. It is also the smallest σ-algebra that contains all the open intervals, or all the closed intervals. For the extended line, we define B(R) := {A ⊆ R : A ∩ R ∈ B(R)}, that is, we allow adding the points ±∞ to sets in B(R). When it is clear that Ω = R or R, we may write B instead of B(R) or B(R). Below we list more classes of sets, all of which generate B(R). Any “reasonable” set is a Borel set, and it is actually hard to construct one which is not, such as the Vitali set in Example 1.9. Nevertheless, we should keep in mind that all the theory works on σ-algebras, and B is the one we are most interested in. Denote by E1, . . . , E6 the following classes of subsets of R, respectively: 1. Closed subsets of R. 2. Open subsets of R. 3. Intervals of the form (a, b), a, b ∈ [−∞,+∞]. 4. Finite intervals of the form (a, b], a, b ∈ R. 5. Semi-infinite intervals of the form (−∞, b], b ∈ R. 6. Semi-infinite intervals of the form (−∞, b), b ∈ R. One can show that σ(E1) = · · · = σ(E6) as follows: σ(E2) ⊆ σ(E1) ⊆ σ(E2) ⊆ σ(E3) ⊆ σ(E4) ⊆ σ(E5) ⊆ σ(E6) ⊆ σ(E2). We prove two inclusions and leave five as exercise. Let A ∈ E2. Then A is the countable union of disjoint open intervals. So there exist {In}n∈N ⊆ E3 ⊆ σ(E3) such that A = ∪nIn. Since σ(E3) is a σ-algebra, it is closed under countable unions, and A ∈ σ(E3). Thus, E2 ⊆ σ(E3). By Proposition 2.14, σ(E2) ⊆ σ(E3). 15 Let A ∈ E5. Then A = (−∞, b] for some b ∈ R. Consider the sequence of sets An = (−∞, b + 1n ) in E6. Since σ(E6) ⊇ E6 is a σ-algebra, it is closed under countable intersections, and A = ∩nAn ∈ σ(E6). Thus, E5 ⊆ σ(E6). By Proposition 2.14, σ(E5) ⊆ σ(E6). Exercise 2.17. Show that each of the classes below generate B(R): (a) {[−∞, a)}a∈R, (b) {[−∞, a]}a∈R, (c) {[a,+∞]}a∈R, (d) {(a,+∞]}a∈R. 4 2.1.4 Algebras On a few occasions, we will use a class with less structure than σ-algebras. Definition 2.18 (Algebra). Let Ω be a sample space. A class A of subsets of Ω is called an algebra on Ω, if (i) Ω ∈ A, (ii) A ∈ A =⇒ Ac ∈ A, (iii) A,B ∈ A =⇒ A ∪B ∈ A. Remark 2.19. In this case, it also satisfies: (iv) ∅ ∈ A, (v) A,B ∈ A =⇒ A ∩B ∈ A. In other words, an algebra on Ω is a class of subsets of Ω, which contains Ω and is stable under finitely many set operations. 4 Example 2.20. Every σ-algebra is an algebra. 4 Remark 2.21. If Ω is a sample space and F1 ⊆ F2 ⊆ F3 ⊆ · · · are σ-algebras, then A := ∪nFn may not be a σ-algebra, but it is still an algebra. 4 Exercise 2.22. Prove that A in the previous remark is indeed an algebra. 4 Example 2.23. Let Ω = R and E = {(a, b] : −∞ 6 a < b < +∞} ∪ {(a,+∞) : a ∈ R}. Consider the class A = {I1 ∪ · · · ∪ In : Ik ∈ E for k = 1, . . . , n, Ik ∩ Ij = ∅ for j 6= k} ∪ {R, ∅}. Then A is an algebra. 4 Exercise 2.24. Prove that A in the previous example is an algebra. Suggestion: show that E ∪{∅} is closed under intersections, use this to show that A is closed under intersections, also show that A is closed under complement, and conclude that A is closed under unions. 4 16 2.2 Measures Definition 2.25 (Measure and σ-finite measure). Let (Ω,F) be a given measurable space. A function µ : F → [0,+∞] is called a measure if µ(∅) = 0 and µ ( ∞⋃ n=1 An ) = ∞∑ n=1 µ(An) for every sequence {An}∞n=1 ⊆ F of disjoint sets. In this case, the triple (Ω,F , µ) is called a measure space. We say that µ is finite if µ(Ω) < ∞ and σ-finite if there exist measurable sets (An)n∈N such that µ(An) <∞ for all n and ∪n∈NAn = Ω. Let us start with some simple but important examples. Example 2.26 (Dirac mass measure). Let Ω be a set, F = P(Ω) and x ∈ Ω. The measure δx is defined as δx(A) = 1A(x) for all A ⊆ Ω. This is a finite measure. It is also a probability measure. 4 Remark 2.27. Even though an indicator function and a Dirac mass function perform a similar test, they are different objects. The indicator function of a given A ⊆ Ω is a function 1A : Ω → {0, 1}, whereas the Dirac measure on a given x ∈ Ω is a map δx : P(Ω)→ {0, 1}. In other words, 1A checks whether a point lies in A, whereas δx checks whether a set contains x. 4 Example 2.28 (Counting measure). Let Ω be a set, F = P(Ω), and µ = ∑x∈Ω δx. So µ(A) = ∑ x∈Ω δx(A) counts how many elements A has. If Ω is a finite set, this is a finite measure, and if Ω is countable then this is a σ-finite measure. 4 Example 2.29. Let Ω = N, F = P(Ω), and µ(A) = ∑k∈A k−1. Then µ is a σ-finite measure (why?). 4 Exercise 2.30. Prove that the functions described in previous examples are indeed measures. 4 Exercise 2.31. Let µ1, µ2, . . . be a sequence of measures on a measurable space (Ω,F), and let x1, x2, · · · ∈ [0,+∞]. Show that µ = ∑ n xnµn is a measure on (Ω,F). 4 Example 2.32 (Coarsening). If (Ω,F , µ) is a measure space, and G ⊆ F is also a σ-algebra, then (Ω,G, µ) is also a measure space (why?). 4 Example 2.33 (Restriction). If (Ω,F , µ) is a measure space and A ∈ F , then the restriction of µ to A, denoted µ|A and given by µ|A(B) = µ(A ∩ B) is also a measure (why?). 4 17 Proposition 2.34 (Properties of a measure). Let (Ω,F , µ) be a measure space. Then, we have the following properties: (i) Countable sub-additivity: If {Ai}∞i=1 ⊆ F , then µ( ⋃∞ i=1Ai) 6 ∑∞ i=1 µ(Ai). (ii) Continuity from below: If {Ai}∞i=1 ⊆ F , Ak ⊆ Ak+1,∀k ∈ N, then µ( ⋃∞ i=1Ai) := µ( limn→∞An) = limn→∞µ(An). (iii) Continuity from above: If {Ai}∞i=1 ⊆ F , Ak ⊇ Ak+1,∀k ∈ N, and µ(Aj) < ∞ for some j ∈ N, then µ(⋂∞i=1Ai) := µ( limn→∞An) = limn→∞µ(An). Proof. For sub-additivity, set Bk := Ak \ ( ⋃k−1 i=1 Ai),∀k ∈ N, so that ⋃∞ i=1Ai =⋃∞ i=1Bi, Bk are disjoint and Bk ⊆ Ak,∀k ∈ N. It follows that, µ ( ∞⋃ i=1 Ai ) = µ ( ∞⋃ i=1 Bi ) = ∞∑ i=1 µ(Bi) 6 ∞∑ i=1 µ(Ai). For continuity from below, let A0 = ∅ and Bk := Ak \ Ak−1,∀k ∈ N, so that⋃∞ i=1Ai = ⋃∞ i=1Bi and Bk are disjoint. It follows that, µ ( ∞⋃ i=1 Ai ) = µ ( ∞⋃ i=1 Bi ) = ∞∑ i=1 µ(Bi) = ∞∑ i=1 µ(Ai \Ai−1) = lim n→∞ ( n∑ i=1 µ(Ai \Ai−1) ) = lim n→∞µ(An). For continuity from above, let Bk := Aj \ Aj+k,∀k ∈ N, use continuity from below, and subtract both sides from µ(Aj). This concludes the proof. Remark 2.35. Continuity from above may be false without the assumption that µ(Aj) is finite for some j. For instance, if µ is the counting measure on N and An = {n, n+ 1, n+ 2, . . . }. Then An ↓ ∅ =: A. Since µ(An) =∞ and µ(A) = 0, we have µ(An) 6→ µ(A) even though An ↓ A. 4 We now state the most important measure in the theory. Theorem 2.36 (Lebesgue Measure). There exists a unique measure m on (R,B) such that m([a, b]) = b− a for all a < b ∈ R. This measure m is called the Lebesgue Measure on R. Proof. Existence will be proved in §3.1 and uniqueness in §2.3. 18 Example 2.37 (Countable sets). Any countable set has Lebesgue measure 0. Although Q is present everywhere in the sense that it intersects every tiny open interval, its length is actually zero. 4 Example 2.38 (Restriction). Given real numbers a < b, the function µ given by µ(A) = m(A ∩ [a, b]) is a measure on (R,B). 4 Definition 2.39 (Probability). A measure µ on a measurable space (Ω,F) is called a probability measure if µ(Ω) = 1. The triple (Ω,F , µ) is called a probability space. Probability measures are often denoted by P , P or P instead of µ. Example 2.40 (Equiprobable outcomes). Let n ∈ N and let Ω be a set with n elements. Take F = P(Ω) and P = 1n ∑ ω∈Ω δω. Then (Ω,F ,P) is the model of equiprobable outcomes. 4 Example 2.41. Let n ∈ N, Ω = {ω1, . . . , ωn}, and E = {A1, . . . , An}, where Ai = {ωi}, i = 1, . . . , n. Set P(Ai) = P({ωi}) := pi > 0, i = 1, . . . , n, such that∑n i=1 pi = 1. We can extend the definition of P to P(Ω) by finite additivity. Then, we can check that P defines a probability measure on (Ω,P(Ω)). 4 Exercise 2.42. Let P1,P2, . . . be a sequence of probability measures on a measurable space (Ω,F), and let p1, p2, · · · ∈ [0, 1] be such that ∑ n pn = 1 Show that P = ∑ n pnPn is a probability measure on (Ω,F). Give an interpretation to P and (pn)n. 4 2.3 Uniqueness and Dynkin’s pi-λ Theorem In this section we will study Dynkin’s pi-λ Theorem and use it to prove the following. Theorem 2.43 (Uniqueness of measures). Let µ and ν be measures on a measurable space (Ω,F), and let C ⊆ F be a class of subsets of Ω closed under intersection. Suppose that µ(An) = ν(An) < ∞ for some sequence An ↑ Ω of sets in C. If µ = ν on C, then µ = ν on σ(C). Of course, the uniqueness part of Theorem 2.36 is a particular case of the above theorem. For that, we take C = {[a, b] : a, b ∈ R}, since σ(C) = B(R) and note that [−n, n] ↑ R are in C and have finite measure. Before introducing the concepts used in its proof, let us discuss where they arise from. Suppose we want to check that two finite measures µ and ν on the 19 measurable space (Ω,F) coincide on a large class of sets, such as F . Suppose also that µ(Ω) = ν(Ω). On how many sets A do we need to test that µ(A) = ν(A)? Testing all A ∈ F is a bit overkilling. Having tested for a disjoint sequence (An)n, it will hold for Ac1 as well as ∪nAn by elementary properties of finite measures. However, having tested for given sets A,B ∈ F , there is no way to infer from the basic properties of measures alone that these two measures coincide on A ∩B. So the idea is to consider the class D = {A ∈ F : µ(A) = ν(A)} (2.44) and show that it is large. This class already has some structure inherited from the way finite measures treat disjoint unions, but that structure is still missing a small bit, namely being closed under intersections. Definition 2.45 (pi-system). A class C of sets is called a pi-system if it is closed under intersections, that is, A ∩B ∈ C for all A,B ∈ C. Good examples of pi-systems are the classes listed in §2.1.3, if we add ∅ to them. Definition 2.46 (λ-system). Let Ω be a sample space. A class D of subsets of Ω is a λ-system on Ω, if Ω ∈ D and it is closed under complement and countable disjoint unions, that is, (i) Ω ∈ D, (ii) A ∈ D =⇒ Ac ∈ D, (iii) {Ai}∞i=1 ⊆ D, Ai ∩Aj = ∅ ∀i 6= j =⇒ ⋃∞ i=1Ai ∈ D. These concepts are useful when the (large and complicated) class of sets satisfying a given property will naturally form a λ-system, and at the same time we can find a (smaller and simpler) subclass that forms a pi-system and generates a large σ-algebra. This is the case for Theorem 2.43. Proposition 2.47. If µ and ν are measures on a measurable space (Ω,F) such that µ(Ω) = ν(Ω) <∞, then the class D defined in (2.44) is a λ-system. Proof. Suppose A ∈ D. Then µ(Ac) = µ(Ω) − µ(A) = ν(Ω) − ν(A) = ν(Ac), so Ac ∈ D. Now consider a countable family {An}n ⊆ D of disjoint sets. Then µ(∪nAn) = ∑ n µ(An) = ∑ n ν(An) = ν(∪nAn), so ∪nAn ∈ D. We now have enough motivation for the following celebrated tool. Theorem 2.48 (Dynkin’s pi-λ Theorem). Let Ω be a sample space. Suppose C is a pi-system of subsets of Ω and D is a λ-system on Ω. If D contains C, then D contains σ(C). 20 Proof. See [Coh13, 1.6.2]. Proof of Theorem 2.43. Assume for a moment that µ(Ω) = ν(Ω) < ∞. Define the class D as in (2.44). By Proposition 2.47, D is a λ-system. By the pi-λ Theorem, σ(C) ⊆ D, which means exactly that µ = ν on σ(C) as claimed. Now drop the previous assumption, and instead assume that µ(An) = ν(An) < ∞ for some sequence An ↑ Ω of sets in C. For each n ∈ N, define the restriction measures µn and νn by µn(A) = µ(A ∩ An) and νn(A) = ν(A ∩ An). With this definition, µn(Ω) = µ(An) = ν(An) = ν(Ω) < ∞. Note also that, for each C ∈ C, we have C ∩ An ∈ C because C is a pi-system. Hence, µn(C) = µ(C ∩An) = ν(C ∩An) = νn(C) for every C ∈ C. So, for each n, the measures µn and νn are in the previous case, and, for every A ∈ σ(C), we have µn(A) = νn(A). Using continuity from below, µ(A) = lim n µ(A ∩An) = lim n µn(A) = lim n νn(A) = lim n µ(A ∩An) = νn(A), which concludes the proof.3 In §3.2, we will use the pi-λ Theorem to prove that the probability of any event can be approximated by an event in a generating algebra. Another useful application of the pi-λ Theorem is to study independent random variables. Suppose we know that the vector (X,Y ) satisfies P ( (X,Y ) ∈ (−∞, r]× (−∞, t]) = P(X ∈ (−∞, r])P(Y ∈ (−∞, t]) for all r, t ∈ R. Does it imply that P ( (X,Y ) ∈ A×B) = P(X ∈ A)P(Y ∈ B) for every A,B ∈ B(R)? The answer is yes, and the proof is a little more ingenious than the one we just showed. It uses twice the fact that {(−∞, b] : b ∈ R} is a pi-system that generates B(R). We leave it as a teaser for now. 3Expanded from [Coh13, 1.6.4]. 21 3 Existence and approximation In this chapter we introduce a fundamental tool to prove existence of measure spaces starting from the specification on an algebra. We give an example of a finitely additive measure that cannot be extended to a true measure, and state a sufficient condition for such extension to exist. We then discuss how this tool is used to construct the Lebesgue measure on R as well as Lebesgue-Stieltjes measures which include all probability measures. We then give two different proofs of the approximation theorem, which says that events on probability spaces can be approximated by events on a generating algebra. We conclude by discussing the Lebesgue measure on Rd. 3.1 Carathéodory Extension Theorem In this section, we introduce the key theorem that allows us to extend a “measure” µ from an algebra to a larger σ-algebra. Given a sample space Ω and an algebra A, we say that a function µ : A → [0,∞] is a finitely additive measure if it is non-negative and finitely additive, that is, (i) µ(∅) = 0 (ii) µ(A ∪B) = µ(A) + µ(B) for all disjoint A,B ∈ A. Note that finitely additive measures are not true measures. The theory of finitely additive measures is rather limited, as one cannot take limits there. Example 3.1 (Uniform “measure” on N). Suppose we pick a large natural number at random. The chances that this number is odd should be 12 . The chances that a year is a leap year should be 97400 . This notion of chance is finitely additive. For instance, the chances that a year is either odd or a leap year would be 297400 , and the chance that it is not a leap year would be 303 400 . We could even talk about independent events. For instance, this random number being a multiple of 5 and a multiple of 6 would be independent, whereas a year being multiple of 5 and leap would not. This “frequentist” notion would be µ(A) = lim n→∞ #(A ∩ [1, n]) n , and that this µ can be extended4 to a finitely additive measure defined on all P(N), including sets A for which the limit does not exist. However, taking An = {n, n + 1, n + 2, . . . }, we have µ(An) = 1 for every n in spite of An ↓ ∅ and µ(∅) = 0. Not really the type of measure that we want to study here. 4 Fortunately, asking µ to be “σ-additive whenever possible” is enough to fix this. Definition 3.2 (Pre-measure). Given a sample space Ω and an algebra A, we say that a finitely additive measure µ : A → [0,∞] is a pre-measure if it is 4This requires the Hahn-Banach Theorem which is a topic of Functional Analysis. 22 σ-additive, that is, µ( ⋃ nAn) = ∑ n µ(An) for every sequence of disjoint sets An ∈ A whose union A happens to be in A. We now state a fundamental theorem of Measure Theory and Probability. Theorem 3.3 (Carathéodory Extension Theorem). Let A be an algebra on Ω. If µ is a pre-measure on (Ω,A), then µ extends to a measure on (Ω, σ(A)). Moreover, if the extension µ is σ-finite, then it is unique. Before outlining its proof, let us consider its main application. 3.1.1 Lebesgue measure on R Here we prove the existence part in Theorem 2.36. Fix Ω = R and E = {(a, b] ∩ R : −∞ 6 a < b 6 +∞} as in Example 2.23. We already know that the class A of sets A given by R, ∅ and finite unions of disjoint left open intervals is an algebra. Set m(I) = +∞ for infinite intervals I, and m((a, b]) := b− a for finite intervals (a, b]. We can extend the definition of m to A by m(A) = n∑ j=1 m(Ij), (3.4) where A = I1 ∪ · · · ∪ In and Ik ∈ E are disjoint. Note that the above formula is well-defined even if I can be written in many different ways as such a finite disjoint union (for instance, if I1 = (0, 1], I2 = (1, 2], it gives m((0, 2]) = m((0, 1]) +m((1, 2]) = 2). We omit the straightforward proof of this fact. Lemma 3.5. The set function m : A → [0,∞] is finitely additive. Proof. Immediate from the fact that (3.4) is well-defined. What is not so easy to prove is that m is σ-additive. For instance, take Ak = ( 1k+1 , 1 k ] ∈ E . Note that the Ak’s are disjoint and ⋃ k Ak = (0, 1], which happens to be in A. In this case, ∞∑ j=1 m(Aj) = ∞∑ j=1 ( 1 j − 1 j + 1 ) = 1 = m((0, 1]). To conclude the proof of Theorem 2.36 from Theorem 3.3, one needs to show that this property holds in general. 23 Lemma 3.6. The set function m : A → [0,∞] is a pre-measure. Proof. See [Dem20, Lemma 1.1.31]. Example 3.6. If we take A ⊆ [0, 1) as a Vitali set, then A 6∈ B(R). We sketch the proof. This is just expanding part of the sketch in Example 1.9. If A were in B(R), we could define m(A) > 0. Then by translation invariance, for each rn ∈ Q ∩ [0, 1), defining An = {x + rn mod 1 : x ∈ A}, we would have m(An) = m(A). Moreover, these sets (An)n are mutually disjoint and ∪nAn = [0, 1). So m([0, 1)) = ∑ n∈Nm(An) = ∑ n∈Nm(A) which is either 0 or ∞ depending on whether m(A) > 0 or not. This contradicts m([0, 1)) = 1. 4 3.1.2 Lebesgue-Stieltjes measures Let F : R → R be a right-continuous non-decreasing function. Consider the sample space Ω = R and the class E = {(a, b] : −∞ < a < b < +∞}. Define the set function mF : E → [0,∞) by mF ((a, b]) = F (b)− F (a). We can define mF on semi-infinite intervals by taking limits as a → −∞ or b → +∞. Consider the algebra A defined in Example 2.23 and revisited in §3.1.1. We can extend mF to A in a unique (and obvious) way so that it is finitely additive. It can be shown that mF is, in fact, a pre-measure on A. Again, by Carathéodory Extension Theorem, there exists a measure mF on (R,B) that gives the right values on E . Moreover, since E is a pi-system which contains a sequence (−n, n] ↑ R of sets with finite measure, this measure is unique. Such measure is called the Lebesgue-Stieltjes measure induced by the function F on the space R. Remark 3.7. If F (x) = x, then mF reduces to the Lebesgue measure on R. 4 Remark 3.8. We say that F is a distribution function if it is right-continuous, non-decreasing, limx→−∞ F (x) = 0 and limx→∞ F (x) = 1. In this case, mF is a probability measure on R. Conversely, if P is a probability measure on R, then F (x) = P({ω : ω 6 b}) is a distribution function. So these two types of objects are equivalent. 4 3.1.3 Outer measures and measurable sets Here we outline the proof of Theorem 3.3. Given a pre-measure µ on (Ω,A), we define the outer measure µ∗ : P(Ω) → [0,∞] as follows. For each subset E of Ω, consider the collection DE of all sequences (An)n of sets in A that cover E: DE = { (An)n : An ∈ A for all n and E ⊆ ⋃ n An } . The outer measure µ∗ is defined by µ∗(E) = inf (An)∈DE ∑ n µ(An). (3.9) 24 One might think of the outer measure as a conservative upper bound for what should be the measure of a set seen from the outside. Loosely speaking, problematic sets such as those of Examples 1.9 and 1.8 look bigger from the outside than form the inside. It turns out, the way to exclude sets like this is to consider sets E ⊆ Ω such that, for every F ⊆ Ω, µ∗(F ) = µ∗(F ∩ E) + µ∗(F ∩ Ec). The sets E satisfying this condition are called µ∗-measurable, and the collection of such sets is denoted by F∗. Lemma 3.10 (Properties of the outer measure). The outer measure µ∗ satisfies: (i) A ∈ A =⇒ µ∗(A) = µ(A), (ii) A ⊆ B =⇒ µ∗(A) 6 µ∗(B), (iii) Let {Ai}∞i=1 ⊆ A, then µ∗( ⋃∞ i=1Ai) 6 ∑∞ i=1 µ ∗(Ai). Proof. See [Kub15, Proposition 8.3]. The last property above implies that µ∗(F ) 6 µ∗(F ∩ E) + µ∗(F ∩ Ec), so problematic sets are those where inequality is strict. For the Vitali set shown in Example 1.9, we have µ∗(E) + µ∗([0, 1) \ E) > 1, which one might interpret as E being larger seen from the outside than from the inside. Lemma 3.11 (Carathéodory). The class F∗ of µ∗-measurable sets is a σ- algebra on Ω which contains A, and the restriction of µ∗ to F∗ is a measure. Proof. See [Kub15, Theorem 8.4]. From the two lemmas above, we have that A ⊆ F∗, hence σ(A) ⊆ σ(F∗) = F∗. Moreover, µ∗, restricted to σ(A), is σ-additive, hence it is a measure. Furthermore, µ∗, restricted to A, coincides with µ. This proves the existence part of Theorem 3.3. For uniqueness, suppose µ∗ is σ-finite, so there exist a countable collection {An}n in σ(A) such that µ∗(An) < ∞ for all n, and ∪nAn = Ω. By (3.9), for each n there is a countable family {An,k}k such that ∪kAn,k ⊇ An and µ∗(An,k) < ∞. Re-indexing the doubly-indexed countable family {An,k}n,k as {Bj}j∈N and defining C` = B1 ∪ · · · ∪ B`, we have C` ∈ A, µ(C`) < ∞ and C` ↑ Ω. Noting that A is a pi-system, it follows from Theorem 2.43 that µ∗ is the unique measure on σ(A) that agrees with µ on A, concluding the proof of Theorem 3.3. 3.2 Approximation by events in a generating algebra The symmetric difference A4B is a way to measure how much two events A and B are different. It is defined by A4B := (A \B) ∪ (B \A). 25 The symmetric satisfies an analogue of the triangle inequality: A4C ⊆ (A4B)∪ (B4C). It also satisfies Ac4Bc = A4B. Moreover, (A ∪ B)4(C ∪ D) ⊆ (A4C) ∪ (B4D). (Why?) Theorem 3.12. Let (Ω,F ,P) be a probability space, and A ⊆ F be an algebra. For every B ∈ σ(A) and ε > 0, there exists A ∈ A such that P(A4B) < ε. Remark 3.13. We can generalise Theorem 3.12 to any measure space (Ω,F , µ), as long as µ is σ-finite and µ(B) <∞. 4 We give two different proofs. Proof using outer measure. Fix B ∈ σ(A). By construction of the outer measure P∗ from (3.9) and using Lemma 3.11, we have P(B) = P∗(B) = inf (Dn)∈DB ∞∑ n=1 P(Dn). Hence, for a given ε > 0, we can choose a sequence of sets {Dn}∞n=1 in A with B ⊆ ∞⋃ n=1 Dn and ∞∑ n=1 P(Dn) < P∗(B) + ε 2 , where we also used the fact that P(B) is finite. Now take k ∈ N such that P( ∞⋃ n=k+1 Dn) 6 ∞∑ n=k+1 P(Dn) < ε 2 , and set A := k⋃ n=1 Dn ∈ A. Then, we get B \A ⊆ ∞⋃ n=k+1 Dn and A \B ⊆ ( ∞⋃ n=1 Dn ) \B, which consequently imply that P(B \A) < ε2 and P(A \B) < ε 2 , respectively. Thus, we can deduce that P(A4B) < ε, which is precisely what we wanted to show.5 5Based on [Tay73, Theorem 4.4]. 26 Alternatively, Theorem 3.12 can be proved using the pi-λ Theorem. Proof using the pi-λ Theorem. Consider the collection G given by G = {B ∈ σ(A) : ∀ε > 0, ∃A ∈ A such that P(A4B) < ε}. Note that G contains A. Note also that, for being an algebra, A is also a pi- system. Thus, it suffices to show that G is a λ-system on Ω, since we can then apply the pi-λ Theorem to deduce that G = σ(A). Indeed, we have (i) Ω ∈ G, since P(Ω4Ω) = 0. (ii) Let B ∈ G. Then, for each ε > 0, there is A ∈ A such that P(A4B) < ε. By noticing that Ac4Bc = A4B, where Ac ∈ A, we can deduce that Bc ∈ G. (iii) Let B1, B2, · · · ∈ G, where Bj are disjoint. Then, for each j ∈ N and ε > 0, there is Aj ∈ A such that P(Aj4Bj) < ε2j+1 . By observing that ( k⋃ i=1 Ai)4( k⋃ i=1 Bi) ⊆ k⋃ i=1 (Ai4Bi). For each k ∈ N, we can apply finite sub-additivity to get P(( k⋃ i=1 Ai)4( k⋃ i=1 Bi)) < ε 2 . Now, since P is a finite measure, take K ∈ N such that P( ∞⋃ i=K+1 Bi) = ∞∑ i=K+1 P(Bi) < ε 2 , and set A := ⋃K i=1Ai ∈ A. By noticing that A4( ∞⋃ i=1 Bi) ⊆ K⋃ i=1 (Ai4Bi) ∪ ( ∞⋃ i=K+1 Bi), we can apply finite sub-additivity to conclude that ⋃∞ i=1Bi ∈ G. and so we are done.6 6Based on [Edg98, 1.1.9] 27 b2 a2 a1 b1 b3 a3b2 a2 a1 b1 Figure 3.1: Elements of E for d = 2 and d = 3. 3.3 Lebesgue measure on Rd Consider the measurable space (Rd,B(Rd)), where B(Rd) denotes the Borel σ- algebra on Rd, d > 1. More precisely, we define B(Rd) := σ(E), where E = {(a, b] : −∞ < ai < bi < +∞, i = 1, . . . , d} and (a, b] = (a1, b1]× . . . (ad, bd] ⊆ Rd. Figure 3.1 illustrates the structure of elements in E for higher dimensions, which take the shape of rectangles and cuboids for d = 2 and d = 3. It can be checked that the family A := {I1 ∪ · · · ∪ In : n ∈ N0, Ik ∈ E for k = 1, . . . , n} is an algebra of sets on Rd. Now, for any (a, b] ∈ E , we define the set function m ( (a, b] ) := (b1 − a1) · · · (bd − ad) as the d-dimensional volume of the cuboid (a, b]. As in §3.1.1, it is possible to prove that m defines a pre-measure on A. Hence, by Carathéodory Extension Theorem, m extends to a measure on (Rd,B(Rd)). The measure m is called the Lebesgue measure on Rd, which is the analogue to the 1-dimensional case. Also, E is a pi-system on Rd that generates B(Rd), from which one can also show uniqueness of m. Example 3.14. Take a point set {x} = {(x1, . . . , xd)} ∈ B(Rd). Indeed, as one might expect, the Lebesgue measure m on (Rd,B(Rd)) assigns value of 0 for {x}. To show this rigorously, consider Ak = {y ∈ Rd : xi − 1k 6 yi 6 xi, i = 1, . . . , d} ∈ B(Rd), where Ak ⊇ Ak+1,∀k ∈ N. Then, one see that {x} = ⋂∞i=1Ai. Now, by continuity from above, it follows that, m({x}) = m ( ∞⋂ i=1 Ai ) = lim n→∞m(An) = limn→∞ ( 1 n )d = 0. 28 x1,0 x1,1 x2,0 x2,2x2,1 xn,0 xn,k xn,n Figure 3.2: Covering L by smaller boxes as Ln. 4 Remark 3.15. As in the above example, monotone-convergence properties of a measure are quite helpful when it comes to computing measures of sets. 4 Remark 3.16. If a set in B(Rd) is countable, then it has Lebesgue measure 0. Equivalently, if a set in B(Rd) has strictly positive Lebesgue measure, then it is uncountable. 4 Example 3.17. For d > 2, we look at a line segment L = {x ∈ Rd : x = rα+β, r ∈ [0, 1]}, where α, β ∈ Rd are fixed. Set points on L by, xn,k = knα+β, so as to consider Ln = ⋃n k=1(xn,k−1, xn,k] ∈ B(Rd), where Ln ⊇ Ln+1,∀n ∈ N. The idea is to iteratively split the line segment L into smaller segments and cover each such segment by via small squares Ln, which is captured in Figure 3.2. Now, we can see that, L = ⋂∞ i=1 Li, so applying continuity from above yields, m(L) = m ( ∞⋂ i=1 Li ) = lim n→∞m(Ln) = 0. 4 Remark 3.18. Notice that, L lies in a subspace of Rd. In general, if A ∈ B(Rd) lies in a subspace of Rd, then m(A) = 0. 4 29 4 Measurable functions and random variables This chapter intends to introduce and construct the concept of measurable functions, through a number of important properties and theorems. 4.1 Definition and properties As its name suggests, a measurable function between two measurable spaces preserves the measurability properties of sets. Definition 4.1 (Measurable function). Let (Ω1,F1) and (Ω2,F2) be two measurable spaces. A function f : Ω1 → Ω2 is called F1/F2-measurable if the pre-image of any F2-measurable set is F1-measurable. That is, f−1(A) ∈ F1, ∀A ∈ F2, where f−1(A) := {ω1 ∈ Ω1 : f(ω1) ∈ A} is the pre-image of f . When the σ- algebras F1 and F2 are clear from context, we often drop ‘F1/F2-measurable’ to just ‘measurable’ or ‘F1-measurable’. Example 4.2. Let (Ω,F) be a measurable space and A ∈ F . We show that the indicator function 1A : Ω→ {0, 1} given by 1A(ω) = { 1, ω ∈ A 0, ω /∈ A is F/P({0, 1})-measurable. Indeed, we have that • 1−1A (∅) = ∅ ∈ F , • 1−1A ({0}) = Ac ∈ F , • 1−1A ({1}) = A ∈ F , • 1−1A ({0, 1}) = 1−1A ({0}) ∪ 1−1A ({1}) = Ac ∪A = Ω ∈ F , so 1A is indeed measurable. 4 Naturally, one can expect from Definition 4.1 that a composition of measurable functions is measurable in any general measurable spaces. Lemma 4.3 (Composition of measurable functions). Let (Ω1,F1), (Ω2,F2) and (Ω3,F3) be measurable spaces and f : Ω1 → Ω2, g : Ω2 → Ω3 be measurable functions. Then, the composition g ◦ f is (F1/F3-)measurable. Proof. Since g is measurable, then for any A ∈ F3, we have g−1(A) ∈ F2. Now, since f is measurable, then f−1(g−1(A)) ∈ F1. But, this implies that (g ◦ f)−1(A) = f−1(g−1(A)) ∈ F1, ∀A ∈ F3, 30 so that the composition g ◦ f is indeed measurable. Given a target measurable space (Ω2,F2) and a function f : Ω1 → Ω2, we can construct a σ-algebra on Ω1 induced by f . Definition 4.4 (σ-algebra generated by a function f). Let (Ω2,F2) be a measurable space and f : Ω1 → Ω2 a function, where Ω1 is a sample space. The σ-algebra generated by f on Ω1 is defined as σ(f) := {f−1(A) : A ∈ F2}, From Exercise 2.7, σ(f) is indeed a σ-algebra. See the diagram below: Ω1 f σ(f) Ω2 F2 f−1 OO Note that the notation σ(f) does not make explicit its dependence on F2. Remark 4.5. It follows from the definition of measurability that σ(f) is the smallest σ-algebra on Ω1 for which f is measurable. That is, for any σ-algebra G on Ω1, the function f is G/F2-measurable if and only if G ⊇ σ(f). The latter condition is interpreted as G being finer than σ(f). 4 Definition 4.6. If f is a measurable function from a measure space (Ω1,F1, µ) to (Ω2,F2), we can define the push-forward measure f∗µ on (Ω2,F2) by (f∗µ)(A) = µ({ω ∈ Ω1 : f(ω) ∈ A}) = µ(f−1(A)). This gives an enlarged diagram as follows: Ω1 f σ(f) µ ◦f−1 Ω2 F2 f−1 OO f∗µ (4.7) Quite often, it is unhandy to check measurability using the whole target space, in which case the following lemma comes in our aid. Lemma 4.8 (Measurability criterion). Let (Ω1,F1) and (Ω2,F2) be measurable spaces. Suppose F2 = σ(E) for some class E of subsets of Ω2. Then, f : Ω1 → Ω2 is a measurable function, if and only if f−1(A) ∈ F1 for all A ∈ E. Proof. The first direction follows from Definition 4.1. Since any A ∈ E implies that A ∈ F2, then measurability of f gives f−1(A) ∈ F1. Conversely, suppose 31 that f−1(A) ∈ F1, for each A ∈ E . To prove that f is measurable, it suffices to show that the collection G given by G = {A ⊆ Ω2 : f−1(A) ∈ F1} is a σ-algebra on Ω2, where E ⊂ G, so that F2 = σ(E) ⊆ G. Indeed, we observe that (i) First, Ω2 ∈ G, since f−1(Ω2) = Ω1 ∈ F1. (ii) Let A ∈ G, so f−1(A) ∈ F1. Since f has domain Ω1 and codomain Ω2, we have f−1(Ω2 \ A) = Ω1 \ f−1(A). Since F1 is closed under complements, we have Ω1 \ f−1(A) ∈ F1, hence Ac ∈ G. (iii) Let A1, A2, · · · ∈ G. Then, f−1(Ak) ∈ F1, for each k ∈ N. This in turn implies that f−1( ⋃∞ i=1Ai) = ⋃∞ i=1 f −1(Ai) ∈ F1. Thus, ⋃∞ i=1Ai ∈ G. and so G defines a σ-algebra on Ω2. Lemma 4.9 (Continuous functions). Let Ω1 and Ω2 be metric spaces. Take B1 and B2, respectively, as the Borel σ-algebras, that is, those generated by the families of open sets. If f : Ω1 → Ω2 is continuous, then it is measurable. Proof. Denote by O the class of open sets in Ω2. Let B ∈ O. Since f is continuous, f−1(B) is an open set in Ω1, hence f−1(B) ∈ B1. Since B2 = σ(O), by Lemma 4.8, f is measurable. 4.2 Extended Borel functions and random variables We now discuss extended Borel functions and extended random variables. 4.2.1 Extended Borel functions Definition 4.10 (Extended Borel function). Let (Ω,F) be a measurable space. If f : (Ω,F) → (R,B) is a measurable function, then it is called a Borel function. Likewise, a measurable function taking values on (R,B) is called an extended Borel function. Remark 4.11. One can think of a Borel function as simply an extended Borel function that never takes infinite values. So properties of extended Borel functions also hold for Borel functions. 4 Recall from §2.1.3 the several classes of subsets of R that generate B(R) and hence B(R). We can then apply Lemma 4.8 above to conclude that measurability suffices to hold for each such class on R. Lemma 4.12 (Equivalent criteria). Let (Ω,F) be a measurable space and f : Ω→ R. The following are equivalent: 32 (a) f is an extended Borel function, (b) {f < a} ∈ F , ∀a ∈ R, (c) {f 6 a} ∈ F , ∀a ∈ R, (d) {f > a} ∈ F , ∀a ∈ R, (e) {f > a} ∈ F , ∀a ∈ R. Proof. By Exercise 2.17, each one of the classes {f < a}a∈R, {f 6 a}a∈R, {f > a}a∈R and {f > a}a∈R generates B(R). Hence, the equivalence follows from Lemma 4.8. Lemma 4.13 (Pull-back of Borel sets). Let Ω be a sample space and f a function from Ω to (R,B). Then, σ(f) = σ({{f 6 a} : a ∈ R}). Proof. Since [−∞, a] ∈ B, we have {f 6 a} ∈ σ(f), for each a ∈ R. Thus, {f 6 a}a∈R ⊆ σ(f), and so G := σ({f 6 a}a∈R) ⊆ σ(σ(f)) = σ(f). On the other hand, by Lemma 4.12 f is G/B(R)-measurable. Therefore, by Remark 4.5, G ⊇ σ(f). Example 4.14. Let (Ω,F) be a measurable space. Given a partitionA1, . . . , An ∈ F of Ω and b1, . . . , bn ∈ R, define the function f : Ω→ R by f(ω) := b11A1(ω) + · · ·+ bn1An(ω), ω ∈ Ω. Suppose that Ai ∩ Aj = ∅ for i 6= j, and bi 6= bj for i 6= j. We will show that σ(f) = σ({A1, . . . , An}). To prove that σ(f) ⊆ σ({A1, . . . , An}), it suffices to show that {f 6 a} ∈ σ({A1, . . . , An}), ∀a ∈ R, and by Lemma 4.13, we can conclude that σ(f) ⊆ σ({A1, . . . , An}). Now note that {f 6 a} = {b11A1 + · · ·+ bn1An 6 a} = { Aij : bij 6 a, ij ∈ {1, . . . , n}, j = 1, . . . ,m } = m⋃ j=1 Aij ∈ σ({A1, . . . , An}), which follows by construction of f . Conversely, to show that {A1, . . . , An} ⊂ σ(f), observe that {f = bi} = Ai ∈ σ(f), i = 1, . . . , n, and so σ({A1, . . . , An}) ⊆ σ(f). 4 Exercise 4.15. Consider the measurable space ([0, 1],B([0, 1])), define the functions f, g : [0, 1]→ {0, 1, 2, 3} by 33 f(ω) = 1, ω ∈ [0, 14 ] 2, ω ∈ ( 14 , 12 ] 3, ω ∈ ( 12 , 1] , g(ω) = { 0, ω ∈ [0, 12 ] 1, ω ∈ ( 12 , 1] . Find σ(f), and check that σ(g) ⊂ σ(f). 4 4.2.2 Extended random variables Definition 4.16 (Extended random variable). An extended Borel function X on a probability space (Ω,F ,P) is called an extended random variable. The distribution of X is the push-forward measure PX on (R,B) given by PX := X∗P = P ◦X−1, that is, PX(A) = P(X ∈ A) = P({ω : X(ω) ∈ A}), A ∈ B(R). The function FX : R→ [0, 1] given by FX(x) := PX ( [−∞, x]) is called the cumulative distribution function of X. Remark 4.17. If X never take values ±∞, we say that X is a random variable. 4 Remark 4.18. For a random variable X, the measure PX on (R,B) is the Lebesgue-Stieltjes measure induced by the function FX , as defined in §3.1.2. 4 Example 4.19. Let (Ω,F ,P) be a probability space and A some event in F . Consider the random variable X given by X(ω) = 1A(ω), ω ∈ Ω, with distribution P(X = 1) = P(A) =: p ∈ [0, 1]. Then, X is called a Bernoulli random variable with parameter p. 4 Example 4.20. Let us revisit the construction of Example 4.14. Let (Ω,F ,P) be a probability space, and for n ∈ N, let A1, . . . , An be events in F that partition Ω and b1, . . . , bn distinct real numbers. Consider the random variable X defined by X(ω) = n∑ i=1 bi1Ai(ω), ω ∈ Ω, 34 with distribution P(X = bk) = P(Ak) =: pk ∈ [0, 1], k = 1, . . . , n. Then, X is called a simple random variable that takes value bk with probability pk, and has cumulative distribution function FX(x) = n∑ i=1 pi1{(−∞,bi]}(x). 4 (4.21) Example 4.22. Take Ω = R, F = B and let P = m|[0,1] be the restriction of the Lebesgue measure to [0, 1]. Let X(ω) = − log |ω|. Then X is measurable (why?), hence an extended random variable. Moreover, FX(x) = m({ω : − log |ω| 6 x} ∩ [0, 1])) = { 1− e−x, x > 0, 0, x 6 0, and X is said to follow the exponential distribution with unit parameter. 4 4.3 Operations on extended Borel functions Given a measurable space (Ω,F), let f, g : Ω→ R be extended Borel functions. A property that we constantly use is that measurability is preserved under various operations. The properties studied in this section are so basic that in subsequent chapters we use them without making explicit reference. Lemma 4.23 (Comparison of functions). Let (Ω,F) be a measurable space and f, g : Ω → R be extended Borel functions. Then, the subsets of Ω given by {f < g}, {f 6 g}, {f = g}, and {f 6= g} are in F . Proof. Since Q is dense in R, for each ω ∈ Ω, f(ω) < g(ω) if and only if there is r ∈ Q such that f(ω) < r < g(ω). Thus, we can express {ω : f(ω) < g(ω)} as the union of {ω : f(ω) < r < g(ω)} over all r ∈ Q, and Q is countable. Further, {f < r}, {r < g} ∈ F , by Lemma 4.12 (b) and (d). Putting everything together yields {f < g} = ⋃ r∈Q ({f < r < g}) = ⋃ r∈Q ({f < r} ∩ {r < g}) ∈ F . (4.24) Now, we can see that {f 6= g} = {f < g} ∪ {f > g} ∈ F , {f = g} = {f 6= g}c ∈ F , {f 6 g} = {f < g} ∪ {f = g} ∈ F . This proves the lemma. 35 As one would expect, sums and products of Borel functions f and g on some measurable space (Ω,F) are also Borel. Importantly, we can extend this to f and g being extended Borel functions, as we only need to impose restrictions on infinite values. In particular, f + g is extended Borel if it is well-defined, that is, if there is no ω ∈ Ω for which f(ω) + g(ω) =∞−∞. Lemma 4.25 (Sums and products). Let (Ω,F) be a measurable space and f, g : Ω→ R be extended Borel functions. Then fg is an extended Borel function. If f + g is well-defined for all ω, then it is an extended Borel function. Proof. First, we show that af+b is an extended Borel function for every a, b ∈ R. For a = 0, the proof is trivial, so we assume that a ∈ R \ {0}. We have that {af + b < c} = {af < c− b} = { {f < c−ba }, a > 0 {f > c−ba }, a < 0 where {f < c−ba }, {f > c−ba } ∈ F , by Lemma 4.12 (b) and (d). Now suppose f + g is well-defined for all ω. For every b ∈ R, it follows that {f + g < b} = {f < b− g} ∈ F , since b− g is extended Borel and by applying Lemma 4.23. Next, we show that fg is extended Borel, where we treat the case of infinite values separately. First we show that f2 is also extended Borel. We have that {f2 > a} = { Ω, a < 0 {f < −√a} ∪ {f > √a}, a > 0 where Ω ∈ F and {f < −√a} ∪ {f > √a} ∈ F , by Lemma 4.12 (b) and (d). Now suppose f and g are both finite Borel functions. Notice that we can write fg = 14[(f + g) 2 − (f − g)2], where (f + g)2 and (f − g)2 are both Borel, since f + g and f − g are, so we can conclude that fg is Borel. Finally, suppose f and g are both extended Borel functions. Let A = {ω : f(ω) ∈ R, g(ω) ∈ R}, and B = Ac = {ω : f(ω) = ±∞ or g(ω) = ±∞}. Note that A,B ∈ F . Define f1 = f1A, f2 = f1B , g1 = g1A, g2 = g1B . Then f2g2 is measurable (why?), and by the previous case f1g1 is measurable. Since fg = f1g1 + f2g2, we conclude that fg is measurable. Remark 4.26. If f is an extended Borel function on some measurable space (Ω,F), then any scalar multiple of f is also extended Borel, that is for any a ∈ R, af is an extended Borel function (why?). 4 36 Lemma 4.27 (Maximum and minimum). Let (Ω,F) be a measurable space and f, g : Ω → R be extended Borel functions. Then, we have that max{f, g} and min{f, g} are extended Borel functions. Proof. By taking the suitable inequalities for minimum and maximum, For every a ∈ R, we have {max{f, g} < a} = {f < a} ∩ {g < a} ∈ F , since {f < a}, {g < a} ∈ F by Lemma 4.12 (b). Finally, use min{f, g} = −max{−f,−g} and Remark 4.26 for the minimum. Limiting operations also maintain measurability. Lemma 4.28 (Suprema and limits of functions). Let (Ω,F) be a measurable space and (fn)n>1 be a sequence of extended Borel functions. Then, the functions supn>1 fn, infn>1 fn, lim supn→∞ fn, lim infn→∞ fn are extended Borel functions. Moreover, if limn→∞ fn exists, then lim n→∞ fn = lim supn→∞ fn = lim inf n→∞ fn, is an extended Borel function. Proof. To prove that supn>1 fn is extended Borel, the idea is similar to the proof of Lemma 4.27. For any a ∈ R, we can write{ sup n>1 fn 6 a } = ∞⋂ n=1 {fn 6 a} ∈ F . where we used the fact that F is a σ-algebra, so it closed under countable intersections. The claim for infn>1 fn follows by noticing that inf n>1 fn = − sup n>1 (−fn), and using Remark 4.26. Now, using the definition of limit superior and limit inferior, lim sup n→∞ fn = inf n>1 sup m>n fm, lim inf n→∞ fn = supn>1 inf m>n fm, which both are extended Borel, by applying the above properties twice for each limit. Finally, assuming that the limit of fn exists, we have limn→∞ fn = lim supn→∞ fn, which concludes the proof. 37 5 Integration and expectation In Real Analysis, we first introduce the Riemann Integral, which approximates the area under a curve on a given interval via vertical rectangles and considering a sequence of step functions that converges to the associated curve. However this procedure is not the most suitable to analyse sequences of functions when taking limits. In contrast, the Lebesgue Integral considers horizontal shapes, which makes it more flexible. It is also more descriptive as to when it is possible to take limits. As an example, suppose we want to show that7 lim n→∞ ∫ ∞ 0 e−nx 2 √ x dx = 0 or d dt ∫ 1 0 etxf(x) dx = ∫ 1 0 xetxf(x) dx for some bounded continuous f . Sure, we could do these by brute force, making estimate over estimate. However, these will become much easier if we use the Dominated Convergence Theorem, to be introduced shortly. The reason to abandon the Riemann integral and adopt the Lebesgue integral does not stem from the possibility of integrating exotic functions, but rather from the robust and streamlined way in which the Lebesgue integral commutes with limits, derivatives, Fourier series and other integrals. Another reason to introduce the Lebesgue integral is that some measure spaces (such as probability spaces!) are not naturally partitioned into little intervals or little cubes, but we can still make measurements there. In order to define an integral in this case, instead of partitioning the domain and measuring the height of the graph of a function, we partition the co-domain and measure chunks of the domain, as shown in Figure 5.1. 5.1 Axiomatic definition In this section we define the integral of extended Borel functions by assuming existence of a linear unitary monotone σ-additive operator on non-negative functions. 7Example from [Bar95]. 38 g(x) x X(ω) ω Figure 5.1: Comparison between the Riemann integral on R and the Lebesgue integral on a measure space. 5.1.1 Non-negative functions Let (Ω,F , µ) denote a measure space. The integral of non-negative extended Borel functions against a measure µ is denoted by∫ Ω f(ω)µ(dω) or ∫ Ω f dµ and defined as follows. The Lebesgue integral is the unique operator with following properties: • Linear: ∫ Ω(αf + g) dµ = α ∫ Ω f dµ+ ∫ Ω g dµ; • Unitary: ∫ Ω 1A dµ = µ(A); • Monotone: f > g =⇒ ∫Ω f dµ > ∫Ω g dµ; • σ-additive: ∫ Ω( ∑ n fn) dµ = ∑ n ∫ Ω fn dµ; for α ∈ [0,+∞], A ∈ F and non-negative extended Borel functions f, g, fn. We prove existence and uniqueness of this operator in §5.4. Before that, let us discuss properties that follow from the above four. Exercise 5.1. Let (Ω,F , µ) be a measure space and f > 0 an extended Borel function. Show that ∫ Ω fdµ = 0 if and only if µ({f > 0}) = 0. 4 Solution. Suppose µ({f > 0}) > 0. Then by continuity from below of measures, there exists ε > 0 and n ∈ N such that µ({f > 1n}) > ε (otherwise we would have µ({f > 0}) 6 ε for every ε). Let g(ω) = ε if f(ω) > ε and 0 otherwise. Then 0 6 g 6 f , hence ∫ Ω f dµ > ∫ Ω g dµ > ε n > 0. Now suppose µ({f > 0}) = 0. Define g(ω) = +∞ if f(ω) > 0 and 0 otherwise. Then 0 6 f 6 g, hence∫ Ω f dµ 6 ∫ Ω g dµ =∞ · 0 = 0. 39 x x f(x) f+(x) f−(x) Figure 5.2: Separation of extended Borel function f : R → R by f+ and f− (coloured) . 5.1.2 Extended Borel functions and random variables We now allow f to take values on [−∞,+∞], similarly to (1.11). Denote the positive and negative parts of numbers by decomposing x = [x]+ − [x]−, [x]+ = { x, x > 0, 0, x 6 0, [x]− = { 0, x > 0, −x, x 6 0. For a function f , denote its positive part f+ and negative part f− by f±(ω) = [f(ω)]±, see Figure 5.2 for an illustration of this decomposition. We then define∫ Ω f dµ = ∫ Ω f+ dµ− ∫ Ω f− dµ. (5.2) unless it gives “ +∞ −∞ ”, in which case ∫Ω f dµ is undefined. When both integrals are finite, we say that f is integrable. Exercise 5.3. Let f be an extended Borel function on (Ω,F , µ). Prove that f is integrable if and only if ∫ Ω |f |dµ <∞. 4 This definition also results in a linear and monotone operator. Theorem 5.4. The Lebesgue integral has the following properties: • Homogeneous: ∫ Ω(αf) dµ = α ∫ Ω f dµ • Additive: ∫ Ω(f + g) dµ = ( ∫ Ω f dµ) + ( ∫ Ω g dµ) • Monotone: f > g =⇒ ∫Ω f dµ > ∫Ω g dµ. for measurable f, g and α ∈ [−∞,+∞] as long as the integrals are defined. These properties apply as follows. Homogeneity holds if both integrals are defined. Linearity is true if both integrals are defined and their sum does not result in “+∞ − ∞.” Monotonicity means that, if ∫ f+ dµ < ∞ or∫ g− dµ <∞, then both integrals are defined and satisfy the inequality. 40 We prove these properties from (5.2) in §5.4. Exercise 5.5. Let f be a non-negative extended Borel function on (Ω,F , µ). Show that ∫ Ω f dµ is defined and non-negative. 4 Exercise 5.6. Let f be an extended Borel function on (Ω,F , µ). Suppose that∫ Ω f dµ is defined. Show that | ∫ Ω f dµ| 6 ∫ Ω |f |dµ. 4 Exercise 5.7. Let f, g be extended Borel functions on (Ω,F , µ). Suppose that |f | 6 g and g is integrable. Show that f is integrable. 4 Exercise 5.8. Let f be an extended Borel function on (Ω,F , µ) and A ∈ F be a measurable set. Suppose that ∫ Ω f dµ is defined. Show that ∫ Ω f1A dµ is defined. 4 Exercise 5.9. Let f be an extended Borel function on (Ω,F , µ) and A ∈ F be a measurable set. Suppose that f is integrable. Show that f1A is integrable. 4 We now give a precise and unified definition of expectation. Definition 5.10 (Expectation). Let (Ω,F ,P) be a probability space. For an extended random variable X, the expectation of X is defined as EX = ∫ Ω X dP. Exercise 5.11. Let X be an extended random variable such that a 6 X 6 b for all ω ∈ Ω. Prove that a 6 EX 6 b. 4 Exercise 5.12. Supposed that X is an integrable extended random variable. Prove that E[X − EX] = 0. 4 Remark 5.13. Notation EX is used when the measure P is arbitrary or clear form the context. If more than one measure are being considered, one may use E and E˜ for the integrals against P and P˜, etc. If we need even more measures than fonts and tildes allow, it may be better to write ∫ ΩX dPα instead. 4 5.2 Main properties of the integral In this section we present some general properties of the integral, some inequalities, conditions for an extended Borel function to be integrable, and the relation with the Riemann integral. 5.2.1 General properties Given a measurable set B ∈ F , the integral of f on B is given by the integral of the function f1B which coincides with f on B and is zero outside B:∫ B f dµ := ∫ Ω (f1B) dµ = ∫ Ω f d(µ|B ) (5.14) 41 The first equality above is the definition of the integral on a subset whenever the second integral is defined. The second equality says that the integral of f on B coincides with the integral of f against the measure µ restricted to B, as defined in Example 2.33, and is valid whenever one of the two is defined. Preview of proof. We assume that f only takes values on {0, 1} (in §5.4 we will see a standard procedure to bootstrap arguments for this type of function to general functions). So f = 1A for some A ∈ F , and∫ Ω (f1B) dµ = ∫ Ω 1A∩B dµ = µ(A ∩B) = µ|B (A) = ∫ Ω f d(µ|B ) All this sounds great but at some point we will want to actually compute an integral. There are many methods, but the most important one is the Fundamental Theorem of Calculus. Theorem 5.15 (Fundamental Theorem of Calculus). If F is continuous on [a, b] and F ′ = f on (a, b), then∫ [a,b] f dx = F (b)− F (a), where dx denotes the Lebesgue measure. Proof. It is is the same as the proof for the Riemann integral. The latter uses the fact that the integral is monotone, finitely additive, and that ∫ [a,b] 1dx = b− a for all a < b. All these three properties are satisfied by the Lebesgue integral against the Lebesgue measure, hence the Fundamental Theorem of Calculus also holds in this context. Example 5.16. ∫ pi 0 sin x dx = − cospi + cos 0 = 2 4 Now, going one step further in the diagram of (4.7), if we have an extended Borel function defined on Ω2, we can define its pull-back on Ω1 by composition: Ω1 f→ Ω2 g→ R ω 7→ ω′ 7→ x . Defining the function f∗g = g ◦ f on Ω1 and get Ω1 f σ(f) µ ◦f−1 f∗g Ω2 F2 f−1 OO f∗µ g ◦f OO 42 In summary, from a function that takes points ω ∈ Ω1 to points ω′ ∈ Ω2, we can pull a σ-algebra on Ω2 back to a σ-algebra on Ω1, then push a measure on Ω1 forward to a measure on Ω2, and pull an observable defined on Ω2 back to an observable defined on Ω1. Theorem 5.17 (Change of variable). For measurable functions f : Ω1 → Ω2 and g : Ω2 → R, we have∫ Ω2 g d(f∗µ) = ∫ Ω1 (f∗g) dµ, provided one of the two is defined. Remark 5.18 (u substitution). A familiar instance of this principle is when f maps “x” to “u” and it becomes∫ A g(f(x))f ′(x) dx = ∫ B g(u) du, where A = f−1(B) and du = f∗dx = f ′(x)dx in a sense that will be made precise in §6. 4 Preview of proof. We assume that g only takes values on {0, 1} and leave the complete statement for §5.4. So g = 1C for some C ∈ F2, and f∗g = 1D where D = f−1(C) ∈ F1. Substituting these identities, we get∫ Ω2 g d(f∗µ) = (f∗µ)(C) = µ(f−1(C)) = µ(D) = ∫ Ω1 (f∗g) dµ. We say that a property holds µ-almost everywhere, or for µ-almost every ω, abbreviated by µ-a.e., if there is a set A ∈ F such that µ(Ac) = 0 and such that this property holds for all ω ∈ A. In case µ is clear from the context, we may simply say “a.e.” omitting µ. In case of a probability measure, it is customary to say a.s. or almost surely instead. Proposition 5.19 (a.e. equal functions). Let (Ω,F , µ) be a measure space and f, g : Ω→ R extended Borel functions. If f = g µ-a.e., then∫ Ω f dµ = ∫ Ω g dµ, provided one of the two is defined. Proof. Set E := {ω : f(ω) = g(ω)} ∈ F . Define f1 = f1E and f2 = f1+∞·1Ec . Then 0 6 f+1 6 f+ 6 f+2 and 0 6 f+1 6 g+ 6 f+2 . 43 Since ∫ Ω f2 dµ = ∫ Ω f1 dµ+∞ · µ(Ec) = ∫ Ω f1 dµ, by monotonicity we get∫ Ω f+ dµ = ∫ Ω g+ dµ. By a similar argument, ∫ Ω f − dµ = ∫ Ω g − dµ. Assuming one of these two is finite, we get ∫ Ω f dµ = ∫ Ω g dµ. Remark 5.20. Most of measure theory is insensitive to whether two functions differ on a set of zero measure. In upcoming sections and chapters, the reader will see that two a.e. functions are basically the same function for (almost!) all practical purposes. 4 Exercise 5.21. Let f be an extended Borel function on a measure space (Ω,F , µ) such that µ(f 6= 0) = 0. Show that ∫Ω f dµ = 0. 4 5.2.2 Inequalities The following inequality is fundamental for many arguments in Measure Theory and Probability. Theorem 5.22 (Chebyshev’s inequality). Let (Ω,F , µ) be a measure space and f : Ω→ R an extended Borel function. Then, for all ε > 0 and α > 0, µ({|f | > ε}) 6 1 εα ∫ Ω |f |αdµ. Proof. Let ε > 0 and α > 0. Then we have that,∫ Ω |f |αdµ > ∫ {|f |>ε} |f |αdµ > ∫ {|f |>ε} εαdµ = = εα ∫ Ω 1{|f |>ε}dµ = εαµ({|f | > ε}), and by rearranging, the inequality follows. Remark 5.23. In the setting of (Ω,F ,P) being a probability space and X : Ω→ R an extended random variable, ∀ε > 0 and α > 0, P(|X| > ε) 6 1 εα E|X|α, which is sometimes referred to as Markov’s inequality. We can, in turn, bound deviation from the mean using the variance of X by, P(|X − EX| > ε) 6 1 ε2 VX. (5.24) This is sometimes also referred to as Chebyshev’s inequality. 4 44 Exercise 5.25. Let X be a random variable such that, almost surely, a 6 X 6 b. SupposeX is not almost-surely equal to a constant. Prove that a < EX < b. 4 For the next inequality, we recall the notion of a convex function. For an open interval I ⊂ R, a function f : I → R is called convex if for every x, y ∈ I and a, b ∈ [0, 1] with a+ b = 1, we have f(ax+ by) 6 af(x) + bf(y). This inequality means that any line segment joining two points in the graph of f stay above the graph. In case f is twice differentiable, it is convex if and only if f ′′(x) > 0 for every x ∈ I. Examples of convex functions are |x|, ex, ax + b, x2 and [x]+ on R, as well as x−2, − log x, − arctan x and x−1 on (0,+∞). Theorem 5.26 (Jensen’s inequality). Let X be an integrable random variable taking values on an open interval I, and f : I → R a convex function. Then, f(EX) 6 Ef(X). Proof. Since f is convex, by [Kle14, Corollary 7.8(ii)], for each x0 ∈ I, there is c ∈ R such that f(x) > f(x0) + c(x− x0), ∀x ∈ I. Choosing x = X and x0 = EX, we get f(X) > f(EX) + c · (X − EX), a.s., and taking expectations on both sides we get the stated inequality.8 Theorem 5.27 (Cauchy-Schwarz Inequality). If EX2 < ∞ and EY 2 < ∞, then XY is integrable and E[XY ] 6 √ EX2 √ EY 2. Proof. We can assume EX2 = 1 and EY 2 = 1 (why?). The inequality follows immediately from the trick 0 6 E(X − Y )2 = 2− 2E[XY ]. (5.28) Example 5.29. Let (Ω,F ,P) be a probability space and X = 1A, Y = 1B be Bernoulli random variables with parameter p, where A,B are events in F . Further, assume that A and B are independent, i.e. P(A ∩B) = P(A)P(B) = p2. 8Based on [Shi16, §2.6.7]. 45 Then, we can see that E[XY ] = E[1A1B ] = E[1A∩B ] = P(A ∩B) = P(A)P(B) = p2. On the other hand, we have √ EX2 √ EY 2 = √ EX √ EY = p, and since p ∈ (0, 1) by construction, Cauchy-Schwarz Inequality is verified. 4 5.2.3 Integrability Proposition 5.30. Suppose there is a countable set A ⊆ [0,+∞] such that X only takes values in A. Then EX = ∑ x∈A xP(X = x). Proof. Using σ-additivity of the expectation, EX = E [∑ x∈A x · 1[X=x] ] = ∑ x xE1[X=x] = ∑ x xP(X = x). Example 5.31. If P(X = n) = λne−λn! for n ∈ N0, then EX = ∞∑ n=0 n λne−λ n! = ∞∑ n=1 λne−λ (n− 1)! = = λe−λ ∞∑ n=1 λn−1 (n− 1)! = λe −λ ∞∑ n=0 λn n! = λe −λeλ = λ, giving the familiar formula for the expectation of a Poisson variable. 4 Proposition 5.32. If X only takes values on {0, 1, 2, 3, . . . ,+∞}, then EX = ∑ n∈N P(X > n). Proof. Let A = {0, 1, 2, 3, . . . ,+∞}. Replacing n by a dull sum over k, EX = ∑ n∈A nP(X = n) = ∑ n∈A ∑ k∈N 1k6nP(X = n) = = ∑ k∈N ∑ n∈A 1n>kP(X = n) = ∑ k∈N ∑ n>k P(X = n) = ∑ k∈N P(X > k), where Theorem 1.13 was used to interchange sums. 46 Example 5.33. If P(X ∈ N0) = 1 and P(X > n) = (1 − p)n−1 for all n ∈ N, where 0 < p < 1, then EX = ∞∑ n=1 P (X > n) = ∞∑ n=1 (1− p)n−1 = ∞∑ j=0 (1− p)j = 11− (1− p) = 1 p , giving the familiar formula for the expectation of a Geometric variable. 4 Remark 5.34. The above propositions and their proof also work for measure spaces with X replaced by f and E by ∫ Ω · · · dµ. 4 Proposition 5.35. A random variable X is integrable if and only if∑ n P(|X| > n) <∞. Above we do not specify whether n = 0 or n = ∞ are included in the sum, because all that matters is whether the sum is finite. Proof. Let Z = |X| and Y = bZc. Observe that EY = ∑ n P(Y > n) = ∑ n P(Z > n). On the other hand, Y 6 Z 6 Y +1. By additivity, we have E[Y +1] = (EY )+1. By monotonicity, EY 6 EZ 6 (EY )+1. If one of these numbers is finite, all the others are finite, and if one is infinite, all the others are infinite (why?). This concludes the proof. Exercise 5.36. Can you find which parts of the proof break down if we work on a measure space instead of probability space? Can you find counter-examples? 4 5.2.4 Relation to the Riemann integral We conclude this section with a comparison with the Riemann integral. Theorem 5.37. Let f : [a, b] → R be a Borel function. If f is Riemann integrable then it is Lebesgue integrable and the value of both integrals coincide. Proof. Using the definition from [Coh13, §2.5], the Riemann integral of f on [a, b] equals L ∈ R if, for every ε > 0, there exist step functions g and h such that g 6 f 6 h and L − ε < ∫[a,b] g dx 6 ∫[a,b] hdx < L + ε. But this implies that g and h (and hence f) are bounded, therefore ∫ [a,b] f dx is defined and | ∫[a,b] f dx−L| < ε. Since this holds for every ε > 0, we get ∫[a,b] f dx = L. 47 Remark 5.38. The converse is false. A function can be Lebesgue integrable and not Riemann integrable. In case f is unbounded, it is not Riemann integrable even if it is continuous at almost every point. Another popular example is f : [0, 1] → R given by f = 1Q. Since m(Q) = 0, we have ∫ [0,1] fdx = 0 but this function is not Riemann integrable. 9 4 Remark 5.39. The reader may wonder what happens if f is Riemann integrable but not Borel measurable. We will never bump into such a function unless we are looking for it. In any case, if f : [a, b] → R is Riemann integrable then it is measurable with respect to a σ-algebra larger than B, denoted F∗ in Lemma 3.11, it is Lebesgue integrable and again the value of both integrals coincide. See [Coh13, 2.5.4] for a proof. 4 5.2.5 The fallacy of improper integrals Some people like to say that there are functions which are Riemann integrable but not Lebesgue integrable. Their infamous example is f(x) = sin x x · 1[pi,+∞)(x), usually followed by the pompous claim that it is Riemann integrable. The reason why this function causes some commotion is that f is not Lebesgue integrable (proof omitted) but nevertheless the limit lim z→+∞ ∫ [pi,z] f(x)dx is finite (proof omitted). Some would infer form this that f is Riemann integrable but not Lebesgue integrable, without realising that the above limit is the same regardless of whether the integral on [pi, z] is Lebesgue or Riemann. Anyway, there are some serious problems with the interpretation of this limit. It is finite, so what? As far as Measure Theory and Probability Theory are concerned, this limit has no meaning. It takes one particular sequence of sets An satisfying An ↑ [pi,+∞). By picking another sequence with the same property, one can make this limit give any number in R (proof omitted). There is no physical interpretation of this limit in terms of signed areas, centre of mass, etc. If this limit were to be understood as the value of the integral, Theorem 5.17 would not hold. If it were to correspond to the expectation of a random variable, the law of large numbers would not hold either. 9Let us prove that it is not Riemann integrable. Consider any partition of [0, 1] into finitely many non-degenerate intervals, and notice that all intervals will contain both rational and irrational numbers. Thus, every step function g 6 f must satisfy g 6 0 a.e., and every step function h > f must satisfy h > 1 a.e. Hence, ∫ gdx 6 0 and ∫ hdx > 1 and taking ε = 13 we see that there is no number L such that L− ε < ∫ g dx 6 ∫ hdx < L+ ε. 48 5.3 Convergence of integral and applications We can think of the region under the graph of a non-negative extended real function as having an area, volume or mass. If the function is f(x) = n·1(0, 1n ](x), or g(x) = 1(n,n+1](x), this mass is always equals to 1 and, nevertheless, it disappears when we let n → ∞. In both cases, we can say that the mass “escaped to infinity.” In the former case it disappeared vertically and in the latter case, horizontally. The three properties studied in this section explain what can happen to this mass when we take limits. 5.3.1 The main theorems The Monotone Convergence Theorem says that nothing strange can happen with the mass of an increasing sequence. More precisely, it says that extra mass cannot appear in the limit. Theorem 5.40 (Monotone Convergence Theorem). Let (fn)n be a non- decreasing sequence of non-negative extended Borel functions on (Ω,F , µ). Then lim n→∞ ∫ Ω fn(ω)µ(dω) = ∫ Ω ( lim n→∞ fn(ω))µ(dω). An easier way to remember is: 0 6 fn ↑ f =⇒ ∫ fn dµ ↑ ∫ f dµ. Proof. We included σ-additivity as an axiom instead of monotone convergence for aesthetic reasons, so the proof is a bit dull. Write fn = g1 + · · ·+ gn. Then∑ n gn = f , and by σ-additivity we have∫ Ω f dµ = ∫ Ω ( ∑ ngn) dµ = ∑ n ∫ Ω gn dµ = lim n ∫ Ω fn dµ. In §5.4 we will re-prove this theorem in order to prove σ-additivity. From this theorem, we derive another fundamental property relating integrals and limits. Fatou’s lemma says that, even though the previous examples show that even though we can lose mass in a limit, we can never gain mass. Theorem 5.41 (Fatou’s lemma). Let (Ω,F , µ) be a measure space and (fn)n>1 a sequence of non-negative extended Borel functions. Then,∫ Ω (lim inf n→∞ fn(ω))µ(dω) 6 lim infn→∞ ∫ Ω fn(ω)µ(dω). To remember the inequality, we consider functions fn = 1[n,∞). 49 Proof. Taking gn(ω) = inf k>n fk(ω) and defining g(ω) = lim inf n→∞ fn(ω), we have that 0 6 gn ↑ g. By the Monotone Convergence Theorem, lim n→∞ ∫ Ω gn dµ = ∫ Ω g dµ. Since gn 6 fn, we get lim inf n→∞ ∫ Ω fn dµ > lim inf n→∞ ∫ Ω gn dµ = ∫ Ω g dµ = ∫ Ω (lim inf n→∞ fn) dµ, concluding the proof. The Dominated Convergence Theorem says that, if the graphs of a sequence fn of functions are confined in a region of finite mass, then we cannot lose mass in the limit either. The reason is that the graph of fn divides this region of finite mass in two parts, and the fact that each of the two parts cannot gain mass in the limit implies that the other one cannot lose it. Theorem 5.42 (Dominated Convergence Theorem). Let (Ω,F , µ) be a measure space and (fn)n>1 a sequence of extended Borel functions such that fn(ω) → f(ω) as n → ∞ for a.e. ω. If there exists a non-negative extended Borel function g that is µ-integrable, such that |fn| 6 g for all n > 1, then f is µ-integrable, and lim n→∞ ∫ Ω fn(ω)µ(dω) = ∫ Ω ( lim n→∞ fn(ω) ) µ(dω). Proof. By changing fn, f, g on a set of measure zero, we can assume that convergence holds for every ω. Notice that fn + g > 0 for all n > 1, so by Fatou’s lemma, we get that∫ Ω (f + g) dµ 6 lim inf n→∞ ∫ Ω (fn + g) dµ and hence ∫ Ω f dµ 6 lim inf n→∞ ∫ Ω fn dµ. Similarly, we have that −fn + g > 0 for all n > 1, which gives∫ Ω (−f + g) dµ 6 lim inf n→∞ ∫ Ω (−fn + g) dµ 50 and hence ∫ Ω f dµ > lim sup n→∞ ∫ Ω fn dµ. From these two inequalities, we get lim n→∞ ∫ Ω fn dµ = ∫ Ω f dµ. 5.3.2 Some corollaries and applications Corollary 5.43. If fn ↓ f > 0 a.e. and f1 is integrable, then ∫ Ω fn dµ →∫ Ω f dµ. Proof. Exercise. Corollary 5.44. Let (Xn)n be extended random variables such that ∑ n E|Xn| converges. Then ∑ nXn converges a.s., and E[ ∑ nXn] = ∑ n EXn. Proof. Exercise. Corollary 5.45 (Bounded Convergence Theorem). If Xn→X a.s. and there is M ∈ R such that |Xn| ≤M a.s. for all n, then EXn → EX. Proof. Apply the Dominated Convergence Theorem with constant g = M . Corollary 5.46. Suppose 0 6 fn ↑ f , and there is M ∈ R such that ∫ Ω fn dµ < M for all n. Then f(ω) <∞ for a.e. ω. Proof. Exercise. Exercise 5.47. Suppose 0 6 fn 6 g, and g is integrable. Show that lim sup n→∞ ∫ Ω fn(ω)µ(dω) 6 ∫ Ω (lim sup n→∞ fn(ω))µ(dω). 4 (5.48) Proposition 5.49 (Moment generating function). Let M(t) = EetX and suppose there exists ε > 0 such that M(t) <∞ for −ε < t < ε. Then M (k)(0) = EXk for k = 0, 1, 2, 3, . . . . Proof. We will show a stronger statement, namely that dk dtkMX(t) = E[X ketX ] for every t ∈ (−ε,+ε), by induction on k. 51 For k = 0 we have the identity MX(t) = EetX which holds trivially. Suppose the identity holds for some k ∈ N. Write gx(t) = xketx and g′x(t) = xk+1etx. Then d dtM (k) X (t) = d dtEgX(t) = limh→0E gX(t+ h)− gX(t) h . If we can commute the expectation and the limit, we get M (k+1) X (t) = E lim h→0 gX(t+ h)− gX(t) h = Eg′X(t) = E[Xk+1etX ]. Fix a sequence hn → 0. In order to apply the Dominated Convergence Theorem, it is enough to bound the random term | gX(t+h)−gX(t)h | by an integrable extended random variable. By the Mean Value Theorem, gx(t+ h)− gx(t) h = g′x(θ), where θ ∈ [t, t+ h] depends on x, t and h. Taking δ = ε−|t|3 , for |h| < δ we have |g′x(θ)| ≤ |x|k+1e(ε−2δ)|x| 6 Ce(ε−δ)|x| for every x ∈ R, where C depends on ε and t. It follows from the assumption that Ee(ε−δ)|X| < ∞, which concludes the proof. 5.4 Construction of the integral In this last section we prove the existence and uniqueness of the operator f 7→∫ Ω f dµ defined on non-negative extended Borel functions f . In this process, we will show that non-negative functions can be approximated by simple functions, which allows us to bootstrap properties proved for binary functions (such as Theorem 5.17) to general measurable functions. 5.4.1 Simple functions We say that f is a simple function if it is measurable and only takes finitely many values. In case f is also non-negative, we define its integral as∫ Ω f dµ := ∑ x x · µ(f = x), where “µ(f = x)” means µ({ω ∈ Ω : f(ω) = x}). Let a1, . . . , an denote the distinct values attained by f , and A1, . . . , An ∈ F the sets where f takes these values. Then the sets A1, . . . , An form a partition of Ω and f = n∑ j=1 aj1Aj . 52 This representation is unique except for permutation of the indices {1, . . . , n}. In this representation, we have∫ Ω f dµ = n∑ j=1 ajµ(Aj) (5.50) for every f non-negative and simple. Lemma 5.51. This definition is unitary, monotone, and linear. It is unitary by construction. Monotonicity follows directly from linearity. Indeed, if f > g > 0 are simple, then h := f −g is non-negative and simple, and∫ Ω f dµ = ∫ Ω g dµ+ ∫ Ω hdµ > ∫ Ω g dµ. So it remains to prove linearity. Lemma 5.52. If A1, . . . , An ∈ F form a partition of Ω and f = ∑n j=1 aj1Aj , then ∫ Ω f dµ = ∑n j=1 ajµ(Aj), even if a1, . . . , an ∈ [0,+∞] are not distinct. Idea of proof. Group the sets A corresponding to same value a. Proof of linearity. Write f = ∑n j=1 aj1Aj and g = ∑k i=1 bi1Bi , where both A1, . . . , An and B1, . . . , Bk form partitions of Ω. After expanding and grouping, af + bg = · · · = n∑ j=1 k∑ i=1 (aaj + bbi)1Aj∩Bi . But the family {Aj ∩ Bi}j=1,...,n;i=1,...,k forms a partition of Ω, so using the previous lemma we get∫ Ω (af + bg) dµ = n∑ j=1 k∑ i=1 (aaj + bbi)µ(Aj ∩Bi). Splitting the terms in the sum and using that B1, . . . , Bk as well as A1, . . . , An are partitions, one can eventually gets∫ Ω (af + bg) dµ = n∑ j=1 aajµ(Aj) + k∑ i=1 bbiµ(Bi) = a ∫ Ω f dµ+ b ∫ Ω g dµ. Remark 5.53. We really had no choice for how to define the integral of simple functions, as (5.50) follows from the integral being unitary and linear. 4 5.4.2 Non-negative extended Borel functions We define the integral of a non-negative extended Borel function f by∫ Ω f dµ = sup {∫ Ω g dµ : 0 6 g 6 f and g is simple } . (5.54) As promised in §5.1, we will now prove the following. 53 Theorem 5.55. This definition is linear, unitary, monotone and σ-additive. It is to be noted that for simple non-negative functions, this definition coincides with the previous one, so it really is only extending it. In particular, this integral is unitary by construction. Monotonicity follows from inclusion: the larger f , the more simple functions are allowed in the above supremum. To prove linearity and σ-additivity, we will use monotone convergence: 0 6 fn ↑ f =⇒ ∫ Ω fn dµ→ ∫ Ω f dµ (5.56) for every non-decreasing sequence of non-negative extended Borel functions. This was already “proved” using σ-additivity, but in order to avoid a circular argument we now prove it directly from the above definition. Proof of (5.56). By monotonicity, the sequence ∫ Ω fn dµ is non-decreasing and bounded by ∫ Ω f dµ. Hence, it converges, and lim n→∞ ∫ Ω fn dµ 6 ∫ Ω f dµ. It remains to show the opposite inequality. Let 0 6 g 6 f be simple, taking positive values a1, . . . , ak. Let 0 < α < 1 and An = {ω : fn(ω) > αg(ω)}. Since f > g and 0 6 fn ↑ f , for each ω there is n such that fn(ω) > αg(ω), so An ↑ Ω and∫ Ω fn dµ > ∫ Ω (fn1An) dµ > ∫ Ω (αg1An) dµ = = α k∑ j=1 ajµ ({ω ∈ An : g(ω) = aj})→ α k∑ j=1 ajµ ( g = aj ) = α ∫ Ω g dµ. Hence, lim n→∞ ∫ Ω fn dµ > α ∫ Ω g dµ. Since this is true for every 0 < α < 1, we get lim n→∞ ∫ Ω fn dµ > ∫ Ω g dµ. Since this is true for every simple g such that 0 6 g 6 f , lim n→∞ ∫ Ω fn dµ > ∫ Ω f dµ. 54 g1(x) g2(x) g3(x) g2(y) x x y Figure 5.3: Graph of g2(y) and approximation gk(x) ↑ x for some fixed x. 5.4.3 Approximation by simple functions In order to bootstrap properties already established for simple functions to the case of non-negative functions, we start by showing that the latter can be approximated by the former in a monotone way, and finally apply (5.56). We define a sequence of functions that approximate every non-negative extended number in a monotone way. For n ∈ N, let gn : [0,+∞]→ R+ be defined by gn(x) = 2−n ·max { j ∈ {0, 1, . . . , 2nn} : 2−nj 6 x} , illustrated in Figure 5.3. Note that, each gn assumes finitely many values and, for every x ∈ [0,+∞], gn(x) ↑ x as n→∞. For a non-negative extended Borel function f defined on Ω, the Borel functions fn := gn ◦ f are simple and satisfy fn ↑ f , see Figure 5.4. Remark 5.57. The above construction shows that, for every non-negative extended Borel function f , there exists a sequence of simple functions (fn)n such that 0 6 fn ↑ f . 4 Remark 5.58. Combining the previous remark with the Monotone Convergence Theorem and Remark 5.53, we see that any other definition of integral of non-negative extended Borel functions that is unitary, monotone, linear and σ-additive, would be equivalent to this one. In other words, the integral is unique. 4 5.4.4 Proofs of the axioms and other general properties Proof of linearity. Suppose f > 0 and h > 0 are extended Borel functions. For homogeneity, given α > 0 and we get ∫ Ω(αf) dµ = α ∫ Ω f dµ directly from (5.54) since we can multiply each g in the supremum by α. For additivity, consider simple functions fn ↑ f and hn ↑ h. Using (5.56) three times and linearity for the integral of simple functions, 55 X(ω) g1(X(ω)) g2(X(ω)) ω Figure 5.4: Approximation of an extended Borel function X by g1(X) e g2(X). Notice the similarity with Figure 5.1. ∫ Ω (f + g) dµ = lim n→∞ ∫ Ω (fn + gn) dµ = lim n→∞ (∫ Ω fn dµ+ ∫ Ω gn dµ ) = = lim n→∞ ∫ Ω fn dµ+ lim n→∞ ∫ Ω gn dµ = ∫ Ω f dµ+ ∫ Ω g dµ. This concludes the proof. Proof of σ-additivity. Let (hn)n be a sequence of non-negative extended Borel functions. Define fn = h1 + · · ·+ hn. Then fn ↑ f := ∑ j hj , and by (5.56) we have ∞∑ j=1 ∫ Ω hj dµ = lim n→∞ n∑ j=1 ∫ Ω hj dµ = lim n→∞ ∫ Ω n∑ j=1 hj dµ = = lim n→∞ ∫ Ω fn dµ = ∫ Ω f dµ = ∫ Ω ∞∑ j=1 hj dµ, concluding the proof. Proof of (5.14). We have already proved this identity when f is an indicator function. By linearity, it also holds when f is a non-negative simple function. By (5.56), it holds for any non-negative extended Borel function f . Finally, if f is an extended Borel function,∫ Ω (f1B) dµ = ∫ Ω (f+1B) dµ− ∫ Ω (f−1B) dµ = = ∫ Ω f+ d(µ|B )− ∫ Ω f− d(µ|B ) = ∫ Ω f d(µ|B ), and when one of the two is defined, these expressions do not entail a “∞−∞.” 56 Proof of Theorem 5.17. We have already proved this identity when g is an indicator function. Just like in the previous proof, by linearity and (5.56), the identity holds for any non-negative extended Borel function g. Assuming g is an extended Borel function,∫ Ω2 g d(f∗µ) = ∫ Ω2 g+ d(f∗µ)− ∫ Ω2 g− d(f∗µ) = = ∫ Ω1 (f∗g)+ dµ− ∫ Ω1 (f∗g)− dµ = ∫ Ω1 (f∗g) dµ, and when one of the two is defined, these expressions do not entail a “∞−∞.” We finally give the belated proof of Theorem 5.4 using Theorem 5.55. Proof. We begin with monotonicity. Suppose f > g and ∫ Ω f + dµ < ∞. Then g+ 6 f+ and f− 6 g−, hence ∫ Ω g + dµ 6 ∫ Ω f + dµ < ∞ and − ∫Ω g− dµ 6− ∫Ω f− dµ. Adding the two last inequalities we get ∫Ω g dµ 6 ∫Ω f dµ without incurring any “∞−∞” operation. The case ∫ g− dµ <∞ is identical. We now move to homogeneity. Suppose both integrals are defined. Consider the case −∞ 6 α 6 0 first. Then∫ Ω (αf) dµ = ∫ Ω (−αf−) dµ− ∫ Ω (−αf+) dµ = = (−α) ∫ Ω f− dµ− (−α) ∫ Ω f+ dµ = = α (∫ Ω f+ dµ− ∫ Ω f− dµ ) = α ∫ Ω f dµ. These expressions cannot contain “∞−∞” because both integrals are defined. The case α > 0 is analogous. We conclude with additivity. Note that (f + g)+ − (f + g)− = f + g = f+ − f− + g+ − g−, whence (f + g)+ + f− + g− = (f + g)− + f+ + g+. These are all non-negative extended Borel functions. By additivity in this case,∫ Ω (f + g)+ dµ+ ∫ Ω f− dµ+ ∫ Ω g− dµ = ∫ Ω (f + g)− dµ+ ∫ Ω f+ dµ+ ∫ Ω g+ dµ. We are supposing that ∫ Ω f dµ+ ∫ Ω g dµ is defined, which implies that ∫ Ω f − dµ+∫ Ω g − dµ <∞ or ∫Ω f+ dµ+ ∫Ω g+ dµ <∞. Assume without loss of generality the former case. Since (f + g)− 6 f− + g−, we have ∫ Ω(f + g) − dµ < ∞, too, so we can subtract these three terms from both sides, getting∫ Ω (f + g)+ dµ− ∫ Ω (f + g)− dµ = ∫ Ω f+ dµ− ∫ Ω f− dµ+ ∫ Ω g+ dµ− ∫ Ω g− dµ. 57 This concludes the proof. 58 6 Density and Radon-Nikodým Theorem There are many situations where we want to define or study a measure in terms of another. Some of the most common uses is to tilt a probability measure and obtain a new one, to get probability distributions on R by specifying some density, and to define conditional expectations. We explore these three cases, while keeping in mind that these are rather general concepts and tools. In particular, Proposition 6.18 we prove the good old formula for the expectation of an absolutely continuous random variable. 6.1 Density of measures and continuous variables In this section we explore the concept of density and how it applies to absolutely continuous random variables. 6.1.1 Measures defined by a density Given a measure space and a non-negative extended Borel function, it is possible to construct a new measure via the Lebesgue integral. Proposition 6.1. Let (Ω,F , µ) be a measure space and f a non-negative extended Borel function. Then, ν(A) := ∫ A f dµ = ∫ Ω f1A dµ, ∀A ∈ F , defines a measure on (Ω,F). Exercise 6.2. Prove the above proposition. 4 Example 6.3. Consider the measure space (R,B,m) and the non-negative Borel function g(x) = λe−λx1[0,∞)(x) for x ∈ R, where λ > 0. Then, Pλ(B) := ∫ B g dm = ∫ R g1B dm, ∀B ∈ B, defines a measure on (R,B), by Proposition 6.1. In fact, Pλ is a probability measure, since Pλ(R) = ∫ R λe−λx1[0,∞)(x)m(dx) = ∫ ∞ 0 λe−λx dx = 1. 4 (6.4) A probability mass function assigns mass to elements of a discrete set, which can be seen as changing the weights assigned by the counting measure. In the above example we changed the weights assigned by the Lebesgue measure instead. 59 Definition 6.5 (Radon-Nikodým derivative). Let (Ω,F , µ) be a measure space and f : Ω→ [0,+∞] a non-negative extended Borel function. If ν is a measure on (Ω,F) such that ν(A) := ∫ A f dµ, ∀A ∈ F , (6.6) we say that f is the Radon-Nikodým derivative (or density) of ν with respect to µ, and we write f = dνdµ . Remark 6.7. Given two measures µ, ν on (Ω,F), we say that dνdµ exists if there is a non-negative extended Borel function f such that (6.6) holds. 4 Remark 6.8. We say the Radon-Nikodým derivative because it is unique, in the following sense. Given two measures µ and ν, and two non-negative extended Borel functions f and g such that (6.6) holds, then f = g for µ-a.e. ω. 4 Remark 6.9. Given a measure µ and a non-negative extended Borel function f , by Proposition 6.1 we can define a unique ν such that dνdµ = f for µ-a.e. ω. 4 Example 6.10. Let Ω = {1, 2, 3, 4, 5, 6}, F = P(Ω) and P(A) = #A6 . Take the function g(n) = 27n. Then by Proposition 6.1 the function P given by P(A) = ∫ Ω(g1A)dP defines another measure on (Ω,F). 4 Exercise 6.11. Show that the above P is a probability measure. 4 Exercise 6.12. Prove the statement in Remark 6.8. 4 Proposition 6.13 (Chain rule of the Radon-Nikodým derivative). Let ν, µ be measures on a given measurable space (Ω,F) such that dνdµ exists, and let g : Ω→ [0,+∞] be a non-negative extended Borel function. Then,∫ Ω g dν = ∫ Ω g dνdµ dµ. Proof. First, suppose that g = 1A for some A ∈ F . In this case, we have∫ Ω 1A dν = ν(A) = ∫ A dν dµ dµ = ∫ Ω 1A dν dµ dµ as claimed. By linearity, it also holds when g is a non-negative simple function. By (5.56), it holds for any non-negative extended Borel function g. 6.1.2 Absolutely continuous random variables We now give the proper definition of a very familiar object. 60 Definition 6.14 (Absolutely continuous variables). Given a random variable X on a probability space (Ω,F ,P), we say that X is an absolutely continuous random variable if dPXdx exists, where dx denotes the Lebesgue measure on (R,B). The Radon-Nikodým derivative dPXdx is called the probability density function of X and usually denoted fX : R→ R. Note that fX is determined by X and P but the latter is omitted in the notation. The probability measure of Example 6.3 is known the Exponential distribution with parameter λ, and g is called the density of this distribution. Example 6.15. The unique probability measure P on (R,B) such that dPX dx (x) = e−x 2/2 √ 2pi is called the standard normal distribution on R. 4 Example 6.16. The unique probability measure P on (R,B) such that dPX dx (x) = 1[0,1](x) is called the uniform distribution on [0, 1]. 4 Remark 6.17. The density of an absolutely continuous random variable is unique up to almost-everywhere equality. For instance, 1(0,1)(x) is also the density of the uniform distribution, as it differs from 1[0,1](x) on a set of zero Lebesgue measure. 4 Proposition 6.18 (Expectation). If X is an absolutely continuous random variable, then, for every non-negative extended Borel function g, we have Eg(X) = ∫ R g(x)fX(x) dx. Proof. First, notice that Theorem 5.17 gives Eg(X) = ∫ Ω g(X(ω))P(dω) = ∫ R g(x)PX(dx) = ∫ R g dPX Now, since X is absolutely continuous, we can apply Proposition 6.13 to get∫ R g dPX = ∫ R g dPXdx dx = ∫ R g(x)fX(x) dx, which concludes the proof. Example 6.19. If X ∼ Exp(λ), then EX = 1λ . Indeed, the density of PX with respect to the Lebesgue measure is λe−λx1[0,∞)(x). By the above proposition, we have EX = ∫ R xλe −λx1[0,∞)(x) dx = · · · = 1λ . 4 61 6.1.3 Tilting a distribution Below we see how to change the distribution of a random variable without changing the variable itself. Example 6.20 (Tilting a probability measure). Let (Ω,F ,P) be a probability space and X ∼ N (0, 1), i.e. X is a standard Normal random variable with probability distribution, that is, P[X ∈ A] = ∫ A 1√ 2pi e− x2 2 dx, ∀A ∈ B(R). Fix some θ ∈ R. We want to define another measure P on (Ω,F), such that X ∼ N (θ, 1), under P. Take dPdP (ω) = eθX(ω) e θ2 2 and by Proposition 6.13, we get E[et(X−θ)] = E[et(X−θ)eθX− θ 2 2 ] = e−θt− θ 2 2 E[e(t+θ)X ] = e−θt− θ 2 2 e (t+θ)2 2 = e t 2 2 , t ∈ R. First, taking t = 0 we see that P(Ω) = E[1] = E[e0] = 1, so P is a indeed a probability measure. Also, by uniqueness of moment generating functions, X − θ ∼ N (0, 1) under P. 4 Remark 6.21. The fact that P is a probability measure comes from our careful choice of Radon-Nikodým derivative where the denominator equals the P- expectation of the numerator. In general, if g is a non-negative extended Borel function and Eg(X) <∞, then we can define dPdP = g(X)Eg(X) , and the distribution of X under P equals the distribution of X under P biased by the function g. The resulting measure is a probability measure because, taking A = Ω and f = g(X) in (6.6) will give P(Ω) = ∫ Ω fdP = 1 Eg(X) ∫ Ω g(X)dP = 1. 4 Example 6.22 (Size-biasing). Let (Ω,F ,P) be a probability space. Suppose X ∼ Poisson(µ). Defining dPdP = Xµ , the measure P is a probability measure and the distribution of X − 1 under P is Poisson with the same parameter. 4 Example 6.23 (Size-biasing). Let (Ω,F ,P) be a probability space. Suppose Y ∼ Exp(λ). Defining dPdP = λY , the measure P is a probability measure and the distribution of Y under P is Γ(2, λ). 4 Example 6.24 (Size-biasing). In Example 6.10, we have the experiment of rolling a die, where P corresponds to sampling one of the six faces with equal probability. The measure P corresponds to sampling one of the 21 dots with equal probability, and observing the number of dots in its face. 4 62 6.2 The Radon-Nikodým Theorem In this section we study when a measure ν can be expressed in terms of another measure µ, deformed by a variable factor dνdµ that we call Radon-Nikodým derivative. 6.2.1 Absolute continuity and the Radon-Nikodým Theorem The following is a useful property for comparing measures. Definition 6.25 (Absolute continuity between measures). Let ν, µ be measures on a given measurable space (Ω,F). We say that ν is absolutely continuous with respect to µ if, for all A ∈ F µ(A) = 0 =⇒ ν(A) = 0. We denote this relation by ν µ. Example 6.26. Consider the measurable space (N,P(N)), and recall the Dirac measure δk at a point k ∈ N, defined in Example 2.26. Denote by µC the counting measure on N, as defined in Example 2.28. If a set A ⊆ Z has µC measure 0, then it has to have δk measure 0. Hence, δk is absolutely continuous with respect to µC , i.e. δk µC . 4 Exercise 6.27. In the above example, can you think of a function f : N→ [0,∞] that would be the Radon-Nikodým derivative dδkdµC of δk with respect to µC? 4 Example 6.28. Let µC be the counting measure on (R,B), so that: • If µC(A) =∞, then A is an infinite Borel set. • If µC(A) = N <∞, then A = {a1, ..., aN} for some a1, ..., aN ∈ R. • If µC(A) = 0, then A = ∅. From the above cases, if µC(A) = 0, then m(A) = 0, where m denotes the Lebesgue measure. Therefore, m µC . 4 Exercise 6.29. In the above example, show that the converse is not true, that is, µC is not absolutely continuous with respect to m. 4 Let µ and ν be measures on (Ω,F) and suppose the Radon-Nikodým derivative f of ν with respect to µ exists, that is, ν(A) = ∫ A f dµ for ever A ∈ F . By Exercise 5.21 we have ν µ. We wonder whether the converse implication holds, that is, if ν µ, does dνdµ exist? Theorem 6.30 (The Radon-Nikodým Theorem). Let ν and µ be σ-finite measures on a given measurable space (Ω,F). Then, ν µ if and only if 63 there exists a non-negative extended Borel function f , such that ν(A) = ∫ A f dµ, ∀A ∈ F . Proof. See [Coh13, 4.2.2]. We say that a random variable X is absolutely continuous if its distribution PX assigns measure zero to Borel sets of Lebesgue measure zero, that is PX m. From the above theorem, we see that any absolutely continuous random variable X must have a density! 6.2.2 Continuous random variables and densities We say that a random variableX is continuous if its distribution assigns measure zero to every fixed number x ∈ R. Since finite sets have Lebesgue measure zero, every absolutely continuous variable is also continuous (showing that the adverb “absolutely” has been used properly, unlike outer measures and finitely additive measures which are not measures). One may wonder about what could be a random variable which is continuous but not absolutely continuous: in fact it is possible to construct such variables, and their distributions assign positive probability to a set which is uncountable but has zero Lebesgue measure. If we are given a non-negative function f and we can check that PX ( (−∞, x]) = ∫ x −∞ f(s) ds for all x ∈ R, then we can conclude that X is absolutely continuous and has density f . Indeed, the measure P(B) = ∫ B f dx coincides with PX on the pi-system {(−∞, x]}x∈R which generates B, and the claim follows from Theorem 2.43. The converse is more delicate, and there is a theory that decomposes PX into absolutely continuous and singularly continuous parts. We will bypass the theory by taking a more modest approach that is very useful in practice. Suppose we are given a distribution function F and want to test if it is absolutely continuous. We take f to be its almost-everywhere derivative, and check whether F (x) = ∫ x −∞ f(s) ds for every x ∈ R. If this is the case, we can conclude that f is the density of F with respect to the Lebesgue measure. Exercise 6.31. Suppose U has the uniform distribution on [0, 1] and we take X = eU . Find the density of X. (Answer: x−11[1,e](x).) 4 64 6.2.3 Conditional expectation The Radon-Nikodým Theorem is a fundamental theorem in Measure Theory and Probability Theory. We now show one of the its many non-trivial applications. Definition 6.32 (Conditional expectation). Suppose (Ω,F ,P) is a probability space, X a an extended random variable, and G ⊆ F another σ-algebra. We say that an extended random variable Z is the conditional expectation of X given G if Z : Ω→ R is G-measurable (6.33) and ∫ A Z dP = ∫ A X dP for all A ∈ G. (6.34) The above identity means whenever one of the integrals is defined, the other one is also defined and they coincide. Exercise 6.35. Suppose that two extended random variables Z and W satisfy (6.33) and (6.34). Prove that P(Z = W ) = 1. 4 Remark 6.36 (Uniqueness). The above definition says the conditional expectation because, if there is such a function, it is unique in the a.s. sense. Conversely, if Z satisfies (6.34) andW = Z a.s., thenW also satisfies (6.34). 4 Theorem 6.37. In the setup of Definition 6.32, if X is a.s. finite and EX is defined, then the conditional expectation of X given G is defined. Proof (non-negative integrable case). Suppose that X is integrable, and define the measure ν on (Ω,G) by ν(A) = ∫ A X dP for A ∈ G. Then ν(Ω) = EX < ∞, so ν is σ-finite. By the Radon-Nikodým Theorem, dνdP exists as a non-negative extended Borel function on (Ω,G). Define Z := dνdP and note that it satisfies both (6.33) and (6.34). Proof (non-negative case). Now suppose that X is a non-negative finite random variable, not necessarily integrable. Then X can be written as X = ∑ nXn where each Xn is an integrable random variable. From the previous case, each Xn has a conditional expectation, which we denote Zn. We take Z := ∑ n Zn. Note that Z is also non-negative and G-measurable. By the previous case and σ-additivity of the integral, for every A ∈ G,∫ A Z dP = ∑ n ∫ A Zn dP = ∑ n ∫ A Xn dP = ∫ A X dP, 65 which concludes the proof. Proof (general case). We can assume that EX− <∞. From the previous case, there exist two G-measurable functions Z± such that ∫ A Z± dP = ∫ A X± dP for all A ∈ G. In particular, ∫Ω Z− dP = ∫ΩX− dP < ∞, so by Z− is a.s. finite, and we can assume that Z−(ω) < ∞ for every ω ∈ Ω. Define Z = Z+ − Z−, which is well-defined because Z− is always finite. Then for all A ∈ G, we have∫ A Z dP = ∫ A Z+ dP− ∫ A Z− dP = ∫ A X+ dP− ∫ A X− dP = ∫ A X dP which are well-defined because Z− and X− are integrable. 66 7 Product measures and independence Product measures are a fundamental tool in Measure Theory. Among its many applications, this topic is useful for computing high-dimensional integrals one variable at at time, computing the expectation of an integral as the integral of the expected value, and to construct new probability spaces from old ones by running two experiments (or even two copies of the same experiment) independently. The proofs in the first three sections are based on [Coh13, §5]. 7.1 Product σ-algebras and their properties Let (Ω1,F1) and (Ω2,F2) be measurable spaces, and let Ω1×Ω2 be the Cartesian product of Ω1 and Ω2. We say that a subset of Ω1 × Ω2 is a rectangle with measurable sides if it is of the form A×B, for some A ∈ F1 and B ∈ F2. Definition 7.1 (Product of σ-algebras). We define the product of the σ- algebras F1 and F2, denoted as F1 ⊗ F2, as the σ-algebra generated by the class F1 ×F2 = {A×B ⊆ Ω1 × Ω2 : A ∈ F1, B ∈ F2} on the sample space Ω1 × Ω2. In other words, F1 ⊗ F2 is the σ-algebra generated by the collection of all rectangles with measurable sides. It is important to note that F1 ⊗F2 is much larger than the Cartesian product F1 ×F2 of F1 and F2. Let us discuss the case of R2. Example 7.2. Consider the sample space R2 = R × R. Then the product of Borel σ-algebras B(R)⊗ B(R) equals the Borel σ-algebra B(R2). 4 Proof. [Not examinable] First, we show B(R2) ⊆ B(R)⊗ B(R). Let E := {(a, b] ∩ R : −∞ 6 a 6 b 6 +∞}, E2 := E × E , and recall that B(R2) = σ(E2). Since E2 ⊆ B(R)× B(R), we have B(R2) ⊆ σ(B(R)× B(R)) = B(R)⊗ B(R). We now show the other direction. Note that {A×B : A ∈ σ(E), B = R} = σ({A×B : A ∈ E , B = R}) as the set B is being pinned to B = R plays no role in any of the theoretic operations involved. Likewise, {A×B : A = R, B ∈ σ(E)} = σ({A×B : A = R, B ∈ E}). 67 In particular, for A,B ∈ B(R) = σ(E), we have A×B = (A× R) ∩ (R×B) ∈ σ(E × E). Thus, σ(E)× σ(E) ⊆ σ(E × E) and hence B(R)× B(R) = σ(E)× σ(E) ⊆ σ(E × E) = σ(E2) = B(R2), Therefore, B(R)⊗ B(R) = σ(B(R)× B(R)) ⊆ B(R2). Remark 7.3. It is not true that B(R2) is equal to B(R) × B(R). To see this, notice that the set {(x, y) : 0 6 x2 + y2 < 1} lies in B(R2), but not in B(R)× B(R). 4 Given a subset or a function on Ω1 × Ω2, we need some notation involving the separate parts on Ω1 and Ω2. Definition 7.4 (Sections of subsets and functions on Ω1×Ω2). Let Ω1×Ω2 be the Cartesian product of given sample spaces Ω1 and Ω2, E a subset of Ω1×Ω2 and f a function on Ω1 × Ω2. Fix x ∈ Ω1 and y ∈ Ω2. (i) The sections Ex and Ey of E are the subsets of Ω2 and Ω1, respectively, given by Ex = {y˜ ∈ Ω2 : (x, y˜) ∈ E} and Ey = {x˜ ∈ Ω1 : (x˜, y) ∈ E}. (ii) The sections fx and fy of f are the functions on Ω2 and Ω1, respectively, given by fx(y˜) = f(x, y˜) and fy(x˜) = f(x˜, y). Naturally, we want to study subsets and functions that are measurable with respect to a product of σ-algebras. Lemma 7.5. Let (Ω1,F1) and (Ω2,F2) be measurable spaces. (i) If E is F1⊗F2-measurable, then for each x ∈ Ω1 and y ∈ Ω2, the sections Ex and Ey are F2-measurable and F1-measurable, respectively. (ii) If f : Ω1 × Ω2 → R is F1 ⊗ F2-measurable, then for each x ∈ Ω1 and y ∈ Ω2, the sections fx and fy are F2-measurable and F1-measurable, respectively. Proof. [Not examinable] For the first statement, we show that each section Ex is F2-measurable. A symmetric reasoning can be used to show that each section Ey is F1-measurable. Fix x ∈ Ω1, and define the collection G by G = {E ⊆ Ω1 × Ω2 : Ex ∈ F2}. By construction, G contains F1 × F2. It remains to show that G is a σ-algebra on Ω1 × Ω2, so that it contains σ(F1 ×F2) = F1 ⊗F2. Indeed, we have 68 (i) Ω1 × Ω2 ∈ G, since (Ω1 × Ω2)x = Ω2 ∈ F2. (ii) Let E ∈ G. Since Ex ∈ F2, then (Ec)x = (Ex)c ∈ F2, so that Ec ∈ G. (iii) Let E1, E2, · · · ∈ G. Then, (Ej)x ∈ F2 for each j ∈ N. This in turn gives( ∞⋃ i=1 Ei ) x = ∞⋃ i=1 (Ei)x ∈ F2, so that ⋃∞ i=1Ei ∈ G. where we use the identities (Ec)x = (Ex)c and ( ⋃∞ i=1Ei)x = ⋃∞ i=1(Ei)x. We now prove the second statement. Again, we only prove that each fx is F2- measurable. Note that (fx)−1(D) = (f−1(D))x for any D ∈ B(R). Combined with the first statement, this implies that (fx)−1(D) ∈ F2 for every D ∈ B, concluding the proof. 7.2 Product measure We now establish existence and uniqueness of the product measure defined by σ-finite measure spaces. We start with a basic lemma. Lemma 7.6. Let (Ω1,F1, µ) and (Ω2,F2, ν) be two σ-finite measure spaces. If E is F1 ⊗ F2-measurable, then the functions x 7→ ν(Ex) and y 7→ µ(Ey) are F1-measurable and F2-measurable, respectively. Proof. [Not examinable] We show that x 7→ ν(Ex) is F1-measurable, the other statement is analogous. First, suppose that the measure ν is finite. Define the collection G by G = {E ∈ F1 ⊗F2 : x 7→ ν(Ex) is F1-measurable}. We will show that G = F1 ⊗F2. Note from Lemma 7.5 (i) that, given E ∈ F1 ⊗ F2, each section Ex lies in F2, so that the function x 7→ ν(Ex) is well-defined. If A ∈ F1 and B ∈ F2, we have ν((A×B)x) = ν(B)1A(x), (7.7) which implies that G contains F1 × F2. Note that F1 × F2 defines a pi-system on Ω1 × Ω2. We now show that G is a λ-system. Indeed, we have (i) Ω1 × Ω2 ∈ G. (ii) For E ∈ G, we can write ν((Ec)x) = ν((Ex)c) = ν(Ω2)− ν(Ex), and so Ec ∈ G. 69 (iii) Let E1, E2, · · · ∈ G, where Ei ∩ Ej = ∅ for i 6= j. Then, by countable additivity of ν, we get ν (( ∞⋃ i=1 Ei ) x ) = ν ( ∞⋃ i=1 (Ei)x ) = ∞∑ i=1 ν((Ei)x), so that ⋃∞ i=1Ei ∈ G. Therefore, the pi-λ Theorem gives F1 ⊗ F2 = σ(F1 × F2) ⊆ G ⊆ F1 ⊗ F2, concluding the proof. Now, suppose that ν is σ-finite, and let {Di}∞i=1 be a sequence of disjoint events in F2 with ν(Dj) <∞, ∀j ∈ N , and ⋃∞ i=1Di = Ω2. Define the finite measures ν1, ν2, . . . by νk(B) = ν(B ∩Dk), k ∈ N. Given E ∈ F1 ⊗F2, each x 7→ νk(Ex) is F1-measurable, hence ν(Ex) = ∞∑ i=1 ν(B ∩Di) = ∞∑ i=1 νi(Ex), is also F1-measurable. Theorem 7.8 (Product measure). Let (Ω1,F1, µ) and (Ω2,F2, ν) be two σ- finite measure spaces. Then, there is a unique measure µ⊗ν on the σ-algebra F1 ⊗F2 such that (µ⊗ ν)(A×B) = µ(A)ν(B), ∀A ∈ F1, B ∈ F2. It is given by (µ⊗ ν)(E) = ∫ Ω1 ν(Ex)µ(dx) = ∫ Ω2 µ(Ey)ν(dy), ∀E ∈ F1 ⊗F2. The measure µ⊗ ν is called the product of µ and ν. Proof. [Not examinable] We start with uniqueness of the measure µ⊗ ν. Since F1 × F2 is a pi-system that generates F1 ⊗ F2, we can apply Theorem 2.43 directly to get uniqueness. Moreover, to get existence as well as the second formula, it is enough to show that each of those two integrals yield a measure which satisfies the first one. We know from Lemma 7.6 that, given E ∈ F1 ⊗ F2, the functions x 7→ ν(Ex) and y 7→ µ(Ey) are F1-measurable and F2-measurable, respectively. So we can define the set functions (µ ⊗ ν)1 : F1 ⊗ F2 → [0,∞] and (µ ⊗ ν)2 : F1 ⊗F2 → [0,∞] by (µ⊗ ν)1(E) = ∫ Ω1 ν(Ex)µ(dx) and (µ⊗ ν)2(E) = ∫ Ω2 µ(Ey)ν(dy). 70 Let us check that (µ⊗ ν)1 is a measure: (i) (µ⊗ ν)1(∅) = ∫ Ω1 ν(∅)µ(dx) = 0. (ii) Let E1, E2, · · · ∈ F1 ⊗ F2, where Ei ∩ Ej = ∅ for i 6= j. By countable additivity of ν, we have (µ⊗ ν)1( ∞⋃ i=1 Ei) = ∫ Ω1 ν(( ∞⋃ i=1 Ei)x)µ(dx) = ∫ Ω1 ν( ∞⋃ i=1 (Ei)x)µ(dx) = = ∞∑ i=1 ∫ Ω1 ν((Ei)x)µ(dx) = ∞∑ i=1 (µ⊗ ν)1(Ei). By analogous argument, (µ⊗ ν)2 is also a measure. Moreover, for every A×B ∈ F1 ×F2, by (7.7) we have (µ⊗ ν)1(A×B) = µ(A)ν(B), and analogously we can prove (µ ⊗ ν)2(A × B) = µ(A)ν(B). As noted in the first paragraph, this concludes the proof. Example 7.9. We return to the case of R2 = R×R. We know from Example 7.2 that B(R2) = B(R)⊗B(R). Let m1 and m2 denote the Lebesgue measure on R and R2, respectively. For each rectangle of the form (a1, b1]× (a2, b2], −∞ < ai 6 bi < +∞, i = 1, 2, its measure is given by m2((a1, b1]× (a2, b2]) = (b1 − a1)(b2 − a2) = = m1((a1, b1])m1((a2, b2]) = (m1 ⊗m1)((a1, b1]× (a2, b2]). (7.10) Therefore, we can deduce that m2 = m1 ⊗m1 on B(R2). This is precisely the motivation for the construction of the Lebesgue measure m2 on R2. 4 7.3 Fubini and Tonelli Theorems 7.3.1 Fubini and Tonelli Theorems Having constructed product measures, we move on to the fundamental theorems that allow us to integrate a measurable function via iterated integrals, which we can compute. 71 Theorem 7.11 (Tonelli Theorem). Let (Ω1,F1, µ) and (Ω2,F2, ν) be two σ- finite measure spaces, and let f : Ω1×Ω2 → [0,∞] be a non-negative F1⊗F2- measurable function. Then, the function x 7→ ∫Ω2 fx dν is F1-measurable, the function y 7→ ∫Ω1 fy dµ is F2-measurable, and∫ Ω1×Ω2 f d(µ⊗ ν) = ∫ Ω1 (∫ Ω2 fx dν ) µ(dx) = ∫ Ω2 (∫ Ω1 fy dµ ) ν(dy). Proof. We show this for the case of f = 1E , for some E ∈ F1⊗F2, as linearity of the integral extends this to any non-negative simple function, and applying the Monotone Convergence Theorem generalises the theorem to any non-negative measurable function. Fix x ∈ Ω1 and y ∈ Ω2, and notice that fx = (1E)x = 1Ex and fy = (1E)y = 1Ey . Thus, we get ∫ Ω2 fx dν = ν(Ex) and ∫ Ω1 fy dµ = µ(Ey). Note that measurability of the functions x 7→ ∫Ω2 fx dν and y 7→ ∫Ω1 fy dµ follows from Lemma 7.6. Equality of the three integrals in the statement follows from Theorem 7.8. Theorem 7.12 (Fubini Theorem). Let (Ω1,F1, µ) and (Ω2,F2, ν) be two σ-finite measure spaces, and let f : Ω1 × Ω2 → R be an F1 ⊗ F2-measurable and µ⊗ν-integrable function. Then, the section fx is ν-integrable µ-a.e. and the section fy is µ-integrable ν-a.e. In addition,∫ Ω1×Ω2 f d(µ⊗ ν) = ∫ Ω1 (∫ Ω2 fx dν ) µ(dx) = ∫ Ω2 (∫ Ω1 fy dµ ) ν(dy). In the above iterated integrals, we can take the integrand to be zero on the (zero-measure) set where the inner integral is undefined. Proof. Let f+ and f− denote the positive and negative part of f , respectively. From Lemma 7.5 (ii), we know that each fx is F2-measurable, and so is each (f+)x and (f−)x. Theorem 7.11 and integrability of f (i.e. integrability of f+ and f−) imply that the functions x 7→ ∫Ω2(f+)x dν and x 7→ ∫Ω2(f−)x dν are F1-measurable and µ-integrable, and so finite µ-a.e. Thus, x 7→ ∫ Ω2 fx dν is finite µ-a.e., i.e. fx is ν-integrable µ-a.e. Now, consider the set N given by N = {x ∈ Ω1 : ∫ Ω2 |fx|dν =∞}. 72 As argued above, N ∈ F1 and µ(N) = 0. By Theorem 7.11 and linearity of the integral, we get∫ Ω1×Ω2 f d(µ⊗ ν) = ∫ Ω1×Ω2 f+ d(µ⊗ ν)− ∫ Ω1×Ω2 f− d(µ⊗ ν) = ∫ Ω1 (∫ Ω2 (f+)x dν ) µ(dx)− ∫ Ω1 (∫ Ω2 (f−)x dν ) µ(dx) = ∫ Nc (∫ Ω2 (f+)x dν ) µ(dx)− ∫ Nc (∫ Ω2 (f−)x dν ) µ(dx) = ∫ Nc (∫ Ω2 fx dν ) µ(dx) = ∫ Ω1 (∫ Ω2 fx dν ) µ(dx), where we use that the integral on a set of measure zero is zero. The proof that ∫ Ω1×Ω2 f d(µ⊗ ν) = ∫ Ω2 ( ∫ Ω1 f y dµ ) ν(dy) is analogous. If one of its positive part f+ or negative part f− is not integrable, then the theorem fails to hold. However, if f is non-negative as in Theorem 7.11, then the integral is allowed to be infinite. 7.3.2 Applications of Fubini and Tonelli Theorems Example 7.13. Let (Ω,F , µ) be a σ-finite measure space, and let m be the Lebesgue measure on (R,B(R)). Consider a non-negative F-measurable function f : Ω→ [0,∞], and let E be the set given by E = {(ω, y) ∈ (Ω,R) : 0 6 y < f(ω)}. Observe that E encapsulates the area under f . Moreover, E lies in the product σ-algebra F ⊗ B(R), since f is F-measurable. Then, we can compute that (µ⊗m)(E) = ∫ Ω m(Eω)µ(dω) = = ∫ Ω m({y ∈ R : 0 6 y < f(ω)})µ(dω) = ∫ Ω f(ω)µ(dω). On the other hand, we have (µ⊗m)(E) = ∫ R µ(Ey)m(dy) = ∫ ∞ 0 µ({ω ∈ Ω : 0 6 y < f(ω)}) dy, which gives the equality∫ Ω f(ω)µ(dω) = ∫ ∞ 0 µ({ω ∈ Ω : 0 6 y < f(ω)}) dy. This relation tells us that we can integrate a non-negative measurable function f over a σ-finite measure µ by instead looking at the function y 7→ µ(Ey) (with E as defined above), whose integral can sometimes be easier to compute. 4 73 To grasp more intuition on Theorems 7.11 and 7.12, we revisit two important properties of doubly-indexed sequences. Proof of Theorem 1.13. Let µ denote the counting measure on (N,P(N)), where µ is σ-finite, and consider the function f : N× N→ [0,∞] given by f(m,n) = xm,n, m, n ∈ N. Since P(N) ⊗ P(N) = P(N × N), the function f is P(N) ⊗ P(N)-measurable. Moreover, ∫ N×N f d(µ⊗ µ) = ∑ (m,n)∈N2 xm,n, ∫ N (∫ N fm dµ ) µ(dm) = ∞∑ m=1 ∞∑ n=1 xm,n, and ∫ N (∫ N fn dµ ) µ(dn) = ∞∑ n=1 ∞∑ m=1 xm,n, which are all equal by Theorem 7.11, completing the proof. Proof of Theorem 1.14. Let µ denote the counting measure on (N,P(N)), and consider the function f : N× N→ R given by f(m,n) = xm,n, m, n ∈ N. Since P(N) ⊗ P(N) = P(N × N), the function f is P(N) ⊗ P(N)-measurable. Moreover, the sum ∑ (m,n)∈N2 xm,n converges absolutely if and only if f is µ⊗µ- integrable. Thus, by Theorem 7.12, we can conclude as for Theorem 1.13, in that ∞∑ m=1 ∞∑ n=1 xm,n = ∑ (m,n)∈N2 xm,n = ∞∑ n=1 ∞∑ m=1 xm,n. In other words, for absolutely convergent (not necessarily non-negative) doubly- indexed series, the order of summation can be reversed. 7.4 Independence In this section, we introduce various concepts of independence and how they relate to each other. 74 7.4.1 Independence of σ-algebras Let (Ω,F ,P) be a probability space. Typically, the σ-algebra F on Ω is called a family of events, so that an event is an F-measurable subset of Ω. We know that two events A and B are said to be independent if their joint probability equals the product of their probabilities, that is P(A ∩B) = P(A)P(B), (7.14) and two random variables X and Y are said to be independent if P(X ∈ C, Y ∈ D) = P(X ∈ C)P(Y ∈ D), ∀C,D ∈ B(R). (7.15) We also know it is possible to have A1 independent of A2, A2 independent of A3, and A3 independent of A1, without P(A1 ∩ A2 ∩ A3) = P(A1)P(A2)P(A3). This notion of pairwise independence can be useful at times (for instance, the variance of the sum is the sum of the variances, etc.), but here we are interested in a stronger notion of independence. For A1, A2, A3 to be independent, we require that P(B1 ∩B2 ∩B3) = P(B1)P(B2)P(B3), for all choices of Bj ∈ {∅, Aj , Acj ,Ω}. This motivates the following definition. Definition 7.16 (Independent σ-algebras). Let (Ω,F ,P) be a probability space and G1, . . . ,Gn a collection of sub-σ-algebras of F . We say that G1, . . . ,Gn are independent if P(A1 ∩ · · · ∩An) = n∏ i=1 P(Ai), ∀Ai ∈ Gi, i = 1, . . . , n. The abstract formulation of σ-algebras turns out to be very powerful and versatile, as we will see now. Definition 7.17 (Independent random variables). Let (Ω,F ,P) be a probability space and X1, . . . , Xn a collection of random variables. We say that X1, . . . , Xn are independent if the sub-σ-algebras of F given by σ(X1), . . . , σ(Xn) are independent. Remark 7.18. The random variables X1, . . . , Xn on a given probability space (Ω,F ,P) are independent if and only if P(X1 ∈ B1, . . . , Xn ∈ Bn) = n∏ i=1 P(Xi ∈ Bi), ∀Bi ∈ B(R), i = 1, . . . , n. 75 Indeed, the events in σ(Xj) are exactly of the form {Xj ∈ Bj} for Bj ∈ B. 4 Exercise 7.19. Suppose X1 and X2 are independent random variables. Suppose also that g1 and g2 : R→ R are Borel-measurable functions. Show that g1(X1) and g2(X2) are independent random variables. 4 Finally, the definition of independent events is conveniently formulated using the definition of independent random variables. Definition 7.20 (Independent events). Let (Ω,F ,P) be a probability space and A1, . . . , An a collection of events. We say that A1, . . . , An are independent if the random variables given by 1A1 , . . . ,1An are independent. Remark 7.21. Events A1, . . . , An on a given probability space (Ω,F ,P) are independent if and only if the sub-σ-algebras of F given by G1, . . . ,Gn are independent, where Gi = σ({Ai}) = {∅,Ω, Ai, Aci}, i = 1, . . . , n. 4 7.4.2 Independence and product measures We now turn to product measures as discussed in previous sections. Remark 7.22 (Longer products). Given σ-finite measure spaces (Ωj ,Fj , µj) for j = 1, . . . , n, we can define F1 ⊗ · · · ⊗ Fn := σ(F1 × · · · × Fn). We can also define µ1 ⊗ · · · ⊗ µn as the unique measure ν such that ν(A1 × · · · × An) = µ1(A1) · · ·µn(An). The theory studied in previous sections is a particular case when n = 2. Uniqueness follows from the pi-λ Theorem as before, and existence follows by taking (µ1 ⊗ · · · ⊗ µn−1)⊗ µn. 4 We define the distribution of a collection of random variables X1, . . . , Xn defined on (Ω,F ,P) by PX1,...,Xn(A) := P [ (X1, . . . , Xn) ∈ A ] , A ∈ B(Rn). From Remarks 7.18 and 7.22, X1, . . . , Xn are independent if and only if their distribution satisfies PX1,...,Xn = PX1 ⊗ · · · ⊗ PXn . (7.22a) We now establish a useful and well-known criterion that shows independence of random variables via their cumulative distribution functions. 76 Proposition 7.23 (Criterion for independent random variables). Let X1, . . . , Xn be random variables on a probability space (Ω,F ,P). Then, the variables X1, . . . , Xn are independent if and only if P(X1 6 a1, . . . , Xn 6 an) = n∏ i=1 P(Xi 6 ai), ∀a1, . . . , an ∈ R. Proof. The direct implication is immediate. Now suppose the above equality holds for all a1, . . . , an ∈ R. Define Aa1,...,an := (−∞, a1] × · · · × (−∞, an] and E := {Aa1,...,an}a1,...,an∈R. The above equality says that PX1,...,Xn and PX1⊗· · ·⊗PXn coincide on E . Now note that E is a pi-system and σ(E) = B(Rn). By Theorem 2.43, PX1,...,Xn = PX1 ⊗ · · · ⊗ PXn , completing the proof. Exercise 7.24. Suppose X and Y are independent integrable random variables. Show that XY is an integrable random variable and E[XY ] = (EX)(EY ). 4 Hints. Let h(ω) = (X(ω), Y (ω)) ∈ R2. Let f(x, y) = x and g(x, y) = y. You will need to combine Theorem 5.17, (7.22a), Tonelli to show that fg is integrable against h∗P, and Fubini to justify iterated integrals.10 10A less abstract but more laborious approach would be to start from the non-negative case, take independent simple approximations, expand the expression for integral, use MCT, and finally bootstrap to the integrable case. 77 8 Convergence of measurable functions Convergence is related to approximation, and as such it is at the core of Measure Theory and Probability Theory. To say that a number x is close to a fixed number y has a unique meaning: their difference is small. More precisely, xn → y if |xn − y| can be made as small as possible by requiring n to be large. The situation becomes considerably more involved when we replace numbers by functions. If we are considering Borel functions fn and g on a measure space (Ω,F , µ), or random variables Xn and Y that can be observed on a given probabilistic model, there are several different notions of convergence, each one with their own meaning and implications. 8.1 Modes of convergence 8.1.1 Convergence of Borel functions Definition 8.1 (a.e. convergence). Let (Ω,F , µ) be a measure space and (fn)n>1, f be Borel functions. We say that (fn)n>1 converges to f µ-a.e. if µ{ω : fn(ω) 6→ f(ω)} = 0. Then, we write fn a.e.−−→ f . The measure µ is implicit in the notation. Let us consider another closely-related definition and then move to some examples to see how they compare. Definition 8.2 (Convergence in measure). Let (Ω,F , µ) be a measure space and (fn)n>1, f be Borel functions. We say that (fn)n>1 converges to f in measure if for all ε > 0, lim n→∞µ{|fn − f | > ε} = 0. Then, we write fn µ−→ f . Example 8.3. Consider the measure space ([0, 1],B([0, 1]),m|[0,1]) and (fn)n>1 a sequence of Borel functions given by fn(x) = 1[0, 1n ), x ∈ [0, 1], n > 1. Then, we can see that m|[0,1](fn 6→ 0) = m|[0,1]({0}) = 0, 78 1 1 2 1 2 1 1 4 1 4 1 2 1 2 3 4 3 4 1 f0,0 f1,0 f1,1 f2,0 f2,1 f2,2 f2,3 Figure 8.1: First elements of the ‘dancing wave.’ and so, fn a.e.−−→ 0. Moreover, for any 0 < ε < 1, m{|fn| > ε} = 1 n → 0, and for ε > 1 this measure is zero for all n, so fn m−→ 0. 4 Example 8.4 (Travelling wave). Consider the measure space (R,B(R),m), and consider the sequence of Borel functions (fn)n>1, where fn = 1[n,n+1), which resembles a travelling wave. Notice that fn(x) → 0 for every x, in particular fn(x)→ 0 for a.e. x. However, we have that m{|fn| > ε} = 1 6→ 0 for ε = 12 , which implies that fn does not converge to 0 in measure. 4 Example 8.5 (Dancing wave). Consider the probability space (R,B(R),m|[0,1]), and consider the sequence of Borel functions (fn,k)n>0, where fn,k = 1( k2n , k+1 2n ] , k = 0, . . . , 2n − 1, which resembles a dancing wave. Figure 8.1 illustrates the construction of fn,k. As n gets larger, each sub-interval gets shorter, so that for any 0 < ε < 1 and any p > 1, we have m{( k2n , k+12n ]} = 2−n → 0 On the other hand, given n > 0, letting k range from 0 to 2n − 1 ensures that fn,k still oscillates between 0 and 1 for every x ∈ (0, 1]. We now order the pairs (n, k) sequentially as follows. For each j > 1, there is a unique pair n > 0 and 0 6 k < 2n such that j = 2n + k. Write gj for the corresponding fn,k. From the above observations, gj → 0 in measure as j →∞, but the set of x such that gj(x) converges does not contain points in (0, 1]. 4 Definition 8.6 (Convergence in the vector space Lp). Let (Ω,F , µ) be a measure space. For p > 1, we define Lp(µ) as the space of Borel functions 79 f such that ∫ Ω |f |p dµ < ∞. Given f and (fn)n>1 in Lp(µ), we say that (fn)n>1 converges to f in Lp if lim n→∞ ∫ Ω |fn − f |p dµ = 0. In this case, we write fn Lp−−→ f . Remark 8.7. To see that Lp is a vector space, use |f + g|p 6 |2f |p + |2g|p. 4 Example 8.8 (Dancing wave revisited). Consider the ‘growing dancing wave’ given by f˜n,k = 2n/2fn,k, and define g˜j as the corresponding sequence. Then for p = 1 we have ∫ [0,1] |f˜n,k|dm = 2−n/2 → 0, so g˜j → 0 in L1 as j →∞. On the other hand,∫ [0,1] |f˜n,k|2 dm = 1 6→ 0, so g˜j 6→ 0 in L2 as j →∞. 4 Example 8.9 (Integrating the tail). Let f(x) = e−|x| on (R,B(R),m), and take the sequence fn(x) = 1nf( x n ). We can think of fn as ‘spreading’ f by a factor of n. Then fn(x)→ 0 for every x ∈ R, and∫ R |fn(x)|2 dm = 1 n ∫ R |f(x)|2 dm→ 0, so fn → 0 in L2 as n→∞. On the other hand,∫ R |fn(x)|dm = ∫ R |f(x)|dm = 2 6→ 0, so fn 6→ 0 in L1 as n → ∞. Note that for every t ∈ R, 0 6 fn(t) 6 1n , so fn a.e.−−→ 0. 4 8.1.2 Convergence of random variables When the measure spaces are probability spaces, we use terminology more suitable to random variables. To this end, let (Ω,F ,P) be a probability space and (Xn)n>1, X be random variables. (i) We say that (Xn)n>1 converges to X P-a.s. if P( lim n→∞Xn = X) = 1. In this case, we write Xn a.s.−−→ X (the measure P is implicit). 80 (ii) We say that (Xn)n>1 converges to X in probability, if, for every ε > 0, lim n→∞P(|Xn −X| > ε) = 0. In this case, we write Xn P−→ X. (iii) Finally, we say that (Xn)n>1 converges to X in Lp (or in p-th moment) if X ∈ Lp(P), p ∈ [1,∞), i.e. E|X|p <∞, and lim n→∞E|Xn −X| p = 0. In this case, we write Xn Lp−−→ X. Remark 8.10. For a probability space, if Xn Lp−−→ X, then EXn → EX. 4 Proof. Using Exercise 5.6 and Jensen’s inequality:∣∣EXn − EX∣∣p 6 (E|Xn −X| )p 6 E|Xn −X|p → 0. Recall that a random variable X on a probability space (Ω,F ,P) is equipped with a cumulative distribution function FX(x) = P(X 6 x), x ∈ R. In particular, we can discuss convergence of a sequence of random variables via their distribution functions. Definition 8.11 (Convergence in distribution). Let (Ω,F ,P) be a probability space and (Xn)n>1 and X be random variables with (FXn)n>1 and FX their corresponding distribution functions. We say that (Xn)n>1 converges to X in distribution if lim n→∞FXn(x) = FX(x), for every point x ∈ R at which FX is continuous. Then, we write Xn d−→ X. Remark 8.12. Convergence in distribution is only concerned with the associated distribution functions. In particular, a sequence of random variables (Xn)n>1 converging in distribution to some random variable X might even be defined on a different probability space from that of X. 4 Example 8.13. Let (Ω,F ,P) be a probability space and (Xn)n>1 a sequence of random variables, each Geometric distributed with probability of success pn, that is, FXn(x) = 1− (1− p)x, x ∈ N, n > 1. Further, assume that npn → λ > 0, as n → ∞. We would like to show that (Xnn )n>1 converges in distribution to the exponential distribution with parameter λ. First, for x ∈ [0,∞) and n > 1, we can write 81 FXn n (x) = P(Xn 6 nx) = P(Xn 6 bnxc) = = 1− (1− pn)bnxc = 1− (1− npn n )nx(1− npn n )bnxc−nx. Now, by observing that (1− npn n )nx n→∞−−−−→ e−λx and (1− npn n )bnxc−nx n→∞−−−−→ 1, we get FXn n (x) n→∞−−−−→ 1− eλx, where the limiting distribution is precisely that of the exponential distribution with parameter λ, and so we are done. 4 Example 8.14. Let (Ω,F ,P) be a probability space and (Xn)n>1 a sequence of random variables, each with 1n and 1 being equiprobable outcomes, that is, P(Xn = 1 n ) = P(Xn = 1) = 1 2 , n > 1. Under this construction, we can expect that (Xn)n>1 converges in distribution to a Bernoulli random variable X with parameter 12 . Indeed, for each x ∈ R, we have FXn(x) = 0, x < 1n 1 2 , x ∈ [ 1n , 1) 1, x > 1 −−−−→ n→∞ 0, x 6 0 1 2 , x ∈ (0, 1) 1, x > 1 where the right hand side is precisely the distribution function of X, except at x = 0. But this is a discontinuity point of FX , so we can still deduce that Xn d−→ 0. 4 Example 8.15. Let (Ω,F ,P) be a probability space and (Xn)n>1 be a sequence of random variables, each Cauchy distributed with location parameter 0 and scale parameter 1n , that is, fXn(x) = 1 1 npi(1 + n2x2) = n pi(1 + n2x2) , x ∈ R, n > 1. From Figure 8.2, we can see that as n gets larger, fXn(x) becomes closer to f(x) = 0, so we claim that (Xn)n>1 converges in distribution to 0. For x ∈ R and n > 1, we have FXn(x) = P(Xn 6 x) = ∫ x −∞ n pi(1 + n2y2) dy = = [ arctan(ny) pi ]x −∞ = 12 + arctan(nx) pi 82 f3(x) f2(x) f1(x) x Figure 8.2: Probability density function of Cauchy distribution with location parameter 0 and scale parameter n = 1, 12 , 1 3 . Now, taking limit as n→∞ gives lim n→∞FXn(x) = 0, x < 0 1 2 , x = 0 1, x > 0 where the limiting distribution is that of a constant X = 0 random variable, except at x = 0, which is a point of discontinuity of FX , and hence we can still deduce that Xn d−→ 0. In addition, we show convergence in probability. For each ε > 0, we have P(|Xn| > ε) = 2 ∫ ∞ ε n pi(1 + n2x2) dx = 2 [ arctan(nx) pi ]∞ ε = 1− 2arctan(nε) pi . Taking limit as n→∞ yields lim n→∞P(|Xn| > ε) = 0, thus, Xn P−→ 0. 4 Remark 8.16. In the above example, convergence in distribution follows from Proposition 8.42 that we will see later on. 4 Theorem 8.17 (Convergence in distribution in terms of test functions). For a family of random variables X and (Xn)n, we have Xn d−→ X if and only if Ef(Xn)→ Ef(X) for every bounded continuous function f : R→ R. Proof. See [Kle14, Theorem 13.23]. 83 We shall later see that convergence in distribution is the weakest among these modes of convergence. However, it is quite often useful in practice. 8.2 Almost sure convergence and Borel-Cantelli To get to convergence of real-valued random variables, for “most” or “almost all” points ω ∈ Ω, we need a systematic way of representing convergence Xn → X in terms of certain events that are amenable to study. 8.2.1 Infinitely often and eventually Definition 8.18 (Infinitely often and eventually). Suppose (Ω,F) is a measurable space and (An)n>1 is a sequence of events (i.e. measurable sets). We define the event “An infinitely often,” abbreviated “An i.o.,” also denoted lim sup n→∞ An, as lim sup n→∞ An := ∞⋂ n=1 ∞⋃ m=n Am. This event is the set of all ω such that ω ∈ An for infinitely many An’s, that is, the set of ω such that, for all n > 1 there exists m > n such that ω ∈ Am. Occurrence of the event lim supnAn means that infinitely many events among {An}n occur. We define the event “An eventually,” also denoted lim inf n→∞ An, as lim inf n→∞ An := ∞⋃ n=1 ∞⋂ m=n Am This event is the set of all ω such that ω ∈ An for all but finitely many An’s, that is, the set of ω such that, for some n > 1 and all m > n one has ω ∈ Am. Occurrence of the event lim supnAn means that all but finitely many events among {An}n occur. Remark 8.19. By De Morgan’s laws, we have {An i.o.}c = {Acn eventually} and {Acn i.o.} = {An eventually}c. 4 Exercise 8.20. Prove the above identities. 4 Let us point out already how the notions of eventually and infinitely often are related to convergence. A sequence of numbers (xn)n converges to a number x if, for every ε > 0, we have |xn− x| < ε for all n sufficiently large, which means |xn − x| < ε for all but finitely many n’s, which means |xn − x| < ε eventually. Likewise, xn 6→ x if for some ε > 0 we have |xn − x| > ε infinitely often. 84 Proposition 8.21 (Criterion for a.e. convergence). Let (Ω,F , µ) be a measure space and (fn)n>1, f be Borel functions. Then, (fn)n>1 converges to f µ-a.e., if and only if µ ({|fn − f | > ε i.o.}) = 0 for all ε > 0. Proof. We have the following chain of equivalent properties: µ{fn 6→ f} = 0 m definition of limit µ{∃ε > 0, |fn − f | > ε i.o.} = 0 m take 1k+1 < ε 6 1k µ{∃k ∈ N, |fn − f | > 1k i.o.} = 0 m continuity from below of µ lim k→∞ µ{|fn − f | > 1k i.o.} = 0 m non-decreasing numbers ∀k ∈ N, µ{|fn − f | > 1k i.o.} = 0 m take 1k+1 < ε 6 1k ∀ε > 0, µ{|fn − f | > ε i.o.} = 0. This completes the proof. 8.2.2 Borel-Cantelli lemmas From the above proposition, we have enough motivation to try and estimate the probability that a sequence of events occur infinitely often. Lemma 8.22 (Borel-Cantelli Lemma I). Let (Ω,F , µ) be a measure space and (An)n>1 a sequence of measurable sets. If ∞∑ n=1 µ(An) <∞, then µ ( lim sup nAn ) = 0. Proof. The idea is to set appropriate upper bounds that converge to 0, so as make the relevant conclusions. Using continuity from above and countable sub- 85 additivity of µ, we have µ({An i.o.}) = µ ( ∞⋂ n=1 ∞⋃ m=n Am ) = lim n→∞µ ( ∞⋃ m=n Am ) 6 lim n→∞ ∞∑ m=n µ(Am) = 0. The last limit is zero because it is the tail of a convergent series. Lemma 8.23 (Borel-Cantelli Lemma II). Let (Ω,F ,P) be a probability space and (An)n>1 a sequence of independent events. If ∞∑ n=1 P(An) =∞, then P ( lim sup nAn ) = 1. Proof. We look instead at {An i.o.}c. By Remark 8.19, continuity from below and from above, and independence of (Acn)n>1, we get P({An i.o.}c) = P ( ∞⋃ n=1 ∞⋂ m=n Acm ) = lim n→∞P ( ∞⋂ m=n Acm ) = lim n→∞ limk→∞P ( k⋂ m=n Acm ) = lim n→∞ limk→∞ k∏ m=n (1− P(Am)) 6 lim n→∞ limk→∞ k∏ m=n e−P(Am) = lim n→∞ limk→∞ e − ∑k m=n P(Am) = e−∞ = 0, where the upper bound follows from inequality 1− x 6 e−x for x > 0. 86 Remark 8.24. In practice, we can apply Borel-Cantelli Lemma I to deduce almost everywhere convergence, that is, if we can show that for every ε > 0, ∞∑ n=1 µ (|fn − f | > ε) <∞ then by Proposition 8.21 we can conclude that fn a.e.−−→ f . 4 Example 8.25. Let (Ω,F ,P) be a probability space and (Xn)n>1 be a sequence of independent random variables, each uniformly distributed on [0, 1]. Consider the sequence of random variables (Yn)n>1 given by Yn = min{X1, . . . , Xn}, n > 1. We show that Yn a.s.−−→ 0. Given n ∈ N, independence and construction of X1, . . . , Xn for each ε ∈ (0, 1) yields P(|Yn − Y | > ε) = P(|Yn| > ε) = P(X1 > ε, . . . ,Xn > ε) = P(X1 > ε)n = (1− ε)n, and consequently, we get that ∞∑ n=1 P(|Yn| > ε) = ∞∑ n=1 (1− ε)n = 1− ε ε <∞. Hence, for each ε ∈ (0, 1), by Borel-Cantelli Lemma I, it follows that P({|Yn| > ε i.o.}) = 0, so by Proposition 8.21, we have Yn a.s.−−→ 0. 4 Exercise 8.26. Let (Ω,F ,P) be a probability space and (Xn)n>1 be a sequence of random variables with probability distribution given by P(Xn = an) = pn and P(Xn = 0) = 1− pn, n > 1, for sequences (an)n and (pn)n with ∑∞ n=1 pn <∞. Show that Xn a.s.−−→ 0. 4 Example 8.27. Suppose an ↑ ∞ and pn ↓ 0 with ∑ n pn =∞, and that (Xn)n is a sequence of independent random variables distributed as P(Xn = an) = pn and P(Xn = 0) = 1− pn, n > 1. Then P(Xn → 0) = 0 but Xn P→ 0 as n→∞. 4 Remark 8.28. When applying Borel-Cantelli Lemma II to argue that almost sure convergence does not hold, we must be careful to check independence of the concerned random variables. 4 87 Example 8.29. Let (Ω,F ,P) = ([0, 1],B,m|[0,1]), Xn(ω) = 1[0, 1n ](ω) and X(ω) = 0 ∀ω. Then P(Xn → 0) = 1 since Xn(ω) → 0 for all ω 6= 0. However, P(Xn = 1) = 1n which is non-summable. If we used the same reasoning as in the previous example, without noticing that the variables (Xn)n are dependent, we would conclude that P(Xn → 0) = 0, which is false. 4 Proposition 8.30. If P(An i.o.) = 0 then P(An)→ 0 as n→∞. Proof. Suppose P(An i.o.) = 0, and take Bk = ∪n>kAn. Since Bk ↓ [An i.o.] as k → ∞, we have P(Bk) → P(An i.o.) = 0. On the other hand, Bk ⊇ Ak, thus P(Ak) 6 P(Bk), and therefore P(An)→ 0 as n→∞. 8.3 Laws of Large Numbers The topic of Laws of Large number dispenses introductions. Let X1, X2, X3, . . . be a sequence of random variables. Define Sn := X1 + · · ·+Xn. We say that the sequence (Xn)n satisfies the law of large numbers if Sn − ESn n → 0 as n → ∞. We say that (Xn)n satisfies weak LLN if the above convergence is in probability, and the strong LLN in case convergence is almost-sure. 8.3.1 Laws of Large Numbers with second and fourth moments Theorem 8.31 (Chebyshev’s Weak Law of Large Numbers). Let (Xn)n be a sequence of random variables such that Cov(Xi, Xj) = 0 for every i 6= j and supnVXn <∞. Then (Xn)n satisfies the weak law of large numbers. Proof. By Chebyshev’s inequality (5.24), P (∣∣∣∣Sn − ESnn ∣∣∣∣ > ε) 6 V(Snn )ε2 = VSnε2n2 = ∑n i=1VXi ε2n2 6 n · supk VXk ε2n2 → 0, concluding the proof. Theorem 8.32 (Cantelli’s Strong Law of Large Numbers). Let X1, X2, . . . be independent random variables such that supn EX4n < ∞. Then (Xn)n satisfies the strong law of large numbers. 88 Proof. We can suppose that EXk = 0 for all k. Expand S4n as S4n = (X1 + · · ·+Xn)4 = ∑ i,j,k,l XiXjXkXl = ∑ i X4i + 4! 2!2! ∑ i
X2iX 2 j+ + 4!3! ∑ i 6=k X3iXk + 4! 2! ∑ ji 6=j,k X2iXjXk + 4! ∑ iXiXjXkXl. By independence, we have ES4n = ∑ i EX4i + 6 ∑ iE[X2iX2j ]+ + ∑ k E [ 4 ∑ X3i + 12 ∑ X2iXj + 24 ∑ XiXjXl ] EXk. Since EXk = 0, the second row is zero. Write M := supk EX4k < ∞. By the Cauchy-Schwarz inequality, E(X2iX2j ) 6 √ M √ M = M . Hence, ES4n 6 nM + 6 ( n 2 ) M = (3n2 − 2n)M 6 3n2M. By Markov’s inequality, we have P (∣∣Sn n ∣∣ > ε) 6 ES4n ε4n4 6 3M ε4n2 , and by Borel-Cantelli Lemma I we get Snn a.s.−−→ 0. 8.3.2 Laws of Large Numbers with first moment Theorem 8.33 (Komogorov’s Strong Law of Large numbers). Let X1, X2, . . . be pairwise independent and identically distributed integrable random variables. Then (Xn)n satisfies the strong law of large numbers. Proof. See [Kle14, Theorem 5.17] for the proof found by Etemadi in 1981. It should be emphasised that, when (Xn)n are i.i.d., the condition of X1 being integrable is the weakest possible. It is in fact equivalent to X1+···+Xnn a.s.→ µ for some µ ∈ R, as shown by the following proposition. Proposition 8.34. Suppose (Xn)n are i.i.d. and not integrable. Then, a.s., the sequence Snn does not converge to any real number. 89 Proof. For every fixed k ∈ N, the variable X1k is not integrable. Since (Xn)n are identically distributed, by Proposition 5.35 we have∑ n P( |Xn|k > n) = ∑ n P( |X1|k > n) =∞. By Borel-Cantelli Lemma II, almost surely the event |Xn| > kn occurs for infinitely many values of n. Taking the complement and union over k, we get P ( for every k, |Xn| > kn for infinitely many values of n ) = 1. We now show that, for a sequence (xn)n satisfying the above condition, we have lim supn | snn | = +∞, where sn := x1 + · · ·+ xn. This implies the proposition. So we have to show that, for every r ∈ N, there exists m such that | smm | > r. Suppose (xn)n satisfies the above condition and let r ∈ N be given. Take k = 2r and n such that |xn| > kn. If | sn−1n−1 | > r, we take m = n− 1 and we are done. So suppose | sn−1n−1 | 6 r. Then |x1 + · · ·+ xn| n > |xn| n − |x1 + · · ·+ xn−1| n > > |xn| n − |x1 + · · ·+ xn−1| n− 1 > |xn| n − r > k − r = r, and we can take m = n. This concludes the proof. 8.4 Relation between modes of convergence 8.4.1 Almost-sure convergence and subsequences Proposition 8.35 (Convergence in measure implies a.e. in a subsequence). Let (Ω,F , µ) be a measure space and (fn)n>1, f Borel functions with fn µ−→ f . Then, there exists a subsequence such that fnk a.e.−−→ f as k →∞. Proof. First for each n > 1 and k ∈ N, set An,k := {|fn − f | > 1k}. Since fn µ−→ f , then for each k ∈ N, lim n→∞µ(An,k) = limn→∞µ(|fn − f | > 1 k ) = 0. In other words, for each k ∈ N, there exists nk ∈ N such that for any n > nk, µ(An,k) = µ(|fn − f | > 1k ) < 1k2 . We can also assume that nk > nk−1 for k > 2. Set Bk := Ank,k, and notice that ∞∑ k=1 µ(Bk) < ∞∑ k=1 1 k2 <∞, and so µ(Bk i.o.) = 0 by Borel-Cantelli Lemma I, thus fnk a.e.−−→ f , as k →∞. 90 Convergence a.e. does not imply convergence in measure, as shown in Example 8.4. However, this is the case in finite measure spaces. Proposition 8.36 (a.s. convergence implies convergence in probability). Let (Ω,F ,P) be a probability space and (Xn)n>1, X be random variables. If Xn → X a.s. as n→∞, then Xn → X in probability as n→∞. Proof. Since Xn a.s.−−→ X, by Proposition 8.21, we have P(|Xn −X| > ε i.o.) = 0, ∀ε > 0. Then, we can apply Proposition 8.30 to get lim n→∞P(|Xn −X| > ε) = 0, ∀ε > 0, so that Xn P−→ X, as required. 8.4.2 Convergence in Lp and in expectation Proposition 8.37 (Convergence in Lp implies convergence in measure). Let (Ω,F , µ) be a measure space and (fn)n>1, f be Borel functions. If fn → f in Lp as n→∞, then fn → f in measure as n→∞. Proof. Since fn Lp−−→ f , for some p > 1, then lim n→∞ ∫ Ω |fn − f |p dµ = 0. By Chebyshev’s inequality (Theorem 5.22), for all ε > 0, it follows that µ(|fn − f | > ε) 6 1 εp ∫ Ω |fn − f |p dµ n→∞−−−−→ 0, thus fn converges to f in measure, i.e. fn µ−→ f . Proposition 8.38 (Comparison of Lp and Lq). For a probability space, if q > p > 1 and Xn L q −−→ X, then Xn L p −−→ X. Proof. The proof goes as follows: (E|Xn −X|p)1/p 6 (E|Xn −X|q)1/q → 0, 91 so we only need to justify the first inequality. Let Y be a non-negative random variable. Take (Zn)n as integrable random variables such that 0 6 Zn ↑ Y p (for instance, Zn = min{Y, n}p). The function f : [0,∞) → [0,∞) given by f(x) = xq/p is convex. Hence, by Jensen’s inequality, we have (EZn)q/p 6 EZq/pn 6 EY q. Now, we can conclude that (EY p)q/p 6 EY q by the Monotone Convergence Theorem. This completes the proof. Example 8.39. Let (Ω,F ,P) = ([0, 1],B,m), and take Xn = n1[0, 1n ). Then Xn → 0 a.s., but ∫ Ω |Xn|dP = 1 so Xn 6→ 0 in Lp. 4 Examples 8.8 and 8.39 show that neither a.s. convergence implies convergence in Lp, nor convergence in Lp implies a.s. convergence. Combined with Propositions 8.36 and 8.37, convergence in probability implies neither of them. Theorem 8.40 (Dominated Convergence Theorem revisited). Let (Ω,F , µ) be a measure space, and (fn)n>1 a sequence of Borel functions such that fn µ−→ f . If there exists a non-negative Borel function g ∈ Lp(µ), p ∈ [1,∞), such that |fn| 6 g for all n > 1, then fn L p −−→ f . If moreover p = 1 or µ is a probability measure, then lim n→∞ ∫ Ω fn dµ = ∫ Ω f dµ. Proof. Convergence of the integral follows from Remark 8.10 if µ(Ω) = 1, and from Exercise 5.6 if p = 1, so we only need to prove that fn → f in Lp. Let us first suppose that fn a.e.−−→ f . By noticing that |fn(ω)− f(ω)|p 6 2p|g(ω)|p, ω ∈ Ω, we can use Theorem 5.42 to deduce that lim n→∞ ∫ Ω |fn − f |p dµ = 0. (8.41) Now, let us assume that fn µ−→ f . Consider a subsequence (fnk)k. Then fnk µ−→ f as k → ∞. By Proposition 8.35, there is a sub-subsequence (fnkj )j such that fnkj a.e.−−→ f as j → ∞. By the previous case, we have ∫Ω |fnkj − f |p → 0 as j →∞. Using Lemma 0.1, we get (8.41), concluding the proof. 92 8.4.3 Convergence in distribution Convergence in distribution is actually the weakest concept, and it is implied by all modes of convergence. Proposition 8.42 (Convergence in probability implies convergence in distribution). Let (Ω,F ,P) be a probability space and (Xn)n>1, X be random variables. If Xn → X in probability, then Xn → X in distribution as n→∞. Proof. Let f : R → R be a bounded continuous function. By Theorem 8.17, it is enough to show that Ef(Xn)→ Ef(X) (8.43) as n→∞. First, suppose that Xn a.s.−−→ X. Since f is continuous, we have that {ω : Xn(ω)→ X(ω)} ⊆ {ω : f(Xn(ω))→ f(X(ω))}, hence f(Xn) a.s.−−→ f(X). Since f is bounded, |f(Xn)| 6 sups∈R |f(s)| < ∞ for all n > 1. Applying Theorem 8.40 with constant g, we get (8.43). If instead we suppose Xn P−→ X, we can get (8.43) by taking sub-subsequences and invoking Lemma 0.1, as we did in the proof of Theorem 8.40. Example 8.44. Let Ω = {0, 1}, P = 12δ0 + 12δ1, Xn(ω) = ω and Y (ω) = 1 − ω. Then, as n→∞, Xn → Y in distribution but not in probability. 4 So convergence in distribution does not imply convergence in probability in general, but there is one situation where it does. Proposition 8.45 (Convergence in distribution to constant limit). Suppose Xn d−→ X and X = c a.s. for some c ∈ R. Then Xn P−→ c. Proof. First, notice that by assumption FXn(x) → 0 if x < c and FXn(x) → 1 if x > c. Let ε > 0. Then P(|Xn − c| > ε) 6 P(Xn 6 c− ε) + P(Xn > c+ ε) = = FXn(c− ε) + 1− FXn(c+ ε)→ 0 + 1− 1 = 0, so Xn P−→ c, as claimed. In summary, for probability spaces, the relationship between the four different 93 notions of convergence are represented below: a.s. P +3 subsequence XX dominated case yy d constant limit }} Lp+s +3 Lp CK The general implications shown in the diagram are Propositions 8.36, 8.37, 8.38 and 8.42. No other implication holds in general, as shown by Examples 8.8, 8.39 and 8.44. The dotted lines refer to implications that hold in particular cases, which are Propositions 8.35 and 8.45, and Theorem 8.40. 94 References [Bar95] R. G. Bartle. The elements of integration and Lebesgue measure. Wiley Classics Library. John Wiley & Sons, Inc., New York, 1995. doi. [Coh13] D. L. Cohn. Measure theory. Birkhäuser Advanced Texts: Basel Textbooks. Birkhäuser/Springer, New York, second edn., 2013. doi. [Dem20] A. Dembo. Probability theory: STAT310/MATH230, 2020. Lecture Notes, Stanford. Available online as of June 2020. [Edg98] G. A. Edgar. Integral, probability, and fractal measures. Springer- Verlag, New York, 1998. doi. [Kle14] A.Klenke. Probability theory. Universitext. Springer, London, second edn., 2014. doi. [Kub15] C. S. Kubrusly. Essentials of measure theory. Springer, Cham, 2015. doi. [Shi16] A. N. Shiryaev. Probability. 1, vol. 95 of Graduate Texts in Mathematics. Springer, New York, third edn., 2016. doi. [Tao11] T. Tao. An introduction to measure theory, vol. 126 of Graduate Studies in Mathematics. American Mathematical Society, 2011. doi. Available online as of June 2020. [Tay73] S. J. Taylor. Introduction to Measure and Integration. Cambridge University Press, 1973. doi. 95 欢迎咨询51作业君