STAT 5611: Statistical Methodology Multivariate Analysis Module 1. Multivariate Normal. A random variable (r.v.) X has a standard normal distribution N(0, 1), if X has density φ(x) = 1√ 2pi e− 1 2 x2 , −∞ < x <∞. X has moment generating function E(etX) = ∫ ∞ −∞ etx · (2pi)− 12 e− 12x2dx = ∫ ∞ −∞ (2pi)− 1 2 e− 1 2 (x−t)2 · e 12 t2dx = e 1 2 t2 , so E X = 0 and Var X = 1. If Y = µ+ σX, σ > 0 then Y ∼ N(µ, σ2) and Y has mean µ and variance σ2. Y has density 1√ 2pi σ2 e− 1 2σ2 (y−µ)2 and mgf E(etY ) = etµE(etσX) = exp (tµ+ 1 2 σ2 t2). Let U = (U1, U2, · · · , Up)T be a p-vector of NID(0, 1) r.v.’s. U has joint density (2pi)−p/2 exp (−1 2 uT u), EU = 0 and Cov U = EUUT = I . Write U ∼ Np(0, I). X has a non-singular multivariate normal distribution with mean vector µ and non-singular covariance matrix Σ if X = µ+ AU where Σ = AAT (i.e. A = Σ 1 2 ). 1 X has density obtained by transforming the density of U. The Jacobian of the trans- formation is |J | = ∣∣∣∣∣ ∂(x1, · · ·xp)∂(u1, · · · , up) ∣∣∣∣∣ = |Σ 12 | = |Σ| 12 as |A2| = |A|2 = |Σ|. Thus X has density (2pi)−p/2|Σ|− 12 exp[−1 2 (x− µ)T Σ−1(x− µ)] as uTu = [A−1(x− µ)]T [A−1(x− µ)] = (x− µ)T (A−1)T (A−1)(x− µ) = (x− µ)T Σ−1(x− µ) as (A−1)T (A−1) = (AAT )−1 = Σ−1. X has m.g.f. M(t) = E(et T X) = E(exp (tTµ+ tT AU)) = exp (tT µ) · E(eaTU), where aT = tT A = exp (tT µ) · exp (1 2 aTa) as aTU ∼ N(0, aT a) = exp (tT µ+ 1 2 tTΣt). This expression for the mgf is well defined even when Σ is singular. This leads to the following definition of a multivariate normal random vector. Definition. A p-vector X is said to have a multivariate normal distribution if and only if every linear combination aTX is univariate normal. Consider the mgf of aTX ψ(t) = E(eta TX) = M(ta) = exp (taTµ+ 1 2 t2aT Σa) 2 so aTX is univariate normal by the uniqueness theorem for mgf’s. Conversely if tTX is univariate normal for all t then E(et TX) = exp (tTµ+ 1 2 tT Cov(X) t). If Σ is singular and has rank r < p then Σ has (p− r) eigenvalues equal to 0 . There is an orthogonal matrix P such that P TΣP = D 0 0 0 where D = diag(λ1, · · · , λr), λ1 ≥ λ2 ≥ · · · ≥ λr > 0. Let γ = P T µ and let γ1 be the vector formed by the first r elements of γ. X has mgf exp (tTµ+ 1 2 tTΣt). Let Y = P TX . Then Y has mgf E(et TY) = E(et TPTX) = exp (tTγ + 1 2 tTP TΣP t) E(et TY) = exp (tTγ + 1 2 tT1Dt1) where t1 = (t1, t2, · · · , tr)T . Thus (Y1, · · · , Yr)T ∼ Nr(γ1, D) and P (Yj = γj) = 1, j > r . Thus if Y is a multivariate normal with covariance matrix of rank r then there exists a linear transformation such that Y = G(X − µ) and Yr+1 = · · · = Yp = 0 and (Y1, · · · , Yr)T ∼ Nr(0, I). 3 Marginal and Conditional Distributions If X ∼ Np(µ,Σ) then aTX ∼ N(aTµ, aTΣa) so by choosing ei = (0, · · · , 0, 1, 0, · · · , 0) with 1 in the ith position we have Xi ∼ N(µi, σii), where Σ = (σij). Thus all marginal distributions are normal. Marginal normality is not enough to ensure joint normality. Write X = ( X1 X2 ) ,µ = ( µ1 µ2 ) and Σ = Σ11 Σ12 Σ21 Σ22 , where X1 is an r-vector and X2 is a p− r = s vector. Σ11 is r × r and Σ12 is r × s, etc. Theorem 1. X1 and X2 are independent if and only if Σ12 = 0 . Proof. Independence implies X1 and X2 are elementwise uncorrelated and so Σ12 = 0. If Σ12 = 0 then X has mgf M(t) = exp (tTµ+ 1 2 tTΣt) = exp (tT1µ1 + 1 2 tT1 Σ11t1) · exp (tT2µ2 + 1 2 tT2 Σ22t2) and so X1 and X2 are independent. X1 ∼ Nr(µ1, Σ11) and X2 ∼ Ns(µ2, Σ22). Theorem 2. If X ∼ Np(µ,Σ) then the conditional distribution of X2 given X1 = x is N ( µ2 + Σ21Σ −1 11 (x− µ1), Σ22 − Σ21Σ−111 Σ12 ) . Proof. Let Y2 = MX1 + X2 be a transformation such that Y2 and X1 are uncorrelated. Cov(Y2, X1) = MΣ11 + Σ21 so M = −Σ21Σ−111 X1 Y2 = I 0 −Σ21Σ−111 I X1 X2 . 4 Thus X1 and Y2 are multivariate normal and uncorrelated and so are independent. Y2 ∼ N(µ2 − Σ21Σ−111 µ1, Σ22 − Σ21Σ−111 Σ12) as Cov (X2 +MX1) = Σ22 − Σ21Σ−111 Σ12. Thus X2 = Y2 +Σ12Σ −1 11 X1, and Y2 and X1 are independent. Hence the conditional distribution of X2 given X1 = x is N ( µ2 + Σ21Σ −1 11 (x− µ1), Σ22 − Σ21Σ−111 Σ12 ) . Write Σ2·1 = Σ22 − Σ21Σ−111 Σ12 . Further if Σ−1 = Σ11 Σ12 Σ21 Σ22 then Σ−12·1 = Σ 22. Partial Correlation. If X = (X1, X2,X T 3 ) T then the best linear estimate of X1 in terms of X3 is b TX3 , where b is such that E(X1 − bTX3)2 is a minimum. Assume µ = 0. E(X1 − bTX3)2 = σ11 − 2bTΣ13 + bTΣ33b, = σ11 −Σ13Σ−133 Σ31 + (b− Σ−133 Σ31)T Σ33(b− Σ−133 Σ31) = σ11 −Σ13Σ−133 Σ31, when b = Σ−133 Σ31. Thus the best linear estimator is Σ13Σ −1 33 X3 . The best linear estimator of X2 is Σ23Σ −1 33 X3 . The partial correlation between X1 and X2 removing the effect of X3 is ρ12·34...p and is defined to be the simple correlation between X1 − Σ13Σ−133 X3 and X2 − Σ23Σ −1 33 X3. Hence ρ12·34...p = σ12 − Σ13Σ−133 Σ32 (σ11 − Σ13Σ−133 Σ31) 12 (σ22 − Σ23Σ−133 Σ32) 12 = −σ12 (σ11σ22) 1 2 5 where Σ−1 = σ11 σ12 σ21 σ22 as Σ11 = σ11 σ12 σ21 σ22 − Σ13 Σ23 Σ−133 Σ13 Σ23 T −1 . The above definition does not require X to be multivariate normal, however if this is the case then ρ12·3...p is just the simple correlation coefficient between X1 and X2 in their conditional distribution given X3 . Recall if only one variable is adjusted for then the simple calculation formula for the sample partial correlation given correlation matrix R = (rij) is r12.3 = r12 − r13r23√ (1− r213)(1− r223) . Maximum Likelihood Estimates for µ and Σ Suppose X1, · · · ,Xn are independent Np(µ,Σ) random vectors. The likelihood is L(µ,Σ) = n∏ i=1 (2pi)−p/2|Σ|− 12 exp[−1 2 (Xi − µ)TΣ−1(Xi − µ)] = (2pi)−np/2|Σ|−n2 exp[−1 2 n∑ i=1 tr(Σ−1(Xi − µ)(Xi − µ)T )] = (2pi)−np/2|Σ|−n2 exp[−1 2 trΣ−1{ n∑ i=1 (Xi − X¯)(Xi − X¯)T + n(X¯− µ)(X¯− µ)T}] where X¯ = n−1 ∑n i=1 Xi . Let S = ∑n i=1(Xi− X¯)(Xi− X¯)T . X¯ and S are clearly sufficient for µ and Σ . The maximum over µ is (2pi)−np/2|Σ|−n/2 exp(−1 2 tr(Σ−1S)) when µˆ = X¯. 6 Let λ1, · · · , λp be the eigenvalues of ΣS−1, λi > 0 as S and Σ are positive definite. Then tr(Σ−1S) = Σpi=1 λ −1 i and |Σ| = |ΣS−1||S| = ( p∏ i=1 λi ) |S| L(X¯,Σ) = (2pi)−np/2|S|−n/2 p∏ i=i ( λ −n/2 i e − 1 2 λ−1i ) . Let f(x) = log(xnex −1 ) = n log x+ x−1. We can show that f(x) ≥ n− n log n. L(X¯,Σ) = (2pi)−np/2|S|−n/2 p∏ i=i [ λni e λ−1i ]− 1 2 ≤ (2pi)−np/2|S|−n/2 p∏ i=i [ n−n en ]− 1 2 = (2pi)−np/2|n−1S|−n/2 e−np/2 and this maximum is achieved if ΣS−1 = n−1I i.e. Σˆ = n−1S . Hence the MLE for µ is µˆ = X¯ and the MLE for Σ is n−1S . 7 2. Hypothesis Testing Given samples from multivariate normal populations we can develop tests using the likelihood ratio approach. Let L(x,θ) = ∏n i=1 f(xi,θ) denote the likelihood where θ ∈ Θ and Θ denotes the parameter space. To test H0 : θ ∈ θ0 vs the general alternative the LRT statistic is Λ = supθ∈θ0 L(x,θ) supθ L(x,θ) . Under regularity conditions that are always satisfied by the multivariate normal distribution, −2 log Λ D→ χ2r, where r = dim Θ − dim θ0, the number of linearly independent parameters that are con- strained by H0. Let Xi ∼ NIDp(µ,Σ), i = 1, . . . , n. 1. Test H0 : µ = µ0 vs H0 : µ 6= µ0, Σ known. The MLE’s for µ and Σ are X¯ and n−1S so sup µ L(x,µ,Σ) = (2pi)−np/2|Σ|−n/2 exp ( −1 2 trΣ−1S ) . If H0 is true L(x,µ0,Σ) = (2pi) −np/2|Σ|−n/2 exp ( −1 2 trΣ−1S − n 2 (x¯− µ0)TΣ−1(x¯− µ0) ) so −2 log Λ = n(X¯− µ0)TΣ−1(X¯− µ0) ∼ χ2p, exactly. 2. Test H0 : µ = µ0 vs H0 : µ 6= µ0, Σ unknown. sup L(x,µ,Σ) = (2pi)−np/2|S/n|−n/2 exp ( −np 2 ) 8 and if µ = µ0, sup Σ L(x,µ0,Σ) = sup(2pi) −np/2|S/n|−n/2 exp [ −1 2 trΣ−1S − 1 2 tr(Σ−1n(x¯− µ0)(x¯− µ0)T ) ] = sup(2pi)−np/2|S/n|−n/2 exp [ −1 2 tr nΣ−1(S/n+ ddT ) ] where d = (x¯− µ0). By analogy with the MLE derivation sup Σ L(x,µ0,Σ) = sup(2pi) −np/2|S/n+ ddT |−n/2 exp(−np/2), so −2 log Λ = n log |S + ndd T | |S| = n log(1 + ndTS−1d). To establish this last equality we need the following result. Lemma. For any square matrix M = M11 M12 M21 M22 assume M−111 exists and let M2.1 = M22 −M21M−111 M12. Then |M | = |M11||M2.1|. Proof. Let B = I 0 −M21M−111 I . Note |B| = 1 and BMBT = M11 0 0 M2.1 so |M | = |M11||M2.1|. Applying the Lemma to S d −dT 1 I 0 dT 1 = S + ddT d 0T 1 , 9 gives |S|(1 + dTS−1d) = |S + ddT |. Thus the statistic for testing H0 : µ = µ0 is Hotelling’s T 2: T 2 = n(X¯− µ0)T ( S n− 1 )−1 (X¯− µ0). We will prove that the exact distribution of T 2 is (n−1)p n−p Fp,n−p. 3.Test H0 : Σ = Σ0, µ unknown. The MLE for µ is X¯ so if H0 is true sup µ L(x,µ,Σ0) = (2pi) −np/2|Σ0|−n/2 exp ( −1 2 tr(Σ−10 S) ) whereas without restrictions sup L(x,µ,Σ) = (2pi)−np/2 |S|−n/2 nnp/2 exp ( −np 2 ) . The LRT statistic is −2 log Λ = −n log |Σ−10 S| − np+ tr(Σ−10 S) + np log n = −n log |Σ−10 S/n| − np+ ntr(Σ−10 S/n). If Σ−10 S/n has eigenvalues λ1 ≥ .. ≥ λp, a = ∑p i=1 λi and g = ∏p i=1 λi then the test statistic is n(a− p− log g) ∼ χ2p(p+1) 2 , for large n. 4. Test for Independence. H0 : Σ12 = 0, µ unknown, where Σ = Σ11 Σ12 Σ21 Σ22 , Σ11 is p1 × p1, Σ22 is p2 × p2 and p1 + p2 = p. If H0 is true then we can treat the data as two independent samples. Σˆ = 1 n S11 0 0 S22 , and µˆ = X¯. 10 The LRT statistic is −2 log Λ = −n log |Σˆ−1S/n| − np+ tr(Σˆ−1S) and Σˆ−1S = n S11 0 0 S22 −1 S11 S12 S21 S22 = n I S −1 11 S12 S−122 S21 I so tr(Σˆ−1S) = np. Thus −2 log Λ = −n log |Σˆ−1S/n| = −n log ( |S| |S11||S22| ) = −n log ( |S2.1| |S22| ) = −n log |I − S−122 S21S−111 S12|. We will show Λ2/n = |S2.1|/|(S22 − S2.1) + S2.1| ∼ Λ(p2, n− p1 − 1, p1), Wilk’s Lambda statistic. − 2 n log Λ approx∼ 1 n− (p+ 3)/2 χ 2 p1p2 , for large n. If p2 = 1, p1 = p− 1 then the above statistic simplifies to −2 log Λ = −n log(1−R2) where R is the multiple correlation coefficient between Xp and (X1, .., Xp−1), R2 = S21S −1 11 S12/spp Λ2/n = 1−R2 ∼ Λ(1, n− p, p− 1) and we can show that R2 1−R2 ∼ p− 1 n− pFp−1,n−p, exactly. 11 3. Results on Quadratic Forms. Let X ∼ Np(0, I) and let C be a square, symmetric matrix. Theorem 3. XTCX ∼ χ2r if and only if C is idempotent of rank r . Proof. If C is idempotent of rank r then C = U UT where U is p × r with r orthonormal columns. Let Y = UTX so Y ∼ Nr(0, I) and YTY = XTCX ∼ χ2r. Conversely, if XTCX ∼ χ2r then XTCX has mgf (1−2t)−r/2. There exists an orthogonal matrix P such that P TCP = diag(α1, · · · , αp). Let W = P TX. XTCX = WTP TCPW = Σpj=1 αjW 2 j . But W is a vector of NID(0, 1) r.v.’s so W 2j is χ 2 1. Thus Σ p j=1 αjW 2 j has mgf E exp (tΣpj=1αjW 2 j ) = p∏ j=1 E ( etαjW 2 j ) = p∏ j=1 (1− 2tαj)− 12 . But mgf’s are unique so p∏ j=1 (1− 2tαj)− 12 = (1− 2t)−r/2 (1− 2t)r = p∏ j=1 (1− 2tαj). By comparing coefficients of powers of t we must have r αj’s equal to 1 and (p−r) equal to 0 . Thus all eigenvalues of C are either 0 or 1 and so C is idempotent of rank r . Theorem 4. (Craig’s Theorem) If A and B are symmetric, non-negative definite matrices and X ∼ Np(0, I) then XTAX and XTBX are independent if and only if AB = 0 . Proof. XTAX has mgf (2pi)−p/2 ∫ ··· ∫ etx TAx e− 1 2 xTx dx = (2pi)−p/2 ∫ ··· ∫ exp (−1 2 xT (I − 2At)x) dx = |I − 2At|− 12 . 12 XTAX and XTBX are independent iff E(et1X TAX+t2XTBX) = E(et1X TAX) · E(et2XTBX), i.e. |I − 2At1 − 2Bt2|− 12 = |I − 2t1A|− 12 |I − 2t2B|− 12 . (1) The RHS is |I−2t1A−2t2B+4t1t2AB|− 12 so if AB = 0 then (1) holds and the two quadratic forms are independent. Now suppose the quadratic forms are independent. Let P be an orthogonal matrix such that A∗ = P TAP = diag(α1, · · · , αa, 0(p−a)) B∗ = P TBP. Let Y = P TX XTAX = YTP TAPY = Σai=1αiY 2 i and XTBX = ∑ i ∑ j b ∗ ijYiYj. E(XTAX)E(XTBX) = E(XTAXXTBX), so E(XTAX) · E(ΣiΣjb∗ijYiYj) = E [( Σai=1αiY 2 i ) ( ΣjΣkb ∗ kjYkYj )] . But Y1, . . . , Yp are independent and EY 3 i = 0 and EY 4 i = 3 so ( a∑ i=1 αi)( ∑ j b∗jj) = 3 a∑ i=1 αib ∗ ii + ∑ i 6=j αib ∗ jj. Thus a∑ i=1 αib ∗ ii = 0. But αi > 0 i = i, · · · , a so b∗ii = 0 i = 1, · · · , a. (We must have b∗ii ≥ 0 as B∗ is non-negative definite.) Since B∗ is non-negative definite aT B∗a ≥ 0 for any a . 13 Choose ai = √ b∗jj aj = −b∗ij/ √ b∗jj and all other ak = 0 . aTB∗a = b∗jjb ∗ ii + (b ∗ ij) 2 − 2(b∗ij)2 ≥ 0 i.e. b∗iib ∗ jj ≥ (b∗ij)2 so b∗ij = 0 if i or j is equal to 1, 2, · · · , a . Hence B∗ = 0 0 0 B∗22 and so A∗B∗ = 0 . But A∗B∗ = P TAPP TBP = P TABP so A∗B∗ = 0 implies AB = 0. Lemma 1. If A and B are non-negative definite, symmetric matrices then AB = 0 if and only if tr(AB) = 0 . Proof: AB = 0 implies tr(AB) = 0. Using the above notation A∗ = P TAP = diag(α1, · · · , αa, 0, · · · , 0), B∗ = P TBP trAB = tr(PA∗P TPB∗P T ) = tr(PA∗B∗P T ) = trA∗B∗ as P TP = I. = Σai=1αib ∗ ii = 0 by assumption. But αi > 0 so b ∗ ii = 0, i = 1, · · · , a so again b∗ij = 0 if i or j is 1, · · · , a . Thus A∗B∗ = 0 which in turn implies AB = 0 . 14 4. The Wishart Distribution. If X is an n× p matrix with rows NIDp(0,Σ) then the p× p matrix M = XTX is said to have a Wishart distribution. M ∼ Wp(Σ, n), where Σ is the scale matrix and n the degrees of freedom. If p = 1 then W1(Σ, n) = W1(σ 2, n) = σ2χ2n . If X = (X1, · · · ,Xn)T M = Σni=1XiXTi so EM = nΣ . Assume Σ is positive definite. Theorem 5. If M ∼ Wp(Σ, n) and B is p× q then BTMB ∼ Wq(BTΣB, n). Proof. BTMB = BTXTXB = Y TY where Y = XB . The rows of Y are NIDq(0, B TΣB) and the result follows. Corollary. The diagonal submatrices of M have Wishart distributions. Theorem 6. If M ∼ Wp(Σ, n) and a is any fixed p-vector such that aTΣa > 0 then aTMa/(aTΣa) ∼ χ2n. Proof. aTMa ∼ W1(aTΣa, n) from Theorem 5. Corollary. Mii ∼ σ2ii χ2n where Σ = (σij) . The converse to Theorem 6 is not true. Theorem 7. If M1 ∼ Wp(Σ, n1) and M2 ∼ Wp(Σ, n2) independently of M1 then M1 +M2 ∼ Wp(Σ, n1 + n2) . 15 Proof. Write Mi = X T i Xi where Xi has ni independent rows drawn from Np(0,Σ) populations M1 +M2 = X T 1 X1 +X T 2 X2 = X TX, where X = X1 X2 . The Wishart arises naturally in connection with the sample covariance matrix. Theorem 8. If X is an n× p matrix with rows independent Np(µ,Σ) then S = XTX − nX¯X¯T ∼ Wp(Σ, n− 1) independently of X¯ ∼ Np(µ, n−1Σ) . Proof. S = Σni=1(Xi − X¯)(Xi − X¯)T , Xi ∼ Np(µ,Σ) = Σni=1XiX T i − nX¯X¯T . Let P be an orthogonal matrix with first row ( 1√ n , · · · , 1√ n ) . Let Y = PX so Y1 = √ nX¯ where Y = YT1 · · YTn . Thus S = XTX − nX¯X¯T = Y TPP TY −Y1YT1 = Y TY −Y1YT1 = n∑ i=2 YiY T i ∼ Wp(Σ, n− 1), since Yi ∼ Np(0,Σ) i = 2, · · · , n. Theorem 9. If X has rows which are NIDp(0,Σ) then X TAX ∼ Wp(Σ, r) if and only if A is idempotent of rank r . Proof. If XTAX ∼ Wp(Σ, r) then aTXTAXa/aTΣa ∼ χ2r for any a, aTΣa > 0 from Theorem 6. From Theorem 3 A is idempotent of rank r as Xa/(aTΣa) 1 2 ∼ N(0, I). 16 If A is idempotent then A = UUT , where U is n × r with orthonormal columns. Let Y = UTX so Y has r rows NIDp(0,Σ) XTAX = Y TY ∼ Wp(Σ, r). Theorem 10. If X has n rows which are NIDp(0,Σ) then for any symmetric, non- negative definite matrices A and B, XTAX and XTBX are independent if and only if AB = 0 . Proof. Assume XTAX and XTBX are independent. Choose a such that aTΣa > 0 and let Y = Xa/ √ aTΣa. Then Y ∼ Nn(0, I) and we have that YTAY and YTBY are independent. Thus by Craig’s Theorem we have AB = 0 . If AB = 0 then there exists an orthogonal matrix P such that P TAP = diag(α1, · · · , αa, 0 · · · 0), P TBP = diag(0, · · · , 0, β1, · · · , βn−a) . Set Y = P TX and the result follows. Theorem 11. If X has n rows which are NIDp(0,Σ) and M = XTX = M11 M12 M21 M22 , where M11 is a × a , then M2·1 = M22 − M21M−111 M12 ∼ Wp−a(Σ2.1, n − a) and is independent of (M11,M12). (Recall Σ2·1 = Σ22 − Σ21Σ−111 Σ12.) Proof. Write X = (X1|X2) X1 is n× a, X2 is n× (p− a) M = XTX = XT1 XT2 (X1 X2) M2·1 = XT2 X2 −XT2 X1M−111 XT1 X2 = XT2 PX2 where P = I−X1M−111 XT1 . P is idempotent of rank (n−a) as trP = n−trX1M−111 XT1 = n− a . Also XT1 P = 0 so XT2 PX2 = XT2·1PX2·1 where X2·1 = X2 −X1Σ−111 Σ12. 17 X2·1 is an n × (p − a) matrix whose rows are independent Np−a(0,Σ2·1), conditional on X1. Thus given X1 M2·1 ∼ Wp−a(Σ2·1, n− a) using Theorem 9. But this conditional distribution does not depend on X1 and so it is the marginal distribution of M2·1 . M2·1 is independent of X1 and hence of M11 = XT1 X1 . Further P (I − P ) = P − P = 0 so, given X1,M2·1 = XT2·1PX2·1 and (I − P )X2·1 = X1M −1 11 M12 −X1Σ−111 Σ12 are independent and hence are unconditionally independent. Thus M2·1 is independent of (M11,M12) as M12 = XT1 (I − P )X2·1 +M11Σ−111 Σ12. Theorem 12. If M ∼ Wp(Σ, n), n > p then a) aTΣ−1a/aTM−1a ∼ χ2n−p+1 for any fixed a, a 6= 0 so in particular σii/M ii ∼ χ2n−p+1. b) M ii is independent of all elements of M except Mii. Proof. From Theorem 11, if a = p− 1 (Mpp)−1 ∼ (σpp)−1χ2n−p+1 and Mpp is independent of (M11,M12) , i.e. M pp is independent of all elements except Mpp . The general result follows by permuting rows and columns. Given a fixed vector a , let A be a non-singular matrix with last column a . Set N = A−1M(A−1)T N ∼ Wp(A−1Σ(A−1)T , n), from Theorem 5. N−1 = ATM−1A so Npp = aTM−1a (Npp)−1 = (aTM−1a)−1 ∼ [(ATΣ−1A)pp]−1χ2n−p+1 = (aTΣ−1a)−1χ2n−p+1. 18 Theorem 13. If M ∼ Wp(Σ, n), n ≥ p then |M | = |Σ| · p∏ i=1 Ui where Ui are independent random variables, Ui ∼ χ2n−i+1 . Proof. Use induction on p . If p = 1 M ∼ σ2χ2n . Write M = M11 M12 M21 M22 and assume |M11| = |Σ11|∏p−1i=1 Ui Ui ∼ χ2n−i+1 as M11 ∼ Wp−1(Σ11, n). From Theorem 11 M11 is independent of M2·1 ∼ Σ2·1χ2n−p+1 as M2·1 is a singleton |M | = |M11||M2·1| so the result follows. Corollary 1. If M ∼ Wp(Σ, n) and Σ is positive definite then M is positive definite with probability 1 . Proof. |Σ| > 0 so P (|M | > 0) = 1 . Thus if |Σ| > 0 then the sample covariance matrix is positive definite (i.e. S−1 exists with probability 1). Corollary 2. If X has n rows which are NIDp(0,Σ) and S = X TX − nX¯X¯T then a) σkk/Skk ∼ χ2n−p; b) aTΣ−1a/aTS−1a ∼ χ2n−p for fixed a 6= 0; c) S11 is a q × q submatrix of S, S11 ∼ Wq(Σ11, n− 1) independently of S2·1 = S22 − S21S−111 S12 ∼ Wp−q(Σ2·1, n− q − 1) . 19 5. Hotelling’s T 2 and Wilk’s Λ. If d ∼ Np(0, I) independently of M ∼ Wp(I, n) then ndTM−1d ∼ T 2(p, n). Theorem 14. If X ∼ Np(µ,Σ) and M ∼ Wp(Σ, n) independently of X then n(X− µ)TM−1(X− µ) ∼ T 2(p, n). Proof. Let d∗ = Σ− 1 2 (X− µ) and M∗ = Σ− 12MΣ− 12 . Theorem 15. T 2(p, n) ∼ np n− p+ 1Fp,n−p+1. Proof. Write T 2(p, n) = n[dTM−1d/dTd].(dTd) where d ∼ Np(0, I) independently of M ∼ Wp(I, n). From Theorem 12 given d, dTd/dTM−1d ∼ χ2n−p+1 and since the distri- bution does not depend on d, it is also the marginal distribution. Moreover dTd/dTM−1d is independent of d. Also dTd ∼ χ2p so T 2(p, n) D = nχ2p/χ 2 n−p+1 D = np n− p+ 1 Fp,n−p+1. Corollary 1. If X¯ is the mean of a sample of size n drawn from a Np(µ,Σ) population and (n− 1)−1 S is the sample covariance matrix then n(X¯− µ)T ( 1 n− 1S) −1(X¯− µ) ∼ (n− 1)p n− p Fp,n−p. Corollary 2. M ∼ Wp(Σ, n) independently of d ∼ Np(0,Σ) then |M |/|M + d dT | ∼ Beta(n− p+ 1 2 , p 2 ). Proof. Note 20 M d −dT 1 I 0 dT 1 = M + d dT d 0 1 so |M + d dT | = ∣∣∣∣∣∣∣∣ M d −dT 1 ∣∣∣∣∣∣∣∣ = |M |(1 + d TM−1d) so |M |/|M + d dT | = (1 + dTM−1d)−1 D= (1 + T 2/n)−1 . Now T 2 ∼ np n− p+ 1 Fp,n−p+1 = nU/V where U ∼ χ2p independently of V ∼ χ2n−p+1, n n+ T 2 D = V U + V = Beta ( n− p+ 1 2 , p 2 ) . Wilk’s Λ. Let A ∼ Wp(Σ,m) independently of B ∼ Wp(Σ, n). Define Λ(p,m, n) = |A|/|A+B|. The distribution of Λ does not depend on Σ . Theorem 16. Λ(p,m, n) ∼ n∏ i=1 Ui where U1, · · · , Un are independent r.v.’s Ui ∼ Beta ( m+i−p 2 , p 2 ) ) i = 1 · · · , n . Proof outline. Write B = XTX = Σni=1XiX T i , where Xi ∼ NID(0, I) are the rows of X. Let Mi = Mi−1 + XiXTi ,M0 = A so Λ(p,m, n) = |M0| |Mn| = |M0| |M1| · |M1| |M2| · · · |Mn−1| |Mn| . and from Corollary 2 to Theorem 15 Ui = |Mi−1| |Mi| ∼ Beta( m+ i− p 2 , p 2 ). RTP Ui are independent. Note Ui = (1+X T i M −1 i−1Xi) −1 so we need to show that XTi M −1 i−1Xi is independent of Mi (see Mardia, Kent and Bibby, p.76 for details). 21 Bartlett’s Approximation. −(m− 1 2 (p− n+ 1)) log Λ(p,m, n) approx∼ χ2np for large n. Proof outline. Let ρ = (m− 1 2 (p− n+ 1))/2 . U = −2ρ log Λ(p,m, n) = −2ρ n∑ i=1 logUi, where Ui ∼ Beta(m+i−p2 , p2) . U has mgf E(e−tU) = n∏ i=1 E(exp (2t ρ logUi)) = n∏ i=1 E(U2tρi ) EUγi = ∫ 1 0 1 B(m+i−p 2 , p 2 ) uγ u m+i−p 2 −1(1− u) p2−1 du = Γ(m+i 2 ) Γ(m+i−p 2 + γ) Γ(m+i−p 2 ) Γ(m+i 2 + γ) . Using Stirling’s approximation Γ(x) ∼ √ 2pi(x− 1) [(x− 1)/e]x−1 log Γ(x+ `) Γ(x) ∼ 1 2 log( x+ `− 1) x− 1 ) + (x− 1) log( x+ `− 1 x− 1 )− `+ ` log(x+ `− 1) = 1 2 ( ` x− 1) +O(x −2) + `− ` 2 2(x− 1) +O(x −2)− ` +` log x+ `(`− 1) x +O(x−2) = ` log x+ `(`− 1) x − `(`− 1) 2(x− 1) +O(x −2) = ` log x+ `(`− 1) 2x +O(x−2). logE(e−tU) = n∑ i=1 [ log Γ(m+i−p 2 + 2tρ) Γ(m+i−p 2 ) − log Γ( m+i 2 + 2tρ) Γ(m+i 2 ) ] 22 = n∑ i=1 [ log Γ(m+i−p 2 − ρ+ (1 + 2t)ρ) Γ((1 + 2t)ρ) − log Γ( m+i 2 − ρ+ (1 + 2t)ρ) Γ((1 + 2t)ρ) ] + n∑ i=1 [ log Γ(m+i 2 − ρ+ ρ) Γ(ρ) − log Γ( m+i−p 2 − ρ+ ρ) Γ(ρ) ] ' −np 2 log(1 + 2t)− 2t 2ρ(1 + 2t) n∑ i=1 p 2 ( p 2 − m+ i 2 + ρ+ 1− m+ i 2 + ρ) +O(ρ−2) = −np 2 log(1 + 2t) +O(ρ−2) by choice of ρ. Thus E(e−tU) = (1 + 2t)− np 2 · (1 + O(ρ−2)) and ρ → ∞ as n,m → ∞ . Therefore U has density that is approximated closely by a χ2np density. 23 6. Factor Analysis. In factor analysis we model p observed variables in terms of m underlying but unobserv- able factors. This modelling approach is widely used in psychology and education where highly correlated responses are linked to common factors like intelligence or ability. The interpretation of the factors is often subjective. Model: X = LF + µ+ , where X is a p-vector, F is an m-vector of independent random variables with means 0 and variances equal to 1, L is a p×m loading matrix and is the error or specific factor vector. F and are independent, E( ) = 0, and cov() = Ψ = diag(ψ1, .., ψp). Let L = (lij), where lij is the loading of the ith variable, Xi, on the jth factor, Fj. E(X) = µ and Cov(X) = Σ = LLT + Ψ. Var(Xi) = σii = ∑m j=1 l 2 ij+ψi and h 2 i = ∑m j=1 l 2 ij is called the communality; it is the component of the variance of Xi due to the common factors. Further Cov(Xi, Xk) = m∑ j=1 lijlkj and Cov(Xi, Fj) = lij. X does not determine F and uniquely. If G is an orthogonal matrix then X = LGTGF + µ+ , and E(GF) = 0 and Cov(GF) = I. If Fi ∼ NID(0, 1) then GF is another vector of independent standard normal variables so the factor loadings are only determined up to an orthogonal transformation. Hence factors can be rotated by G and the model will still hold. 24 To specify a factor loading structure we need to impose further constraints. Generally we ask for LTΨ−1L to be diagonal with elements in decreasing order. The factor model cannot be applied to any covariance matrix (unless we let m = p) and it is only useful if it leads to a simplification of the structure, i.e., m is small relative to p. The original covariance matrix consists of p(p + 1)/2 parameters, whereas the factor model with constraints has pm+ p−m(m− 1)/2 freely determined parameters. The difference is 1 2 (p−m)2 − 1 2 (p+m). If Σ (or the sample covariance matrix, S) is close to diagonal then the specific factors will dominate and so trying to fit a factor model is not worthwhile. Factor analysis is most useful when there are a reasonable number of highly correlated variables. We will now look at methods for estimating L and Ψ given n observations. Principal Component Method. Recall Σ = λ1e1e T 1 + ..+ λpepe T p = LL T , where L = [ √ λ1e1, .., √ λpep], is a p × p matrix. Thus the factor model holds with m = p. If the smallest eigenvalues are close to 0 then we can use L1 = [ √ λ1e1, .., √ λmem], as the loading matrix and set ψi to be the ith diagonal element of Σ− L1LT1 . The factor model structure is unaffected if we work with standardized variables so in practice we fit L1 via the sample correlation matrix, R. Lˆ1 = [ √ λˆ1eˆ1, .., √ λˆmeˆm], where λˆ1 ≥ .. ≥ λˆp are the eigenvalues of R with corresponding orthonormal eigenvectors eˆ1, .., eˆp. The communality hˆ 2 i = ∑m j=1 lˆ 2 ij and the contribution to the total variance from the first factor is p∑ i=1 lˆ2i1 = ( √ λˆ1eˆ1) T ( √ λˆ1eˆ1) = λˆ1. 25 Note that there are problems in reproducing L1 across independent samples if the underlying correlation matrix has repeated eigenvalues. The estimates will not necessarily be equal and so the corresponding eigenvalues can be quite different across replicates. Maximum Likelihood Factor Analysis. If Xi ∼ NIDp(µ,Σ) then we can estimate L and Ψ via maximum likelihood. Recall µˆ = X¯. The maximum likelihood equations for L and Ψ, where Σ = LLT + Ψ and LTΨ−1L is diagonal with elements in decreasing order, do not have an explicit solution unless Ψ = kI for some k. The log likelihood is maximized numerically via an iterative procedure. MLE and PC methods do not necessarily give the same answers. If the normal assumption is reasonable then MLE is preferred. Factor Rotation. The constraint LTΨ−1L diagonal is for mathematical convenience. To make interpreta- tion of the factors easier we rotate the factors by T so that LT has a few large values and the remaining values are near 0. To quantify this we need a measure of ‘large vs small’. Kaiser (1958) suggested the varimax criterion. Let lˆ′ij = lˆij/hˆi, where hˆ 2 i = ∑m j=1 lˆ 2 ij. lˆ ′ ij represents the scaled factor loadings. The varimax procedure selects the orthogonal matrix T that makes V = 1 p m∑ j=1 [ p∑ i=1 lˆ′ 4ij − ( p∑ i=1 lˆ′ 2ij )2 /p ] as large as possible. V is the sum of the variances of squared loadings on factor j. Maximiz- ing V corresponds to spreading out the loadings on each factor as far as possible. Factor Scores. In IQ tests various responses to questions (Xis) are recorded with a view to estimating the value of f , the underlying intelligence factor. Once a factor analysis model has been 26 established we can obtain the factor scores f1, f2, .., fm corresponding to the values of the unobserved random variables. If µ, L, and Ψ are known then (X− µ) = LF + so estimating F is like estimating the parameters in a linear model where the errors have unequal variances, Cov()= Ψ. The weighted least squares estimates for F are: Fˆ = (LTΨ−1L)−1LTΨ−1(X− µ). (2) If we substitute the estimates for L, Ψ and µ then we get Bartlett’s factor scores. The estimates from (2) are reasonable whether X and are multivariate normal or not. Other methods exist that rely on the normality assumption (see reference text Johnson and Wich- ern). A major concern in practice is whether the factor model is sensible and, if so, what is the value of m. Likelihood ratio tests can be used to test if Σ = LLT + Ψ is a suitable model. Given X1, ..,Xm ∼ NID(µ,Σ) test H0 : Σ = LLT + Ψ, where L is p×m. Using a LRT approach and the MLEs for L and Ψ we get −2 log Λ = n log |Σˆ||S/n| + n[tr(Σˆ −1S/n)− p]. If the MLEs are used it can be shown that tr(Σˆ−1S/n) = p so the test statistic is n log |LˆLˆT + Ψˆ| |S/n| . Bartlett suggested replacing the leading ‘n’ by (n−1− (2p+4m+5)/6). The asymptotic distribution of the statistic is χ2[(p−m)2−(p+m)]/2. 27 7. Canonical Correlation Analysis Given two random vectors X and Y what are the linear combinations η = aTX and φ = bTY which have maximum correlation? Canonical correlation analysis enables us to examine the relationships between two groups of variables, whereas principal components analysis looks at the relationship within a single set of variables. Let EX = µ, Cov (X) = Σ11, a p × p matrix, Cov (X,Y ) = Σ12, a p × q matrix, and EY = ν, Cov (Y ) = Σ22. The correlation between η and φ is aTΣ12b√ aTΣ11a · bTΣ22b = ρ(a, b). Assume Σ11 and Σ22 are non-singular and let c = Σ 1 2 11a, d = Σ 1 2 22b . Then we want to maximise |cTΣ− 1 2 11 Σ12Σ − 1 2 22 d| subject to cTc = dTd = 1 . This is equivalent to maximising dTΣ − 1 2 22 Σ21Σ − 1 2 11 c c TΣ − 1 2 11 Σ12Σ − 1 2 22 d . Using the Maximization Lemma, holding c fixed and maximizing over d we get the maximum is the largest eigenvalue of Σ − 1 2 22 Σ21Σ − 1 2 11 c c TΣ − 1 2 11 Σ12Σ − 1 2 22 . This matrix has rank 1 so the largest eigenvalue is the trace, i.e. cTΣ − 1 2 11 Σ12Σ −1 22 Σ21Σ − 1 2 11 c and d = Σ − 1 2 22 Σ21Σ − 1 2 11 c (suitably normalised). We now want to maximise this expression wrt c subject to cT c = 1 . The maximum is the largest eigenvalue of Σ − 1 2 11 Σ12Σ −1 22 Σ21Σ − 1 2 11 (3) which is the same as the maximum eigenvalue of Σ−111 Σ12Σ −1 22 Σ21. Call this value ρ 2. 28 Note: 1. Σ−111 Σ12Σ −1 22 Σ21 and Σ − 1 2 11 Σ12Σ −1 22 Σ21Σ − 1 2 11 have the same eigenvalues. 2. Let ρ21 ≥ ρ22 ≥ · · · ≥ ρ2p be the eigenvalues of Σ−111 Σ12Σ−122 Σ21 with associated eigen- vectors a1, · · · ,ap. Then Σ−111 Σ12Σ −1 22 Σ21a1 = ρ 2 1a1 Σ − 1 2 11 (Σ − 1 2 11 Σ12Σ −1 22 Σ21Σ − 1 2 11 )(Σ 1 2 11a1) = ρ 2 1a1 so Σ 1 2 11a1 is an eigenvector of (3). 3. Let b1 = Σ −1 22 Σ21a1 . Note b1 is the eigenvector of Σ −1 22 Σ21Σ −1 11 Σ12 with eigenvalue ρ 2 1 . Definition: Given ai defined as above and bi = Σ −1 22 Σ21a1 then ηi = a T i X and φi = b T i Y are the ith canonical correlation variables and ρi is the ith canonical correlation coefficient. Further Corr(ηi, ηj) = a T i Σ11aj = δij Corr(φi, φj) = b T i Σ22bj = δij Cov(ηi, φj) = 0 i 6= j as Cov(ηi, φj) = a T i Σ12bj = aTi Σ12Σ −1 22 Σ21aj = aTi Σ11Σ −1 11 Σ12Σ −1 22 Σ21 aj = aTi Σ11aj · ρ2j = 0 i 6= j as aTi Σ11aj = 0 . To see this note Σ 1 2 11ai are eigenvectors of Σ − 1 2 11 Σ12Σ −1 22 Σ21Σ − 1 2 11 with eigenvalues ρ 2 i and eigenvectors corresponding to distinct eigenvalues are orthogonal. 29 If Σ12 has rank 1 then Σ12 = ab T for some a and b Σ−111 Σ12Σ −1 22 Σ21 = Σ −1 11 ab TΣ−122 ba T = (bTΣ−122 b)(Σ −1 11 aa T ) which has eigenvalue (aTΣ−111 a)(b TΣ−122 b). Thus the canonical correlation is √ (aTΣ−111 a)(b TΣ−122 b) . In general the number of non-zero canonical correlations is k = min(p, q) . Sample canonical correlations are obtained via the eigenvalues and eigenvectors of S−111 S12S −1 22 S21 . Testing H0: only s canonical correlation coefficients are non-zero. Using a LRT approach when testing H ′0 : Σ12 = 0 (i.e. all canonical correlations are 0 ) we used Λ2/n = |I − S−122 S21S−111 S12| = Πki=1(1− r2i ) where r1, · · · , rk are the sample canonical correlation coefficients. Thus to test H0 with s ≤ k, Bartlett proposed −[n− 1 2 (p+ q + 3)] log k∏ i=s+1 (1− r2i ) approx∼ χ2(p−s)(q−s), for large n This has the same form as Bartlett’s approximation to Λ2/n when testing H ′0 : Σ12 = 0 −[n− 1 2 (p+ q + 3)] log k∏ i=1 (1− r2i ) ∼ χ2pq. Canonical correlation analysis is used for data reduction and interpretation, in particular, determining which combination of the Xi’s can be predicted effectively using the Yj’s . 30 Reference Texts Afifi, A., May, S. and Clark, V. (2012) Practical Multivariate Analysis, CRC Press, Boca Raton, Florida. Johnson, R. and Wichern, D., (1988) Applied Multivariate Statistical Analysis, Prentice Hall, Englewood Cliffs, New Jersey. Mardia, K., Kent, J. and Bibby, J., (1979) Multivariate Analysis, Academic Press, London. Muirhead, R., (1982) Aspects of Multivariate Statistical Theory, John Wiley and Sons, Inc., New York. Srivastava, M. and Carter, E., (1983) An Introduction to Applied Multivariate Statis- tics, North Holland, New York. 31 Appendix: Multivariate Linear Models Consider the model defined by Y = XB + U where Y is an n× p matrix of observed values. B is a q × p matrix of unknown parameters. X is a n× q matrix of known values. U is an unobserved random matrix whose rows are, given X, uncorrelated with mean 0 and covariance Σ . If Y = (Yij) then E Yij = X T i Bj, where X = XT1 : XTn and B = (B1, · · · ,Bp). If the rows of U are Np(0, Σ) then the least squares estimates for B are the MLE’s. The log likelihood is `(B, Σ) = −np 2 log(2pi)− n 2 log |Σ| − 1 2 tr(Y −XB)Σ−1(Y −XB)T . Theorem. The MLE’s for B and Σ are Bˆ = (XTX)−1XTY and Σˆ = Y TPY/n where P = I −X(XTX)−1XT . Proof. Let Uˆ = PY = Y −XBˆ where Bˆ is defined above. Then Y −XB = Uˆ +X(Bˆ −B) and XTP = 0 so (Y −XB)T (Y −XB) = UˆT Uˆ + (Bˆ −B)TXTX(Bˆ −B). Thus `(B, Σ) = −np 2 log(2pi)− n 2 log |Σ| − 1 2 tr[Σ−1(UˆT Uˆ + (Bˆ −B)TXT X(Bˆ −B)] . This is maximized wrtB when B = Bˆ giving `(B, Σ) = −np 2 log(2pi)− n 2 log |Σ| − 1 2 trΣ−1UˆT Uˆ 32 and from Section 1 we have Σˆ = UˆT Uˆ/n = Y TPY/n . Next `(Bˆ, Σˆ) = −np 2 log(2pi)− n 2 log |Σˆ| − np 2 and Y TY = Y TPY + Y T (I − P )Y = UˆT Uˆ + Yˆ T Yˆ as Yˆ = XBˆ = (I − P )Y. Also P (I − P ) = 0 , so Yˆ T Yˆ is independent of Σˆ . Clearly EBˆ = B and EUˆ = 0 . Note Uˆ = PY = PU as PX = 0, so UˆT Uˆ = UTPU as P 2 = P. The rows of U are Np(0,Σ) so we have n Σˆ ∼ Wp(Σ, n− q), as rank P = trP = tr(I −X(XTX)−1XT ) = n− q, by Theorem 7. Further Bˆ = (XTX)−1XTY = B + (XTX)−1XTU (A) Bˆ −B = (XTX)−1XTU so Cov(Bˆij, Bˆk`) = E(a T i Uj ·UT` ak), where (XTX)−1XT = aT1 ... aTq = aTi σj`I ak U = (U1, · · · ,Up) as the elements of Uj are independent Cov(Bˆij, Bˆk`) = σj` a T i ak = σj` gik where (gik) = (X TX)−1 . Thus the rows of Bˆ will be independent if (XTX)−1 is diagonal. The covariance between the jth column and the `th column is Cov(Bˆj, Bˆ`) = σj`(X TX)−1 . 33 General Linear Hypotheses. Consider a test of the hypothesis H0 : C1B = D where C1 is g × q of rank g and D is g × p of rank ≤ p , (generally D = 0). For example, the hypothesis that the rows of B are all equal C1 = 1 −1 0 · · · · · · · · · · · · 0 1 0 −1 0 · · · · · · 0 ... 1 0 · · · · · · 0 −1 and D = 0. C1 is (q − 1)× q . To construct the likelihood ratio test define C2 such that C2 is (q − g) × q and C = ( C1 C2 ) is q × q of full rank. Let B0 be any matrix such that C1B0 = D . Write the model as Y+ = Z∆ + U where Y+ = Y − XB0, Z = XC−1, ∆ = C(B − B0) = ( ∆1 ∆2 ) , and ∆1 is g × p . With this parametrization H0 becomes H0 : ∆1 = 0 . Let C −1 = (C(1)C(2)) . Maximizing the likelihood with no restrictions gives `(Bˆ, Σˆ) = −np 2 log(2pi)− n 2 log |Σˆ| − np 2 , where Σˆ = Y TPY/n and P = I−X(XTX)−1XT . Under H0 the rows of Y+ have expected value X(C(1)C(2)) ( 0 ∆2 ) = X C(2)∆2 . Let P1 = I −XC(2)(C(2)TXTXC(2))−1C(2)TXT , so the maximum likelihood under H0 is −np 2 log(2pi)− n 2 log | 1 n Y T+ P1Y+| − np 2 . −2 log Λ = n log |Y T+ P1Y+| / |Y TPY | Λ2/n = |Y TPY | / |Y T+ P1Y+| . 34 Thus the statistic is simply the ratio of the determinants of the residual SSP matrices calculated for the full model and the model that holds when the null hypothesis is true. To obtain the distribution of the statistic note that PY+ = PY as PX = 0 so Λ 2/n = |Y T+ PY+| / |Y T+ P1Y+| . We will show that Λ2/n ∼ Λ(p, n−q, g), Wilk’s Lambda distribution, and so using Bartlett’s approximation −(n− q − 1 2 (p− g + 1)) log Λ2/n approx∼ χ2pg for large n . Let P2 = P1 − P . Note PP1 = P so PP2 = 0 as P 2 = P . Further P2 is idempotent as (P1 − P )2 = P1 − P . Thus Y T+ P1Y+ = Y T + PY T + + Y T + P2Y+ and PP2 = 0. Thus Λ2/n = |UTPU | / |UTPU + UTP2U | and UTPU ∼ Wp(Σ, n− q) independently of UTP2U ∼ Wp(Σ, g) since trP = n− q and trP2 = trP1 − trP = n− (q − g)− (n− q) = g . From the definition of Wilk’s lambda Λ2/n ∼ Λ(p, n− q, g). Theorem. The LRT of the hypothesis H0 : C1BM1 = D where C1 is g × q of rank g, and M1 is p× r of rank r has test statistic Λ2/n ∼ Λ(r, n− q, g). [This is the most general linear hypothesis on B and allows the specification of specific values for elements of B.] Proof. Let Y = XB + U . Transform to Z = YM1 = XBM1 + UM1, where UM1 has rows which are independent Nr(0, M T 1 ΣM1), Z is n × r, U1 = UM1 is n × r and B1 = BM1 is q × r. The null hypothesis is now H0 : C1B1 = D so using the above result the LRT statistic is |E| / |E +H| ∼ Λ(r, n− q, g) 35 where E = ZTPZ = MT1 Y TPYM1 and E +H = MT1 Y T + P1Y+M1. If p = 1 the model reduces to the usual multiple regression model and H0 : C1β = d where d is g × 1 yields E = Y T (I −X(XTX)−1XT )Y = residual SS ∼ σ2χ2n−q H + E = Y TP1Y = residual SS under H0 H = (S1 − S0) ∼ σ2χ2g if C1 has rank g . The usual multiple regression F -test is not E H+E but (H/E) ∼ g n−q Fg,n−q and H0 is rejected if H/E is large. If C1 is 1× q then since 1− Λ(p, n− q, 1) Λ(p, n− q, 1) ∼ p n− q − p+ 1 Fp, n−q−p+1 we can base our test on an exact F -distribution, not an approximate χ2 value. MANOVA The LRT statistics for multivariate analysis of variance can be derived as follows. We will consider the 1-way case in detail. Assume we have g independent samples from Np(µi, Σ) populations i = 1, · · · , g . The ith sample is Yi1, · · · ,Yi ni and n1 + n2 + · · ·+ ng = N . 36 Let Y = YT11 YT12 ... YTgng N×p B = µT1 ... µTg g×p X = 1 1 1 n1 0 0 1 1 0 1 n2 0 1 0 · · · · · · · · · 0 1 1 ng N×g Write Y = XB + E, where XTX = diag(n1, n2, · · · , ng) and (XTX)−1XT Y = Y¯T1· ... Y¯Tg· i.e. µˆi = Y¯i·. To test H0 : µ1 = µ2 = · · · = µg write the constraints as C1B = 0 where C1 = 1 0 · · · · · · · · · 0 −1 0 1 0 · · · 0 −1 ... ... 0 · · · · · · · · · 0 1 −1 , a (g − 1)× g matrix. Here P = I −X(XTX)−1XT = I −X diag (n−1i )XT so Y TPY = Y TY − Y TX diag (n−1i )XTY = g∑ i=1 ni∑ j=1 (Yij − Y¯i·)(Yij − Y¯i·)T = W ∼ Wp(Σ, N − g). W is the within groups SSP matrix. 37 Write C = C1 1T = I −1 1T 1 so C−1 = C11 C12 C21 1 g and C12 = I 1 1g = 1g 1 so C(2) = 1g 1. Next P1 = IN −X 1(1TXTX 1)−1 1TXT = I − 1 N X 1 1TXT = I − 1 N 1 1T . In this case Y+ = Y as B0 = 0 (C1B0 = 0). Y TP1Y = Y TY − 1 N Y T 1 1TY = g∑ i=1 ni∑ j=1 (Yij − Y¯··)(Yij − Y¯··)T = T, the total SSP matrix. Thus the MANOVA LRT statistic is Λ = |W | / |T | ∼ Λ(p, N − g, g − 1) and T = W +B where B = ∑g i=1 ni(Y¯i· − Y¯··)(Y¯i· − Y¯··)T . We can write |W | |T | = |W | |W +B| = |I +W −1B|−1. If λ1, · · · , λp are the eigenvalues of W−1B then (1 +λi) is an eigenvalue of I +W−1B. Hence Λ = ∏p i=1(1 + λi) −1 . But λi = 0 if i > min(p, g − 1) = t . Thus Λ = ∏ti=1(1 + λi)−1 . It can be shown that V = (N − g) p∑ i=1 λi ∼ χ2p(g−1). Testing Contrasts. H0 : a1µ1 + · · ·+ ag µg = 0, ∑g i=1 ai = 0 . 38 Write H0 as H0 : C T 1B = 0 T , C1 = (a1, · · · , ag)T . Use Lagrange multipliers to obtain the MLE’s for µi under the contraints imposed by H0 . The log likelihood is ` = −N 2 log(2pi)p|Σ| − 1 2 g∑ i=1 ni∑ j=1 (Yij − µi)TΣ−1(Yij − µi) +λT (a1µ1 + · · ·+ agµg) = −N 2 log(2pi)p|Σ| − 1 2 g∑ i=1 ni∑ j=1 (Yij − Y¯i·)TΣ−1(Yij − Y¯i·) −1 2 [ g∑ i=1 ni(Y¯i· − µi)TΣ−1(Y¯i· − µi)− 2λT g∑ i=1 aiµi]. Maximizing wrt µi gives µˆi = Y¯i· − ai ni Σλ . Further (a1 µˆ1 + · · ·+ agµˆg) = 0 so λˆ = −Σ−1( g∑ i=1 aiY¯i·)/ g∑ i=1 (a2i /ni) so µˆi = Y¯i· − aini ( ∑g i=1 aiY¯i·)/( ∑g i=1 a 2 i /ni), which does not depend on Σ . Thus the MLE for Σ is (by analogy with the full case) Σˆ = 1 N g∑ i=1 ni∑ j=1 (Yij − Y¯i·)(Yij − Y¯i·)T + ( ∑g i=1 aiY¯i·)( ∑g i=1 aiY¯i·) T ( ∑g i=1 a 2 i /ni) = 1 N [W + C] and the LRT statistic is |W | |W + C| = |I +W −1C|−1 ∼ Λ(p, N − g, 1). The matrix C has rank 1 so the non zero eigenvalue of W−1C is trW−1C = ( g∑ i=1 aiY¯i·)T W−1( g∑ i=1 aiY¯i·)/( g∑ i=1 a2i /ni) (cf Mahalanobis distance when g = 2 a1 = −a2 = 1.) Test for Common Covariance Structure. If Yij ∼ Np(µi,Σi), i = 1, · · · , g test H0 : Σ1 = Σ2 = · · · = Σg 39 The LRT statistic is −2 log Λ = N log |W N | − g∑ j=1 ni log |Si ni | = g∑ i=1 ni log |ni N S−1i W | where W = g∑ i=1 Si approx∼ χ2(g−1)p(p+1)/2 if H0 is true. A modified version of the statistic has a distribution closer to the χ2. Box’s M -test: M = γ ∑g i=1(ni − 1) log |niN S−1i W |, where γ = 1− 2p 2 + 3p− 1 6(p+ 1)(g − 1)( g∑ i=1 (ni − 1)−1 − 1 N − g ). 40
欢迎咨询51作业君