程序辅导案例 > Program >

程序代写案例-STAT 5611

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

STAT 5611: Statistical Methodology
Multivariate Analysis Module
1. Multivariate Normal.
A random variable (r.v.) X has a standard normal distribution N(0, 1), if X has
density
φ(x) =
1√
2pi
e−
1
2
x2 , −∞ < x <∞.
X has moment generating function
E(etX) =
∫ ∞
−∞
etx · (2pi)− 12 e− 12x2dx
=
∫ ∞
−∞
(2pi)−
1
2 e−
1
2
(x−t)2 · e 12 t2dx
= e
1
2
t2 ,
so E X = 0 and Var X = 1. If Y = µ+ σX, σ > 0 then Y ∼ N(µ, σ2) and Y has mean µ
and variance σ2. Y has density
1√
2pi σ2
e−
1
2σ2
(y−µ)2
and mgf
E(etY ) = etµE(etσX) = exp (tµ+
1
2
σ2 t2).
Let U = (U1, U2, · · · , Up)T be a p-vector of NID(0, 1) r.v.’s. U has joint density
(2pi)−p/2 exp (−1
2
uT u),
EU = 0 and Cov U = EUUT = I . Write U ∼ Np(0, I).
X has a non-singular multivariate normal distribution with mean vector µ and
non-singular covariance matrix Σ if
X = µ+ AU
where Σ = AAT (i.e. A = Σ
1
2 ).
1
X has density obtained by transforming the density of U. The Jacobian of the trans-
formation is
|J | =
∣∣∣∣∣ ∂(x1, · · ·xp)∂(u1, · · · , up)
∣∣∣∣∣ = |Σ 12 | = |Σ| 12 as |A2| = |A|2 = |Σ|.
Thus X has density
(2pi)−p/2|Σ|− 12 exp[−1
2
(x− µ)T Σ−1(x− µ)]
as
uTu = [A−1(x− µ)]T [A−1(x− µ)]
= (x− µ)T (A−1)T (A−1)(x− µ)
= (x− µ)T Σ−1(x− µ) as (A−1)T (A−1) = (AAT )−1 = Σ−1.
X has m.g.f.
M(t) = E(et
T X)
= E(exp (tTµ+ tT AU))
= exp (tT µ) · E(eaTU), where aT = tT A
= exp (tT µ) · exp (1
2
aTa) as aTU ∼ N(0, aT a)
= exp (tT µ+
1
2
tTΣt).
This expression for the mgf is well defined even when Σ is singular. This leads to the following
definition of a multivariate normal random vector.
Definition. A p-vector X is said to have a multivariate normal distribution if and only if
every linear combination aTX is univariate normal.
Consider the mgf of aTX
ψ(t) = E(eta
TX) = M(ta) = exp (taTµ+
1
2
t2aT Σa)
2
so aTX is univariate normal by the uniqueness theorem for mgf’s. Conversely if tTX is
univariate normal for all t then
E(et
TX) = exp (tTµ+
1
2
tT Cov(X) t).
If Σ is singular and has rank r < p then Σ has (p− r) eigenvalues equal to 0 . There
is an orthogonal matrix P such that
P TΣP =
 D 0
0 0

where D = diag(λ1, · · · , λr), λ1 ≥ λ2 ≥ · · · ≥ λr > 0.
Let γ = P T µ and let γ1 be the vector formed by the first r elements of γ.
X has mgf exp (tTµ+ 1
2
tTΣt). Let Y = P TX . Then Y has mgf
E(et
TY) = E(et
TPTX)
= exp (tTγ +
1
2
tTP TΣP t)
E(et
TY) = exp (tTγ +
1
2
tT1Dt1)
where t1 = (t1, t2, · · · , tr)T . Thus (Y1, · · · , Yr)T ∼ Nr(γ1, D) and
P (Yj = γj) = 1, j > r . Thus if Y is a multivariate normal with covariance matrix of rank
r then there exists a linear transformation such that
Y = G(X − µ)
and Yr+1 = · · · = Yp = 0 and (Y1, · · · , Yr)T ∼ Nr(0, I).
3
Marginal and Conditional Distributions
If X ∼ Np(µ,Σ) then aTX ∼ N(aTµ, aTΣa) so by choosing ei = (0, · · · , 0, 1, 0, · · · , 0)
with 1 in the ith position we have Xi ∼ N(µi, σii), where Σ = (σij). Thus all marginal
distributions are normal.
Marginal normality is not enough to ensure joint normality.
Write X =
(
X1
X2
)
,µ =
(
µ1
µ2
)
and Σ =
 Σ11 Σ12
Σ21 Σ22
 , where X1 is an r-vector and X2 is
a p− r = s vector. Σ11 is r × r and Σ12 is r × s, etc.
Theorem 1. X1 and X2 are independent if and only if Σ12 = 0 .
Proof. Independence implies X1 and X2 are elementwise uncorrelated and so Σ12 = 0.
If Σ12 = 0 then X has mgf
M(t) = exp (tTµ+
1
2
tTΣt)
= exp (tT1µ1 +
1
2
tT1 Σ11t1) · exp (tT2µ2 +
1
2
tT2 Σ22t2)
and so X1 and X2 are independent.
X1 ∼ Nr(µ1, Σ11) and X2 ∼ Ns(µ2, Σ22).
Theorem 2. If X ∼ Np(µ,Σ) then the conditional distribution of X2 given X1 = x is
N
(
µ2 + Σ21Σ
−1
11 (x− µ1), Σ22 − Σ21Σ−111 Σ12
)
.
Proof. Let Y2 = MX1 + X2 be a transformation such that Y2 and X1 are uncorrelated.
Cov(Y2, X1) = MΣ11 + Σ21 so M = −Σ21Σ−111 X1
Y2
 =
 I 0
−Σ21Σ−111 I

 X1
X2
 .
4
Thus X1 and Y2 are multivariate normal and uncorrelated and so are independent.
Y2 ∼ N(µ2 − Σ21Σ−111 µ1, Σ22 − Σ21Σ−111 Σ12)
as Cov (X2 +MX1) = Σ22 − Σ21Σ−111 Σ12.
Thus X2 = Y2 +Σ12Σ
−1
11 X1, and Y2 and X1 are independent. Hence the conditional
distribution of X2 given X1 = x is
N
(
µ2 + Σ21Σ
−1
11 (x− µ1), Σ22 − Σ21Σ−111 Σ12
)
.
Write Σ2·1 = Σ22 − Σ21Σ−111 Σ12 . Further if
Σ−1 =
 Σ11 Σ12
Σ21 Σ22

then Σ−12·1 = Σ
22.
Partial Correlation.
If X = (X1, X2,X
T
3 )
T then the best linear estimate of X1 in terms of X3 is b
TX3 ,
where b is such that E(X1 − bTX3)2 is a minimum. Assume µ = 0.
E(X1 − bTX3)2 = σ11 − 2bTΣ13 + bTΣ33b,
= σ11 −Σ13Σ−133 Σ31 + (b− Σ−133 Σ31)T Σ33(b− Σ−133 Σ31)
= σ11 −Σ13Σ−133 Σ31, when b = Σ−133 Σ31.
Thus the best linear estimator is Σ13Σ
−1
33 X3 . The best linear estimator of X2 is
Σ23Σ
−1
33 X3 . The partial correlation between X1 and X2 removing the effect of X3 is
ρ12·34...p and is defined to be the simple correlation between X1 − Σ13Σ−133 X3 and X2 −
Σ23Σ
−1
33 X3. Hence
ρ12·34...p =
σ12 − Σ13Σ−133 Σ32
(σ11 − Σ13Σ−133 Σ31) 12 (σ22 − Σ23Σ−133 Σ32) 12
=
−σ12
(σ11σ22)
1
2
5
where Σ−1 =

σ11 σ12
σ21 σ22

as Σ11 =

 σ11 σ12
σ21 σ22
−
 Σ13
Σ23
 Σ−133
 Σ13
Σ23

T

−1
.
The above definition does not require X to be multivariate normal, however if this is
the case then ρ12·3...p is just the simple correlation coefficient between X1 and X2 in their
conditional distribution given X3 .
Recall if only one variable is adjusted for then the simple calculation formula for the
sample partial correlation given correlation matrix R = (rij) is
r12.3 =
r12 − r13r23√
(1− r213)(1− r223)
.
Maximum Likelihood Estimates for µ and Σ
Suppose X1, · · · ,Xn are independent Np(µ,Σ) random vectors. The likelihood is
L(µ,Σ) =
n∏
i=1
(2pi)−p/2|Σ|− 12 exp[−1
2
(Xi − µ)TΣ−1(Xi − µ)]
= (2pi)−np/2|Σ|−n2 exp[−1
2
n∑
i=1
tr(Σ−1(Xi − µ)(Xi − µ)T )]
= (2pi)−np/2|Σ|−n2 exp[−1
2
trΣ−1{
n∑
i=1
(Xi − X¯)(Xi − X¯)T + n(X¯− µ)(X¯− µ)T}]
where X¯ = n−1
∑n
i=1 Xi . Let S =
∑n
i=1(Xi− X¯)(Xi− X¯)T . X¯ and S are clearly sufficient
for µ and Σ . The maximum over µ is
(2pi)−np/2|Σ|−n/2 exp(−1
2
tr(Σ−1S))
when µˆ = X¯.
6
Let λ1, · · · , λp be the eigenvalues of ΣS−1, λi > 0 as S and Σ are positive definite.
Then tr(Σ−1S) = Σpi=1 λ
−1
i and
|Σ| = |ΣS−1||S| =
( p∏
i=1
λi
)
|S|
L(X¯,Σ) = (2pi)−np/2|S|−n/2
p∏
i=i
(
λ
−n/2
i e
− 1
2
λ−1i
)
.
Let f(x) = log(xnex
−1
) = n log x+ x−1. We can show that f(x) ≥ n− n log n.
L(X¯,Σ) = (2pi)−np/2|S|−n/2
p∏
i=i
[
λni e
λ−1i
]− 1
2
≤ (2pi)−np/2|S|−n/2
p∏
i=i
[
n−n en
]− 1
2
= (2pi)−np/2|n−1S|−n/2 e−np/2
and this maximum is achieved if ΣS−1 = n−1I i.e. Σˆ = n−1S .
Hence the MLE for µ is µˆ = X¯ and the MLE for Σ is n−1S .
7
2. Hypothesis Testing
Given samples from multivariate normal populations we can develop tests using the
likelihood ratio approach. Let L(x,θ) =
∏n
i=1 f(xi,θ) denote the likelihood where θ ∈ Θ
and Θ denotes the parameter space. To test H0 : θ ∈ θ0 vs the general alternative the LRT
statistic is
Λ =
supθ∈θ0 L(x,θ)
supθ L(x,θ)
.
Under regularity conditions that are always satisfied by the multivariate normal distribution,
−2 log Λ D→ χ2r,
where r = dim Θ − dim θ0, the number of linearly independent parameters that are con-
strained by H0.
Let Xi ∼ NIDp(µ,Σ), i = 1, . . . , n.
1. Test H0 : µ = µ0 vs H0 : µ 6= µ0, Σ known.
The MLE’s for µ and Σ are X¯ and n−1S so
sup
µ
L(x,µ,Σ) = (2pi)−np/2|Σ|−n/2 exp
(
−1
2
trΣ−1S
)
.
If H0 is true
L(x,µ0,Σ) = (2pi)
−np/2|Σ|−n/2 exp
(
−1
2
trΣ−1S − n
2
(x¯− µ0)TΣ−1(x¯− µ0)
)
so
−2 log Λ = n(X¯− µ0)TΣ−1(X¯− µ0) ∼ χ2p, exactly.
2. Test H0 : µ = µ0 vs H0 : µ 6= µ0, Σ unknown.
sup L(x,µ,Σ) = (2pi)−np/2|S/n|−n/2 exp
(
−np
2
)
8
and if µ = µ0,
sup
Σ
L(x,µ0,Σ) = sup(2pi)
−np/2|S/n|−n/2 exp
[
−1
2
trΣ−1S − 1
2
tr(Σ−1n(x¯− µ0)(x¯− µ0)T )
]
= sup(2pi)−np/2|S/n|−n/2 exp
[
−1
2
tr nΣ−1(S/n+ ddT )
]
where d = (x¯− µ0). By analogy with the MLE derivation
sup
Σ
L(x,µ0,Σ) = sup(2pi)
−np/2|S/n+ ddT |−n/2 exp(−np/2),
so
−2 log Λ = n log |S + ndd
T |
|S|
= n log(1 + ndTS−1d).
To establish this last equality we need the following result.
Lemma. For any square matrix
M =
 M11 M12
M21 M22

assume M−111 exists and let M2.1 = M22 −M21M−111 M12. Then
|M | = |M11||M2.1|.
Proof. Let B =
 I 0
−M21M−111 I
 . Note |B| = 1 and
BMBT =
 M11 0
0 M2.1

so |M | = |M11||M2.1|.
Applying the Lemma to S d
−dT 1

 I 0
dT 1
 =
 S + ddT d
0T 1
 ,
9
gives
|S|(1 + dTS−1d) = |S + ddT |.
Thus the statistic for testing H0 : µ = µ0 is Hotelling’s T
2:
T 2 = n(X¯− µ0)T
(
S
n− 1
)−1
(X¯− µ0).
We will prove that the exact distribution of T 2 is (n−1)p
n−p Fp,n−p.
3.Test H0 : Σ = Σ0, µ unknown.
The MLE for µ is X¯ so if H0 is true
sup
µ
L(x,µ,Σ0) = (2pi)
−np/2|Σ0|−n/2 exp
(
−1
2
tr(Σ−10 S)
)
whereas without restrictions
sup L(x,µ,Σ) = (2pi)−np/2 |S|−n/2 nnp/2 exp
(
−np
2
)
.
The LRT statistic is
−2 log Λ = −n log |Σ−10 S| − np+ tr(Σ−10 S) + np log n
= −n log |Σ−10 S/n| − np+ ntr(Σ−10 S/n).
If Σ−10 S/n has eigenvalues λ1 ≥ .. ≥ λp, a =
∑p
i=1 λi and g =
∏p
i=1 λi then the test statistic
is
n(a− p− log g) ∼ χ2p(p+1)
2
, for large n.
4. Test for Independence. H0 : Σ12 = 0, µ unknown, where Σ =
 Σ11 Σ12
Σ21 Σ22
,
Σ11 is p1 × p1, Σ22 is p2 × p2 and p1 + p2 = p.
If H0 is true then we can treat the data as two independent samples.
Σˆ =
1
n
 S11 0
0 S22
 , and µˆ = X¯.
10
The LRT statistic is
−2 log Λ = −n log |Σˆ−1S/n| − np+ tr(Σˆ−1S)
and
Σˆ−1S = n
 S11 0
0 S22

−1 S11 S12
S21 S22
 = n
 I S
−1
11 S12
S−122 S21 I

so tr(Σˆ−1S) = np. Thus
−2 log Λ = −n log |Σˆ−1S/n|
= −n log
( |S|
|S11||S22|
)
= −n log
( |S2.1|
|S22|
)
= −n log |I − S−122 S21S−111 S12|.
We will show
Λ2/n = |S2.1|/|(S22 − S2.1) + S2.1| ∼ Λ(p2, n− p1 − 1, p1),
Wilk’s Lambda statistic.
− 2
n
log Λ
approx∼ 1
n− (p+ 3)/2 χ
2
p1p2
, for large n.
If p2 = 1, p1 = p− 1 then the above statistic simplifies to
−2 log Λ = −n log(1−R2)
where R is the multiple correlation coefficient between Xp and (X1, .., Xp−1),
R2 = S21S
−1
11 S12/spp
Λ2/n = 1−R2 ∼ Λ(1, n− p, p− 1)
and we can show that
R2
1−R2 ∼
p− 1
n− pFp−1,n−p, exactly.
11
3. Results on Quadratic Forms.
Let X ∼ Np(0, I) and let C be a square, symmetric matrix.
Theorem 3.
XTCX ∼ χ2r if and only if C is idempotent of rank r .
Proof. If C is idempotent of rank r then C = U UT where U is p × r with r
orthonormal columns.
Let Y = UTX so Y ∼ Nr(0, I) and YTY = XTCX ∼ χ2r. Conversely, if XTCX ∼ χ2r
then XTCX has mgf (1−2t)−r/2. There exists an orthogonal matrix P such that P TCP =
diag(α1, · · · , αp). Let W = P TX.
XTCX = WTP TCPW = Σpj=1 αjW
2
j .
But W is a vector of NID(0, 1) r.v.’s so W 2j is χ
2
1. Thus Σ
p
j=1 αjW
2
j has mgf
E exp (tΣpj=1αjW
2
j ) =
p∏
j=1
E
(
etαjW
2
j
)
=
p∏
j=1
(1− 2tαj)− 12 .
But mgf’s are unique so
p∏
j=1
(1− 2tαj)− 12 = (1− 2t)−r/2
(1− 2t)r =
p∏
j=1
(1− 2tαj).
By comparing coefficients of powers of t we must have r αj’s equal to 1 and (p−r) equal
to 0 . Thus all eigenvalues of C are either 0 or 1 and so C is idempotent of rank r .
Theorem 4. (Craig’s Theorem) If A and B are symmetric, non-negative definite matrices
and X ∼ Np(0, I) then XTAX and XTBX are independent if and only if AB = 0 .
Proof. XTAX has mgf
(2pi)−p/2
∫
···
∫
etx
TAx e−
1
2
xTx dx = (2pi)−p/2
∫
···
∫
exp (−1
2
xT (I − 2At)x) dx
= |I − 2At|− 12 .
12
XTAX and XTBX are independent iff
E(et1X
TAX+t2XTBX) = E(et1X
TAX) · E(et2XTBX),
i.e.
|I − 2At1 − 2Bt2|− 12 = |I − 2t1A|− 12 |I − 2t2B|− 12 . (1)
The RHS is |I−2t1A−2t2B+4t1t2AB|− 12 so if AB = 0 then (1) holds and the two quadratic
forms are independent.
Now suppose the quadratic forms are independent. Let P be an orthogonal matrix such
that
A∗ = P TAP = diag(α1, · · · , αa, 0(p−a))
B∗ = P TBP. Let Y = P TX
XTAX = YTP TAPY = Σai=1αiY
2
i
and XTBX =
∑
i
∑
j b
∗
ijYiYj.
E(XTAX)E(XTBX) = E(XTAXXTBX),
so
E(XTAX) · E(ΣiΣjb∗ijYiYj) = E
[(
Σai=1αiY
2
i
) (
ΣjΣkb
∗
kjYkYj
)]
.
But Y1, . . . , Yp are independent and EY
3
i = 0 and EY
4
i = 3 so
(
a∑
i=1
αi)(
∑
j
b∗jj) = 3
a∑
i=1
αib
∗
ii +
∑
i 6=j
αib
∗
jj.
Thus
a∑
i=1
αib
∗
ii = 0.
But αi > 0 i = i, · · · , a so b∗ii = 0 i = 1, · · · , a. (We must have b∗ii ≥ 0 as B∗ is
non-negative definite.)
Since B∗ is non-negative definite aT B∗a ≥ 0 for any a .
13
Choose ai =
√
b∗jj aj = −b∗ij/
√
b∗jj and all other ak = 0 .
aTB∗a = b∗jjb
∗
ii + (b
∗
ij)
2 − 2(b∗ij)2 ≥ 0
i.e. b∗iib
∗
jj ≥ (b∗ij)2
so b∗ij = 0 if i or j is equal to 1, 2, · · · , a .
Hence B∗ =
 0 0
0 B∗22
 and so A∗B∗ = 0 . But A∗B∗ = P TAPP TBP = P TABP so
A∗B∗ = 0 implies AB = 0.
Lemma 1. If A and B are non-negative definite, symmetric matrices then AB = 0 if
and only if tr(AB) = 0 .
Proof: AB = 0 implies tr(AB) = 0.
Using the above notation
A∗ = P TAP = diag(α1, · · · , αa, 0, · · · , 0), B∗ = P TBP
trAB = tr(PA∗P TPB∗P T )
= tr(PA∗B∗P T )
= trA∗B∗ as P TP = I.
= Σai=1αib
∗
ii
= 0 by assumption.
But αi > 0 so b
∗
ii = 0, i = 1, · · · , a so again b∗ij = 0 if i or j is 1, · · · , a .
Thus A∗B∗ = 0 which in turn implies AB = 0 .
14
4. The Wishart Distribution.
If X is an n× p matrix with rows NIDp(0,Σ) then the p× p matrix
M = XTX
is said to have a Wishart distribution. M ∼ Wp(Σ, n), where Σ is the scale matrix and n
the degrees of freedom. If p = 1 then W1(Σ, n) = W1(σ
2, n) = σ2χ2n .
If X = (X1, · · · ,Xn)T M = Σni=1XiXTi so EM = nΣ .
Assume Σ is positive definite.
Theorem 5. If M ∼ Wp(Σ, n) and B is p× q then
BTMB ∼ Wq(BTΣB, n).
Proof. BTMB = BTXTXB = Y TY where Y = XB . The rows of Y are
NIDq(0, B
TΣB) and the result follows.
Corollary. The diagonal submatrices of M have Wishart distributions.
Theorem 6. If M ∼ Wp(Σ, n) and a is any fixed p-vector such that aTΣa > 0 then
aTMa/(aTΣa) ∼ χ2n.
Proof. aTMa ∼ W1(aTΣa, n) from Theorem 5.
Corollary. Mii ∼ σ2ii χ2n where Σ = (σij) .
The converse to Theorem 6 is not true.
Theorem 7. If M1 ∼ Wp(Σ, n1) and M2 ∼ Wp(Σ, n2) independently of M1 then
M1 +M2 ∼ Wp(Σ, n1 + n2) .
15
Proof. Write Mi = X
T
i Xi where Xi has ni independent rows drawn from Np(0,Σ)
populations
M1 +M2 = X
T
1 X1 +X
T
2 X2 = X
TX, where X =
 X1
X2
 .
The Wishart arises naturally in connection with the sample covariance matrix.
Theorem 8. If X is an n× p matrix with rows independent Np(µ,Σ) then
S = XTX − nX¯X¯T ∼ Wp(Σ, n− 1) independently of X¯ ∼ Np(µ, n−1Σ) .
Proof.
S = Σni=1(Xi − X¯)(Xi − X¯)T , Xi ∼ Np(µ,Σ)
= Σni=1XiX
T
i − nX¯X¯T .
Let P be an orthogonal matrix with first row
(
1√
n
, · · · , 1√
n
)
. Let Y = PX so Y1 =
√
nX¯
where Y =

YT1
·
·
YTn

. Thus
S = XTX − nX¯X¯T = Y TPP TY −Y1YT1
= Y TY −Y1YT1
=
n∑
i=2
YiY
T
i ∼ Wp(Σ, n− 1),
since Yi ∼ Np(0,Σ) i = 2, · · · , n.
Theorem 9. If X has rows which are NIDp(0,Σ) then X
TAX ∼ Wp(Σ, r) if and only if
A is idempotent of rank r .
Proof. If XTAX ∼ Wp(Σ, r) then aTXTAXa/aTΣa ∼ χ2r for any a, aTΣa > 0 from
Theorem 6. From Theorem 3 A is idempotent of rank r as Xa/(aTΣa)
1
2 ∼ N(0, I).
16
If A is idempotent then A = UUT , where U is n × r with orthonormal columns.
Let Y = UTX so Y has r rows NIDp(0,Σ)
XTAX = Y TY ∼ Wp(Σ, r).
Theorem 10. If X has n rows which are NIDp(0,Σ) then for any symmetric, non-
negative definite matrices A and B, XTAX and XTBX are independent if and only if
AB = 0 .
Proof. Assume XTAX and XTBX are independent. Choose a such that aTΣa > 0
and let Y = Xa/
√
aTΣa. Then Y ∼ Nn(0, I) and we have that YTAY and YTBY are
independent. Thus by Craig’s Theorem we have AB = 0 .
If AB = 0 then there exists an orthogonal matrix P such that P TAP = diag(α1, · · · , αa, 0 · · · 0),
P TBP = diag(0, · · · , 0, β1, · · · , βn−a) . Set Y = P TX and the result follows.
Theorem 11. If X has n rows which are NIDp(0,Σ) and
M = XTX =
 M11 M12
M21 M22
 ,
where M11 is a × a , then M2·1 = M22 − M21M−111 M12 ∼ Wp−a(Σ2.1, n − a) and is
independent of (M11,M12). (Recall Σ2·1 = Σ22 − Σ21Σ−111 Σ12.)
Proof. Write X = (X1|X2) X1 is n× a, X2 is n× (p− a)
M = XTX =
 XT1
XT2
 (X1 X2)
M2·1 = XT2 X2 −XT2 X1M−111 XT1 X2 = XT2 PX2
where P = I−X1M−111 XT1 . P is idempotent of rank (n−a) as trP = n−trX1M−111 XT1 =
n− a . Also XT1 P = 0 so XT2 PX2 = XT2·1PX2·1 where X2·1 = X2 −X1Σ−111 Σ12.
17
X2·1 is an n × (p − a) matrix whose rows are independent Np−a(0,Σ2·1), conditional on
X1. Thus given X1
M2·1 ∼ Wp−a(Σ2·1, n− a) using Theorem 9.
But this conditional distribution does not depend on X1 and so it is the marginal distribution
of M2·1 . M2·1 is independent of X1 and hence of M11 = XT1 X1 .
Further P (I − P ) = P − P = 0 so, given X1,M2·1 = XT2·1PX2·1 and (I − P )X2·1 =
X1M
−1
11 M12 −X1Σ−111 Σ12 are independent and hence are unconditionally independent. Thus
M2·1 is independent of (M11,M12) as M12 = XT1 (I − P )X2·1 +M11Σ−111 Σ12.
Theorem 12. If M ∼ Wp(Σ, n), n > p then
a) aTΣ−1a/aTM−1a ∼ χ2n−p+1 for any fixed a, a 6= 0 so in particular σii/M ii ∼ χ2n−p+1.
b) M ii is independent of all elements of M except Mii.
Proof. From Theorem 11, if a = p− 1
(Mpp)−1 ∼ (σpp)−1χ2n−p+1
and Mpp is independent of (M11,M12) , i.e. M
pp is independent of all elements except
Mpp . The general result follows by permuting rows and columns.
Given a fixed vector a , let A be a non-singular matrix with last column a . Set
N = A−1M(A−1)T
N ∼ Wp(A−1Σ(A−1)T , n), from Theorem 5.
N−1 = ATM−1A so Npp = aTM−1a
(Npp)−1 = (aTM−1a)−1 ∼ [(ATΣ−1A)pp]−1χ2n−p+1
= (aTΣ−1a)−1χ2n−p+1.
18
Theorem 13. If M ∼ Wp(Σ, n), n ≥ p then
|M | = |Σ| ·
p∏
i=1
Ui
where Ui are independent random variables, Ui ∼ χ2n−i+1 .
Proof. Use induction on p . If p = 1
M ∼ σ2χ2n .
Write M =
 M11 M12
M21 M22
 and assume |M11| = |Σ11|∏p−1i=1 Ui
Ui ∼ χ2n−i+1 as M11 ∼ Wp−1(Σ11, n). From Theorem 11 M11 is independent of
M2·1 ∼ Σ2·1χ2n−p+1 as M2·1 is a singleton |M | = |M11||M2·1| so the result follows.
Corollary 1. If M ∼ Wp(Σ, n) and Σ is positive definite then M is positive definite
with probability 1 .
Proof. |Σ| > 0 so P (|M | > 0) = 1 .
Thus if |Σ| > 0 then the sample covariance matrix is positive definite (i.e. S−1 exists with
probability 1).
Corollary 2. If X has n rows which are NIDp(0,Σ) and S = X
TX − nX¯X¯T then
a) σkk/Skk ∼ χ2n−p;
b) aTΣ−1a/aTS−1a ∼ χ2n−p for fixed a 6= 0;
c) S11 is a q × q submatrix of S, S11 ∼ Wq(Σ11, n− 1) independently of
S2·1 = S22 − S21S−111 S12 ∼ Wp−q(Σ2·1, n− q − 1) .
19
5. Hotelling’s T 2 and Wilk’s Λ.
If d ∼ Np(0, I) independently of M ∼ Wp(I, n) then
ndTM−1d ∼ T 2(p, n).
Theorem 14. If X ∼ Np(µ,Σ) and M ∼ Wp(Σ, n) independently of X then
n(X− µ)TM−1(X− µ) ∼ T 2(p, n).
Proof. Let d∗ = Σ−
1
2 (X− µ) and M∗ = Σ− 12MΣ− 12 .
Theorem 15.
T 2(p, n) ∼ np
n− p+ 1Fp,n−p+1.
Proof. Write T 2(p, n) = n[dTM−1d/dTd].(dTd) where d ∼ Np(0, I) independently of
M ∼ Wp(I, n). From Theorem 12 given d, dTd/dTM−1d ∼ χ2n−p+1 and since the distri-
bution does not depend on d, it is also the marginal distribution. Moreover dTd/dTM−1d
is independent of d. Also dTd ∼ χ2p so
T 2(p, n)
D
= nχ2p/χ
2
n−p+1
D
=
np
n− p+ 1 Fp,n−p+1.
Corollary 1. If X¯ is the mean of a sample of size n drawn from a Np(µ,Σ) population
and (n− 1)−1 S is the sample covariance matrix then
n(X¯− µ)T ( 1
n− 1S)
−1(X¯− µ) ∼ (n− 1)p
n− p Fp,n−p.
Corollary 2. M ∼ Wp(Σ, n) independently of d ∼ Np(0,Σ) then
|M |/|M + d dT | ∼ Beta(n− p+ 1
2
,
p
2
).
Proof. Note
20
 M d
−dT 1

 I 0
dT 1
 =
 M + d dT d
0 1

so |M + d dT | =
∣∣∣∣∣∣∣∣
M d
−dT 1
∣∣∣∣∣∣∣∣ = |M |(1 + d
TM−1d)
so |M |/|M + d dT | = (1 + dTM−1d)−1 D= (1 + T 2/n)−1 .
Now
T 2 ∼ np
n− p+ 1 Fp,n−p+1 = nU/V
where U ∼ χ2p independently of V ∼ χ2n−p+1,
n
n+ T 2
D
=
V
U + V
= Beta
(
n− p+ 1
2
,
p
2
)
.
Wilk’s Λ.
Let A ∼ Wp(Σ,m) independently of B ∼ Wp(Σ, n). Define Λ(p,m, n) = |A|/|A+B|.
The distribution of Λ does not depend on Σ .
Theorem 16.
Λ(p,m, n) ∼
n∏
i=1
Ui
where U1, · · · , Un are independent r.v.’s Ui ∼ Beta
(
m+i−p
2
, p
2
)
)
i = 1 · · · , n .
Proof outline. Write B = XTX = Σni=1XiX
T
i , where Xi ∼ NID(0, I) are the rows of
X. Let Mi = Mi−1 + XiXTi ,M0 = A so
Λ(p,m, n) =
|M0|
|Mn| =
|M0|
|M1| ·
|M1|
|M2| · · ·
|Mn−1|
|Mn| .
and from Corollary 2 to Theorem 15
Ui =
|Mi−1|
|Mi| ∼ Beta(
m+ i− p
2
,
p
2
).
RTP Ui are independent. Note Ui = (1+X
T
i M
−1
i−1Xi)
−1 so we need to show that XTi M
−1
i−1Xi
is independent of Mi (see Mardia, Kent and Bibby, p.76 for details).
21
Bartlett’s Approximation.
−(m− 1
2
(p− n+ 1)) log Λ(p,m, n) approx∼ χ2np for large n.
Proof outline. Let ρ = (m− 1
2
(p− n+ 1))/2 .
U = −2ρ log Λ(p,m, n) = −2ρ
n∑
i=1
logUi,
where Ui ∼ Beta(m+i−p2 , p2) . U has mgf
E(e−tU) =
n∏
i=1
E(exp (2t ρ logUi)) =
n∏
i=1
E(U2tρi )
EUγi =
∫ 1
0
1
B(m+i−p
2
, p
2
)
uγ u
m+i−p
2
−1(1− u) p2−1 du
=
Γ(m+i
2
) Γ(m+i−p
2
+ γ)
Γ(m+i−p
2
) Γ(m+i
2
+ γ)
.
Using Stirling’s approximation
Γ(x) ∼
√
2pi(x− 1) [(x− 1)/e]x−1
log
Γ(x+ `)
Γ(x)
∼ 1
2
log(
x+ `− 1)
x− 1 ) + (x− 1) log(
x+ `− 1
x− 1 )− `+ ` log(x+ `− 1)
=
1
2
(
`
x− 1) +O(x
−2) + `− `
2
2(x− 1) +O(x
−2)− `
+` log x+
`(`− 1)
x
+O(x−2)
= ` log x+
`(`− 1)
x
− `(`− 1)
2(x− 1) +O(x
−2)
= ` log x+
`(`− 1)
2x
+O(x−2).
logE(e−tU) =
n∑
i=1
[
log
Γ(m+i−p
2
+ 2tρ)
Γ(m+i−p
2
)
− log Γ(
m+i
2
+ 2tρ)
Γ(m+i
2
)
]
22
=
n∑
i=1
[
log
Γ(m+i−p
2
− ρ+ (1 + 2t)ρ)
Γ((1 + 2t)ρ)
− log Γ(
m+i
2
− ρ+ (1 + 2t)ρ)
Γ((1 + 2t)ρ)
]
+
n∑
i=1
[
log
Γ(m+i
2
− ρ+ ρ)
Γ(ρ)
− log Γ(
m+i−p
2
− ρ+ ρ)
Γ(ρ)
]
' −np
2
log(1 + 2t)− 2t
2ρ(1 + 2t)
n∑
i=1
p
2
(
p
2
− m+ i
2
+ ρ+ 1− m+ i
2
+ ρ) +O(ρ−2)
= −np
2
log(1 + 2t) +O(ρ−2) by choice of ρ.
Thus E(e−tU) = (1 + 2t)−
np
2 · (1 + O(ρ−2)) and ρ → ∞ as n,m → ∞ . Therefore U
has density that is approximated closely by a χ2np density.
23
6. Factor Analysis.
In factor analysis we model p observed variables in terms of m underlying but unobserv-
able factors. This modelling approach is widely used in psychology and education where
highly correlated responses are linked to common factors like intelligence or ability. The
interpretation of the factors is often subjective.
Model:
X = LF + µ+ ,
where X is a p-vector, F is an m-vector of independent random variables with means 0 and
variances equal to 1, L is a p×m loading matrix and is the error or specific factor vector.
F and are independent, E( ) = 0, and cov() = Ψ = diag(ψ1, .., ψp). Let L = (lij), where
lij is the loading of the ith variable, Xi, on the jth factor, Fj.
E(X) = µ and Cov(X) = Σ = LLT + Ψ.
Var(Xi) = σii =
∑m
j=1 l
2
ij+ψi and h
2
i =
∑m
j=1 l
2
ij is called the communality; it is the component
of the variance of Xi due to the common factors. Further
Cov(Xi, Xk) =
m∑
j=1
lijlkj
and Cov(Xi, Fj) = lij.
X does not determine F and uniquely. If G is an orthogonal matrix then
X = LGTGF + µ+ ,
and E(GF) = 0 and Cov(GF) = I. If Fi ∼ NID(0, 1) then GF is another vector of
independent standard normal variables so the factor loadings are only determined up to an
orthogonal transformation. Hence factors can be rotated by G and the model will still hold.
24
To specify a factor loading structure we need to impose further constraints. Generally we
ask for LTΨ−1L to be diagonal with elements in decreasing order.
The factor model cannot be applied to any covariance matrix (unless we let m = p) and
it is only useful if it leads to a simplification of the structure, i.e., m is small relative to p.
The original covariance matrix consists of p(p + 1)/2 parameters, whereas the factor model
with constraints has pm+ p−m(m− 1)/2 freely determined parameters. The difference is
1
2
(p−m)2 − 1
2
(p+m).
If Σ (or the sample covariance matrix, S) is close to diagonal then the specific factors
will dominate and so trying to fit a factor model is not worthwhile. Factor analysis is most
useful when there are a reasonable number of highly correlated variables. We will now look
at methods for estimating L and Ψ given n observations.
Principal Component Method.
Recall
Σ = λ1e1e
T
1 + ..+ λpepe
T
p = LL
T ,
where L = [
√
λ1e1, ..,
√
λpep], is a p × p matrix. Thus the factor model holds with m = p.
If the smallest eigenvalues are close to 0 then we can use L1 = [
√
λ1e1, ..,
√
λmem], as the
loading matrix and set ψi to be the ith diagonal element of Σ− L1LT1 .
The factor model structure is unaffected if we work with standardized variables so in
practice we fit L1 via the sample correlation matrix, R.
Lˆ1 = [
√
λˆ1eˆ1, ..,
√
λˆmeˆm],
where λˆ1 ≥ .. ≥ λˆp are the eigenvalues of R with corresponding orthonormal eigenvectors
eˆ1, .., eˆp. The communality hˆ
2
i =
∑m
j=1 lˆ
2
ij and the contribution to the total variance from the
first factor is
p∑
i=1
lˆ2i1 = (
√
λˆ1eˆ1)
T (
√
λˆ1eˆ1) = λˆ1.
25
Note that there are problems in reproducing L1 across independent samples if the underlying
correlation matrix has repeated eigenvalues. The estimates will not necessarily be equal and
so the corresponding eigenvalues can be quite different across replicates.
Maximum Likelihood Factor Analysis.
If Xi ∼ NIDp(µ,Σ) then we can estimate L and Ψ via maximum likelihood. Recall
µˆ = X¯. The maximum likelihood equations for L and Ψ, where Σ = LLT + Ψ and LTΨ−1L
is diagonal with elements in decreasing order, do not have an explicit solution unless Ψ = kI
for some k. The log likelihood is maximized numerically via an iterative procedure. MLE
and PC methods do not necessarily give the same answers. If the normal assumption is
reasonable then MLE is preferred.
Factor Rotation.
The constraint LTΨ−1L diagonal is for mathematical convenience. To make interpreta-
tion of the factors easier we rotate the factors by T so that LT has a few large values and
the remaining values are near 0. To quantify this we need a measure of ‘large vs small’.
Kaiser (1958) suggested the varimax criterion. Let lˆ′ij = lˆij/hˆi, where hˆ
2
i =
∑m
j=1 lˆ
2
ij. lˆ
′
ij
represents the scaled factor loadings. The varimax procedure selects the orthogonal matrix
T that makes
V =
1
p
m∑
j=1
[
p∑
i=1
lˆ′ 4ij −
( p∑
i=1
lˆ′ 2ij
)2
/p ]
as large as possible. V is the sum of the variances of squared loadings on factor j. Maximiz-
ing V corresponds to spreading out the loadings on each factor as far as possible.
Factor Scores.
In IQ tests various responses to questions (Xis) are recorded with a view to estimating
the value of f , the underlying intelligence factor. Once a factor analysis model has been
26
established we can obtain the factor scores f1, f2, .., fm corresponding to the values of the
unobserved random variables. If µ, L, and Ψ are known then
(X− µ) = LF +
so estimating F is like estimating the parameters in a linear model where the errors have
unequal variances, Cov()= Ψ. The weighted least squares estimates for F are:
Fˆ = (LTΨ−1L)−1LTΨ−1(X− µ). (2)
If we substitute the estimates for L, Ψ and µ then we get Bartlett’s factor scores. The
estimates from (2) are reasonable whether X and are multivariate normal or not. Other
methods exist that rely on the normality assumption (see reference text Johnson and Wich-
ern).
A major concern in practice is whether the factor model is sensible and, if so, what is the
value of m. Likelihood ratio tests can be used to test if Σ = LLT + Ψ is a suitable model.
Given X1, ..,Xm ∼ NID(µ,Σ) test H0 : Σ = LLT + Ψ, where L is p×m.
Using a LRT approach and the MLEs for L and Ψ we get
−2 log Λ = n log |Σˆ||S/n| + n[tr(Σˆ
−1S/n)− p].
If the MLEs are used it can be shown that tr(Σˆ−1S/n) = p so the test statistic is
n log
|LˆLˆT + Ψˆ|
|S/n| .
Bartlett suggested replacing the leading ‘n’ by (n−1− (2p+4m+5)/6). The asymptotic
distribution of the statistic is
χ2[(p−m)2−(p+m)]/2.
27
7. Canonical Correlation Analysis
Given two random vectors X and Y what are the linear combinations η = aTX
and φ = bTY which have maximum correlation? Canonical correlation analysis enables us
to examine the relationships between two groups of variables, whereas principal components
analysis looks at the relationship within a single set of variables.
Let EX = µ, Cov (X) = Σ11, a p × p matrix, Cov (X,Y ) = Σ12, a p × q matrix, and
EY = ν, Cov (Y ) = Σ22.
The correlation between η and φ is
aTΣ12b√
aTΣ11a · bTΣ22b
= ρ(a, b).
Assume Σ11 and Σ22 are non-singular and let c = Σ
1
2
11a, d = Σ
1
2
22b . Then we want to
maximise |cTΣ−
1
2
11 Σ12Σ
− 1
2
22 d| subject to cTc = dTd = 1 . This is equivalent to maximising
dTΣ
− 1
2
22 Σ21Σ
− 1
2
11 c c
TΣ
− 1
2
11 Σ12Σ
− 1
2
22 d .
Using the Maximization Lemma, holding c fixed and maximizing over d we get the
maximum is the largest eigenvalue of
Σ
− 1
2
22 Σ21Σ
− 1
2
11 c c
TΣ
− 1
2
11 Σ12Σ
− 1
2
22 .
This matrix has rank 1 so the largest eigenvalue is the trace, i.e.
cTΣ
− 1
2
11 Σ12Σ
−1
22 Σ21Σ
− 1
2
11 c
and d = Σ
− 1
2
22 Σ21Σ
− 1
2
11 c (suitably normalised).
We now want to maximise this expression wrt c subject to cT c = 1 . The maximum
is the largest eigenvalue of
Σ
− 1
2
11 Σ12Σ
−1
22 Σ21Σ
− 1
2
11 (3)
which is the same as the maximum eigenvalue of Σ−111 Σ12Σ
−1
22 Σ21. Call this value ρ
2.
28
Note:
1. Σ−111 Σ12Σ
−1
22 Σ21 and Σ
− 1
2
11 Σ12Σ
−1
22 Σ21Σ
− 1
2
11 have the same eigenvalues.
2. Let ρ21 ≥ ρ22 ≥ · · · ≥ ρ2p be the eigenvalues of Σ−111 Σ12Σ−122 Σ21 with associated eigen-
vectors a1, · · · ,ap. Then
Σ−111 Σ12Σ
−1
22 Σ21a1 = ρ
2
1a1
Σ
− 1
2
11 (Σ
− 1
2
11 Σ12Σ
−1
22 Σ21Σ
− 1
2
11 )(Σ
1
2
11a1) = ρ
2
1a1
so Σ
1
2
11a1 is an eigenvector of (3).
3. Let b1 = Σ
−1
22 Σ21a1 . Note b1 is the eigenvector of Σ
−1
22 Σ21Σ
−1
11 Σ12 with eigenvalue ρ
2
1
.
Definition: Given ai defined as above and bi = Σ
−1
22 Σ21a1 then ηi = a
T
i X and
φi = b
T
i Y are the ith canonical correlation variables and ρi is the ith canonical correlation
coefficient.
Further
Corr(ηi, ηj) = a
T
i Σ11aj = δij
Corr(φi, φj) = b
T
i Σ22bj = δij
Cov(ηi, φj) = 0 i 6= j
as Cov(ηi, φj) = a
T
i Σ12bj
= aTi Σ12Σ
−1
22 Σ21aj
= aTi Σ11Σ
−1
11 Σ12Σ
−1
22 Σ21 aj
= aTi Σ11aj · ρ2j
= 0 i 6= j as aTi Σ11aj = 0 .
To see this note Σ
1
2
11ai are eigenvectors of Σ
− 1
2
11 Σ12Σ
−1
22 Σ21Σ
− 1
2
11 with eigenvalues ρ
2
i and
eigenvectors corresponding to distinct eigenvalues are orthogonal.
29
If Σ12 has rank 1 then Σ12 = ab
T for some a and b
Σ−111 Σ12Σ
−1
22 Σ21 = Σ
−1
11 ab
TΣ−122 ba
T = (bTΣ−122 b)(Σ
−1
11 aa
T )
which has eigenvalue (aTΣ−111 a)(b
TΣ−122 b).
Thus the canonical correlation is
√
(aTΣ−111 a)(b
TΣ−122 b) . In general the number of
non-zero canonical correlations is k = min(p, q) . Sample canonical correlations are obtained
via the eigenvalues and eigenvectors of S−111 S12S
−1
22 S21 .
Testing H0: only s canonical correlation coefficients are non-zero. Using a LRT
approach when testing H ′0 : Σ12 = 0 (i.e. all canonical correlations are 0 ) we used
Λ2/n = |I − S−122 S21S−111 S12| = Πki=1(1− r2i )
where r1, · · · , rk are the sample canonical correlation coefficients. Thus to test H0 with
s ≤ k, Bartlett proposed
−[n− 1
2
(p+ q + 3)] log
k∏
i=s+1
(1− r2i )
approx∼ χ2(p−s)(q−s), for large n
This has the same form as Bartlett’s approximation to Λ2/n when testing H ′0 : Σ12 = 0
−[n− 1
2
(p+ q + 3)] log
k∏
i=1
(1− r2i ) ∼ χ2pq.
Canonical correlation analysis is used for data reduction and interpretation, in particular,
determining which combination of the Xi’s can be predicted effectively using the Yj’s .
30
Reference Texts
Afifi, A., May, S. and Clark, V. (2012) Practical Multivariate Analysis, CRC Press,
Boca Raton, Florida.
Johnson, R. and Wichern, D., (1988) Applied Multivariate Statistical Analysis, Prentice
Hall, Englewood Cliffs, New Jersey.
Mardia, K., Kent, J. and Bibby, J., (1979) Multivariate Analysis, Academic Press,
London.
Muirhead, R., (1982) Aspects of Multivariate Statistical Theory, John Wiley and Sons,
Inc., New York.
Srivastava, M. and Carter, E., (1983) An Introduction to Applied Multivariate Statis-
tics, North Holland, New York.
31
Appendix: Multivariate Linear Models
Consider the model defined by
Y = XB + U
where Y is an n× p matrix of observed values.
B is a q × p matrix of unknown parameters.
X is a n× q matrix of known values.
U is an unobserved random matrix whose rows are, given X, uncorrelated with
mean 0 and covariance Σ .
If Y = (Yij) then E Yij = X
T
i Bj, where X =

XT1
:
XTn
 and B = (B1, · · · ,Bp).
If the rows of U are Np(0, Σ) then the least squares estimates for B are the MLE’s.
The log likelihood is
`(B, Σ) = −np
2
log(2pi)− n
2
log |Σ| − 1
2
tr(Y −XB)Σ−1(Y −XB)T .
Theorem. The MLE’s for B and Σ are
Bˆ = (XTX)−1XTY and Σˆ = Y TPY/n
where P = I −X(XTX)−1XT .
Proof. Let Uˆ = PY = Y −XBˆ where Bˆ is defined above. Then
Y −XB = Uˆ +X(Bˆ −B) and XTP = 0 so
(Y −XB)T (Y −XB) = UˆT Uˆ + (Bˆ −B)TXTX(Bˆ −B).
Thus
`(B, Σ) = −np
2
log(2pi)− n
2
log |Σ| − 1
2
tr[Σ−1(UˆT Uˆ + (Bˆ −B)TXT X(Bˆ −B)] .
This is maximized wrtB when B = Bˆ giving
`(B, Σ) = −np
2
log(2pi)− n
2
log |Σ| − 1
2
trΣ−1UˆT Uˆ
32
and from Section 1 we have Σˆ = UˆT Uˆ/n = Y TPY/n .
Next
`(Bˆ, Σˆ) = −np
2
log(2pi)− n
2
log |Σˆ| − np
2
and
Y TY = Y TPY + Y T (I − P )Y
= UˆT Uˆ + Yˆ T Yˆ as Yˆ = XBˆ = (I − P )Y.
Also P (I − P ) = 0 , so Yˆ T Yˆ is independent of Σˆ . Clearly EBˆ = B and EUˆ = 0 .
Note Uˆ = PY = PU as PX = 0, so UˆT Uˆ = UTPU as P 2 = P. The rows of U
are Np(0,Σ) so we have
n Σˆ ∼ Wp(Σ, n− q),
as rank P = trP = tr(I −X(XTX)−1XT ) = n− q, by Theorem 7.
Further Bˆ = (XTX)−1XTY = B + (XTX)−1XTU (A)
Bˆ −B = (XTX)−1XTU so
Cov(Bˆij, Bˆk`) = E(a
T
i Uj ·UT` ak), where (XTX)−1XT =

aT1
...
aTq

= aTi σj`I ak U = (U1, · · · ,Up)
as the elements of Uj are independent
Cov(Bˆij, Bˆk`) = σj` a
T
i ak = σj` gik
where (gik) = (X
TX)−1 . Thus the rows of Bˆ will be independent if (XTX)−1 is diagonal.
The covariance between the jth column and the `th column is
Cov(Bˆj, Bˆ`) = σj`(X
TX)−1 .
33
General Linear Hypotheses.
Consider a test of the hypothesis H0 : C1B = D where C1 is g × q of rank g and
D is g × p of rank ≤ p , (generally D = 0). For example, the hypothesis that the rows
of B are all equal
C1 =

1 −1 0 · · · · · · · · · · · · 0
1 0 −1 0 · · · · · · 0
...
1 0 · · · · · · 0 −1

and D = 0. C1 is (q − 1)× q .
To construct the likelihood ratio test define C2 such that C2 is (q − g) × q and
C =
(
C1
C2
)
is q × q of full rank. Let B0 be any matrix such that C1B0 = D . Write the
model as
Y+ = Z∆ + U
where Y+ = Y − XB0, Z = XC−1, ∆ = C(B − B0) =
(
∆1
∆2
)
, and ∆1 is g × p . With
this parametrization H0 becomes H0 : ∆1 = 0 . Let C
−1 = (C(1)C(2)) .
Maximizing the likelihood with no restrictions gives
`(Bˆ, Σˆ) = −np
2
log(2pi)− n
2
log |Σˆ| − np
2
,
where Σˆ = Y TPY/n and P = I−X(XTX)−1XT . Under H0 the rows of Y+ have expected
value
X(C(1)C(2))
(
0
∆2
)
= X C(2)∆2 .
Let P1 = I −XC(2)(C(2)TXTXC(2))−1C(2)TXT , so the maximum likelihood under H0 is
−np
2
log(2pi)− n
2
log | 1
n
Y T+ P1Y+| −
np
2
.
−2 log Λ = n log |Y T+ P1Y+| / |Y TPY |
Λ2/n = |Y TPY | / |Y T+ P1Y+| .
34
Thus the statistic is simply the ratio of the determinants of the residual SSP matrices
calculated for the full model and the model that holds when the null hypothesis is true.
To obtain the distribution of the statistic note that PY+ = PY as PX = 0 so Λ
2/n =
|Y T+ PY+| / |Y T+ P1Y+| . We will show that Λ2/n ∼ Λ(p, n−q, g), Wilk’s Lambda distribution,
and so using Bartlett’s approximation
−(n− q − 1
2
(p− g + 1)) log Λ2/n approx∼ χ2pg for large n .
Let P2 = P1 − P . Note PP1 = P so PP2 = 0 as P 2 = P .
Further P2 is idempotent as (P1 − P )2 = P1 − P . Thus
Y T+ P1Y+ = Y
T
+ PY
T
+ + Y
T
+ P2Y+ and PP2 = 0.
Thus Λ2/n = |UTPU | / |UTPU + UTP2U | and UTPU ∼ Wp(Σ, n− q) independently of
UTP2U ∼ Wp(Σ, g) since trP = n− q and trP2 = trP1 − trP = n− (q − g)− (n− q) = g .
From the definition of Wilk’s lambda
Λ2/n ∼ Λ(p, n− q, g).
Theorem. The LRT of the hypothesis H0 : C1BM1 = D where C1 is g × q of rank
g, and M1 is p× r of rank r has test statistic
Λ2/n ∼ Λ(r, n− q, g).
[This is the most general linear hypothesis on B and allows the specification of specific
values for elements of B.]
Proof. Let Y = XB + U . Transform to Z = YM1 = XBM1 + UM1, where
UM1 has rows which are independent Nr(0, M
T
1 ΣM1), Z is n × r, U1 = UM1 is n × r
and B1 = BM1 is q × r.
The null hypothesis is now H0 : C1B1 = D so using the above result the LRT statistic
is
|E| / |E +H| ∼ Λ(r, n− q, g)
35
where E = ZTPZ = MT1 Y
TPYM1 and
E +H = MT1 Y
T
+ P1Y+M1.
If p = 1 the model reduces to the usual multiple regression model and H0 : C1β = d
where d is g × 1 yields
E = Y T (I −X(XTX)−1XT )Y = residual SS ∼ σ2χ2n−q
H + E = Y TP1Y = residual SS under H0
H = (S1 − S0) ∼ σ2χ2g if C1 has rank g .
The usual multiple regression F -test is not E
H+E
but (H/E) ∼ g
n−q Fg,n−q and H0 is
rejected if H/E is large. If C1 is 1× q then since
1− Λ(p, n− q, 1)
Λ(p, n− q, 1) ∼
p
n− q − p+ 1 Fp, n−q−p+1
we can base our test on an exact F -distribution, not an approximate χ2 value.
MANOVA
The LRT statistics for multivariate analysis of variance can be derived as follows. We will
consider the 1-way case in detail. Assume we have g independent samples from Np(µi, Σ)
populations i = 1, · · · , g . The ith sample is Yi1, · · · ,Yi ni and n1 + n2 + · · ·+ ng = N .
36
Let Y =

YT11
YT12
...
YTgng

N×p
B =

µT1
...
µTg

g×p
X =

1
1
1

n1 0
0 1
1
0 1

n2 0
1
0 · · · · · · · · · 0 1
1

ng

N×g
Write Y = XB + E, where XTX = diag(n1, n2, · · · , ng) and
(XTX)−1XT Y =

Y¯T1·
...
Y¯Tg·
 i.e. µˆi = Y¯i·. To test H0 : µ1 = µ2 = · · · = µg write the
constraints as C1B = 0 where
C1 =

1 0 · · · · · · · · · 0 −1
0 1 0 · · · 0 −1
...
...
0 · · · · · · · · · 0 1 −1

, a (g − 1)× g matrix.
Here P = I −X(XTX)−1XT = I −X diag (n−1i )XT so
Y TPY = Y TY − Y TX diag (n−1i )XTY
=
g∑
i=1
ni∑
j=1
(Yij − Y¯i·)(Yij − Y¯i·)T
= W ∼ Wp(Σ, N − g).
W is the within groups SSP matrix.
37
Write C =
 C1
1T
 =
 I −1
1T 1
 so
C−1 =
 C11 C12
C21 1
g
 and C12 = I 1 1g = 1g 1 so C(2) = 1g 1.
Next
P1 = IN −X 1(1TXTX 1)−1 1TXT = I − 1
N
X 1 1TXT
= I − 1
N
1 1T .
In this case Y+ = Y as B0 = 0 (C1B0 = 0).
Y TP1Y = Y
TY − 1
N
Y T 1 1TY =
g∑
i=1
ni∑
j=1
(Yij − Y¯··)(Yij − Y¯··)T
= T, the total SSP matrix.
Thus the MANOVA LRT statistic is
Λ = |W | / |T | ∼ Λ(p, N − g, g − 1)
and T = W +B where B =
∑g
i=1 ni(Y¯i· − Y¯··)(Y¯i· − Y¯··)T . We can write
|W |
|T | =
|W |
|W +B| = |I +W
−1B|−1.
If λ1, · · · , λp are the eigenvalues of W−1B then (1 +λi) is an eigenvalue of I +W−1B.
Hence Λ =
∏p
i=1(1 + λi)
−1 .
But λi = 0 if i > min(p, g − 1) = t . Thus Λ = ∏ti=1(1 + λi)−1 . It can be shown that
V = (N − g)
p∑
i=1
λi ∼ χ2p(g−1).
Testing Contrasts.
H0 : a1µ1 + · · ·+ ag µg = 0,
∑g
i=1 ai = 0 .
38
Write H0 as H0 : C
T
1B = 0
T , C1 = (a1, · · · , ag)T . Use Lagrange multipliers to obtain
the MLE’s for µi under the contraints imposed by H0 . The log likelihood is
` = −N
2
log(2pi)p|Σ| − 1
2
g∑
i=1
ni∑
j=1
(Yij − µi)TΣ−1(Yij − µi)
+λT (a1µ1 + · · ·+ agµg)
= −N
2
log(2pi)p|Σ| − 1
2
g∑
i=1
ni∑
j=1
(Yij − Y¯i·)TΣ−1(Yij − Y¯i·)
−1
2
[
g∑
i=1
ni(Y¯i· − µi)TΣ−1(Y¯i· − µi)− 2λT
g∑
i=1
aiµi].
Maximizing wrt µi gives
µˆi = Y¯i· −
ai
ni
Σλ .
Further (a1 µˆ1 + · · ·+ agµˆg) = 0 so
λˆ = −Σ−1(
g∑
i=1
aiY¯i·)/
g∑
i=1
(a2i /ni)
so µˆi = Y¯i· − aini (
∑g
i=1 aiY¯i·)/(
∑g
i=1 a
2
i /ni), which does not depend on Σ . Thus the MLE
for Σ is (by analogy with the full case)
Σˆ =
1
N
 g∑
i=1
ni∑
j=1
(Yij − Y¯i·)(Yij − Y¯i·)T + (
∑g
i=1 aiY¯i·)(
∑g
i=1 aiY¯i·)
T
(
∑g
i=1 a
2
i /ni)

=
1
N
[W + C]
and the LRT statistic is
|W |
|W + C| = |I +W
−1C|−1 ∼ Λ(p, N − g, 1).
The matrix C has rank 1 so the non zero eigenvalue of W−1C is
trW−1C = (
g∑
i=1
aiY¯i·)T W−1(
g∑
i=1
aiY¯i·)/(
g∑
i=1
a2i /ni)
(cf Mahalanobis distance when g = 2 a1 = −a2 = 1.)
Test for Common Covariance Structure.
If Yij ∼ Np(µi,Σi), i = 1, · · · , g test
H0 : Σ1 = Σ2 = · · · = Σg
39
The LRT statistic is
−2 log Λ = N log |W
N
| −
g∑
j=1
ni log |Si
ni
|
=
g∑
i=1
ni log |ni
N
S−1i W | where W =
g∑
i=1
Si
approx∼ χ2(g−1)p(p+1)/2 if H0 is true.
A modified version of the statistic has a distribution closer to the χ2.
Box’s M -test: M = γ
∑g
i=1(ni − 1) log |niN S−1i W |, where
γ = 1− 2p
2 + 3p− 1
6(p+ 1)(g − 1)(
g∑
i=1
(ni − 1)−1 − 1
N − g ).
40

欢迎咨询51作业君