MATH5855: Multivariate Analysis
Dr Pavel Krivitsky
based on notes by A/Prof Spiridon Penev
University of New South Wales
School of Mathematics
Department of Statistics
2021 Term 3
This volume of notes is for individual students’ use only. It is therefore not to
be distributed beyond the University of New South Wales.
Since the notes will be uploaded in parts, these page numbers are indicative.
0 Preliminaries 4
1 Exploratory Data Analysis 15
2 The Multivariate Normal Distribution 17
3 Multivariate Normal Estimation 27
4 Intervals and Tests for the Mean 34
5 Correlations 43
6 Principal Components Analysis 50
7 Canonical Correlation Analysis 55
8 MLM and MANOVA 60
9 Tests of a Covariance Matrix 65
10 Factor Analysis 68
11 Structural Equation Modelling 74
12 Discrimination and Classification 79
13 Support Vector Machines 87
14 Cluster Analysis 96
15 Copulae 107
A Exercise Solutions 114
1
UNSW MATH5855 2021T3 Foreword
Foreword
These notes
These notes do not substitute the lectures in Multivariate Analysis for Masters students. You
are strongly recommended to attend each and every lecture and laboratory hour because the
conceptual bases of the discussed modelling methods, as well as some additional derivations and
explanations will then be focused on, as will be important portions of pertinent computer output.
This volume is therefore not meant to be a substitute for a textbook, computer package manual,
or lecture attendance.
We rely on the widely spread and powerful statistical suites R and SAS to perform the actual
calculations during the course. These notes are a compilation from several sources and other
notes. Some of the sources are listed in your handout. As the closest reference book the following
source could be mentioned:
Johnson, R. & Wichern, D. (2007) Applied Multivariate Statistical Analysis. Sixth
Edition, Prentice Hall.
By no means can this book be a substitute for the whole set of notes, though.
It is assumed that you are familiar with some basic concepts of linear algebra. These will
be summarised at the beginning and will be used essentially in the rest of the course. These
concepts include matrix and vector operations, determinants, traces, ranks, projectors, linear
equations, inverses, eigenvectors and eigenvalues etc.
I would appreciate it if you would let me know about any ways these notes could be further
improved.
Overview
First we shall discuss some general aspects of Multivariate Analysis. Usually, when studying
complex phenomena, many variables are required. Besides, the process of studying is usually an
iterative one with many variables often added or deleted from the study. Multivariate analysis
deals with developing methods for better understanding the relationships between the many
variables included in the analysis of such complex phenomena.
What makes Multivariate Analysis different?
In your other classes, you have learned about a variety of methods for analysing many variables.
For example, you have probably learned about multiple regression linear model:
Yi = β0 + β1xi1 + β2xi2 + · · ·+ βpxip + ϵi, i = 1, . . . , n
where Yi is the ith observation of the response variable, xi,k ith observation of the kth predictor
variable, and ϵi the ith error. However, in this regression, we designate the p predictors as fixed
(conditioned on) and only one variable per observation is random. Typically, we assume that
the ϵis and therefore Yis are independent (conditional on the xs) or at least uncorrelated.
Contrast this with a multivariate linear model,
Yi1 = β01 + β11xi1 + β21xi2 + · · ·+ βp1xip + ϵi1,
Yi2 = β02 + β12xi1 + β22xi2 + · · ·+ βp2xip + ϵi2,
where Yi1 and Yi2 are the ith observations of two distinct response variables, and ϵi1 and ϵi2 may
be correlated. The multivariate linear model can be used when multiple observations are taken
2
UNSW MATH5855 2021T3 Foreword
on each individual in the sample, and it can allow us to model the relationships among these
measurements.
Difficulties in such a process:
• More data to analyse
• More involved mathematics necessary
• Computer intensive methods involved in the process
Objectives of multivariate methods:
Data reduction: presenting the phenomenon as simply as possible but without sacrificing valu
able information. Typical representative method: Principal components analysis. Some
times, this reduction is achieved by introducing a small number of unobservable (latent)
variables when trying to explain a large number of observable output variables. Represen
tative methods: factor analysis and covariance structure analysis.
Sorting or grouping: creating groups of “similar” objects or variables that in a sense are more
closer to each other than to objects outside the group; and finding reasonable explanation
for the existing grouping. Representative methods: Factor Analysis, Cluster Analysis,
Discriminant Analysis.
Investigation of dependence among variables: finding which sets of variables can be con
sidered as independent and which are “more dependent”; and “measuring” the depen
dence.Representative Methods: Correlation Analysis, Partial Correlations, Canonical Cor
relations.
Prediction: predicting values of one or more variables on the basis of observations of other
variables that have been found to influence the former variables: a basic but important
goal. Representative: Multivariate Regression.
Hypothesis testing: either validating assumptions (e.g., normality) on the basis of which cer
tain analysis is being done or to reinforce some prior modelling convictions (e.g., equality of
parameters).Hypothesis testing is relevant to the applications of all multivariate methods
we will be dealing with.
As a basic mathematical model for our analyses in this course the multivariate normal
distribution will be used. Reasons are: our limited time and the complexity of other approaches.
Although in practice also other distributions are relevant, modelling based on the multivariate
normal distribution can still be a very good approximation.
3
UNSW MATH5855 2021T3 Lecture 0 Preliminaries
0 Preliminaries
0.1 Matrix algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
0.1.1 Vectors and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
0.1.2 Inverse matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.1.3 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.1.4 Orthogonal matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.1.5 Eigenvalues and eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.1.6 Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
0.1.7 Orthogonal Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
0.2 Standard facts about multivariate distributions . . . . . . . . . . . . . . . . . . . 10
0.2.1 Random samples in multivariate analysis . . . . . . . . . . . . . . . . . . 10
0.2.2 Joint, marginal, conditional distributions . . . . . . . . . . . . . . . . . . 10
0.2.3 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
0.2.4 Density transformation formula . . . . . . . . . . . . . . . . . . . . . . . . 12
0.2.5 Characteristic and moment generating functions . . . . . . . . . . . . . . 13
0.3 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
0.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
0.1 Matrix algebra
0.1.1 Vectors and matrices
As a shorthand notation, we shall be using X ∈Mp,n to indicate that X is a matrix with p rows
and n columns. A notation x ∈ Rn will be used to indicate that x is a ndimensional column
vector. Of course, if x ∈ Rn, it also means that x ∈ Mn,1. Transposition will be denoted by ⊤.
After a transposition, from a matrix X ∈Mp,n we get a new matrix X⊤ ∈Mn,p. In particular,
from a column vector x ∈ Rn we arrive, after a transposition, to a row a vector x⊤ ∈ M1,n.
It is well known that multiplication of a matrix (vector) with a scalar means multiplication of
each of the elements of the matrix (vector) with that scalar. Also, two matrices (vectors) of
the same dimension can be added (subtracted) and the result is a new matrix (vector) of the
same dimension and elements which are the element wise sum (difference) of the elements of the
matrices (vectors) to be added (subtracted). The Euclidean norm of a vector x =
x1
x2
...
xp
∈ Rp
is denoted by ∥x∥ and is defined as ∥x∥ =√∑pi=1 x2i .
The inner product or, equivalently, the scalar product of two pdimensional vectors x and y
is denoted and defined in the following way:
⟨x,y⟩ = x⊤y =
p∑
i=1
xiyi (0.1)
Obviously, the relation ∥x∥2 = ⟨x,x⟩ holds. It is well known that if θ is the angle between two
pdimensional vectors x and y then it also holds
⟨x,y⟩ = ∥x∥∥y∥ cos(θ) (0.2)
Since cos(θ) ≤ 1, we have the inequality
⟨x,y⟩ ≤ ∥x∥∥y∥
4
UNSW MATH5855 2021T3 Lecture 0 Preliminaries
which is one variant of the Cauchy–Bunyakovsky–Schwartz Inequality. Further, if we want to
orthogonally project the vector x ∈ Rp on the vector y ∈ Rp then (having in mind the geometric
interpretation of orthogonal projection) the result will be: x
⊤y
y⊤yy.
Finally, the rules for matrix multiplication are recalled: if X ∈Mp,k and Y ∈Mk,n (i.e. the
number of columns in X is equal to the number of rows in Y ) then the multiplication XY is
possible and the result is a matrix Z = XY ∈Mp,n with elements
zi,j , i = 1, 2, . . . , p, j = 1, 2, . . . , n : zi,j =
k∑
m=1
xi,mym,j (0.3)
i.e. the element in the ith row and jth column of Z is a scalar product of the ith row of X
and the jth column of Y . Note that the multiplication of matrices is not commutative and in
general, it is not necessary for Y X to even exist when XY exists. When the matrices are both
square (quadratic) of the same dimension p (i.e. both X ∈Mp,p and Y ∈Mp,p) then both XY
and Y X will be defined but would in general not give rise to the same result.
The following transposition rule is important to be mentioned (and easy to check): if X ∈
Mp,k and Y ∈Mk,n then the product XY exists and it holds:
(XY )⊤ = Y ⊤X⊤ (0.4)
One should be very careful with transposition though in order to avoid silly mistakes. If
x ∈ Rp, for example, both x⊤x and xx⊤ exist. While the former is a scalar, the latter belongs
to Mp,p!
A square matrix X ∈ Mp,p is called symmetric if xi,j = xj,i for i = 1, 2, . . . , p and j =
1, 2, . . . , p holds. For such a matrix, we have X⊤ = X.
The square matrix
p×p
I = δij for i = 1, 2, . . . , p and j = 1, 2, . . . , p holds (i.e., ones on the
diagonal and zeros outside the diagonal) is called the identity matrix (of dimension p). Obviously,
when the multiplication is possible then always XI = X and IX = X holds.
The trace of a square matrix X ∈ Mp,p is denoted by tr(X) =
∑p
i=1 xii. The following
properties of traces are easy to obtain:
i) tr(X + Y ) = tr(X) + tr(Y )
ii) tr(XY ) = tr(Y X)
iii) tr(X−1Y X) = tr(Y )
iv) If a ∈ Rp and X ∈Mp,p then a⊤Xa = tr(Xaa⊤)
0.1.2 Inverse matrices
To any square matrix X ∈ Mp,p one can attach a number X ≡ det(X) called a determinant
of the matrix. It is defined as
X =
∑
±x1ix2j . . . xpm
where the summation is over all permutations (i, j, . . . ,m) of the numbers (1, 2, . . . , p) by taking
into account the sign rule: summands with an even permutation get a (+) whereas the ones
with an odd permutation get a (−) sign.
It can be seen that this is equivalent to another recursive definition, namely:
• when p = 1 (scalar case) X = a is just a number and X = a in this case
5
UNSW MATH5855 2021T3 Lecture 0 Preliminaries
• when p = 2 then
∣∣∣∣x11 x12x21 x22
∣∣∣∣ = x11x22 − x12x21
• when p = 3 then the following rule applies:∣∣∣∣∣∣
x11 x12 x13
x21 x22 x23
x31 x32 x33
∣∣∣∣∣∣ = x11x22x33+x12x23x31+x21x32x13−x31x22x13−x11x23x32−x12x21x33
(0.5)
• recursively, for X ∈M(p,p),
X =
∑
i
(−1)i+jxij Xij  =
∑
j
(−1)i+jxij Xij 
where Xij denotes the matrix we get by deleting the ith row and jth column of X, and
Xij  is therefore the (i, j)th minor of X.
Here we list some elementary properties of determinants that follow directly from the defini
tion:
i) If one row or one column of the matrix contains zeros only, then the value of the determinant
is zero.
ii) X⊤ = X
iii) If one row (or one column) of the matrix is modified by multiplying with a scalar c then so
is the value of the determinant.
iv) cX = cpX
v) If X,Y ∈Mp,p then XY  = XY 
vi) If the matrix X is diagonal (i.e. all nondiagonal elements are zero) then X = ∏pi=1 xii.
In particular, the determinant of the identity matrix is always equal to one.
Given that X ≠ 0 (or equivalently, if the matrix X ∈Mp,p is nonsingular then an inverse
matrix X−1 ∈Mp,p can be defined that has to satisfy XX−1 = Ip,p. It is easy to check that the
inverse X−1 has as its (j, i)th entry Xij X (−1)i+j , where Xij  is, as before, the (i, j)th minor of
X.
Some elementary properties of inverses follow:
i) XX−1 = X−1X = I
ii) (X−1)⊤ = (X⊤)−1
iii) (XY )−1 = Y −1X−1 when both X and Y are nonsingular square matrices of the same
dimension.
iv) X−1 = X−1
v) If X is diagonal and nonsingular then all its diagonal elements are nonzero and X−1 is again
diagonal with diagonal elements equal to 1xii , i = 1, 2, . . . , p.
6
UNSW MATH5855 2021T3 Lecture 0 Preliminaries
0.1.3 Rank
A set of vectors x1,x2, . . . ,xk ∈ Rn is linearly dependent if there exist k numbers a1, a2, . . . , ak
not all zero such that
a1x1 + a2x2 + · · ·+ akxk = 0 (0.6)
holds. Otherwise the vectors are linearly independent. In particular, for k linearly independent
vectors the equality (0.6) would only be possible if all numbers a1, a2, . . . , ak were zero.
The row rank of a matrix is the maximum number of linearly independent row vectors. The
column rank is the rank of its set of column vectors. It turns out that the row rank and the
column rank of a matrix are always equal. Thus the rank of a matrix X (denoted rk(X)) is either
the row or the column rank. If X ∈Mp,n and rk(X) = min(p, n) we say that the matrix is of full
rank. In particular, a square matrix A ∈Mp,p is of full rank if rk(A) = p. As is well known from
the basic theorem of linear algebra Kronecker–Capelli or Rouche´–Capelli Theorem this means
also that A ≠ 0 when A is of full rank. Then the inverse of A will also exist. Let b ∈ Rp be a
given vector. Then the linear equation system Ax = b has a unique solution x = A−1b ∈ Rp.
0.1.4 Orthogonal matrices
A square matrix X ∈Mp,p is orthogonal if XX⊤ = X⊤X = Ip,p holds. The following properties
of orthogonal matrices are obvious:
i) X is of full rank (rk(X) = p) and X−1 = X⊤
ii) The name orthogonal of the matrix originates from the fact that the scalar product of each
two different column vectors equals zero. The same holds for the scalar product of each two
different row vectors of the matrix. The norm of each column vector (or each row vector) is
equal to one. These properties are equivalent to the definition.
iii) X = ±1
0.1.5 Eigenvalues and eigenvectors
For any square matrix X ∈Mp,p we can define the characteristic polynomial equation of degree
p,
f(λ) = X − λI = 0. (0.7)
Equation (0.7) is a polynomial equation of power p so it has exactly p roots. In general, some
of them may be complex and some may coincide. Since the coefficients are real, if there is a
complex root of 0.7 then also its complex conjugate must be a root of the same equation. Denote
any such eigenvalue by λ∗. In addition, tr(X) =
∑p
i=1 λi and X =
∏p
i=1 λi.
Obviously, the matrix X − λ∗I is singular (its determinant is zero). Then, according to the
Kronecker theorem, there exists a nonzero vector y ∈ Rp such that (X − λ∗I)y = 0,0 ∈ Rp.
We call y an eigenvector of X that corresponds to the eigenvalue λ∗. Note that the eigenvector
is not uniquely defined: µy for any real nonzero µ would also be an eigenvector corresponding
to the same eigenvalue.
Sparing some details of the derivation, we shall formulate the following basic result:
Theorem 0.1. When the matrix X is real symmetric then all of its p eigenvalues are real. If the
eigenvalues are all different then all the p eigenvectors that correspond to them, are orthogonal
(and hence form a basis in Rp). These eigenvectors are also unique (up to the norming constant
µ above). If some of the eigenvalues coincide then the eigenvectors corresponding to them are
not necessarily unique but even in this case they can be chosen to be mutually orthogonal.
7
UNSW MATH5855 2021T3 Lecture 0 Preliminaries
For each of the p eigenvalues λi, i = 1, 2, . . . , p, of X, denote its corresponding set of mutually
orthogonal eigenvectors of unit length by ei, i = 1, 2, . . . , p, i.e.
Xei = λiei, i = 1, 2, . . . , p, ∥ei∥ = 1, e⊤i ej = 0, i ̸= j
holds. Then is can be shown that the following decomposition (spectral decomposition) of any
symmetric matrix X ∈Mp,p holds:
X = λ1e1e
⊤
1 + λ2e2e
⊤
2 + . . . λpepe
⊤
p . (0.8)
Equivalently, X = PΛP⊤ where Λ =
λ1 · · · 0... . . . ...
0 · · · λp
is diagonal and P ∈Mp,p is an orthogonal
matrix containing the p orthogonal eigenvectors e1, e2, . . . , ep.
The above decomposition is a very important analytical tool. One of its most widely used
applications is for defining a square root of a symmetric positive definite matrix.
A symmetric matrix X ∈ Mp,p is positive definite if all of its eigenvalues are positive. (It is
called nonnegative definite if all eigenvalues are ≥ 0.) For a symmetric positive definite matrix
we have all λi, i = 1, 2, . . . , p, to be positive in the spectral decomposition (0.8).
But then
X−1 = (P⊤)−1Λ−1P−1 = PΛ−1P⊤ =
p∑
i=1
1
λi
eie
⊤
i (0.9)
(i.e. inverting X is very easy if the spectral decomposition of X is known).
Moreover we can define the square root of the symmetric nonnegative definite matrix X in
a natural way:
X
1
2 =
p∑
i=1
√
λieie
⊤
i (0.10)
The definition (0.10) makes sense since X
1
2X
1
2 = X holds. Note that X
1
2 is also symmetric and
nonnegative definite. Also X−
1
2 =
∑p
i=1 λ
− 12
i eie
⊤
i = PΛ
− 12P⊤ can be defined where Λ−
1
2 is a
diagonal matrix with λ
−1/2
i , i = 1, 2, . . . , p being its diagonal elements. These facts will be used
essentially in the subsequent sections.
As an illustration of the usefulness of the spectral decomposition approach we shall show the
following statement:
Example 0.2. Let X ∈ Mp,p be symmetric positive definite matrix with eigenvalues λ1 ≥ λ2 ≥
· · · ≥ λp > 0 and associated eigenvectors of unit length e1, e2, . . . ep. Show that
• maxy ̸=0 y
⊤Xy
y⊤y = λ1 attained when y = e1.
• miny ̸=0 y
⊤Xy
y⊤y = λp attained when y = ep.
Let X = PΛP⊤ be the decomposition (0.8) for X. Denote z = P⊤y. Note that y ̸= 0
implies z ̸= 0. Thus
y⊤Xy
y⊤y
=
y⊤PΛP⊤y
y⊤y
=
z⊤Λz
z⊤z
=
∑p
i=1 λiz
2
i∑p
i=1 z
2
i
≤ λ1
∑p
i=1 z
2
i∑p
i=1 z
2
i
= λ1
If we take y = e1 then having in mind the structure of the matrix P we have z = P
⊤e1 =
(1 0 · · · 0)⊤ and for this choice of y also z⊤Λz
z⊤z =
λ1
1 = λ1. The first part of the exercise is
8
UNSW MATH5855 2021T3 Lecture 0 Preliminaries
shown. Similar arguments (just changing the sign of the inequality) apply to show the second
part.
In addition, you can try to show that maxy ̸=0,y⊥e1
y⊤Xy
y⊤y = λ2 holds. How?
0.1.6 Cholesky Decomposition
Computers perform arithmetic to a finite precision, typically around 16 decimal significant fig
ures. Furthermore, the numbers are expressed internally in scientific notation, and so the absolute
magnitude of the number typically has little effect on precision, but certain operations on num
bers with very different magnitudes can sometimes produce severe rounding errors. For example,
to a computer 1 × 1018 + 1 × 100 = 1,000,000,000,000,000,000 + 1 = 1,000,000,000,000,000,000:
the 1 gets lost to a rounding error.
When it comes to matrix inversion in particular, the key number is the condition number,
λ1/λp of a positive definite matrix X, where λ1 is the largest eigenvalue of X and λp is the
smallest. (The definition for nonpositivedefinite matrices can be different.) The higher this
number is, the less numerically stable the inversion is likely to be. (Notice that if the matrix is
singular, this number is infinite.)
We generally try to avoid asking the computer to invert matrices in ways that lose precision.
An alternative, more numerically stable definition of a “matrix square root” is the Cholesky
decomposition. For a symmetric positive definite matrix X ∈Mp,p, there exists a unique upper
triangular matrix U ∈ Mp,p such that U⊤U = X holds. Note that many sources use a lower
triangular matrix L such that LL⊤ = X instead. It is easy to see that L ≡ U⊤, and which
definition is used is arbitrary, provided it is used consistently, since UU⊤ ̸= X and neither do
L⊤L. For example, the Wikipedia article uses L, whereas the R builtin function is chol() and
SAS/IML’s root(x) both return U . This decomposition is particularly useful for generating
correlated variables.
0.1.7 Orthogonal Projection
Orthogonal projection of any vector y ∈ Rn on the space L(X) spanned by the columns of
the matrix X ∈ Mn,p is a linear operation. Hence the result is a vector z ∈ Rn that has the
representation z = Py where the matrix P ∈ Mn,n is called (orthogonal) projector. Since
z ∈ L(X) (being a projection in this space), the projection of z on L(X) is z itself. Hence
Py = z = Pz = PPy = P 2y or (P − P 2)y = 0 → P 2 = P ( since y ∈ Rn is arbitrary).
Therefore, P should be idempotent. Further (y − z)⊤z = 0 or y⊤(P⊤ − I)Py = 0 for all
y → (P⊤ − I)P = 0 or P⊤P = P . Taking transposes, P⊤P = P⊤ or P = P⊤ that is, P is
symmetrical. So, the orthogonal projector is a symmetric and idempotent matrix.
Vice versa, consider a symmetric and idempotent matrix P . Then if we take any y ∈ Rn
then for z = Py → Pz = P 2y = Py → P (y − z) = 0 (and also P⊤(y − z) = 0 since P = P⊤).
Consider L(P ) (the space generated by the rows/columns of P ). Now: z = Py → z ∈ L(P ) and
P⊤(y − z) = 0 means that y − z is perpendicular to L(P ). Hence Py is the projection of y on
L(P ).
Hence, we have seen that P ∈ Mn,n is an orthogonal projection matrix if and only if it is a
symmetric and idempotent matrix.
Also, if P is an orthogonal projection on a given linear space M of dimension dim(M) then
I − P an orthogonal projection on the orthocomplement of M. It holds rk(P ) = dim(M).
Further, it can be seen that the rank of an orthogonal projector is equal to the sum of its
diagonal elements.
Finally, it can be shown that if the matrix X above has a full rank then the projector
PL(X) = X(X⊤X)−1X⊤. If the matrix X is not of full rank then the generalised inverse
9
UNSW MATH5855 2021T3 Lecture 0 Preliminaries
(X⊤X)− of X⊤X can be defined instead. Note that the generalised inverse may not be uniquely
defined but no matter which version of it has been chosen, the matrix X(X⊤X)−X⊤ is uniquely
defined and is the orthogonal projector on the space L(X) spanned by the columns of X also in
cases when the rank of X is not full.
0.2 Standard facts about multivariate distributions
0.2.1 Random samples in multivariate analysis
In order to study the sampling variability of statistics, with the ultimate goal of making inferences,
one needs to make some assumptions about the random variables whose values constitute the
data set X ∈Mp,n in (1.1). Suppose the data has not been observed yet but we intend to collect
n sets of measurements on p variables. Since the actual observations can not be predicted before
the measurements are made, we treat them as random variables. Each set of p measurements
can be considered as a realisation of pdimensional random vector and we have n independent
realisations of such random vectorsXi, i = 1, 2, . . . , n, so we have the random matrix X ∈Mp,n:
X =
X11 X12 · · · X1j · · · X1n
X21 X22 · · · X2j · · · X2n
...
...
. . .
...
. . .
...
Xi1 Xi2 · · · Xij · · · Xin
...
...
. . .
...
. . .
...
Xp1 Xp2 · · · Xpj · · · Xpn
= [X1,X2, . . . ,Xn] (0.11)
The vectors Xi, i = 1, 2, . . . , n are considered as independent observations of a pdimensional
random vector. We start discussing the distribution of such a vector.
0.2.2 Joint, marginal, conditional distributions
A random vector X = (X1 X2 · · · Xp)⊤ ∈ Rp, p ≥ 2 has a joint cdf
FX(x) = P (X1 ≤ x1, X2 ≤ x2, . . . , Xp ≤ xp) = FX(x1, x2, . . . , xp).
In case of a discrete vector of observations X the probability mass function is defined as
PX(x) = P (X1 = x1, X2 = x2, . . . , Xp = xp).
If a density fX(x) = fX(x1, x2, . . . , xp) exists such that
FX(x) =
∫ x1
−∞
· · ·
∫ xp
−∞
fX(t)dt1 . . . dtp (0.12)
thenX is a continuous random vector with a joint density function of p arguments fX(x). From
(0.12) we see that in this case fX(x) =
∂pFX(x)
∂x1∂x2..∂xp
holds.
The marginal cdf of the first k < p components of the vector X is defined in a natural way as
follows:
P (X1 ≤ x1, X2 ≤ x2, . . . , Xk ≤ xk) = P (X1 ≤ x1, X2 ≤ x2, . . . , Xk ≤ xk, Xk+1 ≤ ∞, ..., Xp ≤ ∞)
= FX(x1, x2, . . . , xk,∞,∞, . . . ,∞) (0.13)
10
UNSW MATH5855 2021T3 Lecture 0 Preliminaries
The marginal density of the first k components can be obtained by partial differentiation in
(0.13) and we arrive at ∫ ∞
−∞
· · ·
∫ ∞
−∞
fX(x1, x2, . . . , xp)dxk+1 . . . dxp
For any other subset of k < p components of the vector X, their marginal cdf and density can
be obtained along the same lines.
In particular, each component Xi has marginal cdf FXi(xi), i = 1, 2, . . . , p.
The conditional density X when Xr+1 = xr+1, . . . , Xp = xp is defined by
f(X1,...,XrXr+1,...,Xp)(x1, . . . , xrxr+1, . . . , xp) =
fX(x)
fXr+1,...,Xp(xr+1, . . . , xp)
(0.14)
The above conditional density is interpreted as the joint density of X1, . . . , Xr when Xr+1 =
xr+1, . . . , Xp = xp and is only defined when fXr+1,...,Xp(xr+1, . . . , xp) ̸= 0.
In case X has p independent components then
FX(x) = FX1(x1)FX2(x2) · · ·FXp(xp) (0.15)
holds and, equivalently, also
PX(x) = PX1(x1)PX2(x2) · · ·PXp(xp), fX(x) = fX1(x1)fX2(x2) · · · fXp(xp) (0.16)
holds. We note that in case of mutual independence the p components, all conditional distribu
tions do not depend on the conditions and the factorisations
FX(x) =
p∏
i=1
FXi(xi), fX(x) =
p∏
i=1
fXi(xi)
hold.
0.2.3 Moments
Given the density fX(x) of the random vector X the joint moments of order s1, s2 . . . , sp are
defined, in analogy to the univariate case, as
E(Xs11 · · ·Xspp ) =
∫ ∞
−∞
· · ·
∫ ∞
−∞
xs11 · · ·xspp fX(x1, . . . , xp)dx1 . . . dxp (0.17)
Note that if some of the si in (0.17) are equal to zero then in effect we are calculating the
joint moment of a subset of the p random variables.
Now, let X ∈ Rp and Y ∈ Rq with densities as above. The following moments are commonly
used:
Expectation:
µX = E(X) =
∫ ∞
−∞
· · ·
∫ ∞
−∞
xfX(x1, . . . , xp)dx1 . . . dxp ∈ Rp.
11
UNSW MATH5855 2021T3 Lecture 0 Preliminaries
Variance–covariance matrix: (a.k.a. variance or covariance matrix)
ΣX = Var(X) = Cov(X) = E(X − µX)(X − µX)⊤
= EXX⊤ − µXµ⊤X =
σ11 σ12 · · · σ1p
σ21 σ22 · · · σ2p
...
...
. . .
...
σp1 σp2 · · · σpp
∈Mp,p.
Covariance matrix:
ΣX,Y = Cov(X,Y ) = E(X − µX)(Y − µY )⊤
= EXY ⊤ − µXµ⊤Y =
σX1Y1 σX1Y2 · · · σX1Yq
σX2Y1 σX2Y2 · · · σX2Yq
...
...
. . .
...
σXpY1 σXpY2 · · · σXpYq
∈Mp,q.
Let A ∈Mp′,p and B ∈Mq′,q fixed and known. Then,
• µAX = AµX ∈ Rp′
• ΣAX = AΣXA⊤ ∈Mp′,p′
• ΣAX,BY = AΣX,Y B⊤ ∈Mp′,q′
As a corollary, if X ′, Y ′, A′ and B′ are variables and matrices with the same dimensions as
originals (but possibly distributions and values),
• E(AX +A′X ′) = AµX +A′µX′
• Var(AX +A′X ′) = AΣXA⊤ +AΣX,X′(A′)⊤ +A′ΣX′,XA⊤ +A′ΣX′(A′)⊤
• Cov(AX+A′X ′, BY +B′Y ′) = AΣX,Y B⊤+AΣX,Y ′(B′)⊤+A′ΣX′,Y B⊤+A′ΣX′,Y ′(B′)⊤
These identities are also useful when p = p′ = q = q′ = 1 (i.e., scalars).
0.2.4 Density transformation formula
Assume, the p existing random variables X1, X2, . . . , Xp with given density fX(x) have been
transformed by a smooth (i.e. differentiable) onetoone transformation into p new random
variables Y1, Y2 . . . , Yp, i.e. a new random vector Y ∈ Rp has been created by calculating
Yi = yi(X1, X2 . . . , Xp), i = 1, 2, . . . , p (0.18)
The question is how to calculate the density gY (y) of Y by knowing the transformation functions
yi(X1, X2 . . . , Xp), i = 1, 2, . . . , p and the density fX(x) of the original random vector. Naturally,
since the transformation (0.18) is assumed to be onetoone, its inverse transformation Xi =
xi(Y1, Y2 . . . , Yp), i = 1, 2, . . . , p also exists and then the following density transformation formula
applies:
fY (y1, . . . , yp) = fX [x1(y1, . . . , yp), . . . , xp(y1, . . . , yp)]J(y1, . . . , yp) (0.19)
12
UNSW MATH5855 2021T3 Lecture 0 Preliminaries
where J(y1, . . . , yp) is the Jacobian of the transformation:
J(y1, . . . , yp) =
∣∣∣∂x∂y ∣∣∣ ≡
∣∣∣∣∣∣∣∣∣∣
∂x1
∂y1
∂x1
∂y2
· · · ∂x1∂yp
∂x2
∂y1
∂x2
∂y2
· · · ∂x2∂yp
...
...
. . .
...
∂xp
∂y1
∂xp
∂y2
· · · ∂xp∂yp
∣∣∣∣∣∣∣∣∣∣
≡ ∣∣ ∂y
∂x
∣∣−1 (0.20)
Note that in (0.19) the absolute value of the Jacobian is substituted.
0.2.5 Characteristic and moment generating functions
The characteristic function (cf) φ(t) of the random vector X ∈ Rp is a function of a p
dimensional argument. For any real vector t = (t1 t2 · · · tp)⊤ ∈ Rp it is defined as φX(t) =
E(eit
⊤X) where i =
√−1. Note that the cf always exists since φX(t) ≤ E(eit⊤X ) = 1 < ∞.
Maybe more simple (since it does not involve complex numbers) is the notion of moment gener
ating function (mgf). It is defined as MX(t) = E(e
t⊤X). Note however that in some cases the
mgf may not exist for values of t further away from the zero vector.
Characteristic functions are in onetoone correspondence with distributions and this is the
reason to use them as a machinery to operate with in cases where direct operation with the
distribution is not very convenient. In fact, when the density exists, under mild conditions
the following inversion formulas hold for onedimensional random variables and random vectors,
respectively:
fX(x) =
1
2π
∫ +∞
−∞
e−itx φX(t)dt
fX(x) = (2π)
−p
∫
Rp
e−it
⊤x φX(t)dt.
One important property of cf is the following:
Theorem 0.3. If the cf φX(t) of the random vector X ∈ Rp is given and Y = AX + b, b ∈
Rq, A ∈ Mq,p is a linear transformation of X ∈ Rp into a new random vector Y ∈ Rq then it
holds for all s ∈ Rq that
φY (s) = e
is⊤b φX(A
⊤s).
Proof. at lecture.
0.3 Additional resources
An alternative presentation of these concepts can be found in JW Ch. 2–3.
0.4 Exercises
Exercise 0.1
In an ecological experiment, colonies of 2 different species of insect are confined to the same
habitat. The survival times of the two species (in days) are random variables X1 and X2 respec
tively. It is thought that X1 and X2 have a joint density of the form
fX(x1, x2) = θx1 e
−x1(θ+x2) (0 < x1, x2)
for some constant θ > 0.
13
UNSW MATH5855 2021T3 Lecture 0 Preliminaries
(a) Show that fX(x1, x2) is a valid density.
(b) Find the probability that both species die out within t days of the start of the experiment.
(c) Derive the marginal density of X1. Identify this distribution and write down E(X1) and
Var(X1).
(d) Derive the marginal density of X2, and the conditional density of X2 given X1 = x1.
(e) What evidence do you now have that X1 and X2 are not independent?
Exercise 0.2
Let X = [X1, X2]
⊤ a random vector with E(X) = µ and
Var(X) = Σ = σ2
(
1 ρ
ρ 1
)
.
(a) Find Cov(X1 −X2, X1 +X2).
(b) Find Cov(X1, X2 − ρX1).
(c) Choose b to minimise Var(X2 − bX1).
Exercise 0.3
Suppose X is a pdimensional random vector with cf φX(t). If X is partitioned as
[
X(1)
X(2)
]
,
where X(1) is a p1dimensional subvector, then show that
(a) X(1) has cf φ
(1)
X (t(1)) = φX
{[
t(1)
0
]}
, t(1) ∈ Rp1 .
(b) X(1) and X(2) are independent if and only if
φX(t) = φX
{[
t(1)
0
]}
φX
{[
0
t(2)
]}
,
∀t(1) ∈ Rp1 , ∀t(2)ϵRp−p1 .
Exercise 0.4
Let X ∈Mp,p is a symmetric positive definite matrix with eigenvalues λ1 ≥ λ2 · · · ≥ λp > 0
and associated eigenvectors of unit length ei, i = 1, 2, . . . , p that give rise to the following spectral
decomposition:
X = λ1e1e
⊤
1 + λ2e2e
⊤
2 + . . . λpepe
⊤
p
It is known that maxy ̸=0 y
⊤Xy
y⊤y = λ1. Now, you show that maxy ̸=0,⟨y,e1⟩=0
y⊤Xy
y⊤y = λ2. Can you
find further generalisations of this claim?
Exercise 0.5
We know that an orthogonal projection matrix has only 0 or 1 as possible eigenvalues. Using
this property or otherwise, show that the rank of an orthogonal projector is equal to the sum of
its diagonal elements.
14
UNSW MATH5855 2021T3 Lecture 1 Exploratory Data Analysis
1 Exploratory Data Analysis of Multivariate Data
1.1 Data organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2 Basic summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.1 Data organisation
Assume, we are dealing with p ≥ 1 variables. The values of these variables are all recorded for
each distinct item, individual, or experimental trial. Each of these three words will be substituted
sometimes by the word “case”. We will use the notation xij to indicate a particular value of the
ith variable that is observed on the jth case. Consequently, n measurements on p variables can
be represented in a form of a matrix
p×n
X =
x11 x12 · · · x1j · · · x1n
x21 x22 · · · x2j · · · x2n
...
...
. . .
...
. . .
...
xi1 xi2 · · · xij · · · xin
...
...
. . .
...
. . .
...
xp1 xp2 · · · xpj · · · xpn
(1.1)
The matrix X above contains the data consisting of all the observations on all the variables.
This way of representing the data allows easy manipulations to be performed in order to obtain
some easy descriptive statistics for each of the variables.
1.2 Basic summaries
For example, the sample mean of the second variable is just x¯2 =
1
n
∑n
j=1 x2j , the sample variance
of the second variable is just s22 =
1
n
∑n
j=1(x2j− x¯2)2 (Note that for the sample variance we shall
sometimes use the divisor of n − 1 rather than n and each time this will be differentiated by
displaying the appropriate expression).
The sample covariance (the simple measure of linear association between variables 1 and
2) is given by s12 =
1
n
∑n
j=1(x1j − x¯1)(x2j − x¯2) and one can understand easily how sik, i =
1, 2, . . . , p, k = 1, 2, . . . , p can be defined. Finally, the sample correlation coefficient (the measure
of linear association between two variables that does not depend on the units of measurement)
can be defined. The sample correlation coefficient of the ith and kth variables is defined by
rik =
sik√
sii
√
skk
. Because of the wellknown Cauchy–Bunyakovsky–Schwartz Inequality, rik ≤ 1
holds. Note also that rik = rki for all i = 1, 2, . . . , p and k = 1, 2, . . . , p holds.
It should be repeatedly noted that the sample correlations and covariance are useful only when
trying to measure the linear association between two variables. Their value is less informative
and is misleading in cases of nonlinear association. In this case one needs to invoke the quotient
correlation instead:
Zhang, Zhengjun. Quotient correlation: A sample based alternative to Pearson’s cor
relation. Annals of Statistics 36 (2008), no. 2, 10071030. doi:10.1214/009053607000000866
But because of the fact that covariance and correlation coefficients are routinely calculated and
analysed they are very widely used and provide nice numerical summaries of association when
the data do not exhibit obvious nonlinear patterns of association.
15
UNSW MATH5855 2021T3 Lecture 1 Exploratory Data Analysis
The descriptive statistics that we discussed until now are usually organised into arrays,
namely:
Vector of sample means x¯ =
(
x¯1 x¯2 · · · x¯p
)⊤
Matrix of sample variances and covariances
p×p
S =
s11 s12 · · · s1p
s21 s22 · · · s2p
...
...
. . .
...
sp1 sp2 · · · spp
(1.2)
Matrix of sample correlations
p×p
R =
1 r12 · · · r1p
r21 1 · · · r2p
...
...
. . .
...
rp1 rp2 · · · 1
(1.3)
1.3 Visualisation
Some simple characteristics of the data are worth studying before the actual multivariate analysis
would begin:
• drawing scatterplot of the data;
• calculating simple univariate descriptive statistics for each variable;
• calculating sample correlation and covariance coefficients; and
• linking multiple twodimensional scatterplots.
1.4 Software
SAS In SAS, the procedures that are used for this purpose are called proc means, proc plot
and proc corr. Please study their short description in the included SAS handout.
R In R, these are implemented in base::rowMeans, base::colMeans, stats::cor, graphics::plot,
graphics::pairs, GGally::ggpairs. Here, the format is PACKAGE ::FUNCTION , and you
can learn more by running
library(PACKAGE )
? FUNCTION
16
UNSW MATH5855 2021T3 Lecture 2 The Multivariate Normal Distribution
2 The Multivariate Normal Distribution
2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Properties of multivariate normal . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Tests for Multivariate Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1 Definition
The multivariate normal (MVN ) density is a generalisation of the univariate normal for p ≥ 2
dimensions. Looking at the term (x−µσ )
2 = (x − µ)(σ2)−1(x − µ) in the exponent of the well
known formula
f(x) =
1√
2πσ2
e−[(x−µ)/σ]
2/2,−∞ < x <∞ (2.1)
for the univariate density function, a natural way to generalise this term in higher dimensions is
to replace it by (x− µ)⊤Σ−1(x− µ). Here µ = EX ∈ Rp is the expected value of the random
vector X ∈ Rp and the matrix
Σ = E(X − µ)(X − µ)⊤ =
σ11 σ12 · · · σ1p
σ21 σ22 · · · σ2p
...
...
. . .
...
σp1 σp2 · · · σpp
∈Mp,p
is the covariance matrix. Note that on the diagonals of Σ we get the variances of each of the p
random variables whereas σij = E[(Xi−E(Xi))(Xj −E(Xj))], i ̸= j are the covariances between
the ith and jth random variable. Sometimes, we will also denote σii by σ
2
i .
Of course, the above replacement would only make sense if Σ was positive definite. In general,
however, we can only claim that Σ is (as any covariance matrix) nonnegative definite (try to
prove this claim e.g. using Example 0.2 from Section 0.1.5 or some other argument).
If Σ was positive definite then the density of the random vector X can be written as
fX(x) =
1
(2π)p/2Σ 12 e
−(x−µ)⊤Σ−1(x−µ)/2, −∞ < xi <∞, i = 1, 2 . . . , p. (2.2)
It can be directly checked that the random vector X ∈ Rp has EX = µ and
E[(X − µ)(X − µ)⊤] = Σ.
Since the density is uniquely defined by the mean vector and the covariance matrix we will denote
it by Np(µ,Σ).
In these notes, however, we will introduce the multivariate normal distribution not through its
density formula but through more general reasoning that also allows to cover the case of singular
Σ. We will utilise the famousCramer–Wold argument according to which the distribution of a
pdimensional random vectorX is completely characterised by the onedimensional distributions
of all linear transformations t⊤X, t ∈ Rp. Indeed, if we consider E[e(itt⊤X)] (which is assumed
to be known for every t ∈ R1, t ∈ Rp) then we see that by substituting t = 1 we can get E[e(it⊤X)]
which is the cf of the vector X (and the latter uniquely specifies the distribution of X). Hence
the following definition will be adopted here:
17
UNSW MATH5855 2021T3 Lecture 2 The Multivariate Normal Distribution
Definition 2.1. The random vector X ∈ Rp has a multivariate normal distribution if and only
if (iff) any linear transformation t⊤X, t ∈ Rp has a univariate normal distribution.
Lemma 2.2. The characteristic function of the (univariate) standard normal random variable
X ∼ N(0, 1) is
ψX(t) = exp(−t2/2).
Proof. (optional, not examinable)
ψX(t) = E exp(itX) =
∫ +∞
−∞
exp(itx)
1√
2π
exp(−x2/2)dx
=
∫ +∞
−∞
1√
2π
exp(itx− x2/2)dx
=
∫ +∞
−∞
1√
2π
exp(−(x2 − 2itx)/2)dx
=
∫ +∞
−∞
1√
2π
exp(−(x2 − 2itx+ (it)2)/2 + (it)2/2)dx
=
∫ +∞
−∞
1√
2π
exp(−(x− it)2/2 + (it)2/2)dx
= exp(−t2/2)
∫ +∞
−∞
1√
2π
exp(−(x− it)2/2)dx
= exp(−t2/2) lim
h→∞
∫ +h
−h
1√
2π
exp(−(x− it)2/2)dx.
Change of variable:
z = x− it
x = z + it
dx = dz
results in
ψX(t) = exp(−t2/2) lim
h→∞
∫ +h+it
−h+it
1√
2π
exp(−z2/2)dz.
The remaining integral is over a complex domain, so we must use Cauchy’s Theorem: contour
integration over the contour +h+ it→ +h→ −h→ −h+ it→ +h+ it should result in 0, so∫ +h
+h+it
1√
2π
exp(−z2/2)dz +
∫ −h
+h
1√
2π
exp(−z2/2)dz+∫ −h+it
−h
1√
2π
exp(−z2/2)dz +
∫ +h+it
−h+it
1√
2π
exp(−z2/2)dz = 0
18
UNSW MATH5855 2021T3 Lecture 2 The Multivariate Normal Distribution
for any real h and t. Solving for the integral of interest and taking the limit,
lim
h→∞
∫ +h+it
−h+it
1√
2π
exp(−z2/2)dz
= − lim
h→∞
∫ +h
+h+it
1√
2π
exp(−z2/2)dz − lim
h→∞
∫ −h
+h
1√
2π
exp(−z2/2)dz
− lim
h→∞
∫ −h+it
−h
1√
2π
exp(−z2/2)dz
= − lim
h→∞
∫ +h
+h+it
1√
2π
exp(−z2/2)dz +
:1
lim
h→∞
∫ +h
−h
1√
2π
exp(−z2/2)dz
− lim
h→∞
∫ −h+it
−h
1√
2π
exp(−z2/2)dz,
since the standard normal density integrates to 1.
Lastly, consider limh→∞
∫ +h
+h+it
exp(−z2/2)dz: change of variable
y = (z − h)/i
z = h+ iy
dz = idy,
then
lim
h→∞
∫ +h
+h+it
exp(−z2/2)dz = lim
h→∞
∫ 0
1
exp(−(h+ iy)2/2)idy
=
∫ 0
1
lim
h→∞
exp(−(h2 + 2ihy − y2)/2)idy
=
∫ 0
1
lim
h→∞
exp(−h2/2) exp(−ihy) exp(−y2/2)idy
=
∫ 0
1
0dy = 0,
and, analogously, limh→∞
∫ −h+it
−h
1√
2π
exp(−z2/2)dz = 0, leaving
lim
h→∞
∫ +h+it
−h+it
1√
2π
exp(−z2/2)dz = 1
and
ψX(t) = exp(−t2/2).
Aside: The mgf MX(t) = E exp(tX) can also be derived and used in the argument below;
however, cf s are more general so are preferred when possible. We show the (optional, not
examinable) derivation here.
We begin by completing the square:
19
UNSW MATH5855 2021T3 Lecture 2 The Multivariate Normal Distribution
MX(t) = E exp(tX) =
∫ +∞
−∞
exp(tx)
1√
2π
exp(−x2/2)dx
=
∫ +∞
−∞
1√
2π
exp(tx− x2/2)dx
=
∫ +∞
−∞
1√
2π
exp(−(x2 − 2tx)/2)dx
=
∫ +∞
−∞
1√
2π
exp(−(x2 − 2tx+ t2)/2 + t2/2)dx
=
∫ +∞
−∞
1√
2π
exp(−(x− t)2/2 + t2/2)dx
= exp(t2/2)
∫ +∞
−∞
1√
2π
exp(−(x− t)2/2)dx.
Change of variable:
z = x− t
x = z + t
dx = dz
results in
MX(t) = exp(t
2/2)
∫ +∞
−∞
1√
2π
exp(−z2/2)dz = exp(t2/2),
since the integrand is just a standard normal density.
Theorem 2.3.
Suppose that for a random vector X ∈ Rp with a normal distribution according to Defini
tion 2.1 we have E(X) = µ and D(X) = E[(X − µ)(X − µ)⊤] = Σ. Then:
i) For any fixed t ∈ Rp, t⊤X ∼ N(t⊤µ, t⊤Σt) i.e. t⊤X has an one dimensional normal
distribution with expected value t⊤µ and variance t⊤Σt.
ii) The cf of X ∈ Rp is
φX(t) = e
(it⊤µ− 12 t⊤Σt) . (2.3)
Proof. Part i) is obvious. For part ii) we recall from Lemma 2.2 that the cf of the standard
univariate normal random variable Z is e−t
2/2. Since any U ∼ N1(µ1, σ21) has a distribution that
coincides with the distribution of µ1 + σ1Z we have:
φU (t) = e
itµ1 φσ1Z(t) = e
itµ1 E(eitσ1Z) = eitµ1 φZ(tσ1) = e
(itµ1− 12 t2σ21)
But then, for the univariate random variable t⊤X ∼ N1(t⊤µ, t⊤Σt) we would have as a char
acteristic function φt⊤X(t) = e
(itt⊤µ− 12 t2t⊤Σt) . Substituting t = 1 in the latter formula we find
that
φX(t) = e
(it⊤µ− 12 t⊤Σt) .
20
UNSW MATH5855 2021T3 Lecture 2 The Multivariate Normal Distribution
As an upshot, we see that given the expected value vector µ and the covariance matrix Σ we
can use the cf formula (2.3) rather than the density formula (2.2) to define the p dimensional
multivariate normal distribution. The advantage of the former in comparison to the latter is
that in (2.3) only Σ is used, i.e. this definition makes also sense in cases of singular (i.e. non
invertible) Σ. We still want to know that in case of nonsingular Σ the more general definition
would give raise to the density (2.2). This is the content of the next theorem.
Theorem 2.4. Assume the matrix Σ in (2.3) is nonsingular. Then the density of the random
vector X ∈ Rp with cf as in (2.3) is given by (2.2).
Proof. Consider the vector Y ∈ Rp such that Y = Σ− 12 (X−µ) (compare (0.10) in Section 0.1.5).
Since obviously E(Y ) = 0 and D(Y ) = E(Y Y ⊤) = Σ−
1
2 E[(X − µ)(X − µ)⊤]Σ− 12 = Ip holds
we can substitute to get the cf of Y ∈ Rp: φY (t) = e− 12
∑p
i=1 t
2
i . But the latter can be seen
directly to be the characteristic function of the vector of p independent standard normal variables.
Hence, from the relation Y = Σ−
1
2 (X − µ) we can also conclude that X = µ + Σ 12Y where
the density fY (y) =
1
(2π)p/2
e−
1
2
∑p
i=1 y
2
i . With other words, X is a linear transformation of Y
where the density of Y is known. We can therefore apply the density transformation approach
(Section 0.2.4 of this lecture) to obtain: fX(x) = fY (Σ
− 12 (x − µ))J(x1, . . . , xp). It is easy
to see (because of the linearity of the transformation) that J(x1, . . . , xp) = Σ− 12  = Σ 12 −1.
Taking into account that
∑p
i=1 y
2
i = y
⊤y = (x − µ)⊤Σ− 12Σ− 12 (x − µ) = (x − µ)⊤Σ−1(x − µ)
we finally arrive at the density formula (2.2) for fX(x).
2.2 Properties of multivariate normal
The following properties of multivariate normal can be easily derived using the machinery devel
oped so far:
Property 1
If Σ = D(X) = Λ is a diagonal matrix then the p components of X are independent.
(Indeed, in this case φX(t) = e
i
∑p
j=1 tjµj− 12 t2jσ2j which can be seen to be the cf of the vector
of p independent components each distributed according to N(µj , σ
2
j ), j = 1, . . . , p).
The above property can be paraphrased as “for a multivariate normal, if its components are
uncorrelated they are also independent”. On the other hand, it is well known that always, i.e.
not only for normal from the fact that certain components are independent we can conclude
that they are also uncorrelated. Therefore, for the multivariate normal distribution we can
conclude that its components are independent if and only if they are uncorrelated!
Example 2.5 (Random variables that are marginally normal and uncorrelated but not inde
pendent). Consider two variables Z1 = (2W − 1)Y and Z2 = Y , where Y ∼ N1(0, 1) and,
independently, W ∼ Binomial(1, 1/2) (so 2W − 1 takes −1 and +1 with equal probability).
Property 2
If X ∼ Np(µ,Σ) and C ∈Mq,p is an arbitrary matrix of real numbers then
Y = CX ∼ Nq(Cµ, CΣC⊤).
To prove this property note that (see Section 0.2.5) for any s ∈ Rq we have:
φY (s) = φX(C
⊤s) = eis
⊤Cµ− 12s⊤CΣC⊤s
which means that Y = CX ∼ Nq(Cµ, CΣC⊤).
21
UNSW MATH5855 2021T3 Lecture 2 The Multivariate Normal Distribution
Note also that if it happens that the rank of C is full and if rk(Σ) = p then the rank of CΣC⊤
is also full, i.e. the distribution of Y would not be degenerate in this case.
Property 3
(This is a finer version of Property 1). Assume the vector X ∈ Rp is divided into subvectors
X =
(
X(1)
X(2)
)
and according to this subdivision the vector means are µ =
(
µ(1)
µ(2)
)
and the
covariance matrix Σ has been subdivided into Σ =
(
Σ11 Σ12
Σ21 Σ22
)
. Then the vectors X(1) and
X(2) are independent iff Σ12 = 0.
Proof. (Exercise (see lecture)).
Property 4
Let the vector X ∈ Rp be divided into subvectors X =
(
X(1)
X(2)
)
, X(1) ∈ Rr, r < p,X(2) ∈
Rp−r and according to this subdivision the vector means are µ =
(
µ(1)
µ(2)
)
and the covariance
matrix Σ has been subdivided into Σ =
(
Σ11 Σ12
Σ21 Σ22
)
. Assume for simplicity that the rank of
Σ22 is full. Then the conditional density of X(1) given that X(2) = x(2) is
Nr(µ(1) +Σ12Σ
−1
22 (x(2) − µ(2)),Σ11 − Σ12Σ−122 Σ21) (2.4)
Proof. Perhaps the easiest way to proceed is the following. Note that the expression µ(1) +
Σ12Σ
−1
22 (x(2)−µ(2)) (for which we want to show that it equals the conditional mean), is a function
of x(2). Denote is as g(x(2)) for short. Let us construct the random vectors Z =X(1) − g(X(2))
and Y = X(2) − µ(2). Obviously EZ = 0 and EY = 0 holds. The vector
(
Z
Y
)
is a linear
transformation of a normal vector (
(
Z
Y
)
= A(X − µ), A =
(
Ir −Σ12Σ−122
0 Ip−r
)
) and hence, its
distribution is normal (Property 2). Calculating therefore covariance matrix of the vector
(
Z
Y
)
we find that
Var
(
Z
Y
)
= AΣA⊤ =
(
Σ11 − Σ12Σ−122 Σ21 0
0 Σ22
)
after a simple exercise in block multiplication of matrices.
Hence the two vectorsZ and Y are uncorrelated normal vectors and therefore are independent
(Property 3). But Y is a linear transformation of X(2) and this means that Z and X(2) are
independent. Hence the conditional density of Z given X(2) = x(2) will not depend on x(2) and
coincides with the unconditional density of Z. This means, it is normal with zero mean vector
and its covariance matrix is
Cov(Z) = Σ11 − Σ12Σ−122 Σ21 = Σ12
Hence we can state that X(1)− g(x(2)) ∼ N(0,Σ12) and correspondingly, the conditional distri
bution of X(1) given that X(2) = x(2) is (2.4).
22
UNSW MATH5855 2021T3 Lecture 2 The Multivariate Normal Distribution
Example 2.6. As an immediate consequence of Property 4 we see that if p = 2, r = 1 then
for a twodimensional normal vector
(
X1
X2
)
∼ N
{(
µ1
µ2
)
,
(
σ21 σ12
σ12 σ
2
2
)}
its conditional density
f(x1x2) is N(µ1 + σ12σ22 (x2 − µ2), σ
2
1 − σ
2
12
σ22
).
As an exercise, try to derive the above result by direct calculations starting from the joint
density f(x1, x2), going over to the marginal f(x2) by integration and finally getting f(x1x2) =
f(x1,x2)
f(x2)
.
Property 5
If X ∼ Np(µ,Σ) and Σ is nonsingular then (X − µ)⊤Σ−1(X − µ) ∼ χ2p where χ2p denotes
the chisquare distribution with p degrees of freedom.
Proof. It suffices to use the fact that (see also Theorem 2.4) the vector Y ∈ Rp : Y = Σ− 12 (X −
µ) ∼ N(0, Ip) i.e. it has p independent standard normal components. Then
(X − µ)⊤Σ−1(X − µ) = Y ⊤Y =
p∑
i=1
Y 2i ∼ χ2p
according to the definition of χ2p as a distribution of the sum of squares of p independent standard
normals.
Finally, one more interpretation of the result in Property 4 will be given. Assume we want,
as is a typical situation in statistics, to predict a random variable Y that is correlated with some
p random variables (predictors) X = (X1 X2 · · · Xp). Trying to find the best predictor of Y
we would like to minimise the expected value EY [{Y − g(X)}2X = x] over all possible choices
of the function g such that E g(X)2 < ∞. A little careful work and use of basic properties of
conditional expectations leads us (see lecture) to the conclusion that the optimal solution to the
above minimisation problem is g∗(x) = E(Y X = x). This optimal solution is also called the
regression function. Thus given a particular realisation x of the random vector X the regression
function is just the conditional expected value of Y given X = x.
In general, the conditional expected value may be a complicated nonlinear function of the
predictors. However, if we assume in addition that the joint (p+ 1)dimensional distribution of
Y and X is normal then by applying Property 4 we see that given the realisation x of X, the
best prediction of the Y value is given by b+ σ⊤0 C
−1x where b = E(Y )− σ⊤0 C−1 E(X), C is the
covariance matrix of the vector X, σ0 is the vector of Covariances of Y with Xi, i = 1, . . . , p.
Indeed, we know that when the joint (p+1)dimensional distribution of Y and X is normal
the regression function is given by
E(Y ) + σ⊤0 C
−1(x− E(X)).
By introducing the notation b = E(Y )− σ⊤0 C−1 E(X) we can write this as b+ σ⊤0 C−1x.
That is, in case of normality, the optimal predictor of Y in the least squares sense
turns out to be a very simple linear function of the predictors. The vector C−1σ0 ∈ Rp
is the vector of the regression coefficients. Substituting the optimal values we get the minimal
value of the sum of squares which is equal to Var(Y )− σ⊤0 C−1σ0.
23
UNSW MATH5855 2021T3 Lecture 2 The Multivariate Normal Distribution
2.3 Tests for Multivariate Normality
We have seen that the assumption of multivariate normality may bring essential simplifications in
analysing data. But applying inference methods based on the multivariate normality assumption
in cases where it is grossly violated may introduce serious defects in the quality of the analysis. It
is therefore important to be able to check the multivariate normality assumption. Based on the
properties of normal distributions discussed in this lecture, we know that all linear combinations
of normal variables are normal and the contours of the multivariate normal density are ellipsoids.
Therefore we can (to some extent) check the multivariate normality hypothesis by:
1. checking if the marginal distributions of each component appear to be normal (by using
Q–Q plots and the Shapiro–Wilk test, for example);
2. checking if the scatterplots of pairs of observations give the elliptical appearance expected
from normal populations;
3. are there any outlying observations that should be checked for accuracy.
All this can be done by applying univariate techniques and by drawing scatterplots which are well
developed in SAS and R. To some extent, however, there is a price to be paid for concentrating
on univariate and bivariate examinations of normality.
There is a need to construct a “good” overall test of multivariate normality. One of the simple
and tractable ways to verify the multivariate normality assumption is by using tests based on
Mardia’s multivariate skewness and kurtosis measures. For any general multivariate
distribution we define these respectively as
β1,p = E[(Y − µ)⊤Σ−1(X − µ)]3 (2.5)
provided that X is independent of Y but has the same distribution and
β2,p = E[(X − µ)⊤Σ−1(X − µ)]2 (2.6)
(if the expectations in (2.5) and (2.6) exist). For the Np(µ,Σ) distribution: β1,p = 0 and
β2,p = p(p+ 2).
(Note that when p = 1, the quantity β1,1 is the square of the skewness coefficient
E(X−µ)3
σ3
whereas β2,1 coincides with the kurtosis coefficient
E(X−µ)4
σ4 .)
For a sample of size n consistent estimates of β1,p and β2,p can be obtained as
βˆ1,p =
1
n2
n∑
i=1
n∑
j=1
g3ij
βˆ2,p =
1
n
n∑
i=1
g2ii
where gij = (xi − x¯)⊤S−1n (xj − x¯). Notice that for βˆ1,p, we take advantage of our sample being
independent and use observations xj for j ̸= i as the “Y ” values for xi.
Both quantities βˆ1,p and βˆ2,p are nonnegative and for multivariate data, one would expect
them to be around zero and p(p + 2), respectively. Both quantities can be utilised to detect
departures from multivariate normality.
Mardia has shown that asymptotically, k1 = nβˆ1,p/6 ∼ χ2p(p+1)(p+2)/6, and k2 = [βˆ2,p− p(p+
2)]/[8p(p+ 2)/n]
1
2 is standard normal. Thus we can use k1 and k2 to test the null hypothesis of
24
UNSW MATH5855 2021T3 Lecture 2 The Multivariate Normal Distribution
multivariate normality. If neither hypothesis is rejected, the multivariate normality assumption
is in reasonable agreement with the data. It also has been observed that Mardia’s multivariate
kurtosis can be used as a measure to detect outliers from the data that are supposedly distributed
as multivariate normal.
Shapiro–Wilk, Mardia, and other distribution tests have, as their null hypothesis, that the
true population distribution is (multivariate) normal. This means that if the population distri
bution deviates from normality even a little, then as the sample size increases, the power of the
test (the probability of rejecting the null hypothesis of normality) approaches 1.
At the same time, as the sample size increases, the Central Limit Theorem tells us that many
statistics, including sample means and (much more slowly) sample variances and covariances,
approach normality—and multivariate statistics generally approach multivariate normality. This
means that regardless of the underlying distribution, the statistical procedures depending on the
normality assumption become valid—even as the chances that a statistical hypothesis test will
detect what nonnormality there is approaches 1.
This means that we must not rely on hypothesis testing blindly but consider the situation on
a casebycase basis, particularly when dealing with large datasets. For a decent sample size, the
“symmetric, bellshaped” heuristic may indicate an adequate distribution, even if a hypothesis
test reports a small pvalue.
2.4 Software
SAS Use CALIS procedure. The quantity k2 is called Normalised Multivariate Kurtosis there,
whereas βˆ2,p − p(p+ 2) bears the name Mardia’s Multivariate Kurtosis.
R MVN::mvn, psych::mardia
2.5 Examples
Example 2.7. Testing multivariate normality of microwave oven radioactivity measurements
(JW).
2.6 Additional resources
An alternative presentation of these concepts can be found in JW Sec. 4.1–4.2, 4.6.
2.7 Exercises
Exercise 2.1
Let X1 and X2 denote i.i.d. N(0, 1) r.v.’s.
(a) Show that the r.v.’s Y1 = X1 − X2 and Y2 = X1 + X2 are independent, and find their
marginal densities.
(b) Find P (X21 +X
2
2 < 2.41).
Exercise 2.2
Let X ∼ N3(µ,Σ) where
µ =
3−1
2
Σ =
3 2 12 3 1
1 1 2
.
25
UNSW MATH5855 2021T3 Lecture 2 The Multivariate Normal Distribution
(a) For A =
(
1 1 1
1 −2 1
)
find the distribution of Z = AX and find the correlation between the
two components of Z.
(b) Find the conditional distribution of [X1, X3]
⊤ given X2 = 0.
Exercise 2.3
Suppose that X1, . . . ,Xn are independent random vectors, with each Xi ∼ Np(µi,Σi). Let
a1, . . . , an be real constants. Using characteristic functions, show that
a1X1 + · · ·+ anXn ∼ Np(a1µ1 + · · ·+ anµn, a21Σ1 + · · ·+ a2nΣn)
Therefore, deduce that, if X1, . . . ,Xn form a random sample from the Np(µ,Σ) distribution,
then the sample mean vector, X¯ = 1n
∑n
i=1Xi, has distribution
X¯ ∼ Np(µ, 1
n
Σ) .
Exercise 2.4
Prove that if X1 ∼ Nr(µ1,Σ11) and (X2X1 = x1) ∼ Np−r(Ax1 + b,Ω) where Ω does not
depend on x1 then X =
(
X1
X2
)
∼ Np(µ,Σ) where
µ =
(
µ1
Aµ1 + b
)
, Σ =
(
Σ11 Σ11A
⊤
AΣ11 Ω+AΣ11A
⊤
)
.
Exercise 2.5
Knowing that,
i) Z ∼ N1(0, 1)
ii) Y Z = z ∼ N1(1 + z, 1)
iii) X(Y,Z) = (y, z) ∼ N1(1− y, 1)
(a) Find the distribution of
XY
Z
and of Y (X,Z).
(b) Find the distribution of
(
U
V
)
=
(
1 + Z
1− Y
)
.
(c) Compute E(Y U = 2).
26
UNSW MATH5855 2021T3 Lecture 3 Multivariate Normal Estimation
3 Estimation of the Mean Vector and Covariance Matrix
of Multivariate Normal Distribution
3.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 Likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.2 Maximum Likelihood Estimators . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.3 Alternative proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.4 Application in correlation matrix estimation . . . . . . . . . . . . . . . . . 29
3.1.5 Sufficiency of µˆ and Σˆ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Distributions of MLE of mean vector and covariance matrix of multivariate normal
distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 Sampling distribution of X¯ . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.2 Sampling distribution of the MLE of Σ . . . . . . . . . . . . . . . . . . . 31
3.2.3 Aside: The Gramm–Schmidt Process (not examinable) . . . . . . . . . . . 32
3.3 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 Maximum Likelihood Estimation
3.1.1 Likelihood function
Suppose we have observed n independent realisations of pdimensional random vectors from
Np(µ,Σ). Suppose for simplicity that Σ is nonsingular. The data matrix has the form
X =
X11 X12 · · · X1j · · · X1n
X21 X22 · · · X2j · · · X2n
...
...
. . .
...
. . .
...
Xi1 Xi2 · · · Xij · · · Xin
...
...
. . .
...
. . .
...
Xp1 Xp2 · · · Xpj · · · Xpn
= [X1,X2, . . . ,Xn] (3.1)
The goal to estimate the unknown mean vector and the covariance matrix of the multivariate
normal distribution by the Maximum Likelihood Estimation (MLE) method.
Based on our knowledge from Lecture 2 we can write down the Likelihood function
L(x;µ,Σ) = (2π)−
np
2 Σ−n2 e− 12
∑n
i=1(xi−µ)⊤Σ−1(xi−µ) (3.2)
(Note that we have substituted the observations in (3.2) and consider L as a function of the
unknown parameters µ,Σ only.) Correspondingly, we get the loglikelihood function in the form
logL(x;µ,Σ) = −np
2
log(2π)− n
2
log(Σ)− 1
2
n∑
i=1
(xi − µ)⊤Σ−1(xi − µ) (3.3)
It is well known that maximising either (3.2) or (3.3) will give the same solution for the MLE.
We start deriving the MLE by trying to maximise (3.3). To this end, first note that by
27
UNSW MATH5855 2021T3 Lecture 3 Multivariate Normal Estimation
utilising properties of traces from Section 0.1.1, we can transform:
n∑
i=1
(xi − µ)⊤Σ−1(xi − µ) =
n∑
i=1
tr[Σ−1(xi − µ)(xi − µ)⊤] =
tr[Σ−1(
n∑
i=1
(xi − µ)(xi − µ)⊤)] =
(by adding ±x¯ = 1n
∑n
i=1 xi to each term (xi − µ) in
∑n
i=1(xi − µ)(xi − µ)⊤)
tr[Σ−1(
n∑
i=1
(xi − x¯)(xi − x¯)⊤ + n(x¯− µ)(x¯− µ)⊤)]
= tr[Σ−1(
n∑
i=1
(xi − x¯)(xi − x¯)⊤)] + n(x¯− µ)⊤Σ−1(x¯− µ).
Thus
logL(x;µ,Σ) = −np
2
log(2π)−n
2
log(Σ)−1
2
tr[Σ−1(
n∑
i=1
(xi−x¯)(xi−x¯)⊤)]−1
2
n(x¯−µ)⊤Σ−1(x¯−µ)
(3.4)
3.1.2 Maximum Likelihood Estimators
The MLE are the ones that maximise (3.4). Looking at (3.4) we realise that (since Σ is non
negative definite) the minimal value for 12n(x¯ − µˆ)⊤Σ−1(x¯ − µˆ) is zero and is attained when
µˆ = x¯. It remains to find the optimal value for Σ. We will use the following
Theorem 3.1 (Anderson’s lemma). If A ∈ Mp,p is symmetric positive definite, then the
maximum of the function h(G) = −n log(G) − tr(G−1A) (defined over the set of symmetric
positive definite matrices G ∈ Mp,p) exists, occurs at G = 1nA and has the maximal value of
np log(n)− n log(A)− np.
Proof. (sketch, details at lecture): Indeed, (see properties of traces):
tr(G−1A) = tr((G−1A
1
2 )A
1
2 ) = tr(A
1
2G−1A
1
2 )
Let ηi, i = 1, . . . , p be the eigenvalues of A
1
2G−1A
1
2 . Then (since the matrix A
1
2G−1A
1
2 is positive
definite) ηi > 0, i = 1, . . . , p. Also, tr(A
1
2G−1A
1
2 ) =
∑p
i=1 ηi and A
1
2G−1A
1
2  = ∏pi=1 ηi holds.
Hence
− n logG − tr(G−1A) = n
p∑
i=1
log ηi − n logA −
p∑
i=1
ηi (3.5)
Considering the expression n
∑p
i=1 log ηi − n logA −
∑p
i=1 ηi as a function of the eigenvalues
ηi, i = 1, . . . , p we realise that it has a maximum which is attained when all ηi = n, i = 1, . . . , p.
Indeed, the first partial derivatives with respect to ηi, i = 1, . . . , p are equal to
n
ηi
− 1 and hence
the stationary points are η∗i = n, i = 1, . . . , p. The matrix of second derivatives calculated at
η∗i = n, i = 1, . . . , p is equal to −Ip and hence the stationary points give rise to a maximum
of the function. Now, we can check directly by substituting the η∗ values that the maximal
value of the function is np log(n) − n log(A) − np. But a direct substitution in the formula
h(G) = −n log(G) − tr(G−1A) with G = 1nA also gives rise to np log(n) − n log(A) − np, i.e.
the maximum is attained at G = 1nA.
28
UNSW MATH5855 2021T3 Lecture 3 Multivariate Normal Estimation
Using the structure of the loglikelihood function in (3.4) and Theorem 3.1 (applied for the
case A =
∑n
i=1(xi − x¯)(xi − x¯)⊤ (!)) it is now easy to formulate following:
Theorem 3.2. Suppose X1,X2, . . . ,Xn is a random sample from Np(µ,Σ), p < n. Then µˆ =
X¯ and Σˆ = 1n
∑n
i=1(xi − x¯)(xi − x¯)⊤ are the maximum likelihood estimators of µ and Σ,
respectively.
3.1.3 Alternative proofs
Alternative proofs of Theorem 3.2 are also available that utilise some formal rules for vector and
matrix differentiation that have been developed as a standard machinery in multivariate analysis
(recall that according to the folklore, in order to find the maximum of the loglikelihood, we
need to differentiate it with respect to its arguments, i.e. with respect to the vector µ and to
the matrix Σ), set the derivatives equal to zero and solve the corresponding equation system. If
time permits, these matrix differentiation rules will also be discussed later in this course.
3.1.4 Application in correlation matrix estimation
The correlation matrix can be defined in terms of the elements of the covariance matrix Σ.
The correlation coefficients ρij , i = 1, . . . , p, j = 1, . . . , p are defined as ρij =
σij√
σii
√
σjj
where
Σ = (σij , i = 1 . . . , p; j = 1, . . . , p) is the covariance matrix. Note that ρii = 1, i = 1, . . . , p. To
derive theMLE of ρij , i = 1, . . . , p, j = 1, . . . , p we note that these are continuous transformations
of the covariances whose maximum likelihood estimators have already been derived. Then we
can claim (according to the transformation invariance properties of MLE ) that
ρˆij =
σˆij√
σˆii
√
σˆjj
, i = 1, . . . , p, j = 1, . . . , p (3.6)
3.1.5 Sufficiency of µˆ and Σˆ
Back from (3.4) we can write the likelihood function as
L(x;µ,Σ) =
1
(2π)
np
2 Σn2 e
− 12 tr[Σ−1(
∑n
i=1(xi−x¯)(xi−x¯)⊤+n(x¯−µ)(x¯−µ)⊤)]
which means that L(x;µ,Σ) can be factorised into L(x;µ,Σ) = g1(x)g2(µ,Σ; µˆ, Σˆ), i.e. the
likelihood function depends on the observations only through the values of µˆ = X¯ and Σˆ.
Hence the pair µˆ and Σˆ are sufficient statistics for µ and Σ in the case of a sample from
Np(µ,Σ). Note that the structure of the multivariate normal density was essentially used here
thus underlying the importance of the check on adequacy of multivariate normality assumptions
in practice. If testing indicates significant departures from multivariate normality then inferences
that are based solely on µˆ and Σˆ may not be very reliable.
3.2 Distributions of MLE of mean vector and covariance matrix of
multivariate normal distribution
Inference is not restricted to only find point estimators but also to construct confidence regions,
test hypotheses etc. To this end we need the distribution of the estimators (or of suitably chosen
functions of them).
29
UNSW MATH5855 2021T3 Lecture 3 Multivariate Normal Estimation
3.2.1 Sampling distribution of X¯
In the univariate case (p = 1) it is well known that for a sample of n observations from normal
distribution N(µ, σ2) the sample mean is normally distributed: N(µ, σ
2
n ). Moreover, the sample
mean and the sample variance are independent in the case of sampling from a univariate normal
population (Basu’s Lemma). This fact was very useful in developing tstatistics for testing the
mean vector. Do we have similar statements about the sample mean and sample variance in the
multivariate (p > 1) case?
Let the random vector X¯ = 1n
∑n
i=1Xi ∈ Rp. For any l ∈ Rp : l⊤X¯ is a linear combination
of normals and hence is normal (see Definition 2.1). Since taking expected value is a linear
operation, we have E X¯ = 1nnµ = µ; In analogy with the univariate case we could formally
write Cov X¯ = 1n2nCovX1 =
1
nΣ and hence X¯ ∼ Np(µ, 1nΣ). But we would like to develop
a more appropriate machinery for the multivariate case that would help us to more rigorously
prove statements like the last one. It is based on operations with Kronecker products.
Kronecker product of two matrices A ∈ Mm,n and B ∈ Mp,q is denoted by A ⊗ B and is
defined (in block matrix notation) as
A⊗B =
a11B a12B · · · a1nB
a21B a22B · · · a2nB
...
...
. . .
...
am1B am2B · · · amnB
(3.7)
The following basic properties of Kronecker products will be used:
(A⊗B)⊗ C = A⊗ (B ⊗ C)
(A+B)⊗ C = A⊗ C +B ⊗ C
(A⊗B)⊤ = A⊤ ⊗B⊤
(A⊗B)−1 = A−1 ⊗B−1
(A⊗B)(C ⊗D) = AC ⊗BD
(whenever the corresponding matrix products and inverses exist)
tr(A⊗B) = tr(A) tr(B)
A⊗B = ApBm
(in case A ∈Mm,m, B ∈Mp,p).
In addition, the □⃗ operation on a matrix A ∈ Mm,n will be defined. This operation creates
a vector A⃗ ∈ Rmn which is composed by stacking the n columns of the matrix A ∈Mm,n under
each other (the second below the first etc). For matrices A,B and C (of suitable dimensions) it
holds: −−−→
ABC = (C⊤ ⊗A)B⃗
Let us see how we could utilise the above to derive the distribution of X¯. Denote by 1n the
vector of n ones. Note that if X is the random data matrix (see (0.11) in Lecture 0.2) then
X⃗ ∼ N(1n ⊗ µ, In ⊗ Σ) and X¯ = 1n (1⊤n ⊗ Ip)X⃗. Hence X¯ is multivariate normal with
E X¯ =
1
n
(1⊤n ⊗ Ip)(1n ⊗ µ) =
1
n
(1⊤n 1n ⊗ µ) =
1
n
nµ = µ,
Cov X¯ = n−2(1⊤n ⊗ Ip)(In ⊗ Σ)(1n ⊗ Ip) = n−2(1⊤n 1n ⊗ Σ) = n−1Σ.
30
UNSW MATH5855 2021T3 Lecture 3 Multivariate Normal Estimation
Independence of X¯ and Σˆ
How can we show that X¯ and Σˆ are independent? Recall the likelihood function
L(x;µ,Σ) =
1
(2π)
np
2 Σn2 e
− 12 tr[Σ−1(
∑n
i=1(xi−x¯)(xi−x¯)⊤+n(x¯−µ)(x¯−µ)⊤)]
We have two summands in the exponent from which one is a function of the observations through
nΣˆ =
∑n
i=1(xi−x¯)(xi−x¯)⊤ only and the other one depends on the observations through x¯ only.
The idea is now to transform the original data matrix X ∈ Mp,n into a new matrix Z ∈ Mp,n
whose columns are independent normal and in such a way that X¯ would only be a function of
the first column Z1, whereas
∑n
i=1(xi − x¯)(xi − x¯)⊤ would only be a function of Z2, . . . ,Zn. If
we succeed, then clearly X¯ and
∑n
i=1(xi − x¯)(xi − x¯)⊤ = nΣˆ would be independent.
Now the claim is that the sought after transformation is given by Z = XA with A ∈ Mn,n
being an orthogonal matrix with a first column equal to 1√
n
1n. Note that the first column of
Z would be then
√
nX¯. (An explicit form of the matrix A can be obtained using the Gramm–
Schmidt Process discussed later.) Since Z⃗ =
−−−−→
IpXA = (A
⊤⊗ Ip)X⃗, the Jacobian of the transfor
mation (X⃗ into Z⃗) is A⊤⊗Ip = Ap = ±1 (note that A is orthogonal). Therefore, the absolute
value of the Jacobian is equal to one. For Z⃗ we have:
E(Z⃗) = (A⊤ ⊗ Ip)(1n ⊗ µ) = A⊤1n ⊗ µ =
√
n
0
...
0
⊗ µ
Further,
Cov(Z⃗) = (A⊤ ⊗ Ip)(In ⊗ Σ)(A⊗ Ip) = A⊤A⊗ IpΣIp = In ⊗ Σ
which means that the Zi, i = 1, . . . , n are independent. Note Z1 =
√
nX¯ holds (because of the
choice of the first column of the orthogonal matrix A). Further
n∑
i=1
(Xi − X¯)(Xi − X¯)⊤ =
n∑
i=1
XiX
⊤
i −
1
n
(
n∑
i=1
Xi)(
n∑
i=1
X⊤i ) =
ZA⊤AZ⊤ −Z1Z⊤1 =
n∑
i=1
ZiZ
⊤
i −Z1Z⊤1 =
n∑
i=2
ZiZ
⊤
i
Hence we proved the following
Theorem 3.3. For a sample of size n from Np(µ,Σ), p < n the sample average X¯ ∼ Np(µ, 1nΣ).
Moreover, the MLE µˆ = X¯ and Σˆ are independent.
3.2.2 Sampling distribution of the MLE of Σ
Definition 3.4. A random matrix U ∈ Mp,p has a Wishart distribution with parameters
Σ, p, n (denoting this by U ∼Wp(Σ, n)) if there exist n independent random vectors Y1, . . . ,Yn
each with Np(0,Σ) distribution such that the distribution of
∑n
i=1 YiY
⊤
i coincides with the
distribution of U .
31
UNSW MATH5855 2021T3 Lecture 3 Multivariate Normal Estimation
Note that we require that p < n and that U be nonnegative definite.
Having in mind the proof of Theorem 3.3 we can claim that the distribution of the matrix
nΣˆ =
∑n
i=1(Xi − X¯)(Xi − X¯)⊤ is the same as that of
∑n
i=2ZiZ
⊤
i and therefore is Wishart
with parameters Σ, p, n− 1. That is, we can denote:
nΣˆ ∼Wp(Σ, n− 1).
The density formula for the Wishart distribution is given in several sources but we will not
deal with it in this course. Some properties of Wishart distribution will be mentioned though
since we will make use of them later in the course:
1. If p = 1 and if we denote the “matrix” Σ by σ2 (as usual) then W1(Σ, n)/σ
2 = χ2n. In
particular, when σ2 = 1 we see that W1(1, n) is exactly the χ
2
n random variable. In that
sense we can state that the Wishart distribution is a generalisation (with respect to the
dimension p) of the chisquared distribution.
2. For an arbitrary fixed matrix H ∈Mk,p, k ≤ p one has:
nHΣˆH⊤ ∼Wk(HΣH⊤, n− 1).
(Why? Show it!)
3. Refer to the previous case for the particular value of k = 1. The matrix H ∈ M1,p is just
a pdimensional row vector that we could denote by c⊤. Then:
i) nc
⊤Σˆc
c⊤Σc ∼ χ2n−1
ii) n c
⊤Σ−1c
c⊤Σˆ−1c
∼ χ2n−p
4. Let us partition S = 1n−1
∑n
i=1(Xi − X¯)(Xi − X¯)⊤ ∈Mp,p into
S =
(
S11 S12
S21 S22
)
,S11 ∈Mr,r, r < p
Σ =
(
Σ11 Σ12
Σ21 Σ22
)
,Σ11 ∈Mr,r, r < p.
Further, denote
S12 = S11 − S12S−122 S21, Σ12 = Σ11 − Σ12Σ−122 Σ21.
Then it holds
(n− 1)S11 ∼Wr(Σ11, n− 1)
(n− 1)S12 ∼Wr(Σ12, n− p+ r − 1)
3.2.3 Aside: The Gramm–Schmidt Process (not examinable)
Let A = [a1, . . . ,an] ∈ Mn,n be an arbitrary fullrank matrix whose first column must be
preserved (up to a constant multiple) but which must otherwise be made into an orthogonal
matrix. The idea of the the Gram–Schmidt Orthogonalisation (and Orthonormalisation) is to
first make a2 orthogonal to a1, then a3 orthogonal to a1 and a2, all the way to making an
orthogonal to all the previous vectors. This is accomplished by the following procedure:
32
UNSW MATH5855 2021T3 Lecture 3 Multivariate Normal Estimation
1. For each i = 2, . . . , n,
2. For each j = 1, . . . , i− 1,
3. Update ai = ai − ⟨ai,aj⟩⟨aj ,aj⟩aj .
4. For each k = 1, . . . , n,
5. Update ak =
ak
∥ak∥ .
Then, after Step 3 for a given i and j,
⟨ai,aj⟩ = ⟨ai − ⟨ai,aj⟩⟨aj ,aj⟩aj ,aj⟩ = ⟨ai,aj⟩ −
⟨ai,aj⟩
⟨aj ,aj⟩ ⟨aj ,aj⟩ = 0.
We can use induction to show that by the time we reach Step 4, ⟨ai,aj⟩ = 0 for any i and j.
Observe that after Step 3 completes with i = 2 (and therefore j = 1 only), ⟨a1,a2⟩ = 0.
Now, suppose that a1, . . . ,ai−1 are orthogonal. Then, after Step 3 for some j, for an arbitrary
l < i, l ̸= j,
⟨ai − ⟨ai,aj⟩⟨aj ,aj⟩aj ,al⟩ = ⟨ai,al⟩ −
⟨ai,aj⟩
⟨aj ,aj⟩
:
0⟨aj ,al⟩ = ⟨ai,al⟩,
since l, j ≤ i − 1 and are therefore orthogonal. This means that Step 3 only affects ⟨ai,al⟩ for
l = j: Step 3 cannot make ai no longer orthogonal to any of the vectors a1, . . . ,ai−1 to which it
was previously orthogonal, so by the time the loop increments i, a1, . . . ,ai will be orthogonal,
completing the proof by induction.
Lastly, Steps 4 and 5 simply ensure that a1, . . . ,an are normal. At no point is a1 changed
except for being normalised.
Example 3.5. Gram–Schmidt process implemented in R.
3.3 Additional resources
An alternative presentation of these concepts can be found in JW Sec. 4.3–4.5.
3.4 Exercises
Exercise 3.1
Find the product A⊗B if A =
(
1 2
3 4
)
, B =
(
5 0
2 1
)
.
33
UNSW MATH5855 2021T3 Lecture 4 Intervals and Tests for the Mean
4 Confidence Intervals and Hypothesis Tests for the Mean
Vector
4.1 Hypothesis tests for the multivariate normal mean . . . . . . . . . . . . . . . . . 34
4.1.1 Hotelling’s T 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.2 Sampling distribution of T 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.3 Noncentral Wishart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.4 T 2 as a likelihood ratio statistic . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.5 Wilks’ lambda and T 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.6 Numerical calculation of T 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.7 Asymptotic distribution of T 2 . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Confidence regions for the mean vector and for its components . . . . . . . . . . 38
4.2.1 Confidence region for the mean vector . . . . . . . . . . . . . . . . . . . . 38
4.2.2 Simultaneous confidence statements . . . . . . . . . . . . . . . . . . . . . 38
4.2.3 Simultaneous confidence ellipsoid . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Comparison of two or more mean vectors . . . . . . . . . . . . . . . . . . . . . . 39
4.3.1 Reducing to a single population . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.2 The twosample T 2test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Hypothesis tests for the multivariate normal mean
4.1.1 Hotelling’s T 2
Suppose again that, like in Lecture 3, we have observed n independent realisations of pdimensional
random vectors from Np(µ,Σ). Suppose for simplicity that Σ is nonsingular. The data matrix
has the form
x =
x11 x12 · · · x1j · · · x1n
x21 x22 · · · x2j · · · x2n
...
...
. . .
...
. . .
...
xi1 xi2 · · · xij · · · xin
...
...
. . .
...
. . .
...
xp1 xp2 · · · xpj · · · xpn
= [x1,x2, . . . ,xn]
Based on our knowledge from Section 3.2 we can claim that X¯ ∼ Np(µ, 1nΣ) and nΣˆ ∼
Wp(Σ, n− 1).
Consequently, any linear combination c⊤X¯, c ̸= 0 ∈ Rp follows N(c⊤µ, 1nc⊤Σc) and the
quadratic form nc⊤Σˆc/c⊤Σc ∼ χ2n−1. Further, we have shown that X¯ and Σˆ are independently
distributed and hence
T =
√
nc⊤(X¯ − µ)/
√
c⊤
n
n− 1Σˆc ∼ tn−1,
i.e. follows the t distribution with n− 1 degrees of freedom. This result has useful applications
in testing for contrasts.
Indeed, if we would like to test H0 : c
⊤µ =
∑p
i=1 ciµi = 0, we note that under H0, T becomes
simply
T =
√
nc⊤X¯/
√
c⊤Sc,
34
UNSW MATH5855 2021T3 Lecture 4 Intervals and Tests for the Mean
that is, does not involve the unknown µ and can be used as a teststatistic whose distribution
under H0 is known. If T  > t1−α/2,n−1 we should reject H0 in favour of H1 : c⊤µ =
∑p
i=1 ciµi ̸=
0.
The formulation of the test for other (onesided) alternatives is left for you as an exercise.
More often we are interested in testing the mean vector of a multivariate normal. First
consider the case of known covariance matrix Σ (variance σ2 in the univariate case). The standard
univariate (p = 1) test for this purpose is the following: to test H0 : µ = µ0 versus H1 : µ ̸= µ0
at level of significance α, we look at U =
√
n X¯−µ0σ and reject H0 if U  exceeds the upper
α
2 · 100% point of the standard normal distribution. Checking if U  is large enough is equivalent
to checking if U2 = n(X¯ − µ0)(σ2)−1(X¯ − µ0) is large enough. We can now easily generalise
the above test statistic in a natural way for the multivariate (p > 1) case: calculate U2 =
n(X¯ − µ0)⊤Σ−1(X¯ − µ0) and reject the null hypothesis of µ = µ0 when U2 is large enough.
Similarly to the proof of Property 5 of the multivariate normal distribution (Section 2.2) and by
using Theorem 3.3 of Section 3.2 you can convince yourself (do it (!)) that U2 ∼ χ2p under the
null hypothesis. Hence, tables of the χ2distribution will suffice to perform the above test in the
multivariate case.
Now let us turn to the (practically more relevant) case of unknown covariance matrix Σ. The
standard univariate (p = 1) test for this purpose is the ttest. Let us recall it: to test H0 : µ = µ0
versus H1 : µ ̸= µ0 at level of significance α, we look at
T =
√
n
X¯ − µ0
S
, S2 =
1
n− 1
n∑
i=1
(Xi − X¯)2
and reject H0 if T  exceeds the upper α2 · 100% point of the tdistribution with n − 1 degrees
of freedom. We note that checking if T  is large enough is equivalent to checking if T 2 =
n(X¯−µ0)(s2)−1(X¯−µ0) is large enough. Of course, under H0, the statistic T 2 is F distributed:
T 2 ∼ F1,n−1 which means that H0 would be rejected at level α when T 2 > F1−α;1,n−1. We can
now easily generalise the above test statistic in a natural way for the multivariate (p > 1) case:
Definition 4.1 (Hotelling’s T 2). The statistic
T 2 = n(X¯ − µ0)⊤S−1(X¯ − µ0) (4.1)
where X¯ = 1n
∑n
i=1Xi, S =
1
n−1
∑n
i=1(Xi − X¯)(Xi − X¯)⊤, µ0 ∈ Rp, Xi ∈ Rp, i = 1, . . . , n is
named after Harold Hotelling.
4.1.2 Sampling distribution of T 2
Obviously, the test procedure based on Hotelling’s statistic will reject the null hypothesis H0 :
µ = µ0 if the value of T
2 is sufficiently high. It turns out we do not need special tables for the
distribution of T 2 under the null hypothesis because of the following basic result (that represents
a true generalisation of the univariate (p = 1) case:
Theorem 4.2. Under the null hypothesis H0 : µ = µ0, Hotelling’s T
2 is distributed as (n−1)pn−p Fp,n−p
where Fp,n−p denotes the F distribution with p and n− p degrees of freedom.
Proof. Indeed, we can write the T 2 statistic in the form:
T 2 =
n(X¯ − µ0)⊤S−1(X¯ − µ0)
n(X¯ − µ0)⊤Σ−1(X¯ − µ0)n(X¯ − µ0)
⊤Σ−1(X¯ − µ0).
35
UNSW MATH5855 2021T3 Lecture 4 Intervals and Tests for the Mean
Denote by C =
√
n(X¯ − µ0). Conditionally on C = c we have:
n(X¯ − µ0)⊤S−1(X¯ − µ0)
n(X¯ − µ0)⊤Σ−1(X¯ − µ0) =
c⊤S−1c
c⊤Σ−1c
,
has a distribution that only depends on the data through S−1. Noting that nΣˆ = (n − 1)S
and having in mind the third property of Wishart distributions from Section 3.2.2, we can
claim that this distribution is the same as of (n − 1)/χ2n−p. Note also that the distribution
does not depend on the particular c. The second factor n(X¯ − µ0)Σ−1(X¯ − µ0) ∼ χ2p and its
distribution depends on the data through X¯ only. Because of the independence of the mean
and covariance estimators, we have that the distribution of T 2 is the same as the distribution of
χ2p(n−1)
χ2n−p
where the two chisquares are independent. But this means that T
2(n−p)
p(n−1) ∼ Fp,n−p and
hence T 2 ∼ p(n−1)n−p Fp,n−p.
4.1.3 Noncentral Wishart
It is possible to extend the definition of the Wishart distribution in Section 3.2.2 by allowing the
random vectors Yi, i = 1, . . . , n there to be independent with Np(µi,Σ) (instead of just having
all µi = 0). One arrives at the noncentral Wishart distribution with parameters Σ, p, n − 1,Γ
in that way (denoted also as Wp(Σ, n − 1,Γ). Here Γ = MM⊤ ∈ Mp,p, M = [µ1,µ2, . . . ,µn]
is called a noncentrality parameter. When all columns of M ∈ Mp,n are zero, this is the usual
(central) Wishart distribution. Theorem 4.2 can be extended to derive the distribution of the
T 2 statistic under alternatives, i.e. the distribution of T 2 = n(X¯ −µ)⊤S−1(X¯ −µ) for µ ̸= µ0.
This distribution turns out to be related to noncentral Fdistribution. It is helpful in studying
power of the test of H0 : µ = µ0 versus H1 : µ ̸= µ0. We shall spare the details here.
4.1.4 T 2 as a likelihood ratio statistic
It is worth mentioning that Hotelling’s T 2 that we introduced by analogy with the univariate
squared t statistic can in fact also be derived as the likelihood ratio test statistic for testing
H0 : µ = µ0 versus H1 : µ ̸= µ0. This safeguards the asymptotic optimality of the test suggested
in Sections 4.1.1–4.1.2. To see this, first recall the likelihood function (3.2). Its unconstrained
maximisation gives as a maximum value:
L(x; µˆ, Σˆ) =
1
(2π)
np
2 Σˆn2 e
−np2
On the other hand, under H0 :
max
Σ
L(x;µ0,Σ) = max
Σ
1
(2π)
np
2 Σn2 e
− 12
∑n
i=1(xi−µ0)⊤Σ−1(xi−µ0)
Since logL(x;µ0,Σ) = −np2 log(2π)− n2 logΣ− 12 tr[Σ−1(
∑n
i=1(xi−µ0)(xi−µ0)⊤)], on applying
Anderson’s lemma (see Theorem 3.1 in Section 3.1.2) we find that maximum of logL(x;µ0,Σ)
(whence also of L(x;µ0,Σ)) is obtained when Σˆ0 =
1
n
∑n
i=1(xi−µ0)(xi−µ0)⊤ and the maximal
value is
1
(2π)
np
2 Σˆ0n2
e−
np
2 .
Hence the likelihood ratio is:
Λ =
maxΣ L(x;µ0,Σ)
maxµ,Σ L(x;µ,Σ)
=
(
Σˆ
Σˆ0
)
n
2 (4.2)
36
UNSW MATH5855 2021T3 Lecture 4 Intervals and Tests for the Mean
The equivalent statistic Λ
2
n = ΣˆΣˆ0 is called Wilks’ lambda. Small values of Wilks’ lambda lead
to rejecting H0 : µ = µ0.
4.1.5 Wilks’ lambda and T 2
The following theorem shows the relation between Wilks’ lambda and T 2:
Theorem 4.3. The likelihood ratio test is equivalent to the test based on T 2 since Λ
2
n = (1 +
T 2
n−1 )
−1 holds.
Proof. Consider the matrix A ∈Mp+1,p+1:
A =
(∑n
i=1(xi − x¯)(xi − x¯)⊤
√
n(x¯− µ0)√
n(x¯− µ0)⊤ −1
)
=
(
A11 A12
A21 A22
)
It is easy to check that
A = A22A11 −A12A−122 A21 = A11A22 −A21A−111 A12 (4.3)
holds from which we get:
(−1)
n∑
i=1
(xi − x¯)(xi − x¯)⊤ + n(x¯− µ0)(x¯− µ0)⊤ =

n∑
i=1
(xi − x¯)(xi − x¯)⊤−1− n(x¯− µ0)⊤(
n∑
i=1
(xi − x¯)(xi − x¯)⊤)−1(x¯− µ0)
Hence (−1)∑ni=1(xi − µ0)(xi − µ0)⊤ = ∑ni=1(xi − x¯)(xi − x¯)⊤(−1)(1 + T 2n−1 ). Thus Σˆ0 =
Σˆ(1 + T 2n−1 ), i.e.
Λ
2
n = (1 +
T 2
n− 1)
−1 (4.4)
4.1.6 Numerical calculation of T 2
Hence H0 is rejected for small values of Λ
2
n or equivalently, for large values of T 2. The critical
values for T 2 are determined from Theorem 4.2. Relation (4.4) can be used to calculate T 2 from
Λ
2
n = ΣˆΣˆ0 thus avoiding the need to invert the matrix S when calculating T
2!
4.1.7 Asymptotic distribution of T 2
Since S−1 is a consistent estimator of Σ−1, the limiting distribution of T 2 will coincide with the
one of n(x¯−µ)⊤Σ−1(x¯−µ) which, as we know already, is χ2p. This coincides with a general claim
of asymptotic theory which states that −2 log Λ is asymptotically distributed as χ2p. Indeed:
−2 log Λ = n log(1 + T
2
n− 1) ≈
n
n− 1T
2 ≈ T 2
(by using the fact that log(1 + x) ≈ x for small x).
37
UNSW MATH5855 2021T3 Lecture 4 Intervals and Tests for the Mean
4.2 Confidence regions for the mean vector and for its components
4.2.1 Confidence region for the mean vector
For a given confidence level (1− α) it can be constructed in the form
{µn(x¯− µ)⊤S−1(x¯− µ) ≤ F1−α,p,n−p p
n− p (n− 1)}
where F1−α,p,n−p is the upper α · 100% percentage point of the F distribution with (p, n− p) df.
This confidence region has the form of an ellipsoid in Rp centred at x¯. The axes of this confidence
ellipsoid are directed along the eigenvectors ei of the matrix S =
1
n−1
∑n
i=1(xi−x¯)(xi−x¯)⊤. The
halflengths of the axes are given by the expression
√
λi
√
p(n−1)F1−α,p,n−p
n(n−p) , with λi, i = 1, . . . , p
being the corresponding eigenvalues, i.e.
Sei = λiei, i = 1, . . . , p
Example 4.4. Microwave ovens (Example 5.3., pages 221–223, JW).
4.2.2 Simultaneous confidence statements
For a given confidence level (1 − α) the confidence ellipsoids in Section 4.2.1 correctly reflect
the joint (multivariate) knowledge about plausible values of µ ∈ Rp but nevertheless one is
often interested in confidence intervals for means of each individual component. We would like
to formulate these statements in such a form that all of the separate confidence statements
should hold simultaneously with a prespecified probability. This is why we are speaking about
simultaneous confidence intervals.
First, note that if the vector X ∼ Np(µ,Σ) then for any l ∈ Rp : l⊤X ∼ N1(l⊤µ, l⊤Σl)
and, hence, for any fixed l we can construct an (1− α) · 100% confidence interval of l⊤µ in the
following simple way:(
l⊤x¯− t1−α/2,n−1
√
l⊤Sl√
n
, l⊤x¯+ t1−α/2,n−1
√
l⊤Sl√
n
)
(4.5)
By taking l⊤ = [1, 0, . . . , 0] or l⊤ = [0, 1, 0, . . . , 0] etc. we obtain from (4.5) the usual confidence
interval for each separate component of the mean. Note however that the confidence level for
all these statements taken together is not (1 − α). To make it (1 − α) for all possible choices
simultaneously we need to take a larger constant than t1−α/2,n−1 in the right hand side of the
inequality 
√
n(l⊤x¯−l⊤µ¯)√
l⊤Sl
 ≤ t1−α/2,n−1 (or equivalently n(l
⊤x¯−l⊤µ¯)2
l⊤Sl ≤ t21−α/2,n−1).
4.2.3 Simultaneous confidence ellipsoid
Theorem 4.5. Simultaneously for all l ∈ Rp, the interval(
l⊤x¯−
√
p(n− 1)
n(n− p)F1−α,p,n−pl
⊤Sl, l⊤x¯+
√
p(n− 1)
n(n− p)F1−α,p,n−pl
⊤Sl
)
will contain l⊤µ¯ with a probability at least (1− α).
Example 4.6. Microwave Ovens (Example 5.4, p. 226 in JW).
38
UNSW MATH5855 2021T3 Lecture 4 Intervals and Tests for the Mean
Proof. Note that according to Cauchy–Bunyakovski–Schwartz Inequality:
[l⊤(x¯−µ)]2 = [(S1/2l)⊤S−1/2(x¯−µ)]2 ≤ ∥S1/2l∥2∥S−1/2(x¯−µ)∥2 = (l⊤Sl)(x¯−µ)⊤S−1(x¯−µ).
Therefore,
max
l
n(l⊤(x¯− µ))2
l⊤Sl
≤ n(x¯− µ)⊤S−1(x¯− µ) = T 2 (4.6)
Inequality (4.6) helps us to claim that whenever a constant c has been such that T 2 ≤ c2 then
also n(l
⊤x¯−l⊤µ¯)2
l⊤Sl ≤ c2 holds for any l ∈ Rp, l ̸= 0. Equivalently,
l⊤x¯− c
√
l⊤Sl
n
≤ l⊤µ¯ ≤ l⊤x¯+ c
√
l⊤Sl
n
(4.7)
for every l. Now it remains to choose c2 = p(n − 1)F1−α,p,n−p/(n − p) to make sure that
1 − α = P (T 2 ≤ c2) holds and this will automatically ensure that (4.7) will contain l⊤µ¯ with
probability 1− α.
Bonferroni Method
The simultaneous confidence intervals when applied for the vectors l⊤ = [1, 0, . . . , 0], l⊤ =
[0, 1, 0, . . . , 0] etc. are much more reliable at a given confidence level than the oneatatime
intervals. Note that the former also utilise the covariance structure of all p variables in their
construction. However, sometimes we can do better in cases where one is interested in a small
number of individual confidence statements.
In this latter case, the simultaneous confidence intervals may give too large a region and
the Bonferroni method may prove more efficient instead. The idea of the Bonferroni approach
is based on a simple probabilistic inequality. Assume that simultaneous confidence statements
about m linear combinations l⊤1 µ, l
⊤
2 µ, . . . , l
⊤
mµ are required. If Ci, i = 1, 2, . . . ,m denotes the
ith confidence statement and P (Ci true) = 1− αi then
P (all Ci true) = 1− P (at least one Ci false) ≥
1−
m∑
i=1
P (Ci false) = 1−
m∑
i=1
(1− P (Ci true)) = 1− (α1 + α2 + · · ·+ αm)
Hence, if we choose αi =
α
m , i = 1, 2, . . . ,m (that is, if calculate each statement at confidence
level (1− αm ) · 100% instead of (1− α) · 100%) then the probability of any statement being false
will not exceed α.
Example 4.7. Microwave Ovens (based on JW Example 5.4, p. 226).
4.3 Comparison of two or more mean vectors
Finally, let us note that comparison of the mean vectors of two or more than two different
multivariate populations when there are independent observations from each of the populations
is an important, practically relevant problem. For the purposes of this section, suppose that we
observe two samples, X1,X2, . . . ,XnX ∈ Rp and Y1,Y2, . . . ,YnY ∈ Rp, with means µX ∈ Rp
and µY ∈ Rp respectively and variances ΣX ∈ Mp,p and ΣY ∈ Mp,p, respectively. Typically,
we wish to test H0 : µX − µY = δ0.
Multivariate ANOVA for comparing more than two populations is discussed in Lecture 8.
39
UNSW MATH5855 2021T3 Lecture 4 Intervals and Tests for the Mean
4.3.1 Reducing to a single population
As with the univariate ttest, under some scenarios the test of a difference between two pop
ulations in fact reduces to a onesample test. For example, if the samples are paired and
nX = nY = n, we may proceed analogously to the paired ttest: we take Di = Xi − Yi
for i = 1, . . . , n and proceed as if with a 1sample T 2 test:
T 2 = n(D¯ − δ0)⊤S−1D (D¯ − δ0) ∼
(n− 1)p
n− p Fp,n−p, (4.8)
where D¯ ∈ Rp and SD ∈ Mp,p are the sample mean and variance of D1, . . . ,Dn, respectively,
assuming Di are normally distributed. (It is important to note that any diagnostics for this test
should be performed on the differences, not on the original values.)
We can also formulate this is in a “multivariate” form: let the contrast matrix C ∈ Mp,p+p
be
C =
+1 −1+1 −1
+1 −1
.
Then, we can express Di = C
(
Xi
Yi
)
and the test as H0 : C
(
µX
µY
)
= δ0. It is easy to show that
the test statistic reduces to (4.8).
C can have more complex forms. For example, in a repeated measures design, we may measure
the results of a series of p treatment outcomes on each sampling unit. If we then collect each
individual i’s measurements into a vector Xi, we may test whether all outcomes are the same in
expectation by forming
C =
1 −1... . . .
1 −1
∈Mp−1,p
and testing H0 : CµX = 0p−1. It is easy to show that CµX = 0p−1 holds if and only if all
elements of µX are equal.
4.3.2 The twosample T 2test
We now turn to the scenario where X and Y are, in fact, independent samples. As with the
univariate test, we must decide whether we are prepared to assume that ΣX = ΣY = Σ in
the population and therefore use the pooled test. If so—and necessarily if the sample sizes are
small—we evaluate
Spooled =
(nX − 1)SX + (nY − 1)SY
nX + nY − 2 .
Since Spooled estimates Σ,
Var(X¯ − Y¯ ) = Σ
nX
+
Σ
nY
≈ Spooled
nX
+
Spooled
nY
= Spooled
(
1
nX
+
1
nY
)
.
And, since X¯ − Y¯ ∼ Np(µX − µY ,Σ(n−1X + n−1Y )), we write
T 2 = (X¯−Y¯ −δ0)⊤
{
Spooled
(
1
nX
+
1
nY
)}−1
(X¯−Y¯ −δ0) ∼ (nX + nY − 2)p
nX + nY − p− 1Fp,nX+nY −p−1.
(4.9)
40
UNSW MATH5855 2021T3 Lecture 4 Intervals and Tests for the Mean
We would thus reject H0 if T
2 falls above the F critical value in (4.9), construct a confidence
region based on{
δ
∣∣(x¯− y¯ − δ)⊤S¯−1p (x¯− y¯ − δ) ≤ (nX + nY − 2)pnX + nY − p− 1F1−α,p,nX+nY −p−1
}
and simultaneous contrast confidence intervals
l⊤(x¯− y¯)±
√
(nX + nY − 2)p
nX + nY − p− 1F1−α,p,nX+nY −p−1l
⊤Spooled
(
1
nX
+
1
nY
)
l.
If we are not prepared to make the pooling assumption, our test statistic is instead
T 2 = (X¯ − Y¯ − δ0)⊤
(
SX
nX
+
SY
nY
)−1
(X¯ − Y¯ − δ0).
Even for modest sample sizes, under multivariate normality, the distribution of this T 2 is rea
sonably well approximated by νpν−p+1Fp,ν−p+1, where
ν =
p+ p2∑2
i=1
1
ni
(
tr
[{
1
ni
Si
(
1
n1
S1 +
1
n2
S2
)−1}2]
+
[
tr
{
1
ni
Si
(
1
n1
S1 +
1
n2
S2
)−1}]2) .
The confidence regions are then produced by{
δ
∣∣(x¯− y¯ − δ)⊤(SX
nX
+
SY
nY
)−1
(x¯− y¯ − δ) ≤ νp
ν − p+ 1Fp,ν−p+1
}
and simultaneous contrast confidence intervals
l⊤(x¯− y¯)±
√
νp
ν − p+ 1Fp,ν−p+1l
⊤
(
SX
nX
+
SY
nY
)
l.
4.4 Software
R: car::confidenceEllipse, package Hotelling, rrcov::T2.test, ergm::approx.hotelling.diff.test,
MVTests::TwoSamplesHT2
SAS: See IML implementations.
4.5 Additional resources
An alternative presentation of these concepts can be found in JW Sec. 5.1–5.5 and 6.
4.6 Exercises
Exercise 4.1
Suppose X1,X2, . . . ,Xn are independent Np(µ,Σ) random vectors with sample mean vector
X¯ and sample covariance matrix S. We wish to test the hypothesis
H0 : µ2 − µ1 = µ3 − µ2 = · · · = µp − µp−1 = 1
where µ1, µ2, . . . , µp are the elements of µ.
41
UNSW MATH5855 2021T3 Lecture 4 Intervals and Tests for the Mean
(a) Determine a (p − 1) × p matrix C so that H0 may be written equivalently as H0 : Cµ = 1
where 1 is a (p− 1)× 1 vector of ones.
(b) Make an appropriate transformation of the vectors Xi, i = 1, 2, . . . , n and hence find the
rejection region of a size α test of H0 in terms of X¯, S, and C.
Exercise 4.2
A sample of 50 vector observations, each containing three components, is drawn from a normal
distribution having covariance matrix
Σ =
3 1 11 4 1
1 1 2
.
The components of the sample mean are 0.8, 1.1 and 0.6. Can you reject the null hypothesis of
zero distribution mean against a general alternative?
Exercise 4.3
Evaluate Hotelling’s statistic T 2 for testing H0 : µ =
(
7
11
)
using the data matrix X =(
2 8 6 8
12 9 9 10
)
. Test the hypothesis H0 at level α = 0.05. What conclusion is reached?
Exercise 4.4
Let X1, . . . ,Xn1 , i.i.d. Np(µ1,Σ) independently of Y1, . . .Yn2 i.i.d. Np(µ2,Σ), Σ known.
Prove that X¯ ∼ Np(µ1, 1n1Σ) and Y¯ ∼ Np(µ2, 1n2Σ). Hence W = X¯ − Y¯ ∼ N(µ1 −
µ2,
(
1
n1
+ 1n2
)
Σ) so that X¯ − Y¯ − (µ1 − µ2) ∼ N(0,
(
1
n1
+ 1n2
)
Σ). Construct a test of
H0 : µ1 = µ2.
Exercise 4.5
Let X¯ and S be based on n observations fromNp(µ,Σ) and letX be an additional observation
from Np(µ,Σ). Show that X − X¯ ∼ Np(0, (1 + 1n )Σ). Find the distribution of nn+1 (X −
X¯)⊤S−1(X − X¯) and suggest how to use this result to give a (1 − α) prediction region for X
based on X¯ and S (i.e., a region in Rp such that one has a given confidence (1 − α) that the
next observation will fall into it).
42
UNSW MATH5855 2021T3 Lecture 5 Correlations
5 Correlation, Partial Correlation, Multiple Correlation
5.1 Partial correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.1 Simple formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Multiple correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.1 Multiple correlation coefficient as ordinary correlation coefficient of trans
formed data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.2 Interpretation of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.3 Remark about the calculation of R2 . . . . . . . . . . . . . . . . . . . . . 46
5.2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Testing of correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.1 Usual correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.2 Partial correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.3 Multiple correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
First of all, we would like to make some general comments on similarities and differences
between correlations and dependencies.
Very often we are interested in correlations (dependencies) between a number of random
variables and are trying to describe the “strength” of the (mutual) dependencies. For example,
we would like to know if there is a correlation (mutual nondirected dependence) between the
length of the arm and of the leg. But, if we would like to get an information about (or to predict)
the length of the arm by measuring the length of the leg, we are dealing with dependence of the
arm’s length on the leg’s length. Both problems described in this example make sense.
On the other hand, there are other examples/situations in which only one of the problems
is interesting or makes sense. If we study the dependence between rain and crops, this makes a
perfect sense but there is no sense at all to study the (directed) influence of crops on rain.
In a nutshell, we can say that when studying the mutual (linear) dependence, we are dealing
with correlation theory whereas when studying directed influence of one (input) variable on
another (output) variable, we are dealing with regression theory. It should be clearly pointed
out though that correlation alone, no matter how strong, can not help us identify the direction of
influence and can not help us in regression modelling. Our reasoning about direction of influence
should come outside of Statistical theory, from another theory.
Another important point to always bear in mind is that, as already discussed in Lecture 2,
uncorrelated does not necessarily mean independent if the multivariate data happens to fail the
multivariate normality test. Nonetheless, for multivariate normal data, the notions of “uncorre
lated” and “independent” coincide.
In general, there are 3 types of correlation coefficients:
• The usual correlation coefficient between 2 variables
• Partial correlation coefficient between 2 variables after adjusting for the effect (regression,
association ) of set of other variables.
• Multiple correlation between a single random variable and a set of p other variables
43
UNSW MATH5855 2021T3 Lecture 5 Correlations
5.1 Partial correlation
For X ∼ Np(µ,Σ) we defined the correlation coefficient ρij = σij√σii√σjj , i, j = 1, 2, . . . , p and
discussed the MLE ρˆij in (3.6). It turned out that they coincide with the sample correlations rij
we introduced in the first lecture (formula (1.3)).
To define partial correlation coefficients, recall the Property 4 of the multivariate normal
distribution from Section 2.2:
If vector X ∈ Rp is divided into X =
(
X(1)
X(2)
)
, X(1) ∈ Rr, r < p,X(2) ∈ Rp−r and
according to this subdivision the vector means are µ =
(
µ(1)
µ(2)
)
and the covariance
matrix Σ has been subdivided into Σ =
(
Σ11 Σ12
Σ21 Σ22
)
and the rank of Σ22 is full then
the conditional density of X(1) given that X(2) = x(2) is
Nr(µ(1) +Σ12Σ
−1
22 (x(2) − µ(2)),Σ11 − Σ12Σ−122 Σ21).
We define the partial correlations of X(1) given X(2) = x(2) as the usual correlation coef
ficients calculated from the elements σij.(r+1),(r+2)...,p of the matrix Σ12 = Σ11 − Σ12Σ−122 Σ21,
i.e.
ρij.(r+1),(r+2),...,p =
σij.(r+1),(r+2),...,p√
σii.(r+1),(r+2),...,p
√
σjj.(r+1),(r+2),...,p
. (5.1)
We call ρij.(r+1),(r+2),...,p the correlation of the ith and jth component when the components
(r + 1), (r + 2), etc. up to the pth (i.e. the last p − r components) have been held fixed. The
interpretation is that we are looking for the association (correlation) between the ith and jth
component after eliminating the effect that the last p − r components might have had on this
association.
To find ML estimates for these, we use the transformation invariance property of the MLE
to claim that if Σˆ =
(
Σˆ11 Σˆ12
Σˆ21 Σˆ22
)
is the usual MLE of the covariance matrix then Σˆ12 =
Σˆ11 − Σˆ12Σˆ−122 Σˆ21 with elements σˆij.(r+1),(r+2),...,p, i, j = 1, 2, . . . , r is the MLE of Σ12 and
correspondingly,
ρˆij.(r+1),(r+2),...,p =
σˆij.(r+1),(r+2),...,p√
σˆii.(r+1),(r+2),...,p
√
σˆjj.(r+1),(r+2),...,p
, i, j = 1, 2, . . . , r
will be the ML estimators of ρij.(r+1),(r+2)...,p, i, j = 1, 2, . . . , r.
5.1.1 Simple formulae
For situations when p is not large, as a partial case of the above general result, simple plugin
formulae are derived that express the partial correlation coefficients by the usual correlation
coefficients. We shall discuss such formulae now. The formulae are given below:
i) partial correlation between first and second variable by adjusting for the effect of the third:
ρ12.3 =
ρ12 − ρ13ρ23√
(1− ρ213)(1− ρ223)
.
44
UNSW MATH5855 2021T3 Lecture 5 Correlations
ii) partial correlation between first and second variable by adjusting for the effects of third and
fourth variable:
ρ12.3,4 =
ρ12.4 − ρ13.4ρ23.4√
(1− ρ213.4)(1− ρ223.4)
.
For higher dimensional cases computers need to be utilised.
5.1.2 Software
SAS: PROC CORR
R: ggm::pcor, ggm::parcor
5.1.3 Examples
Example 5.1. Three variables have been measured for a set of schoolchildren:
i) X1: Intelligence
ii) X2: Weight
iii) X3: Age
The number of observations was large enough so that one can assume the empirical correlation
matrix ρˆ ∈ M3,3 to be the true correlation matrix: ρˆ =
1 0.6162 0.82670.6162 1 0.7321
0.8267 0.7321 1
. This
suggests there is a high degree of positive dependence between weight and intelligence. But (do
the calculation (!)) ρˆ12.3 = 0.0286 so that, after the effect of age is adjusted for, there is
virtually no correlation between weight and intelligence, i.e. weight obviously plays little part in
explaining intelligence.
5.2 Multiple correlation
Recall our discussion in the end of Section 2.2 for the best prediction in mean squares sense in
case of multivariate normality: If we want to predict a random variable Y that is correlated
with p random variables (predictors) X =
(
X1 X2 · · · Xp
)⊤
by trying to minimise the
expected value E[{Y − g(X)}2X = x] the optimal solution (i.e. the regression function) was
g∗(X) = E(Y X). When the joint (p+ 1)dimensional distribution of Y and X is normal this
function was linear in X. Given a specific realisation x of X it was given by b + σ⊤0 C
−1x
where b = E(Y )−σ⊤0 C−1 E(X), C is the covariance matrix of the vector X, σ0 is the vector of
Covariances of Y with Xi, i = 1, . . . , p. The vector C
−1σ0 ∈ Rp was the vector of the regression
coefficients.
Now, let us define the multiple correlation coefficient between the random variable Y and the
random vector X ∈ Rp to be the maximum correlation between Y and any linear combination
α⊤X, α ∈ Rp. This makes sense: to look at the maximal correlation that we can get by trying
to predict Y as a linear function of the predictors. The solution to this which also gives us an
algorithm to calculate (and estimate) the multiple correlation coefficient is given in the next
lemma.
45
UNSW MATH5855 2021T3 Lecture 5 Correlations
5.2.1 Multiple correlation coefficient as ordinary correlation coefficient of trans
formed data
Lemma 5.2. The multiple correlation coefficient is the ordinary correlation coefficient between
Y and σ⊤0 C
−1X ≡ β∗⊤X. (I.e., β∗ ≡ C−1σ0.)
Proof. Note that for any α ∈ Rp : Cov(Y,α⊤X) = α⊤Cβ∗ and, in particular, Cov(Y,β∗⊤X) =
β∗
⊤
Cβ∗ holds.
Using Cauchy–Bunyakovsky–Schwartz inequality we have:
[Cov(α⊤X,β∗
⊤
X)]2 ≤ Var(α⊤X)Var(β∗⊤X)
and therefore:
σ2Y ρ
2(Y,α⊤X) =
(α⊤σ0)2
α⊤Cα
=
(α⊤Cβ∗)2
α⊤Cα
≤ β∗⊤Cβ∗
holds, σ2Y denoting the variance of Y . In this last equality we can get the equality sign by choosing
α = β∗, i.e. the squared correlation coefficient ρ2(Y,α⊤X) of Y and α⊤X is maximised over
α when α = β∗.
Coefficient of Determination
From Lemma 5.2 we see that the maximum correlation between Y and any linear combination
α⊤X, α ∈ Rp, is R =
√
β∗⊤Cβ∗
σ2Y
. This is the multiple correlation coefficient. Its square
R2 is called coefficient of determination. Having in mind that β∗ = C−1σ0 we see that R =√
σ⊤0 C−1σ0
σ2Y
. If Σ =
(
σ2Y σ
⊤
0
σ0 C
)
=
(
Σ11 Σ12
Σ21 Σ22
)
is the partitioned covariance matrix of the (p+1)
dimensional vector (Y,X)⊤ then we know how to calculate the MLE of Σ by Σˆ =
(
Σˆ11 Σˆ12
Σˆ21 Σˆ22
)
so the MLE of R would be Rˆ =
√
Σˆ12Σˆ
−1
22 Σˆ21
Σˆ11
.
5.2.2 Interpretation of R
At the end of Section 2.2 we derived the minimal value of the mean squared error when trying to
predict Y by a linear function of the vector X. It is achieved when using the regression function
and the value itself was σ2Y − σ⊤0 C−1σ0. The latter value can also be expressed by using the
value of R. It is equal to σ2Y (1 − R2). Thus, our conclusion is that when R2 = 0 there is no
predictive power at all. In the opposite extreme case, if R2 = 1, it turns out that Y can be
predicted without any error at all (it is a true linear function of X).
5.2.3 Remark about the calculation of R2
Sometimes, the correlation matrix only may be available. It can be shown that in that case the
relation
1−R2 = 1
ρY Y
(5.2)
46
UNSW MATH5855 2021T3 Lecture 5 Correlations
holds. In (5.2), ρY Y ≡ (ρ−1)11 is the upper lefthand corner of the inverse of the correlation
matrix ρ ∈Mp+1,p+1 determined from Σ. We note that the relation ρ = V − 12ΣV − 12 holds with
V =
σ2y 0 · · · 0
0 c11 · · · 0
...
...
. . .
...
0 0 · · · cpp
One can use (5.2) to calculate R2 by first calculating the right hand side in (5.2). To show
Equality (5.2) we note that
1−R2 = σ
2
Y − σ⊤0 C−1σ0
σ2Y
=
C
C
σ2Y − σ⊤0 C−1σ0
σ2Y
=
Σ
Cσ2Y
,
with the last equality in the numerator holding because of (4.3). But CΣ = σ
Y Y ≡ (Σ−1)11, the
entry in the first row and column of Σ−1. (Recall from Section 0.1.2: (X−1)ji =
Xij 
X (−1)i+j .)
Since ρ−1 = V
1
2Σ−1V
1
2 , we see that ρY Y = σY Y σ2Y holds. Therefore 1−R2 = 1ρY Y .
5.2.4 Examples
Example 5.3. Let µ =
µYµX1
µX2
=
52
0
and Σ =
10 1 −11 7 3
−1 3 2
= (σY Y σ⊤0
σ0 ΣXX
)
. Calculate:
(a) The best linear prediction of Y using X1 and X2.
(b) The multiple correlation coefficient R2Y.(X1,X2).
(c) The mean squared error of the best linear predictor.
Solution
β∗ = Σ−1XXσ0 =
(
7 3
3 2
)−1(
1
−1
)
=
(
.4 −.6
−.6 1.4
)(
1
−1
)
=
(
1
−2
)
and
b = µY − β∗⊤µX = 5− (1,−2)
(
2
0
)
= 3.
Hence the best linear predictor is given by 3 +X1 − 2X2. The value of:
RY.(X1,X2) =
√√√√√ (1,−1)
(
.4 −.6
−.6 1.4
)(
1
−1
)
10
=
√
3
10
= .548
The mean squared error of prediction is: σ2Y (1−R2Y.(X1,X2)) = 10(1− 310 ) = 7.
Example 5.4. Relationship between multiple correlation and regression, and equivalent ways of
computing it.
47
UNSW MATH5855 2021T3 Lecture 5 Correlations
5.3 Testing of correlation coefficients
5.3.1 Usual correlation coefficients
When considering the distribution of a particular correlation coefficient ρˆij = rij the problem
becomes bivariate because only the variablesXi andXj are involved. Direct transformations with
the bivariate normal can be utilised to derive the exact distribution of rij under the hypothesis
H0 : ρij = 0. It turns out that in this case the statistic T = rij
√
n−2
1−r2ij ∼ tn−2 and tests can
be performed by using the tdistribution. For other hypothetical values the derivations are more
painful. There is one most frequently used approximation that holds no matter what the true
value of ρij is. We shall discuss it here. Consider Fisher’s Z transformation Z =
1
2 log[
1+rij
1−rij ].
Under the hypothesis H0 : ρij = ρ0 it holds:
Z ≈ N(1
2
log[
1 + ρ0
1− ρ0 ],
1
n− 3)
In particular, in the most common situation, when one would like to test H0 : ρij = 0 versus
H1 : ρij ̸= 0 one would reject H0 at 5% significance level if Z
√
n− 3 ≥ 1.96.
Based on the above, now you suggest how to test the hypothesis of equality of two correlation
coefficients from two different populations(!).
5.3.2 Partial correlation coefficients
Coming over to testing partial correlations, not much has to be changed. Fisher’s Z approxi
mation can be used again in the following way: to test H0 : ρij.r+1,r+2,...,r+k = ρ0 versus H1 :
ρij.r+1,r+2,...,r+k ̸= ρ0 (i.e., conditioning on k variables) we construct Z = 12 log[ 1+rij.r+1,r+2,...,r+k1−rij.r+1,r+2,...,r+k ]
and a = 12 log[
1+ρ0
1−ρ0 ]. Asymptotically Z ∼ N(a, 1n−k−3 ) holds. Hence, test statistic to be com
pared with significance points of the standard normal is now :
√
n− k − 3Z − a. If ρ0 = 0, the
ttest can be used, with “n− 2” replaced by “n− k− 2” in both the statistic and the degrees of
freedom.
5.3.3 Multiple correlation coefficients
It turns out that under the hypothesis H0 : R = 0 the statistic F =
Rˆ2
1−Rˆ2 ×
n−p
p−1 ∼ Fp−1,n−p.
Hence, when testing significance of the multiple correlation, the rejection region would be { Rˆ2
1−Rˆ2×
n−p
p−1 > F1−α,p−1,n−p} for a given significance level α.
It should be stressed that the value of p in Section 5.3.3 refers to the total number of all
variables (the output Y and all of the input variables in the input vector X). This is different
from the value of p that was used in Section 5.2. In other words, the p in Section 5.3.3 is the
p+ 1 in Section 5.2.
5.3.4 Software
SAS: PROC CORR
R: ggm::pcor.test
5.3.5 Examples
Example 5.5. Testing ordinary correlations: age, height, and intelligence.
Example 5.6. Testing partial correlations: age, height, and intelligence.
48
UNSW MATH5855 2021T3 Lecture 5 Correlations
5.4 Additional resources
An alternative presentation of these concepts can be found in JW Sec. 7.8.
5.5 Exercises
Exercise 5.1
Suppose X ∼ N4(µ,Σ) where µ =
1
2
3
4
and Σ =
3 1 0 1
1 4 0 0
0 0 1 4
1 0 4 20
. Determine:
(a) the distribution of
X1
X2
X3
X1 +X2 +X4
;
(b) the conditional mean and variance of X1 given x2, x3, and x4;
(c) the partial correlation coefficients ρ12.3, ρ12.4;
(d) the multiple correlation between X1 and (X2, X3, X4). Compare it to ρ12 and comment.
(e) Justify that
X2X3
X4
is independent of X1 − (1 0 1)
4 0 00 1 4
0 4 20
−1X2X3
X4
.
Exercise 5.2
A random vector X ∼ N3(µ,Σ) with µ =
2−3
1
and Σ =
1 1 11 3 2
1 2 2
.
(a) Find the distribution of 3X1 − 2X2 +X3.
(b) Find a vector a ∈ R2 such that X2 and X2 − a⊤
(
X1
X3
)
are independent.
49
UNSW MATH5855 2021T3 Lecture 6 Principal Components Analysis
6 Principal Components Analysis
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2 Precise mathematical formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.3 Estimation of the Principal Components . . . . . . . . . . . . . . . . . . . . . . . 51
6.4 Deciding how many principal components to include . . . . . . . . . . . . . . . . 52
6.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.7 PCA and Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.8 Application to finance: Portfolio optimisation . . . . . . . . . . . . . . . . . . . . 53
6.9 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1 Introduction
Principal components analysis is applied mainly as a variable reduction procedure. It is
usually applied in cases when data is obtained from a possibly large number of variables which
are possibly highly correlated. The goal is to try to “condense” the information. This is done
by summarising the data in a (small) number of transformations of the original variables. Our
motivation to do that is that we believe there is some redundancy in the presentation of the
information by the original set of variables since e.g. many of these variables are measuring the
same construct. In that case we try to reduce the observed variables into a smaller number of
principal components (artificial variables) that would account for most of the variability in
the observed variables. For simplicity, these artificial new variables are presented as a linear
combinations of the (optimally weighted) observed variables. If one linear combination is
not enough, we can choose to construct two, three, etc. such combinations. Note also that
principal components analysis may be just an intermediate step in much larger investigations.
The principal components obtained can be used for example as inputs in a regression analysis
or in a cluster analysis procedure. They are also a basic method in extracting factors in factor
analysis.
6.2 Precise mathematical formulation
Let X ∼ Np(µ,Σ) where p is assumed to be relatively large. To perform a reduction, we are
looking for a linear combination α⊤1X with α1 ∈ Rp suitably chosen such that it maximises the
variance of α⊤1X subject to the reasonable normalising constraint ∥α1∥2 = α⊤1 α1 = 1. Since
Var(α⊤1X) = α
⊤
1 Σα1 we need to choose α1 to maximise α
⊤
1 Σα1 subject to α
⊤
1 α1 = 1.
This requires us to apply Lagrange’s optimisation under constraint procedure:
1. construct the Lagrangian function
Lag(α1, λ) = α
⊤
1 Σα1 + λ(1−α⊤1 α1)
where λ ∈ R1 is the Lagrange multiplier;
2. take the partial derivative with respect to α1 and equate it to zero:
2Σα1 − 2λα1 = 0 =⇒ (Σ− λIp)α1 = 0. (6.1)
From (6.1), we see that α1 must be an eigenvector of Σ and since we know from Example 0.2
what the maximal value of α
⊤Σα
α⊤α is, we conclude that α1 should be the eigenvector that
50
UNSW MATH5855 2021T3 Lecture 6 Principal Components Analysis
corresponds to the largest eigenvalue λ¯1 of Σ. The random variable α
⊤
1X is called the
first principal component.
For the second principal component α⊤2X we want it to be normalised according to α
⊤
2 α2 =
1, uncorrelated with the first component and to give maximal variance of a linear combination
of the components of X under these constraints. To find it, we construct the Lagrange function:
Lag1(α2, λ1, λ2) = α
⊤
2 Σα2 + λ1(1−α⊤2 α2) + λ2α⊤1 Σα2
Its partial derivative w.r.t. α2 gives
2Σα2 − 2λ1α2 + λ2Σα1 = 0 (6.2)
Multiplying (6.2) by α⊤1 from left and using the two constraints α
⊤
2 α2 = 1 and α
⊤
2 Σα1 = 0
gives:
−2λ1α⊤1 α2 + λ2α⊤1 Σα1 = 0 =⇒ λ2 = 0
(WHY? Have in mind that α1 was an eigenvector of Σ.) But then (6.2) also implies that
α2 ∈ Rp must be an eigenvector of Σ (has to satisfy (Σ − λ1Ip)α2 = 0). Since it has to be
different from α1, having in mind that we aim at variance maximisation, we see that α2 has to
be the normalised eigenvector that corresponds to the second largest eigenvalue λ¯2 of Σ. The
process can be continued further. The third principal component should be uncorrelated with
the first two, should be normalised and should give maximal variance of a linear combination
of the components of X under these constraints. One can easily realise then that the vector
α3 ∈ Rp in the formula α⊤3X should be the normalised eigenvector that corresponds to the third
largest eigenvalue λ¯3 of the matrix Σ etc..
Note that if we extract all possible p principal components then
∑p
i=1Var(α
⊤
i X) will just
equal the sum of all eigenvalues of Σ and hence
p∑
i=1
Var(α⊤i X) = tr(Σ) = Σ11 + · · ·+Σpp.
Therefore, if we only take a small number of k principal components instead of the total possible
number p we can interpret their inclusion as one that explains a
Var(α⊤1 X)+···+Var(α⊤kX)
Σ11+···+Σpp ×100% =
λ¯1+···+λ¯k
Σ11+···+Σpp × 100% of the total population variance Σ11 + · · ·+Σpp.
6.3 Estimation of the Principal Components
In practice, Σ is unknown and has to be estimated. The principal components are derived from
the normalised eigenvectors of the estimated covariance matrix.
Note also that extracting principal components from the (estimated) covariance matrix has
the drawback that it is influenced by the scale of measurement of each variableXi, i = 1, . . . , p. A
variable with large variance will necessarily be a large component in the first principal component
(note the goal of explaining the bulk of variability by using the first principal component). Yet
the large variance of the variable may be just an artefact of the measurement scale used for this
variable. Therefore, an alternative practice is adopted sometimes to extract principal components
from the correlation matrix ρ instead of the covariance matrix Σ.
Example 6.1 (Eigenvalues obtained from Covariance and Correlation Matrices: see JW p. 437).
It demonstrates the great effect standardisation may have on the principal components. The
relative magnitudes of the weights after standardisation (i.e. from ρ may become in direct
opposition to the weights attached to the same variables in the principal component obtained
from Σ).
51
UNSW MATH5855 2021T3 Lecture 6 Principal Components Analysis
For the reasons mentioned above, variables are often standardised before sample principal
components are extracted. Standardisation is accomplished by calculating the vectors Zi =(
X1i−X¯1√
s11
X2i−X¯2√
s22
· · · Xpi−X¯p√spp
)⊤
, i = 1, . . . , n. The standardised observations matrix Z =
[Z1,Z2, . . . ,Zn] ∈ Mp,n gives the sample mean vector Z¯ = 1nZ1n = 0 and a sample covariance
matrix SZ =
1
n−1ZZ
⊤ = R (the correlation matrix of the original observations). The principal
components are extracted in the usual way from R now.
6.4 Deciding how many principal components to include
To reduce the dimensionality (which is the motivating goal), we should restrict attention to the
first k principal components and ideally, k should be kept much less than p but there is a tradeoff
to be made here since we would also like the proportion ψk =
λ¯1+...λ¯k
λ¯1+...λ¯p
be close to one. How
could a reasonable tradeoff be made? Three methods are most widely used:
• The “scree plot”: basically, it is a graphical method of plotting the ordered λ¯k against k
and deciding visually when the plot has flattened out. Typically, the initial part of the
plot is like the side of the mountain, while the flat portion where each λ¯k is just slightly
smaller than λ¯k−1, is like the rough scree at the bottom. This motivates the name of the
plot. The task here is to find where “the scree begins”.
• Choose an arbitrary constant c ∈ (0, 1) and choose k to be the smallest one with the
property ψk ≥ c. Usually, c = 0.9 is used, but please, note the arbitrariness of the choice
here.
• Kaiser’s rule: it suggests that from all p principal components only the ones should be
retained whose variances (after standardisation) are greater than unity, or, equivalently,
only those components which, individually, explain at least 1p100% of the total variance.
(This is the same as excluding all principal components with eigenvalues less than the
overall average). This criterion has a number of positive features that have contributed to
its popularity but can not be defended on a safe theoretical ground.
• Formal tests of significance. Note that it actually does not make sense to test whether
λ¯k+1 = · · · = λ¯p = 0 since if such a hypothesis were true then the population distribution
would be contained entirely within a kdimensional subspace and the same would be
true for any sample from this distribution, hence we would have the estimated λ¯ values
for indices k + 1, . . . , p being also equal to zero with probability one! What seems to be
reasonable to do instead, is to test H0 : λ¯k+1 = · · · = λ¯p (without asking the common
value to be zero). This is a more quantitative variant of the scree test. A test for this
hypothesis is to form the arithmetic and geometric means a0 = arithmetic mean of the last
p− k estimated eigenvalues; g0 = geometric mean of the last p− k estimated eigenvalues,
and then construct −2 log λ = n(p− k) log a0g0 . The asymptotic distribution of this statistic
under the null hypothesis is χ2ν where ν =
(p−k+2)(p−k−1)
2 . The interested student can
find more details about this test in the monograph of Mardia, Kent and Bibby. We should
note, however, that the last result holds under multivariate normality assumption and is
only valid as stated for the covariancebased (not the correlationbased) version of the
principal component analysis. In practice, many data analysts are reluctant to make a
multivariate normality assumption at the early stage of the descriptive data analysis and
hence distrust the above quantitative test but prefer the simple Kaiser criterion.
52
UNSW MATH5855 2021T3 Lecture 6 Principal Components Analysis
6.5 Software
Principal components analysis can be performed in SAS by using either the PRINCOMP or the
FACTOR procedures and in R using stats::prcomp, stats::princomp, or about halfdozen other
implementations.
6.6 Examples
Example 6.2. The Crime Rates example will be discussed at the lecture. The data gives crime
rates per 100,000 people in seven categories for each of the 50 states in USA in 1997. Principal
components are used to summarise the 7dimensional data in 2 or 3 dimensions only and help
to visualise and interpret the data.
6.7 PCA and Factor Analysis
Principal components can serve as a method for initial factor extraction in exploratory factor
analysis. But one should mention here that Principal component analysis is not Factor analysis.
The main difference is that in factor analysis (to be studied later in this course) one assumes that
the covariation in the observed variables is due to the presence of one or more latent variables
(factors) that exert casual influence on the observed variables. Factor analysis is being used when
it is believed that certain latent factors exist and it is hoped to explore the nature and number
of these factors. In contrast, in principal component analysis there is no prior assumption about
an underlying casual model. The goal here is just variable reduction.
6.8 Application to finance: Portfolio optimisation
Many other problems in Multivariate Statistics lead to formulating optimisation problems that
are similar in spirit to the Principal Component Analysis problem. Hereby, we shall illustrate
the Efficient portfolio choice problem.
Assume that a pdimensional vectorX of returns of the p assets is given. Then the return of a
portfolio that has these assets with weights (c1, c2, . . . , cp) (with
∑p
i=1 ci = 1) is Q = c
⊤X and
the mean return is c⊤µ. (Here we assume that EX = µ, Var(X) = Σ.) The risk of the portfolio
is c⊤Σc. Further, assume that a prespecified mean return µ¯ is to be achieved. The question is
how to choose the weights c so that the risk of a portfolio that achieves the prespecified mean
return, is as small as possible.
Mathematically, this is equivalent to the requirement to find the solution of an optimisation
problem under two constraints. The Lagrangian function is:
Lag(λ1, λ2) = c
⊤Σc+ λ1(µ¯− c⊤µ) + λ2(1− c⊤1p) (6.3)
where 1p is a pdimensional vector of ones. Differentiating (6.3) with respect to c we get the
first order conditions for a minimum:
2Σc− λ1µ− λ21p = 0. (6.4)
To simplify derivations, we shall consider the socalled case of nonexistence of a riskless asset
with a fixed (nonrandom) return. Then it makes sense to assume that Σ is positive definite and
hence Σ−1 exists. We get from (6.4) then:
c =
1
2
Σ−1(λ1µ+ λ21p). (6.5)
53
UNSW MATH5855 2021T3 Lecture 6 Principal Components Analysis
After multiplying by 1⊤p from left both sides of the equality, we get:
1 =
1
2
1⊤p Σ
−1(λ1µ+ λ21p) (6.6)
We can get λ2 from (6.6) as λ2 =
2−λ11⊤p Σ−1µ
1⊤p Σ−11p
and then substitute it in the formula for c to end
up with:
c =
1
2
λ1(Σ
−1µ− 1
⊤
p Σ
−1µ
1⊤p Σ−11p
Σ−11p) +
Σ−11p
1⊤p Σ−11p
. (6.7)
In a similar way, if we multiply both sides of (6.5) by µ⊤ from left and use the restriction
µ⊤c = µ¯ we can get one more relationship between λ1 and λ2 : λ1 =
2µ¯−λ2µ⊤Σ−11p
µ⊤Σ−1µ The linear
system of 2 equations with respect to λ1 and λ2 can be solved then and the values substituted
in (6.7) to get the final expression for c using µ, µ¯ and Σ. (Do it (!))
One special case is of particular interest. This is the socalled varianceefficient portfolio (as
opposed to the mean–varianceefficient portfolio considered above). For the varianceefficient
portfolio, there is no prespecified mean return, that is, there is no restriction on the mean. It is
only required to minimise the variance. Obviously, we have λ1 = 0 then and from (6.7) we get
the optimal weights for the variance efficient portfolio: copt =
Σ−11p
1⊤p Σ−11p
.
6.9 Additional resources
An alternative presentation of these concepts can be found in JW Ch. 8.
6.10 Exercises
Exercise 6.1
A random vector Y =
Y1Y2
Y3
is normally distributed with zero mean vector and Σ = 1 ρ/2 0ρ/2 1 ρ
0 ρ 1
where ρ is positive.
(a) Find the coefficients of the first principal component and the variance of that component.
What percentage of the overall variability does it explain?
(b) Find the joint distribution of Y1, Y2 and Y1 + Y2 + Y3.
(c) Find the conditional distribution of Y1, Y2 given Y3 = y3.
(d) Find the multiple correlation of Y3 with Y1, Y2.
54
UNSW MATH5855 2021T3 Lecture 7 Canonical Correlation Analysis
7 Canonical Correlation Analysis
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2 Application in testing for independence of sets of variables . . . . . . . . . . . . . 55
7.3 Precise mathematical formulation and solution to the problem . . . . . . . . . . 56
7.4 Estimating and testing canonical correlations . . . . . . . . . . . . . . . . . . . . 57
7.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.6 Some important computational issues . . . . . . . . . . . . . . . . . . . . . . . . 58
7.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.8 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.1 Introduction
Assume we are interested in the association between two sets of random variables. Typical
examples include: relation between set of governmental policy variables and a set of economic
goal variables; relation between college “performance” variables (like grades in courses in five
different subject matter areas) and precollege “achievement” variables (like highschool grade
point averages for junior and senior years, number of highschool extracurricular activities) etc.
The way the above problem of measuring association is solved in Canonical Correlation
Analysis, is to consider the largest possible correlation between linear combination of the variables
in the first set and a linear combination of the variables in the second set. The pair of linear
combinations obtained through this maximisation process is called first canonical variables
and their correlation is called first canonical correlation. The process can be continued
(similarly to the principal components procedure) to find a second pair of linear combinations
having the largest correlation among all pairs that are uncorrelated with the initially selected pair.
This would give us the second set of canonical variables with their second canonical correlation
etc. The maximisation process that we are performing at each step reflects our wish (again like in
principal components analysis) to concentrate the initially high dimensional relationship between
the 2 sets of variables into a few pairs of canonical variables only. Often, even only one pair is
considered. The rationale in canonical correlation analysis is that when the number of variables is
large, interpreting the whole set of correlation coefficients between pairs of variables from each
set is hopeless and in that case one should concentrate on a few carefully chosen representative
correlations. Finally, we should note that the traditional (simple) correlation coefficient and the
multiple correlation coefficient (Lecture 5) are special cases of canonical correlation in which one
or both sets contain a single variable.
7.2 Application in testing for independence of sets of variables
Besides being interesting in its own right (see Section 7.1), calculating canonical correlations turns
out to be important for the sake of testing independence of sets of random variables. Let
us remember that testing for independence and for uncorrelatedness in the case of multivariate
normal are equivalent problems. Assume now that that X ∼ Np(µ,Σ). Furthermore, let X be
partitioned into r, q components (r+ q = p) with X(1) ∈ Rr,X(2) ∈ Rq and correspondingly, the
covariance matrix
Σ = E(X − µ)(X − µ)⊤ =
σ11 σ12 · · · σ1p
σ21 σ22 · · · σ2p
...
...
. . .
...
σp1 σp2 · · · σpp
∈Mp,p
55
UNSW MATH5855 2021T3 Lecture 7 Canonical Correlation Analysis
has been also partitioned into Σ =
(
Σ11 Σ12
Σ21 Σ22
)
, accordingly. We shall assume for simplicity
that the matrices Σ, Σ11, and Σ22 are nonsingular. To test H0 : Σ12 = 0 against a general
alternative, a sensible way to go would be the following: for fixed vectors a ∈ Rr, b ∈ Rq let
Z1 = a
⊤X(1) and Z2 = b⊤X(2) giving ρa,b = Cor(Z1, Z2) = a
⊤Σ12b√
a⊤Σ11ab⊤Σ22b
. H0 is equivalent
to H0 : ρa,b = 0 for all a ∈ Rr, b ∈ Rq. For a particular pair a, b, H0 would be accepted if
ra,b = a
⊤S12b√
a⊤S11ab⊤S22b
≤ k for certain positive constant k. (Here Sij are the corresponding
data based estimators of Σij .) Hence an appropriate acceptance region for H0 would be given
in the form {X ∈ Mp,n : maxa,b r2a,b ≤ k2}. But maximising r2a,b means to find the maximum
of (a⊤S12b)2 under constraints a⊤S11a = 1 and b⊤S22b = 1, and this is exactly the databased
version of the optimisation problem to be solved in Section 7.1. For the goals in Sections 7.1
and 7.2 to be achieved, we need to solve problems of the following type.
7.3 Precise mathematical formulation and solution to the problem
Canonical variables are the variables Z1 = a
⊤X(1) and Z2 = b⊤X(2) where a ∈ Rr, b ∈ Rq are
obtained by maximising (a⊤Σ12b)2 under the constraints a⊤Σ11a = b⊤Σ22b = 1. To solve the
above maximisation problem, we construct
Lag(a, b, λ1, λ2) = (a
⊤Σ12b)2 + λ1(a⊤Σ11a− 1) + λ2(b⊤Σ22b− 1).
Partial differentiation with respect to the vectors a and b gives:
2(a⊤Σ12b)Σ12b+ 2λ1Σ11a = 0 ∈ Rr, (7.1)
2(a⊤Σ12b)Σ21a+ 2λ2Σ22b = 0 ∈ Rq. (7.2)
We multiply (7.1) by the vector a⊤ from left and equation (7.2) by b⊤ from left and after
subtracting the two equations obtained we get λ1 = λ2 = −(a⊤Σ12b)2 = −µ2. Hence:
Σ12b = µΣ11a (7.3)
and
Σ21a = µΣ22b (7.4)
hold.
Now we first multiply (7.3) by Σ21Σ
−1
11 from left, then both sides of (7.4) by the scalar µ and
after finally adding the two equations we get:
(Σ21Σ
−1
11 Σ12 − µ2Σ22)b = 0. (7.5)
The homogeneous equation system (7.5) having a nontrivial solution w.r.t. b means that
Σ21Σ−111 Σ12 − µ2Σ22 = 0 (7.6)
must hold. Then, of course,
Σ− 1222 Σ21Σ−111 Σ12 − µ2Σ22Σ−
1
2
22  = Σ−
1
2
22 Σ21Σ
−1
11 Σ12Σ
− 12
22 − µ2Iq = 0
must hold. This means that µ2 has to be an eigenvalue of the matrix Σ
− 12
22 Σ21Σ
−1
11 Σ12Σ
− 12
22 . Also,
b = Σ
− 12
22 bˆ where bˆ is the eigenvector of Σ
− 12
22 Σ21Σ
−1
11 Σ12Σ
− 12
22 corresponding to this eigenvalue
(WHY?!).
56
UNSW MATH5855 2021T3 Lecture 7 Canonical Correlation Analysis
(Note, however, that this representation is good mainly for theoretical purposes, the main
advantage being that one is dealing with eigenvalues of a symmetric matrix. If doing calculations
by hand, it is usually easier to calculate b directly as the solution of the linear equation (7.5),
i.e., find the largest eigenvalue of the (nonsymmetric) matrix Σ−122 Σ21Σ
−1
11 Σ12 and then find
the eigenvector b that corresponds to it. Besides, we also see from the definition of µ that
µ2 = (a⊤Σ12b)2 holds.)
Since we wanted to maximise the right hand side, it is obvious that µ2 must be chosen to
be the largest eigenvalue of the matrix Σ
− 12
22 Σ21Σ
−1
11 Σ12Σ
− 12
22 (or, which is the same thing, the
largest eigenvalue of the matrix Σ21Σ
−1
11 Σ12Σ
−1
22 ). Finally, we can obtain the vector a from (7.3):
a = 1µΣ
−1
11 Σ12b. That way, the first canonical variables Z1 = a
⊤X(1) and Z2 = b⊤X(2) are
determined and the value of the first canonical correlation is just µ. The orientation of the vector
b is chosen such that the sign of µ should be positive.
Now, it is easy to see that if we want to extract a second pair of canonical variables we need
to repeat the same process by starting with the second largest eigenvalue µ2 of the matrix
Σ
− 12
22 Σ21Σ
−1
11 Σ12Σ
− 12
22 (or of the matrix Σ
−1
22 Σ21Σ
−1
11 Σ12). This will automatically ensure that the
second pair of canonical variables is uncorrelated with the first pair. The process can theoretically
be continued until the number of pairs of canonical variables equals the number of variables in the
smaller group. But in practice, much fewer canonical variables will be needed. Each canonical
variable is uncorrelated with all the other canonical variables of either set except for the one
corresponding canonical variable in the opposite set.
Note. It is important to point out that already by definition the canonical correlation is at
least as large as the multiple correlation between any variable and the opposite set of variables.
It is in fact possible for the first canonical correlation to be very large while all the multiple
correlations of each separate variable with the opposite set of canonical variables are small. This
once again underlines the importance of Canonical Correlation analysis.
7.4 Estimating and testing canonical correlations
The way to estimate the canonical variables and canonical correlation coefficients is based on the
plugin technique: one follows the steps outlined in Section 7.3, by each time substituting Sij in
place of Σij .
Let us now discuss the independence testing issue outlined in Section 7.2. The acceptance
region of the independence test of H0 in Section 7.2. would be {X ∈ Mp,n : largest eigenvalue
of S
− 12
22 S21S
−1
11 S12S
− 12
22 ≤ kα} where kα has been worked out and is given in the so called
Hecks charts. This distribution depends on three parameters: s = min(r, q), m = r−q−12 ,
and N = n−r−q−22 , n being the sample size. Besides using the charts, one can also use good F 
distributionbased approximations for a (transformations of) this distribution like Wilk’s lambda,
Pillai’s trace, Hotelling trace, and Roy’s greatest root.
7.5 Software
Here we shall only mention that all these statistics and their P values (using suitable F distribution
based approximations) are readily available as an output in the SAS program CANCORR so that
performing the test is really easyone can read out directly the pvalue from the SAS output. In
R, see stats::cancor and package CCA for computing and visualisation, and package CCP for
testing canonical correlations.
57
UNSW MATH5855 2021T3 Lecture 7 Canonical Correlation Analysis
7.6 Some important computational issues
Note that calculating X−
1
2 and X
1
2 for a symmetric positive definite matrix X according to the
theoretically attractive spectral decomposition method may be numerically unstable. This is
especially the case when some of the eigenvalues are close to zero (or, more precisely, when the
the ratio of the greatest eigenvalue and the least eigenvalue—the condition number—is high).
We can use the Cholesky decomposition described in Section 0.1.6 instead. Looking back at
(7.5), we see that if U⊤U = Σ−122 gives the Cholesky decomposition of the matrix Σ
−1
22 then µ
2
is an eigenvalue of the matrix A = UΣ21Σ
−1
11 Σ12U
⊤. Indeed, by multiplying from left by U and
from right by U⊤ in (7.6) we get:
A− µ2UΣ22U⊤ = 0.
But UΣ22U
⊤ = U(U⊤U)−1U⊤ = UU−1(U⊤)−1U⊤ = I holds.
7.7 Examples
Example 7.1. Canonical Correlation Analysis of the Fitness Club Data. Three physio
logical and three exercise variables were measured on twenty middle aged men in a fitness club.
Canonical correlation is used to determine if the physiological variables are related in any way
to the exercise variables.
Example 7.2. JW Example 10.4, p. 552 Studying canonical correlations between leg and
head bone measurements: X1, X2 are skull length and skull breadth, respectively; X3, X4 are
leg bone measurements: femur and tibia length, respectively. Observations have been taken
on n = 276 White Leghorn chicken. The example is chosen to also illustrate how a canonical
correlation analysis can be performed when the original data is not given but the empirical
correlation matrix (or empirical covariance matrix) is available.
7.8 Additional resources
An alternative presentation of these concepts can be found in JW Ch. 10.
7.9 Exercises
Exercise 7.1
Let the components of X correspond to scores on tests in arithmetic speed (X1), arithmetic
power (X2), memory for words (X3), memory for meaningful symbols (X4), and memory for
meaningless symbols (X5). The observed correlations in a sample of 140 are
1.0000 0.4248 0.0420 0.0215 0.0573
1.0000 0.1487 0.2489 0.2843
1.0000 0.6693 0.4662
1.0000 0.6915
1.0000
.
Find the canonical correlations and canonical variates between the first two variates and the last
three variates. Comment. Write a SASIML or R code to implement the required calculations.
58
UNSW MATH5855 2021T3 Lecture 7 Canonical Correlation Analysis
Exercise 7.2
Students sit 5 different papers, two of which are closed book and the rest open book. For the
88 students who sat these exams the sample covariance matrix is
S =
302.3 125.8 100.4 105.1 116.1
170.9 84.2 93.6 97.9
111.6 110.8 120.5
217.9 153.8
294.4
.
Find the canonical correlations and canonical variates between the first two variates (closed book
exams) and the last three variates (open book exams). Comment.
Exercise 7.3
A random vector X ∼ N4(µ,Σ) with µ =
0
0
0
0
and
1 2ρ ρ ρ
2ρ 1 ρ ρ
ρ ρ 1 2ρ
ρ ρ 2ρ 1
where ρ is a small
enough positive constant.
(a) Find the two canonical correlations between
(
X1
X2
)
and
(
X3
X4
)
. Comment.
(b) Find the first pair of canonical variables.
Exercise 7.4
Consider the following covariance matrix Σ of a four dimensional normal vector: Σ =(
Σ11 Σ12
Σ21 Σ22
)
=
100 0 0 0
0 1 0.95 0
0 0.95 1 0
0 0 0 100
. Verify that the first pair of canonical variates
are just the second and the third component of the vector and the canonical correlation equals
.95.
59
UNSW MATH5855 2021T3 Lecture 8 MLM and MANOVA
8 Multivariate Linear Models and Multivariate ANOVA
8.1 Univariate linear models and ANOVA . . . . . . . . . . . . . . . . . . . . . . . . 60
8.2 Multivariate Linear Model and MANOVA . . . . . . . . . . . . . . . . . . . . . . 61
8.3 Computations used in the MANOVA tests . . . . . . . . . . . . . . . . . . . . . . 61
8.3.1 Roots distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.3.2 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8.6 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8.1 Univariate linear models and ANOVA
Recall the univariate linear model: for observations i = 1, 2, . . . , n, let the response variable
Yi = xiβ + ϵi, for predictor row vector x
⊤
i ∈ Rk assumed fixed and known, coefficient vector
β ∈ Rp fixed and unknown, and ϵi i.i.d.∼ N(0, σ2). In matrix form, Y =
(
Y1 Y2 · · · Yn
)⊤
and
X =
(
x⊤1 x
⊤
2 · · · x⊤n
)⊤ ∈Mn,k. We will assume that X contains an intercept. Then,
Y = Xβ + ϵ,
where ϵ ∼ Nn(0, Inσ2). The MLE for β requires us to minimise
n∑
i=1
(Yi − xiβ)2 = ∥Y −Xβ∥2 = (Y −Xβ)⊤(Y −Xβ),
and, after some vector calculus, we get
βˆ = (X⊤X)−1X⊤Y
with
Var(βˆ) = (X⊤X)−1X⊤Var(Y )X(X⊤X)−1 = (X⊤X)−1σ2.
Furthermore, we can consider projection matrices A = In − X(X⊤X)−1X⊤ and B =
X(X⊤X)−1X⊤ − 1n(1⊤n 1n)−11⊤n , with
AY = Y −X{(X⊤X)−1X⊤Y } = Y − Yˆ ,
the residual vector and
BY = X{(X⊤X)−1X⊤}Y − 1n(1⊤n 1n)−11⊤nY = Yˆ − 1nY¯
the vector of fitted values over and above the mean, and observe that
Cov(AY , BY ) = AVar(Y )B⊤ = σ2AB⊤
= X(X⊤X)−1X⊤ −X(X⊤X)−1X⊤X(X⊤X)−1X⊤
− 1n(1⊤n 1n)−11⊤n +X(X⊤X)−1X⊤1n(1⊤n 1n)−11⊤n
=
1
n
(X(X⊤X)−1X⊤1n − 1n)1⊤n = 0
if X contains an intercept effect. Then, SSE = Y ⊤AY ∼ σ2χ2n−k and SSA = Y ⊤BY ∼ σ2χ2k−1,
independent, letting us set up F = SSA/(k−1)SSE/(n−k) ∼ Fk−1,n−k, etc..
60
UNSW MATH5855 2021T3 Lecture 8 MLM and MANOVA
8.2 Multivariate Linear Model and MANOVA
How do we generalise it to multivariate response? That is, suppose that we observe the following
response matrix:
Y =
Y ⊤1
Y ⊤2
...
Y ⊤n
=
Y11 Y12 · · · Y1p
Y21 Y22 · · · Y2p
...
...
. . .
...
Yn1 Yn2 · · · Ynp
∈Mn,p
with xi and X as before, and
Y ⊤i = xiβ + ϵ
⊤
i
where β ∈Mk,p, and ϵi ∼ Np(0,Σ), Σ ∈Mp,p symmetric positive definite. In matrix form,
Y = Xβ +E,
where
E =
(
ϵ1 ϵ2 · · · ϵn
)⊤ ∈Mn,p.
Then, we can write E⃗ ∼ Nnp(0,Σ⊗ In) or
−→
E⊤ ∼ Nnp(0, In ⊗ Σ), and
Y⃗ ∼ Nnp({β⊤ ⊗ In}X⃗,Σ⊗ In)
or −−→
Y ⊤ ∼ Nnp({In ⊗ β⊤}
−−→
X⊤, In ⊗ Σ).
MLE is equivalent to the OLS problem minimising
∑n
i=1 tr{(Yi−xiβ)(Yi−xiβ)⊤} = tr{(Y −
Xβ)⊤(Y −Xβ)}, leading to
βˆ = (X⊤X)−1X⊤Y
again, with
Var(
−→ˆ
β⊤) = Var(
−−−−−−−−−−−→
Y ⊤X(X⊤X)−1) = Var{((X⊤X)−1X⊤ ⊗ Ip)
−−→
Y ⊤}
= ((X⊤X)−1X⊤ ⊗ Ip)(Ip ⊗ Σ)((X⊤X)−1X⊤ ⊗ Ip)⊤
= ((X⊤X)−1X⊤ ⊗ Ip)((X⊤X)−1X⊤ ⊗ Σ)⊤
= (X⊤X)−1 ⊗ Σ,
or
Var(
−→ˆ
β ) = Σ⊗ (X⊤X)−1.
Projection matricesA andB still work (check it!), and we can write SSE = Y ⊤AY ∼Wp(Σ, p(n−
k − 1)) and SSA = Y ⊤BY ∼Wp(Σ, p(k − 1)). Notice that they are now matrices.
8.3 Computations used in the MANOVA tests
In standard (univariate) Analysis of Variance, with usual normality assumptions on the errors,
testing about effects of the factors involved in the model description is based on the F test. The
F tests are derived from the ANOVA decomposition SST = SSA + SSE. The argument goes as
follows:
i) SSE and SSA are independent, (up to constant factors involving the variance σ2 of the
errors) χ2 distributed;
61
UNSW MATH5855 2021T3 Lecture 8 MLM and MANOVA
ii) By proper norming to account for degrees of freedom, from SSE and SSA one gets statistics
that have the following behaviour: the normed SSE always delivers an unbiased estimator
of σ2 no matter if the null hypothesis or alternative is true; the normed SSA delivers an
unbiased estimator of σ2 under the null hypothesis but delivers an unbiased estimator of a
“larger” quantity under the alternative.
The above observation is crucial and motivates the F testing: F statistics are (suitably normed
to account for degrees of freedom) ratios of SSA/SSE. When taking the ratio, the factors
involving σ2 cancel out and σ2 does not play any role in the distribution of the ratio. Under
H0 their distribution is F . When the null hypothesis is violated, then the same statistics will
tend to have “larger” values as compared to the case when H0 is true. Hence significant (w.r.t.
the corresponding F distribution) values of the statistic lead to rejection of H0.
Aiming at generalising these ideas to the Multivariate ANOVA (MANOVA) case, we should
note that instead of χ2 distributions we now have to deal with Wishart distributions and we
need to properly define (a proper functional of) the SSA/SSE ratio which would be a “ratio”
of matrices now. Obviously, there are more ways to define suitable statistics in this context! It
turns out that such functionals are related to the eigenvalues of the (properly normed) Wishart
distributed matrices that enter the decomposition SST = SSA + SSE in the multivariate case.
8.3.1 Roots distributions
Let Yi, i = 1, 2, . . . , n
ind.∼ Np(µi,Σ). Then the following data matrix:
Y =
Y ⊤1
Y ⊤2
...
Y ⊤n
=
Y11 Y12 · · · Y1p
Y21 Y22 · · · Y2p
...
...
. . .
...
Yn1 Yn2 · · · Ynp
∈Mn,p
is a n × p matrix containing n pdimensional (transposed) vectors. Denote: E(Y ) = M ,
Var(Y⃗ ) = Σ⊗ In. Let A and B be projectors such that Q1 = Y ⊤AY and Q2 = Y ⊤BY are two
independent Wp(Σ, v) and Wp(Σ, q) matrices, respectively. Although the theory is general, to
keep you on track, you could always think about a multivariate linear model example:
Y = Xβ +E, Yˆ = Xβˆ
A = In −X(X⊤X)−X⊤, B = X(X⊤X)−X⊤ − 1n(1⊤n 1n)−11⊤n
and the corresponding decomposition
Y [In − 1n(1⊤n 1n)−11⊤n ]Y = Y ⊤BY + Y ⊤AY = Q2 +Q1
of SST = SSA + SSE = Q2 + Q1 where Q2 is the “hypothesis matrix” and Q1 is the “error
matrix”.
Lemma 8.1. Let Q1,Q2 ∈ Mp,p be two positive definite symmetric matrices . Then the roots
of the determinant equation Q2 − θ(Q1 + Q2) = 0 are related to the roots of the equation
Q2 − λQ1 = 0 by: λi = θi1−θi (or θi = λi1+λi ).
Lemma 8.2. Let Q1,Q2 ∈ Mp,p be two positive definite symmetric matrices . Then the roots
of the determinant equation Q1 − v(Q1 + Q2) = 0 are related to the roots of the equation
Q2 − λQ1 = 0 by: λi = 1−vivi (or vi = 11+λi ).
62
UNSW MATH5855 2021T3 Lecture 8 MLM and MANOVA
We can employ the above two lemmas to see that if λi, vi, θi are the roots of
Q2 − λQ1 = 0, Q1 − v(Q1 +Q2) = 0, Q2 − θ(Q1 +Q2) = 0
then:
Λ = Q1(Q1 +Q2)−1 =
p∏
i=1
(1 + λi)
−1
(Wilks’ Criterion statistic) or
Q2Q−11  =
p∏
i=1
λi =
p∏
i=1
1− vi
vi
=
p∏
i=1
θi
1− θi
or
Q2(Q1 +Q2)−1 =
p∏
i=1
θi =
p∏
i=1
λi
1 + λi
=
p∏
i=1
(1− vi)
and other functional transformations of these products of (random) roots would have a distribu
tion that would only depend on p (the dimension of Yi), v (the Wishart degrees of freedom for
Q1), and q (same for Q2).
There are various ways to choose such functional transformations (statistics) and many have
been suggested like:
• Λ (Wilks’s Lambda)
• tr(Q2Q−11 ) = tr(Q
−1
1 Q2) =
∑p
i=1 λi (Lawley–Hotelling trace)
• max iλi (Roy’s criterion)
• V = tr[Q2(Q1 +Q2)−1] =
∑p
i=1
λi
1+λi
(Pillai statistic / Pillai’s trace)
Tables and charts for their exact or approximate distributions are available. Also, P values for
these statistics are readily calculated in statistical packages. In these applications, the meaning
of Q1 is of the “error matrix” (also denoted by E sometimes) and the meaning of Q2 is that of
a “hypothesis matrix” (also denoted by H sometimes).
The distribution of the statistics defined above depends on the following three parameters:
• p = the number of responses
• q = νh = degrees of freedom for the hypothesis
• v = νe = degrees of freedom for the error
Based on these, the following quantities are calculated: s = min(p, q), m = 0.5(p − q − 1),
n = 0.5(v−p−1), r = v−0.5(p− q+1), u = 0.25(pq−2). Moreover, we define: t =
√
p2q2−4
p2+q2−5 if
p2 + q2 − 5 > 0 and t = 1 otherwise. Let us order the eigenvalues of E−1H = Q−11 Q2 according
to: λ1 ≥ λ2 ≥ · · · ≥ λp.
Then the following distribution results are exact if s = 1 or 2, otherwise approximate:
• Wilks’s test. The test statistics, Wilks’s lambda, is Λ = EE+H =
∏p
i=1
1
1+λi
Then it holds:
F = 1−Λ
1/t
Λ1/t
. rt−2upq ∼ Fpq,rt−2u df (Rao’s F).
• Lawley–Hotelling trace Test. The Lawley–Hotelling statistic is U = tr(E−1H) = λ1 +
· · ·+ λp, and F = 2(sn+ 1) Us2(2m+s+1) ∼ Fs(2m+s+1),2(sn+1) df.
63
UNSW MATH5855 2021T3 Lecture 8 MLM and MANOVA
• Pillai’s test. The test statistic, Pillai trace, is V = tr(H(H +E)−1) = λ11+λ1 + · · ·+
λp
1+λp
and F = 2n+s+12m+s+1 × Vs−V ∼ Fs(2m+s+1),s(2n+s+1) df.
• Roy’s maximum root criterion. The test statistic is just the largest eigenvalue λ1.
Finally, we shall mention one historically older and very universal approximation to the
distribution of the Λ statistic due to Bartlett (1927):
It holds: level of −[νe − p−νh+12 ] log Λ = c(p, νh,M)× level of χ2pνh , where the constant
c(p, νh,M = νe−p+1) is given in tables. Such tables are prepared for levels α = 0.10, 0.05, 0.025
etc..
In the context of testing the hypothesis about significance of the first canonical correlation,
we have:
E = S22 − S21S−111 S12, H = S21S−111 S12.
The Wilks’s statistic becomes SS11S22 . (Recall (4.3)!) We also see that in this case, if µ
2
i were
the squared canonical correlations then µ21 was defined as the maximal eigenvalue to S
−1
22 H, that
is, it is a solution to (E +H)−1H − µ21I = 0 However, setting λ1 = µ
2
1
1−µ21 we see that:
(E+H)−1H−µ21I = 0 =⇒ H−µ21(E+H) = 0 =⇒ H−
µ21
1− µ21
E = 0 =⇒ E−1H−λ1I = 0
holds and λ1 is an eigenvalue of E
−1H. Similarly you can argue for the remaining λi =
µ2i
1−µ2i
values. What are the degrees of freedom of E and H?
8.3.2 Comparisons
From all statistics discussed, Wilks’s lambda has been most widely applied. One important reason
for this is that this statistic has the virtue of being convenient to use and, more importantly,
being related to the Likelihood Ratio Test! Despite the above, the fact that so many different
statistics exist for the same hypothesis testing problem, indicates that there is no universally
best test. Power comparisons of the above tests are almost lacking since the distribution of the
statistic under alternatives is hardly known.
8.4 Software
In SAS, both PROC GLM and PROC REG can conduct analysis and perform hypothesis tests. In R,
use stats::lm.
8.5 Examples
Example 8.3. Multivariate linear modelling of the Fitness dataset.
8.6 Additional resources
An alternative presentation of these concepts can be found in JW Ch. 7.
64
UNSW MATH5855 2021T3 Lecture 9 Tests of a Covariance Matrix
9 Tests of a Covariance Matrix
9.1 Test of Σ = Σ0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
9.2 Sphericity test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
9.3 General situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
9.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Previously, we developed a number of techniques for decomposing and analysing covariance
matrices and their properties. Here, we develop a general family of tests for their structure,
which will let you specify almost arbitrary tests for the covariance structure of a multivariate
normal population.
9.1 Test of Σ = Σ0
We start with this simpler case since ideas are more transparent. The practically more relevant
cases are about comparing covariance matrices of two or more multivariate normal populations
but the derivations of the latter tests is more subtle. For these we will only formulate the final
results.
Assume now that we have the sample X1,X2, . . . ,Xn from a Np(µ,Σ) distribution and we
would like to test H0 : Σ = Σ0 against the alternative H1 : Σ ̸= Σ0. Obviously the problem
can be easily transformed into testing H¯0 : Σ = Ip since otherwise we can consider the modified
observations Yi = Σ
− 12
0 Xi which under H0 will be multivariate normal with a covariance matrix
being equal to Ip. Therefore we can assume that X1,X2, . . . ,Xn is a sample from a Np(µ,Σ)
and we want to test H0 : Σ = Ip versus H1 : Σ ̸= Ip.
We will derive the likelihood ratio test for this problem. The likelihood function is
L(x;µ,Σ) = (2π)−
np
2 Σ−n2 e− 12
∑n
i=1(xi−µ)⊤Σ−1(xi−µ)
= (2π)−
np
2 Σ−n2 e− 12 tr[Σ−1
∑n
i=1(xi−µ)(xi−µ)⊤] .
Under the hypothesis H0, the maximum of the likelihood function is obtained when µ¯ = x¯.
Under the alternative we have to maximise with respect to both µ and Σ and we know from
Section 3.1.2 that the maximum of the likelihood function is obtained for µˆ = x¯ and Σˆ =
1
n
∑n
i=1(xi − x¯)(xi − x¯)⊤. Then we obtain easily the likelihood ratio
Λ =
maxµ L(x;µ, Ip)
maxµ,Σ L(x;µ,Σ)
=
e[−
1
2 trV ]
V −n2 nnp2 e−np2
where V =
∑n
i=1(xi − x¯)(xi − x¯)⊤. Therefore
− 2 log Λ = np log n− n logV + trV − np, (9.1)
and according to the asymptotic theory the quantity in (9.1) is asymptotically distributed as
χ2p(p+1)/2 (the degrees of freedom being the difference of the number of free parameters under
the alternative and under the hypothesis). This test would reject H0 if the value of the −2 log Λ
statistic is significantly large.
9.2 Sphericity test
Further, it is more realistic to assume that the structure of the covariance matrix is only known
up to some constant. Having in mind the discussion in the beginning of Section 9.1, we can
65
UNSW MATH5855 2021T3 Lecture 9 Tests of a Covariance Matrix
assume without loss of generality that H0 : Σ = σ
2Ip against a general alternative. This test
has the name “sphericity test”. The likelihood ratio test can be developed in a manner similar
to the previous case (do it (!)) and the final result is that
−2 log Λ = np log(nσˆ2)− n logV .
Here, σˆ2 = 1np
∑n
i=1(xi − x¯)⊤(xi − x¯). The asymptotic distribution of np log(nσˆ2) − n logV 
under the null hypothesis will be again χ2 but the degrees of freedom are this time p(p+1)2 − 1 =
(p−1)(p+2)
2 (WHY (?!)). Again, the hypothesis will be rejected for large values.
9.3 General situation
Testing equality of covariance matrices from k different multivariate normal populationsNp(µi,Σi), i =
1, 2, . . . , k is a very important problem especially in discriminant analysis and multivariate anal
ysis of variance. Let,
k be the number of populations;
p the dimension of vector;
n the total sample size n = n1 + n2 + . . . nk,
ni being the sample size for each population.
The analysis of deviance test statistic that results is
−2 log
∏k
i=1Σˆi
ni
2
Σˆpooledn2
,
with Σˆi the MLE sample variance (with denominator ni as opposed to ni − 1) of population i,
and Σˆpooled =
1
n
∑k
i=1 niΣˆi, asymptotically distributed χ
2
(k−1)p(p+1)/2.
It has been noticed that this test has the defect that it is (asymptotically) biased: that is, the
probability of rejecting H0 when H0 is false can be smaller than the probability of rejecting H0
when H0 is true (i.e., it may happen that in some points of the parameter space the probability
of a correct decision is smaller than the probability for a wrong decision). Hence it is desirable
to modify it to make it asymptotically unbiased.
Further let N = n − k and Ni = ni − 1. Under the null hypothesis of equality of all k
covariance matrices, it holds:
− 2ρ log
∏k
i=1Si
Ni
2
SpooledN2
, (9.2)
for ρ = 1 − [(∑ki=1 1Ni ) − 1N ] 2p2+3p−16(p+1)(k−1) , Si the sample variance (with n − 1 denominator) of
population i, and Spooled =
1
N
∑k
i=1NkSi, is asymptotically distributed as χ
2
(k−1)p(p+1)/2. Large
values of the statistic are significant and lead to the rejection of the hypothesis about equality
of the k covariance matrices.
In the following, we will avoid the subtle details and refer to Chapter 8 of the monograph
Muirhead, R. (1982) Aspects of Multivariate Statistical Theory. Wiley, New York.
66
UNSW MATH5855 2021T3 Lecture 9 Tests of a Covariance Matrix
The modified LR is achieved by replacing ni and n by Ni and N (that is, by the correct degrees
of freedom). We note that indeed ρ = 1− [(∑ki=1 1Ni )− 1N ] 2p2+3p−16(p+1)(k−1) is close to 1 anyway if all
sample sizes ni were very large. Finally, the scaling of the test statistic by ρ = 1− [(
∑k
i=1
1
Ni
)−
1
N ]
2p2+3p−1
6(p+1)(k−1) that is made in (9.2) serves to improve the quality of the asymptotic approximation
of the statistic by the limiting χ21
2 (k−1)p(p+1)
distribution. Such (asymptotically negligible) scalar
transformations of the LR statistic that yield improved test statistic with a chisquared null
distribution of order O(1/n) instead of the ordinary O(1) for the standard LR, are known in the
literature under the common name Bartlett corrections. Thus (9.2) is a Bartlett corrected
version of the modified LR statistic.
9.4 Software
SAS: PROC CALIS, PROC DISCRIM (option)
R: heplots::boxM, MVTests::BoxM
The statistic (9.2) is the one that is implemented in software packages.
9.5 Exercises
Exercise 9.1
Follow the discussion about the sphericity test. Argue that if λˆi, i = 1, 2, . . . , p denote the
eigenvalues of the empirical covariance matrix S then
−2 log Λ = np log arithm. mean λˆi
geom. mean λˆi
.
Of course, the above statistic is asymptotically χ2(p+2)(p−1)/2 distributed under H0 since it only
represents the sphericity test in a different form.
Exercise 9.2
Show that the likelihood ratio test of
H0 : Σ is a diagonal matrix
rejects H0 when −n log R is larger than χ21−α,p(p−1)/2. (Here R is the empirical correlation
matrix, p is the dimension of the multivariate normal and n is the sample size.)
67
UNSW MATH5855 2021T3 Lecture 10 Factor Analysis
10 Factor Analysis
10.1 ML Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
10.2 Hypothesis testing under multivariate normality assumption . . . . . . . . . . . . 70
10.3 Varimax method of rotating the factors . . . . . . . . . . . . . . . . . . . . . . . 71
10.4 Relationship to Principal Component Analysis . . . . . . . . . . . . . . . . . . . 71
10.4.1 The principal component solution of the factor model . . . . . . . . . . . 71
10.4.2 The Principal Factor Solution . . . . . . . . . . . . . . . . . . . . . . . . . 71
10.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
10.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
10.7 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Let Yi, i = 1, 2, .., n be independent Np(µ,Σ) variables (think of the Yis as a results of a
battery of p tests applied to the ith individual). Fundamental assumption in factor analysis:
Yi = Λfi + ei (10.1)
Λ ∈Mp,k factor loading matrix (full rank);
fi ∈ Rk (k < p) factor variable. The components of fi are thought to be the (latent) factors.
Usually fi are taken to be independent N(α, Ik) (i.e., “orthogonal”) but also “oblique”
factors are considered sometimes with a covariance matrix ̸= Ik.
ei independent Np(θ,Σe) with Σe diagonal, i.e., Σe = diag(σ
2
1 , σ
2
2 , . . . , σ
2
p).
Also, the es are independent of the fs.
Then,
µ = Λα+ θ; Σ = ΛΛ⊤ +Σe,
or, componentwise:
Var(Yir) =
k∑
j=1
λ2rj + σ
2
r = communality + uniqueness.
Cov(Yir, Yis) =
k∑
j=1
λrjλsj .
The fundamental idea of factor analysis is to describe the covariance relationships among
many variables (p “large”) in terms of few (k “small”) underlying, not observable (latent)
random quantities (the factors). The model is motivated by the following argument: suppose
variables can be grouped by their correlations. That is, all variables in a particular group are
highly correlated among themselves but have relatively small correlations with variables in a
different group. It is then quite reasonable to assume that each group of variables represents a
single underlying construct (factor) that is “responsible” for the observed correlations.
Important notes
• The model (10.1) is similar to a linear regression model but the key differences are that fi
are random and are not observable.
68
UNSW MATH5855 2021T3 Lecture 10 Factor Analysis
• If we knew the Λ (or have found estimates of them), then using properties of orthogonal
projections on the linear space spanned by the columns of Λ, we would get:
αˆ = (Λ⊤Λ)−1Λ⊤Y¯ ; θˆ = Y¯ − Λαˆ.
Because of the above observation, we can consider only µ, Λ, and σ2i , i = 1, 2, . . . , p
as unknown parameters when parameterising the factor analysis model. Note also that
primary interest in factor analysis is focused on estimating Λ.
• There is a fundamental indeterminacy in this model even when we require that
Var(f) = Ik since, if P ∈Mk,k is any orthogonal matrix then obviously
ΛΛ⊤ = ΛP (ΛP )⊤; Λfi = (ΛP )(P⊤fi).
Hence replacing Λ by ΛP and fi by P
⊤fi leads to the same equations.
10.1 ML Estimation
The likelihood function for the n observations Y1,Y2, . . . ,Yn ∈ Rp is
L(Y ;µ,Λ, σ21 , σ
2
2 , .., σ
2
p) = (2π)
−np/2Σ−n/2 exp[−1
2
n∑
i=1
(Yi − µ)⊤Σ−1(Yi − µ)]
= (2π)−np/2Σ−n/2 exp[−n
2
(tr(Σ−1S) + (Y¯ − µ)⊤Σ−1(Y¯ − µ))]
with S = 1n
∑n
i=1(Yi − Y¯ )(Yi − Y¯ )⊤. Taking logL, we get:
logL(Y ;µ,Λ, σ21 , σ
2
2 , .., σ
2
p) = −
np
2
log(2π)− n
2
log(Σ)− n
2
[tr(Σ−1S)+ (Y¯ −µ)⊤Σ−1(Y¯ −µ))].
After differentiating w.r.t. µ,
∂ logL
∂µ
= nΣ−1(Y¯ − µ) = 0 =⇒ µˆ = Y¯ .
It remains to estimate Λ and Σe by minimising:
Q =
1
2
logΛΛ⊤ +Σe+ 1
2
tr(ΛΛ⊤ +Σe)−1S.
To implement the minimisation of Q we use the following rules for matrix differentiation:
∂
∂Λ
logΛΛ⊤ +Σe = 2(ΛΛ⊤ +Σe)−1Λ (10.2)
∂
∂A
tr(A−1B) = −(A−1BA−1)⊤. (10.3)
Applying (10.3) and the chain rule we get:
∂
∂Λ
tr[(ΛΛ⊤ +Σe)−1S] = −2(ΛΛ⊤ +Σe)−1S(ΛΛ⊤ +Σe)−1Λ.
69
UNSW MATH5855 2021T3 Lecture 10 Factor Analysis
Hence after substitution:
∂
∂Λ
Q = (ΛΛ⊤ +Σe)−1Λ− (ΛΛ⊤ +Σe)−1S(ΛΛ⊤ +Σe)−1Λ =
(ΛΛ⊤ +Σe)−1[ΛΛ⊤ +Σe − S](ΛΛ⊤ +Σe)−1Λ = 0. (10.4)
Woodbury Matrix Identity gives
(ΛΛ⊤ +Σe)−1 = Σ−1e − Σ−1e Λ(I + Λ⊤Σ−1e Λ)−1Λ⊤Σ−1e . (10.5)
Hence form (10.4) and (10.5) we get
[ΛΛ⊤ +Σe − S]Σ−1e Λ{I − (I + Λ⊤Σ−1e Λ)−1Λ⊤Σ−1e Λ} = 0. (10.6)
Since the rank of the matrix in the curly brackets in (10.6) is full we get
[ΛΛ⊤ +Σe − S]Σ−1e Λ = 0,
or, equivalently,
SΣ−1e Λ = Λ(I + Λ
⊤Σ−1e Λ).
The latter can also be written as
(Σ−1/2e SΣ
−1/2
e )Σ
−1/2
e Λ = Σ
−1/2
e Λ(I + Λ
⊤Σ−1e Λ). (10.7)
To find a particular solution, we require Λ⊤Σ−1e Λ to be diagonal. Then (10.7) implies that
the matrix Σ
−1/2
e Λ has as its columns k eigenvectors that correspond to the k eigenvalues of
Σ
−1/2
e SΣ
−1/2
e . More subtle analysis shows that to obtain the minimum value of Q these have to
be the eigenvectors that correspond to the largest eigenvalues of Σ
−1/2
e SΣ
−1/2
e .
Based on this fact, the following iterative solution (due to Lawley) has been proposed that
can be described algorithmically as follows:
1. With an initial guess Σ˜e, calculate Σ˜
−1/2
e Λ˜ by using the eigenvectors of the k largest
eigenvalues of Σ˜
−1/2
e SΣ˜
−1/2
e .
2. Then from Σ˜
−1/2
e Λ˜, get a (first iteration) value for Λ˜.
3. With this value of Λ˜ we can calculate the value of Q˜(Σ˜e) =
1
2 logΛ˜Λ˜⊤ + Σ˜e+ 12 tr(Λ˜Λ˜⊤ +
Σ˜e)
−1S (which is the value of the functional). This functional only depends on the p
nonzero values of Σ˜e and there are several powerful numerical procedures to find its mini
mum.
4. If it is achieved at Σ∗e, then update Σ˜e with the new guess Σ
∗
e and repeat from Step 1 to
convergence.
10.2 Hypothesis testing under multivariate normality assumption
The most interesting hypothesis is H0 : k factors against H1 : ̸= k factors.
logL1 = −np
2
log(2π)− n
2
logS − np
2
logL0 = −np
2
log(2π)− n
2
logΣˆ − n
2
tr(Σˆ−1S)
(where Σˆ = ΛˆΛˆ⊤ + Σˆe). Hence −2 log L0L1 = n[logΣˆ − logS + tr(Σˆ−1S) − p]. The asymptotic
distribution of this statistic is χ2 with df = p(p+1)2 − [pk+p− k(k−1)2 ] = 12 [(p−k)2−p−k]. Why?
70
UNSW MATH5855 2021T3 Lecture 10 Factor Analysis
10.3 Varimax method of rotating the factors
If Λˆ0 is the estimated factor loading matrix obtained by the ML method, we know that Λˆ = Λˆ0P
with any orthogonal P ∈ Mk,k can be used instead. How to choose a particular P such that Λˆ
has some desirable properties?
Let dr =
∑p
i=1 λ
2
ir, then the varimax method of rotating the factors consists in choosing
P to maximise
Sd =
k∑
r=1
{
p∑
i=1
(λ2ir −
dr
p
)2} =
k∑
r=1
{
p∑
i=1
λ4ir −
(
∑p
i=1 λ
2
ir)
2
p
}.
This corresponds to the wish to make, for each column of factor loadings, some of the coordinates
to be “very large” and the rest to be “very small” (in absolute value). Iterative solution to the
above rotation problem exists.
Note: Rotation of factor loadings is particularly recommended for loadings obtained by
ML method since the initial values of Λˆ0 are constrained to satisfy the condition that
Λˆ⊤0 Σ
−1
e Λˆ0 be diagonal. This is convenient for computational purposes but may not lead to
easily interpretable factors.
10.4 Relationship to Principal Component Analysis
There are different ways in which you can relate Factor analysis to Principal Component analysis.
We will discuss two of them here.
10.4.1 The principal component solution of the factor model
Starting with the matrix
S =
1
n
n∑
i=1
(Yi − Y¯ )(Yi − Y¯ )⊤
we can write down its spectral decomposition by using all of its p eigenvalues and eigenvectors.
In such a way we would derive a perfect reconstruction of S but since it has been achieved by
using p factors, it does not deliver any dimension reduction and is useless. We would prefer to
employ a smaller number k of eigenvalues and eigenvectors of S and to get only an approximate
reconstruction of S
S ≈
k∑
i=1
τia⃗ia⃗
⊤
i = ΛΛ
⊤
whereby τi are the characteristic roots of S, taking the k biggest ones (w.o.l.g. τ1, τ2, .., τk)
and ai being their corresponding eigenvectors. Since the understanding is that (if k is the right
number of factors) all communalities have been taken into account then sii −
∑k
j=1 λ
2
ij would
be the estimators of the uniquenesses. This approach shows the k factors have been extracted
from S in the same way like the principal components are calculated. The method is called the
principal component solution of the factor model.
10.4.2 The Principal Factor Solution
This is yet another method that uses similar ideas from principal components analysis. It is
similar to the principal component solution, but the factor extraction is not performed directly
71
UNSW MATH5855 2021T3 Lecture 10 Factor Analysis
on S. To describe it, let us assume for a moment that the uniquenesses are known (or can be
estimated reasonably well) and we can decompose
S = Sr +Σe
whereby the number k of factors is known and Σe is the diagonal matrix containing the unique
nesses. Then the factor analysis model states that (an estimate of) Λ should satisfy
Sr = S − Σe = ΛΛ⊤
Hence Λ estimate can be found by performing principal component analysis on Sr :
If Sr =
∑p
i=1 tib⃗ib⃗
⊤
i , ti being the characteristic roots of Sr, take the k biggest ones (w.o.l.g.
t1, t2, .., tk). Denote
B =
(
b1 b2 · · · bk
)
; ∆ = diag(t1, t2, . . . , tk).
Then Λˆ = B∆1/2. Can do it also iteratively!
This approach has some problems:
i) There is no reliable estimate of Σe available. (The most commonly used one in the case
where S is the correlation matrix R is σ2ei = 1/r
ii where rii is the ith diagonal element of
R−1.)
ii) How to select k?
Note: The methods in Section 10.4.2 are not efficient as compared to the ML method and in
general, the ML method is the preferred one. However, for the ML method one has to
assume normality and the alternative approaches described here are used in cases where
multivariate normality in a serious doubt. Most often in practice the choice of k is done
by combining subject matter knowledge, “reasonableness” of results and by looking at
proportion variance explained.
10.5 Software
SAS
As you might expect, factor analysis is implemented in PROC FACTOR. Some remarks:
• if you want to extract different numbers of factors (the example below shows how to extract
n = 2 factors), you should run the procedure once for each number of factors;
• the communalities need a preliminary estimate. If one considers the correlation matrix
instead of Σ, then the communalities can be estimated by the squared multiple correlations
of each of the variables with the rest (these communality estimates are used to get pre
liminary estimates of the uniquenesses to start the iteration process). If in the iteration
process it happens that a communality estimate exceeds 1the case is referred to as an
ultraHeywood case and the Heywood option sets such communality to one thus allowing
iterations to continue;
• the scree option can be used to produce a plot of the eigenvalues Σ that is helpful in
deciding how many factors to use;
• besides method=ml you can use method=principal;
• with the ML method option, the Akaike’s Information criterion (and Schwarz’s Bayesian
Criterion) are included. These can be used to estimate the “best” number of parameters
to include in a model (in case more than one model is acceptable). The number of factors
that yields the smallest AIC is considered “best”.
72
UNSW MATH5855 2021T3 Lecture 10 Factor Analysis
R
Function stats::factanal() is the builtin implementation. Package psych contains addi
tional functions and utilities, as well as its own implementation, psych::fa(), with a number of
model selection tools. Package nFactors contains utilities for determining the number of factors
(e.g., scree plots).
10.6 Examples
Example 10.1. Data about five socioeconomic variables for 12 census data in the Los Angeles
area. The five variables represent total population, median school years, total unemployment,
miscellaneous professional services, and median house value. Use ML method and varimax
rotation.
• Try to run the above model with n = 3 factors. The message “WARNING: Too many
factors for a unique solution” appears. This is not surprising as the number of parameters
in the model will exceed the number of elements in Σ ( 12 [(p − k)2 − p − k] = −2). In this
example you can run the procedure for n = 1 and for n = 2 only (do it!) and you will see
that n = 2 gives the adequate representation.
• Try using psych::fa.parallel() to search for optimal number of factors.
10.7 Additional resources
An alternative presentation of these concepts can be found in JW Ch. 9.
73
UNSW MATH5855 2021T3 Lecture 11 Structural Equation Modelling
11 Structural Equation Modelling
11.1 General form of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
11.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
11.3 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
11.4 Some particular SEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
11.5 Relationship between exploratory and confirmatory FA . . . . . . . . . . . . . . . 76
11.6 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
11.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Factor analysis (FA) is only one example of a new approach to data analysis which is not
based on the individual observations. We were not able to use the regression approach
since the input factors were latent (not observable). There were too many unknowns. We went
to analyse the covariance matrix Σ (and its estimator S) which involved the actual parameters
of interest—σ2i and Λ. That is, we switched from the level of individual observations to
analyse covariance matrices instead. There are a series of methods which are based on analysis
of covariances rather than individual cases. Instead of minimising functions of observed and
predicted individual values, we minimise the differences between sample covariances and
covariances predicted by the model.
The fundamental hypothesis in these analyses is
H0 : Σ = Σ(θ) against H1 : Σ ̸= Σ(θ).
Here Σ has p(p + 1)/2 unknown elements (estimated by S) but these are assumed to be
reproducible by just k = dim(θ) < p(p + 1)/2 parameters. Note that more generally we
could consider fitting means and covariances, or means and covariances and higher
moments to a given structure. Regression analysis with random inputs, simultaneous
equations systems, confirmatory factor analysis, canonical correlations, (M)ANOVA
can be considered special cases.
Structural equation modelling is an important statistical tool in economics and behavioural
sciences. Structural equations express relationships among several variables that can be either
directly observed variables (manifest variables) or unobserved hypothetical variables (latent vari
ables). In structural models, as opposed to functional models, all variables are taken to be
random rather than having fixed levels. In addition, for maximum likelihood estimation and
generalised least squares estimation (see below), the random variables are assumed to have an
approximately multivariate normal distribution. Hence you are advised to remove outliers and
consider transformations to normality before fitting.
11.1 General form of the model
η = Bη + Γξ + ζ. (11.1)
Here,
η ∈ Rm vector of output latent variables;
ξ ∈ Rn′ vector of input latent variables;
B ∈Mm,m, Γ ∈Mm,n′ coefficient matrices;
Note: (I −B) is assumed to be nonsingular.
ζ ∈ Rm disturbance vector with E ζ = 0.
74
UNSW MATH5855 2021T3 Lecture 11 Structural Equation Modelling
To this modelling equation (11.1) we attach 2 measurement equations:
Y = ΛY η + ϵ; (11.2)
X = ΛXξ + δ; (11.3)
Y ∈ Rp, X ∈ Rq; ΛY ∈ mp×m,ΛX ∈ mq×n′
with ϵ ∈ Rp, δ ∈ Rq zeromean measurement errors. These errors are assumed to be uncorrelated
with ξ and ζ and with each other.
Generative model for X and Y
Y
X
ϵ
δ
η
ξ
ζ
B
Γ
ΛX
ΛY
The above quite general model (11.1)–(11.2)–(11.3) is calledKeesling–Wiley–Jo¨reskog model.
Its interpretation is that the input and output latent variables ξ and η are connected by a system
of linear equations (the structural model (11.1)) with coefficient matrices B and Γ and an error
vector ζ. The random vectors Y and X represent the observable vectors (measurements).
The implied covariance matrix for this model can be obtained. Let
Var(ξ) = Φ; Var(ζ) = Ψ; Var(ϵ) = θϵ; Var(δ) = θδ.
Then,
Σ = Σ(θ) =
(
ΣY Y (θ) ΣY X(θ)
ΣXY (θ) ΣXX(θ)
)
=
(
ΛY (I −B)−1(ΓΦΓ⊤ +Ψ)[(I −B)−1]⊤Λ⊤Y + θϵ ΛY (I −B)−1ΓΦΛ⊤X
ΛXΦΓ
⊤[(I −B)−1]⊤Λ⊤Y ΛXΦΛ⊤X + θδ
)
. (11.4)
11.2 Estimation
Under the normality assumption, we can use the MLE. Since the “data” is the estimated covari
ance matrix S =
1
n− 1
n∑
i=1
{(
Yi − Y¯
Xi − X¯
)(
Yi − Y¯
Xi − X¯
)⊤}
, and since it is known that (n− 1)S ∼
Wp+q(n− 1,Σ), we can utilise the form of the Wishart density to derive that
logL(S,Σ(θ)) = constant− n− 1
2
{logΣ(θ)+ tr[SΣ−1(θ)]}.
This is the function that has to be maximised. Hence, to find MLE, we minimise
FML(θ) = logΣ(θ)+ tr[SΣ−1(θ)]− logS − (p+ q). (11.5)
The function (11.5) has the advantage that FML would be zero for the “saturated model” (with
Σˆ = S). I.e., a perfect fit is indicated by zero (and any nonperfect fit gives rise to > 0 value of
FML).
75
UNSW MATH5855 2021T3 Lecture 11 Structural Equation Modelling
11.3 Model evaluation
Under normality, model adequacy is mostly tested by an asymptotic χ2test. Under H0 : Σ =
Σ(θ) versus H1 : Σ ̸= Σ(θ), the statistic to be used is T = (n− 1)FML(θˆML) and under H0, its
asymptotic distribution is χ2 with df = (p+q)(p+q+1)2 − dim(θ).
Reason:
logL0 = logL(S, ΣˆMLE) = logL(S,Σ(θˆML))
= −n− 1
2
{logΣˆMLE+ tr[SΣˆ−1MLE]}+ constant;
logL1 = logL(S,S) = −n− 1
2
{logS+ (p+ q)}+ constant.
Then,
−2 log L0
L1
= (n− 1){logΣˆMLE+ tr(SΣˆ−1MLE)− logS − (p+ q)}
= (n− 1)FML(θˆML).
11.4 Some particular SEM
From the general model (11.1)–(11.2)–(11.3), we can obtain following particular models:
A) ΛY = Im , ΛX = In′ ; p = m; q = n
′; θϵ = 0 ; θδ = 0 =⇒ Y = BY + ΓX + ζ (the
classical econometric model).
B) ΛY = Ip , ΛX = Iq =⇒ The measurement error model:
• η = Bη + Γξ + ζ
• Y = η + ϵ
• X = ξ + δ
C) Factor Analysis Models: Just take the measurement part X = ΛXξ + δ.
11.5 Relationship between exploratory and confirmatory FA
In EFA the number of latent variables is not determined in advance; further, the measurement
errors are assumed to be uncorrelated. In CFA a model is constructed to a great extent in
advance, the number of latent variables ξ is set by the analyst, whether a latent variable influ
ences an observed variable is specified, some direct effects of latent on observed values are fixed
to zero or some other constant (e.g., one), measurement errors δ may correlate, the covariance
of latent variables can be either estimated or set to any value. In practice, distinction between
EFA and CFA is more blurred. For instance, researchers using traditional EFA procedures may
restrict their analysis to a group of indicators that they believe are influenced by one factor. Or,
researchers with poorly fitting models in CFA often modify their model in an exploratory way
with the goal of improving fit.
76
UNSW MATH5855 2021T3 Lecture 11 Structural Equation Modelling
11.6 Software
SAS
In SAS, the standard PROC CALIS is used for fitting Structural Equation Models, and it
has been significantly upgraded in SAS 9.3. In particular, now you can analyse means and
covariance (or even higher order) structures (instead of just covariance structures like in
the classical SEM).
R
There are two packages for SEM in R: lavaan and sem. sem is an older package, whereas
lavaan aims to provide an extensible framework for SEMs and their extensions:
• can mimic commercial packages (including those below)
• provides convenience functions for specifying simple special cases (such as CFA) but also
a more flexible interface for advanced users
• mean structures and multiple groups
• different estimators and standard errors (including robust)
• handling of missing data
• linear and nonlinear equality and inequality constraints
• categorical data support
• multilevel SEMs
• package blavaan for Bayesian estimation
• etc.
Others
Note that the general form of the SEM model given here is only one possible description due
to Karl Jo¨reskog. His paradigm has been first implemented in the software called LISREL (Linear
Structural Relationships).
There are other equivalent descriptions due to Bentler and Weeks, to McDonald and some
other prominent researchers in the field. Some of them also have proposed their own software for
fitting SEM models according to their model specification. The EQS program for PC that deals
with the Bentler/Weeks model, was very popular for a while. The latest “hit” in the area is the
program MPLUS (M is for Bength Muthe´n). Mute´n is a former PhD student of Jo¨reskog and has
been the developer of LISREL. During the last 15 years or so however, he has developed his own
program MPLUS. Its latest version 6 represents a fully integrated framework and is the premier
software in the area of general latent variable modelling specifically in the behavioural sciences.
MPLUS capabilities include:
• Exploratory factor analysis
• Structural equation modelling
• Item response theory analysis
• Growth curve modelling
77
UNSW MATH5855 2021T3 Lecture 11 Structural Equation Modelling
• Mixture modelling (latent class analysis)
• Longitudinal mixture modelling (hidden Markov, latent transition analysis, latent class
growth analysis, growth mixture analysis)
• Survival analysis (continuous and discretetime)
• Multilevel analysis
• Bayesian analysis
• etc.
11.7 Examples
Example 11.1. Wheaton, Muthen, Alwin, and Summers (1977) Anomie example.
78
UNSW MATH5855 2021T3 Lecture 12 Discrimination and Classification
12 Discrimination and Classification
12.1 Separation and Classification for two populations . . . . . . . . . . . . . . . . . . 79
12.2 Classification errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
12.3 Summarising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
12.4 Optimal classification rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
12.4.1 Rules that minimise the expected cost of misclassification (ECM) . . . . . 81
12.4.2 Rules that minimise the total probability of misclassification (TPM) . . . 81
12.4.3 Bayesian approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
12.5 Classification with two multivariate normal populations . . . . . . . . . . . . . . 82
12.5.1 Case of equal covariance matrices Σ1 = Σ2 = Σ . . . . . . . . . . . . . . . 82
12.5.2 Case of different covariance matrices (Σ1 ̸= Σ2) . . . . . . . . . . . . . . . 83
12.5.3 Optimum error rate and Mahalanobis distance . . . . . . . . . . . . . . . 84
12.6 Classification with more than 2 normal populations . . . . . . . . . . . . . . . . . 84
12.7 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
12.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
12.9 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
12.10Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
12.1 Separation and Classification for two populations
Discriminant analysis and classification are widely used multivariate techniques. The goal is
either separating sets of objects (in discriminant analysis terminology) or allocating new objects
to given groups (in classification theory terminology).
Basically, discriminant analysis is more exploratory in nature than classification. However,
the difference is not significant especially because very often a function that separates may
sometimes serve as an allocator, and, conversely, a rule of allocation may suggest a discriminatory
procedure. In practice, the goals in the two procedures often overlap. We will consider the case
of two populations (classes of objects) first.
Typical examples include: an anthropologist wants to classify a skull as a male or female; a
patient needs to be classified as needing surgery or not needing surgery etc..
Denote the two classes by π1 and π2. The separation is to be performed on the basis of
measurements of p associated random variables that form a vector X ∈ Rp. The observed
values of X belong to different distributions when taken from π1 and π2 and we shall denote the
densities of these two distributions by f1(x) and f2(x), respectively.
Allocation or classification is possible due to the fact that one has a learning sample at hand,
i.e., there are some measurement vectors that are known to have been generated from each of
the two populations. These measurements have been generated in earlier similar experiments.
The goal is to partition the sample space into 2 mutually exclusive regions, say R1 and R2, such
that if a new observation falls in R1, it is allocated to π1 and if it falls in R2, it is allocated to
π2.
12.2 Classification errors
There is always a chance of an erroneous classification (misclassification). Our goal will be to
develop such classification methods that in a suitably defined sense minimise the chances of
misclassification.
It should be noted that one of the two classes may have a greater likelihood of occurrence
because one of the two populations might be much larger than the other. For example, there tend
79
UNSW MATH5855 2021T3 Lecture 12 Discrimination and Classification
to be a lot more financially sound companies than bankrupt companies. These prior probabilities
of occurrence should also be taken into account when constructing an optimal classification rule
if we want to perform optimally.
In a more detailed study of optimal classification rules, cost is also important. If classifying a
π1 object to the class π2 represents a much more serious error than classifying a π2 object to the
class π1 then these cost differences should also be taken into account when designing the optimal
rule.
The conditional probabilities for misclassification are defined naturally as:
Pr(21) = Pr(X ∈ R2π1) =
∫
R2
f1(x)dx (12.1)
Pr(12) = Pr(X ∈ R1π2) =
∫
R1
f2(x)dx (12.2)
12.3 Summarising
We turn briefly to the question of how to summarise a classifier’s performance. Each object has
a true class membership and the one predicted by the classifier, and for a given dataset for which
true memberships are known, we may summarise the counts of the four resulting possibilities in
a contingency table called a confusion matrix, i.e.,
Predicted class
1 2
Actual class
1 Members of 1 correctly
classified
Members of 1 misclassified
as 2
2 Members of 2 misclassified
as 1
Members of 2 correctly
classified
A confusion matrix can be produced when there are more than two classes as well.
In the special case where there are two classes that can be meaningfully labelled as Nega
tive/Positive, False/True, No/Yes, Null/Alternative, or similar, it is common to use the following
terminology for them:
Predicted class
Negative Positive
Actual class
Negative True Negative (TN) False Positive (FP)
Positive False Negative (FN) True Positive (TP)
One can then define various performance metrics such as
sensitivity (a.k.a. recall, true positive rate (TPR)): Pr(Pred. pos.Act. pos.) = TPTP+FN
specificity (a.k.a. selectivity, true negative rate (TNR)): Pr(Pred. neg.Act. neg.) = TNTN+FP
false positive rate (a.k.a. FPR, fallout): Pr(Pred. pos.Act. neg.) = FPTN+FP = 1− TNR
accuracy: TP+TNTP+FP+TN+FN
total probability of misclassification (a.k.a. TPM): 1− accuracy
precision (a.k.a. positive predictive value): Pr(Act. pos.Pred. pos.) = TPTP+FP
negative predictive value: Pr(Act. neg.Pred. neg.) = TNTN+FN
80
UNSW MATH5855 2021T3 Lecture 12 Discrimination and Classification
F1 score: 2TP2TP+FP+FN
Many classifiers return a continuous score that needs to be thresholded to produce a binary
decision (e.g., predict “Yes” if the score exceeds some constant k and “No” otherwise), it is a
common practice to plot a receiver operating characteristic (ROC) curve by varying the threshold
and then plotting the TPR (on the vertical axis) against FPR (on the horizontal axis) that result.
Both of which decrease as k increases. A perfect classifier would have a threshold for which the
curve achieves the (0, 1) point, whereas classifier close to the y = x line is no better than chance.
12.4 Optimal classification rules
12.4.1 Rules that minimise the expected cost of misclassification (ECM)
Lemma 12.1. Denote by pi the prior probability of πi, i = 1, 2, p1 + p2 = 1. Then the overall
probabilities of incorrectly classifying objects will be: Pr(misclassified as π1) = Pr(12)p2 and
Pr(misclassified as π2) = Pr(21)p1. Further, let c(ij), i ̸= j, i, j = 1, 2 be the misclassification
costs. Then the expected cost of misclassification is
ECM = c(21)Pr(21)p1 + c(12)Pr(12)p2 (12.3)
The regions R1 and R2 that minimise ECM are given by
R1 = {x : f1(x)
f2(x)
≥ c(12)
c(21)
p2
p1
} (12.4)
and
R2 = {x : f1(x)
f2(x)
<
c(12)
c(21)
p2
p1
}. (12.5)
Proof. It is easy to see that ECM =
∫
R1
[c(12)p2f2(x)− c(21)p1f1(x)]dx+ c(21)p1. Hence, the
ECM will be minimised if R1 includes those values of x for which the integrand [c(12)p2f2(x)−
c(21)p1f1(x)] ≤ 0 and excludes all the complementary values.
Note the significance of the fact that in Lemma 12.1 only ratios are involved. Often in
practice, one would have a much clearer idea about the cost ratio rather than for the actual costs
themselves.
For your own exercise, consider the partial cases of Lemma 12.1 when p2 = p1, c(12) = c(21)
and when both these equalities hold. Comment on the soundness of the classification regions in
these cases.
12.4.2 Rules that minimise the total probability of misclassification (TPM)
If we ignore the cost of misclassification, we can define the total probability of misclassification
as
TPM = p1
∫
R2
f1(x)dx+ p2
∫
R1
f2(x)dx
Mathematically, this is a particular case of Lemma 12.1 when the costs of misclassification are
equal—so nothing new here.
81
UNSW MATH5855 2021T3 Lecture 12 Discrimination and Classification
12.4.3 Bayesian approach
Here, we try to allocate a new observation x0 to the population with the larger posterior prob
ability Pr(πix0), i = 1, 2. According to Bayes’s formula we have
Pr(π1x0) = p1f1(x0)
p1f1(x0) + p2f2(x0)
, Pr(π2x0) = p2f2(x0)
p1f1(x0) + p2f2(x0)
Mathematically, the strategy of classifying an observation x0 as π1 if Pr(π1x0) > Pr(π2x0) is
again a particular case of Lemma 12.1 when the costs of misclassification are equal. (Why?)
But note that the calculation of the posterior probabilities themselves is in itself a useful and
informative operation.
12.5 Classification with two multivariate normal populations
Until now we did not specify any particular form of the densities f1(x) and f2(x). Essential
simplification occurs under normality assumption and we are going over to a more detailed
discussion of this particular case now. Two different cases will be considered of equal and of
nonequal covariance matrices.
12.5.1 Case of equal covariance matrices Σ1 = Σ2 = Σ
Now we assume that the two populations π1 and π2 are Np(µ1,Σ) and Np(µ2,Σ), respectively.
Then, (12.4) becomes
R1 = {x : exp[−1
2
(x− µ1)⊤Σ−1(x− µ1) + 1
2
(x− µ2)⊤Σ−1(x− µ2)] ≥ c(12)
c(21) ×
p2
p1
}.
Similarly, from (12.5) we get
R2 = {x : exp[−1
2
(x− µ1)⊤Σ−1(x− µ1) + 1
2
(x− µ2)⊤Σ−1(x− µ2)] < c(12)
c(21) ×
p2
p1
},
and we arrive at the following result:
Theorem 12.2. Under the above assumptions, the allocation rule that minimises the ECM is
given by:
1. allocate x0 to π1 if
(µ1 − µ2)⊤Σ−1x0 − 1
2
(µ1 − µ2)⊤Σ−1(µ1 + µ2) ≥ log[c(12)
c(21) ×
p2
p1
].
2. Otherwise, allocate x0 to π2.
Proof. Simple exercise (to be discussed at lectures).
Note also that it is unrealistic to assume in most situations that the parameters µ1, µ2,
and Σ are known. They will need to be estimated by the data instead. Assume, n1 and n2
observations are available from the first and from the second population, respectively. If x¯1 and
x¯2 are the sample mean vectors and S1 and S2 the corresponding sample covariance matrices,
then under the assumption of Σ1 = Σ2 = Σ we can derive the pooled covariance matrix estimator
Spooled =
(n1−1)S1+(n2−1)S2
n1+n2−2 (This is an unbiased estimator of Σ (!)).
Hence the sample classification rule becomes:
82
UNSW MATH5855 2021T3 Lecture 12 Discrimination and Classification
1. allocate x0 to π1 if
(x¯1 − x¯2)⊤S−1pooledx0 −
1
2
(x¯1 − x¯2)⊤S−1pooled(x¯1 + x¯2) ≥ log[
c(12)
c(21) ×
p2
p1
] (12.6)
2. Otherwise, allocate x0 to π2.
This empirical classification rule is called an allocation rule based on Fisher’s discrim
inant function. The function
(x¯1 − x¯2)⊤S−1pooledx0 −
1
2
(x¯1 − x¯2)⊤S−1pooled(x¯1 + x¯2)
itself (which is linear in the vector observation x0) is called Fisher’s linear discriminant
function.
Of course, the latter rule is only an estimate of the optimal rule since the parameters in the
latter have been replaced by estimated quantities. But we are expecting this rule to perform well
when n1 and n2 are large. It is to be pointed out that the allocation rule in (12.6) is linear in
the new observation x0. The simplicity of its form is a consequence of the multivariate normality
assumption.
• Allocation rule based on Fisher’s discriminant function:
(x¯1 − x¯2)⊤S−1pooledx0 −
1
2
(x¯1 − x¯2)⊤S−1pooled(x¯1 + x¯2)
• Function itself called Fisher’s linear discriminant function.
• Only an estimate of the optimal rule.
– linear in the new observation x0
12.5.2 Case of different covariance matrices (Σ1 ̸= Σ2)
Theorem 12.3. Now we assume that the two populations π1 and π2 are Np(µ1,Σ1) and Np(µ2,Σ2),
respectively. Repeating the same steps as in Theorem 12.2 we get
R1 = {x : −1
2
x⊤(Σ−11 − Σ−12 )x+ (µ⊤1 Σ−11 − µ⊤2 Σ−12 )x− k ≥ log[
c(12)
c(21) ×
p2
p1
]}
R2 = {x : −1
2
x⊤(Σ−11 − Σ−12 )x+ (µ⊤1 Σ−11 − µ⊤2 Σ−12 )x− k < log[
c(12)
c(21) ×
p2
p1
]}
where k = 12 log(
Σ1
Σ2 ) +
1
2 (µ
⊤
1 Σ
−1
1 µ1 − µ⊤2 Σ−12 µ2) and we see that the classification regions are
quadratic functions of the new observation in this case.
One obtains the following rule:
1. allocate x0 to π1 if
−1
2
x⊤0 (S
−1
1 − S−12 )x0 + (x¯⊤1 S−11 − x¯⊤2 S−12 )x0 − kˆ ≥ log[
c(12)
c(21) ×
p2
p1
]
where kˆ is the empirical analog of k.
83
UNSW MATH5855 2021T3 Lecture 12 Discrimination and Classification
2. Allocate x0 to π2 otherwise.
When Σ1 = Σ2, the quadratic term disappears and we can easily see that the classification
regions from Theorem 12.2 are obtained. Of course, the case considered in Theorem 12.3 is more
general but we should be cautious when applying it in practice. It turns out that in more than
two dimensions, classification rules based on quadratic functions do not always perform nicely
and can lead to strange results. This is especially true when the data are not quite normal and
when the differences in the covariance matrices are significant. The rule is very sensitive (non
robust) towards departures from normality. Therefore, it is advisable to try to first transform
the data to more nearly normal by using some classical normality transformations. A detailed
discussion of these effects will be provided during the lecture. Also, tests discussed in Lecture 9
can be used to check if equal variance assumption is valid.
12.5.3 Optimum error rate and Mahalanobis distance
We defined the TPM quantity in general terms for any classification rule (12.3). When the
regions R1 and R2 are selected in an optimal way, one obtains the minimal value of TPM which
is called optimum error rate (OER) and is being used to characterise the difficulty of the
classification problem at hand. Hereby we shall illustrate the calculation of the OER for the
simple case of two normal populations with Σ1 = Σ2 = Σ and prior probabilities p1 = p2 =
1
2 .
In this case
TPM =
1
2
∫
R2
f1(x)dx+
1
2
∫
R1
f2(x)dx,
and OER is obtained by choosing
R1 = {x : (µ1 − µ2)⊤Σ−1x− 1
2
(µ1 − µ2)⊤Σ−1(µ1 + µ2) ≥ 0}
and
R2 = {x : (µ1 − µ2)⊤Σ−1x− 1
2
(µ1 − µ2)⊤Σ−1(µ1 + µ2) < 0}.
If we introduce the random variable Y = (µ1−µ2)⊤Σ−1X = l⊤X then Y i ∼ N1(µiY ,∆2), i =
1, 2 for the two populations π1 and π2 where µiY = (µ1 − µ2)⊤Σ−1µi, i = 1, 2. The quantity
∆ =
√
(µ1 − µ2)⊤Σ−1(µ1 − µ2) is the Mahalanobis distance between the two normal popu
lations and it has an important role in many applications of Multivariate Analysis. Now
Pr(21) = Pr(Y < 1
2
(µ1 − µ2)⊤Σ−1(µ1 + µ2)) = Pr(Y − µ1Y
∆
< −∆
2
) = Φ(−∆
2
),
Φ(·) denoting the cumulative distribution function of the standard normal. Along the same lines
we can get (do it (!)) : Pr(12) = Φ(−∆2 ) to that finally OER = minimum TPM = Φ(−∆2 ).
In practice, ∆ is replaced by its estimated value ∆ˆ =
√
(x¯1 − x¯2)⊤S−1pooled(x¯1 − x¯2).
12.6 Classification with more than 2 normal populations
Formal generalisation of the theory for the case of g > 2 groups π1, π2, . . . , πg is straightforward
but optimal error rate analysis is difficult when g > 2. It is easy to see that the ECM classification
rule with equal misclassification costs becomes (compare to (12.4) and (12.5)) now:
1. Allocate x0 to πk if pkfk > pifi for all i ̸= k.
84
UNSW MATH5855 2021T3 Lecture 12 Discrimination and Classification
Equivalently, one can check if log pkfk > log pifi for all i ̸= k.
When applying this classification rule to g normal populations fi(x) ∼ Np(µi,Σi), i =
1, 2, . . . , g it becomes:
1. Allocate x0 to πk if
log pkfk(x0) = log pk−p
2
log(2π)−1
2
logΣk−1
2
(x0−µk)⊤Σ−1k (x0−µk) = maxi log pifi(x0).
Ignoring the constant p2 log(2π) we get the quadratic discriminant score for the ith pop
ulation:
dQi (x) = −
1
2
logΣi − 1
2
(x− µi)⊤Σ−1i (x− µi) + log pi (12.7)
and the rule advocates to allocate x to the population with a largest quadratic discriminant
score. It is obvious how one would estimate from the data the unknown quantities involved in
(12.7) in order to obtain the estimated minimum total probability of misclassification rule. (You
formulate the precise statement (!)).
In the case we are justified to assume that all covariance matrices for the g populations
are equal, a simplification is possible (like in the case g = 2). Looking only at the terms that vary
with i = 1, 2, . . . , g in (12.7) we can define the linear discriminant score: di(x) = µ
⊤
i Σ
−1x−
1
2µ
⊤
i Σ
−1µi + log pi. Correspondingly, a sample version of the linear discriminant score is
obtained by substituting the arithmetic means x¯i instead of µi and Spooled =
n1−1
n1+n2+...ng−gS1+
· · ·+ ng−1n1+n2+···+ng−gSg instead of Σ thus arriving at
dˆi(x) = x¯
⊤
i S
−1
pooledx−
1
2
x¯⊤i S
−1
pooledx¯i + log pi
Therefore the Estimated Minimum TPM Rule for Equal Covariance Normal Popula
tions is the following:
1. Allocate x to πk if dˆk(x) is the largest of the g values dˆi(x), i = 1, 2, . . . , g.
In this form, the classification rule has been implemented in many computer packages.
12.7 Software
SAS: PROC DISCRIM
R: MASS:lda, MASS:qda
12.8 Examples
Example 12.4. Linear and quadratic discriminant analysis for the Edgar Anderson’s Iris data,
and using crossvalidation to assess classifiers.
12.9 Additional resources
An alternative presentation of these concepts can be found in JW Sec. 11.1–11.6.
85
UNSW MATH5855 2021T3 Lecture 12 Discrimination and Classification
12.10 Exercises
Exercise 12.1
Three bivariate normal populations, labelled i = 1, 2, 3 have same covariance matrix given by
Σ =
(
1 0.5
0.5 1
)
and means µ1 =
(
1
1
)
, µ2 =
(
1
0
)
,µ3 =
(
0
1
)
, respectively.
(a) Suggest a classification rule for an observation x =
(
x1
x2
)
that corresponds to one of the
three populations. You may assume equal priors for the three populations and equal mis
classification costs.
(b) Classify the following observations to one of the three distributions:
(
0.2
0.6
)
,
(
2
0.8
)
,
(
0.75
1
)
.
(c) Show that in R2, the 3 classification regions are bounded by straight lines and draw a graph
of these three regions.
86
UNSW MATH5855 2021T3 Lecture 13 Support Vector Machines
13 Support Vector Machines
13.1 Introduction and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
13.2 Expected versus Empirical Risk minimisation . . . . . . . . . . . . . . . . . . . . 87
13.3 Basic idea of SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
13.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
13.4.1 Linear SVM: Separable Case . . . . . . . . . . . . . . . . . . . . . . . . . 90
13.4.2 Linear SVM: Nonseparable Case . . . . . . . . . . . . . . . . . . . . . . . 91
13.5 Nonlinear SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
13.6 Multiple classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
13.7 SVM specification and tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
13.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
13.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
13.1 Introduction and motivation
As seen in Lecture 12, when classifying into one of two pdimensional multivariate normal pop
ulations, the scores are either linear (when the same covariance matrices are used) or quadratic
(when the covariance matrices are different). Even optimality for such simple classifiers could
be shown due to the multivariate normality assumption. However, when the two populations
are not multivariate normal, the situation is more difficult, the bounds between the populations
may be more blurry and significantly more nonlinear classification techniques may be necessary
to achieve a good classification. Support vector machines (SVM) are an example of such non
linear statistical classification techniques. They usually achieve superior results in comparison
to more traditional nonlinear parametric classification techniques such as logit analysis or non
parametric techniques such as neural networks. Mathematically, when using SVM, we try to
formulate the classification as an empirical risk minimisation problem and to solve the problem
under additional restrictions on the allowed (nonlinear) classifier functions.
13.2 Expected versus Empirical Risk minimisation
Let Y be and “indicator” with values +1 and −1 that indicate if certain p dimensional observation
belongs to one of two groups of interest. We want to find a “best” classifier in a class F of
functions f. Each classifier function f(x) is meant to deliver a value of +1 or −1 for a given
observation vector x. To this end, we consider the expected risk
R(f) =
∫
1
2
f(x)− ydP (x, y)
Since the joint distribution P (x, y) is unknown in practice, we consider the empirical risk over
a training set (xi, yi), i = 1, 2, . . . , n of observations instead:
Rˆ(f) =
1
n
n∑
i=1
1
2
f(xi)− yi
The loss in the risk’s definition is the “zeroone loss” given by
L(x, y) =
1
2
f(x)− y
and, thanks to the chosen labels ±1 for Y obviously has the values 0 (if classification is correct)
and 1 (if classification is wrong).
87
UNSW MATH5855 2021T3 Lecture 13 Support Vector Machines
Minimising the empirical (instead of the unknown expected) risk means to find fn = argminf∈F Rˆ(f)
as an approximation to fopt = argminf∈F R(f). Generally speaking the two solutions fn and
fopt do not coincide and without further assumptions may be quite different. However, thanks to
some ground breaking work by V. Vapnik there are theoretical results which, loosely speaking,
state that if F is not too large and n → ∞, there is an upper bound on their difference with
probability (1− η):
R(f) ≤ Rˆ(f) + ϕ(h
n
,
log η
n
)
The above inequality can be interpreted as stating that the test error is bounded from above by
the sum of the training error and the complexity of the set of models under consideration. We
can then try to minimise the bound from above and hope that in that way we keep under control
to a minimum the (unknown) test error.
The function ϕ above is monotone increasing in h (at least for large enough sample sizes n).
Here h denotes the Vapnik–Chervonenkis (VC) dimension (i.e., a measure of the complexity
of the class F).
For a linear classification rule f(x) = sign(w⊤x + b) with a p dimensional predictor x it is
known that
ϕ(
h
n
,
log η
n
) =
√
h(log( 2nh ) + 1)− log(η4 )
n
.
and that the VC dimension is h = p+ 1. You can now directly check that
∂
∂h
[
h(log( 2nh ) + 1)− log(η4 )
n
] =
1
n
log(
2n
h
) > 0
as long as h < 2n which confirms the monotone increasing property stated above.
In general, the VC dimension of a given set of functions is equal to the maximal number of
points that can be separated in all possible ways by that set of functions.
At first glance, the “more rich” the function class F the better the classification rule would be.
Indeed you can construct a classifier that has zero classification error on the training set. However,
this classifier will be too specialised for the given training set with no ability to generalise for
other sets. Hence such a classifier would be undesirable.
At first glance, the “more rich” the function class F the better the classification rule would be.
Indeed you can construct a classifier that has zero classification error on the training set. However,
this classifier will be too specialised for the given training set with no ability to generalise for other
sets. Hence such a classifier would be undesirable. “More rich” is tantamount to require bigger
complexity of F or equivalently higher value of h (and therefore of ϕ). The term ϕ(hn , log ηn ) can
be considered penalty for the excessive complexity of the classifier function. You can see directly
that the derivative
∂ϕ( hn ,
log(η)
n )
∂h ≥ 0 if and only if 2n ≥ h. For large enough n this means that the
function ϕ is increasing with the complexity of the model. Hence the sum of the two terms: Rˆ(f)
(precision) and ϕ(hn ,
log η
n ) (complexity) represents the compromise between precision in the risk
estimation and the complexity of the classifier. Therefore minimising this sum is the sensible
thing to do in order to perform “optimally”.
The rest of the lecture focuses on ways to solve (or solve approximately) this minimisation
problem for some classes F .
For additional information, see Section 19.4 of
Ha¨rdle, W. and Simar, L., Applied Multivariate Statistical Analysis, Third Edition,
Springer, 2012.
A treatment along similar lines can also be found in the ebook (available from the library)
88
UNSW MATH5855 2021T3 Lecture 13 Support Vector Machines
Hastie, T., Friedman, J. and Tibshirani, R. The Elements of Statistical Learning:
Data Mining, Inference and Prediction, Second Edition, Springer 2009.
13.3 Basic idea of SVMs
A linear classifier is one that given feature vector xnew and weights w, classifies ynew based on
the value of w⊤xnew; for example,
yˆnew =
{
+1 if w⊤xnew + b > 0
−1 if w⊤xnew + b < 0
for a threshold −b. Here, we see that every element of x, xi, gets a weight wi:
Sign of wi determines whether increasing xi pushes the prediction toward yi = −1 or yi = +1.
Magnitude of wi determines how strongly.
The regions of x for which the model predicts +1 as opposed to−1 are defined byw⊤x+b = 0.
Points x that satisfy that equation exactly form a line (if d = 2), a plane (if d = 3), or a
hyperplane (if d ≥ 3). We call the data linearly separable if a hyperplane that separates them
exists. Let us focus on this linearly separable case (and consider the nonseparable case later.)
The following diagram illustrates one such line:
x1
x2
−1
+1
y
separating
line
w
−b
∥w
∥
Now, usually, there are infinitely many different hyperplanes which could be used to separate
a linearly separable dataset. We therefore have to define the “best” one. The “best” choice can
be regarded as the middle of the widest empty strip (or higher dimensional analogue) between
the two classes, one that maximises the margin b+−b−∥w∥ in the following illustration:
x1
x2
−1
+1
yw
⊤
x
+
b
=
0
w
⊤
x
+
b
+
=
0
w
⊤
x
+
b−
=
0
b+−
b−
∥w∥
89
UNSW MATH5855 2021T3 Lecture 13 Support Vector Machines
=⇒ We want to make the margin b+−b−∥w∥ as big as possible.
The scale of w and b is arbitrary: for arbitrary α ̸= 0, any x that satisfies w⊤x+ b = 0 also
satisfies (αw)⊤x + (αb) = α(w⊤x + b) = 0, so (w, b) and (αw, αb) define the same plane. We
fix b+ − b = b− − b = 1, and only vary w: our “outer” hyperplanes become
w⊤x+ (b− 1) = 0
w⊤x+ (b+ 1) = 0.
Then, the margin of b+−b−∥w∥ =
2
∥w∥ is maximised by minimising ∥w∥. Therefore, a Linear
Support Vector Machine minimises ∥w∥2 subject to separating −1s and +1s.
13.4 Estimation
13.4.1 Linear SVM: Separable Case
We write the boundaries of the empty region as
w⊤x+ (b− 1) = 0 =⇒ w⊤x+ b = +1
w⊤x+ (b+ 1) = 0 =⇒ w⊤x+ b = −1,
and observe that
yˆi =
{
+1 if w⊤xi + b > 0
−1 if w⊤xi + b < 0
= sign(w⊤xi + b).
This means that if w⊤x+ b = 0 separates −1s and +1s (i.e., yi = yˆi for all i = 1, . . . , n.),
yi(w
⊤xi + b) ≥ 1.
Therefore, a linear SVM learning task for can be expressed as a constrained optimisation problem:
argmin
w
1
2
∥w∥2 subject to yi(w⊤xi + b) ≥ 1, i = 1, . . . , n.
(Here and elsewhere, argmina h(a) is that a which minimises the value of h(a).)
The objective is quadratic (convex) and the constraints are linear. This problem can be solved
by Lagrange multipliers. The following outlines the steps and the key results.
1. Rewrite the objective function as the Lagrangian: (note the use of αis instead of λis):
Lag(w, b;α) =
1
2
∥w∥2 −
n∑
i=1
αi
[
yi(w
⊤xi + b)− 1
]
.
2. As the constraints are inequalities rather than equalities, apply the socalled KKT (Karush–
Kuhn–Tucker) conditions: the saddle point (w, b,α) : Lag′(w, b;α) = 0 will be the
constrained optimum if αi ≥ 0, i = 1, . . . , n. Thus, our goal becomes to solve for
Lag′(w, b;α) = 0 subject to αi ≥ 0.
3. Set derivatives of Lag with respect to w and b equal to zero:
∂L
∂w
= w −
n∑
i=1
αiyixi = 0 =⇒ w =
n∑
i=1
αiyixi,
∂L
∂b
= −
n∑
i=1
αiyi = 0 =⇒
n∑
i=1
αiyi = 0.
90
UNSW MATH5855 2021T3 Lecture 13 Support Vector Machines
4. Note, also, that
yi(w
⊤xi + b)− 1 ≥ 0, i = 1, . . . , n,
αi
(
yi(w
⊤xi + b)− 1
)
= 0, i = 1, . . . , n.
for some αi ≥ 0, i = 1, . . . , n. Notice that the second equation implies that either αi = 0
or yi(w
⊤xi + b) = 1 (or both). But that means that if αi ̸= 0, the observation lies on a
corresponding hyperplane and is known as a support vector.
Dual Optimisation Problem
Substituting the expression of w in terms of α and expanding ∥w∥2, we get the dual problem:
LagD(α) =
n∑
i=1
αi − 1
2
n∑
i=1
n∑
j=1
αiαjyiyjx
⊤
i xj ,
to be maximised subject to
αi ≥ 0, i = 1, . . . , n
n∑
i=1
αiyi = 0.
This is a quadratic programming problem, for which many software tools are available.
13.4.2 Linear SVM: Nonseparable Case
Of course, in realworld problems, it is not possible to find hyperplanes which perfectly separate
the target classes. The soft margin approach considers a tradeoff between margin width and
number of training misclassifications. Slack variables ξi ≥ 0 are included in the constraints: we
insist that
yi(w
⊤xi + b) ≥ 1− ξi. (13.1)
The optimisation then becomes
argmin
w,ξ
(
1
2
∥w∥2 + C
n∑
i=1
ξi
)
subject to yi(w
⊤xi + b) ≥ 1− ξi, i = 1, . . . , n,
or a tuning constant C. Small C means a lot of slack, whereas a large C means little slack. In
particular, if we set C =∞, we require separation to be perfect, a hard margin.
Now, taking (13.1) and solving for ξi gives ξi ≥ 1 − yi(w⊤xi + b). We want to make ξi as
small as possible, so we can set ξi = 1− yi(w⊤xi + b).
Dual Optimisation Problem
The Lagrangian is now (with additional multipliers µ),
Lag(w, b, ξ;α,µ) =
1
2
∥w∥2 + C
n∑
i=1
ξi −
n∑
i=1
αi
[
yi(w
⊤xi + b)− 1 + ξi
]− n∑
i=1
µiξi.
91
UNSW MATH5855 2021T3 Lecture 13 Support Vector Machines
Now,
∂L
∂w
= w −
n∑
i=1
αiyixi = 0 =⇒ w =
n∑
i=1
αiyixi
∂L
∂b
= −
n∑
i=1
αiyi = 0 =⇒
n∑
i=1
αiyi = 0
∂L
∂ξ
= C1n −α− µ = 0 =⇒ C − αi − µi = 0, i = 1, . . . , n.
with additional KKT conditions for i = 1, . . . , n:
αi ≥ 0
µi ≥ 0
αi
(
yi(w
⊤xi + b)− 1 + ξi
)
= 0.
Substituting into the Lagrangian leads to
LagD(α,µ) =
n∑
i=1
αi − 1
2
n∑
j=1
n∑
k=1
αjαkyjyk(x
⊤
j xk) +
n∑
i=1
ξi(C − αi − µi).
But C − αi − µi = 0, so as long as αi ≤ C, µi ≥ 0 is completely determined by αi, and we
get a dual problem
argmax
α
n∑
i=1
αi − 1
2
n∑
j=1
n∑
k=1
αjαkyjyk(x
⊤
j xk)
subject to
n∑
i=1
αiyi = 0 and 0 ≤ αi ≤ C, i = 1, . . . , n.
We can also express the prediction in two ways:
Primal: yˆ(x) = sign(w⊤x+ b), (13.2)
Dual: yˆ(x) = sign{
n∑
j=1
αjyj(x
⊤
j x) + b}. (13.3)
Primal (w) form requires d parameters, while dual (α) form requires n parameters. This
means that for highdimensional problems—those with d≫ n, a huge number of predictors—the
dual representation can be more efficient.
But it gets better! Notice that only the xis closest to the separating hyperplane—those with
αj > 0—matter in determining yˆ(x), so most of them will have no effect. Thus, computationally,
effective “n” will actually much smaller than the sample size, so the above condition can be met
far more often than one might expect. Again, those xis that “support” the hyperplane are called
support vectors.
In addition, notice that the dual form only depends on (x⊤j xk)s. This opens the door to
nonlinear SVMs.
92
UNSW MATH5855 2021T3 Lecture 13 Support Vector Machines
13.5 Nonlinear SVMs
Consider:
x1
x2
The true classification for these points is
y =
{
+1 if x1
2 + x2
2 > 0.752
−1 if x12 + x22 < 0.752
,
but one can hardly draw a line separating them.
What we can do is transform x so that a linear decision boundary can separate them. In this
case, suppose we augmented our x with squared terms:
(x1, x2)→ (x1, x2, x21, x22) :
x1
0.
0
0
.3
0
.6
0.0 0.2 0.4 0.6
0.
0
0.
2
0.
4
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
x2
x1.2
0.0 0.1 0.2 0.3 0.4 0.5
0.0 0.1 0.2 0.3 0.4 0.5
0.
0
0.
3
0.
6
0
.0
0.
2
0.
4
x2.2
Now, a linear separator exists! Better yet, recall that the dual form (13.3) depends only on
dot products x⊤i xj . However, we can specify other kernels k(xi,xj). For example, a “kernel”
function of the form k(u,v) = (u⊤v + 1)2 can be regarded as a dot product
u21v
2
1 + u2v
2
2 + 2u1v1 + 2u2v2 + 1
93
UNSW MATH5855 2021T3 Lecture 13 Support Vector Machines
= (u21, u
2
2,
√
2u1,
√
2u2, 1)
⊤(v21 , v
2
2 ,
√
2v1,
√
2v2, 1).
which reconstructs the above augmentation. In general, kernel functions can be expressed in
terms of high dimensional dot products. Computing dot products via kernel functions is com
putationally “cheaper” than using transformed attributes directly.
A common type of kernel is a radial basis function: a function of distance from the origin, or
from another fixed point v. Usually, the distance is Euclidean, i.e.
∥u− v∥ =
√
(u1 − v1)2 + · · ·+ (un − vn)2.
A common radial basis function is Gaussian:
ϕ(u,v) = exp
(−γ∥u− v∥2) .
We can use ϕ(·, ·) as our SVM kernel.
13.6 Multiple classes
Finally, we briefly consider the problem when there are more than two classes. Suppose that
there are K > 2 categories.
Recall that w⊤xi gives us a “score” that we normally compare to b. However, we do not
have to do so. Instead, for each k = 1, . . . ,K, we can fit a separate SVM (i.e., wk and bk) for
whether an observation is in k vs. not. We can then predict yˆnew by evaluating w
⊤
k xnew+ bk for
each k and taking highest biggest one. This is called the Oneagainstrest approach.
A computationally more expensive approach that tends to perform better is the Oneagainst
one: an SVM is fit for every distinct pair k1, k2 = 1, . . . ,K, fit an SVM for k1 vs. k2, and predict
the “winner” of all the rounds (if any). This requires fitting K(K − 1)/2 binary classifiers, but
to smaller datasets.
13.7 SVM specification and tuning
Categorical data can be handled by introducing binary dummy variables to indicate each possible
value.
When fitting an SVM, the user must specify some control parameters, these include cost
constant C for slack variables, the type of kernel function, and its parameters. Unlike the more
probabilistic forms of classification, it is difficult to predict the outofsample classification error
for SVMs, so crossvalidation is used.
The following kernel functions available via the R e1071 package:
linear: u⊤v
polynomial: (γu⊤v + c0)p
radial basis: exp(−γ∥u− v∥2)
sigmoid: tanh(−γu⊤v + c0)
for constants γ, p, and c0.
13.8 Examples
Example 13.1. SVM classification for the Edgar Anderson’s Iris data, and using ROC curves.
94
UNSW MATH5855 2021T3 Lecture 13 Support Vector Machines
13.9 Conclusion
We conclude with a brief discussion of the advantages and disadvantages of SVMs. SVM training
can be formulated as a convex optimisation problem, with efficient algorithms for finding the
global minimum, and the final result involves support vectors rather than the whole training set.
This is both a computational benefit, but also one to robustness: outliers have less effect than
for other methods.
On the other hand, they are much more difficult to interpret than modelbased classification
techniques like the linear discriminant analysis. Furthermore, SVMs do not actually provide
class probability estimates. These can be estimated by crossvalidation, however.
95
UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis
14 Cluster Analysis
14.1 “Classical” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
14.1.1 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
14.1.2 Example: Kmeans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
14.1.3 Extension: Kmedioids . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
14.1.4 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
14.1.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
14.1.6 Assessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
14.1.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
14.2 Modelbased clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
14.2.1 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
14.2.2 Multivariate normal clusters . . . . . . . . . . . . . . . . . . . . . . . . . . 102
14.2.3 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
14.2.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
14.2.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
14.2.6 Expectation–Maximisation Algorithm . . . . . . . . . . . . . . . . . . . . 104
14.3 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
The goal of cluster analysis is to identify groups in data. In contrast to SVMs and discriminant
analysis, no preexisting group labels are provided. This makes it an example of unsupervised
learning.
The input of cluster analysis is therefore an unlabelled sample x1, . . . ,xn ∈ Rp, and the
output is a grouping of observations such that more similar (in some sense) observations are
placed in the group. That is, cluster analysis assigns to each xi a group index Gi ∈ {1, . . . ,K}
such that if Gi = Gj , xi and xj are “on average” more similar in some sense than if Gi ̸= Gj .
Throughout this lecture, we will use the following additional notation.
G = (G1, . . . , Gn)
⊤: a column vector of cluster memberships.
S1, . . . , SK : a partitioning of the observations {1, . . . , n} into K nonoverlapping sets such that
for every i ∈ Sk, Gi = k.
S = (S1, . . . , SK): a shorthand for the clustering expressed in terms of sets.
We will consider a taxonomy of approaches to clustering. The “classical” approach is to
specify an algorithm that assigns observations to clusters. (Often, but not always, an objective
function may be defined that is optimised by the algorithm.) Classical approaches can be further
subdivided into hierarchical clustering, which produces a hierarchy of nested clusterings in a
tree which has observations as leaves; and nonhierarchical, which merely assigns a label to each
point.
The modelbased approach to clustering is to postulate a mixture model—a model consisting
of a mixture of probability distributions with different location parameters. The parameters
of this model embody information about the clusters (e.g., their means and frequencies), and
estimating them enables probabilistic, or soft clusterings.
We discuss these approaches in turn.
14.1 “Classical”
14.1.1 Components
In order to cluster data—particularly multivariate data—we must first define a proximity mea
sure: some function d(x1,x2) that determines difference between two observations. (Equivalently
96
UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis
we can define a similarity score and negate or invert it.) Here are some common metrics measures:
Euclidean: ∥x1 − x2∥ =
√∑p
j=1(x1j − x2j)2, the “ordinary” straightline distance.
taxicab/Manhattan: ∥x1 − x2∥1 =
∑p
j=1x1j − x2j , distance if one is only allowed to travel
parallel to the axes (like a taxicab on the Manhattan city grid).
Gower: p−1
∑p
j=1 I(x1j ̸= x2j): for binary measures.
A metric should be substantively meaningful and appropriate for the data. It is also common to
scale all of the dimensions (say, to have variance of 1 or to be between 0 and 1) before clustering.
Given these distances, we specify the algorithm that minimises withincluster and maximises
betweencluster distances in some sense—that sense often operationalised in an objective function.
14.1.2 Example: Kmeans
Perhaps the best known clustering algorithm is the Kmeans. It has the advantage of being
simple and intuitive. The objective function that it ultimately minimises (over the partitioning
S = (S1, . . . , SK)) is
K∑
k=1
1
2Sk
∑
i,j∈Sk
∥xi − xj∥2,
the sum of squared Euclidean distances between every distinct pair of observations within each
cluster (appropriately scaled). It can be shown (using a decomposition similar to that of ANOVA)
that this is equivalent to minimising
K∑
k=1
∑
i∈Sk
∥xi − x¯Sk∥2, x¯Sk =
1
Sk
∑
i∈Sk
xi,
which is simply the sum of the squared Euclidean distances between each data point and the
mean of its cluster.
The following algorithm often does a good job finding such a clustering:
1. Randomly assign a cluster index to each element of G(0).
2. Calculate cluster means (centroids):
x¯
S
(t−1)
k
=
1
S(t−1)k 
∑
i∈S(t−1)k
xi, k = 1, . . . ,K.
3. Calculate distances of each data point from each mean:
dik = ∥xi − x¯S(t−1)k ∥, i = 1, . . . , n, k = 1, . . . ,K.
4. Reassign each point to its nearest mean:
G
(t)
i = argmin
k
dik.
(Here and elsewhere, argmina h(a) is that a which minimises the value of h(a).)
5. Repeat from Step 2 until G(t) = G(t−1).
97
UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis
14.1.3 Extension: Kmedioids
A generalisation of Kmeans is the Kmedioids technique. We define a medioid x˜Sk of cluster k
to be a specific observation that has the closest summed distance (however defined) to all other
observations in Sk:
x˜Sk = argminxj
∑
i∈Sk
d(xj ,xi).
The Method of Kmedioids or partitioning around medioids (PAM) minimises the sum of these
distances:
argmin
S
K∑
k=1
∑
i∈Sk
d(xi, x˜Sk).
This method is much more expensive computationally than Kmeans, but it is also more robust
to outliers.
It is typically fit as follows:
1. Randomly assign a cluster index to each element of G(0).
2. Calculate cluster medioids:
x˜
S
(t−1)
k
= argmin
xj
∑
i∈S(t−1)k
d(xj ,xi), k = 1, . . . ,K.
3. Calculate distances of each data point from each medioid:
dik = d(xi, x˜S(t−1)k
), i = 1, . . . , n, k = 1, . . . ,K.
4. Reassign each point to its nearest medioid:
G
(t)
i = argmin
k
dik.
5. Repeat from Step 2 until G(t) = G(t−1).
14.1.4 Hierarchical clustering
Hierarchical clustering, instead of partitioning the data into K groups, produces a hierarchy of
clusterings whose sizes range from 1 (no splits) to as high as n (every observation its own cluster).
This clustering is typically visualised in a dendrogram, a tree diagram whose branching represents
subdivisions of the data into clusters and whose height represents the distances between points
or clusters.
The algorithms for producing these clusterings are either agglomerative, in that they start
with each observation in its own cluster, then combine nearest observations into clusters, nearest
clusters into bigger clusters, etc.; or divisive, starting with the whole dataset, then splitting it
into a small number of clusters, those clusters into smaller clusters, etc..
The former require defining a notion of a distance between clusters. The latter require to
defining a criterion based on which a cluster is split.
Some common examples of distances are provided in the following table:
98
UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis
Single linkage d(S1, S2) = min{d(xi,xj) : i ∈ S1, j ∈ S2}
Complete linkage d(S1, S2) = max{d(xi,xj) : i ∈ S1, j ∈ S2}
Average linkage
(unweighted)
d(S1, S2) =
1
S1S2
∑
i∈S1
∑
j∈S2 d(xi,xj)
Average linkage
(weighted)
d(S1 ∪ S2, S3) = d(S1,S3)+d(S2,S3)2
Centroid d(S1, S2) = ∥x¯S1 − x¯S2∥
Ward
d(S1, S2) =
∑
i∈S1∪S2∥xi − x¯S1∪S2∥2−∑i∈S1∥xi − x¯S1∥2−∑i∈S2∥xi − x¯S2∥2
= S1S2S1+S2∥x¯S1 − x¯S2∥2
A framework that is useful for expressing different betweencluster distances is the Lance–
Williams framework. Given three clusters, S1, S2, and S3, and suppose that we have some metric
for evaluating pairwise distances between them, i.e., d(S1, S2), d(S1, S3), and d(S2, S3). Then,
we define the distance resulting from combining S1 and S2 in terms of these pairwise distances
and coefficients α1, α2, β, and γ:
d(S1 ∪ S2, S3) = α1d(S1, S3) + α2d(S2, S3) + βd(S1, S2) + γd(S1, S3) − d(S2, S3).
This, plus the distance metric between individual points (which applies when the clusters have
only one observation in them), allows us to define and efficiently calculate distances between
clusters.
For example, the unweighted average linkage can be expressed in this framework as follows:
d(S1 ∪ S2, S3) = 1S1 ∪ S2S3
∑
i∈S1∪S2
∑
j∈S3
d(xi,xj)
=
1
(S1+ S2)S3
∑
i∈S1
∑
j∈S3
d(xi,xj) +
∑
i∈S2
∑
j∈S3
d(xi,xj)
=
S1S3d(S1, S3) + S2S3d(S2, S3)
(S1+ S2)S3
=⇒ α1 = S1S1+ S2 , α2 =
S2
S1+ S2 , β = γ = 0.
Ward’s method—the most popular hierarchical clustering criterion—similarly, uses the squared
Euclidean distances d(xi,xj) = ∥xi − xj∥2 between points and then
α1 =
S1+ S3
S1+ S2+ S3 , α2 =
S2+ S3
S1+ S2+ S3 ,
β =
−S3
S1+ S2+ S3 , γ = 0.
Ward’s method joins the groups that will increase the withingroup variance least.
14.1.5 Software
SAS:
99
UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis
Hierarchical: PROC CLUSTER (PROC TREE to visualise, PROC DISTANCE to preprocess),
PROC VARCLUS
Nonhierarchical: PROC FASTCLUS, PROC MODECLUS, PROC FASTKNN
R:
Hierarchical: stats::hclust, cluster::agnes
Nonhierarchical: stats::kmeans, cluster::pam
• Many others
14.1.6 Assessing
Lastly, we briefly discuss how a clustering G may be assessed. Ideally, this measurement should
be “fair” to the number of clusters K. For example, in Kmeans clustering, splitting a cluster
will always reduce the withincluster variances, and so those cannot be used as a criterion.
• Given a clustering G, how good is it?
• Ideally, measurement should be invariant to K.
– I.e., not withincluster variances.
A popular method, inspired by Kmedioid clustering, is the silhouettes. For each i = 1, . . . , n,
let
a(i) =
1
SGi  − 1
∑
j∈SGi
d(xi,xj)
b(i) = min
k ̸=Gi
1
Sk
∑
j∈Sk
d(xi,xj).
Observe that a(i) is the distance between i and other observations its own cluster and b(i) is the
distance between i and observations in the cluster nearest to i to which i does not belong. In a
good clustering each observation will be much closer to its own cluster than to its neighbouring
cluster, so b(i)≫ a(i).
Then, silhouette of i is a value between −1 and +1 calculated as follows:
s(i) =
{
b(i)−a(i)
max(a(i),b(i)) if SGi  > 1
0 otherwise
.
That is s(i) evaluates how much closer is i to the rest of its cluster than it is to its nearest cluster,
and a higher silhouette indicates a better clustering for point i. Mean silhouette n−1
∑n
i=1 s(i)
then measures the overall quality of clustering.
14.1.7 Examples
Example 14.1. Hierarchical, nonhierarchical clustering and assessment illustrated on the Edgar
Anderson’s Iris data.
100
UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis
14.2 Modelbased clustering
14.2.1 Mixture Models
Lastly, we turn to modelbased clustering. We will discuss the theoretical underpinnings of
this approach—mixture models—and an important special case of Gaussian clustering and its
parametrisation. The Expectation–Maximisation algorithm, often used to estimate these models
will also be described, as it is useful in a wide variety of circumstances, but it is not examinable.
A finite mixture model is a probability model under which each observation comes from one
of several distributions, but we do not observe from which one. (Infinite mixture models exist as
well, but they are outside of the scope of this class.)
A mixture model is specified as follows. We setK to be the number of distributions (clusters),
and a collection of K density functions on the support of xi, fk(xi;θk) (for k = 1, . . . ,K) each
having a parameter vectors θk (e.g., its expectation), which we do not know and must estimate.
We also postulate K (unknown) probabilities πk that an observation (any observation) comes
from cluster k. (Standard restrictions apply: 0 ≤ πk ≤ 1,
∑K
k=1 πk = 1.)
For brevity, we define π = (π1, . . . , πK)
⊤, a vector of these probabilities; and Ψ = {θ1, . . . ,θK ,π},
the collection of all model parameters. Then, we assume the following datagenerating process:
for each i = 1, . . . , n,
1. Sample Gi ∈ {1, . . . ,K} with Pr(Gi = k;π) = πk.
2. Sample XiGi ∼ fGi(·;θGi).
3. Observe Xi, and “forget” Gi.
The pdf of this mixture density is
fXi(xi; Ψ) =
K∑
k=1
πkfk(xi;θk). (14.1)
We wish to estimate the parameters Ψ from the sample of x = [x1, . . . ,xn]. This leads to the
likelihood
Lx(Ψ) =
n∏
i=1
K∑
k=1
πkfk(xi;θk). (14.2)
This formulation is convenient for a number of reasons. It is a probability model for the Xis,
and therefore we can use it to obtain a soft clustering : rather than a hard clustering that assigns
a point to a single cluster, we can apportion an observation’s membership by how likely it to
have come from each cluster. An application of Bayes’s rule and (14.1) gives
Pr(Gi = kxi; Ψ) = πkfk(xi;θk)∑K
k′=1 πk′fk′(xi;θk′)
.
We can also embed it into a hierarchical model (a meaning distinct from the hierarchical clustering
above), in which either xis are parameters for some model for the data or for the observation
process or θs are functions of some hyperparameters. Lastly, the fact that we have a welldefined
likelihood facilitates model selection.
101
UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis
14.2.2 Multivariate normal clusters
As with other analysis scenarios discussed in this course, the multivariate normal distribution
provides a useful formulation for the clusters. Consider the following parametrisation:
fk(xi;θk) =
1
(2π)p/2Σ(θk)1/2 e
− 12 (xi−µ(θk))⊤Σ(θk)−1(xi−µ(θk)) .
Here, µ(θk) is the mean vector of cluster k (e.g., first p elements of θk), and Σ(θk) is the model
for the variances. We may also have different clusters “share” elements of θ, and a more general
case is
fk(xi;θ) =
1
(2π)p/2Σk(θ)1/2 e
− 12 (xi−µk(θ))⊤Σk(θ)−1(xi−µk(θ)), (14.3)
where µk(θ) and Σk(θ) “extract” the appropriate elements from θ.
One advantage of multivariate normal clusters is in its flexibility in specifying cluster size
and shape. (Recall your exercises from Week 1.) Recall the eigendecomposition of the covari
ance matrix Σ = PΛP⊤, with P orthogonal and Λ diagonal and nonnegative. Let us further
parametrise it as
Σ = λPAP⊤,
with P ∈Mp,p orthogonal, A ∈Mp,p diagonal and nonnegative with A = 1 (unimodular), and
scalar λ > 0. This allows us to interpret the structure of the matrix in simple, substantive terms.
Starting with λ, recall recalling that the determinant of a matrix can be viewed as its volume.
Then,
Σ = λpP AP⊤ = λp,
which makes λ is the “spread”, “size”, or “volume” of the cluster.
To interpret the diagonal, unimodular matrix A, observe that if A = Ip, then
Σ = λPAP⊤ = λPP⊤ = λIp,
making the cluster spherical—equal variances on all dimensions. Similarly, if some diagonal
elements of A are much larger than others, then the cluster will be an ellipsoid more stretched
in one direction than in others.
Lastly, observe that if P = Ip, then
Σ = λPAP⊤ = λA,
an ellipsoid whose axes are parallel to coordinate axes, implying the elements of Xi within
each cluster are uncorrelated with unequal variances. More generally, P controls the rotation of
ellipsoid—the correlation between the dimensions and the orientation of the cluster.
When it comes to estimating K clusters, we can permit the λs, the As, and the P s to vary
between the clusters, be constant between the clusters, or, for A and P , be fixed at the identity.
Each combination embodies different assumption about the shape and the relationship between
clusters; and, in general, the more we permit to vary, the more parameters we must estimate
and the more data we therefore require. Generally,
1. For a mixture of K clusters, we must, invariably, estimate the cluster membership proba
bilities π1, . . . , πK (K − 1 parameters) and cluster means µ1, . . . ,µK (Kp parameters).
2. Then, λ can be constrained λ1 = λ2 = · · · = λK (1 parameter) or allowed to vary (K
parameters).
102
UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis
3. Then, A can be fixed A1 = A2 = · · · = AK = Id (0 parameters), constrained A1 = A2 =
· · · = AK (p− 1 parameters), or allowed to (K(p− 1) parameters).
4. Lastly, if A is not fixed at the identity matrix, P can either be fixed P1 = P2 = · · · = PK =
Id (0 parameters), constrained P1 = P2 = · · · = PK (
(
p
2
)
parameters), or allowed to vary
(K
(
p
2
)
parameters).
The different cluster shapes identified by their constraint triple (λ,A, P ) encoding being fixed
at identity as I, being constrained to equality between clusters as E, and being allowed to vary
freely as V are given in the following figure:
Incorporated under the terms of Creative Commons Attribution 3.0 Unported license from Figure 2 of:
Luca Scrucca, Michael Fop, T. Brendan Murphy, and Adrian E. Raftery (2016). mclust 5: Clustering, Classification and
Density Estimation Using Gaussian Finite Mixture Models. The R Journal 8:1, pages 289317.
14.2.3 Model selection
As mentioned before, modelbased clustering requires one to specify both the number of clusters
K and the withincluster models fk(xi; Ψ). In the case of multivariate normal clustering, we
have a large number of possible specifications for the Σks, and the number of parameters can
grow quickly for “XXV” models in particular.
At the same time, because it is likelihoodbased, a variety of standard modelselection tech
niques can be used. For example, BIC is recommended:
BICν = −2 logLx(Ψˆ) + ν log n,
where ν the number of parameters estimated. (Here, lower BIC is better, but some authors and
software packages use 2 logLx(Ψˆ)− ν log n, with higher BIC being better.)
Substantive considerations also matter. For example, how many clusters does our research
hypothesis predict? Do we expect correlations between dimensions to vary between clusters?
103
UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis
14.2.4 Software
SAS: PROC MBC
R: package mclust and others
14.2.5 Examples
Example 14.2. Modelbased clustering and model selection illustrated on the Edgar Anderson’s
Iris data.
14.2.6 Expectation–Maximisation Algorithm
Lastly, we discuss the typical computational approach for estimating these mixture models. The
logL(Ψ) in (14.2) is computationally tractable, but it does not simplify or decompose much,
because while the logarithm of a product is a sum of the logarithms, the logarithm of a sum
does not, in general, simplify further. Thus, we introduce the Expectation–Maximisation (EM)
algorithm:
1. Introduce an unobserved (latent) variable Gi, i = 1, . . . , n giving the cluster membership
of i.
2. Suppose that G1, . . . , Gn are observed; then, this completedata likelihood,
Lx,G1,...,Gn(Ψ) =
n∏
i=1
πGifGi(xi;θGi) :
we “know” the exact cluster from which each observation came, so we no longer have to
sum over the possible clusters. Then, the loglikelihood decomposes into two summations:
logLx,G1,...,Gn(Ψ) =
n∑
i=1
log πGi +
n∑
i=1
log fGi(xi;θGi), (14.4)
one that depends only on the πks and the other only on the θks.
3. Start with an initial guess Ψ(0).
4. Iterate Estep and Mstep described below to convergence.
Estep
The Expectation step consists of starting with a parameter guess Ψ(t−1) and evaluating
Q(ΨΨ(t−1)) = EG1,...,Gnx;Ψ(t−1)(logLx,G1,...,Gn(Ψ)) :
the expected value of the completedata loglikelihood. We can evaluate it by calculating (using
the Bayes’s rule)
q
(t−1)
ik = Pr(Gi = kx; Ψ(t−1)) =
π
(t−1)
k fk(xi;θ
(t−1)
k )∑K
k′=1 π
(t−1)
k′ fk′(xi;θ
(t−1)
k′ )
, i = 1, . . . , n, k = 1, . . . ,K,
then substituting them in as
Q(ΨΨ(t−1)) =
n∑
i=1
K∑
k=1
q
(t−1)
ik log πk +
n∑
i=1
K∑
k=1
q
(t−1)
ik log fk(xi;θk). (14.5)
Observe that, like (14.4), (14.5) decomposes into a summation that depends only on the πks and
a summation that depends only on the θks.
104
UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis
Mstep
The Maximisation step then consists of maximising the Q(ΨΨ(t−1)) with respect to Ψ to
obtain the next parameter guess:
Ψ(t) = argmax
Ψ
Q(ΨΨ(t−1)), s.t.
K∑
k=1
πk=1.
Conveniently, the form (14.5) separates the πks from the θks, and so we can maximise them
separately (i.e., if we differentiate with respect to one, the summation involving the other will
vanish).
Maximising (14.5) with respect to θks, we take the derivative
∂Q(ΨΨ(t−1))
∂θk
=
n∑
i=1
q
(t−1)
ik
∂ log fk(xi;θk)
∂θk
,
and set to 0. This is a weighted maximum likelihood estimator.
Maximising (14.5) with respect to to πks is also straightforward. We will use Lagrange
Multipliers to do so:
Lag(π) =
n∑
i=1
K∑
k=1
q
(t−1)
ik log πk − α(
K∑
k=1
πk − 1).
Differentiating,
Lag′k(π) =
n∑
i=1
q
(t−1)
ik π
−1
k − α.
Setting to 0,
πk =
n∑
i=1
q
(t−1)
ik /α.
Summing and solving for α,
K∑
k=1
πk =
1
α
K∑
k=1
n∑
i=1
q
(t−1)
ik = 1,
α =
K∑
k=1
n∑
i=1
q
(t−1)
ik .
Therefore,
π
(t)
k =
∑n
i=1 q
(t−1)
ik∑K
k=1
∑n
i=1 q
(t−1)
ik
.
“Sharing” θs
Lastly, recall that when we select one of the “E” models and (14.3) in Section 14.2.2, we
no longer have a separate θk for every fk. We may then need to redefine θ ∈ RKp+1 or more
to contain parameters for all groups (separate means, distinct variance parameters, etc.), and
fk(xi;θ) to “extract” those elements of θ that it needs, with Ψ = (θ,π).
Inferentially, θ replaces θk in all derivations above. In particular,
Q(ΨΨ(t−1)) =
n∑
i=1
K∑
k=1
q
(t−1)
ik log πk +
n∑
i=1
K∑
k=1
q
(t−1)
ik log fk(xi;θ),
105
UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis
so
∂Q(ΨΨ(t−1))
∂θ
=
n∑
i=1
K∑
k=1
q
(t−1)
ik
∂ log fk(xi;θ)
∂θ
,
which is still a weighted MLE, but now it is joint for all groups, and without simplification.
14.3 Additional resources
An alternative presentation of these concepts can be found in JW Sec. 12.1–12.5. Additional
software demonstration of modelbased clustering can be found in
Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A. E. (2016). mclust 5: clustering,
classification and density estimation using Gaussian finite mixture models. The R
Journal, 8(1), 289.
106
UNSW MATH5855 2021T3 Lecture 15 Copulae
15 Copulae
15.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
15.2 Common copula types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
15.2.1 Elliptical copulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
15.2.2 Archimedean copulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
15.3 Margins, estimation, and simulation . . . . . . . . . . . . . . . . . . . . . . . . . 111
15.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
15.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
15.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
15.1 Formulation
For the multivariate normal, independence is equivalent to absence of correlation between any
two components. In this case the joint cdf is a product of the marginals. When the independence
is violated, the relation between the joint multivariate distribution and the marginals is more
involved. An interesting concept that can be used to describe this more involved relation is the
concept of copula. We focus on the twodimensional case for simplicity. Then the copula is a
function C : [0, 1]2 → [0, 1] with the properties:
i) C(0, u) = C(u, 0) = 0 for all u ∈ [0, 1].
ii) C(u, 1) = C(1, u) = u for all u ∈ [0, 1].
iii) For all pairs (u1, u2), (v1, v2) ∈ [0, 1]× [0, 1] with u1 ≤ v1, u2 ≤ v2 :
C(v1, v2)− C(v1, u2)− C(u1, v2) + C(u1, u2) ≥ 0.
The name is due to the implication that the copula links the multivariate distribution to its
marginals. This is explicated in the following theorem:
Theorem 15.1 (Sklar’s Theorem). Let F (·, ·) be a joint cdf with marginal cdf ’s FX1(.) and
FX2(.). Then there exists a copula C(·, ·) with the property
F (x1, x2) = C(FX1(x1), FX2(x2))
for every pair (x1, x2) ∈ R2. When FX1(·) and FX2(·) are continuous the above copula is
unique. Vice versa, if C(·, ·) is a copula and FX1(·), FX2(·) are cdf then the function F (x1, x2) =
C(FX1(x1), FX2(x2)) is a joint cdf with marginals FX1(·) and FX2(·).
Taking derivatives we also get:
f(x1, x2) = c(FX1(x1), FX2(x2))fX1(x1)fX2(x2) (15.1)
where
c(u, v) =
∂2
∂u∂v
C(u, v)
is the density of the copula. This relation clearly shows that the contribution to the joint
density of X1, X2 comes from two parts: one that comes from the copula and is “responsible”
for the dependence (c(u, v) = ∂
2
∂u∂vC(u, v)) and another one which takes into account marginal
information only (fX1(x1)fX2(x2)).
107
UNSW MATH5855 2021T3 Lecture 15 Copulae
It is also clear that the independence implies that the corresponding copula is Π(u, v) = uv
(this is called the independence copula).
These concepts are generalised also to p dimensions with p > 2.
The following figure illustrates an independence copula:
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.00.0
0.2
0.4
0.6
0.8
1.0
C(u,v)
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.00.0
0.2
0.4
0.6
0.8
1.0
c(u,v)
C(u,v)
0.1
0.2
0.3
0.4
0.5 0.6
0.7
0.8
0.9
0.0 0.2 0.4 0.6 0.8 1.0
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
c(u,v)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.9
1
0.0 0.2 0.4 0.6 0.8 1.0
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Independence copula, dim. d = 2
15.2 Common copula types
15.2.1 Elliptical copulae
An interesting example is the Gaussian copula. For p = 2 it is equal to:
Cρ(u, v) = Φρ(Φ
−1(u),Φ−1(v))
=
∫ Φ−1(u)
−∞
∫ Φ−1(v)
−∞
fρ(x1, x2)dx2dx1.
Here fρ(·, ·) is the joint bivariate normal density with zero mean, unit variances and a correlation
ρ, Φρ(·, ·) is its cdf, and Φ−1(·) is the inverse of the cdf of the standard normal. (This is “The
formula that killed Wall street”.) When ρ = 0 we see that we get C0(u, v) = uv (as is to be
expected).
NonGaussian copulae are much more important in practice and inference methods about
copulae are a hot topic in Statistics. The reason for importance of nonGaussian copulae is that
108
UNSW MATH5855 2021T3 Lecture 15 Copulae
Gaussian copulae do not allow us to model reasonably well the tail dependence, that is, joint
extreme events have virtually a zero probability. Especially in financial applications, it is very
important to be able to model dependence in the tails.
The tcopula, based on the multivariate tdistribution does a slightly better job in tail be
haviour. The multivariate tdistribution with variance parameter Σ and ν degrees of freedom is
defined as T = Z/X, where Z ∼ N(0,Σ) and, independently, X ∼ χ2ν . Note that Var(T ) ̸= Σ.
The following figure illustrates an Gaussian copula:
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.00.0
0.2
0.4
0.6
0.8
1.0
C(u,v)
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.00
2
4
6
8
c(u,v)
C(u,v)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8 0.9
0.0 0.2 0.4 0.6 0.8 1.0
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
c(u,v)
123
3
4
4
5
5
6
6
0.0 0.2 0.4 0.6 0.8 1.0
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Normal copula, dim. d = 2
param.: (rho.1 = 0.9)
The following figure illustrates a multivariate tcopula copula:
109
UNSW MATH5855 2021T3 Lecture 15 Copulae
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.00.0
0.2
0.4
0.6
0.8
1.0
C(u,v)
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.00
5
10
c(u,v)
C(u,v)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.0 0.2 0.4 0.6 0.8 1.0
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
c(u,v)
123
3
4
4
5
5
6
6
7
7
8
0.0 0.2 0.4 0.6 0.8 1.0
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
tcopula, dim. d = 2
param.: (rho.1 = 0.9, df = 4.0)
15.2.2 Archimedean copulae
The Gumbel–Hougaard copula is much more flexible in modeling dependence in the upper tails.
For an arbitrary dimension p is is defined as
CGHθ (u1, u2, . . . , up) = exp{−[
p∑
j=1
(− log uj)θ]1/θ},
where θ ∈ [1,∞) is a parameter that governs the strength of the dependence. You can easily see
that the GumbellHougaard copula reduces to the independence copula when θ = 1 and to the
Fre´chet–Hoeffding upper bound copula min(u1, . . . , up) when θ →∞.
The following figure illustrates a Gumbel–Hougaard copula:
110
UNSW MATH5855 2021T3 Lecture 15 Copulae
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.00.0
0.2
0.4
0.6
0.8
1.0
C(u,v)
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.00
2
4
6
8
c(u,v)
C(u,v)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.0 0.2 0.4 0.6 0.8 1.0
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
c(u,v)
12
2
3
47
0.0 0.2 0.4 0.6 0.8 1.0
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Gumbel copula, dim. d = 2
param.: 2
The Gumbel–Hougaard copula is also an example of the socalled Archimedean copulae. The
latter are characterised by their generator ϕ(·): a continuous, strictly decreasing, convex function
from [0, 1] to [0,∞) such that ϕ(1) = 0. Then the Archimedean copula is defined via the generator
as follows:
C(u1, u2, . . . , up) = ϕ
−1(ϕ(u1) + · · ·+ ϕ(up)).
Here, ϕ−1(t) is defined to be 0 if t is not in the image of ϕ(·).
Example 15.2. Show that the Gumbell–Hougaard copula is an Archimeden copula with a gener
ator ϕ(t) = (− log t)θ.
The benefit of using the Archimedean copulae is that they allow for simple description of
the pdim dependence by using a function of one argument only (the generator). However it
is seen immediately that the Archimedean copula is symmetric in its arguments and this limits
its applicability for modelling dependencies that are not symmetric in their arguments. The
socalled Liouville copulae are an extension of the Archiemedean copulae and can be used also
to model dependencies that are not symmetric in their arguments.
15.3 Margins, estimation, and simulation
So far, we have discussed the copula functions C(·, ·) and copula density c(·, ·), but using copulae
also requires marginal cdfs FX1(·) and FX2(·) and pdfs fX1(·) and fX2(·) (and so on, for more
111
UNSW MATH5855 2021T3 Lecture 15 Copulae
than two variables). We can, in fact, specify arbitrary univariate continuous distributions (e.g,
normal, gamma, beta, Laplace, etc.) for them. This choice is driven by substantive considerations
(E.g., is the distribution positive?)
Then, the density (15.1), appropriately parametrised, provides the likelihood, e.g.,
L(ρ,θ1,θ2) = fρ,θ1,θ2(x1, x2) = cρ(FX1θ1(x1), FX2θ2(x2))fX1θ1(x1)fX2θ2(x2),
which we can maximise in terms of the parameters of the copula and of the marginal distributions
to obtain their estimates. A closed form for these estimators is rarely available, and so it is
typically done numerically.
However, we might not want to specify margins in the first place. What can we do then? The
empirical distribution function (edf) Fˆ (·) is an unbiased estimator for the true cdf F (·). Given
Xij , i = 1, 2, j = 1, . . . , n observations we can obtain one for each of the 2 variables:
FˆXi(x) = n
−1
n∑
j=1
I(Xij ≤ x).
We can then use it in the copula cdf, i.e.,
F (x1, x2) = C(FˆX1(x1), FˆX2(x2)).
How do we estimate the parameters of the copula? Although Fˆ (·) is straightforward, fˆ(·) is
not and requires further assumptions and tuning parameters (e.g., kernel bandwidth). This
means that likelihood L(ρ,θ1,θ2) is no longer available to maximise. However, other methods
are possible. Typically, we convert the data into empirical quantiles Pij =
n
n+1 FˆXi(Xij), with
denominator n+1 used to ensure that Pij run from
1
n+1 to
n
n+1 . The resulting empirical quantiles
will be uniform but maintain their correlations (approximately). Then, we can tune our copula
function’s parameters until the correlations it induces among the empirical quantiles matches
their observed correlations.
Lastly, simulating copulae with parametric margins is straightforward, and simulating cop
ulae with empirical margins is possible as well. C(·, ·) and c(·, ·) themselves represent a valid
distribution with uniform margins can therefore be used to make random dependent draws of
marginally uniform quantiles P⋆ = [P1⋆, P2⋆]
⊤. The variables on the original scale can be ob
tained using inversetransform sampling as Xi⋆ = F
−1
Xi
(Pi⋆), i = 1, 2 for parametric margins
and Xi⋆ = Fˆ
−1
Xi
(Pi⋆), i = 1, 2 for empirical margins. Here, Fˆ
−1
Xi
(·) is the inverse of the FˆXi(·),
typically smoothed in some way, since FˆXi(·) represents a discrete distribution.
15.4 Software
SAS: PROC COPULA
R: Packages copula, VineCopula, and others.
15.5 Examples
Example 15.3. Microwave Ovens example (with empirical and gamma margins).
Example 15.4. Stock and portfolio modelling.
112
UNSW MATH5855 2021T3 Lecture Copulae
15.6 Exercises
Exercise 15.1
The (pdimensional) Clayton copula is defined for a given parameter θ > 0 as
Cθ(u1, u2, . . . , up) =
[
p∑
i=1
u−θi − p+ 1
]−1/θ
.
Show that it is an Archimedean copula and that its generator is ϕ(x) = θ−1(x−θ − 1).
113
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
A Exercise Solutions
Note that these solutions omit the steps of differentiation and integration, as well as arithmetic,
as those can be performed by a computer.
0.1
(a)
1. θx1 e
−x1(θ+x2) ≥ 0 as long as θ, x1, and x2 > 0.
2.
∫∞
0
∫∞
0
θx1 e
−x1(θ+x2) dx2dx1 = 1.
(b)
Pr(X1 < t,X2 < t) = F (t, t) =
∫ t
0
∫ t
0
θx1 e
−x1(θ+x2) dx2dx1
=
t
θ + t
+ e−tθ(
θ e−t
2
θ + t
− 1).
(c)
fX1(x1) =
∫ ∞
0
θx1 e
−x1(θ+x2) dx2 = θ e−x1θ 1x1>0 ∼ Exponential(θ).
Then E(X1) = θ
−1 and Var(X1) = θ−2.
(d)
fX2(x2) =
∫ ∞
0
θx1 e
−x1(θ+x2) dx1
=
θ
(θ + x2)2
1x2>0,
so
fX2X1(x2x1) =
fX(x1, x2)
fX1(x1)
=
θx1 e
−x1(θ+x2)
θ e−x1θ
= x1 e
−x1x2 1x2>0 ∼ Exponential(x1).
(e)
fX2X1(x2x1) = x1 e−x1x2 ̸= θ(θ+x2)2 = fX2(x2). More simply, the conditional distribution of
X2X1 depends on X1.
114
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
0.2
(a)
Let
Y =
(
1 −1
1 1
)
X =
(
X1 −X2
X1 +X2
)
.
Then
Cov(Y ) =
(
1 −1
1 1
)
σ2
(
1 ρ
ρ 1
)(
1 −1
1 1
)⊤
= σ2
(
2− 2ρ 0
0 2ρ+ 2
)
,
so Cov(X1 −X2, X1 +X2) = 0. Note that we only actually require
Cov(X1 −X2, X1 +X2) =
(
1 −1)σ2(1 ρ
ρ 1
)(
1
1
)
= 0.
(b)
Cov(X1, X2 − ρX1) =
(
1 0
)(1 ρ
ρ 1
)(−ρ
1
)
= 0.
(c)
Var(X2 − bX1) =
(−b 1)σ2(1 ρ
ρ 1
)(−b
1
)
= σ2(b2 − 2bρ+ 1).
∂b2 − 2bρ+ 1
∂b
= 2b− 2ρ set= 0 =⇒ b = ρ,
and ∂
2b2−2bρ+1
∂b2 = 2 > 0 =⇒ b = ρ is a minimum.
0.3
(a)
This is trivial, but for additional rigour, we can use Theorem 0.3 letting A =
(
Ip1 0p1,p2
) ∈
Mp1,p and b = 0. Then X(1) = AX = b, and
φ
(1)
X (s) = φX
{
A⊤s
}
= φX
{
A⊤s
}
= φX
{(
Ip1
0p1,p2
)
s
}
= φX
{[
s
0
]}
.
(b)
If X(1) and X(2) are independent, then fX(x) = fX(1)(x(1))fX(2)(x(2)). Then for
φX(t) = E(e
it⊤X) =
∫
Rp
eit
⊤x fX(x)dx
=
∫
Rp2
∫
Rp1
eit
⊤
(1)x(1) eit
⊤
(2)x(2) fX(1)(x(1))fX(2)(x(2))dx(1)dx(2)
=
∫
Rp1
eit
⊤
(1)x(1) fX(1)(x(1))dx(1)
∫
Rp2
eit
⊤
(2)x(2) fX(2)(x(2))dx(2)
= φX
{[
t(1)
0
]}
φX
{[
0
t(2)
]}
.
115
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
Conversely, if
φX(t) = φX
{[
t(1)
0
]}
φX
{[
0
t(2)
]}
= φX(1)(t(1))φX(2)(t(2)),
since always,
e−it
⊤x = e−it
⊤
(1)x(1) e−it
⊤
(2)x(2) ,
we can take the inverse of the Fourier transform (which cf is),
fX(x) = (2π)
−p
∫
Rp
φX(t) e
−it⊤x dt
= (2π)−p1(2π)−p2
∫
Rp2
∫
Rp1
φX(1)(x(1))φX(2)(x(2)) e
−it⊤(1)x(1) e−it
⊤
(2)x(2) dt(1)dt(2)
= (2π)−p1
∫
Rp1
φX(1)(x(1)) e
−it⊤(1)x(1) dt(1)(2π)−p2
∫
Rp2
φX(2)(x(2)) e
−it⊤(2)x(2) dt(2)
= fX(1)(x(1))fX(2)(x(2)).
0.4
Using the notation from Example 0.2, write X = PΛP⊤, and denote z = P⊤y. Now, since we
constrain ⟨y, e1⟩ = 0, then z1 = ⟨y, e1⟩ = 0, so
y⊤Xy
y⊤y
=
y⊤PΛP⊤y
y⊤y
=
z⊤Λz
z⊤z
=
∑p
i=2 λiz
2
i∑p
i=2 z
2
i
,
which we maximise by setting z = (0 1 · · · 0)⊤ resulting in z⊤Λz
z⊤z = λ2.
0.5
First, let us show that an orthogonal projection matrix P has only 0 or 1 as possible eigenvalues.
This stems directly from its idempotency: let λ be an eigenvalue of P and y the corresponding
eigenvector. Then,
P 2y = PPy = λPy = λ2y,
but idempotency implies that
P 2y = Py = λy,
and so λ = λ2, forcing it to be either 0 or 1.
Now, spectral decomposition implies that P =
∑n
i=1 λieie
⊤
i , and so rk(P ) is the number of
its nonzero eigenvalues. Meanwhile,
tr(P ) = tr(
n∑
i=1
λieie
⊤
i ) =
n∑
i=1
λi tr(e
⊤
i ei) =
n∑
i=1
λi1 = rk(P ).
116
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
2.1
(a)
Write the joint distribution of Y1 and Y2 as(
Y1
Y2
)
=
(
1 −1
1 1
)(
X1
X2
)
then,
Var
(
Y1
Y2
)
=
(
1 −1
1 1
)
I2
(
1 1
−1 1
)
=
(
2 0
0 2
)
,
and Y1 and Y2 are independent (being multivariate normal and uncorrelated) and identically
distributed N(0, 2).
(b)
P (χ22 < 2.41) = 0.7 (i.e., pchisq(2.41,2)).
2.2
(a)
Z ∼ N
((
4
7
)
,
(
16 −2
−2 7
))
.
Hence Cor (Z1, Z2) = − 2√16×7 .
(b)
Take
(
X˜(1)
X˜(2)
)
=
X1X3
X2
, and rearrange to get distribution isN3
32
−1
,
3 1 21 2 1
2 1 3
.
Call its mean and variance µ˜ and Σ˜. Then, X1, X3  X2 ∼ N
(
µ˜(1)(2), Σ˜(1)(2)
)
where
µ˜(1)(2) = µ˜(1) + Σ˜(1)(2)Σ˜
−1
(2)(2)
(
X˜(2) − µ˜(2)
)
=
(
3
2
)
+
(
2
1
)
1
3
(X2 + 1) =
(
3
2
)
+
(
2/3
1/3
)
(X2 + 1)
Σ˜(1)(2) =
(
3 1
1 2
)
−
(
2
1
)
1
3
(
2 1
)
=
(
3 1
1 2
)
−
(
4/3 2/3
2/3 1/3
)
=
1
3
(
5 1
1 5
)
In particular, for x2 = 0 we get,
X1, X3  X2 ∼ N
((
3 23
2 13
)
,
1
3
(
5 1
1 5
))
.
117
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
2.3
Take t ∈ Rp. Observe that a1X1+· · ·+anXn =XA for A = [a1, . . . , an] andX = [X1, . . . ,Xn].
Then, along the lines of Theorem 0.3,
φa1X1+···+anXn(t) =
n∏
j=1
φajXj (t) =
n∏
j=1
eiajt
⊤µj−
a2j
2 t
⊤Σjt
= eit
⊤(
∑n
j=1 ajµj)− 12 t⊤(
∑n
j=1 a
2
jΣj)t = φN(
∑n
j=1 ajµj ,
∑n
j=1 a
2
jΣj)
(t)
(by definition).
Then, substitute µi = µ, Σi = Σ, and ai =
1
n for all i = 1, . . . , n to obtain the distribution
of X¯.
2.4
By Property 4, the conditional distribution X2  X1 = x1 must have the form Ax1 + b +X3
(i.e., a linear combination of x1, a constant, and some noise X3 ∼ N(0,Ω) independent of X1).
Hence, the marginal distribution of X2 is the same as the distribution of AX1 + b +X3. But
then,
X =
(
X1
X2
)
=
(
Ir 0
A Ip−r
)(
X1
X3
)
+
(
0
b
)
and will be multivariate normal. We only need the mean and the covariance matrix.
Now, E(X1) = µ1 and E(X2) = EX1 [EX2(X2 X1)] = EX1 [EX3(AX1+b+X3)] = Aµ1+b,
and
Var(X2) = Var(AX1 +X3) = AΣ11A
⊤ +Ω,
with
Cov(X1,X2) = E[(X1 − µ1)(AX1 + b−Aµ1 − b)⊤]
= E[(X1 − µ1)(X1 − µ1)⊤A⊤]
= E[(X1 − µ1)(X1 − µ1)⊤]A⊤ = Σ11A⊤,
hence
X ∼ Np
((
µ1
Aµ1 + b
)
,
(
Σ11 Σ11A
⊤
AΣ11 Ω+AΣ11A
⊤
))
.
2.5
(a)
Using Exercise 2.4, we can get the joint distribution of
(
Z
Y
)
∼ N2
((
0
1
)
,
(
1 1
1 2
))
(or,
equivalently
(
Y
Z
)
∼ N2
((
1
0
)
,
(
2 1
1 1
))
. Applying the same procedure again, we can get
(with Ω = 1, b = 1, and A = (−1, 0))YZ
X
∼ N3
10
0
,
2 1 −21 1 −1
−2 −1 3
118
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
or, equivalently, XY
Z
∼ N
01
0
,
3 −2 −1−2 2 1
−1 1 1
.
Then, Y  (X,Z) is normal with
µY (X,Z) = 1 +
(−2 1)( 3 −1−1 1
)−1((
X
Z
)
−
(
0
0
))
= 1 +
(−2 1) 1
2
(
1 1
1 3
)(
X
Z
)
= 1 +
1
2
(Z −X)
and
σ2Y (X,Z) = 2−
(−2 1)( 3 −1−1 1
)−1(−2
1
)
=
1
2
:
Y  (X,Z) ∼ N(1 + 1
2
(Z −X), 1
2
)
(b) (
U
V
)
=
(
1 + Z
1− Y
)
is obviously normal. Moreover, µU = 1 + E(Z) = 1, µV = E(1 − Y ) = 0, σ2U = σ2Z = 1,
σ2V = σ
2
Y = 2, σU,V = −σZ,Y = −1. Hence,(
U
V
)
∼ N2
((
1
0
)
,
(
1 −1
−1 2
))
.
(c)
Y  (U = 2) has the same distribution as Y  Z + 1 = 2, that is, Y  Z = 1. Using (b), we get
Y  U = 2 ∼ N1(2, 1).
119
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
3.1
5 0 10 0
2 1 4 2
15 0 20 0
6 3 8 4
120
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
4.1
(a)
C =
−1 1 0 0 · · · · · ·
0 −1 1 0 · · · · · ·
0 0 −1 1 . . . · · ·
...
...
. . .
. . .
. . . 0
0 0 · · · 0 −1 1
∈Mp−1,p
is the required matrix.
(b)
Yj = CXj → Yj are i.i.d. N( Cµ, CΣC⊤ ), SY = CSC⊤, Y¯ = CX¯, µY = Cµ
n
(
Y¯ − µY
)⊤
S
−1
Y
(
Y¯ − µY
)
= n
(
X¯ − µ)⊤ C⊤(CSC⊤)−1C(X¯−µ) ∼ (n− 1)(p− 1)
n− p+ 1 Fp−1,n−p+1
Hence, the rejection region would be{
X : n
(
CX¯ − 1)⊤ (CSC⊤)−1(CX¯ − 1) > (n− 1)(p− 1)
n− p+ 1 F1−α,p−1,n−p+1
}
,
where 1p−1 ∈ Rp−1 is a (p− 1) vector of ones.
4.2
Use the fact that Y = n
(
X¯ − µ0
)⊤
Σ−1
(
X¯ − µ0
) ∼ χ23: plugin n = 50, X¯ =
0.81.1
0.6
,
µ0 =
00
0
, Σ =
3 1 11 4 1
1 1 2
, find Σ−1, and hence reject if Y > χ21−α,3.
4.3
From the data, X¯ = [6, 10]⊤ and S =
(
24 −10
−10 6
)
/3 and S−1 =
(
18 30
30 72
)
/44. Then,
T 2 = n
(
X¯ − µ0
)⊤
S−1
(
X¯ − µ0
)
= 13.636.
To compute the P value, evaluate F = n−p(n−1)pT
2 = 4.545 and P value = Pr(F ≥ Fp,n−p) =
0.180 > 0.05 (i.e., pf(4.545455, 2, 2, lower.tail=FALSE)). Do not reject H0: there is not
sufficient evidence to believe that the population mean differs from [7, 11]⊤.
4.4
Use Exercise 2.3 on the two samples, then the property of the difference of means. Observe that
the variance does not depend on the means, and so we can the pooled T 2 test (4.9).
121
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
4.5
For a difference of independent variables, means subtract and variances add, so X − X¯ ∼
Np
(
0,
(
1 + 1n
)
Σ
)
and (n− 1)S ∼Wp (Σ, n− 1) by definition, and they are independent. Call
C =X − X¯. Then,
C
n
n+ 1
(X − X¯)⊤S−1(X − X¯) = C
⊤S−1C
C⊤Σ−1C
(
n
n+ 1
)
(C⊤Σ−1C).
Now,
C⊤S−1C
C⊤Σ−1C
=
n− 1
χ2n−p
,
independent of X or X¯, so
n
n+ 1
C⊤Σ−1C = (X − X¯)⊤
((
1 +
1
n
)
Σ
)
(X − X¯) ∼ χ2p
and independent of S. Hence
n
n+ 1
(X − X¯)⊤S−1(X − X¯) ∼ (n− 1)χ
2
p
χ2n−p
,
i.e., the distribution asked: ∼ p(n−1)(n−p) Fp,n−p (same as distribution of T 2). Then, (1 − α)100%
prediction region would be:{
X :
n
n+ 1
(X − X¯)⊤S−1(X − X¯) < p(n− 1)
(n− p) F1−α,p,n−p
}
.
122
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
5.1
(a)
Let X˜4 = X1 +X2 +X4. We can obtain what we are looking for as a linear combination:
X1
X2
X3
X˜4
=
X1
X2
X3
X1 +X2 +X4
=
1 0 0 0
0 1 0 0
0 0 1 0
1 1 0 1
X1
X2
X3
X4
,
Then,
E
X1
X2
X3
X˜4
=
1
2
3
7
and
Var
X1
X2
X3
X˜4
=
1 0 0 0
0 1 0 0
0 0 1 0
1 1 0 1
3 1 0 1
1 4 0 0
0 0 1 4
1 0 4 20
1 0 0 1
0 1 0 1
0 0 1 0
0 0 0 1
=
3 1 0 5
1 4 0 5
0 0 1 4
5 5 4 31
,
so
X1
X2
X3
X˜4
∼ N
1
2
3
7
,
3 1 0 5
1 4 0 5
0 0 1 4
5 5 4 31
.
(b)
Using the expression for the conditional distribution of a normal distribution,
E
X1∣∣∣∣
X2X3
X4
= 1 + ( 1 0 1 )
4 0 00 1 4
0 4 20
−1 x2 − 2x3 − 3
x4 − 4
= 1 +
(
1 0 1
) 1/4 0 00 5 −1
0 −1 1/4
x2 − 2x3 − 3
x4 − 4
= 1 +
(
1 0 1
) −0.5 + x2411 + 5x3 − x4
1− x3 + x44
= 1− 0.5 + x2
4
+ 2− x3 + x4
4
= 2.5 +
x2
4
− x3 + x4
4
.
123
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
And,
Var
X1∣∣∣∣
X2X3
X4
= 3− ( 1 0 1 )
14 0 00 5 −1
0 −1 14
10
1
= 3− ( 1 0 1 )
14−1
1
4
= 2.5.
(c)
Looking at the upper part
3 1 01 4 0
0 0 1
of the covariance matrix, we see that X3 is independent
of
(
X1, X2
)
. Hence x3 does not influence the correlation of X1 and X2 =⇒ ρ12.3 = ρ12 =√
3
6 = 0.2887.
For ρ12.4,
Σ11 − Σ12Σ−122 Σ21 =
(
3 1
1 4
)
− 1
20
(
1 0
0 0
)
=
(
59
20 1
1 4
)
.
Hence, ρ12.4 =
√
5
59 = 0.291.
(d)
R1.234 =
√√√√√√√( 1 0 1 )
4 0 00 1 4
0 4 20
−1 10
1
3
=
√√√√√1
3
(
1 0 1
) 14 0 00 5 −1
0 −1 14
10
1
=
√
1
6
= 0.408 > ρ12.
Of course R1.234 should be larger than ρ12 (or at least no smaller), and this is supported numer
ically (0.408 > 0.2887).
(e)
Consider
X2X3
X4
and
X1 −
(
1 0 1
) X245X3 −X4
X4
4 −X3
= X1 − X2
4
− X4
4
+X3.
124
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
Then directly you can check:
Cov(X2, X1 − X2
4
− X4
4
+X3) = 1− 1 = 0
Cov(X3,X1 − X2
4
− X4
4
+X3) = −1 + 1 = 0
Cov(X4,X1 − X2
4
− X4
4
+X3) = 1− 5 + 4 = 0.
But more clever is to say: X1−E
X1∣∣∣∣
x2x3
x4
and
x2x3
x4
are uncorrelated. This general
argument was put forward and proved as a part of the proof of Property 4 of the Multivariate
Normal Distribution in Section 2.2.
5.2
(a)
3−2
1
⊤ X1X2
X3
∼ N
( 3 −2 1 )
2−3
1
, ( 3 −2 1 )
1 1 11 3 2
1 2 2
3−2
1
∼ N(13, 9).
(b)
Let vector a =
(
U
V
)
.
Cov(X2, X2 − UX1 − V X3) = Var(X2)− U Cov(X2, X1)− V Cov(X2, X3)
= 3− U − 2V = 0.
Then, if, say, U = 1, then V = 1, so a =
(
1
1
)
.
125
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
6.1
First, let us note that not all ρ > 0 are allowed since Σ must be nonnegative definite. It must
hold that ∣∣∣∣ 1 ρ/2ρ/2 1
∣∣∣∣ = 1− ρ2/4 ≥ 0
(since otherwise, for some a ∈ R and b ∈ R,ab
0
⊤ 1 ρ/2 0ρ/2 1 ρ
0 ρ 1
ab
0
= a2 + aρ/2 + bρ/2 + b2 < 0,
making the whole matrix no longer nonnegative definite) and∣∣∣∣∣∣
1 ρ/2 0
ρ/2 1 ρ
0 ρ 1
∣∣∣∣∣∣ = 1− 54ρ2 ≥ 0.
This means that 0 < ρ ≤ 2√
5
.
(a)
First, let us find the 3 eigenvalues of Σ:∣∣∣∣∣∣
1− λ ρ/2 0
ρ/2 1− λ ρ
0 ρ 1− λ
∣∣∣∣∣∣ = 1− 3λ+ 3λ2 − λ3 − 54ρ2(1− λ) = 0
and
(1− λ)
[
λ2 − 2λ+ 1− 5
4
ρ2
]
= 0.
Solving this equation, we obtain three roots: λ1 = 1, λ2 = 1 −
√
5
2 ρ, and λ3 = 1 +
√
5
2 ρ. The
larges eigenvalue is λ3 = 1 +
√
5
2 ρ.
By definition, its corresponding eigenvector
a1a2
a3
satisfies,
a1 +
ρ
2
a2 = a1 +
√
5
2
ρa1
ρ
2
a1 + a2 + ρa3 = a2 +
√
5
2
ρa2
ρa2 + a3 = a3 +
√
5
2
ρa3.
Solving (up to a constant), a2 =
√
5a1, a3 =
2√
5
a2 = 2a1.
126
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
So a1
1√5
2
is an eigenvector. To normalise it, choose a1 = 1√10 . Thus, the first principal
component is
1√
10
Y1 +
√
1
2
Y2 +
2√
10
Y3.
It explains
1+
√
5
2 ρ
3 · 100% of the overall variability.
(b)
Y1Y2
Y1 + Y2 + Y3
=
1 0 00 1 0
1 1 1
Y1Y2
Y3
∼ N
1 0 00 1 0
1 1 1
00
0
,
1 0 00 1 0
1 1 1
1 ρ2 0ρ
2 1 ρ
0 ρ 1
1 0 10 1 1
0 0 1
∼ N
00
0
,
1 ρ2 1 + ρ2ρ
2 1 1 +
3
2ρ
1 + ρ2 1 +
3
2ρ 3(1 + ρ)
.
(c)
N
( ((
0
0
)
+
(
0
ρ
)
y3
)
,
(
1 ρ2
ρ
2 1
)
−
(
0
ρ
)(
0 ρ
) )
= N
( (
0
ρy3
)
,
(
1 ρ2
ρ
2 1− ρ2
) )
.
(d)
Cov
Y3Y2
Y1
=
1 ρ 0ρ 1 ρ2
0 ρ2 1
,
so
R =
√√√√√( ρ 0 )( 1 ρ2ρ
2 1
)−1 (
ρ
0
)
1
=
1√
1− ρ24
√(
ρ 0
)( 1 −ρ2−ρ2 1
)(
ρ
0
)
=
ρ√
1− ρ24
.
127
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
7.1
Split up the matrix: Σ =
(
Σ11 Σ12
Σ21 Σ22
)
into
Σ11 =
(
1 0.4248
0.4248 1
)
,
Σ12 =
(
0.0420 0.0215 0.0573
0.1487 0.2489 0.2843
)
,
Σ22 =
1 0.6693 0.46620.6693 1 0.6915
0.4662 0.6915 1
,
Σ21 = Σ12.
We need to find eigenvalues for Σ⊤12Σ
−1
11Σ12Σ
−1
22 if calculating by hand (this would be easier
than finding eigenvalues of Σ
− 12
22 Σ
⊤
12Σ
−1
11Σ12Σ
− 12
22 ). If using SAS, we would use the following
statements:
proc iml;
S 11 = {1 0.4248 , 0.4248 1} ;
S 12 = {0.0420 0.0215 0.0573 , 0.1487 0.2489 0.2843} ;
S 22 = {1 0.6693 0.4662 , 0.6693 1 0.6915 , 0.4662 0.6915 1};
S 22inv = inv(S 22);
S r = root(S 22inv);
a = S r*S 12’*inv(S 11)*S 12*S r’;
call eigen(c, d, a);
print c; print d;
The result, C =
0.09464550.0035185
2.252× 10−18
, D =
−0.1281 0.7192 0.68290.2840 −0.6331 0.7201
0.9502 0.2862 −0.1232
. Further,
b=S r’*d[,1];
a=1/sqrt(0.09464557)*inv(S 11)*(S 12)*b;
gives a =
(
0.3262
−1.0940
)
and b =
0.1724−0.5079
−0.6794
with a⊤( X1
X2
)
and b⊤
X3X4
X5
the canonical
variates, and relevant eigenvalues λ = 0.0946, 0.0035.
In R,
s < c(1, 0.4248, 0.0420, 0.0215, 0.0573,
1, 0.1487, 0.2489, 0.2843,
1, 0.6693, 0.4662,
1, 0.6915,
1)
S < matrix(NA, 5, 5)
S[lower.tri(S,TRUE)] < s
S[upper.tri(S)] < t(S)[upper.tri(S)]
S 11 < S[1:2,1:2]
128
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
S 12 < S[1:2,3:5]
S 22 < S[3:5,3:5]
S 22inv < solve(S 22)
S r < chol(S 22inv) # Can use Cholesky instead of square root.
A < S r%*%t(S 12)%*%solve(S 11)%*%S 12%*%t(S r)
(e < eigen(A))
(b < t(S r)%*%e$vectors[,1])
(a < 1/sqrt(e$values[1]) * solve(S 11)%*%S 12%*%b)
This suggests that the first canonical correlation is sufficient. The first canonical correlation
represents primarily a positive association between arithmetic power and memory for symbols
(both kinds).
7.2
R and SAS implementations are as in previous exercise, but with modified matrices give a =
( −0.0260
−0.0518
)
,
b =
−0.0823−0.0081
−0.0035
, with the relevant eigenvalues λ = 0.4396, 0.0016.
The eigenvalues suggest that there is little for the second canonical correlation left to explain.
(I.e., a factor of over 200.)
The first canonical correlation appears to indicate a positive relationship (i.e., negative ×
negative) between the first open book exam and the two closed book exams, whereas the other
two open book exams are weakly associated with the closed book exams. (Rerunning after
converting to correlation matrix does not change this.)
7.3
(a)
The following is an outline of the solution:
1. Since this makes them easier to perform, work with the matrix Σ21Σ
−1
11 Σ12Σ
−1
22
2. Using the 2× 2 matrix inversion formula, evaluate it.
3. Using the 2 × 2 martix determinant formula, find the expression for the characteristic
polynomial, in terms of ρ and λ.
You should get λ1 =
4ρ2
1+4ρ+4ρ2 and λ2 = 0, which means that one canonical variables pair is
enough.
(b)
Similarly, solve for eigenvectors of Σ21Σ
−1
11 Σ12Σ
−1
22 and transform them.
129
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
7.4
(a)
Splitting this up, Σ11 =
(
100 0
0 1
)
, Σ22 =
(
1 0
0 100
)
, Σ12 =
(
0 0
0.95 0
)
, Σ−122 =
1
100
(
100 0
0 1
)
,Σ
− 12
22 =
1
10
(
10 0
0 1
)
, so
Σ
− 12
22 Σ
⊤
12Σ
−1
11 Σ12Σ
− 12
22 =
(
(0.95)2 0
0 0
)
.
µ2 = (0.95)2 and eigenvector b =
(
1
0
)
, hence Z2 = 1×X3 + 0×X4 = X3 and
a =
1
0.95
1
100
(
1 0
0 100
)(
0 0
0.95 0
)(
1
0
)
=
(
0
1
)
,
so Z1 = 0×X1 + 1×X2 = X2. µ2 = (0.95)2, and µ = 0.95 is the first canonical correlation.
Can you give another argument for the canonical variables and canonical correlation in this
problem that will help you to avoid all the calculations above?
130
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
9.1
Since S = 1nV =
1
n
∑n
i=1(xi−x¯)(xi−x¯)⊤ (using n instead of n−1 here to simplify notation—the
factor cancels), observe that
arithm. mean λˆi =
1
p
p∑
i=1
λˆi =
1
p
tr(S)
=
1
pn
tr{
n∑
i=1
(xi − x¯)(xi − x¯)⊤}
=
1
pn
tr{
n∑
i=1
(xi − x¯)⊤(xi − x¯)}
= σˆ2
and
geom. mean λˆi =
(
p∏
i=1
λˆi
)1/p
= S1/p = ( 1
np
V )1/p = 1
n
V 1/p.
Substituting,
−2 log Λ = np log arithm. mean λˆi
geom. mean λˆi
= np log
σˆ2
1
n V 1/p
= np log nσˆ2 − np logV 1/p
= np log nσˆ2 − n logV ,
the test statistic from Section 9.2.
9.2
Observe that we can write the sample correlation matrix as
R = diag(Σˆ)−1/2Σˆdiag(Σˆ)−1/2,
where
diag(A)ij =
{
Aii i = j
0 otherwise
,
a diagonal matrix whose diagonal elements are the diagonal elements of A. Recall that for a
diagonal matrix, the matrix inverse, the matrix square root, etc. become simply elementwise
operations on the diagonal and its determinant is a product of its diagonal values.
Then, let V =
∑n
i=1(xi − x¯)(xi − x¯)⊤ = nΣˆ as before. If Σ is diagonal, then elements of X
131
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
are independent, so if σ2j = VarXj , σˆ
2
j = n
−1∑n
i=1(xji − x¯j)2 = n−1Vjj = Σˆjj . Then,
Λ =
∏p
j=1(Σˆjj)
−n2 e
− 1
2Σˆjj
∑n
i=1(xji−x¯j)2
V −n2 nnp2 e−np2
=
∏p
j=1(Σˆjj)
−n2e−
n
2
Σˆ−n2e−np2
=
∏p
j=1(Σˆjj)
−n2 (diag(Σˆ)−1/2−n2 )2
diag(Σˆ)−1/2−n2 Σˆ−n2 diag(Σˆ)−1/2−n2
=(
((((
((({∏pj=1(Σˆjj)}−n2((((((((((({(∏pj=1(Σˆjj)−1/2)2}−n2
diag(Σˆ)−1/2Σˆdiag(Σˆ)−1/2−n2
= diag(Σˆ)−1/2Σˆdiag(Σˆ)−1/2n2 = Rn2 .
so
−2 log Λ = −n log R.
Lastly, the degrees of freedom for the χ2 distribution are
# param. SPD matrix︷ ︸︸ ︷
p(p+ 1)
2
−
# param. diag. matrix︷︸︸︷
p =
p(p− 1)
2
.
132
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
12.1
(a)
Normal populations with equal variances implies LDA, so use the the expression from Sec
tion 12.6, with µi replacing x¯i and Σ replacing Spooled, since those are given to us rather than
estimated from the sample. This leads to the following rule:
1. Evaluate
di(x) = µ
⊤
i Σ
−1x− 1
2
µ⊤i Σ
−1µi + log
1
3
for i = 1, 2, 3.
2. Classify x into the category with the highest di(x).
(b)
We shall illustrate the first case in detail, and only the results for the remainder. Let x =(
0.2
0.6
)
=
(
1/5
3/5
)
, and evaluate Σ−1 =
(
4/3 −2/3
−2/3 4/3
)
. Then,
d1(x) =
(
1
1
)⊤(
4/3 −2/3
−2/3 4/3
)(
1/5
3/5
)
− 1
2
(
1
1
)⊤(
4/3 −2/3
−2/3 4/3
)(
1
1
)
+ log
1
3
= − 2
15
+ log
1
3
d2(x) =
(
1
0
)⊤(
4/3 −2/3
−2/3 4/3
)(
1/5
3/5
)
− 1
2
(
1
0
)⊤(
4/3 −2/3
−2/3 4/3
)(
1
0
)
+ log
1
3
= −4
5
+ log
1
3
d3(x) =
(
0
1
)⊤(
4/3 −2/3
−2/3 4/3
)(
1/5
3/5
)
− 1
2
(
0
1
)⊤(
4/3 −2/3
−2/3 4/3
)(
0
1
)
+ log
1
3
= 0 + log
1
3
Thus, we classify to Category 3.
For x =
(
2
0.8
)
, d1(x) =
6
5 + log
1
3 , d2(x) =
22
15 + log
1
3 , d3(x) = − 1415 + log 13 . Classify into
Category 2.
For x =
(
0.75
1
)
, d1(x) =
1
2 + log
1
3 , d2(x) = − 13 + log 13 , d3(x) = 16 + log 13 . Classify into
Category 1.
(c)
To be at the boundary between two regions, say, i and j, the point x must have di(x) = dj(x).
Then,
µ⊤i Σ
−1x− 1
2
µ⊤i Σ
−1µi + log πi = µ⊤j Σ
−1x− 1
2
µ⊤j Σ
−1µj + log πj
µ⊤i Σ
−1x− µ⊤j Σ−1x = −
1
2
µ⊤j Σ
−1µj +
1
2
µ⊤i Σ
−1µi + log πj − log πi
(µ⊤i Σ
−1 − µ⊤j Σ−1)x = −
1
2
µ⊤j Σ
−1µj +
1
2
µ⊤i Σ
−1µi + log πj − log πi.
If we call a = (µ⊤i Σ
−1−µ⊤j Σ−1)⊤ ∈ R2 and c = − 12µ⊤j Σ−1µj+ 12µ⊤i Σ−1µi+log πj− log πi ∈ R,
neither of them depending on x, we can write
a⊤x = c =⇒ a1x1 + a2x2 = c =⇒ x2 = c
a2
− a1
a2
x1,
133
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
an equation for a line with slope −a1/a2 and yintercept c/a2. Here’s a sketch of region bound
aries:
1.0 0.5 0.0 0.5 1.0 1.5 2.0
1
.0
0
.5
0.
0
0.
5
1.
0
1.
5
2.
0
x1
x
2
134
UNSW MATH5855 2021T3 Lecture A Exercise Solutions
15.1
First, we solve for the inverse of the generator:
ϕ−1(x) = (θx+ 1)−1/θ.
Then, substitute into the Archimedean form:
Cθ(u1, u2, . . . , up) = ϕ
−1{
p∑
i=1
ϕ(ui)}
=
[
θ{
p∑
i=1
θ−1(u−θi − 1)}+ 1
]−1/θ
=
[
θθ−1{(
p∑
i=1
u−θi )− p}+ 1
]−1/θ
=
[
p∑
i=1
u−θi − p+ 1
]−1/θ
.
135
欢迎咨询51作业君