辅导案例-STAT3006

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
STAT3006: Statistical Learning
Lecture Notes
Ian A. Wood
School of Mathematics and Physics
University of Queensland, St. Lucia 4072, Australia
October 25, 2020
Contents
1 Multivariate Statistics 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Marginal and Conditional Distributions . . . . . . . . . . . . . . . . . . . 4
1.3 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Marginal and Conditional Densities of the Multivariate Normal . . . . . 10
1.6 Samples from the multivariate normal . . . . . . . . . . . . . . . . . . . . 14
1.6.1 James–Stein estimators . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6.2 Wishart Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.7 Hotelling’s T 2 test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.7.1 Confidence region for mean vector . . . . . . . . . . . . . . . . . . 22
1.7.2 Two sample test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2 Clustering 26
2.1 K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Fitting a normal mixture model . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Estimating uncertainty about the parameter values . . . . . . . . . . . . 41
2.5 Choosing the number of components . . . . . . . . . . . . . . . . . . . . . 42
2.5.1 Cross-validation of the likelihood . . . . . . . . . . . . . . . . . . 47
2.6 Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.6.1 Spectral Decomposition . . . . . . . . . . . . . . . . . . . . . . . . 49
1
2.7 Mahalanobis Distances and Transformation . . . . . . . . . . . . . . . . . 50
3 Discriminant Analysis 52
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Statistical Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Optimal Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5 Quadratic Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . 59
3.6 Fisher’s Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . 60
3.7 Estimation of Error Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.8 Mixture Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . 69
3.9 Kernel Density Discriminant Analysis . . . . . . . . . . . . . . . . . . . . 70
3.10 Nearest Neighbour Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.11 Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4 High-dimensional analysis 85
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3 Single-variable analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4 Variable selection and the lasso . . . . . . . . . . . . . . . . . . . . . . . . 96
.1 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
.2 Appendix B: Code for knn example . . . . . . . . . . . . . . . . . . . . . . 101
2
Chapter 1
Multivariate Statistics
Note: this chapter draws heavily on Morrison (2005).
1.1 Introduction
Assume we will observe data as a result of a random experiment, i.e. we will observe n
random variables {Xi, i = 1, . . . , n}, with each Xi being p-dimensional. The values we
record for of these random variables are written {xi, i = 1, . . . , n}. Each element can
be continuous or discrete, but we will assume they are all continuous in most cases.
Note that when necessary, we will use the index i = 1, . . . , n to index observations in
the sample, and the index j = 1, . . . , p to index the elements of an observation.
We typically assume that all n random variables are drawn independently from the
same distribution, i.e. that of X . The distribution of X can be described by its joint
distribution function
F (X ≤ x) = P (X1 ≤ x1, . . . , Xp ≤ xp).
When F is absolutely continuous, one can write
F (x) =
∫ xp
−∞
. . .
∫ x1
−∞
f(u1, . . . , up) du1 . . . dup,
3
where f(x) is the joint density function of the elements of X .
If the elements of X happen to be independent, then
f(x) =
p∏
j=1
f(xj) and F (x) =
p∏
j=1
F (xj).
Also, if such a factorisation is possible, it implies independence of the variables. How-
ever, we will usually assume that there is some dependency among the p elements of
X .
It can be the case that while f(x) is straightforward to write out analytically, F (x)
is not. This is true for the multivariate normal distribution in p dimensions. The distri-
bution function is analytically intractable, but the density can be written as
f(x) =
1
(2pi)p/2|Σ|1/2 exp
[
−1
2
(x− µ)TΣ−1(x− µ)
]
, x ∈ Rp,
where µ is the p-dimensional mean vector and Σ is the p× p covariance matrix.
1.2 Marginal and Conditional Distributions
If we wish to determine the (marginal) joint density of a subset of the random variables,
e.g. the first q < p variables, this can be obtained by integrating out the variables which
are not in the subset. Hence the joint density
g(x1, . . . , xq) =
∫ ∞
−∞
. . .
∫ ∞
−∞
f(x1, . . . , xp) dxq+1 . . . dxp. (1.1)
Correspondingly, the joint distribution function is given by
G(x1, . . . , xq) = P (X1 ≤ x1, . . . , Xq ≤ xq) = F (x1, . . . , xq,∞, . . . ,∞). (1.2)
Sometimes we are instead interested in a conditional distribution, that is, the joint dis-
tribution of a subset of the random variables when the remainder are held fixed at
4
certain values. E.g. assume that we want to know the joint density of X1, . . . , Xq, q < p,
with the remaining variables fixed at values Xq+1 = xq+1, . . . , Xp = xp.
The definition of conditional probability states that, when P (B) > 0, P (A|B) =
P (A ∩ B)/P (B). When extended to densities and considered for this conditional den-
sity, we let A = {X1 = x1, . . . , Xq = xq} and B = {Xq+1 = xq+1, . . . , Xp = xp}. So
h(x1, . . . , xq|xq+1, . . . , xp) = f(x1, . . . , xq, xq+1, . . . , xp)∫∞
−∞ . . .
∫∞
−∞ f(x1, . . . , xp) dx1 . . . dxq
=
f(x1, . . . , xp)
g(xq+1, . . . , xp)
.
(1.3)
Note that g(xq+1, . . . , xp) is used here to mean the relevant marginal density evaluated
at these x values. It is not the same marginal density as in 1.1.
1.3 Moments
The expected value of a p-dimensional random vector X is just the vector of expecta-
tions of its elements, i.e.
E(X) = [E(X1), . . . , E(Xp)]
T (1.4)
The covariance of two elements, j and k, of the random vector is
σjk = cov(Xj, Xk) = E[(Xj − E(Xj))(Xk − E(Xk))] = E(XjXk)− E(Xj)E(Xk) (1.5)
=
∫ ∞
−∞
∫ ∞
−∞
fjk(xj, xk)xjxk dxjdxk −
∫ ∞
−∞
fj(xj)xj dxj
∫ ∞
−∞
fk(xk)xk dxk, (1.6)
where fjk(xj, xk) is the (marginal) joint density of Xj and Xk, and fj is the (marginal)
density of Xj .
The variance of the jth element of the random vector X can be written as σjj or
σ2j , where σj is the standard deviation of the jth element of the random vector. The
5
covariance matrix of X is defined to be
var(X) = Σ = cov(X,XT ) = E[(X − E(X))(X − E(X))T ] =

σ11 . . . σ1p
. . . . . . . . .
σ1p . . . σpp
 (1.7)
The matrix Σ must be real, symmetric and positive semi-definite. The latter is re-
quired to ensure that each covariance term |σjk| ≤ √σjjσkk. For Σ to be positive semi-
definite, it must be the case that zTΣz ≥ 0,∀z 6= 0,∈ Rp. Equivalently, if Σ is a symmet-
ric real matrix, it will be positive semi-definite if all its eigenvalues are greater than or
equal to zero.
From 1.5, we can show that translations of either or both variables do not affect
their covariance:
cov(Xj + a,Xk + b) = cov(Xj, Xk). (1.8)
We can also show that changes in the scale of either variable are passed on to the
covariance:
cov(cXj, dXk) = cd cov(Xj, Xk). (1.9)
Hence, if we define a p-length vector a = (a1, . . . , ap)T , the covariance of aX is
var(aTX) =
p∑
j=1
p∑
k=1
ajakσjk = a
TΣa. (1.10)
It can also be shown that if b is another p-length vector,
cov(aTX, bTX) =
p∑
j=1
p∑
k=1
ajbkσjk = a
TΣb. (1.11)
More generally, if A and B are two real matrices, of dimensions r × p and s × p,
6
respectively, and we let Y = AX and Z = BX , then it can be shown that
cov(Y, Y T ) = AΣAT , (1.12)
cov(Z,ZT ) = BΣBT , and (1.13)
cov(Y, ZT ) = AΣBT . (1.14)
It is sometimes desirable to standardise covariance terms into the range [-1,1], i.e.
consider the (Pearson) correlation between variables. The correlation coefficient of Xj
and Xk is defined to be:
corr(Xj, Xk) = ρjk =
cov(Xj, Xk)√
var(Xj)var(Xk)
=
σjk√
σjjσkk
. (1.15)
IfXj andXk are independent, then corr(Xj, Xk) = 0, but the converse is not necessarily
true. If Xk = cXj + a, for two scalars a and c > 0, then corr(Xj, Xk) = 1, and if c < 0,
then corr(Xj, Xk) = −1.
The matrix of correlations among the elements of X is given by
P =

1 ρ12 . . . ρ1p
ρ12 1 . . . ρ2p
. . . . . . . . . . . .
ρ1p ρ2p . . . 1

(1.16)
Let D be a p× p diagonal matrix containing the standard deviations of the elements
of X , σ1, . . . , σp. Its inverse, D−1, is also a diagonal matrix, with its entries consisting of
the reciprocals of the standard deviations of the elements of X . It can then be shown
that the correlation matrix for X is related to the covariance matrix as follows:
P = D−1ΣD−1 (1.17)
Σ = DPD (1.18)
7
1.4 Multivariate Normal Distribution
The pdf of the univariate normal distribution is given by
φ(xj;µ, σ
2) =
1√
2piσ
exp
[
−(xj − µ)
2
2σ2
]
, xj ∈ R. (1.19)
If the elements of X = (X1, . . . , Xp) are each normal, but independent of each other,
then X has mean vector µ = (µ1, . . . , µp), where µj is the mean of Xj , and covariance
matrix
Σ =

σ21 . . . 0
. . . . . . . . .
0 . . . σ2p
 .
Hence the joint density of X is
φ(x;µ,Σ) =
p∏
j=1
φ(x;µj, σ
2
j ) =
1
(2pi)p/2
∏p
j=1 σj
exp
[
−1
2
p∑
j=1
(
xj − µj
σj
)2]
, x ∈ Rp.
(1.20)
If we note that for this case, |Σ|1/2 = ∏pj=1 σj , then we can rewrite this as
φ(x) =
1
(2pi)p/2|Σ|1/2 exp
[
−1
2
(x− µ)TΣ−1(x− µ)
]
, (1.21)
which is the general form for the normal density, even if Σ contains off-diagonal terms.
It can be shown that, when Σ is positive definite, φ(x) > 0 for all finite x and
∫ ∞
−∞
. . .
∫ ∞
−∞
φ(x) dx1 . . . dxp = 1 (1.22)
for all µ ∈ Rp, so φ(x) is a valid density function. This distribution has mean µ and
covariance matrix Σ.
The bivariate normal distribution is the simplest multivariate case, but has many
applications. When learning about multivariate statistics, it is generally worth looking
at the bivariate case before considering higher-dimensional cases.
8
The bivariate normal distribution has 5 parameters: the two elements of the mean
vector, the variances (or equivalently, standard deviations) of each element, and the
correlation ρ, and its parameters can be written in terms of these as follows:
µ =
 µ1
µ2
 , Σ =
 σ21 ρσ1σ2
ρσ1σ2 σ
2
2
 . (1.23)
The joint pdf is
φ(x1, x2) =
1
2piσ1σ2

1− ρ2×
exp
{
− 1
2(1− ρ2)
[(
x1 − µ1
σ1
)2
− 2ρ
(
x1 − µ1
σ1
)(
x2 − µ2
σ2
)
+
(
x2 − µ2
σ2
)2]}
.
If we choose a fixed density value c (e.g. c = 0.1), there must exist a level set of
points in the support of the multivariate normal for which the multivariate normal
density has this value. In the bivariate case, the level set will be an elliptical contour
centered on the mean, with major and minor axes. These axes follow the longest and
shortest lines through the mean which touch the contour twice. In higher dimensions,
the level set is an ellipsoid. This can be seen mathematically by setting the multivariate
normal density equal to c. Since
f(x) =
1
(2pi)p/2|Σ|1/2 exp
[
−1
2
(x− µ)TΣ−1(x− µ)
]
= c,
we can divide both sides by the leading constant and then take logs of both sides,
which still leaves us with a constant on the right hand side, say C, i.e.
(x− µ)TΣ−1(x− µ) = C.
This equation describes a p-dimensional ellipsoid. The axes of a multivariate normal el-
lipsoid estimated from data are known as the principal components. These are widely
used for the purposes of dimensionality reduction and as a method of describing rela-
9
tionships between variables, and will be considered later in the course.
1.5 Marginal and Conditional Densities of the Multivari-
ate Normal
Property: if X ∼ N(µ,Σ), with Σ of rank p (i.e. p × p without linear redundancy), and
A is an m×p matrix of rank m ≤ p and we define Y = AX . Then Y is a vector of length
m and Y ∼ N(Aµ,AΣAT ).
The covariance result is from 1.12 and the mean result is also a general property of
multivariate distributions. We will not prove this result in this course. Proofs can be
found in both Anderson (2003) and Graybill (2001).
This property has a number of implications. One of the most important is that the
joint distribution of any set of elements of X is a multivariate normal distribution,
with mean vector and covariance given by the appropriate elements of µ and Σ, i.e.
any marginal distribution is multivariate normal. Can you see how the property can
be used to show this? A certain type of tuple for A could produce this result. (Hint:
binary). (Discussed in detail in class).
However, note that having (multivariate) normal marginal distributions does not
necessarily imply that the original joint distribution was multivariate normal.
To consider conditional densities further, we begin by splittingX into two parts: XA
containing the first q elements, 0 < q < p, and XB containing the last p − q elements.
We let
X =
 XA
XB
 , µ =
 µA
µB
 , Σ =
 ΣAA ΣAB
ΣTAB ΣBB
 . (1.24)
Note that the marginal distributions are XA ∼ N(µA,ΣAA) and XB ∼ N(µB,ΣBB).
The dimensions of the matrices are: ΣAA, q × q; ΣAB, q × (p− q); ΣBB, (p− q)× (p− q).
Also, XA and XB are independent iff (if and only if) ΣAB = 0.
We would like to determine the conditional density h(XA = xA|XB = xB) in terms
10
of the quantities mentioned so far. From 1.3 we have
h(xA|xB) = f(xA, xB)
g(xB)
. (1.25)
Since g() is a multivariate normal density,
g(xB) =
1
(2pi)(p−q)/2|ΣBB|1/2 exp
[
−1
2
(xB − µB)TΣ−1BB(xB − µB)
]
. (1.26)
We next need to express f(xA, xB) in terms of the submatrices of Σ. We will need the
matrix inverse and the determinant.
The linear algebra involved with the Schur complement (Boyd and Vandenberghe,
2004; Zhang, 2005), which can be used in solving systems of linear equations, gives
a number of identities, including the inverse and determinant of block matrices. The
Schur complement of ΣBB in Σ is ΣAA−ΣABΣ−1BBΣTAB, which we will abbreviate as ΣA|B.
The inverse identity gives
Σ−1 =
 Σ−1A|B −Σ−1A|BΣABΣ−1BB
−Σ−1BBΣTABΣ−1A|B Σ−1BB + Σ−1BBΣTABΣ−1A|BΣABΣ−1BB
 , (1.27)
and the determinant is given by
|Σ| = |ΣBB||ΣA|B|. (1.28)
We substitute these into 1.25. The ratio of the parts involving pi and the determi-
nants resolves to
(2pi)(p−q)/2|ΣBB|1/2
(2pi)p/2|ΣBB|1/2|ΣA|B|1/2 =
1
(2pi)q/2|ΣA|B|1/2 . (1.29)
11
The argument of the exponential in the ratio becomes:
− 1
2
(x− µ)TΣ−1(x− µ) + 1
2
(xB − µB)TΣ−1BB(xB − µB)
= −1
2
 xA − µA
xB − µB

T  Σ−1A|B −Σ−1A|BΣABΣ−1BB
−Σ−1BBΣTABΣ−1A|B Σ−1BB + Σ−1BBΣTABΣ−1A|BΣABΣ−1BB

 xA − µA
xB − µB

+
1
2
(xB − µB)TΣ−1BB(xB − µB)
= −1
2
 xA − µA
xB − µB

T  Σ−1A|B(xA − µA)− Σ−1A|BΣABΣ−1BB(xB − µB)
−Σ−1BBΣTABΣ−1A|B(xA − µA) + (Σ−1BB + Σ−1BBΣTABΣ−1A|BΣABΣ−1BB)(xB − µB)

+
1
2
(xB − µB)TΣ−1BB(xB − µB)
= −1
2
[(xA − µA)TΣ−1A|B(xA − µA)− (xA − µA)TΣ−1A|BΣABΣ−1BB(xB − µB)
− (xB − µB)TΣ−1BBΣTABΣ−1A|B(xA − µA) + (xB − µB)T (Σ−1BB − Σ−1BB + Σ−1BBΣTABΣ−1A|BΣABΣ−1BB)(xB − µB)]
= −1
2
[(xA − µA)TΣ−1A|B(xA − µA)− 2(xA − µA)TΣ−1A|BΣABΣ−1BB(xB − µB)
+ (xB − µB)TΣ−1BBΣTABΣ−1A|BΣABΣ−1BB(xB − µB)].
Note that (xA−µA)TΣ−1A|BΣABΣ−1BB(xB−µB) = (xB−µB)TΣ−1BBΣTABΣ−1A|B(xA−µA) be-
cause the latter is the transpose of the former, and since both are scalars, the transpose
has no effect.
In considering the above, we must remember that we are trying to get a density
for the random variable XA at position xA, and everything else is constant. The expo-
nential argument includes a quadratic term −1
2
xTAΣ
−1
A|BxA, a linear term and a constant.
This essentially ensures that the resulting density is normal, and we also recall that
being a random variable, its density must integrate to 1. So what we should try to do
is the multivariate equivalent of completing the square for a quadratic equation. This
means we should focus on the linear term, with the understanding that the constant
term must fall into line for the result to be a density.
12
The linear term is as follows:
= −1
2
[−2xTAΣ−1A|BµA − 2xTAΣ−1A|BΣABΣ−1BB(xB − µB)]
= xTAΣ
−1
A|B(µA + ΣABΣ
−1
BB(xB − µB)).
Let
µA|B(xB) = µA + ΣABΣ−1BB(xB − µB), (1.30)
so the linear term can be written xTAΣ
−1
A|BµA|B(xB).
The only quadratic form which could give rise to these quadratic and linear terms
is
−1
2
[
xA − µA|B(xB)
]T
Σ−1A|B
[
xA − µA|B(xB)
]
. (1.31)
Hence the density of the conditional distribution must be
g(xA|xB) =
exp
{
−1
2
[
xA − µA|B(xB)
]T
Σ−1A|B
[
xA − µA|B(xB)
]}
(2pi)q/2 |ΣA|B|1/2
(1.32)
This is the density for a multivariate normal distribution with mean µA|B(xB) and co-
variance matrix ΣA|B. In summary:
XA|(XB = xB) ∼ N(µA|B(xB),ΣA|B). (1.33)
Based on this, and remembering the assumption of p-dimensional normality, we
can create a vector of residuals
XA.B = XA − µA|B(xB) (1.34)
These are the discrepancies between elements of XA and the values that would be
predicted after conditioning on a known set of values forXB, using the linear mean and
covariance equations above. The main difference versus linear regression is that linear
13
regression does not typically assume that the explanatory variables being conditioned
upon are jointly normally distributed with the response variables.
The covariance of XA with XA.B can be shown to be ΣA|B. The covariance of XB
with XA.B can be shown to be 0, implying that the residuals and the fixed variables are
independently distributed.
When holding constant xq+1, . . . , xp, the elements of the covariance matrix ΣA|B for
the conditional distribution are called the partial variances and covariances, with the
jkth (1 ≤ j, k ≤ q) element written as σjk|q+1,...,p. The jkth partial correlation is then
ρjk|q+1,...,p =
σjk|q+1,...,p√
σjj|q+1,...,p σkk|q+1,...,p
(1.35)
If we let DA|B = diag(ΣA|B), then the matrix of partial correlations is
PA|B = D
−1/2
A|B ΣA|BD
−1/2
A|B , (1.36)
where the power −1/2 for a matrix means to take the reciprocal of each element and
then take its square root.
1.6 Samples from the multivariate normal
If we take n observations from a p-dimensional multivariate normal distributionN(µ,Σ),
we can store these in a data matrix
XAll =

x11 . . . x1p
. . . . . . . . .
xn1 . . . xnp
 =

xT1
. . .
xTn
 (1.37)
The likelihood of these observations is
L(µ,Σ;XAll) =
1
(2pi)np/2|Σ|n/2 exp
[
−1
2
n∑
i=1
(xi − µ)TΣ−1(xi − µ)
]
. (1.38)
14
The maximum likelihood estimators of µ and Σ are then functions of the data which
maximise the likelihood.
Let the sample mean vector be
x =
1
n
n∑
i=1
xi, (1.39)
and let A be the matrix of sums of squares and cross products
A =
n∑
i=1
(xi − x)(xi − x)T =
(
n∑
i=1
xix
T
i
)
− nxxT . (1.40)
The maximum of a function remains in place under monotonic transformations, and
so for convenience we take the natural logarithm of the likelihood.
The log likelihood is given by
l(µ,Σ;XAll) = −1
2
np ln(2pi)− 1
2
n ln |Σ| − 1
2
n∑
i=1
(xi − µ)TΣ−1(xi − µ) (1.41)
Since Σ−1 is a symmetric positive definite matrix, the quadratic form term will be
minimised only when µ = x, so the maximum likelihood estimator is µˆ = x.
We next need to find the maximum likelihood estimator for Σ. For this we use
a lemma due to Zehna (1966). The lemma was extended by Mood et al. (1974) (see
p285 - Theorem 2 and proof). This expresses the invariance of maximum likelihood
estimators under transformations:
Lemma 1 Let the distribution function of a multivariate random variable depend on the pa-
rameters θ = (θ1, . . . , θk), with maximum likelihood estimators θˆ = (θˆ1, . . . , θˆk). Consider a
transformation of the original parameter space Θ: τ(θ) = (τ1(θ), . . . , τr(θ)), 1 ≤ r ≤ k. Then
the maximum likelihood estimators of τl(θ) , l = 1, . . . , r, are τˆl = τl(θˆ1, . . . , θˆk).
Proof: Let θˆ = (θˆ1, . . . , θˆk) be a maximum likelihood estimate of θ = (θ1, . . . , θk). We let
M(τ ;x1, . . . , xn) = supθ:τ(θ)=τ L(θ;x1, . . . , xn). It is sufficient to show thatM(τ(θˆ);x1, . . . , xn) ≥
15
M(τ ;x1, . . . , xn) for any τ ∈ T , where T is the range of the transformation.
sup
θ:τ(θ)=τ
L(θ;x1, . . . , xn) ≤ sup
θ∈Θ
L(θ;x1, . . . , xn)
= L(θˆ;x1, . . . , xn)
= sup
θ:τ(θ)=τ(θˆ)
L(θ;x1, . . . , xn)
= M(τ(θˆ);x1, . . . , xn).
To use this in deriving the maximum likelihood estimator for Σ, we let θ = Σ−1, so
the θl are the k = p(p+ 1)/2 unique elements of Σ−1 and the τl are the unique elements
of Σ.
The log likelihood for the multivariate normal distribution is given by
l(µ,Σ;XAll) = −1
2
np ln(2pi)− 1
2
n ln |Σ| − 1
2
n∑
i=1
(xi − µ)TΣ−1(xi − µ)
= −1
2
np ln(2pi) +
1
2
n ln |θ| − 1
2
n∑
i=1
(xi − µ)T θ (xi − µ)
= l(µ, θ;XAll)
We will take the (matrix) derivative and set it to zero to determine the maximum
likelihood estimator, as follows.
∂l(µ, θ;XAll)
∂θ
=
n
2
θ−1 − 1
2
n∑
i=1
(xi − µ) (xi − µ)T
(see identities (57) and (70) from The Matrix Cookbook, Peterson, 2012.)
We know that the MLE for µ is µˆ = x, so we include this when setting the above to
zero to find θˆ.
16
n2
θˆ−1 − 1
2
n∑
i=1
(xi − x) (xi − x)T = 0
n θˆ−1 − A = 0
So θˆ−1 =
1
n
A
Hence the maximum likelihood estimator is θˆ =
(
1
n
A
)−1. By lemma 1, the maxi-
mum likelihood estimator of Σ is
Σˆ =
1
n
A. (1.42)
Using expectations it can be shown that this Σˆ is biased. As a result, the unbiased
alternative is more commonly used:
S =
1
n− 1A, (1.43)
which is known as the sample covariance matrix.
It is quite common for the sample to come from G independent groups of experi-
mental units, each with mean µg, g = 1, . . . , G and a common covariance matrix Σ. In
this case the data matrix for the gth group can be written as
XAllg =

x11g . . . x1pg
. . . . . . . . .
xng1g . . . xngpg
 (1.44)
with ith row being xig. The maximum likelihood estimator of µg is the sample mean xg
for the gth group and it can be shown that the unbiased estimator of Σ is
S =
1
n−G
G∑
g=1
ng∑
i=1
(xig − xg)(xig − xg)T = 1
n−G
G∑
g=1
Ag. (1.45)
In 1956, Stein (1956) showed that when data is generated from a multivariate nor-
17
mal distribution with Σ = I , with p ≥ 3, then the sample mean vector is inadmissible,
i.e. that it does not have the smallest expected mean squared error, or that there exists
an estimator t such that
E
[
(t− µ)T (t− µ)] < E [(x− µ)T (x− µ)] (1.46)
1.6.1 James–Stein estimators
This was extended by James and Stein in 1961 (James and Stein, 1961) to cover Σ = σ2I ,
with σ known and what are now known as the James–Stein estimators were intro-
duced. For convenience, we assume that σ is such that Y = X has distribution N(µ, I).
(You should be able to work out what σ would need to be to achieve this.) Then for
a single sample of Y , or equivalently an iid sample of size n of X , the James–Stein
estimator is
t(y) =
(
1− p− 2||y − ν||2
)
(y − ν) + ν, (1.47)
where ν is an arbitrary fixed vector, and p ≥ 3. This estimator shrinks the observed
y towards the specified ν. The estimator always improves on y as an estimator with
respect to this loss function, and improves more the closer ν is to µ. These results
have been extended by e.g. Berger (1975) to a general known covariance matrix and
an arbitrary quadratic loss function L(µ, t) = (t− µ)TQ(t− µ), where Q is an arbitrary
positive definite matrix which reflects the relative importance of errors in different
directions. See Anderson (2003) section 3.5.3 for more details.
1.6.2 Wishart Distribution
The matrix A (see 1.40) can be written as the sums of products of n− 1 independent p-
dimensional random vectors with common distribution N(0,Σ). Such a matrix is said
18
to have a Wishart distribution (Wishart, 1928), with density
w(A; Σ, n− 1) =

|A|n−p−22 exp(− 12 tr(AΣ−1))
2(n−1)p/2 pip(p−1)/4 |Σ|(n−1)/2∏pj=1 Γ(n−j2 ) , for A positive definite
0, otherwise.
(1.48)
When p = 1 and Σ = 1, the data is drawn from a N(µ, 1) distribution, and the Wishart
distribution simplifies to a χ2n−1 distribution.
1.7 Hotelling’s T 2 test
The single sample univariate hypothesis test for iid data from a N(µ, σ2) distribution
tests H0 : µ = µ0 against Ha : µ 6= µ0, for some µ0 ∈ R that we are free to choose.
Inspired by z-scores to normalise to form a test statistic, we choose as a test statistic
t = x−µ0
s/

n
. It can be shown that t ∼ tn−1, where the latter is a t-distribution with n − 1
degrees of freedom. When using a significance level of α (e.g. 0.05), the decision rule
is to accept Ha if |t| > tα/2;n−1 and to retain H0 otherwise. In these notes, we will use
the first subscript of a distribution (e.g. α/2) to mean the upper quantile at that level,
i.e. so that the total probability to the right of this point is given by the subscript. Any
remaining subscripts will generally be degrees of freedom.
We now wish to extend such testing to data drawn from a multivariate normal dis-
tribution. We thus assume we will collect n independent observations from a N(µ,Σ)
distribution, i.e. a multivariate normal distribution with p elements.
The null and alternative hypotheses are then
H0 :

µ1
...
µp
 =

µ01
...
µ0p
 (1.49)
19
and
H1 :

µ1
...
µp
 6=

µ01
...
µ0p
 . (1.50)
We assume that the standard estimates of mean and covariance, namely x and S
have been calculated. Our principal tasks are then to choose an appropriate test statis-
tic and to determine its distribution under the null hypothesis.
There are a number of ways to show the construction. Here we follow that of An-
derson (2003), which considers this as an example of a likelihood ratio test.
The likelihood ratio criterion λ is the ratio of the maximum likelihood under the
null hypothesis to the maximum likelihood under the alternative hypothesis. So here
we have
λ =
maxΣL(µ0,Σ)
maxµ,ΣL(µ,Σ)
∈ [0, 1]. (1.51)
For µ = µ0, the likelihood function is maximised with
Σˆω =
1
n
n∑
i=1
(xi − µ0)(xi − µ0)T (1.52)
=
1
n
n∑
i=1
[(xi − x) + (x− µ0)] [(xi − x) + (x− µ0)]T
=
1
n
n∑
i=1
[
(xi − x)(xi − x)T + (xi − x)(x− µ0)T + (x− µ0)(xi − x)T + (x− µ0)(x− µ0)T
]
=
1
n
[
A+ n(x− µ0)(x− µ0)T +
(
n∑
i=1
(xi − x)
)
(x− µ0)T + (x− µ0)
(
n∑
i=1
(xi − x)
)]
=
1
n
[
A+ n(x− µ0)(x− µ0)T
]
. (1.53)
Without restrictions, the likelihood function is maximised with µˆΩ = x and
ΣˆΩ =
1
n
n∑
i=1
(xi − x)(xi − x)T = 1
n
A. (1.54)
20
Hence
λ =
exp(− 12
∑n
i=1(xi−µ0)T Σˆ−1ω (xi−µ0))
(2pi)np/2|Σˆω |n/2
exp(− 12
∑n
i=1(xi−x)T Σˆ−1Ω (xi−x))
(2pi)np/2|ΣˆΩ|n/2
. (1.55)
The arguments of the exponential functions can be simplified as follows using prop-
erties of the trace.
n∑
i=1
(xi − µ0)T Σˆ−1ω (xi − µ0) (1.56)
=
n∑
i=1
(xi − µ0)T
[
1
n
n∑
l=1
(xl − µ0)(xl − µ0)T
]−1
(xi − µ0) (1.57)
= tr
 n∑
i=1
(xi − µ0)T
[
1
n
n∑
l=1
(xl − µ0)(xl − µ0)T
]−1
(xi − µ0)
 (1.58)
= tr
 n∑
i=1
(xi − µ0)(xi − µ0)T
[
1
n
n∑
l=1
(xl − µ0)(xl − µ0)T
]−1 (1.59)
= ntr(Ip×p) = np. (1.60)
The exponential function involving Σˆ−1ω produces the same result. Hence,
λ =
exp(−np/2)
(2pi)np/2|Σˆω |n/2
exp(−np/2)
(2pi)np/2|ΣˆΩ|n/2
(1.61)
=
|ΣˆΩ|n/2
|Σˆω|n/2
(1.62)
=
|A|n/2
|A+ n(x− µ0)(x− µ0)T |n/2 . (1.63)
The following matrix identity is useful to simplify the expression for λ:
∣∣∣∣∣∣∣
B C
D E
∣∣∣∣∣∣∣ =
∣∣∣∣∣∣∣
B 0
D E −DB−1C
∣∣∣∣∣∣∣ = |B||E −DB−1C|. (1.64)
21
We can write |A+ n(x− µ0)(x− µ0)T | as∣∣∣∣∣∣∣
A −√n(x− µ0)

n(x− µ0)T 1
∣∣∣∣∣∣∣ = |A||1 + n(x− µ0)TA−1(x− µ0)|. (1.65)
Hence
λ2/n =
1
1 + n(x− µ0)TA−1(x− µ0) =
1
1 + T 2/(n− 1) , (1.66)
where
T 2 = n(x− µ0)TS−1(x− µ0) = (n− 1)n(x− µ0)TA−1(x− µ0). (1.67)
The likelihood ratio test is defined by the critical region (set of possible outcomes
which will cause the null hypothesis to be rejected) λ < λ0, or equivalently T 2 > T 20 ,
where T 20 = (n− 1)
(
λ
−2/n
0 − 1
)
.
So then the null hypothesis is rejected whenever T 2 ≥ T 20 , where T0 is chosen so
that P (T 2 > T 20 ) is equal to the chosen significance level under the null hypothesis.
When H0 is true, F = n−pp(n−1)T
2 ∼ Fp,n−p. So then we reject H0 if F > Fα;p,n−p.
1.7.1 Confidence region for mean vector
A 100(1− α)% confidence region consists of all vectors µ for which
n(x− µ)TS−1(x− µ) ≤ (n− 1)p
n− p Fα;p,n−p. (1.68)
This region has an ellipsoidal boundary and is centered at x.
1.7.2 Two sample test
More commonly, we may wish to check whether the evidence supports two multi-
variate samples coming from the same population or from different populations. In
particular, we would like to compare their means, i.e. to test H0 : µ1 − µ2 = δ0 versus
H1 : µ1 − µ2 6= δ0 based on two samples X(1) and X(2) of size n1 and n2, respectively.
22
We assume that the two groups have a common, but unknown covariance matrix Σ. A
typical use of this test will use a null value of δ0 = 0.
The vector X
(l)
; l = 1, 2, has distribution N(µl, 1nlΣ). This can be shown using 1.12
by choosing to consider all nl observationsX
(l)
i ∼ N(µl,Σ), i = 1, . . . , n, laid out under-
neath each other in a single vector Z of length np. Z has a covariance matrix composed
of copies of Σ down the diagonal and 0 matrices elsewhere:
cov(Z) =

Σ 0 . . . 0
0 Σ . . . 0
. . . . . . . . . . . .
0 . . . 0 Σ

. (1.69)
We can then consider Y = AZ, withA = (I . . . I) being a p×npmatrix comprised of
n copies of the p × p identity matrix. Hence Y = ∑nli=1X(l)i . The covariance matrix for
Y is then cov(Y, Y T ) = A cov(Z)AT = nΣ. Since the sample mean vector is X
(l)
= 1
n
Y ,
we can use 1.9 to show that X
(l) ∼ N(µl, 1nΣ).
Under the null hypothesis it can be shown that X
(1) −X(2) has distribution
N
(
0,
(
1
n1
+
1
n2
)
Σ
)
= N
(
0,
n1 + n2
n1n2
Σ
)
. (1.70)
The natural approach to estimating the common covariance matrix would be to
treat each observation as equally informative and sum their contributions. This is done
in the following, which uses two degrees of freedom to estimate the mean of each
population:
S =
1
n1 + n2 − 2
{
n1∑
i=1
(x
(1)
i − x(1))(x(1)i − x(1))T +
n2∑
i=1
(x
(2)
i − x(2))(x(2)i − x(2))T
}
. (1.71)
Then (n1 + n2− 2)S has the same distribution as
∑n1+n2−2
m=1 UmU
T
m, where each Um ∼
23
N(0,Σ). By extending the results for the one sample case, it can be shown that
T 2 =
n1n2
n1 + n2
(X
(1) −X(2))TS−1(X(1) −X(2)) (1.72)
is distributed according to the T 2 distribution with n1 + n2 − 2 degrees of freedom.
Equivalently,
T 2
n1 + n2 − p− 1
(n1 + n2 − 2)p ∼ Fp,n1+n2−p−1. (1.73)
Hence the critical or rejection region for the test is
T 2
n1 + n2 − p− 1
(n1 + n2 − 2)p > Fα;p,n1+n2−p−1. (1.74)
As a consequence the 100(1 − α)% confidence region for the difference in means δ =
µ1 − µ2 is given by vectors δ which satisfy
(x(1) − x(2) − δ)S−1(x(1) − x(2) − δ)T ≤ n1 + n2
n1n2
T 2α;p,n1+n2−p−1, (1.75)
where
T 2α;p,n1+n2−p−1 =
n1 + n2 − 2
n1 + n2 − p− 1Fα;p,n1+n2−p−1. (1.76)
Further, the 100(1 − α)% simultaneous confidence intervals for any linear compound
of the mean difference aT δ, a ∈ Rp can be shown to be:
aT (x(1) − x(2))±

aTSa
n1 + n2
n1n2
T 2α;p,n1+n2−p−1. (1.77)
Note that we assumed that Σ was common to the two groups in the above. This
can be an unreasonable assumption. However, Ito and Schull (1964) have shown that
when the two sample sizes involved are equal, the type I error rate is unaffected by
unequal covariance matrices and the type II error rate is unaffected if the two samples
are large. Hence the test above can be used in many circumstances even when there
are apparent differences in the covariance matrices of the two groups.
24
In general, the task of such hypothesis testing is known as the multivariate Behrens-
Fisher problem, and a variety of solutions have been proposed. The procedure by
Krishnamoorthy and Yu (2004) is one of the most powerful. For this test, we first
define S˜1 = S1/n1, S˜2 = S2/n2, S˜ = S˜1 + S˜2 and
T 2 = (X
(1) −X(2))T S˜−1(X(1) −X(2)). (1.78)
Then
ν − p+ 1
νp
T 2 ∼ Fp,ν−p+1, (1.79)
where
ν =
p+ p2
{tr[(S1S−1)2] + [tr(S1S−1)]2}/n1 + {tr[(S2S−1)2] + [tr(S2S−1)]2}/n2 . (1.80)
This type of analysis can be continued further with multivariate analysis of variance.
See e.g. Morrison (2005), Chapter 3 for details.
25
Chapter 2
Clustering
Clustering is the task of allocating observations into a number of groups which are
suggested by structure in the data. They may or may not correspond to meaningful
groups in the context producing the observations. They hope is that the groups or
clusters produced will be meaningful and will lead to new insights into the population.
This naturally leads to many questions, including the following:
• Will we make use of probability models of the data ?
• What measure of similarity will we use so as to guide which observations are put
together in a cluster, and which are put in separate clusters ?
• How will we penalise using large numbers of clusters so that we find a reasonable
number of clusters ?
• How does this relate to (supervised) classification ?
2.1 K-means clustering
We will begin by considering a well-known method of clustering which does not ex-
plicitly use a probability model. It also avoids the question of the number of clusters
by requiring the user to choose this number: K. The general idea of the algorithm is
26
that the data will have been generated from K round clusters, and we should assign
each point to its closest cluster center, in this case its cluster mean. This is actually
equivalent to a mixture model with spherical components and equal weights, and so
is a special case of models considered in the next section.
We would generally be wise to normalise the data such that each dimension has
variance 1. The algorithm does not require a particular measure of dissimilarity, but
for continuous predictors, a squared Euclidean (L2) norm is generally used, i.e.
d(xi, xl) =
p∑
j=1
(xij − xlj)2 = ||xi − xl||2. (2.1)
We then need something to optimise, and here we wish to minimise the sum of squared
Euclidean distances between the observations within the clusters. This turns out to be
equivalent to minimising the sum of squared distances from each observation to its
cluster mean. We wish to minimise the sum of squared distances within clusters, which
is also called the within-cluster point scatter (with a convenient constant multiplier):
W (C) =
1
2
K∑
k=1

C(i)=k

C(l)=k
||xi − xl||2, (2.2)
where C() maps observation number i to a cluster number in the range 1, . . . , K, i.e.
it records cluster membership. In principle, we could directly optimise this function.
However, the number of combinations to be compared grows rapidly with sample size,
as follows (see e.g. Jain and Dubes (1988)).
S(n,K) =
1
K!
K∑
k=1
(−1)K−k
(
K
k
)
kn. (2.3)
This tends to encourage the use of heuristic optimization, such as greedy approaches.
Greedy optimization is the idea of starting somewhere, typically at a random point,
and proposing a sequence of limited changes to the state, each of which will only be
accepted if they improve the cost function. As a result, it can become trapped in a local
27
minimum if none of the available proposals offer an improvement in cost.
It can be shown using a method similar to that of the ANOVA identity (see Ap-
pendix A) that 2.2 can be rewritten as
W (C) =
K∑
k=1
Nk

C(i)=k
||xi − xk||2, (2.4)
where xk is the mean of observations in the kth cluster and Nk =
∑n
i=1 I(C(i), k) is the
number of observations which are assigned to the kth cluster. We have now phrased
the optimisation problem in terms of distances to the cluster mean. Note that each
observation is weighted by the size of the cluster, but this could be altered if desired.
A greedy solution to this problem forms the K-means algorithm:
1. Assign each observation to the closest cluster mean
2. Recompute the cluster means
3. Iterate steps 1 and 2 until the assignments of observations to clusters
do not change.
One must initialise this algorithm somehow, and probably not particularly well,
or this could constitute another clustering algorithm. All that must be chosen is the K
initial p-dimensional “means”. These do not have to actually be the mean of any subset
of the observations, but they will be after one iteration of the algorithm.
The algorithm eventually leads to a local minimum, and it will usually not take
many iterations. Since the initialisation is random and the result is a local minimum,
restarting the algorithm from another initialisation could easily lead to a different local
minimum. Note that the global minimum is also a local minimum. The descent into a
local minimum is deterministic, so one could in theory partition the Kp-dimensional
initialisation space into (possibly disjoint) regions, with all points in a region leading
to the same local minimum, for the given dataset. Given that the K-means algorithm
usually converges quickly, it is reasonable to run it a number of times and choose the
28
final clustering which has produced the lowest value of the cost function. It is not
generally feasible to check whether this is a global minimum or not.
There is no consensus on initialisation on how to initialise the K-means algorithm,
and many authors assume it will be done randomly. This still leaves many choices, e.g.
uniform over a bounded feasible parameter space or over an object (e.g. hypercube)
enclosing the data, or (discrete) uniform choice of K separate individual observations
or by random division of the observations into K subsets, and taking the means of
these. A large number of variations on K-means have been developed, but here we
suggest that you consider the basic version again later from the perspective of mixture
modelling.
The key remaining issue is the choice of K. This is an example of the general prob-
lem of statistical model selection, i.e. decide which model out of a set or family of
models is most appropriate for the data, in addition to fitting them. The Bayesian ver-
sion places a prior over the models and asks what the posterior distribution is over the
models, but for now we will only consider frequentist approaches, which primarily
seek a single best model.
There are many approaches to choosing K in K-means clustering, none of which
are particularly satisfactory. One of the currently most popular is based on a “gap-
statistic” (Tibshirani et al., 2001), as described below.
For the given dataset, one finds the (sample) minimum and maximum value for
each dimension: xmin 1, . . . , xmin p and xmax 1, . . . , xmax p. We also run the K-means algo-
rithm on this data and obtain a WKdata(C) value of the within-cluster scatter cost func-
tion. We then take B samples of size n from a uniform distribution over the box (hy-
perrectangle) whose vertices are these minimum and maximum values. This approach
is similar in spirit to the bootstrap. For each of these samples, we run the K-means
algorithm and obtain a value WKb (C), for b = 1, . . . , B. We then consider a range of k
29
values k = 1, . . . , K. For each k, we determine the average value
W
k
(C) =
1
B
B∑
b=1
W kb (C). (2.5)
The gap statistic is then defined as
Gap(k) =
1
B
B∑
b=1
logW kb (C)− logW kdata(C). (2.6)
We also define
s′(k) =

1 + 1/B s.d.(logW k1 (C), . . . , logW
k
B(C)), (2.7)
and finally choose the optimal
k∗ = min {k : Gap(k) ≥ Gap(k + 1)− s′(k + 1)} , (2.8)
i.e. find the smallest value of k such that the above is true. Put another way,
k∗ = min {k : Gap(k + 1)−Gap(k) ≤ s′(k + 1)} , (2.9)
that is, find the smallest k such that any increase in gap score is within a measure of
variability of such scores in the bootstrap sample with k + 1 clusters.
The within-cluster scatter is usually a decreasing function as k increases. However,
it is hard to predict what pattern will be seen in a plot of the gap statistic versus k.
This heuristic rule will increase k provided the gap statistic increases more than the
measure of variability. The optimal value k∗ is chosen as the first value at which the
changes level off or fall. There is very little theory to back this method, but it recov-
ered the correct number of clusters in a high proportion of runs on a set of simulation
experiments. The method understandably struggled when the clusters were far from
spherical. The authors also considered a variation of the method in which the axes of
30
the box were taken to be those of the principal components. This approach was more
successful with elongated clusters.
It should be noted that a k means clustering forms a Voronoi tessellation of data
space, with the means corresponding to the seeds and the clusters corresponding to
the cells.
With three clusters, provided that the three cluster centres do not occur along a line
(collinearity), the boundaries between these clusters will all meet at a single point. That
point is the circumcentre of the triangle whose vertices are the cluster centres. That is,
the centre of the smallest circle which passes through all three cluster centres. By the
definition of a circle, its centre must be equidistant from all points on its perimeter.
If we find the point half way between two cluster centres and from there draw a line
(cluster boundary) perpendicular to the line connecting the cluster centres, all points
on the boundary line must be equidistant from the two cluster centres, and hence it
must include the circumcentre. This is true for all three possible pairs of cluster centres,
so the three cluster boundary lines meet at the circumcentre. Note that this is also true
for any number of dimensions, because three points always form a two-dimensional
triangle. Note that the cluster boundaries extend as hyperplanes (dimension p− 1).
2.2 Mixture Models
A finite mixture model assumes that the data has been generated by a distribution
whose density can be decomposed into a finite sum of weighted component densities.
Equivalently, the model is that each observations comes from one of a finite set of para-
metric distributions. The component densities are usually all of the same parametric
form, e.g. those of the multivariate normal distribution. The probabilities of the com-
ponents are free to vary, but must sum to 1. The probability that an observation will
be drawn from the hth component is pih, 0 ≤ pih ≤ 1, with
∑H
h=1 pih = 1, where H is the
31
number of mixture components. The overall mixture density is given by
f(x) =
H∑
h=1
pihfh(x), (2.10)
where fh() is the hth component density.
Here we will only consider mixtures whose components are multivariate normal,
but in principle, mixtures of any kind of components may be considered, and many
variations such as mixtures of t and beta distributions and mixtures of generalised
linear models have been considered. See, for example, McLachlan and Peel (2000)
chapter 5 for further details.
As George Box once said “all models are wrong, but some are useful”. If one has
good reason to believe that the data has been drawn from a mixture of distributions of
a particular type, then a key purpose of mixture modelling is to recover or approximate
the parameters of the originating mixture, based on the data. Since one usually uses
parametric component distributions, the fitted parameters of these are then available
for interpretation. Even if one does not believe that the data is drawn from a mixture,
a mixture model can be a useful and flexible option for approximating the underlying
probability density — this is known as density estimation.
A parametric model has a fixed number of parameters and which can be fitted to the
data, and its density often has a well-defined shape. The density of a nonparametric
model does not have a well-defined shape, and either has no parameters or a set of
parameters to be fitted whose form or number depend on the data, with the number
not necessarily being bounded as the size of the dataset grows. Finite mixture models
are an example of a “semi-parametric” model, because the components are generally
parametric models, but there is in principle, no limit to the number of the components,
and no apriori way to know how many components will be needed to fit the data
well. However, the stipulation that the number of components is finite makes this
semi-parametric. Infinite mixture models are also possible, and typically studied from
a Bayesian perspective.
32
A number of major tasks await the mixture modeller, with the following foremost:
1. fit all parameters of the mixture model based on the data
2. choose the number of components
3. quantify uncertainty about the parameters.
2.3 Fitting a normal mixture model
Here we will assume that the number of components H is fixed, that the data is p-
dimensional continuous, that each component is multivariate normal and that the prin-
ciple used to fit the model is that of maximising the likelihood.
If we observe a random sample {x1, . . . , xn} drawn from a mixture distribution with
density 2.10, then the observations are i.i.d. and we can store them in an n × p data
matrix XAll. We assume that the set of parameters of the component densities are
collected together with the probabilities of the components into a parameter vector Ψ.
In fact, since we know that
∑H
i=1 pih = 1, only pi1, . . . , piH−1 are needed as component
probability parameters.
Hence the likelihood function can be written as
L(Ψ;XAll) =
n∏
i=1
H∑
h=1
pihfh(xi). (2.11)
It is generally worthwhile to transform this via the log function, which then gives us
the log likelihood
l(Ψ;XAll) =
n∑
i=1
ln
(
H∑
h=1
pihfh(xi)
)
. (2.12)
We now assume that the components are each multivariate normal, and write out the
log likelihood. Note that now Ψ = (µT1 , . . . , µTH , ξ
T , pi1, . . . , piH−1)T , where ξ is a vector
33
containing the unique parameters of Σ1, . . . ,ΣH .
l(Ψ;XAll) =
n∑
i=1
ln
(
H∑
h=1
pih
(2pi)p/2|Σh|1/2 exp
[
−1
2
(xi − µh)TΣ−1h (xi − µh)
])
. (2.13)
This function cannot really be simplified, although it can be directly maximised us-
ing optimisation methods such as Newton-Raphson approaches. In considering such
general purpose iterative methods, one wonders if a more specific approach might be
suitable for this problem.
One thing to realise is that fitting the mixture model would be quite easy if we
knew which component each observation came from. If we know that a subset of the
observations belong to the first multivariate normal component, we can just set the
mean vector of that component equal to the sample mean over this subset, and the
covariance matrix of the component equal to the sample covariance over that subset.
We could also set the proportion for this component equal to the number of observa-
tions in this subset divided by the total. If we knew this for all components, we would
quickly fit the entire mixture model. Unfortunately, we do not. However, this idea is
largely the basis of the Expectation-Maximisation (EM) algorithm for fitting mixture
models.
The EM algorithm is an iterative procedure to find a maximum of the log likelihood
function for a statistical model. It tends to be useful when it is possible to augment the
data with additional latent or unobservable data (effectively variables), which would
be sufficient to make the maximisation straightforward. One then iterates a process of
finding approximate values for the latent variables by taking expectations (the Expec-
tation step), followed by using these expectations to maximise the resulting likelihood
over both the observed and latent data. The original data is then said to be “incom-
plete” and the original likelihood is termed the incomplete likelihood. When the latent
data is added, we term the data “complete” and the likelihood function of such data is
termed the complete likelihood.
In most cases, the maximum likelihood estimate Ψˆ is a solution of the following
34
equation
∂l(Ψ;XAll)
∂Ψ
= 0, (2.14)
and for this problem there are no solutions of interest at the boundary of the parameter
space. It can been shown that the MLE satisfies
pˆih =
n∑
i=1
τh(xi; Ψˆ)/n, h = 1, . . . , H, (2.15)
where τh(xi; Ψˆ) is the posterior probability that observation i has been drawn from
component h, i.e.
τh(xi; Ψˆ) =
pˆihfh(xi; θˆh)∑H
m=1 pˆimfm(xi; θˆm)
, (2.16)
where fh() in this case is the multivariate normal density, and so θh is the vector of
unique parameters from the mean and covariance matrix of component h. This is not
immediately helpful, given that we do not know the pˆih values. In the past some re-
searchers were able to solve equations such as those presented by 2.15 and 2.16 for
the unknown parameters, but this was difficult and required iterative methods whose
properties were largely unknown. However, these equations add to the hints which
suggest that we should choose the latent data to be the set of component memberships
of the observations and try an EM algorithm to maximise the likelihood.
We thus let the complete data be
XC = {XAll, Z}, (2.17)
where
Z = (zT1 , . . . , z
T
n )
T , (2.18)
and zi is a vector which describes which component observation i belongs to. It is a
binary vector consisting of a 1 and H − 1 zeros. If xi belongs to component h, then the
1 is placed in the hth position in the vector.
The complete data likelihood for a single observation is the density of that observa-
35
tion, which can be written:
P (Xi = xi, Zi = zi; Ψ) = P (Zi = zi; Ψ)P (Xi = xi|Zi = zi; Ψ) (2.19)
=
H∏
h=1
pizihh fh(xi)
zih . (2.20)
Note that since all but one of the elements of zi are zero, the product over the h elements
of zi merely finds the component which the ith observation belongs to, and for that
component, we are using the right-hand side of 2.19. For all other components, we are
merely multiplying by terms raised to the power 0, i.e. ones. As a result, we can write
the complete likelihood function as
LC(Ψ;XAll, Z) =
n∏
i=1
H∏
h=1
pizihh fh(xi)
zih . (2.21)
Hence the complete data log likelihood is
lC(Ψ;XAll, Z) =
n∑
i=1
H∑
h=1
zih [lnpih + ln fh(xi; θh)] . (2.22)
For a mixture of multivariate normal distributions, this becomes
lC(Ψ;XAll, Z) =
n∑
i=1
H∑
h=1
zih
[
ln pih − p
2
ln(2pi)− 1
2
ln |Σh| − 1
2
(xi − µh)TΣ−1h (xi − µh)
]
.
(2.23)
We will then attempt to maximise this function indirectly using the EM algorithm. It
can be shown that the maximum of the complete and the incomplete log likelihoods
are found at the same set of parameter values. Dempster, Laird and Rubin (Dempster
et al., 1977) showed that the incomplete data likelihood L(Ψ) is not decreased after an
EM iteration. It will usually converge to a local maximum of this function, which could
be the (non-boundary) global maximum, or a saddle point. The convergence of the EM
algorithm was proven by Wu (1983).
The E step of the EM algorithm actually requires that we obtain the expectation
36
of the complete log likelihood of the data, taken across the distribution of the latent
variables. When fitting mixtures, the latent variables are the z values for each of the
observations. We must have some current set of values for the parameters - we will
label these Ψ(k). These will be used to determine the distribution of the latent vari-
ables. For example, for a given set of mixture parameters, a particular observation will
be more likely to belong to one component than another, and these probabilities effec-
tively define the distribution over the z values. The superscript (k) is used to indicate
that it is the set of values at the kth iteration. The purpose of the E and M steps is to
update Ψ(k) to its next value Ψ(k+1), which is hoped to be an improvement with respect
to L(Ψ;XAll).
In general, for the E step, we need to calculate what is known as the Q function:
Q(Ψ; Ψ(k)) = EZ;Ψ(k) {lc(Ψ)|XAll} . (2.24)
Here the expectation is taken over the set of latent variables Z with Ψ fixed at Ψ(k). The
expectation is taken of the complete log likelihood, with the observed data (stored in
XAll) fixed throughout. The M step then maximises the Q function, i.e.
Ψ(k+1) = arg max
Ψ
Q(Ψ; Ψ(k)). (2.25)
One can then continue taking E and M steps in alternation until some stopping cri-
terion is met. As we have seen before, typical stopping criteria include completing a
fixed number of steps or waiting until the size of improvements in some function of
interest fall below a small threshold value. Here, the function of most interest is the
incomplete log likelihood l(Ψ;XAll).
We will now determine the details of the E and M steps for mixture models, and
specifically the multivariate normal mixture model. We should first notice that the
complete data log likelihood for mixtures is linear in the latent variables z (see 2.22),
37
i.e. we could bring the expectation inside the sum. As a result
Q(Ψ; Ψ(k)) = EZ;Ψ(k) {lc(Ψ)|XAll} (2.26)
= EZ;Ψ(k)
{
n∑
i=1
H∑
h=1
(zih|xi) [lnpih + ln fh(xi; θh)]
}
(2.27)
=
n∑
i=1
H∑
h=1
EZ;Ψ(k)(zih|xi) [lnpih + ln fh(xi; θh)] (2.28)
=
n∑
i=1
H∑
h=1
τh(xi; Ψ
(k)) [lnpih + ln fh(xi; θh)] (2.29)
Note that the expectation of the zih variable is the same as the probability that it equals
1, given its binary indicator structure. We have already seen τh(xi; ΨˆMLE) in 2.16. All
we need do now is estimate τ using Ψ(k) instead. I.e. we set
τh(xi; Ψ
(k)) =
pi
(k)
h fh(xi; θ
(k)
h )∑H
m=1 pi
(k)
m fm(xi; θ
(k)
m )
(2.30)
The M step requires the global maximisation of the Q function with respect to Ψ,
which will result in our next parameter vector Ψ(k+1). This is where our realisation
of the simplicity of fitting the complete data version of the mixture model becomes
useful. We simply set
pi
(k+1)
h =
1
n
n∑
i=1
τh(xi; Ψ
(k)) (2.31)
Similarly, for the multivariate normal mixture, we weight the mean and variance es-
timates via the τ values for each observation, i.e. allowing fractional membership of
each of the components via:
µ
(k+1)
h =
∑n
i=1 τh(xi; Ψ
(k))xi∑n
i=1 τh(xi; Ψ
(k))
(2.32)
and
Σ
(k+1)
h =
∑n
i=1 τh(xi; Ψ
(k))(xi − µ(k+1)h )(xi − µ(k+1)h )T∑n
i=1 τh(xi; Ψ
(k))
. (2.33)
It may be apparent that one unfortunate possibility with a mixture model is to pro-
38
duce a component whose mean is exactly the value of a data point, and whose covari-
ance matrix is close to the zero matrix. As the mean moves closer to the data point
and the covariance matrix moves closer to the zero matrix, the mixture density for
that point increases without bounded, and hence so does the likelihood. So the like-
lihood function for mixtures contains at least nH infinite maxima, because any of the
H components can converge in this way on any of the n data points. In most cases,
these maxima are spurious, and to be avoided. Instead we seek the parameter set with
the largest finite maximum of the likelihood function (which is hence not technically a
global maximum).
It should be noted that a mixture model is also not identifiable, that is there always
exist multiple sets of of values of the parameters Ψ which produce exactly the same
mixture density throughout the support, which isRp for a multivariate normal mixture
model. However, this loss of identifiability is only due to the fact that one could swap
the parameters of any component with those of any other, along with their mixture
weights, without changing the overall density. So we can say that a mixture model is
identifiable, up to a permutation of the component labels or numbering, which have no
particular meaning anyway. However, this lack of identifiability does make it difficult
to compare two different fits of a mixture model to the same data. One might think one
could just order the components by their means to produce a unique and comparable
solution. However, when component means are close together, the fact that the fits are
different can cause this approach to fail.
EM fitting of a mixture model can be started from any set of valid parameter values
and should then converge to a local maximum or saddle point. Once this is chosen,
the EM algorithm is deterministic. As in previous examples such as the k-means algo-
rithm, this implies that the initial parameter space could be theoretically decomposed
into (possibly disjoint) regions, each of which will lead to the same local maximum or
saddle point. So the choice of initial values ultimately determines the conclusion of the
search, and it also determines how many steps it will take for the algorithm to come
39
within of this solution in terms of the likelihood, or some measure of distance in the
parameter space.
The simplest approach to initialisation would be to randomly fully allocate all the
observations to mixure components, and then fit each components to the observations
so assigned. This has the benefit of keeping the means close to the overall mean of the
data, and hence inside its convex hull, and the covariance matrices will also tend be
close to those of the data overall, i.e. not far too large. However, the means will tend
to be closer together than one might expect in most plausible generating mixtures, and
covariance matrix elements will tend to be larger than desired. A popular alternative
is to initialise via running the k-means algorithm with k = H . Since k-means uses out-
right allocation of observations to clusters, this will tend to separate the means. The
means of the clusters are then used as the means of the components, and the compo-
nent weights and covariance matrices are estimated from the points allocated to the
respective clusters. k-means tends to produce clusters which are as close as possible
to spherical, but at least they will have less spread than the whole dataset. Note that
k-means needs to be started from some position also, and the idea of randomly allo-
cating of observations to clusters and then taking their means may work better for that
task.
In reaction to the presence of spurious global maxima and sub-optimal local max-
ima, a typical strategy is to restart the EM algorithm many times, and eventually
choose the outcome with highest likelihood, provided it does not resemble a spuri-
ous solution. The randomness of the initialisation is hence useful, because it allows
each of these restarts to potentially reach a different solution.
In practice, solutions in which the covariance matrix of a component is tending to
zero will be detected by the eventual near-singularity of this matrix, leading to compu-
tational difficulties in the calculation of densities from that component, and hence the
log likelihood. Solutions of this type are then discarded. There can also be spurious
solutions in which a small number of points occupy an isolated cluster, and while the
40
fit is valid, this may not represent any meaningful sub-population.
One could imagine trying to penalise small variances or constrain the variances to
be above a certain threshold. The former would tend to have a global effect, including
possibly moving the location of the global non-spurious maximum away from its orig-
inal value, which is probably undesirable. Constraining the variance might work, but
the choice of the threshold would be difficult. If the threshold is too small, if the EM
algorithm arrives in the vicinity of the undesirable maximum and hits the constraint,
it will be unable to escape the pull of the maximum and hence be unable to move. If
the threshold is too large, it could block access to legitimate solutions, or those which
might provide useful stepping stones to better solutions. One might more generically
constrain the minimum total number of points allocating to a cluster, considering par-
tial allocations through τ values. It seems clear that this number should be greater
than 1. However, implementing such constraints would require modifications of the
algorithm, and put some of its properties at risk.
Rather than monitor the variances of the components, one can monitor the eigen-
values of the component covariance matrices. These are related to the variance along
the elliptical axes of the component distribution, and it is desirable than none of them
become too small. Often, one is forced to comb through the best solutions found (ac-
cording to their likelihood) by hand checking to see if they seem reasonable. If the
objective is merely density estimation, such checks may not be necessary.
2.4 Estimating uncertainty about the parameter values
A number of methods exist to obtain standard errors or approximate confidence inter-
vals for the parameters of mixture distributions. Some are based on the information
matrix I(Ψˆ;XAll), which can be expressed in terms of the complete data information
matrix IC(Ψˆ;XAll) and terms involving the score functions for the complete and in-
complete data. These are typically algebraically rather complicated and still require
approximations such as using the observed information matrix. See sections 2.15–2.16
41
of McLachlan and Peel (2000) for details.
A simpler approach is to use what is known as the parametric bootstrap. It is called
parametric because it derives new samples of size n from a fitted model, in this case
the mixture using our best value of Ψˆ.
The procedure is as follows:
1. Generate B parametric boostrap samples X(1), . . . , X(B), each of size n, from the
fitted mixture model with parameters Ψˆ.
2. Apply the EM algorithm to each bootstrap sample to obtain B realisations of Ψˆ:
Ψˆ(1), . . . , Ψˆ(B).
3. Calculate the bootstrap covariance matrix of the parameter values:
covB(Ψˆ) =
1
B − 1
B∑
b=1
(Ψˆ(b) − Ψˆ)(Ψˆ(b) − Ψˆ)T , (2.34)
where
Ψˆ =
1
B
B∑
b=1
Ψˆ(b). (2.35)
2.5 Choosing the number of components
The Bayesian approach to model selection compares one modelM1 with another model
M2 via Bayes factors, given a fixed dataset XAll. We assume that the data must have
arisen under one of these two models. In Bayesian statistics, each model’s parameters
are considered to have a distribution prior to the observation of the data, and there
is also a prior distribution over the models themselves. These have to be set by the
user of the method, and chosen in a reasonable and well-argued way, if it is hoped
that someone else will take the results seriously. The distribution of the data, given the
model is also random, as usual. A few points will be summarised below, but see Kass
and Raftery (1995) for more details.
42
Using Bayes theorem, we find that
P (Mk|XAll) = P (XAll|Mk)P (Mk)
P (XAll|M1)P (M1) + P (XAll|M2)P (M2) , k = 1, 2. (2.36)
The posterior odds of modelM1 are then defined as:
P (M1|XAll)
P (M2|XAll) =
P (XAll|M1)P (M1)
P (XAll|M2)P (M2) (2.37)
The Bayes factor for model 1 is defined as
B12 =
P (XAll|M1)
P (XAll|M2) . (2.38)
Hence we can state that with respect to either model:
posterior odds = Bayes factor× prior odds.
Similar equations hold for comparison of any number of models K. In some cases, it
may be possible to calculate P (XAll|Mk) analytically. Usually, one will instead need
a method of approximation such as Markov chain Monte Carlo sampling. The main
difficulty is that
P (XAll|Mk) =

Θk
P (XAll|Ψk,Mk)P (Ψk)dΨk, (2.39)
i.e. that this is a marginal distribution, requiring integration over the full multivariate
distribution of Ψk over the parameter space Θk appropriate for model k.
Schwartz considered asymptotic form of the Bayes factor as the sample size n ap-
proaches infinity (Schwarz, 1978). He began by assuming that the observations are
drawn from an exponential family of models. Schwarz did not directly specify a prior
distribution. He assumed that the prior distribution was of the form
P (Ψk) =
K∑
k=1
P (Mk)P (Ψk|Mk). (2.40)
43
He also assumed that the penalty for choosing the wrong model is fixed, although this
can be loosened somewhat without altering the asymptotic results.
In the Bayesian paradigm, we usually seek a posterior distribution over models.
However if we must choose a single model, we choose the one with highest posterior
probability. This is also known as the maximum a posteriori (MAP) estimate.
In 2.36, we can see that the denominator is constant for all models. The MAP model
choiceMk∗ is then as follows:
k∗ = arg max
k
P (XAll|Mk)P (Mk) (2.41)
= arg max
k
P (Mk)

Θk
P (XAll,Ψk|Mk) dΨk (2.42)
= arg max
k
P (Mk)

Θk
exp {logP (XAll,Ψk|Mk)} dΨk. (2.43)
If one takes a second-order Taylor series approximation of logP (XAll,Ψk|Mk) around
one of its modes (local maxima) Ψk = Ψ˜k, we get
logP (XAll,Ψk|Mk) ≈ logP (XAll, Ψ˜k|Mk)− 1
2
(Ψk − Ψ˜k)TH(Ψ˜k)(Ψk − Ψ˜k), (2.44)
where H(Ψ˜k) is the negative Hessian matrix of logP (XAll,Ψk|Mk), evaluated at Ψk =
Ψ˜k. The first order term is zero since the gradient at Ψ˜k is zero. If we substitute this
approximation into 2.43, then we find that
k∗ ≈ arg max
k
P (Mk)P (XAll, Ψ˜k|Mk)

Θk
exp
{
−1
2
(Ψk − Ψ˜k)TH(Ψ˜k)(Ψk − Ψ˜k)
}
dΨk.
(2.45)
The integrand of 2.45 is only a normalising constant away from being the density of a
normal distribution with mean Ψ˜k and covariance matrix H(Ψ˜k)−1. As a result, we can
multiply by the required normalising constants and their inverse, leaving the integral
of a multivariate normal density. For a finite sample size, this integral will not evalu-
ate to 1 since some of the parameters have limited support, e.g. the variance parame-
ters must be positive and the covariance parameters are constrained by the variances.
44
However, if we take the sample size n to infinity, the largest posterior mode coincides
with the maximum likelihood estimate Ψˆ, given mild conditions on the prior distri-
butions. At this point, the likelihood and log likelihood become very sharply peaked,
giving the Hessian entries enormous magnitudes. This will make the entries of the in-
verse Hessian matrix tiny, and this is the covariance matrix of the multivariate normal
density. As a result, the implied multivariate normal is tightly concentrated around Ψˆ,
and so the the integral is dominated by a region very close to Ψˆ. As a result, the bounds
of the integral have essentially no effect, and the integral is approximately 1. In detail,
noting that |A−1| = |A|−1:

Θk
exp
{
−1
2
(Ψk − Ψ˜k)TH(Ψ˜k)(Ψk − Ψ˜k)
}
dΨk (2.46)
= (2pi)dk/2|H(Ψ˜k)−1|1/2

Θk
1
(2pi)dk/2|H(Ψ˜k)−1|1/2
exp
{
−1
2
(Ψk − Ψ˜k)TH(Ψ˜k)(Ψk − Ψ˜k)
}
dΨk
(2.47)
≈ (2pi)dk/2|H(Ψˆk)|−1/2. (2.48)
So
k∗ ≈ arg max
k
P (Mk)P (XAll, Ψˆk|Mk)(2pi)dk/2|H(Ψˆk)|−1/2. (2.49)
Since log is a monotonic increasing function which does not alter the position of max-
ima, we can take the log of the right hand side.
k∗ ≈ arg max
k
logP (Mk) + log(P (XAll|Ψ˜k,Mk)P (Ψˆk|Mk)) + dk
2
log(2pi)− 1
2
log |H(Ψˆk)|
(2.50)
≈ arg max
k
logP (Mk) + l(Ψ˜k;XAll,Mk) + logP (Ψˆk|Mk) + dk
2
log(2pi)− 1
2
log |H(Ψˆk)|.
(2.51)
For a given model, we generally have a method of estimating the MLE, such as the EM
algorithm for mixture distributions. We can also recognise that H(Ψˆk) is the observed
45
information matrix I(Ψˆk, XAll). Hence
k∗ ≈ arg max
k
logP (Mk)+l(Ψˆk;XAll,Mk)+logP (Ψˆk|Mk)+dk
2
log(2pi)−1
2
log |I(Ψˆk, XAll)|.
(2.52)
Since we are taking n→∞, we can ignore any terms which will not scale with n. This
removes logP (Mk), although this could also be removed by assuming that the prior
has similar probability for all models considered (all values of k). It also removes the
log prior density at the MLE, logP (Ψˆk|Mk), and the constant dk2 log(2pi). In addition,
it can be shown that the determinant of the observed information matrix is approx-
imately ndk , where dk is the dimensionality of Ψk, i.e. the dimension or number of
parameters of the model. Hence
|I(Ψˆk, XAll)| ≈ ndk . (2.53)
Using this collection of approximations, we can simplify 2.52 to
k∗ ≈ arg max
k
l(Ψˆk;XAll,Mk)− dk
2
log n, (2.54)
or equivalently
k∗ ≈ arg min
k
BIC = arg min
k
−2l(Ψˆk;XAll,Mk) + dk log n. (2.55)
It should be clear that there is scope to improve on the accuracy of the BIC as a
method of correctly choosing the MAP model if one makes fewer assumptions, given
that we do plan to use it with finite data. If one uses the standard BIC, one is effectively
saying that the prior on the set of models is completely flat (same density for each
model) and that the prior density of the MLE is the same across all models. The latter
would be difficult to achieve for models of various sizes, and its meaning is somewhat
unclear. In addition log(2pi) ≈ 1.84 and this term could also be retained.
While the BIC can be used to help find an approximate MAP solution, it is more
46
widely used in frequentist statistics, in which there is no concept of prior distributions
over models or parameters. In such cases, we need to remove the prior-based terms.
However, one cannot really avoid the Bayesian basis of BIC, and hence ideally, one
would consider which type of prior is most suitable, and its effect.
Effectively, we are still attempting to maximise the likelihood, but we now have a
penalty for model complexity, based purely on the number of parameters. One might
ask where this penalty has come from, given that it didn’t seem to be explicitly in-
cluded in the prior. The answer lies in the dimensionality of the parameter space. In a
larger space, the normal density term has a larger normalising constant, which leads to
a larger penalty. Equivalently, a uniform prior over models or something similar leads
to a more diffuse prior over the larger parameter space, and so the posterior density
for any particular model is lower.
If one wishes to use the BIC in practice to choose the number of components of
a mixture, one typically works through a range of numbers of components (e.g. 1
to 10) and then selects the model with lowest BIC value. It has been shown that the
BIC is consistent for normal mixture models (Roeder and Wasserman, 1997). That is,
if data is drawn from a normal mixture distribution with H components, and we are
using BIC to compare normal mixture models with various numbers of components,
then as n approaches infinity, BIC will have its minimum value for the model with H
components.
2.5.1 Cross-validation of the likelihood
Smyth (2000) proposed calculating the log likelihood of a set of test data, using a model
fit on a set of training data. The model with the highest test set log likelihood would
then be chosen.
With respect to model selection, the idea is to choose the model with the highest
47
expected log likelihood on new data. That is,
k∗ = arg max
k
EX logP (X|Ψ = Ψˆk,Mk), (2.56)
where the expectation is over X , a new iid observation, and for each model, Ψˆk is
the MLE fit based on the existing dataset XAll. This expectation can be approximated
using a test dataset Xtest, not used for model fitting. Given argmin is invariant to scale,
one can equivalently consider the log likelihood over the whole test dataset (a sum of
individual observation log likelihoods). I.e.
k∗ ≈ arg max
k
1
ntest
ntest∑
i=1
logP (Xtest,i|Ψ = Ψˆk,Mk) (2.57)
≈ arg max
k
ntest∑
i=1
logP (Xtest,i|Ψ = Ψˆk,Mk). (2.58)
The test set approach can be expanded to cross-validation, with the models needing
to be refit on the different training sets with each model order of interest. The log
likelihood for models of a particular order is then summed across all the CV test sets.
The model with the highest total test log likelihood value is chosen. In detail, cross-
validation is used to choose the optimal model type or size, and then this type of model
is refit to all of the data to create the final model.
It can be shown that the test set log likelihood approximates the expected value of
the Kullback-Leibler (KL) divergence between the density of the true model f(x) and
that of the fitted model fˆ(x), plus a constant which is independent of the model order.
In general, this KL divergence can be written as
D(f(x)||fˆ(x)) =
∫ ∞
−∞
ln
(
f(x)
fˆ(x)
)
f(x)dx. (2.59)
Note that the KL divergence is not symmetric, and the reference density (here f(x)) is
usually chosen as the first argument for D().
Kullback-Leibler divergence is also the basis of the Akaike information criterion
48
(AIC) (Akaike, 1974), but this is calculated based on the same data used to fit the model,
and an asymptotically-derived penalty is applied, as follows:
k∗ ≈ arg min
k
AIC = arg min
k
−2l(Ψˆk;XAll,Mk) + 2dk. (2.60)
In practice, the AIC tends to choose models that are too large, implying that its penalty
function is not severe enough on model complexity when sample sizes are small. It
may be reasonable to consider cross-validated likelihood as a rethinking of AIC using
test sets.
2.6 Decompositions
2.6.1 Spectral Decomposition
A p× p real symmetric matrix A can be decomposed as follows:
A = ΓΛΓT
where
Λ = diag (λ1, . . . , λp)
and
Γ = (γ1 . . . γp) ,
where λj is the jth eigenvalue of A and γj is the jth eigenvector of A. Note that Γ is an
orthogonal matrix, i.e. whose rows and columns are all orthogonal unit vectors, and
hence ΓΓT = Ip = ΓTΓ
It is then possible to define, for any α ∈ R,
Aα = ΓΛαΓT .
49
In particular
A1/2 = ΓΛ1/2ΓT ,
and so
A1/2A1/2 = ΓΛ1/2ΓTΓΛ1/2ΓT = ΓΛ1/2Λ1/2ΓT = ΓΛΓT = A.
Also
AαA−α = ΓΛαΓTΓΛ−αΓT = ΓΛαΛ−αΓT = ΓIΓT = ΓΓT = Ip.
2.7 Mahalanobis Distances and Transformation
The Mahalanobis distances are analogues of Euclidean distance which take into ac-
count a relevant covariance matrix. A Mahalanobis distance can be used as a mea-
sure of the distance between a point, sample or distribution and another point, sample
or distribution. The main idea is that the points or variables are first normalised by
transformation such that the resulting distribution would have an identity covariance
matrix, i.e. with all variables having the same variances and no correlation between
variables.
When random variables X and Y have means µX and µY and common covariance
matrix Σ, the Mahalanobis distance between the two means is defined as
∆(µX , µY ) =

(µX − µY )TΣ−1(µX − µY ).
If have a sample of each variable and an estimated common covariance matrix S,
then the Mahalanobis distance between x and y is
D(x, y) =

(x− y)TS−1(x− y).
If we have a sample of one variable X , namely Xall = (x1, . . . , xn) with sample
covariance SX , and a single point y, then the Mahalanobis distance between y and Xall
is
50
D(x, y) =

(x− y)TS−1X (x− y).
The Mahalanobis distance between a point y and a point xi, using the sample co-
variance matrix SX is
D(xi, y) =

(xi − y)TS−1X (xi − y).
These distance measures suggest a corresponding transformation so that the result-
ing dataset or distribution will have an identity covariance matrix.
The Mahalanobis transformation of a dataset (x1, . . . , xn) is as follows:
zi = S
−1/2(xi − x), i = 1, . . . , n,
where S−1/2 is determined via the spectral decomposition.
Then, for the transformed data, Zall = (z1, . . . zn), the sample covariance matrix is
Ip, since
Cov(Z,ZT ) = S−1/2S(S−1/2)T = S−1/2S1/2S1/2S−1/2 = Ip.
51
Chapter 3
Discriminant Analysis
3.1 Introduction
Discriminant analysis, also known as supervised classification, is the task of construct-
ing a decision rule or classifier based on a set of training data, for which the class labels
are given. The classifier can then be used to assign a class to a new observation whose
class is unknown.
The classes can be defined in any way desired, provided that they are distinct from
each other, i.e. that each observation can only properly belong to one of the defined
classes. The classes can overlap with each other, but in the areas of overlap there should
be some method for determining which class is the best choice.
The applications of discriminant analysis are wide and varied. In medicine, it can
be used for diagnosis (determining a patient’s condition) or prognosis (determining
their outcome status sometime in the future). It can be used in security, defence or
robotics applications for face or object recognition, and with documents for classifying
into categories, including whether or not they are relevant to a particular query. In
finance, it can be of interest to try to decide whether or not a loan applicant is credit-
worthy or if a transaction is fraudulent or worthy of further investigation.
In each of these situations, classes can be clearly defined and applied to label a
set of well understood instances. A set of possibly relevant predictor variables can
52
be defined, mainly comprised of aspects or features of the experimental units which
can be conveniently measured, and which may plausibly allow some discrimination
between the different classes. It is hoped that, despite measurement errors and large
overlaps between classes in terms of their joint densities in the predictor space, that
collectively there will be sufficient information in the set of predictor variables that
it will be possible to train (or fit) a classifier to fairly accurately predict the class of a
randomly chosen new observation.
3.2 Statistical Classifiers
The training data consists of multivariate observations of predictor variables along
with the class labels. As in the previous chapter, the multivariate observations are
placed in a data matrix XAll (see 1.37), whose rows are the individual p-dimensional
observations x1, . . . , xn, xi ∈ R, i.e. each recording a value for p predictor variables.
The class labels for the n observations are stored in a vector yAll = (y1, . . . , yn)T , where
yi ∈ {1, . . . , G}where G is the number of classes.
We can consider each observation (Xi, Yi) to be drawn from a multivariate joint
distribution F . The marginal distribution for the class variable Y is then a categorical
distribution with parameters pi1, . . . , piG−1. Note that piG = 1 −
∑G−1
g=1 pig . These pa-
rameters are known as the prior probabilities of class membership. For n observations,
the numbers drawn from each class are then distributed according to a multinomial
distribution with the above probability parameters and n trials. In some cases, the
prior probability of each class is known fairly accurately, e.g. on the basis of large past
studies. In other cases, it must be estimated from the training data via
pˆig =
n∑
i=1
Ig(yi)/n (3.1)
where I is the indicator function.
In most cases of interest, the classifier is a deterministic function of the new point
53
to be classified, X = x, and the training set, and we denote it by η(XAll,yAll)(x), with the
range being 1, . . . , G.
Classifiers based on statistical models allow one to calculate either the density func-
tion for each class at any point x. More usefully, they allow one to to determine how
likely each class is, given the observation. To determine this, we use Bayes’ theorem,
which applies to densities as well as probabilities.
P (Y = g|X = x) = h(X = x|Y = g)P (Y = g)
f(X = x)
, (3.2)
where Y is the class of observation X and P (Y = g) = pig. h(X = x|Y = g) is the den-
sity for the gth class, which we will now abbreviate to fg(x). f(X = x) is the density
for the overall distribution of observations, which can be viewed as a mixture distri-
bution over the classes, with component probabilities pi1, . . . , piG. We will henceforth
abbreviate this to fX(x), and note that, by the law of total probability
fX(x) =
G∑
g=1
pigfg(x). (3.3)
Since each pig can be specified a priori, P (Y = g|X = x) is known as the posterior
probability of class g given observation x, and we will abbreviate this to τg(x). (Note:
this notation is drawn from (McLachlan, G.J., 1992), an important book in this area.)
As a result, we can rewrite equation 3.2 as
τg(x) =
pigfg(x)∑G
g′=1 pig′fg′(x)
. (3.4)
The classification rule is then to choose the class for which the posterior probability is
highest, i.e.
B(x) = arg max
g∈1,...,G
τg(x) = arg max
g∈1,...,G
pigfg(x), (3.5)
with the right hand equality following since fX(x), the denominator of τg(x), has no
effect on which class is maximal. When using a model and parameters which exactly
54
describe the classes, this can be shown to produce the lowest expected risk (details will
be given in a later section). When using parameter estimates based on data, we use a
similar rule:
η(XAll,yAll)(x) = arg max
g∈1,...,G
τˆg(x) = arg max
g∈1,...,G
pˆigfˆg(x). (3.6)
3.3 Optimal Classifiers
A classification rule η() will always produce a class as its answer, and this predicted
class can be correct or incorrect. An incorrect answer is an error, and we would like to
produce a classifier which results in a low error rate on new observations. In fact, for
G classes, there are G2 −G unique types of error, and error rate, one for every possible
pair of classes, excepting correct answers, where the predicted class matches the true
class. The allocation rate at which a classifier η predicts a class h for an observation
whose actual class is g is
egh(η) = P (η(X) = h|Y = g) =

Rh
fg(x)dν, g, h ∈ 1, . . . , G. (3.7)
where ν is the appropriate measure on Rp for fX(x) and Rh is the segment of the pre-
dictor space which the rule will allocate to class h. Note that these regions are disjoint.
Then the class-specific error rate for class g can be written as
Errg(η) = P (η(X) 6= g|Y = g) =

∪h6=gRh
fg(x)dν, g, h ∈ 1, . . . , G
= 1−

Rg
fg(x)dν, g ∈ 1, . . . , G. (3.8)
Let cgh be the cost (or loss) for misclassifying an observation whose true class is g
into class h. We assume that cgg = 0, ∀g ∈ 1, . . . , G, i.e. that correct classification is
costless. For now, we will consider all types of errors to be equally undesirable, and
set cgh = 1, ∀g ∈ 1, . . . , G, Hence we will only consider the overall error rate. The
true costs can be scaled up to any value desired, but such scaling has no effect on the
55
classification.
From now on, it will be convenient to change the form of Y , the variable repre-
senting the true class, to be a G-length binary vector with a 1 in the position of the
true class, and 0’s elsewhere. For example, with three possible classes, the second class
would be represented by y = (010). We will then use yg to refer to the gth element of
this vector, as needed.
Let (X, Y ) be a new point drawn at random from the distribution F , and define the
zero-one loss function Q for the classifier η as follows:
Q[y, η(XAll,yAll)(x)] =

0, η(XAll,yAll)(x) = y
1, η(XAll,yAll)(x) 6= y.
(3.9)
For a new observation (x, y), the loss due to allocation using the rule η is
l(y, η(x)) =
G∑
g=1
ygQ[g, η(x)]. (3.10)
The expected loss or risk conditional on a particular value for x is then
EFY |X=xl(Y, η(x)) =
G∑
g=1
EFY |X=x(Yg|x)Q[g, η(x)] (3.11)
=
G∑
g=1
τg(x)Q[g, η(x)] (3.12)
=
G∑
g 6=η(x)
τg(x) = 1− τη(x). (3.13)
Here the expectation is over the random variable Y conditional on X = x, i.e. with
Y drawn from the conditional distribution FY |X=x. This distribution can also be writ-
ten Mult(1,τ1(x), . . . , τG(x)). In most cases, every class distribution has support which
includes x, and so τg(x) > 0) for each class g. The decision rule η(x) must be determin-
istic, and so must predict a single class for a given x value. When this is the true class,
it will result in a 0 value for Q[, ] in equation 3.13. All other classes will result in a loss
56
of Q[, ] = 1.
An optimal allocation rule is one which minimises the conditional risk (3.13) for
every possible value of x. A rule which does this is called a Bayes (decision) rule or
Bayes-optimal rule or minimum risk rule. It follows from 3.13 that the way to minimise
the risk is to choose
η(x) = B(x) = arg max
g∈1,...,G
τg(x), (3.14)
since this will ensure that the largest term is left out of the sum. This rule is admissible
with respect to the given loss function, i.e. there is no rule which can produce a lower
risk. For a given joint distribution F , the risk value for the Bayes-optimal rule is called
the Bayes risk.
When the costs of misallocation vary across pairs of classes, the loss function and
Bayes-optimal rule also change. For general costs cgh with a new observation (x, y), the
loss due to allocation using the rule η is
l(y, η(x)) =
G∑
g=1
ygcgη(x)Q[g, η(x)]. (3.15)
The conditional risk becomes
EFY |X=xl(Y, η(x)) =
G∑
g 6=η(x)
τg(x)cgη(x). (3.16)
This is more difficult to optimise. In general we can only write that the conditional risk
3.16 will be minimised by choosing
η(x) = B(x) = arg min
h∈1,...,G
G∑
g 6=h
τg(x)cgh. (3.17)
Statistical classifiers essentially estimate pig and fg(x) for g = 1, . . . , G. These are
used to estimate τg(x), and these estimates are then simply plugged into either 3.14 or
3.17, depending on the cost structure. We will examine some of these classifiers in the
following sections.
57
3.4 Linear Discriminant Analysis
The method of linear discriminant analysis (LDA) assumes that the population in (or
model of) each class has a multivariate normal distribution, and that the covariance
matrices for all the classes are the same. The classes can have different means and
proportions, as in the general case. The jth class has a multivariate normal density,
with parameters pˆij, µˆj and Σˆ.
Since we have the class labels, we separate the training data into classes. If we
have no knowledge of the class probabilities, we estimate them by the proportion of
observations in each class, as in 3.1. If we know the class probabilities or have what
we believe to be more accurate estimates from elsewhere, we would use these instead.
The mean of each class distribution is set equal to the corresponding sample mean of
observations from that class. The common covariance matrix is estimated by removing
the relevant class sample mean from each observation, pooling all the observations and
then using the (unbiased) sample covariance matrix, as in 1.45.
For the following, we assume that all costs of misclassification are equal. The dis-
criminant boundary between two classes, j and l, occurs where the estimated proba-
bility of a new observation x belonging to either the jth or the lth class is equal. So
pˆij fˆj(x) = pˆilfˆl(x), where pˆij is the estimated prior probability of an observation belong-
ing to the jth class and fˆj() is the estimated probability density function of the jth class.
We will now derive the equation for the discriminant boundary between two classes, j
and l.
pˆij
(2pi)p/2|Σˆ|1/2 e
− 1
2
(x−µˆj)T Σˆ−1(x−µˆj) =
pˆil
(2pi)p/2|Σˆ|1/2 e
− 1
2
(x−µˆl)T Σˆ−1(x−µˆl) (3.18)
Cancelling common terms and taking logs of both sides, we find that:
log
pˆij
pˆil
=
1
2
[
(x− µˆj)T Σˆ−1(x− µˆj)− (x− µˆl)T Σˆ−1(x− µˆl)
]
(3.19)
=
1
2
[
xT Σˆ−1x− xT Σˆ−1µˆj − µˆTj Σˆ−1x+ µˆTj Σˆ−1µˆj −
(
xT Σˆ−1x− xT Σˆ−1µˆl − µˆTl Σˆ−1x+ µˆTl Σˆ−1µˆl
)]
(3.20)
58
Only the two xT Σˆ−1x terms cancel immediately, but this is enough to ensure that the
resulting discriminant boundary equation is linear in x. Also, since Σˆ−1 is a symmetric
p× p matrix, aT Σˆ−1b = bT Σˆ−1a for a, b ∈ Rp. This lets us group terms in x.
log
pˆij
pˆil
=
1
2
[
−2xT Σˆ−1µˆj + 2xT Σˆ−1µˆl + µˆTj Σˆ−1µˆj − µˆTl Σˆ−1µˆl
]
(3.21)
= xT Σˆ−1(µˆl − µˆj) + 1
2
[
µˆTj Σˆ
−1µˆj − µˆTl Σˆ−1µˆl
]
(3.22)
Thus
xT Σˆ−1(µˆl − µˆj) = log pˆij
pˆil
+
1
2
[
µˆTl Σˆ
−1µˆl − µˆTj Σˆ−1µˆj
]
. (3.23)
Once all the model parameters have been estimated and entered, this will give a linear
equation in terms of the elements of x.
Like all classifiers, LDA partitions the predictor space into regions. Within each
region the classifier will predict a single class. The boundaries between these regions
are based on combinations of discriminant boundaries, and are thus piecewise linear.
3.5 Quadratic Discriminant Analysis
Quadratic Discriminant Analysis (QDA) is the same as linear discriminant analysis,
except that the covariance matrix is allowed to be different for each class. Estimation
of means and proportions is as in LDA, but the covariance matrix for each class is
estimated by the sample covariance matrix for the observations in that class.
Assuming equal costs of misclassification, we can again consider the discriminant
boundary between two classes, j and l.
pˆij
(2pi)p/2|Σˆj|1/2
e−
1
2
(x−µˆj)T Σˆj−1(x−µˆj) =
pˆil
(2pi)p/2|Σˆl|1/2
e−
1
2
(x−µˆl)T Σˆl−1(x−µˆl) (3.24)
59
Cancelling common terms and taking logs of both sides, we find that:
log
pˆij|Σˆl|1/2
pˆil|Σˆj|1/2
=
1
2
[
(x− µˆj)T Σˆj−1(x− µˆj)− (x− µˆl)T Σˆl−1(x− µˆl)
]
. (3.25)
It is not generally possible to simplify this further. Hence the discriminant boundaries
are quadratic in x. The regions dedicated to each class thus have piecewise quadratic
boundaries.
Clearly QDA is more flexible than LDA, and able to represent more complex class
structures. However, it also has more parameters to be estimated from the same amount
of data due to the proliferation of covariance matrices. The uncertainty in the estimates
of the covariance parameters tends to be higher for QDA than LDA as a result. When
data is limited and a common covariance matrix is a reasonable assumption or approx-
imation,
LDA can produce better performance on new data than QDA. This is an example of
potential over-fitting, where the flexibility in the model can be wasted on fitting noise
or meaningless variation in the training data, and then act to reduce the accuracy of
predictions on new observations. Unfortunately, it is hard to know how much data
is enough, but one can be sure that more data is needed as the number of predictor
variables increases. In deciding whether to use QDA or LDA, one should create and
consider a set of bivariate or trivariate marginal plots to try to determine whether or
not a common covariance matrix seems reasonable.
3.6 Fisher’s Linear Discriminant Analysis
In 1936, Fisher proposed the first form of linear discriminant analysis for two classes in
(Fisher, 1936). This method does not require the class distributions to be multivariate
normal and it does not require the two classes to have the same covariance matrix. It
does not impose a model on the data, but it does consider the mean and covariance
matrix of each class.
60
The main idea of Fisher’s Linear Discriminant Analysis (FLDA) is to find a univari-
ate projection of the data for which the ratio of between-class variance to within-class
variance is maximised. A hyperplane perpendicular to this line then provided the dis-
criminant boundary.
Let the sample means of the two classes be µˆ1 and µˆ2, respectively, and let their
common sample covariance matrix be S. Let a ∈ Rp, with ||a|| = 1, be a projection to
be considered (although the magnitude is not important for what follows). Under this
projection, the means of the classes become aT µˆ1 and aT µˆ2, respectively. The variance
between classes is then
[
aT (µˆ1 − (µˆ1 + µˆ2)/2)
]2
+
[
aT (µˆ2 − (µˆ1 + µˆ2)/2)
]2
= (aT µˆ1 − aT µˆ2)2/2. (3.26)
The common variance (within classes) in the projected space is given by aTSa. Note
that this does not require that we believe there is a common covariance matrix for the
classes. In the projected space, we are merely adding up the sum of squared distances
from the relevant class sample mean across all observations and multiplying by a con-
stant (1/(N − 2)). Without the constant, this is also known as the within-class scatter.
Hence the ratio of between class to within class variance is:
(aT µˆ1 − aT µˆ2)2
2aTSa
(3.27)
It can be shown (by taking derivatives and then solving as an eigenvalue problem)
that this ratio is maximised when a is proportional to S−1(µˆ1 + µˆ2). The discriminant
function is then aTx = xTa = xTS−1(µˆ1 + µˆ2). The discriminant boundary would
then naturally be set orthogonal to the direction of the projection a. Where it should
be put along this line is open to some debate. The discriminant boundary is given by
aTx−a0 = 0, with a0 determining the position of this hyperplane. A common choice of
a0 is the mid-point of the two class sample means in the projected space, which leads
to a0 = aT (µˆ1 + µˆ2)/2. If any sort of statistical model is applied to each class, one can
61
choose to place the threshold at the point where posterior density under each class is
equal.
3.7 Estimation of Error Rates
In addition to selecting and training a classifier, we would also like to estimate its
accuracy on new observations. To be precise, we would like to estimate the probability
of the classifier η() misclassifying a new observation (X, Y ) drawn from F :
Err = EF Q(Y, η(X)). (3.28)
If the costs of misclassification vary from one class to another, or just for a better un-
derstanding of the classifier’s capabilities, we would also like to estimate Errg, g =
1, . . . , G.
Note that
Errg = EF |Y=g Q(Y, η(X))
= P (η(X) 6= Y |Y = g)
= P (η(X) 6= Y, Y = g)/P (Y = g)
= EF Q(Y, η(X))I(Y, g)/pig
We could go further and try to estimate every type of error rate Errgh, g 6= h =
1, . . . , G. However, this is normally not necessary, and unless we have plenty of data
from every class, these will be difficult to estimate. Note that the bias of any estimator
of the predictive error rate Eˆrr is E(Eˆrr − Err), where the expectation is over the
distribution producing whatever data is used to estimate the error rate.
If we have used all the available labelled data to train our classifier, then we have
no independent observations left for the estimation of error rates. The classic solution,
akin to reporting R2 with regression, is to reuse the training data to test the classifier.
62
These leads to a set of estimates known as the apparent error rates. These are often
optimistic, i.e. downwardly biased, and are hence less than ideal. The bias will tend
be more severe if the classifier is capable of over-fitting the data, a problem that tends
to be worse for higher numbers of dimensions p.
The obvious alternative is to separate the labelled data into training and test sets,
using the former to train the classifier and the latter to test it, i.e. estimate the error
rates. This was originally known as the holdout method, but is now more commonly
known as the test set method. This raises a number of questions, however.
1. Which classifier are we trying to estimate the error rate for ?
2. What proportion of the data will we allocate to the test set ?
3. How will we implement this proportion ?
4. How can we provide confidence intervals for the error rate ?
The first question is easily forgotten, but needs a clear answer. First we must answer
the question: what classifier would we like to use for the prediction of new observa-
tions ? Or more precisely: what data would we like to use to train this classifier ? In
most cases, the answer will be to use all the available data to train what may be termed
the final classifier. This is the classifier being trained when using the apparent error
rate estimator. However, it is not possible to assess this classifier directly with the test
set method. We must leave some fraction of the data out for testing. Reducing the size
of the training set will tend to reduce the predictive accuracy of the trained classifier,
and hence lead to an upward bias in the error rate. While any form of bias is unfortu-
nate, this bias is likely to be small if the proportion left out is small, and a conservative
estimate is probably more acceptable than an optimistic one.
This tends to suggest that the answer to the second question is to leave out less
than half the data for testing. Proportions such as 1/3, 1/4, 1/5 or 1/10 of the data
may be reasonable. One would like the test set to represent the distribution as well as
possible, so it will need some observations from every class, and sufficient observations
63
so that confidence intervals for the error rates are not too wide. All of this is easier if
one has a large dataset. One answer to the fourth question is to view the number of
errors observed out of the total as the outcome of a binomial experiment, and hence use
one of the corresponding confidence intervals for proportions. The width of any such
confidence interval is proportional to 1/

ntest, where ntest is the size of the relevant
test set.
In answer to question 3, one typically randomly reorders the data, and then takes
the last ntest observations (Xtest, ytest) to be the test set, leaving ntrain = n − ntest ob-
servations (Xtrain, ytrain) in the training set. We assume that each observation has been
drawn independently from the distribution F , but it is possible that this is not truly
the case or that there has been some sorting of the data after collection. The random
reordering acts to reduce the effect of either of these. For example, if the observations
from each class have been grouped together, the random reordering will produce a mix
of classes in a test set taken as described above. The overall error rate is then estimated
via
Eˆrrtest =
∑ntest
i=1 Q[yi, η(Xtrain,ytrain)(xi)]
ntest
. (3.29)
The corresponding class-specific error rate estimators are:
Eˆrrtest g =
∑ntest
i=1 Q[yi, η(Xtrain,ytrain)(xi)]I(yi, g)∑ntest
l=1 I(yl, g)
, (3.30)
where I(, ) is the identity function.
Around 1968, cross-validation (CV) was proposed in the following paper to im-
prove on the test set approach: Lachenbruch and Mickey (1968). Further details were
presented in the following papers: Stone (1974) and Geisser (1975). These re-sampling
methods involve a partition into a test set and a training set and the counting of errors,
but then the repetition of this procedure many times with different systematically or
randomly chosen subsets used for testing.
The simplest such approach leaves out only one observation for testing at each
64
iteration, and uses the remaining n − 1 observations for training. This is then re-
peated n times, each time leaving a different observation out. This leave-one-out cross-
validation (LOO CV) error rate estimator can be written as follows:
EˆrrLOOCV =
∑n
i=1 Q[yi, η(X−iAll,y
−i
All)
(xi)]
n
, (3.31)
where−imeans that the ith observation has been left out of this set. The corresponding
class-specific error rate estimators are:
EˆrrLOOCV g =
∑n
i=1 Q[yi, η(X−iAll,y
−i
All)
(xi)]I(yi, g)∑n
l=1 I(yl, g)
. (3.32)
This method is attractive since it uses nearly all the data for training each time,
and yet manages to eventually test on all n observations. The latter is common to
all cross-validation schemes, and the former should reduce the bias of the estimator.
However, both theory and experiment have shown that the variance of the estimator
can unfortunately be fairly high. This is believed to be due to the high similarity of
each training set, with the effect that any unusual aspects in the sample will tend to be
modelled in each of the n trained classifiers. The other drawback of this approach is
that it requires the model to be refit n times, which can be undesirable or prohibitive
when n is large.
Geisser proposed an alternative known as K-fold cross validation. This requires
the removal of approximately n/K observations (known as a ’fold’) for testing at each
iteration. This is repeated a total of K times with separate non-overlapping folds each
time until each observation has been used for testing once.
ˆErrK CV =
∑K
k=1
∑nk
i=1Q[y
k
i , η(X′kAll,y
′k
All)
(xki )]
n
, (3.33)
where nk is the number of observations in the kth fold, xki is the ith observation in the
kth fold and ′k means without the kth fold. The corresponding class-specific error rate
65
estimators are:
EˆrrK CV g =
∑K
k=1
∑nk
l=1 Q[y
k
l , η(X′kAll,y
′k
All)
(xkl )]I(ykl , g)∑n
l=1 I(yl, g)
. (3.34)
K-fold CV requires the model to be fit to data only K times, where K is typically
chosen to be 2, 3, 5 or 10. LOO CV can be seen as n-fold CV. As K increases, the bias
of the estimator due to left out data decreases, but the variance tends to increase, and
K = 5 or K = 10 are common compromise values.
Other variations exist, which may improve upon the above. Breiman and col-
leagues developed balanced CV, which allocates observations to folds so that, when-
ever possible, each class is represented in each fold, with little variation in the class
proportions among folds. It is unclear as to whether this method has better properties
than K-fold CV.
The random allocation of observations to folds implies that the results of a single
CV run will not be unique. Hence one can consider v repetitions of K-fold CV. How-
ever, Burman’s results (Burman, 1989) showed that this is not an improvement over
Kv-fold CV.
Another noteworthy option is repeated learning-testing, in which a test set of a
certain size is chosen at random, a classifier is trained on the training portion, tested
on this test set and the errors counted. This is then repeated as many times as desired,
with the error rates estimated from the aggregated test results. Again, the results of
Burman do not seem to indicate any improvement over K-fold CV. Overall, Burman’s
results tend to favour LOO-CV, but these are not conclusive.
If one needs to select a model of out K possible models using CV and then also
test it using CV, one can use the following procedure, first covered by Stone, 1974 for
the LOOCV case. It has been called nested CV, two-layer CV and two-level CV in the
literature.
1. Split the data into m folds. m− 2 of these are for training, one for validation and
one for testing.
66
2. Using the training and validation data : perform m− 1 fold CV for k = 1, . . . , K
• Estimate the overall error for each k (from the CV results)
• Choose k which gives the lowest error rate
3. Now that we have say k = k∗1 , train the classifier on the training and validation
data. Test on the test fold.
4. Repeat for other choices of test fold (total ofm times), getting a different k∗2, . . . , k∗m
each time. Average the results to estimate the error rate of the final classifier on
future observations.
5. For the final classifier, choose a new k = k∗final using a similar procedure, but
without a test set (only training and validation sets defined).
6. The final classifier is trained on all of the data using k = k∗final.
Note that while we advocate reporting the overall error rate and the class-specific
error rates, some authors choose instead to report a confusion matrix, which is a matrix
whose (i, j)th element is the number of observations labelled with class i which the
classifier predicted to have class j. If this matrix is divided through by the sum of its
entries (the total number of observations in the test set(s)), then the sum of the off-
diagonal elements should equal Eˆrr. If instead the gth row is divided by its total, then
the sum over this row, excluding the gth element, will be Eˆrrg.
Some other methods of error rate estimation are based on the bootstrap (Efron, 1981,
1983). Efron described the non-parametric bootstrap estimator of error rate as follows:
Eˆrr(BOOT ) =
1
n
n∑
i=1
1
|C−i|

b∈C−i
Q[yi, ηb(xi)] (3.35)
where C−i is the set of bootstrap samples which do not contain observation i and |C−i|
is its cardinality. A non-parametric bootstrap sample is constructed by taking a sample
of size n with replacement from the observed n observations, i.e. sampling from the
67
empirical distribution of the data. As a result, many observations are not sampled at
all, and others are sampled more than once. The probability that a particular observa-
tion becomes part of a bootstrap sample is 1 minus the probability that it is left out,
which is 1 − (1 − 1/n)n. As n → ∞, this approaches 1 − 1/e ≈ 0.632. The expected
number of unique observations in a bootstrap sample is then approximately 0.632n. As
a result, there is generally some upward bias in Eˆrr(BOOT ). Efron proposed that this be
corrected by combining it with the apparent error rate estimate err as follows:
Eˆrr(.632) = 0.368err + 0.632Eˆrr(BOOT ). (3.36)
Despite some theoretical encouragement, there is little evidence that this or its succes-
sors are better estimators than those based on cross-validation.
LOOCV was preceded by a related LOO procedure, the jackknife. This was devel-
oped to estimate the bias (Quenouille, 1949, 1956) and standard error (Tukey, 1958) of
a particular estimation procedure. The ith jackknife sample x(i), i = 1, . . . , n leaves out
the ith observation. One then estimates the parameter vector θ based on each of these
samples, resulting in θˆ(i), i = 1, . . . , n, and additionally produces the estimate θˆ based
on the full sample. We then define the average jackknife estimate as
θˆ(.) =
1
n
n∑
i=1
θˆ(i). (3.37)
The jackknife estimate of bias is given by
ˆbiasjack = (n− 1)
(
θˆ(.) − θˆ
)
. (3.38)
The jackknife estimate of standard error is given by
sˆejack =
[
n− 1
n
n∑
i=1
(
θˆ(i) − θˆ(.)
)2]1/2
. (3.39)
See e.g. Efron and Tibshirani (1993) p141–143 for derivations.
68
3.8 Mixture Discriminant Analysis
For a classifier to approach the optimal Bayes rule, it needs to use sufficiently flexible
models of the class distributions so that these can be modelled accurately, and enough
data so that the fit of each is as close as possible to that of the true class distributions.
One option is to fit a mixture model to each class. This was done in applications by
McLachlan and Gordon (1989) and formalised by Hastie and Tibshirani (1996), who
named the method mixture discriminant analysis (MDA).
MDA assumes that each mixture component in each class is normal and that each
has the same covariance matrix Σ. This method is available in R using the mda pack-
age. The equal covariance matrices restriction was lifted by Fraley and Raftery (2002),
who have since maintained an R package which implements this method and is the
most popular R package for mixture modelling generally: MClust.
To use this method, one does the following. For each of the G classes, fit a normal
mixture model to the observations from that class. Use a method such as BIC to choose
the number of components, separately for each class. For a new observation, calculate
its density using the mixture distributions for each of the classes. Weight these density
values by the class probabilities (known or estimated) and assign the observation to
the class with the highest weighted density at that point.
This imitates the Bayes-optimal rule, with the true class densities each replaced by
a different mixture model. For a new point x, we can write the discriminant rule as
g∗ = arg max
g
pigfˆg(x), (3.40)
where pig is the prior probability of class g and
fˆg(x) =
Hg∑
h=1
λgh fˆgh(x), (3.41)
where λgh is the probability of component h within the mixture model used to fit the
69
gth class and fˆgh(x) is the fitted hth component density of the mixture used to fit the
gth class.
3.9 Kernel Density Discriminant Analysis
Kernel density estimation is typically familiar from plots - it gives a quick impression
of the shape of a data distribution by smoothing the empirical distribution. However,
since it provides a flexible non-parametric model of a density it can be used in a variety
of contexts, including discriminant analysis (classification). In this context, one uses a
different kernel density estimate for each class and then uses these to approximate
the Bayes’ classifier, as in LDA, QDA and MDA. While kernel density estimation is
most often done in only one dimension, it can be extended to an arbitrary number of
dimensions. However, it tends to work best for just a few dimensions.
For a random sample in p dimensions {x1, . . . , xn}, the kernel density estimate is
given by
fˆ(x,H) =
1
n
n∑
i=1
KH(x− xi).
Here K(x) is a kernel function, which needs to be a symmetric pdf, H is a symmet-
ric, positive-definite bandwidth matrix and
KH(x) = |H|−1/2K(H−1/2x),
where H−1/2 is determined via spectral decomposition, as explained in section 2.6.1.
A number of options exist for the kernel, and one of these is the standard multivari-
ate normal density, i.e.
K(x) =
1
(2pi)p/2
e−
1
2
xT x.
This leads to
KH(x) =
1
(2pi)p/2|H|1/2 e
− 1
2
xTH−1x,
i.e. the kernel function is a multivariate normal density with mean 0 and covariance
70
matrix H .
So each observation will be the centre of a multivariate normal distribution. Ef-
fectively the kernel density model is a mixture model with n components. Given that
the number of components and the means and proportions have been determined, the
main task is to choose the bandwidth matrix H . Note that the matrix is common to all
of the components.
The two main options for this matrix are to make it diagonal (i.e. without covari-
ance terms) or unconstrained, i.e. a full covariance matrix. In either case the aim of the
density estimation is to provide the best possible approximation to the distribution the
data was drawn from, which we will assume has density f . The usual measure of dis-
tance between f and fˆ is mean integrated squared error (MISE), and here we consider
it as a function of the bandwidth matrix.
MISE(H) = E

Rp
[
fˆ(x,H)− f(x)
]2
dx.
The optimal choice ofH will minimise MISE(H) over symmetric, positive definite p×p
matrices. MISE(H) can be approximated using a wide range of methods including
cross-validation. The derivations for these are substantial and one can refer to Wand
and Jones (1995) or Chacon and Duong (2011) for examples, bearing in mind that this
remains an active research area. Most methods require numerical optimization of H
to minimise the approximated MISE criterion, and since a general H contains p(p +
1)/2 unique parameters, this provides one of the challenges to utilizing such methods
in higher dimensions. There are also some computational difficulties in using large
sample sizes, although various approximations have been developed to reduce these.
Kernel density discriminant analysis works in the same way as mixture discrimi-
nant analysis or other stochastic-model-based procedures which approximate the Bayes-
optimal rule. Each class density is approximated by a different kernel density model,
fitted using data from that class alone.
71
For a new point x, the discriminant rule is
g∗ = arg max
g
pigfˆg(x),
where pig is the prior probability of class g and fˆg(x) is the kernel density estimate of
the gth class evaluated at x.
One implementation of multivariate kernel density estimation and classification
is via the ks package in R, as described in Duong (2007), which can handle up to 6
dimensions. If wishing to use this with higher-dimensional data, one would have to
reduce the dimensionality first, likely losing some information.
3.10 Nearest Neighbour Classifiers
We will now take a direction away from likelihood-based models to consider some
competing forms of classifier.
First we consider the nearest neighbour algorithm, which is probably the simplest.
When presented with a new observation of the predictor variables x, one looks for the
nearest point in the training set, and gives the new observation the same class.
An extension of this idea is the k-nearest neighbours algorithm. When presented
with a new observation of the predictor variables x, one looks for the k nearest points
in the training set - these can be termed the neighbours. We then predict the class of the
new observation to be the class which occurs most frequently among the neighbours.
When k = 1, this is the nearest neighbour classifier.
A number of issues remain:
1. What kind of distance metric should we use?
2. How do we deal with ties?
3. How should we choose k?
4. What properties does this algorithm have?
72
Any metric that encapsulates our concept of closeness adequately is suitable. The
most common choice is Euclidean distance, i.e. the L2 norm
||x− xi||2 =

(x− xi).(x− xi) =

(x1 − xi1)2 + · · ·+ (xp − xip)2 (3.42)
One problem that arises with most choices of distance metric is a lack of scale invari-
ance. For example, if we have p predictor variables, where one of them is measured
in metres, but we decide to instead report it in centimetres, we will have scaled that
dimension up by a factor of 100. We would hope that the k-nearest neighbour rule
would be unaffected, but with the rule defined above, the effect would be large and
undesirable.
As an example consider the following plots (R code is in Appendix B). It is intended
to show two classes, with the x1 values both unscaled and scaled. In the first plot, a
nearest neighbour classifier would work well. It would work less well in the second
case. Note that the plotting commands use equal scales on each axis.
This also casts doubt over any particular use of a distance metric in the predictor
space. An effective metric will have to offer a way of making distances comparable
across predictor variables. Imagine we only have two predictor variables, x1 and x2. If
we have absolutely no prior information on these variables, then we might decide to
scale each dimension so that some measure of spread is similar along each of x1 and
x2, i.e. we would scale each of these so that their sample estimate of some measure
of spread is 1. Available measures of spread include the standard deviation and the
inter-quartile range. The former is probably better when the marginal distributions
appear approximately normal (symmetric, fast-decaying tails), while the latter can tol-
erate ’outliers’ and ’heavy tails’. If x1 is approximately normally distributed and x2 is
approximately t1 (Cauchy) or t2 distributed, then extreme points in x2 will only tend
to be closest to other points extreme points (in the same direction) under Euclidean
or most other metrics. When there are many such heavy-tailed distributed predictor
variables, such extreme points are unlikely to be chosen among the neighbours of any
73
20 40 60 80 100 120

40

20
0
20
40
60
xs
ys
x x
x x
x x
x x
x x
74
new observation.
An alternative approach, proposed by Olshen (1977) and Devroye (1978) uses ranks.
For each predictor variable, the data and the new observation’s value are given ranks
from 1 to n+1 corresponding to smallest (or most negative) to largest. The data is then
effectively replaced by these ranks: r(j)i being the rank of the jth component of the ith
observation, and r(j) being the rank of the jth component of the new observation. The
empirical distance between the new observation x and xi is then defined to be
ρ(x, xi) = max
1≤j≤p
|r(j)i − r(j)|. (3.43)
Since all the distances will be integer-valued, ties will frequently occur, and are usually
broken by (uniform) random ordering.
If we have prior information that, for example 2 units in x1 are equivalent to 1 unit
in x2, then we may use this instead. We could then normalise the space so that 1 unit is
equivalent in both by replacing x1 with x′1 = x1/2. See the following for further details:
Devroye et al. (1996).
If there is a tie in terms of the most common class, we typically choose the assigned
class uniformly at random amongst those classes.
One would like to choose k such that it usually picks up observations which have
the same class as x, the new observation. If the regions which have highest (true) pos-
terior probability of membership of a class are reasonably large and close to spherical
in the (possibly transformed) predictor space, then we should be able to pick k such
that k/n is appreciable. If not, we will need to pick a smaller k. Of course the smallest
value we can pick is 1, and there is no guarantee that we have another point in the
dataset sufficiently close to x that we can infer anything useful about its class. Any
suggestions ?
It has been shown that the k-nearest neighbour classifier is universally consistent,
i.e. that it approaches the Bayes-optimal rule as n→∞, provided k →∞ and k/n→ 0.
It has also been shown that the risk of the nearest neighbour rule is no more than twice
75
that of the Bayes-optimal rule, and the risk of the k-nearest neighbours rule is no more
than (1 +

2/k) times that of the Bayes-optimal rule. See (Devroye et al., 1996) for
proofs and further discussion.
3.11 Classification Trees
Classification trees were developed independently by multiple groups of researchers.
The most famous examples are CART (Breiman et al., 1984) and C4.5 (Quinlan, 1993).
These are rule-based rather than model-based classifiers which iteratively partition the
predictor space into rectangular-shaped areas, each of which is completely assigned to
a single class (or value in the case of regression trees).
For a given dataset, one builds a classification tree by iteratively choosing a predic-
tor variable and then determining the optimal way to split up that variable’s support
among the classes. The aim is that each subgroup is closer to being all of one class. In
other words, we are choosing the location of decision boundaries defined by values of
a particular variable. Once this is done, we look at each part of the space and decide if
we should split it again, and if so, we must again choose a predictor variable and the
value(s) at which to split it. We can continue this splitting until all the “leaf” nodes
(those with no children) each contain training observations from only one class, or no
further useful splits can be found, or according to some criterion.
Once a tree is constructed, one passes each training observation through it by start-
ing at the root node of the tree and choosing a sub-tree based on the split criteria ap-
plied to that observation. Once one reaches a leaf node, one checks the proportion
of the training observations at this node belonging to each class. The node is then
assigned the class which is most common among training observations at that node.
One uses the classification tree to classify a new observation x by passing it through
the node decision rules until reaching a class-assigning leaf node.
Ultimately we face the following questions:
76
1. What criterion are we seeking to optimise in selecting a variable and a split point
?
2. How will we do this optimization ?
3. How do we decide when to stop splitting ?
4. How can we interpret the resulting tree ?
In answering the first question, one might first think that we should minimise mis-
classifications when applying the classifier to the training set. This is perfectly reason-
able and is sometimes used.
Figure 3.1: Node impurity functions with two classes - from Hastie et al. (2009).
The three most popular measures of what is termed “node impurity” are cross-
entropy, the Gini index and misclassification rate. When there are G possible classes,
then at the mth node, the cross-entropy is defined to be
Cross-entropy = IC(m) = −
G∑
g=1
pˆmg log pˆmg, (3.44)
77
where pˆmg is the proportion of the training observations in class g out of those which
pass through the decision tree to the mth node. In cases where a particular class g′ is
not present among the training observations at this node, we take pˆmg′ log pˆmg′ to be
zero, as per the asymptotic limit. The Gini index at the mth node is defined to be
Gini index = IG(m) =
G∑
g=1
pˆmg(1− pˆmg). (3.45)
The (training) misclassification rate at the mth node is defined to be
Misclassification rate = IM(m) = 1− pˆmg∗ , (3.46)
where g∗ = arg maxg pˆmg. These three node impurity functions are shown for a two
class classification case versus p, here the proportion of observations in the first class.
In practice, the choice of impurity type often makes very little difference to the type of
tree formed.
One can stop splitting the tree at a given node once all the observations at this node
are from the same class. We then simply assign that class to any new observations
which arrive at this node. Conversely, the maximum value of these functions occurs
when each class is equally likely in the empirical distribution based on training obser-
vations at that node.
Hence the natural aim is to reduce the node impurity to the greatest extent possible
via each split. We define
∆I(m) = I(m)− PˆLmI(mL)− (1− PˆLm)I(mR), (3.47)
where mL and mR are the indices of the left and right descendent nodes (children) of
node m, respectively, and PˆLm is the proportion of the population reaching node m
which will be assigned to node mL, i.e. take the left branch, as estimated based on the
training data.
78
All these terms are naturally dependent on the way the split is carried out, i.e. the
parameters of the split. The split parameters of the jth node consist of the choice of
variable on which to split, which can be labelled jm and the value of this parameter
to split upon. If the jth predictor variable is continuous, then we need a single real-
valued split point λm. We would assign an observation x to the left branch if xjm < λm
and otherwise to the right branch.
We can use a similar rule if x is a discrete variable or an ordinal categorical variable,
although λm would then be restricted to the values of the second through to last values
for this variable (when this number is finite). If the jth predictor variable is binary
(e.g. ∈ {0, 1}), then we can assign an observation x to the left branch if xjm = 0 and
otherwise to the right branch. If the jth predictor variable is nominal, i.e. unordered
categorical, we must still find a way to make a binary decision. One obvious choice is
to consider splitting off a single category versus the rest at this node. One could instead
consider subsets of categories. This may be computationally feasible if the number of
such sets is reasonable, e.g. 100 or fewer.
When the split variable is finitely discrete or categorical, we can find the best choice
of split, with respect to the change in impurity on the training set, by searching over the
entire set of possible splits. When the split variable is continuous, we should ideally
calculate ∆I(m) in analytic form and then seek to maximise this function. We would
then use calculus, consider stationary points and any relevant boundary values. How-
ever, since we will not typically know the form of the class-generating distributions,
for a particular attribute, we again need merely to search through the possible values
of that attribute from the second smallest to the largest and split between that point
and the previous point.
If the split is to be made somewhere between consecutive training set values a and
b, a < b, then any value in that range will result in the same estimated node impurity.
One then commonly chooses the the cut-point to be the midpoint (a+b)/2 to encourage
reasonable generalisation capabilities. Another option is to choose a weighted point
79
(1− PˆLm)a+ PˆLmb = a+ PˆLm(b− a), where PˆLm is the estimated proportion of observa-
tions which would be directed to the left node via this rule. This uses the assumption
that observations falling in the interval (a, b) should be allocated to the left and right
sub-nodes with probabilities estimated by PˆLm and 1 − PˆLm, respectively. It is also
consistent with the assumption that the probability density of points in the interval is
uniform, since the length from a to the mid-point is effectively considered proportional
to the probability of the left branch.
Despite appearances, neither assumption is particularly reasonable. If, for example,
the right branch has a higher proportion of observations, we have likely observed more
extreme points from the right branch, that is, b may already be closer to the ideal cut
point than a. While it might be possible to determine ideal cut-points given full knowl-
edge of the class densities fg(x), given the data, none of the cut points in the interval
will appear to produce better performance than any other unless tested on left out
data. The distribution between the points may not be uniform either, though this may
be more reasonable if they are close together, which is more likely with larger datasets.
The theory of maximum margins, as seen later with support vector machines, offers
some more support for the use of the midpoint. However, while the splits sometimes
produce class boundaries, splits are often performed which reduce impurity, but do
not split classes, particularly when class prior probabilities are far from equal. There is
little theory or experimental evidence to support either method of selecting cut-points,
and they are unlikely to make much difference to performance.
Note that the form of the decision tree produced (order of splits, choice of split
variables) is invariant under affine transformations of the variables, i.e. linear trans-
formation plus constant. This useful property means that there is no need to normalise
any of the predictor variables before fitting a tree to them. Note that if the data is given
an affine transformation, the values of the split points will also undergo this transfor-
mation.
Multi-way splits on a single predictor variable are another option for the construc-
80
tion of trees. However, a well-chosen K + 1-way split will always be favoured over a
K-way split if using a simple extension of the ∆I criterion. This can be corrected for
by using the following criterion:
∆I(m) =
I(m)−∑Kk=1 PˆkmImk
−∑Kk=1 Pˆkm log Pˆkm , (3.48)
where Pˆkm is the estimated proportion of observations which would be directed to
kth child node of node m via the proposed rule and Imk is the impurity of the kth
child node. Note that the denominator could only be zero if all training observations
come from a single class. We would then have a pure node and hence would not
consider splitting it. Searching all possible multi-way splits out of a single node can be
computationally expensive, and hence is often avoided in favour of binary splits.
Missing data in training is generally ignored on a predictor-by-predictor basis.
However, when categorical data are missing, one can assign this predictor values to
an additional “missing” category and consider using that in the tree. If the data is not
missing at random, this may be useful. If a new (test) observation is missing the value
needed for a split point, one assigns it to the branch which contained more training
observations. Given the sparse nature of a typical tree, it is quite likely that an obser-
vation with missing data will not have that predictor tested at a node.
One can extend the idea of impurity to the whole tree, by a summation over the
set of leaf nodes. This allows one to consider penalising the complexity of the tree to
avoid making it too large and hence over-fitting the training data.
Ipenalised =

leaf nodes
I(m) + α|set of leaf nodes|, (3.49)
where || indicates cardinality. One then attempts to minimise Ipenalised, with the simplest
method being to not add any more nodes during tree construction if they cannot reduce
Ipenalised. One is still left with the choice of α, and there is little informative guidance
on this. One could choose it using cross-validation, but one would then be tempted to
81
use this instead to look at classification error directly, and stop adding nodes once this
fails to improve. Note that such cross-validation must be considered part of model-
building, and so separate data would need to be held out for testing. A two layer
cross-validation scheme would be a suitable method for such model building along
with associated estimation of error rates.
Example in one dimension:
Assume we have two classes and 5 observations, all of a single variable. In class 1
are the observations, 1 and 6, and in class 2 are the observations 2, 5 and 11. Using the
cross-entropy form of node impurity, what tree will be built ?
The first thing is to consider the impurity of the root node, which we will number
m = 1. Since we have 3 observations of class 2, and only 2 from class 1, if we had to
stop at this node, we would make it of class 2. The cross-entropy of this node is
IC(m = 1) = −
G∑
g=1
pˆ1g log pˆ1g (3.50)
= −(pˆ11 log pˆ11 + pˆ12 log pˆ12) (3.51)
= −(2/5 log(2/5) + 3/5 log(3/5)) (3.52)
≈ 0.673. (3.53)
The change in node impurity in response to a split is
∆I(m) = I(m)− PˆLmI(mL)− (1− PˆLm)I(mR). (3.54)
One needs to calculate the value of ∆I(m) for all (n − 1) possible splits. Note that the
class of the sub-nodes (the two being constructed beneath the current node) is deter-
mined automatically by the more common class among training observations assigned
to that sub-node, so there is no need to search over these allocations.
The ordered observations are 1,2,5,6,11.
Consider the split: 1 | 2,5,6,11 (where | indicates the split point), i.e. the split point is
82
somewhere in the interval (1, 2). We will do what is most common in the literature and
place the split point at the midpoint of (1, 2), i.e. at 1.5. We first consider the proportion
of future observations we expect to go to the left branch of the split point. The only
information we have is that 1/5 of the training observations would be allocated to the
left branch, so PˆL1 = 0.2.
Now we need to calculate the IC values for the left and right sub-nodes, which we
will number 2 and 3. Node 2 has just 1 observation of class 1, so it is a pure node and
hence a leaf node (no further splits will be considered). Calculation of its impurity
gives
IC(m = 2) = −(pˆ21 log pˆ21 + pˆ22 log pˆ22) = −(1 log 1 + 0 log 0) = 0, (3.55)
as expected. Node 3 has 1 observation of class 1 and 3 of class 2. Its impurity is
IC(m = 3) = −(pˆ31 log pˆ31 + pˆ32 log pˆ32) = −(0.25 log 0.25 + 0.75 log 0.75) ≈ 0.562. (3.56)
Hence the change in node impurity for this split
∆I(1) = I(1)−PˆL1I(2)−(1−PˆLm)I(3) ≈ 0.673−(0.2×0)−(0.8×0.562) ≈ 0.223. (3.57)
The following splits are possible:
Classification trees are generally considered very interpretable by non-statisticians.
They resemble a logical decision-making process that first looks at the most impor-
tant features (predictors), and makes a series of rule-based decisions leading to a final
decision in often only a few steps. However, this likely over-simplifies the problem,
and the rectangular decision boundaries so produced are unlikely to be a close match
to the true Bayes-error-style decision boundaries from the generating distributions. In
addition, the rules are far more fragile than one might expect. If one leaves some obser-
vations out of the data (as is done in cross-validation), rather different trees tend to be
produced. In fact one should do this for sensitivity testing if considering encouraging
someone to interpret the decision tree definitively. Extensions to decision trees often
83
consider combinations of many trees, e.g. random forests. These tend to reduce the
(questionable) interpretability, but offer much smoother decision boundaries.
84
Chapter 4
High-dimensional analysis
4.1 Introduction
As of December 2015, the most highly cited paper with words “high-dimensional” in
the title published up until 1995 is (Andrews, 1972) which offers examples with 6 and
8 dimensions. In his influential 1990 book, Fukunaga (1990) listed p = 32 or 64 as
examples of high-dimensionality. Until the mid-1990s, what most statisticians meant
by high-dimensional data was more than 5 dimensions, and 100 dimensions would
be considered extreme. However, machine learning groups had begun applying their
methods to datasets with even more variables, and new data-producing technology
was leading to demands that statistics also offer suitable methods.
In 1958 Dempster (1958) pointed out that neither Hotelling’s T 2 test to compare two
groups nor Fisher’s LDA can be used when p > n1 +n2−2. In both cases, the difficulty
occurs in fitting a multivariate normal distribution in this number of dimensions.
For a single dataset with p > n − 1, the sample covariance matrix is singular and
hence not invertible. This is based on the following. The n × p data matrix XAll is at
most rank min(n, p). Since the p×p matrix S = 1
n−1A, where A =
∑n
i=1(xi−x)(xi−x)T ,
the rank of A and S is the same. Each observation contributes one rank to the matrix A
provided the observations are linearly independent. One rank is lost for the estimation
of the mean x. Hence the rank of A and S is at most n− 1, and these matrices are only
85
invertible if p ≤ n− 1.
Dempster outlined a non-exact two-sample test that would work under these con-
ditions. Various improvements have been suggested since.
The standard solution for multiple linear regression is βˆ = (XTAllXAll)
−1XTAllyAll,
where yAll is a vector of continuous response variables. This also requires that p ≤ n:
otherwise the matrix inversion is not possible. It is possible to minimise the sum of
squares for the regression in other ways, but if p > n, the solution is not unique, and
hence largely uninterpretable.
As a result, standard experimental design demands that the number of observations
n be at least p, and usually quite a bit higher. Unfortunately, this is not possible in
all situations, partly for reasons of expense. The modern world has now produced
an enormous array of equipment and facilities which are capable of collecting very
large amounts of data. In many cases, the number of dimensions is very large, and
sometimes much larger than the sample size, i.e. p n.
The best known example of a need for high-dimensional data analysis was pre-
sented by microarray technology in genomics. These were developed and commer-
cialised around 1995 and have been an important tool in understanding the genomes
of many species ever since. Humans have approximately 20,000 genes, and many other
complex species have similar numbers of genes. Oversimplifying somewhat, each gene
both encodes and is capable of leading production of a particular type of protein. The
amount of each gene product (typically a protein) being produced is controlled by
other genetic architecture, but one can get an idea of the current rate of production by
measuring levels of messenger ribose-nucleic acid (mRNA), there being a unique form
of this for each gene. Microarrays are compact chemical and photonic experimental
apparatus which can measure the levels of a full set of mRNA levels from a tissue
sample, e.g. a blood sample.
Some medical and biological scientists were and are very interested to discover
relationships between these levels of gene expression and measurable aspects of the
86
organisms from which the tissue samples were taken - these are known as phenotypes.
Examples of phenotypes include: height, presence of a particular disease, a particular
health outcome, ability to tolerate certain kinds of environment, crop yield, milk yield
(in e.g. cattle) — an enormous array of possibilities.
The set of mRNA levels form the explanatory variables and any phenotype of in-
terest can be the response. This allows the full range of response variable types, with
the main types of interest being continuous and nominal. Since each microarray typi-
cally cost over $1000 to purchase and use, sample sizes tended to be fairly small, in the
range 10-1000. Technology has changed over time and new problems have emerged
with much larger numbers of explanatory variables, but the p > n dataset seems to be
here to stay.
In this chapter, we will consider methods to cope with large values for p, likely at
least 100, where p may be greater than n or smaller. The choice of methods for such
data depends greatly on the question which you wish to answer. Some of the main
possibilities include:
1. Which of the predictor variables is associated with the response variable ?
(a) considering each variable separately
(b) considering the set of explanatory variables all at once
2. Can the set of explanatory variables be transformed into a space of smaller di-
mension without losing anything important in terms of the association with the
response variable ?
3. Can the set of explanatory variables be transformed into a space of smaller di-
mension when there is no response variable ?
(a) when the aim is to capture structure in the data
(b) when the aim is clustering
87
4. How can we make the highest-quality predictions of the response variable, based
on the explanatory variables ?
One should be able to see that (1) is a special case of (2). We will discuss methods
for 3(a), but there will not be time to address 3(b). We will make some comments about
(4).
4.2 Principal Component Analysis
Principal component analysis (PCA) is largely a technique to transform the original
variables into a set of orthogonal variables which are ordered according to the size of
their sample variances. It aims to find a set of underlying uncorrelated latent variables
which produce the dependencies and the variation in the original variables.
The first principal component direction is a unit vector whose direction in the space
defined by the explanatory variables is that for which the sample variance is largest.
The first principal components of the data are the projections of the data onto the first
principal component direction, or equivalently, the result of dot products of the ob-
servations with this vector. The second principal component direction is a direction
orthogonal to the first, and within that constraint, also maximises the sample variance.
The third principal component direction vector must be orthogonal to the first two, and
also maximise the sample variance, and so on, until p principal component directions
and the corresponding principal components can be defined.
Note that the notation is somewhat confused in the literature, with many authors
referring to the principal component directions as the principal components, and not
bothering to name the transformed data. Duda et al. (2001) use this terminology. Mor-
rison (2005) and Anderson (2003) avoid naming the directions until they determine
their form. Here I have adopted the terminology used by Hastie et al. (2009). Some
authors also refer to the principal component directions as loadings.
In practice the point of the exercise is usually dimensionality reduction. This is
88
achieved by replacing the original dataset with a reduced dataset which contains only
values for the first few principal components. These are generally enough to capture
most of the variance in the data. Whether they capture most of the important informa-
tion is another question. The overall hope is that variables which show little variation
will be of little value, and those that vary the most will be the most important.
Mathematically, in constructing the first principal component, we are seeking a lin-
ear combination Y1 of the original variables X1, . . . , Xp which maximises the sample
variance (and hence, it is hoped, the true variance). Let the first principal component
be the following linear compound:
Y1 = a11X1 + · · ·+ ap1Xp = aT1X. (4.1)
Thus the sample variance of this principal component will be
s2Y1 =
p∑
j=1
p∑
k=1
aj1ak1sjk = a
T
1 Sa1, (4.2)
where S is the sample covariance matrix. The first principal component is chosen such
that s2Y1 is maximised. Since one could easily scale such a solution up to infinity, we
also require that aT1 a1 = 1, i.e. that a is of length 1 in the original variable space. The
vector Y1 is also in the direction of the principal axis of the ellipsoid for a multivariate
normal distribution fitted to the data.
The second principal component,
Y2 = a12X1 + · · ·+ ap2Xp = aT2X, (4.3)
is also chosen to maximise the sample variance aT2 Sa2, but is subject to the additional
constraint that it must be orthogonal to the first principal component. This is achieved
by setting aT1 a2 = 0. Later principal components are chosen similarly, with the condi-
tion that each be orthogonal to the earlier components.
89
We can find the weight vector a1 for the first principal component using the method
of Lagrange multipliers. We need to maximise aT1 Sa1 subject to the constraint aT1 a1 =
1. Hence, we let λ be an unknown real-valued Lagrange multiplier and consider the
Lagrange function
u1 = a
T
1 Sa1 − λ(aT1 a1 − 1). (4.4)
It can be shown that there is stationary value of the Lagrange function which will
maximise aT1 Sa1 while being subject to the constraint. We find stationary points of the
Lagrange function via differentiation. The vector of partial derivatives is
∂u1
∂a1
= 2Sa1 − 2λa1. (4.5)
We set this to zero, and obtain
Sa1 − λa1. (4.6)
Hence the stationary points in terms of a1 are the eigenvectors of S, with their corre-
sponding eigenvalues being values for λ. We wished to maximise aT1 Sa1, and we can
see that
aT1 Sa1 = λa
T
1 a1 = λ. (4.7)
Hence the first principal component should be chosen to be the eigenvector of S whose
eigenvalue λ is largest. The second and subsequent principal components can be
found similarly by subtracting the first principal component values from the data, and
then searching for a direction which maximises the sample variance. It can be shown
that the directions found will be orthogonal to those previously found, and that the
jth principal component direction is that of the eigenvector corresponding to the jth
largest eigenvalue of the original sample covariance matrix S.
Having produced the principal component directions and the principal compo-
nents (i.e. having transformed the original data), one can then calculate the percentage
of the variance in the data captured by each component. It is common to keep only the
90
principal components which explain e.g. 90%, 95% or 99% of the variance, which might
not need very many. However, it does need to be remembered that this method does
not consider a response variable, and we do not know how much ability to explain a
response will be lost in dropping some principal components.
There have been many extensions of the technique of principal components, includ-
ing sparse methods for high-dimensional data.
Standard PCA can be used with high-dimensional data where p > n, with some
caveats. Since S is symmetric (and hence Hermitian), all of its eigenvalues are real (i.e.
not complex). As we have seen, the rank of S is at most n − 1, this implies that the
number of unique eigenvalues of S is also at most n − 1. This limits the number of
principal components to be min(n − 1, p), and hence equal to n − 1 when p > n. If we
ignore the orthogonality aspect, this should not be surprising. If the data consists of n
vectors in a p-dimensional space, then we could transform the data to a space whose
basis is the n vectors, i.e. an n-dimensional space. The orthogonality requirement
changes this somewhat, but makes no difference to the dimensionality. PCA loses a
degree of freedom due to the calculation of the sample mean, needed to produce S.
The n − 1 dimensional PCA basis appears to be an extreme case of over-fitting the
data, but the effect on subsequent analysis is unclear. However, the main purpose of
using PCA is to reduce the dimensionality, and it is quite likely that we would like
to reduce it well below min(n − 1, p), reducing the tendency to over-fit in the process.
Other software may require this before it can be applied to the resulting data. For
example, in order to fit the covariance matrices, QDA requires that the dimension of
the data is less than that of the sample size for the smallest class. LDA has no such
restriction. While PCA does not take into account any relationship with the response
variable, it completely removes collinearity and can be used to reduce the dimensional-
ity to something more manageable. By not considering the response, it avoids a variety
of potential biases in the assessment of predictive accuracy. As a result, it is a useful
tool when the purpose is prediction.
91
4.3 Single-variable analysis
In many cases, the belief of the investigators is that just some of the explanatory vari-
ables are related to the response, and the aim is to find as many of these as possible.
For example, if we have two groups of subjects, one with a certain disease, and the
other healthy, we may wish to find out which of the explanatory variables have a dif-
ferent distribution or mean in each group. Alternatively, we might have a continuous
response variable, and wish to know which of the explanatory variables seem most
strongly related to it. The variables that are found to be related are then candidates for
further studies, with a typical eventual aim of intervening to change the response from
a worse to a better state.
The simplest approach is then to work through the explanatory variables one by
one, and quantify their relationship to the response variable. The most common method
for doing so is via a p-value. When the response is nominal and the explanatory vari-
able is continuous, one is likely to use ANOVA if its conditions are met. Once the com-
monly reasonable assumption of independence of observations is checked, we primar-
ily need approximate normality of data in each group, along with similar variances.
Sometimes further transformation of the data can achieve this, and when it cannot, the
data may meet the less stringent conditions of alternatives such as the Kruskal-Wallis
test. When there are just two groups, one can use a Welch’s t-test for which the groups
have unequal variances. When the response is nominal and the explanatory variable
is categorical, one might use a chi-square or Fisher exact test.
For a continuous response variable, if the explanatory variable is categorical, one
might again use ANOVA or alternatives. When both response and explanatory vari-
ables are continuous, one would typically use linear regression. Some transformations
might be needed to do so.
The net result is that we have a p-value for every one of the explanatory variables.
The difficulty is to then decide what threshold to use to determine which explanatory
variables will be considered significant or of interest. If we set α = 0.05 and report
92
every variable with a p-value < α, then we have to be aware that we will collect 5% of
any irrelevant variables, since such p-values are uniformly distributed on [0, 1]. If we
have p = 20000, and the most of these variables are unrelated to the response, then this
will produce 1000 irrelevant variables, and possibly some of the relevant variables.
In courses covering experimental design, we teach how to control the family-wise
or experiment-wise error rate (FWER) to be at most a certain value, e.g. 0.05. Potential
solutions such as the Bonferroni method exist, but are likely to be far too stringent
for situations where p > n. For example, if n = 100 and p = 20, 000, the Bonferroni
method suggests that if we wish to have a family-wise error rate of 0.05, we should
use an individual threshold of 0.05/20000 = 2.5 × 10−6. If using a two-sample t-test
with equal variances, this requires a test statistic whose magnitude is at least 5. Given
a sample size of 100 with 50 in each group, this requires a difference between the group
means at least as large as the standard deviation within a group. Such differences can
be seen by eye on exploratory plots and barely need statistical analysis.
It is often difficult to know how the set of variables found will be used, but some-
times it is clear. A laboratory may want to investigate up to 30 different genes, and like
a list of those which are most promising. In one sense the task is then easy: we can
always sort the variables according to their p-values, and our best candidates would
be the smallest 30. Note that for a particular type of test, if one does not have suffi-
ciently accurate p-values, one can simply sort by the absolute value of the test statistic,
assuming that we are always performing two-sided tests. The problem is that we don’t
then know if we are wasting people’s time. Are any of the 30 relevant ? Where should
we draw the line in a principled fashion ?
The currently most popular solution is to attempt to control what is known as the
false discovery rate (FDR) (Benjamini and Hochberg, 1995). Since we will now be dis-
cussing p-values, we will reluctantly join Benjamini and Hochberg (and Hastie, Tibshi-
rani and Friedman) in changing our notation for the number of variables to m, repre-
senting the number of tests to be performed, and the number of resulting p-values.
93
Declared non-significant Declared significant Total
H0 true U V M0
H0 false T S M1
Total m−R R m
Table 4.1: Number of errors in the testing of M null hypotheses.
Table 4.1 shows the breakdown of all the hypothesis test results once a threshold
criterion is chosen. The power of the procedure is the probability of declaring an alter-
native hypothesis significant when it really is, and so can also be defined as
Power = E
(
S
M1
)
. (4.8)
The family-wise error rate can be defined as FWER = P (V ≥ 1).
The proportion of false discoveries is the proportion of alternative hypotheses de-
clared significant which are false. This is labelled Q = V/R, provided R > 0. We define
Q = 0 when V = R = 0.
The False Discovery Rate (FDR) is the expected value of Q:
False Discovery Rate (FDR) = E(Q) = E
(
V
R
∣∣∣∣R > 0)P (R > 0), (4.9)
where V and R are as defined in the table.
The possibility of R = 0 is the reason why the above is not just E(V/R). However,
P (R = 0) is typically close to zero, so there is not much difference. See (Sun and Cai,
2009) for further details.
R is an observable random variable and S, T, U , and V are unobservable random
variables, as are the derived quantities M0 and M1. Due to sums to known values, we
only have two unknown random variables here, e.g. S and T .
When all the null hypotheses are true, s = 0 and so v = r. If v = 0, then Q = 0. If
v ≥ 1, then Q = 1. As a result, P (V ≥ 1) = E(Q). Hence for this case, the FDR is the
same as the FWER.
When m0 < m, it implies m1 > 0 (some alternative hypotheses are true) and the
94
FDR ≤ FWER. This can be shown as follows:
We make the following claim:
IV≥1 ≥ Q. (4.10)
There are only two cases to consider. If V = 0, then IV≥1 = 0 andQ = 0, so IV≥1 ≥ Q.
If V ≥ 1, then IV≥1 = 1 and Q = V/R ≤ 1, so IV≥1 ≥ Q. So the claim holds.
Taking expectations of both sides:
P (V ≥ 1) ≥ E(Q) (4.11)
i.e. FWER ≥ FDR (4.12)
As a result, any procedure which controls (places an upper bound on) the FWER also
controls the FDR. However, if one controls just the FDR, the FWER can be considerably
higher. This may be of no concern, and the benefit can be a large increase in the power
of the overall procedure.
Benjamini and Hochberg offered a method to control the FDR, which is described
below. Assume we are testing m null hypotheses, which we shall name H1, H2, . . . , Hm
and that based on the collected dataset, these yield p-values P1, . . . , Pm. Note that
upper-case letters are used here to indicate that in general, the p-values are random
variables before the data is collected. We now order the p-values from smallest to
largest and relabel them as follows: P(1) ≤ P(2) ≤ · · · ≤ P(m). So P(i) is the ith smallest
p-value, and is the result of testing the null hypothesis H(i). We define the following
multiple testing procedure:
Let k be the largest value of i for which P(i) ≤ imq∗. Then reject all H(i), i = 1, . . . , k.
Theorem: for all independent test statistics and any configuration of false null hy-
potheses, this procedure controls the FDR to be at most q∗.
95
4.4 Variable selection and the lasso
We now turn to the problem of explanatory variable selection, when there is a response
variable. Usually it is assumed that only some of the explanatory variables are related
to the response variable, and the task of variable selection is to find these. So far, this
is no different to the task of single variable analysis. The difference is that here we
also plan to use the set of variables found for prediction. Further, we believe that by
selecting the related variables and leaving out the unrelated variables, we will improve
the accuracy of predictions. Further, we will have an interpretable and parsimonious
(very frugal) model. This issue of variable selection is actually a special case of model
selection. The way each variable relates to the response may also be considered un-
known, with a variety of models considered, but one needs a lot of data to obtain clear
answers to such questions.
Here we will consider just two types of models: (multiple) linear regression to pre-
dict a continuous response variable and logistic regression to predict a binary response
variable.
As we saw with neural networks, one way of limiting or avoiding over-fitting is to
penalise the magnitude of the parameters — these are called shrinkage methods. As we
saw with the BIC, another way is to penalise the number of parameters. Another way
to choose a subset of explanatory variables for a model and hence try to avoid over-
fitting begins with choosing a method of assessing predictive accuracy, e.g. the CV-
based PRESS for a continuous response variable, and classic CV for a nominal response
variable (classification). We then search over all possible combinations of the available
variables somehow. When there are p explanatory variables, the number of possible
subsets is 2p.
It is not possible to fully explore this space when p is large. A number of approaches
have been tried including forward selection (starting from the simplest model - a con-
stant only, and building up in greedy fashion) and backward selection (starting from
the full model - with all explanatory variables, and removing variables in a greedy
96
fashion), although these have often been done using “apparent” assessments of predic-
tive accuracy, such as R2. Other authors have tried meta-heuristic algorithms to search
this space, and preliminary conclusions suggest that while they can effectively min-
imise a predictive accuracy cost function, the level of noise is often such that there is
insufficient signal to encourage the removal of enough explanatory variables (Nguyen
and Wood, 2012).
The method of subset selection aimed to fix the number of selected variables at a
certain value k < p and search over this smaller domain. This can be viewed as a hard-
constraint version of BIC. It can also be viewed as a method which uses an L0 norm
penalty. The L0 norm is
||β||L0 = |{p : βp 6= 0}|. (4.13)
By far the most popular current method of variable selection is the lasso, the “least
absolute shrinkage and selection operator” (Tibshirani, 1994). This method aims to
shrink some parameter values and set others to zero, hence achieving both shrinkage
and variable selection.
For standard linear regression, we use the model Y = β0 + Xβ + , where Y is the
response variable, X is a p-length tuple of explanatory variables, β = (β1, . . . , βp)T is
a vector of regression slope parameters, β0 is the constant regression parameter and
∼ N(0, σ2) is the normally-distributed error term, which has variance parameter σ2.
We are not interested in shrinking or remove β0 or σ2 - they must be present in all the
models of interest.
Given this model, it can be shown that the maximum likelihood solution is the same
as the least squares solution. I.e. the standard solution is to choose
(βˆ0, βˆ) = arg min
β0,β

n∑
i=1
(
yi − β0 −
p∑
j=1
βjxij
)2 . (4.14)
It is well known that this has a closed form solution, which is not that difficult to derive,
97
at least for p < n. That is
βˆ = (XTallXall)
−1XallY,
whereXall is the n×(p+1) matrix of explanatory variable values, and Y is the vector of
response values. Here Xall has been augmented with a vector of 1’s (the first column)
to allow the constant β0 to be included in the parameter vector β.
A long-standing shrinkage method is known as ridge regression, weight decay or
Tikhonov regularization. One typically normalises the predictor variables before start-
ing, i.e. take away their sample mean and divide by their sample standard deviation.
We then include as a penalty the sum of the squared regression weights, also known
as an L2 penalty, as follows:
(βˆ0, βˆ) = arg min
β0,β

n∑
i=1
(
yi − β0 −
p∑
j=1
βjxij
)2
+ λ
p∑
j=1
β2j
 . (4.15)
Here λ is a tuning parameter that controls the amount of shrinkage. It can be chosen
manually, by trial and error, or by optimizing over some measure of predictive accu-
racy, e.g. via cross-validation. Note that we would need something such as two layers
of cross-validation to both select such a parameter and estimate the accuracy of the
resulting model.
Ridge regression was introduced by (Hoerl and Kennard, 1970) primarily to im-
prove regression for non-orthogonal predictor variables, i.e. those with substantial
multicollinearity. The concern in this case is that the slope parameter estimates will
be unstable, with large confidence intervals and high sensitivity to individual obser-
vations. This can also be seen via the near-singularity of the matrix XTallXall. The con-
dition number of the matrix to be inverted is decreased via ridge regression with the
ridge estimator being given by
βˆR = (X
T
allXall + λI)
−1XallY.
98
If p > n, then XTallXall will not be invertible due to being rank-deficient. However,
it is positive semi-definite, so at least one of its eigenvalues will be zero, with the rest
positive. The addition of λI increases all the eigenvalues by λ (> 0), making this matrix
positive definite and invertible, hence allowing a unique solution for βˆ.
Tibshirani had what at first seems a fairly trivial idea, to replace the L2 penalty with
an L1 penalty:
(βˆ0, βˆ) = arg min
β0,β

n∑
i=1
(
yi − β0 −
p∑
j=1
βjxij
)2
+ λ
p∑
j=1
|βj|
 . (4.16)
This is the Lagrangian equivalent of
(βˆ0, βˆ) = arg min
β0,β

n∑
i=1
(
yi − β0 −
p∑
j=1
βjxij
)2 , (4.17)
subject to
p∑
j=1
|βj| ≤ t, (4.18)
where t is a threshold parameter, which has a one-to-one relationship with λ. With
normalized data, the solution for βˆ0 is y, and having chosen this, we then only need
choose β1, . . . , βp. However, Tibshirani may have been the first person to realise the
potential of this type of penalty in variable selection and its computational feasibility.
What he may not have appreciated at the time was that he had invented it just in time
for a new wave of data and analysis in which variable selection would feature promi-
nently. Most of the examples in the original lasso paper have p < 10, but Tibshirani
tried one simulation with p = 40 and n = 100.
We will not usually be aware of an optimal value for λ or t. Tibshirani offered
three methods for selecting this value. The first was simply the minimization of the
prediction error, as assessed via five-fold cross-validation. The second was using a
form of generalized cross-validation, and the third was based on an unbiased estimate
of risk by Stein.
99
Tibshirani noted that Frank and Friedman (1993) discussed generalizing ridge re-
gression and subset selection by adding a term of the form λ
∑p
j=1 |βj|q, for q ≥ 0.
However, they did not investigate other values of q. Tibshirani claimed that his choice
of q = 1 was motivated by two factors: (i) that this was closer to q = 0, as used in
subset selection, which seems natural when doing variable selection (ii) that this was
the smallest value of q for which the optimization problem would be convex.
To be able to choose λ, it would be best to be able to determine the set of parameters
which result from optimising 4.16 for every possible value of λ. It can be shown that for
a particular value of λ, the optimal values can be found via quadratic programming.
Further, the problem of finding the complete solution path is also convex and according
to Hastie et al. (2009) (p76), requires “the same order of computation as that of a single
least squares fit using the p predictors”. We will not cover the details in this course, but
a range of R packages are available which implement this and related methods.
The lasso can also be used with any generalised linear model, i.e. one for which
g(µ(x)) = βx, where µ(x) = E(Y |X = x), g() is a monotone function and Y is dis-
tributed according to a member of the exponential family. In particular, it can be ap-
plied to logistic regression.
.1 Appendix A
ANOVA identity:
The total corrected sum of squares may be written as
a∑
i=1
r∑
j=1
(yij − y··)2 =
a∑
i=1
r∑
j=1
(
(yij − yi·) + (yi· − y··)
)2
=
a∑
i=1
r∑
j=1
(yij − yi·)2 + 2
a∑
i=1
r∑
j=1
(yij − yi·)(yi· − y··) + r
a∑
i=1
(yi· − y··)2.
100
However, the cross-product term in this last equation is zero, because
r∑
j=1
(yij − yi·) = yi· − ryi· = 0.
Therefore we have
a∑
i=1
r∑
j=1
(yij − y··)2 = r
a∑
i=1
(yi· − y··)2 +
a∑
i=1
r∑
j=1
(yij − yi·)2. (19)
.2 Appendix B: Code for knn example
library(MASS)
scale=10
offset=3.5
x<-seq(1,10)
c1 <- cbind(x,x)
c2 <- cbind(x+offset,x)
call<-rbind(c1,c2)
eqscplot(call,pch=’’,xlab="x1",ylab="x2")
points(c1)
points(c2,pch=’x’)
cs1 <- cbind(x*scale,x)
cs2 <- cbind((x+offset)*scale,x)
calls<-rbind(cs1,cs2)
eqscplot(calls,pch=’’,xlab="x1",ylab="x2")
points(cs1)
points(cs2,pch=’x’)
##
### New version below including predicted classes
##
## scaled
101
library(class)
minx = min(cs1[,1],cs2[,1])
maxx = max(cs1[,1],cs2[,1])
miny = min(cs1[,2],cs2[,2])-10
maxy = max(cs1[,2],cs2[,2])+10
xs <- seq(minx,maxx)
ys <- seq(miny,maxy)
lenxs = length(xs)
lenys = length(ys)
classes = factor(c(rep(1,10), rep(0,10)))
z<-matrix(0,nrow=lenxs,ncol=lenys)
for (i in 1:lenxs) {
for (j in 1:lenys) {
z[i,j] = knn(calls,c(xs[i],ys[j]),cl=classes,k=1)
}
}
image(xs,ys,z,asp=1)
points(cs1)
points(cs2,pch=’x’)
## unscaled
minu = min(c1[,1],c2[,1])
maxu = max(c1[,1],c2[,1])
minv = min(c1[,2],c2[,2])
maxv = max(c1[,2],c2[,2])
us <- seq(minu,maxu,0.02)
vs <- seq(minv,maxv,0.02)
lenus = length(us)
lenvs = length(vs)
classes = factor(c(rep(1,10), rep(0,10)))
w<-matrix(0,nrow=lenus,ncol=lenvs)
102
for (i in 1:lenus) {
for (j in 1:lenvs) {
w[i,j] = knn(call,c(us[i],vs[j]),cl=classes,k=1)
}
}
image(us,vs,w,asp=1)
points(c1)
points(c2,pch=’x’)
103
Bibliography
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions
on Automatic Control, 19:716–723.
Anderson, T. (2003). An Introduction to Multivariate Statistical Analysis. Wiley, 3rd edi-
tion.
Andrews, D. F. (1972). Plots of high-dimensional data. Biometrics, pages 125–136.
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical
and powerful approach to multiple testing. Journal of the Royal Statistical Society.
Series B, pages 289–300.
Berger, J. (1975). Minimax estimation of location vectors for a wide class of densities.
Annals of Statistics, 3:1318–1328.
Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University
Press.
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression
Trees. Wadsworth, Belmont, CA.
Burman, P. (1989). A comparative study of ordinary cross-validation, v-fold cross-
validation and the repeated learning-testing methods. Biometrika, 76:503–514.
Chacon, J. and Duong, T. (2011). Unconstrained pilot selectors for smoothed cross-
validation. Australian & New Zealand Journal of Statistics, 53(3):331–351.
104
Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society B, 39(1):1–38.
Dempster, A. P. (1958). A high dimensional two sample significance test. The Annals of
Mathematical Statistics, pages 995–1010.
Devroye, L. (1978). A universal k-nearest neighbor procedure in discrimination. In
Proceedings of the 1978 IEEE Computer Society Conference on Pattern Recognition and
Image Processing, pages 142–147, Long Beach, CA. IEEE Press.
Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recogni-
tion. Springer-Verlag, New York.
Duda, R., Hart, P., and Stork, D. (2001). Pattern Classification. Wiley, New York.
Duong, T. (2007). ks: Kernel density estimation and kernel discriminant analysis for
multivariate data in R. Journal of Statistical Software, 21(7):1–16.
Efron, B. (1981). Nonparametric estimates of standard error: The jackknife, the boot-
strap and other methods. Biometrika, 68:589–599.
Efron, B. (1983). Estimating the error rate of a prediction rule: Improvement on cross-
validation. Journal of the American Statistical Association, 78(382):316–331.
Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap. Chapman & Hall.
Fisher, R. (1936). The use of multiple measurements in taxonomic problems. Annals of
Eugenics, 7(28):179–188.
Fraley, C. and Raftery, A. (2002). Model-based clustering, discriminant analysis, and
density estimation. Journal of the American Statistical Association, 97:611–631.
Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regres-
sion tools. Technometrics, 35(2):109–135.
105
Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition. Academic Press, San
Diego.
Geisser, S. (1975). The predictive sample reuse method with applications. Journal of the
American Statistical Association, 70:320–328.
Graybill, F. (2001). Matrices with Applications in Statistics. Duxbury, Belmont, CA, USA,
2nd edition.
Hastie, T. and Tibshirani, R. (1996). Discriminant analyses by gaussian mixtures. Jour-
nal of the Royal Statistical Society B, 8:155–176.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning.
Springer, New York, 2nd edition.
Hoerl, A. and Kennard, R. (1970). Ridge regression: Biased estimation for nonorthog-
onal problems. Technometrics, 12(1):55–67.
Ito, K. and Schull, W. J. (1964). On the robustness of the t20 test in multivariate analysis
of variance when variance-covariance matrices are not equal. Biometrika, 51(1-2):71–
82.
Jain, A. K. and Dubes, R. C. (1988). Algorithms for Clustering Data. Prentice-Hall, Engle-
wood Cliffs, NJ, USA.
James, W. and Stein, C. (1961). Estimation with quadratic loss. In Proceedings of the
Fourth Berkeley Symposium on Mathematical Statistics and Probability, pages 361–379.
Kass, R. and Raftery, A. (1995). Bayes factors. Journal of the American Statistical Associa-
tion, 90:773–795.
Krishnamoorthy, K. and Yu, J. (2004). Modified nel and van der merwe test for the
multivariate behrens–fisher problem. Statistics & Probability Letters, 66(2):161–169.
Lachenbruch, P. and Mickey, M. (1968). Estimation of error rates in discriminant anal-
ysis. Technometrics, 10:1–11.
106
McLachlan, G. and Gordon, R. (1989). Mixture models for partially unclassified data:
a case study of renal venous renin levels in essential hypertension. Statistics in
Medicine, 8:1291–1300.
McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley, New York.
McLachlan, G.J. (1992). Discriminant Analysis and Statistical Pattern Recognition. Wiley,
New York.
Mood, A., Graybill, F., and Boes, D. (1974). Introduction to the Theory of Statistics.
McGraw-Hill, 3rd edition.
Morrison, D. F. (2005). Multivariate Statistical Methods. Duxbury, Belmont, CA, USA,
4th edition.
Nguyen, H. and Wood, I. (2012). Variable selection in statistical models using
population-based incremental learning with applications to genome-wide associa-
tion studies. In Proceedings of the IEEE Congress on Evolutionary Computation, pages
1841–1848.
Olshen, R. (1977). Comments on a paper by C.J. Stone. Annals of Statistics, 5:632–633.
Quenouille, M. (1949). Problems in plane sampling. The Annals of Mathematical Statis-
tics, 20:355–375.
Quenouille, M. (1956). Notes on bias in estimation. Biometrika, 43:353–360.
Quinlan, J. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo,
CA.
Roeder, K. and Wasserman, L. (1997). Practical density estimation using mixtures of
normals. Journal of the American Statistical Association, 89:487–495.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics,
6(2):461–464.
107
Smyth, P. (2000). Model selection for probabilistic clustering using cross-validated like-
lihood. Statistics and Computing, 9:63–72.
Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate
normal distribution. In Proceedings of the Third Berkeley Symposium on Mathematical
Statistics and Probability, pages 197–206.
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions.
Journal of the Royal Statistical Society, B36:111–133.
Sun, W. and Cai, T. (2009). Large-scale multiple testing under dependence. Journal of
the Royal Statistical Society: Series B, 71(2):393–424.
Tibshirani, R. (1994). Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society, Series B, 58:267–288.
Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the number of clusters in
a data set via the gap statistic. Journal of the Royal Statistical Society B, 63:411–423.
Tukey, J. (1958). Bias and confidence in not quite large samples. The Annals of Mathe-
matical Statistics, 29:614.
Wand, M. and Jones, M. (1995). Kernel Smoothing. Chapman and Hall, London.
Wishart, J. (1928). The generalised product moment distribution in samples from a
normal multivariate population. Biometrika, 20:32–52.
Wu, C. (1983). On the convergence properties of the EM algorithm. Annals of Statistics,
11:95–103.
Zehna, P. (1966). Invariance of maximum likelihood estimation. Annals of Mathematical
Statistics, 37:744.
Zhang, F. (2005). The Schur Complement and its Applications. Springer, New York.
108

欢迎咨询51作业君
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468