STAT3006: Statistical Learning Lecture Notes Ian A. Wood School of Mathematics and Physics University of Queensland, St. Lucia 4072, Australia October 25, 2020 Contents 1 Multivariate Statistics 3 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Marginal and Conditional Distributions . . . . . . . . . . . . . . . . . . . 4 1.3 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Marginal and Conditional Densities of the Multivariate Normal . . . . . 10 1.6 Samples from the multivariate normal . . . . . . . . . . . . . . . . . . . . 14 1.6.1 James–Stein estimators . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.6.2 Wishart Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.7 Hotelling’s T 2 test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.7.1 Confidence region for mean vector . . . . . . . . . . . . . . . . . . 22 1.7.2 Two sample test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2 Clustering 26 2.1 K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3 Fitting a normal mixture model . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4 Estimating uncertainty about the parameter values . . . . . . . . . . . . 41 2.5 Choosing the number of components . . . . . . . . . . . . . . . . . . . . . 42 2.5.1 Cross-validation of the likelihood . . . . . . . . . . . . . . . . . . 47 2.6 Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.6.1 Spectral Decomposition . . . . . . . . . . . . . . . . . . . . . . . . 49 1 2.7 Mahalanobis Distances and Transformation . . . . . . . . . . . . . . . . . 50 3 Discriminant Analysis 52 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2 Statistical Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3 Optimal Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5 Quadratic Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . 59 3.6 Fisher’s Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . 60 3.7 Estimation of Error Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.8 Mixture Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . 69 3.9 Kernel Density Discriminant Analysis . . . . . . . . . . . . . . . . . . . . 70 3.10 Nearest Neighbour Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.11 Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4 High-dimensional analysis 85 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3 Single-variable analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4 Variable selection and the lasso . . . . . . . . . . . . . . . . . . . . . . . . 96 .1 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 .2 Appendix B: Code for knn example . . . . . . . . . . . . . . . . . . . . . . 101 2 Chapter 1 Multivariate Statistics Note: this chapter draws heavily on Morrison (2005). 1.1 Introduction Assume we will observe data as a result of a random experiment, i.e. we will observe n random variables {Xi, i = 1, . . . , n}, with each Xi being p-dimensional. The values we record for of these random variables are written {xi, i = 1, . . . , n}. Each element can be continuous or discrete, but we will assume they are all continuous in most cases. Note that when necessary, we will use the index i = 1, . . . , n to index observations in the sample, and the index j = 1, . . . , p to index the elements of an observation. We typically assume that all n random variables are drawn independently from the same distribution, i.e. that of X . The distribution of X can be described by its joint distribution function F (X ≤ x) = P (X1 ≤ x1, . . . , Xp ≤ xp). When F is absolutely continuous, one can write F (x) = ∫ xp −∞ . . . ∫ x1 −∞ f(u1, . . . , up) du1 . . . dup, 3 where f(x) is the joint density function of the elements of X . If the elements of X happen to be independent, then f(x) = p∏ j=1 f(xj) and F (x) = p∏ j=1 F (xj). Also, if such a factorisation is possible, it implies independence of the variables. How- ever, we will usually assume that there is some dependency among the p elements of X . It can be the case that while f(x) is straightforward to write out analytically, F (x) is not. This is true for the multivariate normal distribution in p dimensions. The distri- bution function is analytically intractable, but the density can be written as f(x) = 1 (2pi)p/2|Σ|1/2 exp [ −1 2 (x− µ)TΣ−1(x− µ) ] , x ∈ Rp, where µ is the p-dimensional mean vector and Σ is the p× p covariance matrix. 1.2 Marginal and Conditional Distributions If we wish to determine the (marginal) joint density of a subset of the random variables, e.g. the first q < p variables, this can be obtained by integrating out the variables which are not in the subset. Hence the joint density g(x1, . . . , xq) = ∫ ∞ −∞ . . . ∫ ∞ −∞ f(x1, . . . , xp) dxq+1 . . . dxp. (1.1) Correspondingly, the joint distribution function is given by G(x1, . . . , xq) = P (X1 ≤ x1, . . . , Xq ≤ xq) = F (x1, . . . , xq,∞, . . . ,∞). (1.2) Sometimes we are instead interested in a conditional distribution, that is, the joint dis- tribution of a subset of the random variables when the remainder are held fixed at 4 certain values. E.g. assume that we want to know the joint density of X1, . . . , Xq, q < p, with the remaining variables fixed at values Xq+1 = xq+1, . . . , Xp = xp. The definition of conditional probability states that, when P (B) > 0, P (A|B) = P (A ∩ B)/P (B). When extended to densities and considered for this conditional den- sity, we let A = {X1 = x1, . . . , Xq = xq} and B = {Xq+1 = xq+1, . . . , Xp = xp}. So h(x1, . . . , xq|xq+1, . . . , xp) = f(x1, . . . , xq, xq+1, . . . , xp)∫∞ −∞ . . . ∫∞ −∞ f(x1, . . . , xp) dx1 . . . dxq = f(x1, . . . , xp) g(xq+1, . . . , xp) . (1.3) Note that g(xq+1, . . . , xp) is used here to mean the relevant marginal density evaluated at these x values. It is not the same marginal density as in 1.1. 1.3 Moments The expected value of a p-dimensional random vector X is just the vector of expecta- tions of its elements, i.e. E(X) = [E(X1), . . . , E(Xp)] T (1.4) The covariance of two elements, j and k, of the random vector is σjk = cov(Xj, Xk) = E[(Xj − E(Xj))(Xk − E(Xk))] = E(XjXk)− E(Xj)E(Xk) (1.5) = ∫ ∞ −∞ ∫ ∞ −∞ fjk(xj, xk)xjxk dxjdxk − ∫ ∞ −∞ fj(xj)xj dxj ∫ ∞ −∞ fk(xk)xk dxk, (1.6) where fjk(xj, xk) is the (marginal) joint density of Xj and Xk, and fj is the (marginal) density of Xj . The variance of the jth element of the random vector X can be written as σjj or σ2j , where σj is the standard deviation of the jth element of the random vector. The 5 covariance matrix of X is defined to be var(X) = Σ = cov(X,XT ) = E[(X − E(X))(X − E(X))T ] = σ11 . . . σ1p . . . . . . . . . σ1p . . . σpp (1.7) The matrix Σ must be real, symmetric and positive semi-definite. The latter is re- quired to ensure that each covariance term |σjk| ≤ √σjjσkk. For Σ to be positive semi- definite, it must be the case that zTΣz ≥ 0,∀z 6= 0,∈ Rp. Equivalently, if Σ is a symmet- ric real matrix, it will be positive semi-definite if all its eigenvalues are greater than or equal to zero. From 1.5, we can show that translations of either or both variables do not affect their covariance: cov(Xj + a,Xk + b) = cov(Xj, Xk). (1.8) We can also show that changes in the scale of either variable are passed on to the covariance: cov(cXj, dXk) = cd cov(Xj, Xk). (1.9) Hence, if we define a p-length vector a = (a1, . . . , ap)T , the covariance of aX is var(aTX) = p∑ j=1 p∑ k=1 ajakσjk = a TΣa. (1.10) It can also be shown that if b is another p-length vector, cov(aTX, bTX) = p∑ j=1 p∑ k=1 ajbkσjk = a TΣb. (1.11) More generally, if A and B are two real matrices, of dimensions r × p and s × p, 6 respectively, and we let Y = AX and Z = BX , then it can be shown that cov(Y, Y T ) = AΣAT , (1.12) cov(Z,ZT ) = BΣBT , and (1.13) cov(Y, ZT ) = AΣBT . (1.14) It is sometimes desirable to standardise covariance terms into the range [-1,1], i.e. consider the (Pearson) correlation between variables. The correlation coefficient of Xj and Xk is defined to be: corr(Xj, Xk) = ρjk = cov(Xj, Xk)√ var(Xj)var(Xk) = σjk√ σjjσkk . (1.15) IfXj andXk are independent, then corr(Xj, Xk) = 0, but the converse is not necessarily true. If Xk = cXj + a, for two scalars a and c > 0, then corr(Xj, Xk) = 1, and if c < 0, then corr(Xj, Xk) = −1. The matrix of correlations among the elements of X is given by P = 1 ρ12 . . . ρ1p ρ12 1 . . . ρ2p . . . . . . . . . . . . ρ1p ρ2p . . . 1 (1.16) Let D be a p× p diagonal matrix containing the standard deviations of the elements of X , σ1, . . . , σp. Its inverse, D−1, is also a diagonal matrix, with its entries consisting of the reciprocals of the standard deviations of the elements of X . It can then be shown that the correlation matrix for X is related to the covariance matrix as follows: P = D−1ΣD−1 (1.17) Σ = DPD (1.18) 7 1.4 Multivariate Normal Distribution The pdf of the univariate normal distribution is given by φ(xj;µ, σ 2) = 1√ 2piσ exp [ −(xj − µ) 2 2σ2 ] , xj ∈ R. (1.19) If the elements of X = (X1, . . . , Xp) are each normal, but independent of each other, then X has mean vector µ = (µ1, . . . , µp), where µj is the mean of Xj , and covariance matrix Σ = σ21 . . . 0 . . . . . . . . . 0 . . . σ2p . Hence the joint density of X is φ(x;µ,Σ) = p∏ j=1 φ(x;µj, σ 2 j ) = 1 (2pi)p/2 ∏p j=1 σj exp [ −1 2 p∑ j=1 ( xj − µj σj )2] , x ∈ Rp. (1.20) If we note that for this case, |Σ|1/2 = ∏pj=1 σj , then we can rewrite this as φ(x) = 1 (2pi)p/2|Σ|1/2 exp [ −1 2 (x− µ)TΣ−1(x− µ) ] , (1.21) which is the general form for the normal density, even if Σ contains off-diagonal terms. It can be shown that, when Σ is positive definite, φ(x) > 0 for all finite x and ∫ ∞ −∞ . . . ∫ ∞ −∞ φ(x) dx1 . . . dxp = 1 (1.22) for all µ ∈ Rp, so φ(x) is a valid density function. This distribution has mean µ and covariance matrix Σ. The bivariate normal distribution is the simplest multivariate case, but has many applications. When learning about multivariate statistics, it is generally worth looking at the bivariate case before considering higher-dimensional cases. 8 The bivariate normal distribution has 5 parameters: the two elements of the mean vector, the variances (or equivalently, standard deviations) of each element, and the correlation ρ, and its parameters can be written in terms of these as follows: µ = µ1 µ2 , Σ = σ21 ρσ1σ2 ρσ1σ2 σ 2 2 . (1.23) The joint pdf is φ(x1, x2) = 1 2piσ1σ2 √ 1− ρ2× exp { − 1 2(1− ρ2) [( x1 − µ1 σ1 )2 − 2ρ ( x1 − µ1 σ1 )( x2 − µ2 σ2 ) + ( x2 − µ2 σ2 )2]} . If we choose a fixed density value c (e.g. c = 0.1), there must exist a level set of points in the support of the multivariate normal for which the multivariate normal density has this value. In the bivariate case, the level set will be an elliptical contour centered on the mean, with major and minor axes. These axes follow the longest and shortest lines through the mean which touch the contour twice. In higher dimensions, the level set is an ellipsoid. This can be seen mathematically by setting the multivariate normal density equal to c. Since f(x) = 1 (2pi)p/2|Σ|1/2 exp [ −1 2 (x− µ)TΣ−1(x− µ) ] = c, we can divide both sides by the leading constant and then take logs of both sides, which still leaves us with a constant on the right hand side, say C, i.e. (x− µ)TΣ−1(x− µ) = C. This equation describes a p-dimensional ellipsoid. The axes of a multivariate normal el- lipsoid estimated from data are known as the principal components. These are widely used for the purposes of dimensionality reduction and as a method of describing rela- 9 tionships between variables, and will be considered later in the course. 1.5 Marginal and Conditional Densities of the Multivari- ate Normal Property: if X ∼ N(µ,Σ), with Σ of rank p (i.e. p × p without linear redundancy), and A is an m×p matrix of rank m ≤ p and we define Y = AX . Then Y is a vector of length m and Y ∼ N(Aµ,AΣAT ). The covariance result is from 1.12 and the mean result is also a general property of multivariate distributions. We will not prove this result in this course. Proofs can be found in both Anderson (2003) and Graybill (2001). This property has a number of implications. One of the most important is that the joint distribution of any set of elements of X is a multivariate normal distribution, with mean vector and covariance given by the appropriate elements of µ and Σ, i.e. any marginal distribution is multivariate normal. Can you see how the property can be used to show this? A certain type of tuple for A could produce this result. (Hint: binary). (Discussed in detail in class). However, note that having (multivariate) normal marginal distributions does not necessarily imply that the original joint distribution was multivariate normal. To consider conditional densities further, we begin by splittingX into two parts: XA containing the first q elements, 0 < q < p, and XB containing the last p − q elements. We let X = XA XB , µ = µA µB , Σ = ΣAA ΣAB ΣTAB ΣBB . (1.24) Note that the marginal distributions are XA ∼ N(µA,ΣAA) and XB ∼ N(µB,ΣBB). The dimensions of the matrices are: ΣAA, q × q; ΣAB, q × (p− q); ΣBB, (p− q)× (p− q). Also, XA and XB are independent iff (if and only if) ΣAB = 0. We would like to determine the conditional density h(XA = xA|XB = xB) in terms 10 of the quantities mentioned so far. From 1.3 we have h(xA|xB) = f(xA, xB) g(xB) . (1.25) Since g() is a multivariate normal density, g(xB) = 1 (2pi)(p−q)/2|ΣBB|1/2 exp [ −1 2 (xB − µB)TΣ−1BB(xB − µB) ] . (1.26) We next need to express f(xA, xB) in terms of the submatrices of Σ. We will need the matrix inverse and the determinant. The linear algebra involved with the Schur complement (Boyd and Vandenberghe, 2004; Zhang, 2005), which can be used in solving systems of linear equations, gives a number of identities, including the inverse and determinant of block matrices. The Schur complement of ΣBB in Σ is ΣAA−ΣABΣ−1BBΣTAB, which we will abbreviate as ΣA|B. The inverse identity gives Σ−1 = Σ−1A|B −Σ−1A|BΣABΣ−1BB −Σ−1BBΣTABΣ−1A|B Σ−1BB + Σ−1BBΣTABΣ−1A|BΣABΣ−1BB , (1.27) and the determinant is given by |Σ| = |ΣBB||ΣA|B|. (1.28) We substitute these into 1.25. The ratio of the parts involving pi and the determi- nants resolves to (2pi)(p−q)/2|ΣBB|1/2 (2pi)p/2|ΣBB|1/2|ΣA|B|1/2 = 1 (2pi)q/2|ΣA|B|1/2 . (1.29) 11 The argument of the exponential in the ratio becomes: − 1 2 (x− µ)TΣ−1(x− µ) + 1 2 (xB − µB)TΣ−1BB(xB − µB) = −1 2 xA − µA xB − µB T Σ−1A|B −Σ−1A|BΣABΣ−1BB −Σ−1BBΣTABΣ−1A|B Σ−1BB + Σ−1BBΣTABΣ−1A|BΣABΣ−1BB xA − µA xB − µB + 1 2 (xB − µB)TΣ−1BB(xB − µB) = −1 2 xA − µA xB − µB T Σ−1A|B(xA − µA)− Σ−1A|BΣABΣ−1BB(xB − µB) −Σ−1BBΣTABΣ−1A|B(xA − µA) + (Σ−1BB + Σ−1BBΣTABΣ−1A|BΣABΣ−1BB)(xB − µB) + 1 2 (xB − µB)TΣ−1BB(xB − µB) = −1 2 [(xA − µA)TΣ−1A|B(xA − µA)− (xA − µA)TΣ−1A|BΣABΣ−1BB(xB − µB) − (xB − µB)TΣ−1BBΣTABΣ−1A|B(xA − µA) + (xB − µB)T (Σ−1BB − Σ−1BB + Σ−1BBΣTABΣ−1A|BΣABΣ−1BB)(xB − µB)] = −1 2 [(xA − µA)TΣ−1A|B(xA − µA)− 2(xA − µA)TΣ−1A|BΣABΣ−1BB(xB − µB) + (xB − µB)TΣ−1BBΣTABΣ−1A|BΣABΣ−1BB(xB − µB)]. Note that (xA−µA)TΣ−1A|BΣABΣ−1BB(xB−µB) = (xB−µB)TΣ−1BBΣTABΣ−1A|B(xA−µA) be- cause the latter is the transpose of the former, and since both are scalars, the transpose has no effect. In considering the above, we must remember that we are trying to get a density for the random variable XA at position xA, and everything else is constant. The expo- nential argument includes a quadratic term −1 2 xTAΣ −1 A|BxA, a linear term and a constant. This essentially ensures that the resulting density is normal, and we also recall that being a random variable, its density must integrate to 1. So what we should try to do is the multivariate equivalent of completing the square for a quadratic equation. This means we should focus on the linear term, with the understanding that the constant term must fall into line for the result to be a density. 12 The linear term is as follows: = −1 2 [−2xTAΣ−1A|BµA − 2xTAΣ−1A|BΣABΣ−1BB(xB − µB)] = xTAΣ −1 A|B(µA + ΣABΣ −1 BB(xB − µB)). Let µA|B(xB) = µA + ΣABΣ−1BB(xB − µB), (1.30) so the linear term can be written xTAΣ −1 A|BµA|B(xB). The only quadratic form which could give rise to these quadratic and linear terms is −1 2 [ xA − µA|B(xB) ]T Σ−1A|B [ xA − µA|B(xB) ] . (1.31) Hence the density of the conditional distribution must be g(xA|xB) = exp { −1 2 [ xA − µA|B(xB) ]T Σ−1A|B [ xA − µA|B(xB) ]} (2pi)q/2 |ΣA|B|1/2 (1.32) This is the density for a multivariate normal distribution with mean µA|B(xB) and co- variance matrix ΣA|B. In summary: XA|(XB = xB) ∼ N(µA|B(xB),ΣA|B). (1.33) Based on this, and remembering the assumption of p-dimensional normality, we can create a vector of residuals XA.B = XA − µA|B(xB) (1.34) These are the discrepancies between elements of XA and the values that would be predicted after conditioning on a known set of values forXB, using the linear mean and covariance equations above. The main difference versus linear regression is that linear 13 regression does not typically assume that the explanatory variables being conditioned upon are jointly normally distributed with the response variables. The covariance of XA with XA.B can be shown to be ΣA|B. The covariance of XB with XA.B can be shown to be 0, implying that the residuals and the fixed variables are independently distributed. When holding constant xq+1, . . . , xp, the elements of the covariance matrix ΣA|B for the conditional distribution are called the partial variances and covariances, with the jkth (1 ≤ j, k ≤ q) element written as σjk|q+1,...,p. The jkth partial correlation is then ρjk|q+1,...,p = σjk|q+1,...,p√ σjj|q+1,...,p σkk|q+1,...,p (1.35) If we let DA|B = diag(ΣA|B), then the matrix of partial correlations is PA|B = D −1/2 A|B ΣA|BD −1/2 A|B , (1.36) where the power −1/2 for a matrix means to take the reciprocal of each element and then take its square root. 1.6 Samples from the multivariate normal If we take n observations from a p-dimensional multivariate normal distributionN(µ,Σ), we can store these in a data matrix XAll = x11 . . . x1p . . . . . . . . . xn1 . . . xnp = xT1 . . . xTn (1.37) The likelihood of these observations is L(µ,Σ;XAll) = 1 (2pi)np/2|Σ|n/2 exp [ −1 2 n∑ i=1 (xi − µ)TΣ−1(xi − µ) ] . (1.38) 14 The maximum likelihood estimators of µ and Σ are then functions of the data which maximise the likelihood. Let the sample mean vector be x = 1 n n∑ i=1 xi, (1.39) and let A be the matrix of sums of squares and cross products A = n∑ i=1 (xi − x)(xi − x)T = ( n∑ i=1 xix T i ) − nxxT . (1.40) The maximum of a function remains in place under monotonic transformations, and so for convenience we take the natural logarithm of the likelihood. The log likelihood is given by l(µ,Σ;XAll) = −1 2 np ln(2pi)− 1 2 n ln |Σ| − 1 2 n∑ i=1 (xi − µ)TΣ−1(xi − µ) (1.41) Since Σ−1 is a symmetric positive definite matrix, the quadratic form term will be minimised only when µ = x, so the maximum likelihood estimator is µˆ = x. We next need to find the maximum likelihood estimator for Σ. For this we use a lemma due to Zehna (1966). The lemma was extended by Mood et al. (1974) (see p285 - Theorem 2 and proof). This expresses the invariance of maximum likelihood estimators under transformations: Lemma 1 Let the distribution function of a multivariate random variable depend on the pa- rameters θ = (θ1, . . . , θk), with maximum likelihood estimators θˆ = (θˆ1, . . . , θˆk). Consider a transformation of the original parameter space Θ: τ(θ) = (τ1(θ), . . . , τr(θ)), 1 ≤ r ≤ k. Then the maximum likelihood estimators of τl(θ) , l = 1, . . . , r, are τˆl = τl(θˆ1, . . . , θˆk). Proof: Let θˆ = (θˆ1, . . . , θˆk) be a maximum likelihood estimate of θ = (θ1, . . . , θk). We let M(τ ;x1, . . . , xn) = supθ:τ(θ)=τ L(θ;x1, . . . , xn). It is sufficient to show thatM(τ(θˆ);x1, . . . , xn) ≥ 15 M(τ ;x1, . . . , xn) for any τ ∈ T , where T is the range of the transformation. sup θ:τ(θ)=τ L(θ;x1, . . . , xn) ≤ sup θ∈Θ L(θ;x1, . . . , xn) = L(θˆ;x1, . . . , xn) = sup θ:τ(θ)=τ(θˆ) L(θ;x1, . . . , xn) = M(τ(θˆ);x1, . . . , xn). To use this in deriving the maximum likelihood estimator for Σ, we let θ = Σ−1, so the θl are the k = p(p+ 1)/2 unique elements of Σ−1 and the τl are the unique elements of Σ. The log likelihood for the multivariate normal distribution is given by l(µ,Σ;XAll) = −1 2 np ln(2pi)− 1 2 n ln |Σ| − 1 2 n∑ i=1 (xi − µ)TΣ−1(xi − µ) = −1 2 np ln(2pi) + 1 2 n ln |θ| − 1 2 n∑ i=1 (xi − µ)T θ (xi − µ) = l(µ, θ;XAll) We will take the (matrix) derivative and set it to zero to determine the maximum likelihood estimator, as follows. ∂l(µ, θ;XAll) ∂θ = n 2 θ−1 − 1 2 n∑ i=1 (xi − µ) (xi − µ)T (see identities (57) and (70) from The Matrix Cookbook, Peterson, 2012.) We know that the MLE for µ is µˆ = x, so we include this when setting the above to zero to find θˆ. 16 n2 θˆ−1 − 1 2 n∑ i=1 (xi − x) (xi − x)T = 0 n θˆ−1 − A = 0 So θˆ−1 = 1 n A Hence the maximum likelihood estimator is θˆ = ( 1 n A )−1. By lemma 1, the maxi- mum likelihood estimator of Σ is Σˆ = 1 n A. (1.42) Using expectations it can be shown that this Σˆ is biased. As a result, the unbiased alternative is more commonly used: S = 1 n− 1A, (1.43) which is known as the sample covariance matrix. It is quite common for the sample to come from G independent groups of experi- mental units, each with mean µg, g = 1, . . . , G and a common covariance matrix Σ. In this case the data matrix for the gth group can be written as XAllg = x11g . . . x1pg . . . . . . . . . xng1g . . . xngpg (1.44) with ith row being xig. The maximum likelihood estimator of µg is the sample mean xg for the gth group and it can be shown that the unbiased estimator of Σ is S = 1 n−G G∑ g=1 ng∑ i=1 (xig − xg)(xig − xg)T = 1 n−G G∑ g=1 Ag. (1.45) In 1956, Stein (1956) showed that when data is generated from a multivariate nor- 17 mal distribution with Σ = I , with p ≥ 3, then the sample mean vector is inadmissible, i.e. that it does not have the smallest expected mean squared error, or that there exists an estimator t such that E [ (t− µ)T (t− µ)] < E [(x− µ)T (x− µ)] (1.46) 1.6.1 James–Stein estimators This was extended by James and Stein in 1961 (James and Stein, 1961) to cover Σ = σ2I , with σ known and what are now known as the James–Stein estimators were intro- duced. For convenience, we assume that σ is such that Y = X has distribution N(µ, I). (You should be able to work out what σ would need to be to achieve this.) Then for a single sample of Y , or equivalently an iid sample of size n of X , the James–Stein estimator is t(y) = ( 1− p− 2||y − ν||2 ) (y − ν) + ν, (1.47) where ν is an arbitrary fixed vector, and p ≥ 3. This estimator shrinks the observed y towards the specified ν. The estimator always improves on y as an estimator with respect to this loss function, and improves more the closer ν is to µ. These results have been extended by e.g. Berger (1975) to a general known covariance matrix and an arbitrary quadratic loss function L(µ, t) = (t− µ)TQ(t− µ), where Q is an arbitrary positive definite matrix which reflects the relative importance of errors in different directions. See Anderson (2003) section 3.5.3 for more details. 1.6.2 Wishart Distribution The matrix A (see 1.40) can be written as the sums of products of n− 1 independent p- dimensional random vectors with common distribution N(0,Σ). Such a matrix is said 18 to have a Wishart distribution (Wishart, 1928), with density w(A; Σ, n− 1) = |A|n−p−22 exp(− 12 tr(AΣ−1)) 2(n−1)p/2 pip(p−1)/4 |Σ|(n−1)/2∏pj=1 Γ(n−j2 ) , for A positive definite 0, otherwise. (1.48) When p = 1 and Σ = 1, the data is drawn from a N(µ, 1) distribution, and the Wishart distribution simplifies to a χ2n−1 distribution. 1.7 Hotelling’s T 2 test The single sample univariate hypothesis test for iid data from a N(µ, σ2) distribution tests H0 : µ = µ0 against Ha : µ 6= µ0, for some µ0 ∈ R that we are free to choose. Inspired by z-scores to normalise to form a test statistic, we choose as a test statistic t = x−µ0 s/ √ n . It can be shown that t ∼ tn−1, where the latter is a t-distribution with n − 1 degrees of freedom. When using a significance level of α (e.g. 0.05), the decision rule is to accept Ha if |t| > tα/2;n−1 and to retain H0 otherwise. In these notes, we will use the first subscript of a distribution (e.g. α/2) to mean the upper quantile at that level, i.e. so that the total probability to the right of this point is given by the subscript. Any remaining subscripts will generally be degrees of freedom. We now wish to extend such testing to data drawn from a multivariate normal dis- tribution. We thus assume we will collect n independent observations from a N(µ,Σ) distribution, i.e. a multivariate normal distribution with p elements. The null and alternative hypotheses are then H0 : µ1 ... µp = µ01 ... µ0p (1.49) 19 and H1 : µ1 ... µp 6= µ01 ... µ0p . (1.50) We assume that the standard estimates of mean and covariance, namely x and S have been calculated. Our principal tasks are then to choose an appropriate test statis- tic and to determine its distribution under the null hypothesis. There are a number of ways to show the construction. Here we follow that of An- derson (2003), which considers this as an example of a likelihood ratio test. The likelihood ratio criterion λ is the ratio of the maximum likelihood under the null hypothesis to the maximum likelihood under the alternative hypothesis. So here we have λ = maxΣL(µ0,Σ) maxµ,ΣL(µ,Σ) ∈ [0, 1]. (1.51) For µ = µ0, the likelihood function is maximised with Σˆω = 1 n n∑ i=1 (xi − µ0)(xi − µ0)T (1.52) = 1 n n∑ i=1 [(xi − x) + (x− µ0)] [(xi − x) + (x− µ0)]T = 1 n n∑ i=1 [ (xi − x)(xi − x)T + (xi − x)(x− µ0)T + (x− µ0)(xi − x)T + (x− µ0)(x− µ0)T ] = 1 n [ A+ n(x− µ0)(x− µ0)T + ( n∑ i=1 (xi − x) ) (x− µ0)T + (x− µ0) ( n∑ i=1 (xi − x) )] = 1 n [ A+ n(x− µ0)(x− µ0)T ] . (1.53) Without restrictions, the likelihood function is maximised with µˆΩ = x and ΣˆΩ = 1 n n∑ i=1 (xi − x)(xi − x)T = 1 n A. (1.54) 20 Hence λ = exp(− 12 ∑n i=1(xi−µ0)T Σˆ−1ω (xi−µ0)) (2pi)np/2|Σˆω |n/2 exp(− 12 ∑n i=1(xi−x)T Σˆ−1Ω (xi−x)) (2pi)np/2|ΣˆΩ|n/2 . (1.55) The arguments of the exponential functions can be simplified as follows using prop- erties of the trace. n∑ i=1 (xi − µ0)T Σˆ−1ω (xi − µ0) (1.56) = n∑ i=1 (xi − µ0)T [ 1 n n∑ l=1 (xl − µ0)(xl − µ0)T ]−1 (xi − µ0) (1.57) = tr n∑ i=1 (xi − µ0)T [ 1 n n∑ l=1 (xl − µ0)(xl − µ0)T ]−1 (xi − µ0) (1.58) = tr n∑ i=1 (xi − µ0)(xi − µ0)T [ 1 n n∑ l=1 (xl − µ0)(xl − µ0)T ]−1 (1.59) = ntr(Ip×p) = np. (1.60) The exponential function involving Σˆ−1ω produces the same result. Hence, λ = exp(−np/2) (2pi)np/2|Σˆω |n/2 exp(−np/2) (2pi)np/2|ΣˆΩ|n/2 (1.61) = |ΣˆΩ|n/2 |Σˆω|n/2 (1.62) = |A|n/2 |A+ n(x− µ0)(x− µ0)T |n/2 . (1.63) The following matrix identity is useful to simplify the expression for λ: ∣∣∣∣∣∣∣ B C D E ∣∣∣∣∣∣∣ = ∣∣∣∣∣∣∣ B 0 D E −DB−1C ∣∣∣∣∣∣∣ = |B||E −DB−1C|. (1.64) 21 We can write |A+ n(x− µ0)(x− µ0)T | as∣∣∣∣∣∣∣ A −√n(x− µ0) √ n(x− µ0)T 1 ∣∣∣∣∣∣∣ = |A||1 + n(x− µ0)TA−1(x− µ0)|. (1.65) Hence λ2/n = 1 1 + n(x− µ0)TA−1(x− µ0) = 1 1 + T 2/(n− 1) , (1.66) where T 2 = n(x− µ0)TS−1(x− µ0) = (n− 1)n(x− µ0)TA−1(x− µ0). (1.67) The likelihood ratio test is defined by the critical region (set of possible outcomes which will cause the null hypothesis to be rejected) λ < λ0, or equivalently T 2 > T 20 , where T 20 = (n− 1) ( λ −2/n 0 − 1 ) . So then the null hypothesis is rejected whenever T 2 ≥ T 20 , where T0 is chosen so that P (T 2 > T 20 ) is equal to the chosen significance level under the null hypothesis. When H0 is true, F = n−pp(n−1)T 2 ∼ Fp,n−p. So then we reject H0 if F > Fα;p,n−p. 1.7.1 Confidence region for mean vector A 100(1− α)% confidence region consists of all vectors µ for which n(x− µ)TS−1(x− µ) ≤ (n− 1)p n− p Fα;p,n−p. (1.68) This region has an ellipsoidal boundary and is centered at x. 1.7.2 Two sample test More commonly, we may wish to check whether the evidence supports two multi- variate samples coming from the same population or from different populations. In particular, we would like to compare their means, i.e. to test H0 : µ1 − µ2 = δ0 versus H1 : µ1 − µ2 6= δ0 based on two samples X(1) and X(2) of size n1 and n2, respectively. 22 We assume that the two groups have a common, but unknown covariance matrix Σ. A typical use of this test will use a null value of δ0 = 0. The vector X (l) ; l = 1, 2, has distribution N(µl, 1nlΣ). This can be shown using 1.12 by choosing to consider all nl observationsX (l) i ∼ N(µl,Σ), i = 1, . . . , n, laid out under- neath each other in a single vector Z of length np. Z has a covariance matrix composed of copies of Σ down the diagonal and 0 matrices elsewhere: cov(Z) = Σ 0 . . . 0 0 Σ . . . 0 . . . . . . . . . . . . 0 . . . 0 Σ . (1.69) We can then consider Y = AZ, withA = (I . . . I) being a p×npmatrix comprised of n copies of the p × p identity matrix. Hence Y = ∑nli=1X(l)i . The covariance matrix for Y is then cov(Y, Y T ) = A cov(Z)AT = nΣ. Since the sample mean vector is X (l) = 1 n Y , we can use 1.9 to show that X (l) ∼ N(µl, 1nΣ). Under the null hypothesis it can be shown that X (1) −X(2) has distribution N ( 0, ( 1 n1 + 1 n2 ) Σ ) = N ( 0, n1 + n2 n1n2 Σ ) . (1.70) The natural approach to estimating the common covariance matrix would be to treat each observation as equally informative and sum their contributions. This is done in the following, which uses two degrees of freedom to estimate the mean of each population: S = 1 n1 + n2 − 2 { n1∑ i=1 (x (1) i − x(1))(x(1)i − x(1))T + n2∑ i=1 (x (2) i − x(2))(x(2)i − x(2))T } . (1.71) Then (n1 + n2− 2)S has the same distribution as ∑n1+n2−2 m=1 UmU T m, where each Um ∼ 23 N(0,Σ). By extending the results for the one sample case, it can be shown that T 2 = n1n2 n1 + n2 (X (1) −X(2))TS−1(X(1) −X(2)) (1.72) is distributed according to the T 2 distribution with n1 + n2 − 2 degrees of freedom. Equivalently, T 2 n1 + n2 − p− 1 (n1 + n2 − 2)p ∼ Fp,n1+n2−p−1. (1.73) Hence the critical or rejection region for the test is T 2 n1 + n2 − p− 1 (n1 + n2 − 2)p > Fα;p,n1+n2−p−1. (1.74) As a consequence the 100(1 − α)% confidence region for the difference in means δ = µ1 − µ2 is given by vectors δ which satisfy (x(1) − x(2) − δ)S−1(x(1) − x(2) − δ)T ≤ n1 + n2 n1n2 T 2α;p,n1+n2−p−1, (1.75) where T 2α;p,n1+n2−p−1 = n1 + n2 − 2 n1 + n2 − p− 1Fα;p,n1+n2−p−1. (1.76) Further, the 100(1 − α)% simultaneous confidence intervals for any linear compound of the mean difference aT δ, a ∈ Rp can be shown to be: aT (x(1) − x(2))± √ aTSa n1 + n2 n1n2 T 2α;p,n1+n2−p−1. (1.77) Note that we assumed that Σ was common to the two groups in the above. This can be an unreasonable assumption. However, Ito and Schull (1964) have shown that when the two sample sizes involved are equal, the type I error rate is unaffected by unequal covariance matrices and the type II error rate is unaffected if the two samples are large. Hence the test above can be used in many circumstances even when there are apparent differences in the covariance matrices of the two groups. 24 In general, the task of such hypothesis testing is known as the multivariate Behrens- Fisher problem, and a variety of solutions have been proposed. The procedure by Krishnamoorthy and Yu (2004) is one of the most powerful. For this test, we first define S˜1 = S1/n1, S˜2 = S2/n2, S˜ = S˜1 + S˜2 and T 2 = (X (1) −X(2))T S˜−1(X(1) −X(2)). (1.78) Then ν − p+ 1 νp T 2 ∼ Fp,ν−p+1, (1.79) where ν = p+ p2 {tr[(S1S−1)2] + [tr(S1S−1)]2}/n1 + {tr[(S2S−1)2] + [tr(S2S−1)]2}/n2 . (1.80) This type of analysis can be continued further with multivariate analysis of variance. See e.g. Morrison (2005), Chapter 3 for details. 25 Chapter 2 Clustering Clustering is the task of allocating observations into a number of groups which are suggested by structure in the data. They may or may not correspond to meaningful groups in the context producing the observations. They hope is that the groups or clusters produced will be meaningful and will lead to new insights into the population. This naturally leads to many questions, including the following: • Will we make use of probability models of the data ? • What measure of similarity will we use so as to guide which observations are put together in a cluster, and which are put in separate clusters ? • How will we penalise using large numbers of clusters so that we find a reasonable number of clusters ? • How does this relate to (supervised) classification ? 2.1 K-means clustering We will begin by considering a well-known method of clustering which does not ex- plicitly use a probability model. It also avoids the question of the number of clusters by requiring the user to choose this number: K. The general idea of the algorithm is 26 that the data will have been generated from K round clusters, and we should assign each point to its closest cluster center, in this case its cluster mean. This is actually equivalent to a mixture model with spherical components and equal weights, and so is a special case of models considered in the next section. We would generally be wise to normalise the data such that each dimension has variance 1. The algorithm does not require a particular measure of dissimilarity, but for continuous predictors, a squared Euclidean (L2) norm is generally used, i.e. d(xi, xl) = p∑ j=1 (xij − xlj)2 = ||xi − xl||2. (2.1) We then need something to optimise, and here we wish to minimise the sum of squared Euclidean distances between the observations within the clusters. This turns out to be equivalent to minimising the sum of squared distances from each observation to its cluster mean. We wish to minimise the sum of squared distances within clusters, which is also called the within-cluster point scatter (with a convenient constant multiplier): W (C) = 1 2 K∑ k=1 ∑ C(i)=k ∑ C(l)=k ||xi − xl||2, (2.2) where C() maps observation number i to a cluster number in the range 1, . . . , K, i.e. it records cluster membership. In principle, we could directly optimise this function. However, the number of combinations to be compared grows rapidly with sample size, as follows (see e.g. Jain and Dubes (1988)). S(n,K) = 1 K! K∑ k=1 (−1)K−k ( K k ) kn. (2.3) This tends to encourage the use of heuristic optimization, such as greedy approaches. Greedy optimization is the idea of starting somewhere, typically at a random point, and proposing a sequence of limited changes to the state, each of which will only be accepted if they improve the cost function. As a result, it can become trapped in a local 27 minimum if none of the available proposals offer an improvement in cost. It can be shown using a method similar to that of the ANOVA identity (see Ap- pendix A) that 2.2 can be rewritten as W (C) = K∑ k=1 Nk ∑ C(i)=k ||xi − xk||2, (2.4) where xk is the mean of observations in the kth cluster and Nk = ∑n i=1 I(C(i), k) is the number of observations which are assigned to the kth cluster. We have now phrased the optimisation problem in terms of distances to the cluster mean. Note that each observation is weighted by the size of the cluster, but this could be altered if desired. A greedy solution to this problem forms the K-means algorithm: 1. Assign each observation to the closest cluster mean 2. Recompute the cluster means 3. Iterate steps 1 and 2 until the assignments of observations to clusters do not change. One must initialise this algorithm somehow, and probably not particularly well, or this could constitute another clustering algorithm. All that must be chosen is the K initial p-dimensional “means”. These do not have to actually be the mean of any subset of the observations, but they will be after one iteration of the algorithm. The algorithm eventually leads to a local minimum, and it will usually not take many iterations. Since the initialisation is random and the result is a local minimum, restarting the algorithm from another initialisation could easily lead to a different local minimum. Note that the global minimum is also a local minimum. The descent into a local minimum is deterministic, so one could in theory partition the Kp-dimensional initialisation space into (possibly disjoint) regions, with all points in a region leading to the same local minimum, for the given dataset. Given that the K-means algorithm usually converges quickly, it is reasonable to run it a number of times and choose the 28 final clustering which has produced the lowest value of the cost function. It is not generally feasible to check whether this is a global minimum or not. There is no consensus on initialisation on how to initialise the K-means algorithm, and many authors assume it will be done randomly. This still leaves many choices, e.g. uniform over a bounded feasible parameter space or over an object (e.g. hypercube) enclosing the data, or (discrete) uniform choice of K separate individual observations or by random division of the observations into K subsets, and taking the means of these. A large number of variations on K-means have been developed, but here we suggest that you consider the basic version again later from the perspective of mixture modelling. The key remaining issue is the choice of K. This is an example of the general prob- lem of statistical model selection, i.e. decide which model out of a set or family of models is most appropriate for the data, in addition to fitting them. The Bayesian ver- sion places a prior over the models and asks what the posterior distribution is over the models, but for now we will only consider frequentist approaches, which primarily seek a single best model. There are many approaches to choosing K in K-means clustering, none of which are particularly satisfactory. One of the currently most popular is based on a “gap- statistic” (Tibshirani et al., 2001), as described below. For the given dataset, one finds the (sample) minimum and maximum value for each dimension: xmin 1, . . . , xmin p and xmax 1, . . . , xmax p. We also run the K-means algo- rithm on this data and obtain a WKdata(C) value of the within-cluster scatter cost func- tion. We then take B samples of size n from a uniform distribution over the box (hy- perrectangle) whose vertices are these minimum and maximum values. This approach is similar in spirit to the bootstrap. For each of these samples, we run the K-means algorithm and obtain a value WKb (C), for b = 1, . . . , B. We then consider a range of k 29 values k = 1, . . . , K. For each k, we determine the average value W k (C) = 1 B B∑ b=1 W kb (C). (2.5) The gap statistic is then defined as Gap(k) = 1 B B∑ b=1 logW kb (C)− logW kdata(C). (2.6) We also define s′(k) = √ 1 + 1/B s.d.(logW k1 (C), . . . , logW k B(C)), (2.7) and finally choose the optimal k∗ = min {k : Gap(k) ≥ Gap(k + 1)− s′(k + 1)} , (2.8) i.e. find the smallest value of k such that the above is true. Put another way, k∗ = min {k : Gap(k + 1)−Gap(k) ≤ s′(k + 1)} , (2.9) that is, find the smallest k such that any increase in gap score is within a measure of variability of such scores in the bootstrap sample with k + 1 clusters. The within-cluster scatter is usually a decreasing function as k increases. However, it is hard to predict what pattern will be seen in a plot of the gap statistic versus k. This heuristic rule will increase k provided the gap statistic increases more than the measure of variability. The optimal value k∗ is chosen as the first value at which the changes level off or fall. There is very little theory to back this method, but it recov- ered the correct number of clusters in a high proportion of runs on a set of simulation experiments. The method understandably struggled when the clusters were far from spherical. The authors also considered a variation of the method in which the axes of 30 the box were taken to be those of the principal components. This approach was more successful with elongated clusters. It should be noted that a k means clustering forms a Voronoi tessellation of data space, with the means corresponding to the seeds and the clusters corresponding to the cells. With three clusters, provided that the three cluster centres do not occur along a line (collinearity), the boundaries between these clusters will all meet at a single point. That point is the circumcentre of the triangle whose vertices are the cluster centres. That is, the centre of the smallest circle which passes through all three cluster centres. By the definition of a circle, its centre must be equidistant from all points on its perimeter. If we find the point half way between two cluster centres and from there draw a line (cluster boundary) perpendicular to the line connecting the cluster centres, all points on the boundary line must be equidistant from the two cluster centres, and hence it must include the circumcentre. This is true for all three possible pairs of cluster centres, so the three cluster boundary lines meet at the circumcentre. Note that this is also true for any number of dimensions, because three points always form a two-dimensional triangle. Note that the cluster boundaries extend as hyperplanes (dimension p− 1). 2.2 Mixture Models A finite mixture model assumes that the data has been generated by a distribution whose density can be decomposed into a finite sum of weighted component densities. Equivalently, the model is that each observations comes from one of a finite set of para- metric distributions. The component densities are usually all of the same parametric form, e.g. those of the multivariate normal distribution. The probabilities of the com- ponents are free to vary, but must sum to 1. The probability that an observation will be drawn from the hth component is pih, 0 ≤ pih ≤ 1, with ∑H h=1 pih = 1, where H is the 31 number of mixture components. The overall mixture density is given by f(x) = H∑ h=1 pihfh(x), (2.10) where fh() is the hth component density. Here we will only consider mixtures whose components are multivariate normal, but in principle, mixtures of any kind of components may be considered, and many variations such as mixtures of t and beta distributions and mixtures of generalised linear models have been considered. See, for example, McLachlan and Peel (2000) chapter 5 for further details. As George Box once said “all models are wrong, but some are useful”. If one has good reason to believe that the data has been drawn from a mixture of distributions of a particular type, then a key purpose of mixture modelling is to recover or approximate the parameters of the originating mixture, based on the data. Since one usually uses parametric component distributions, the fitted parameters of these are then available for interpretation. Even if one does not believe that the data is drawn from a mixture, a mixture model can be a useful and flexible option for approximating the underlying probability density — this is known as density estimation. A parametric model has a fixed number of parameters and which can be fitted to the data, and its density often has a well-defined shape. The density of a nonparametric model does not have a well-defined shape, and either has no parameters or a set of parameters to be fitted whose form or number depend on the data, with the number not necessarily being bounded as the size of the dataset grows. Finite mixture models are an example of a “semi-parametric” model, because the components are generally parametric models, but there is in principle, no limit to the number of the components, and no apriori way to know how many components will be needed to fit the data well. However, the stipulation that the number of components is finite makes this semi-parametric. Infinite mixture models are also possible, and typically studied from a Bayesian perspective. 32 A number of major tasks await the mixture modeller, with the following foremost: 1. fit all parameters of the mixture model based on the data 2. choose the number of components 3. quantify uncertainty about the parameters. 2.3 Fitting a normal mixture model Here we will assume that the number of components H is fixed, that the data is p- dimensional continuous, that each component is multivariate normal and that the prin- ciple used to fit the model is that of maximising the likelihood. If we observe a random sample {x1, . . . , xn} drawn from a mixture distribution with density 2.10, then the observations are i.i.d. and we can store them in an n × p data matrix XAll. We assume that the set of parameters of the component densities are collected together with the probabilities of the components into a parameter vector Ψ. In fact, since we know that ∑H i=1 pih = 1, only pi1, . . . , piH−1 are needed as component probability parameters. Hence the likelihood function can be written as L(Ψ;XAll) = n∏ i=1 H∑ h=1 pihfh(xi). (2.11) It is generally worthwhile to transform this via the log function, which then gives us the log likelihood l(Ψ;XAll) = n∑ i=1 ln ( H∑ h=1 pihfh(xi) ) . (2.12) We now assume that the components are each multivariate normal, and write out the log likelihood. Note that now Ψ = (µT1 , . . . , µTH , ξ T , pi1, . . . , piH−1)T , where ξ is a vector 33 containing the unique parameters of Σ1, . . . ,ΣH . l(Ψ;XAll) = n∑ i=1 ln ( H∑ h=1 pih (2pi)p/2|Σh|1/2 exp [ −1 2 (xi − µh)TΣ−1h (xi − µh) ]) . (2.13) This function cannot really be simplified, although it can be directly maximised us- ing optimisation methods such as Newton-Raphson approaches. In considering such general purpose iterative methods, one wonders if a more specific approach might be suitable for this problem. One thing to realise is that fitting the mixture model would be quite easy if we knew which component each observation came from. If we know that a subset of the observations belong to the first multivariate normal component, we can just set the mean vector of that component equal to the sample mean over this subset, and the covariance matrix of the component equal to the sample covariance over that subset. We could also set the proportion for this component equal to the number of observa- tions in this subset divided by the total. If we knew this for all components, we would quickly fit the entire mixture model. Unfortunately, we do not. However, this idea is largely the basis of the Expectation-Maximisation (EM) algorithm for fitting mixture models. The EM algorithm is an iterative procedure to find a maximum of the log likelihood function for a statistical model. It tends to be useful when it is possible to augment the data with additional latent or unobservable data (effectively variables), which would be sufficient to make the maximisation straightforward. One then iterates a process of finding approximate values for the latent variables by taking expectations (the Expec- tation step), followed by using these expectations to maximise the resulting likelihood over both the observed and latent data. The original data is then said to be “incom- plete” and the original likelihood is termed the incomplete likelihood. When the latent data is added, we term the data “complete” and the likelihood function of such data is termed the complete likelihood. In most cases, the maximum likelihood estimate Ψˆ is a solution of the following 34 equation ∂l(Ψ;XAll) ∂Ψ = 0, (2.14) and for this problem there are no solutions of interest at the boundary of the parameter space. It can been shown that the MLE satisfies pˆih = n∑ i=1 τh(xi; Ψˆ)/n, h = 1, . . . , H, (2.15) where τh(xi; Ψˆ) is the posterior probability that observation i has been drawn from component h, i.e. τh(xi; Ψˆ) = pˆihfh(xi; θˆh)∑H m=1 pˆimfm(xi; θˆm) , (2.16) where fh() in this case is the multivariate normal density, and so θh is the vector of unique parameters from the mean and covariance matrix of component h. This is not immediately helpful, given that we do not know the pˆih values. In the past some re- searchers were able to solve equations such as those presented by 2.15 and 2.16 for the unknown parameters, but this was difficult and required iterative methods whose properties were largely unknown. However, these equations add to the hints which suggest that we should choose the latent data to be the set of component memberships of the observations and try an EM algorithm to maximise the likelihood. We thus let the complete data be XC = {XAll, Z}, (2.17) where Z = (zT1 , . . . , z T n ) T , (2.18) and zi is a vector which describes which component observation i belongs to. It is a binary vector consisting of a 1 and H − 1 zeros. If xi belongs to component h, then the 1 is placed in the hth position in the vector. The complete data likelihood for a single observation is the density of that observa- 35 tion, which can be written: P (Xi = xi, Zi = zi; Ψ) = P (Zi = zi; Ψ)P (Xi = xi|Zi = zi; Ψ) (2.19) = H∏ h=1 pizihh fh(xi) zih . (2.20) Note that since all but one of the elements of zi are zero, the product over the h elements of zi merely finds the component which the ith observation belongs to, and for that component, we are using the right-hand side of 2.19. For all other components, we are merely multiplying by terms raised to the power 0, i.e. ones. As a result, we can write the complete likelihood function as LC(Ψ;XAll, Z) = n∏ i=1 H∏ h=1 pizihh fh(xi) zih . (2.21) Hence the complete data log likelihood is lC(Ψ;XAll, Z) = n∑ i=1 H∑ h=1 zih [lnpih + ln fh(xi; θh)] . (2.22) For a mixture of multivariate normal distributions, this becomes lC(Ψ;XAll, Z) = n∑ i=1 H∑ h=1 zih [ ln pih − p 2 ln(2pi)− 1 2 ln |Σh| − 1 2 (xi − µh)TΣ−1h (xi − µh) ] . (2.23) We will then attempt to maximise this function indirectly using the EM algorithm. It can be shown that the maximum of the complete and the incomplete log likelihoods are found at the same set of parameter values. Dempster, Laird and Rubin (Dempster et al., 1977) showed that the incomplete data likelihood L(Ψ) is not decreased after an EM iteration. It will usually converge to a local maximum of this function, which could be the (non-boundary) global maximum, or a saddle point. The convergence of the EM algorithm was proven by Wu (1983). The E step of the EM algorithm actually requires that we obtain the expectation 36 of the complete log likelihood of the data, taken across the distribution of the latent variables. When fitting mixtures, the latent variables are the z values for each of the observations. We must have some current set of values for the parameters - we will label these Ψ(k). These will be used to determine the distribution of the latent vari- ables. For example, for a given set of mixture parameters, a particular observation will be more likely to belong to one component than another, and these probabilities effec- tively define the distribution over the z values. The superscript (k) is used to indicate that it is the set of values at the kth iteration. The purpose of the E and M steps is to update Ψ(k) to its next value Ψ(k+1), which is hoped to be an improvement with respect to L(Ψ;XAll). In general, for the E step, we need to calculate what is known as the Q function: Q(Ψ; Ψ(k)) = EZ;Ψ(k) {lc(Ψ)|XAll} . (2.24) Here the expectation is taken over the set of latent variables Z with Ψ fixed at Ψ(k). The expectation is taken of the complete log likelihood, with the observed data (stored in XAll) fixed throughout. The M step then maximises the Q function, i.e. Ψ(k+1) = arg max Ψ Q(Ψ; Ψ(k)). (2.25) One can then continue taking E and M steps in alternation until some stopping cri- terion is met. As we have seen before, typical stopping criteria include completing a fixed number of steps or waiting until the size of improvements in some function of interest fall below a small threshold value. Here, the function of most interest is the incomplete log likelihood l(Ψ;XAll). We will now determine the details of the E and M steps for mixture models, and specifically the multivariate normal mixture model. We should first notice that the complete data log likelihood for mixtures is linear in the latent variables z (see 2.22), 37 i.e. we could bring the expectation inside the sum. As a result Q(Ψ; Ψ(k)) = EZ;Ψ(k) {lc(Ψ)|XAll} (2.26) = EZ;Ψ(k) { n∑ i=1 H∑ h=1 (zih|xi) [lnpih + ln fh(xi; θh)] } (2.27) = n∑ i=1 H∑ h=1 EZ;Ψ(k)(zih|xi) [lnpih + ln fh(xi; θh)] (2.28) = n∑ i=1 H∑ h=1 τh(xi; Ψ (k)) [lnpih + ln fh(xi; θh)] (2.29) Note that the expectation of the zih variable is the same as the probability that it equals 1, given its binary indicator structure. We have already seen τh(xi; ΨˆMLE) in 2.16. All we need do now is estimate τ using Ψ(k) instead. I.e. we set τh(xi; Ψ (k)) = pi (k) h fh(xi; θ (k) h )∑H m=1 pi (k) m fm(xi; θ (k) m ) (2.30) The M step requires the global maximisation of the Q function with respect to Ψ, which will result in our next parameter vector Ψ(k+1). This is where our realisation of the simplicity of fitting the complete data version of the mixture model becomes useful. We simply set pi (k+1) h = 1 n n∑ i=1 τh(xi; Ψ (k)) (2.31) Similarly, for the multivariate normal mixture, we weight the mean and variance es- timates via the τ values for each observation, i.e. allowing fractional membership of each of the components via: µ (k+1) h = ∑n i=1 τh(xi; Ψ (k))xi∑n i=1 τh(xi; Ψ (k)) (2.32) and Σ (k+1) h = ∑n i=1 τh(xi; Ψ (k))(xi − µ(k+1)h )(xi − µ(k+1)h )T∑n i=1 τh(xi; Ψ (k)) . (2.33) It may be apparent that one unfortunate possibility with a mixture model is to pro- 38 duce a component whose mean is exactly the value of a data point, and whose covari- ance matrix is close to the zero matrix. As the mean moves closer to the data point and the covariance matrix moves closer to the zero matrix, the mixture density for that point increases without bounded, and hence so does the likelihood. So the like- lihood function for mixtures contains at least nH infinite maxima, because any of the H components can converge in this way on any of the n data points. In most cases, these maxima are spurious, and to be avoided. Instead we seek the parameter set with the largest finite maximum of the likelihood function (which is hence not technically a global maximum). It should be noted that a mixture model is also not identifiable, that is there always exist multiple sets of of values of the parameters Ψ which produce exactly the same mixture density throughout the support, which isRp for a multivariate normal mixture model. However, this loss of identifiability is only due to the fact that one could swap the parameters of any component with those of any other, along with their mixture weights, without changing the overall density. So we can say that a mixture model is identifiable, up to a permutation of the component labels or numbering, which have no particular meaning anyway. However, this lack of identifiability does make it difficult to compare two different fits of a mixture model to the same data. One might think one could just order the components by their means to produce a unique and comparable solution. However, when component means are close together, the fact that the fits are different can cause this approach to fail. EM fitting of a mixture model can be started from any set of valid parameter values and should then converge to a local maximum or saddle point. Once this is chosen, the EM algorithm is deterministic. As in previous examples such as the k-means algo- rithm, this implies that the initial parameter space could be theoretically decomposed into (possibly disjoint) regions, each of which will lead to the same local maximum or saddle point. So the choice of initial values ultimately determines the conclusion of the search, and it also determines how many steps it will take for the algorithm to come 39 within of this solution in terms of the likelihood, or some measure of distance in the parameter space. The simplest approach to initialisation would be to randomly fully allocate all the observations to mixure components, and then fit each components to the observations so assigned. This has the benefit of keeping the means close to the overall mean of the data, and hence inside its convex hull, and the covariance matrices will also tend be close to those of the data overall, i.e. not far too large. However, the means will tend to be closer together than one might expect in most plausible generating mixtures, and covariance matrix elements will tend to be larger than desired. A popular alternative is to initialise via running the k-means algorithm with k = H . Since k-means uses out- right allocation of observations to clusters, this will tend to separate the means. The means of the clusters are then used as the means of the components, and the compo- nent weights and covariance matrices are estimated from the points allocated to the respective clusters. k-means tends to produce clusters which are as close as possible to spherical, but at least they will have less spread than the whole dataset. Note that k-means needs to be started from some position also, and the idea of randomly allo- cating of observations to clusters and then taking their means may work better for that task. In reaction to the presence of spurious global maxima and sub-optimal local max- ima, a typical strategy is to restart the EM algorithm many times, and eventually choose the outcome with highest likelihood, provided it does not resemble a spuri- ous solution. The randomness of the initialisation is hence useful, because it allows each of these restarts to potentially reach a different solution. In practice, solutions in which the covariance matrix of a component is tending to zero will be detected by the eventual near-singularity of this matrix, leading to compu- tational difficulties in the calculation of densities from that component, and hence the log likelihood. Solutions of this type are then discarded. There can also be spurious solutions in which a small number of points occupy an isolated cluster, and while the 40 fit is valid, this may not represent any meaningful sub-population. One could imagine trying to penalise small variances or constrain the variances to be above a certain threshold. The former would tend to have a global effect, including possibly moving the location of the global non-spurious maximum away from its orig- inal value, which is probably undesirable. Constraining the variance might work, but the choice of the threshold would be difficult. If the threshold is too small, if the EM algorithm arrives in the vicinity of the undesirable maximum and hits the constraint, it will be unable to escape the pull of the maximum and hence be unable to move. If the threshold is too large, it could block access to legitimate solutions, or those which might provide useful stepping stones to better solutions. One might more generically constrain the minimum total number of points allocating to a cluster, considering par- tial allocations through τ values. It seems clear that this number should be greater than 1. However, implementing such constraints would require modifications of the algorithm, and put some of its properties at risk. Rather than monitor the variances of the components, one can monitor the eigen- values of the component covariance matrices. These are related to the variance along the elliptical axes of the component distribution, and it is desirable than none of them become too small. Often, one is forced to comb through the best solutions found (ac- cording to their likelihood) by hand checking to see if they seem reasonable. If the objective is merely density estimation, such checks may not be necessary. 2.4 Estimating uncertainty about the parameter values A number of methods exist to obtain standard errors or approximate confidence inter- vals for the parameters of mixture distributions. Some are based on the information matrix I(Ψˆ;XAll), which can be expressed in terms of the complete data information matrix IC(Ψˆ;XAll) and terms involving the score functions for the complete and in- complete data. These are typically algebraically rather complicated and still require approximations such as using the observed information matrix. See sections 2.15–2.16 41 of McLachlan and Peel (2000) for details. A simpler approach is to use what is known as the parametric bootstrap. It is called parametric because it derives new samples of size n from a fitted model, in this case the mixture using our best value of Ψˆ. The procedure is as follows: 1. Generate B parametric boostrap samples X(1), . . . , X(B), each of size n, from the fitted mixture model with parameters Ψˆ. 2. Apply the EM algorithm to each bootstrap sample to obtain B realisations of Ψˆ: Ψˆ(1), . . . , Ψˆ(B). 3. Calculate the bootstrap covariance matrix of the parameter values: covB(Ψˆ) = 1 B − 1 B∑ b=1 (Ψˆ(b) − Ψˆ)(Ψˆ(b) − Ψˆ)T , (2.34) where Ψˆ = 1 B B∑ b=1 Ψˆ(b). (2.35) 2.5 Choosing the number of components The Bayesian approach to model selection compares one modelM1 with another model M2 via Bayes factors, given a fixed dataset XAll. We assume that the data must have arisen under one of these two models. In Bayesian statistics, each model’s parameters are considered to have a distribution prior to the observation of the data, and there is also a prior distribution over the models themselves. These have to be set by the user of the method, and chosen in a reasonable and well-argued way, if it is hoped that someone else will take the results seriously. The distribution of the data, given the model is also random, as usual. A few points will be summarised below, but see Kass and Raftery (1995) for more details. 42 Using Bayes theorem, we find that P (Mk|XAll) = P (XAll|Mk)P (Mk) P (XAll|M1)P (M1) + P (XAll|M2)P (M2) , k = 1, 2. (2.36) The posterior odds of modelM1 are then defined as: P (M1|XAll) P (M2|XAll) = P (XAll|M1)P (M1) P (XAll|M2)P (M2) (2.37) The Bayes factor for model 1 is defined as B12 = P (XAll|M1) P (XAll|M2) . (2.38) Hence we can state that with respect to either model: posterior odds = Bayes factor× prior odds. Similar equations hold for comparison of any number of models K. In some cases, it may be possible to calculate P (XAll|Mk) analytically. Usually, one will instead need a method of approximation such as Markov chain Monte Carlo sampling. The main difficulty is that P (XAll|Mk) = ∫ Θk P (XAll|Ψk,Mk)P (Ψk)dΨk, (2.39) i.e. that this is a marginal distribution, requiring integration over the full multivariate distribution of Ψk over the parameter space Θk appropriate for model k. Schwartz considered asymptotic form of the Bayes factor as the sample size n ap- proaches infinity (Schwarz, 1978). He began by assuming that the observations are drawn from an exponential family of models. Schwarz did not directly specify a prior distribution. He assumed that the prior distribution was of the form P (Ψk) = K∑ k=1 P (Mk)P (Ψk|Mk). (2.40) 43 He also assumed that the penalty for choosing the wrong model is fixed, although this can be loosened somewhat without altering the asymptotic results. In the Bayesian paradigm, we usually seek a posterior distribution over models. However if we must choose a single model, we choose the one with highest posterior probability. This is also known as the maximum a posteriori (MAP) estimate. In 2.36, we can see that the denominator is constant for all models. The MAP model choiceMk∗ is then as follows: k∗ = arg max k P (XAll|Mk)P (Mk) (2.41) = arg max k P (Mk) ∫ Θk P (XAll,Ψk|Mk) dΨk (2.42) = arg max k P (Mk) ∫ Θk exp {logP (XAll,Ψk|Mk)} dΨk. (2.43) If one takes a second-order Taylor series approximation of logP (XAll,Ψk|Mk) around one of its modes (local maxima) Ψk = Ψ˜k, we get logP (XAll,Ψk|Mk) ≈ logP (XAll, Ψ˜k|Mk)− 1 2 (Ψk − Ψ˜k)TH(Ψ˜k)(Ψk − Ψ˜k), (2.44) where H(Ψ˜k) is the negative Hessian matrix of logP (XAll,Ψk|Mk), evaluated at Ψk = Ψ˜k. The first order term is zero since the gradient at Ψ˜k is zero. If we substitute this approximation into 2.43, then we find that k∗ ≈ arg max k P (Mk)P (XAll, Ψ˜k|Mk) ∫ Θk exp { −1 2 (Ψk − Ψ˜k)TH(Ψ˜k)(Ψk − Ψ˜k) } dΨk. (2.45) The integrand of 2.45 is only a normalising constant away from being the density of a normal distribution with mean Ψ˜k and covariance matrix H(Ψ˜k)−1. As a result, we can multiply by the required normalising constants and their inverse, leaving the integral of a multivariate normal density. For a finite sample size, this integral will not evalu- ate to 1 since some of the parameters have limited support, e.g. the variance parame- ters must be positive and the covariance parameters are constrained by the variances. 44 However, if we take the sample size n to infinity, the largest posterior mode coincides with the maximum likelihood estimate Ψˆ, given mild conditions on the prior distri- butions. At this point, the likelihood and log likelihood become very sharply peaked, giving the Hessian entries enormous magnitudes. This will make the entries of the in- verse Hessian matrix tiny, and this is the covariance matrix of the multivariate normal density. As a result, the implied multivariate normal is tightly concentrated around Ψˆ, and so the the integral is dominated by a region very close to Ψˆ. As a result, the bounds of the integral have essentially no effect, and the integral is approximately 1. In detail, noting that |A−1| = |A|−1: ∫ Θk exp { −1 2 (Ψk − Ψ˜k)TH(Ψ˜k)(Ψk − Ψ˜k) } dΨk (2.46) = (2pi)dk/2|H(Ψ˜k)−1|1/2 ∫ Θk 1 (2pi)dk/2|H(Ψ˜k)−1|1/2 exp { −1 2 (Ψk − Ψ˜k)TH(Ψ˜k)(Ψk − Ψ˜k) } dΨk (2.47) ≈ (2pi)dk/2|H(Ψˆk)|−1/2. (2.48) So k∗ ≈ arg max k P (Mk)P (XAll, Ψˆk|Mk)(2pi)dk/2|H(Ψˆk)|−1/2. (2.49) Since log is a monotonic increasing function which does not alter the position of max- ima, we can take the log of the right hand side. k∗ ≈ arg max k logP (Mk) + log(P (XAll|Ψ˜k,Mk)P (Ψˆk|Mk)) + dk 2 log(2pi)− 1 2 log |H(Ψˆk)| (2.50) ≈ arg max k logP (Mk) + l(Ψ˜k;XAll,Mk) + logP (Ψˆk|Mk) + dk 2 log(2pi)− 1 2 log |H(Ψˆk)|. (2.51) For a given model, we generally have a method of estimating the MLE, such as the EM algorithm for mixture distributions. We can also recognise that H(Ψˆk) is the observed 45 information matrix I(Ψˆk, XAll). Hence k∗ ≈ arg max k logP (Mk)+l(Ψˆk;XAll,Mk)+logP (Ψˆk|Mk)+dk 2 log(2pi)−1 2 log |I(Ψˆk, XAll)|. (2.52) Since we are taking n→∞, we can ignore any terms which will not scale with n. This removes logP (Mk), although this could also be removed by assuming that the prior has similar probability for all models considered (all values of k). It also removes the log prior density at the MLE, logP (Ψˆk|Mk), and the constant dk2 log(2pi). In addition, it can be shown that the determinant of the observed information matrix is approx- imately ndk , where dk is the dimensionality of Ψk, i.e. the dimension or number of parameters of the model. Hence |I(Ψˆk, XAll)| ≈ ndk . (2.53) Using this collection of approximations, we can simplify 2.52 to k∗ ≈ arg max k l(Ψˆk;XAll,Mk)− dk 2 log n, (2.54) or equivalently k∗ ≈ arg min k BIC = arg min k −2l(Ψˆk;XAll,Mk) + dk log n. (2.55) It should be clear that there is scope to improve on the accuracy of the BIC as a method of correctly choosing the MAP model if one makes fewer assumptions, given that we do plan to use it with finite data. If one uses the standard BIC, one is effectively saying that the prior on the set of models is completely flat (same density for each model) and that the prior density of the MLE is the same across all models. The latter would be difficult to achieve for models of various sizes, and its meaning is somewhat unclear. In addition log(2pi) ≈ 1.84 and this term could also be retained. While the BIC can be used to help find an approximate MAP solution, it is more 46 widely used in frequentist statistics, in which there is no concept of prior distributions over models or parameters. In such cases, we need to remove the prior-based terms. However, one cannot really avoid the Bayesian basis of BIC, and hence ideally, one would consider which type of prior is most suitable, and its effect. Effectively, we are still attempting to maximise the likelihood, but we now have a penalty for model complexity, based purely on the number of parameters. One might ask where this penalty has come from, given that it didn’t seem to be explicitly in- cluded in the prior. The answer lies in the dimensionality of the parameter space. In a larger space, the normal density term has a larger normalising constant, which leads to a larger penalty. Equivalently, a uniform prior over models or something similar leads to a more diffuse prior over the larger parameter space, and so the posterior density for any particular model is lower. If one wishes to use the BIC in practice to choose the number of components of a mixture, one typically works through a range of numbers of components (e.g. 1 to 10) and then selects the model with lowest BIC value. It has been shown that the BIC is consistent for normal mixture models (Roeder and Wasserman, 1997). That is, if data is drawn from a normal mixture distribution with H components, and we are using BIC to compare normal mixture models with various numbers of components, then as n approaches infinity, BIC will have its minimum value for the model with H components. 2.5.1 Cross-validation of the likelihood Smyth (2000) proposed calculating the log likelihood of a set of test data, using a model fit on a set of training data. The model with the highest test set log likelihood would then be chosen. With respect to model selection, the idea is to choose the model with the highest 47 expected log likelihood on new data. That is, k∗ = arg max k EX logP (X|Ψ = Ψˆk,Mk), (2.56) where the expectation is over X , a new iid observation, and for each model, Ψˆk is the MLE fit based on the existing dataset XAll. This expectation can be approximated using a test dataset Xtest, not used for model fitting. Given argmin is invariant to scale, one can equivalently consider the log likelihood over the whole test dataset (a sum of individual observation log likelihoods). I.e. k∗ ≈ arg max k 1 ntest ntest∑ i=1 logP (Xtest,i|Ψ = Ψˆk,Mk) (2.57) ≈ arg max k ntest∑ i=1 logP (Xtest,i|Ψ = Ψˆk,Mk). (2.58) The test set approach can be expanded to cross-validation, with the models needing to be refit on the different training sets with each model order of interest. The log likelihood for models of a particular order is then summed across all the CV test sets. The model with the highest total test log likelihood value is chosen. In detail, cross- validation is used to choose the optimal model type or size, and then this type of model is refit to all of the data to create the final model. It can be shown that the test set log likelihood approximates the expected value of the Kullback-Leibler (KL) divergence between the density of the true model f(x) and that of the fitted model fˆ(x), plus a constant which is independent of the model order. In general, this KL divergence can be written as D(f(x)||fˆ(x)) = ∫ ∞ −∞ ln ( f(x) fˆ(x) ) f(x)dx. (2.59) Note that the KL divergence is not symmetric, and the reference density (here f(x)) is usually chosen as the first argument for D(). Kullback-Leibler divergence is also the basis of the Akaike information criterion 48 (AIC) (Akaike, 1974), but this is calculated based on the same data used to fit the model, and an asymptotically-derived penalty is applied, as follows: k∗ ≈ arg min k AIC = arg min k −2l(Ψˆk;XAll,Mk) + 2dk. (2.60) In practice, the AIC tends to choose models that are too large, implying that its penalty function is not severe enough on model complexity when sample sizes are small. It may be reasonable to consider cross-validated likelihood as a rethinking of AIC using test sets. 2.6 Decompositions 2.6.1 Spectral Decomposition A p× p real symmetric matrix A can be decomposed as follows: A = ΓΛΓT where Λ = diag (λ1, . . . , λp) and Γ = (γ1 . . . γp) , where λj is the jth eigenvalue of A and γj is the jth eigenvector of A. Note that Γ is an orthogonal matrix, i.e. whose rows and columns are all orthogonal unit vectors, and hence ΓΓT = Ip = ΓTΓ It is then possible to define, for any α ∈ R, Aα = ΓΛαΓT . 49 In particular A1/2 = ΓΛ1/2ΓT , and so A1/2A1/2 = ΓΛ1/2ΓTΓΛ1/2ΓT = ΓΛ1/2Λ1/2ΓT = ΓΛΓT = A. Also AαA−α = ΓΛαΓTΓΛ−αΓT = ΓΛαΛ−αΓT = ΓIΓT = ΓΓT = Ip. 2.7 Mahalanobis Distances and Transformation The Mahalanobis distances are analogues of Euclidean distance which take into ac- count a relevant covariance matrix. A Mahalanobis distance can be used as a mea- sure of the distance between a point, sample or distribution and another point, sample or distribution. The main idea is that the points or variables are first normalised by transformation such that the resulting distribution would have an identity covariance matrix, i.e. with all variables having the same variances and no correlation between variables. When random variables X and Y have means µX and µY and common covariance matrix Σ, the Mahalanobis distance between the two means is defined as ∆(µX , µY ) = √ (µX − µY )TΣ−1(µX − µY ). If have a sample of each variable and an estimated common covariance matrix S, then the Mahalanobis distance between x and y is D(x, y) = √ (x− y)TS−1(x− y). If we have a sample of one variable X , namely Xall = (x1, . . . , xn) with sample covariance SX , and a single point y, then the Mahalanobis distance between y and Xall is 50 D(x, y) = √ (x− y)TS−1X (x− y). The Mahalanobis distance between a point y and a point xi, using the sample co- variance matrix SX is D(xi, y) = √ (xi − y)TS−1X (xi − y). These distance measures suggest a corresponding transformation so that the result- ing dataset or distribution will have an identity covariance matrix. The Mahalanobis transformation of a dataset (x1, . . . , xn) is as follows: zi = S −1/2(xi − x), i = 1, . . . , n, where S−1/2 is determined via the spectral decomposition. Then, for the transformed data, Zall = (z1, . . . zn), the sample covariance matrix is Ip, since Cov(Z,ZT ) = S−1/2S(S−1/2)T = S−1/2S1/2S1/2S−1/2 = Ip. 51 Chapter 3 Discriminant Analysis 3.1 Introduction Discriminant analysis, also known as supervised classification, is the task of construct- ing a decision rule or classifier based on a set of training data, for which the class labels are given. The classifier can then be used to assign a class to a new observation whose class is unknown. The classes can be defined in any way desired, provided that they are distinct from each other, i.e. that each observation can only properly belong to one of the defined classes. The classes can overlap with each other, but in the areas of overlap there should be some method for determining which class is the best choice. The applications of discriminant analysis are wide and varied. In medicine, it can be used for diagnosis (determining a patient’s condition) or prognosis (determining their outcome status sometime in the future). It can be used in security, defence or robotics applications for face or object recognition, and with documents for classifying into categories, including whether or not they are relevant to a particular query. In finance, it can be of interest to try to decide whether or not a loan applicant is credit- worthy or if a transaction is fraudulent or worthy of further investigation. In each of these situations, classes can be clearly defined and applied to label a set of well understood instances. A set of possibly relevant predictor variables can 52 be defined, mainly comprised of aspects or features of the experimental units which can be conveniently measured, and which may plausibly allow some discrimination between the different classes. It is hoped that, despite measurement errors and large overlaps between classes in terms of their joint densities in the predictor space, that collectively there will be sufficient information in the set of predictor variables that it will be possible to train (or fit) a classifier to fairly accurately predict the class of a randomly chosen new observation. 3.2 Statistical Classifiers The training data consists of multivariate observations of predictor variables along with the class labels. As in the previous chapter, the multivariate observations are placed in a data matrix XAll (see 1.37), whose rows are the individual p-dimensional observations x1, . . . , xn, xi ∈ R, i.e. each recording a value for p predictor variables. The class labels for the n observations are stored in a vector yAll = (y1, . . . , yn)T , where yi ∈ {1, . . . , G}where G is the number of classes. We can consider each observation (Xi, Yi) to be drawn from a multivariate joint distribution F . The marginal distribution for the class variable Y is then a categorical distribution with parameters pi1, . . . , piG−1. Note that piG = 1 − ∑G−1 g=1 pig . These pa- rameters are known as the prior probabilities of class membership. For n observations, the numbers drawn from each class are then distributed according to a multinomial distribution with the above probability parameters and n trials. In some cases, the prior probability of each class is known fairly accurately, e.g. on the basis of large past studies. In other cases, it must be estimated from the training data via pˆig = n∑ i=1 Ig(yi)/n (3.1) where I is the indicator function. In most cases of interest, the classifier is a deterministic function of the new point 53 to be classified, X = x, and the training set, and we denote it by η(XAll,yAll)(x), with the range being 1, . . . , G. Classifiers based on statistical models allow one to calculate either the density func- tion for each class at any point x. More usefully, they allow one to to determine how likely each class is, given the observation. To determine this, we use Bayes’ theorem, which applies to densities as well as probabilities. P (Y = g|X = x) = h(X = x|Y = g)P (Y = g) f(X = x) , (3.2) where Y is the class of observation X and P (Y = g) = pig. h(X = x|Y = g) is the den- sity for the gth class, which we will now abbreviate to fg(x). f(X = x) is the density for the overall distribution of observations, which can be viewed as a mixture distri- bution over the classes, with component probabilities pi1, . . . , piG. We will henceforth abbreviate this to fX(x), and note that, by the law of total probability fX(x) = G∑ g=1 pigfg(x). (3.3) Since each pig can be specified a priori, P (Y = g|X = x) is known as the posterior probability of class g given observation x, and we will abbreviate this to τg(x). (Note: this notation is drawn from (McLachlan, G.J., 1992), an important book in this area.) As a result, we can rewrite equation 3.2 as τg(x) = pigfg(x)∑G g′=1 pig′fg′(x) . (3.4) The classification rule is then to choose the class for which the posterior probability is highest, i.e. B(x) = arg max g∈1,...,G τg(x) = arg max g∈1,...,G pigfg(x), (3.5) with the right hand equality following since fX(x), the denominator of τg(x), has no effect on which class is maximal. When using a model and parameters which exactly 54 describe the classes, this can be shown to produce the lowest expected risk (details will be given in a later section). When using parameter estimates based on data, we use a similar rule: η(XAll,yAll)(x) = arg max g∈1,...,G τˆg(x) = arg max g∈1,...,G pˆigfˆg(x). (3.6) 3.3 Optimal Classifiers A classification rule η() will always produce a class as its answer, and this predicted class can be correct or incorrect. An incorrect answer is an error, and we would like to produce a classifier which results in a low error rate on new observations. In fact, for G classes, there are G2 −G unique types of error, and error rate, one for every possible pair of classes, excepting correct answers, where the predicted class matches the true class. The allocation rate at which a classifier η predicts a class h for an observation whose actual class is g is egh(η) = P (η(X) = h|Y = g) = ∫ Rh fg(x)dν, g, h ∈ 1, . . . , G. (3.7) where ν is the appropriate measure on Rp for fX(x) and Rh is the segment of the pre- dictor space which the rule will allocate to class h. Note that these regions are disjoint. Then the class-specific error rate for class g can be written as Errg(η) = P (η(X) 6= g|Y = g) = ∫ ∪h6=gRh fg(x)dν, g, h ∈ 1, . . . , G = 1− ∫ Rg fg(x)dν, g ∈ 1, . . . , G. (3.8) Let cgh be the cost (or loss) for misclassifying an observation whose true class is g into class h. We assume that cgg = 0, ∀g ∈ 1, . . . , G, i.e. that correct classification is costless. For now, we will consider all types of errors to be equally undesirable, and set cgh = 1, ∀g ∈ 1, . . . , G, Hence we will only consider the overall error rate. The true costs can be scaled up to any value desired, but such scaling has no effect on the 55 classification. From now on, it will be convenient to change the form of Y , the variable repre- senting the true class, to be a G-length binary vector with a 1 in the position of the true class, and 0’s elsewhere. For example, with three possible classes, the second class would be represented by y = (010). We will then use yg to refer to the gth element of this vector, as needed. Let (X, Y ) be a new point drawn at random from the distribution F , and define the zero-one loss function Q for the classifier η as follows: Q[y, η(XAll,yAll)(x)] = 0, η(XAll,yAll)(x) = y 1, η(XAll,yAll)(x) 6= y. (3.9) For a new observation (x, y), the loss due to allocation using the rule η is l(y, η(x)) = G∑ g=1 ygQ[g, η(x)]. (3.10) The expected loss or risk conditional on a particular value for x is then EFY |X=xl(Y, η(x)) = G∑ g=1 EFY |X=x(Yg|x)Q[g, η(x)] (3.11) = G∑ g=1 τg(x)Q[g, η(x)] (3.12) = G∑ g 6=η(x) τg(x) = 1− τη(x). (3.13) Here the expectation is over the random variable Y conditional on X = x, i.e. with Y drawn from the conditional distribution FY |X=x. This distribution can also be writ- ten Mult(1,τ1(x), . . . , τG(x)). In most cases, every class distribution has support which includes x, and so τg(x) > 0) for each class g. The decision rule η(x) must be determin- istic, and so must predict a single class for a given x value. When this is the true class, it will result in a 0 value for Q[, ] in equation 3.13. All other classes will result in a loss 56 of Q[, ] = 1. An optimal allocation rule is one which minimises the conditional risk (3.13) for every possible value of x. A rule which does this is called a Bayes (decision) rule or Bayes-optimal rule or minimum risk rule. It follows from 3.13 that the way to minimise the risk is to choose η(x) = B(x) = arg max g∈1,...,G τg(x), (3.14) since this will ensure that the largest term is left out of the sum. This rule is admissible with respect to the given loss function, i.e. there is no rule which can produce a lower risk. For a given joint distribution F , the risk value for the Bayes-optimal rule is called the Bayes risk. When the costs of misallocation vary across pairs of classes, the loss function and Bayes-optimal rule also change. For general costs cgh with a new observation (x, y), the loss due to allocation using the rule η is l(y, η(x)) = G∑ g=1 ygcgη(x)Q[g, η(x)]. (3.15) The conditional risk becomes EFY |X=xl(Y, η(x)) = G∑ g 6=η(x) τg(x)cgη(x). (3.16) This is more difficult to optimise. In general we can only write that the conditional risk 3.16 will be minimised by choosing η(x) = B(x) = arg min h∈1,...,G G∑ g 6=h τg(x)cgh. (3.17) Statistical classifiers essentially estimate pig and fg(x) for g = 1, . . . , G. These are used to estimate τg(x), and these estimates are then simply plugged into either 3.14 or 3.17, depending on the cost structure. We will examine some of these classifiers in the following sections. 57 3.4 Linear Discriminant Analysis The method of linear discriminant analysis (LDA) assumes that the population in (or model of) each class has a multivariate normal distribution, and that the covariance matrices for all the classes are the same. The classes can have different means and proportions, as in the general case. The jth class has a multivariate normal density, with parameters pˆij, µˆj and Σˆ. Since we have the class labels, we separate the training data into classes. If we have no knowledge of the class probabilities, we estimate them by the proportion of observations in each class, as in 3.1. If we know the class probabilities or have what we believe to be more accurate estimates from elsewhere, we would use these instead. The mean of each class distribution is set equal to the corresponding sample mean of observations from that class. The common covariance matrix is estimated by removing the relevant class sample mean from each observation, pooling all the observations and then using the (unbiased) sample covariance matrix, as in 1.45. For the following, we assume that all costs of misclassification are equal. The dis- criminant boundary between two classes, j and l, occurs where the estimated proba- bility of a new observation x belonging to either the jth or the lth class is equal. So pˆij fˆj(x) = pˆilfˆl(x), where pˆij is the estimated prior probability of an observation belong- ing to the jth class and fˆj() is the estimated probability density function of the jth class. We will now derive the equation for the discriminant boundary between two classes, j and l. pˆij (2pi)p/2|Σˆ|1/2 e − 1 2 (x−µˆj)T Σˆ−1(x−µˆj) = pˆil (2pi)p/2|Σˆ|1/2 e − 1 2 (x−µˆl)T Σˆ−1(x−µˆl) (3.18) Cancelling common terms and taking logs of both sides, we find that: log pˆij pˆil = 1 2 [ (x− µˆj)T Σˆ−1(x− µˆj)− (x− µˆl)T Σˆ−1(x− µˆl) ] (3.19) = 1 2 [ xT Σˆ−1x− xT Σˆ−1µˆj − µˆTj Σˆ−1x+ µˆTj Σˆ−1µˆj − ( xT Σˆ−1x− xT Σˆ−1µˆl − µˆTl Σˆ−1x+ µˆTl Σˆ−1µˆl )] (3.20) 58 Only the two xT Σˆ−1x terms cancel immediately, but this is enough to ensure that the resulting discriminant boundary equation is linear in x. Also, since Σˆ−1 is a symmetric p× p matrix, aT Σˆ−1b = bT Σˆ−1a for a, b ∈ Rp. This lets us group terms in x. log pˆij pˆil = 1 2 [ −2xT Σˆ−1µˆj + 2xT Σˆ−1µˆl + µˆTj Σˆ−1µˆj − µˆTl Σˆ−1µˆl ] (3.21) = xT Σˆ−1(µˆl − µˆj) + 1 2 [ µˆTj Σˆ −1µˆj − µˆTl Σˆ−1µˆl ] (3.22) Thus xT Σˆ−1(µˆl − µˆj) = log pˆij pˆil + 1 2 [ µˆTl Σˆ −1µˆl − µˆTj Σˆ−1µˆj ] . (3.23) Once all the model parameters have been estimated and entered, this will give a linear equation in terms of the elements of x. Like all classifiers, LDA partitions the predictor space into regions. Within each region the classifier will predict a single class. The boundaries between these regions are based on combinations of discriminant boundaries, and are thus piecewise linear. 3.5 Quadratic Discriminant Analysis Quadratic Discriminant Analysis (QDA) is the same as linear discriminant analysis, except that the covariance matrix is allowed to be different for each class. Estimation of means and proportions is as in LDA, but the covariance matrix for each class is estimated by the sample covariance matrix for the observations in that class. Assuming equal costs of misclassification, we can again consider the discriminant boundary between two classes, j and l. pˆij (2pi)p/2|Σˆj|1/2 e− 1 2 (x−µˆj)T Σˆj−1(x−µˆj) = pˆil (2pi)p/2|Σˆl|1/2 e− 1 2 (x−µˆl)T Σˆl−1(x−µˆl) (3.24) 59 Cancelling common terms and taking logs of both sides, we find that: log pˆij|Σˆl|1/2 pˆil|Σˆj|1/2 = 1 2 [ (x− µˆj)T Σˆj−1(x− µˆj)− (x− µˆl)T Σˆl−1(x− µˆl) ] . (3.25) It is not generally possible to simplify this further. Hence the discriminant boundaries are quadratic in x. The regions dedicated to each class thus have piecewise quadratic boundaries. Clearly QDA is more flexible than LDA, and able to represent more complex class structures. However, it also has more parameters to be estimated from the same amount of data due to the proliferation of covariance matrices. The uncertainty in the estimates of the covariance parameters tends to be higher for QDA than LDA as a result. When data is limited and a common covariance matrix is a reasonable assumption or approx- imation, LDA can produce better performance on new data than QDA. This is an example of potential over-fitting, where the flexibility in the model can be wasted on fitting noise or meaningless variation in the training data, and then act to reduce the accuracy of predictions on new observations. Unfortunately, it is hard to know how much data is enough, but one can be sure that more data is needed as the number of predictor variables increases. In deciding whether to use QDA or LDA, one should create and consider a set of bivariate or trivariate marginal plots to try to determine whether or not a common covariance matrix seems reasonable. 3.6 Fisher’s Linear Discriminant Analysis In 1936, Fisher proposed the first form of linear discriminant analysis for two classes in (Fisher, 1936). This method does not require the class distributions to be multivariate normal and it does not require the two classes to have the same covariance matrix. It does not impose a model on the data, but it does consider the mean and covariance matrix of each class. 60 The main idea of Fisher’s Linear Discriminant Analysis (FLDA) is to find a univari- ate projection of the data for which the ratio of between-class variance to within-class variance is maximised. A hyperplane perpendicular to this line then provided the dis- criminant boundary. Let the sample means of the two classes be µˆ1 and µˆ2, respectively, and let their common sample covariance matrix be S. Let a ∈ Rp, with ||a|| = 1, be a projection to be considered (although the magnitude is not important for what follows). Under this projection, the means of the classes become aT µˆ1 and aT µˆ2, respectively. The variance between classes is then [ aT (µˆ1 − (µˆ1 + µˆ2)/2) ]2 + [ aT (µˆ2 − (µˆ1 + µˆ2)/2) ]2 = (aT µˆ1 − aT µˆ2)2/2. (3.26) The common variance (within classes) in the projected space is given by aTSa. Note that this does not require that we believe there is a common covariance matrix for the classes. In the projected space, we are merely adding up the sum of squared distances from the relevant class sample mean across all observations and multiplying by a con- stant (1/(N − 2)). Without the constant, this is also known as the within-class scatter. Hence the ratio of between class to within class variance is: (aT µˆ1 − aT µˆ2)2 2aTSa (3.27) It can be shown (by taking derivatives and then solving as an eigenvalue problem) that this ratio is maximised when a is proportional to S−1(µˆ1 + µˆ2). The discriminant function is then aTx = xTa = xTS−1(µˆ1 + µˆ2). The discriminant boundary would then naturally be set orthogonal to the direction of the projection a. Where it should be put along this line is open to some debate. The discriminant boundary is given by aTx−a0 = 0, with a0 determining the position of this hyperplane. A common choice of a0 is the mid-point of the two class sample means in the projected space, which leads to a0 = aT (µˆ1 + µˆ2)/2. If any sort of statistical model is applied to each class, one can 61 choose to place the threshold at the point where posterior density under each class is equal. 3.7 Estimation of Error Rates In addition to selecting and training a classifier, we would also like to estimate its accuracy on new observations. To be precise, we would like to estimate the probability of the classifier η() misclassifying a new observation (X, Y ) drawn from F : Err = EF Q(Y, η(X)). (3.28) If the costs of misclassification vary from one class to another, or just for a better un- derstanding of the classifier’s capabilities, we would also like to estimate Errg, g = 1, . . . , G. Note that Errg = EF |Y=g Q(Y, η(X)) = P (η(X) 6= Y |Y = g) = P (η(X) 6= Y, Y = g)/P (Y = g) = EF Q(Y, η(X))I(Y, g)/pig We could go further and try to estimate every type of error rate Errgh, g 6= h = 1, . . . , G. However, this is normally not necessary, and unless we have plenty of data from every class, these will be difficult to estimate. Note that the bias of any estimator of the predictive error rate Eˆrr is E(Eˆrr − Err), where the expectation is over the distribution producing whatever data is used to estimate the error rate. If we have used all the available labelled data to train our classifier, then we have no independent observations left for the estimation of error rates. The classic solution, akin to reporting R2 with regression, is to reuse the training data to test the classifier. 62 These leads to a set of estimates known as the apparent error rates. These are often optimistic, i.e. downwardly biased, and are hence less than ideal. The bias will tend be more severe if the classifier is capable of over-fitting the data, a problem that tends to be worse for higher numbers of dimensions p. The obvious alternative is to separate the labelled data into training and test sets, using the former to train the classifier and the latter to test it, i.e. estimate the error rates. This was originally known as the holdout method, but is now more commonly known as the test set method. This raises a number of questions, however. 1. Which classifier are we trying to estimate the error rate for ? 2. What proportion of the data will we allocate to the test set ? 3. How will we implement this proportion ? 4. How can we provide confidence intervals for the error rate ? The first question is easily forgotten, but needs a clear answer. First we must answer the question: what classifier would we like to use for the prediction of new observa- tions ? Or more precisely: what data would we like to use to train this classifier ? In most cases, the answer will be to use all the available data to train what may be termed the final classifier. This is the classifier being trained when using the apparent error rate estimator. However, it is not possible to assess this classifier directly with the test set method. We must leave some fraction of the data out for testing. Reducing the size of the training set will tend to reduce the predictive accuracy of the trained classifier, and hence lead to an upward bias in the error rate. While any form of bias is unfortu- nate, this bias is likely to be small if the proportion left out is small, and a conservative estimate is probably more acceptable than an optimistic one. This tends to suggest that the answer to the second question is to leave out less than half the data for testing. Proportions such as 1/3, 1/4, 1/5 or 1/10 of the data may be reasonable. One would like the test set to represent the distribution as well as possible, so it will need some observations from every class, and sufficient observations 63 so that confidence intervals for the error rates are not too wide. All of this is easier if one has a large dataset. One answer to the fourth question is to view the number of errors observed out of the total as the outcome of a binomial experiment, and hence use one of the corresponding confidence intervals for proportions. The width of any such confidence interval is proportional to 1/ √ ntest, where ntest is the size of the relevant test set. In answer to question 3, one typically randomly reorders the data, and then takes the last ntest observations (Xtest, ytest) to be the test set, leaving ntrain = n − ntest ob- servations (Xtrain, ytrain) in the training set. We assume that each observation has been drawn independently from the distribution F , but it is possible that this is not truly the case or that there has been some sorting of the data after collection. The random reordering acts to reduce the effect of either of these. For example, if the observations from each class have been grouped together, the random reordering will produce a mix of classes in a test set taken as described above. The overall error rate is then estimated via Eˆrrtest = ∑ntest i=1 Q[yi, η(Xtrain,ytrain)(xi)] ntest . (3.29) The corresponding class-specific error rate estimators are: Eˆrrtest g = ∑ntest i=1 Q[yi, η(Xtrain,ytrain)(xi)]I(yi, g)∑ntest l=1 I(yl, g) , (3.30) where I(, ) is the identity function. Around 1968, cross-validation (CV) was proposed in the following paper to im- prove on the test set approach: Lachenbruch and Mickey (1968). Further details were presented in the following papers: Stone (1974) and Geisser (1975). These re-sampling methods involve a partition into a test set and a training set and the counting of errors, but then the repetition of this procedure many times with different systematically or randomly chosen subsets used for testing. The simplest such approach leaves out only one observation for testing at each 64 iteration, and uses the remaining n − 1 observations for training. This is then re- peated n times, each time leaving a different observation out. This leave-one-out cross- validation (LOO CV) error rate estimator can be written as follows: EˆrrLOOCV = ∑n i=1 Q[yi, η(X−iAll,y −i All) (xi)] n , (3.31) where−imeans that the ith observation has been left out of this set. The corresponding class-specific error rate estimators are: EˆrrLOOCV g = ∑n i=1 Q[yi, η(X−iAll,y −i All) (xi)]I(yi, g)∑n l=1 I(yl, g) . (3.32) This method is attractive since it uses nearly all the data for training each time, and yet manages to eventually test on all n observations. The latter is common to all cross-validation schemes, and the former should reduce the bias of the estimator. However, both theory and experiment have shown that the variance of the estimator can unfortunately be fairly high. This is believed to be due to the high similarity of each training set, with the effect that any unusual aspects in the sample will tend to be modelled in each of the n trained classifiers. The other drawback of this approach is that it requires the model to be refit n times, which can be undesirable or prohibitive when n is large. Geisser proposed an alternative known as K-fold cross validation. This requires the removal of approximately n/K observations (known as a ’fold’) for testing at each iteration. This is repeated a total of K times with separate non-overlapping folds each time until each observation has been used for testing once. ˆErrK CV = ∑K k=1 ∑nk i=1Q[y k i , η(X′kAll,y ′k All) (xki )] n , (3.33) where nk is the number of observations in the kth fold, xki is the ith observation in the kth fold and ′k means without the kth fold. The corresponding class-specific error rate 65 estimators are: EˆrrK CV g = ∑K k=1 ∑nk l=1 Q[y k l , η(X′kAll,y ′k All) (xkl )]I(ykl , g)∑n l=1 I(yl, g) . (3.34) K-fold CV requires the model to be fit to data only K times, where K is typically chosen to be 2, 3, 5 or 10. LOO CV can be seen as n-fold CV. As K increases, the bias of the estimator due to left out data decreases, but the variance tends to increase, and K = 5 or K = 10 are common compromise values. Other variations exist, which may improve upon the above. Breiman and col- leagues developed balanced CV, which allocates observations to folds so that, when- ever possible, each class is represented in each fold, with little variation in the class proportions among folds. It is unclear as to whether this method has better properties than K-fold CV. The random allocation of observations to folds implies that the results of a single CV run will not be unique. Hence one can consider v repetitions of K-fold CV. How- ever, Burman’s results (Burman, 1989) showed that this is not an improvement over Kv-fold CV. Another noteworthy option is repeated learning-testing, in which a test set of a certain size is chosen at random, a classifier is trained on the training portion, tested on this test set and the errors counted. This is then repeated as many times as desired, with the error rates estimated from the aggregated test results. Again, the results of Burman do not seem to indicate any improvement over K-fold CV. Overall, Burman’s results tend to favour LOO-CV, but these are not conclusive. If one needs to select a model of out K possible models using CV and then also test it using CV, one can use the following procedure, first covered by Stone, 1974 for the LOOCV case. It has been called nested CV, two-layer CV and two-level CV in the literature. 1. Split the data into m folds. m− 2 of these are for training, one for validation and one for testing. 66 2. Using the training and validation data : perform m− 1 fold CV for k = 1, . . . , K • Estimate the overall error for each k (from the CV results) • Choose k which gives the lowest error rate 3. Now that we have say k = k∗1 , train the classifier on the training and validation data. Test on the test fold. 4. Repeat for other choices of test fold (total ofm times), getting a different k∗2, . . . , k∗m each time. Average the results to estimate the error rate of the final classifier on future observations. 5. For the final classifier, choose a new k = k∗final using a similar procedure, but without a test set (only training and validation sets defined). 6. The final classifier is trained on all of the data using k = k∗final. Note that while we advocate reporting the overall error rate and the class-specific error rates, some authors choose instead to report a confusion matrix, which is a matrix whose (i, j)th element is the number of observations labelled with class i which the classifier predicted to have class j. If this matrix is divided through by the sum of its entries (the total number of observations in the test set(s)), then the sum of the off- diagonal elements should equal Eˆrr. If instead the gth row is divided by its total, then the sum over this row, excluding the gth element, will be Eˆrrg. Some other methods of error rate estimation are based on the bootstrap (Efron, 1981, 1983). Efron described the non-parametric bootstrap estimator of error rate as follows: Eˆrr(BOOT ) = 1 n n∑ i=1 1 |C−i| ∑ b∈C−i Q[yi, ηb(xi)] (3.35) where C−i is the set of bootstrap samples which do not contain observation i and |C−i| is its cardinality. A non-parametric bootstrap sample is constructed by taking a sample of size n with replacement from the observed n observations, i.e. sampling from the 67 empirical distribution of the data. As a result, many observations are not sampled at all, and others are sampled more than once. The probability that a particular observa- tion becomes part of a bootstrap sample is 1 minus the probability that it is left out, which is 1 − (1 − 1/n)n. As n → ∞, this approaches 1 − 1/e ≈ 0.632. The expected number of unique observations in a bootstrap sample is then approximately 0.632n. As a result, there is generally some upward bias in Eˆrr(BOOT ). Efron proposed that this be corrected by combining it with the apparent error rate estimate err as follows: Eˆrr(.632) = 0.368err + 0.632Eˆrr(BOOT ). (3.36) Despite some theoretical encouragement, there is little evidence that this or its succes- sors are better estimators than those based on cross-validation. LOOCV was preceded by a related LOO procedure, the jackknife. This was devel- oped to estimate the bias (Quenouille, 1949, 1956) and standard error (Tukey, 1958) of a particular estimation procedure. The ith jackknife sample x(i), i = 1, . . . , n leaves out the ith observation. One then estimates the parameter vector θ based on each of these samples, resulting in θˆ(i), i = 1, . . . , n, and additionally produces the estimate θˆ based on the full sample. We then define the average jackknife estimate as θˆ(.) = 1 n n∑ i=1 θˆ(i). (3.37) The jackknife estimate of bias is given by ˆbiasjack = (n− 1) ( θˆ(.) − θˆ ) . (3.38) The jackknife estimate of standard error is given by sˆejack = [ n− 1 n n∑ i=1 ( θˆ(i) − θˆ(.) )2]1/2 . (3.39) See e.g. Efron and Tibshirani (1993) p141–143 for derivations. 68 3.8 Mixture Discriminant Analysis For a classifier to approach the optimal Bayes rule, it needs to use sufficiently flexible models of the class distributions so that these can be modelled accurately, and enough data so that the fit of each is as close as possible to that of the true class distributions. One option is to fit a mixture model to each class. This was done in applications by McLachlan and Gordon (1989) and formalised by Hastie and Tibshirani (1996), who named the method mixture discriminant analysis (MDA). MDA assumes that each mixture component in each class is normal and that each has the same covariance matrix Σ. This method is available in R using the mda pack- age. The equal covariance matrices restriction was lifted by Fraley and Raftery (2002), who have since maintained an R package which implements this method and is the most popular R package for mixture modelling generally: MClust. To use this method, one does the following. For each of the G classes, fit a normal mixture model to the observations from that class. Use a method such as BIC to choose the number of components, separately for each class. For a new observation, calculate its density using the mixture distributions for each of the classes. Weight these density values by the class probabilities (known or estimated) and assign the observation to the class with the highest weighted density at that point. This imitates the Bayes-optimal rule, with the true class densities each replaced by a different mixture model. For a new point x, we can write the discriminant rule as g∗ = arg max g pigfˆg(x), (3.40) where pig is the prior probability of class g and fˆg(x) = Hg∑ h=1 λgh fˆgh(x), (3.41) where λgh is the probability of component h within the mixture model used to fit the 69 gth class and fˆgh(x) is the fitted hth component density of the mixture used to fit the gth class. 3.9 Kernel Density Discriminant Analysis Kernel density estimation is typically familiar from plots - it gives a quick impression of the shape of a data distribution by smoothing the empirical distribution. However, since it provides a flexible non-parametric model of a density it can be used in a variety of contexts, including discriminant analysis (classification). In this context, one uses a different kernel density estimate for each class and then uses these to approximate the Bayes’ classifier, as in LDA, QDA and MDA. While kernel density estimation is most often done in only one dimension, it can be extended to an arbitrary number of dimensions. However, it tends to work best for just a few dimensions. For a random sample in p dimensions {x1, . . . , xn}, the kernel density estimate is given by fˆ(x,H) = 1 n n∑ i=1 KH(x− xi). Here K(x) is a kernel function, which needs to be a symmetric pdf, H is a symmet- ric, positive-definite bandwidth matrix and KH(x) = |H|−1/2K(H−1/2x), where H−1/2 is determined via spectral decomposition, as explained in section 2.6.1. A number of options exist for the kernel, and one of these is the standard multivari- ate normal density, i.e. K(x) = 1 (2pi)p/2 e− 1 2 xT x. This leads to KH(x) = 1 (2pi)p/2|H|1/2 e − 1 2 xTH−1x, i.e. the kernel function is a multivariate normal density with mean 0 and covariance 70 matrix H . So each observation will be the centre of a multivariate normal distribution. Ef- fectively the kernel density model is a mixture model with n components. Given that the number of components and the means and proportions have been determined, the main task is to choose the bandwidth matrix H . Note that the matrix is common to all of the components. The two main options for this matrix are to make it diagonal (i.e. without covari- ance terms) or unconstrained, i.e. a full covariance matrix. In either case the aim of the density estimation is to provide the best possible approximation to the distribution the data was drawn from, which we will assume has density f . The usual measure of dis- tance between f and fˆ is mean integrated squared error (MISE), and here we consider it as a function of the bandwidth matrix. MISE(H) = E ∫ Rp [ fˆ(x,H)− f(x) ]2 dx. The optimal choice ofH will minimise MISE(H) over symmetric, positive definite p×p matrices. MISE(H) can be approximated using a wide range of methods including cross-validation. The derivations for these are substantial and one can refer to Wand and Jones (1995) or Chacon and Duong (2011) for examples, bearing in mind that this remains an active research area. Most methods require numerical optimization of H to minimise the approximated MISE criterion, and since a general H contains p(p + 1)/2 unique parameters, this provides one of the challenges to utilizing such methods in higher dimensions. There are also some computational difficulties in using large sample sizes, although various approximations have been developed to reduce these. Kernel density discriminant analysis works in the same way as mixture discrimi- nant analysis or other stochastic-model-based procedures which approximate the Bayes- optimal rule. Each class density is approximated by a different kernel density model, fitted using data from that class alone. 71 For a new point x, the discriminant rule is g∗ = arg max g pigfˆg(x), where pig is the prior probability of class g and fˆg(x) is the kernel density estimate of the gth class evaluated at x. One implementation of multivariate kernel density estimation and classification is via the ks package in R, as described in Duong (2007), which can handle up to 6 dimensions. If wishing to use this with higher-dimensional data, one would have to reduce the dimensionality first, likely losing some information. 3.10 Nearest Neighbour Classifiers We will now take a direction away from likelihood-based models to consider some competing forms of classifier. First we consider the nearest neighbour algorithm, which is probably the simplest. When presented with a new observation of the predictor variables x, one looks for the nearest point in the training set, and gives the new observation the same class. An extension of this idea is the k-nearest neighbours algorithm. When presented with a new observation of the predictor variables x, one looks for the k nearest points in the training set - these can be termed the neighbours. We then predict the class of the new observation to be the class which occurs most frequently among the neighbours. When k = 1, this is the nearest neighbour classifier. A number of issues remain: 1. What kind of distance metric should we use? 2. How do we deal with ties? 3. How should we choose k? 4. What properties does this algorithm have? 72 Any metric that encapsulates our concept of closeness adequately is suitable. The most common choice is Euclidean distance, i.e. the L2 norm ||x− xi||2 = √ (x− xi).(x− xi) = √ (x1 − xi1)2 + · · ·+ (xp − xip)2 (3.42) One problem that arises with most choices of distance metric is a lack of scale invari- ance. For example, if we have p predictor variables, where one of them is measured in metres, but we decide to instead report it in centimetres, we will have scaled that dimension up by a factor of 100. We would hope that the k-nearest neighbour rule would be unaffected, but with the rule defined above, the effect would be large and undesirable. As an example consider the following plots (R code is in Appendix B). It is intended to show two classes, with the x1 values both unscaled and scaled. In the first plot, a nearest neighbour classifier would work well. It would work less well in the second case. Note that the plotting commands use equal scales on each axis. This also casts doubt over any particular use of a distance metric in the predictor space. An effective metric will have to offer a way of making distances comparable across predictor variables. Imagine we only have two predictor variables, x1 and x2. If we have absolutely no prior information on these variables, then we might decide to scale each dimension so that some measure of spread is similar along each of x1 and x2, i.e. we would scale each of these so that their sample estimate of some measure of spread is 1. Available measures of spread include the standard deviation and the inter-quartile range. The former is probably better when the marginal distributions appear approximately normal (symmetric, fast-decaying tails), while the latter can tol- erate ’outliers’ and ’heavy tails’. If x1 is approximately normally distributed and x2 is approximately t1 (Cauchy) or t2 distributed, then extreme points in x2 will only tend to be closest to other points extreme points (in the same direction) under Euclidean or most other metrics. When there are many such heavy-tailed distributed predictor variables, such extreme points are unlikely to be chosen among the neighbours of any 73 20 40 60 80 100 120 − 40 − 20 0 20 40 60 xs ys x x x x x x x x x x 74 new observation. An alternative approach, proposed by Olshen (1977) and Devroye (1978) uses ranks. For each predictor variable, the data and the new observation’s value are given ranks from 1 to n+1 corresponding to smallest (or most negative) to largest. The data is then effectively replaced by these ranks: r(j)i being the rank of the jth component of the ith observation, and r(j) being the rank of the jth component of the new observation. The empirical distance between the new observation x and xi is then defined to be ρ(x, xi) = max 1≤j≤p |r(j)i − r(j)|. (3.43) Since all the distances will be integer-valued, ties will frequently occur, and are usually broken by (uniform) random ordering. If we have prior information that, for example 2 units in x1 are equivalent to 1 unit in x2, then we may use this instead. We could then normalise the space so that 1 unit is equivalent in both by replacing x1 with x′1 = x1/2. See the following for further details: Devroye et al. (1996). If there is a tie in terms of the most common class, we typically choose the assigned class uniformly at random amongst those classes. One would like to choose k such that it usually picks up observations which have the same class as x, the new observation. If the regions which have highest (true) pos- terior probability of membership of a class are reasonably large and close to spherical in the (possibly transformed) predictor space, then we should be able to pick k such that k/n is appreciable. If not, we will need to pick a smaller k. Of course the smallest value we can pick is 1, and there is no guarantee that we have another point in the dataset sufficiently close to x that we can infer anything useful about its class. Any suggestions ? It has been shown that the k-nearest neighbour classifier is universally consistent, i.e. that it approaches the Bayes-optimal rule as n→∞, provided k →∞ and k/n→ 0. It has also been shown that the risk of the nearest neighbour rule is no more than twice 75 that of the Bayes-optimal rule, and the risk of the k-nearest neighbours rule is no more than (1 + √ 2/k) times that of the Bayes-optimal rule. See (Devroye et al., 1996) for proofs and further discussion. 3.11 Classification Trees Classification trees were developed independently by multiple groups of researchers. The most famous examples are CART (Breiman et al., 1984) and C4.5 (Quinlan, 1993). These are rule-based rather than model-based classifiers which iteratively partition the predictor space into rectangular-shaped areas, each of which is completely assigned to a single class (or value in the case of regression trees). For a given dataset, one builds a classification tree by iteratively choosing a predic- tor variable and then determining the optimal way to split up that variable’s support among the classes. The aim is that each subgroup is closer to being all of one class. In other words, we are choosing the location of decision boundaries defined by values of a particular variable. Once this is done, we look at each part of the space and decide if we should split it again, and if so, we must again choose a predictor variable and the value(s) at which to split it. We can continue this splitting until all the “leaf” nodes (those with no children) each contain training observations from only one class, or no further useful splits can be found, or according to some criterion. Once a tree is constructed, one passes each training observation through it by start- ing at the root node of the tree and choosing a sub-tree based on the split criteria ap- plied to that observation. Once one reaches a leaf node, one checks the proportion of the training observations at this node belonging to each class. The node is then assigned the class which is most common among training observations at that node. One uses the classification tree to classify a new observation x by passing it through the node decision rules until reaching a class-assigning leaf node. Ultimately we face the following questions: 76 1. What criterion are we seeking to optimise in selecting a variable and a split point ? 2. How will we do this optimization ? 3. How do we decide when to stop splitting ? 4. How can we interpret the resulting tree ? In answering the first question, one might first think that we should minimise mis- classifications when applying the classifier to the training set. This is perfectly reason- able and is sometimes used. Figure 3.1: Node impurity functions with two classes - from Hastie et al. (2009). The three most popular measures of what is termed “node impurity” are cross- entropy, the Gini index and misclassification rate. When there are G possible classes, then at the mth node, the cross-entropy is defined to be Cross-entropy = IC(m) = − G∑ g=1 pˆmg log pˆmg, (3.44) 77 where pˆmg is the proportion of the training observations in class g out of those which pass through the decision tree to the mth node. In cases where a particular class g′ is not present among the training observations at this node, we take pˆmg′ log pˆmg′ to be zero, as per the asymptotic limit. The Gini index at the mth node is defined to be Gini index = IG(m) = G∑ g=1 pˆmg(1− pˆmg). (3.45) The (training) misclassification rate at the mth node is defined to be Misclassification rate = IM(m) = 1− pˆmg∗ , (3.46) where g∗ = arg maxg pˆmg. These three node impurity functions are shown for a two class classification case versus p, here the proportion of observations in the first class. In practice, the choice of impurity type often makes very little difference to the type of tree formed. One can stop splitting the tree at a given node once all the observations at this node are from the same class. We then simply assign that class to any new observations which arrive at this node. Conversely, the maximum value of these functions occurs when each class is equally likely in the empirical distribution based on training obser- vations at that node. Hence the natural aim is to reduce the node impurity to the greatest extent possible via each split. We define ∆I(m) = I(m)− PˆLmI(mL)− (1− PˆLm)I(mR), (3.47) where mL and mR are the indices of the left and right descendent nodes (children) of node m, respectively, and PˆLm is the proportion of the population reaching node m which will be assigned to node mL, i.e. take the left branch, as estimated based on the training data. 78 All these terms are naturally dependent on the way the split is carried out, i.e. the parameters of the split. The split parameters of the jth node consist of the choice of variable on which to split, which can be labelled jm and the value of this parameter to split upon. If the jth predictor variable is continuous, then we need a single real- valued split point λm. We would assign an observation x to the left branch if xjm < λm and otherwise to the right branch. We can use a similar rule if x is a discrete variable or an ordinal categorical variable, although λm would then be restricted to the values of the second through to last values for this variable (when this number is finite). If the jth predictor variable is binary (e.g. ∈ {0, 1}), then we can assign an observation x to the left branch if xjm = 0 and otherwise to the right branch. If the jth predictor variable is nominal, i.e. unordered categorical, we must still find a way to make a binary decision. One obvious choice is to consider splitting off a single category versus the rest at this node. One could instead consider subsets of categories. This may be computationally feasible if the number of such sets is reasonable, e.g. 100 or fewer. When the split variable is finitely discrete or categorical, we can find the best choice of split, with respect to the change in impurity on the training set, by searching over the entire set of possible splits. When the split variable is continuous, we should ideally calculate ∆I(m) in analytic form and then seek to maximise this function. We would then use calculus, consider stationary points and any relevant boundary values. How- ever, since we will not typically know the form of the class-generating distributions, for a particular attribute, we again need merely to search through the possible values of that attribute from the second smallest to the largest and split between that point and the previous point. If the split is to be made somewhere between consecutive training set values a and b, a < b, then any value in that range will result in the same estimated node impurity. One then commonly chooses the the cut-point to be the midpoint (a+b)/2 to encourage reasonable generalisation capabilities. Another option is to choose a weighted point 79 (1− PˆLm)a+ PˆLmb = a+ PˆLm(b− a), where PˆLm is the estimated proportion of observa- tions which would be directed to the left node via this rule. This uses the assumption that observations falling in the interval (a, b) should be allocated to the left and right sub-nodes with probabilities estimated by PˆLm and 1 − PˆLm, respectively. It is also consistent with the assumption that the probability density of points in the interval is uniform, since the length from a to the mid-point is effectively considered proportional to the probability of the left branch. Despite appearances, neither assumption is particularly reasonable. If, for example, the right branch has a higher proportion of observations, we have likely observed more extreme points from the right branch, that is, b may already be closer to the ideal cut point than a. While it might be possible to determine ideal cut-points given full knowl- edge of the class densities fg(x), given the data, none of the cut points in the interval will appear to produce better performance than any other unless tested on left out data. The distribution between the points may not be uniform either, though this may be more reasonable if they are close together, which is more likely with larger datasets. The theory of maximum margins, as seen later with support vector machines, offers some more support for the use of the midpoint. However, while the splits sometimes produce class boundaries, splits are often performed which reduce impurity, but do not split classes, particularly when class prior probabilities are far from equal. There is little theory or experimental evidence to support either method of selecting cut-points, and they are unlikely to make much difference to performance. Note that the form of the decision tree produced (order of splits, choice of split variables) is invariant under affine transformations of the variables, i.e. linear trans- formation plus constant. This useful property means that there is no need to normalise any of the predictor variables before fitting a tree to them. Note that if the data is given an affine transformation, the values of the split points will also undergo this transfor- mation. Multi-way splits on a single predictor variable are another option for the construc- 80 tion of trees. However, a well-chosen K + 1-way split will always be favoured over a K-way split if using a simple extension of the ∆I criterion. This can be corrected for by using the following criterion: ∆I(m) = I(m)−∑Kk=1 PˆkmImk −∑Kk=1 Pˆkm log Pˆkm , (3.48) where Pˆkm is the estimated proportion of observations which would be directed to kth child node of node m via the proposed rule and Imk is the impurity of the kth child node. Note that the denominator could only be zero if all training observations come from a single class. We would then have a pure node and hence would not consider splitting it. Searching all possible multi-way splits out of a single node can be computationally expensive, and hence is often avoided in favour of binary splits. Missing data in training is generally ignored on a predictor-by-predictor basis. However, when categorical data are missing, one can assign this predictor values to an additional “missing” category and consider using that in the tree. If the data is not missing at random, this may be useful. If a new (test) observation is missing the value needed for a split point, one assigns it to the branch which contained more training observations. Given the sparse nature of a typical tree, it is quite likely that an obser- vation with missing data will not have that predictor tested at a node. One can extend the idea of impurity to the whole tree, by a summation over the set of leaf nodes. This allows one to consider penalising the complexity of the tree to avoid making it too large and hence over-fitting the training data. Ipenalised = ∑ leaf nodes I(m) + α|set of leaf nodes|, (3.49) where || indicates cardinality. One then attempts to minimise Ipenalised, with the simplest method being to not add any more nodes during tree construction if they cannot reduce Ipenalised. One is still left with the choice of α, and there is little informative guidance on this. One could choose it using cross-validation, but one would then be tempted to 81 use this instead to look at classification error directly, and stop adding nodes once this fails to improve. Note that such cross-validation must be considered part of model- building, and so separate data would need to be held out for testing. A two layer cross-validation scheme would be a suitable method for such model building along with associated estimation of error rates. Example in one dimension: Assume we have two classes and 5 observations, all of a single variable. In class 1 are the observations, 1 and 6, and in class 2 are the observations 2, 5 and 11. Using the cross-entropy form of node impurity, what tree will be built ? The first thing is to consider the impurity of the root node, which we will number m = 1. Since we have 3 observations of class 2, and only 2 from class 1, if we had to stop at this node, we would make it of class 2. The cross-entropy of this node is IC(m = 1) = − G∑ g=1 pˆ1g log pˆ1g (3.50) = −(pˆ11 log pˆ11 + pˆ12 log pˆ12) (3.51) = −(2/5 log(2/5) + 3/5 log(3/5)) (3.52) ≈ 0.673. (3.53) The change in node impurity in response to a split is ∆I(m) = I(m)− PˆLmI(mL)− (1− PˆLm)I(mR). (3.54) One needs to calculate the value of ∆I(m) for all (n − 1) possible splits. Note that the class of the sub-nodes (the two being constructed beneath the current node) is deter- mined automatically by the more common class among training observations assigned to that sub-node, so there is no need to search over these allocations. The ordered observations are 1,2,5,6,11. Consider the split: 1 | 2,5,6,11 (where | indicates the split point), i.e. the split point is 82 somewhere in the interval (1, 2). We will do what is most common in the literature and place the split point at the midpoint of (1, 2), i.e. at 1.5. We first consider the proportion of future observations we expect to go to the left branch of the split point. The only information we have is that 1/5 of the training observations would be allocated to the left branch, so PˆL1 = 0.2. Now we need to calculate the IC values for the left and right sub-nodes, which we will number 2 and 3. Node 2 has just 1 observation of class 1, so it is a pure node and hence a leaf node (no further splits will be considered). Calculation of its impurity gives IC(m = 2) = −(pˆ21 log pˆ21 + pˆ22 log pˆ22) = −(1 log 1 + 0 log 0) = 0, (3.55) as expected. Node 3 has 1 observation of class 1 and 3 of class 2. Its impurity is IC(m = 3) = −(pˆ31 log pˆ31 + pˆ32 log pˆ32) = −(0.25 log 0.25 + 0.75 log 0.75) ≈ 0.562. (3.56) Hence the change in node impurity for this split ∆I(1) = I(1)−PˆL1I(2)−(1−PˆLm)I(3) ≈ 0.673−(0.2×0)−(0.8×0.562) ≈ 0.223. (3.57) The following splits are possible: Classification trees are generally considered very interpretable by non-statisticians. They resemble a logical decision-making process that first looks at the most impor- tant features (predictors), and makes a series of rule-based decisions leading to a final decision in often only a few steps. However, this likely over-simplifies the problem, and the rectangular decision boundaries so produced are unlikely to be a close match to the true Bayes-error-style decision boundaries from the generating distributions. In addition, the rules are far more fragile than one might expect. If one leaves some obser- vations out of the data (as is done in cross-validation), rather different trees tend to be produced. In fact one should do this for sensitivity testing if considering encouraging someone to interpret the decision tree definitively. Extensions to decision trees often 83 consider combinations of many trees, e.g. random forests. These tend to reduce the (questionable) interpretability, but offer much smoother decision boundaries. 84 Chapter 4 High-dimensional analysis 4.1 Introduction As of December 2015, the most highly cited paper with words “high-dimensional” in the title published up until 1995 is (Andrews, 1972) which offers examples with 6 and 8 dimensions. In his influential 1990 book, Fukunaga (1990) listed p = 32 or 64 as examples of high-dimensionality. Until the mid-1990s, what most statisticians meant by high-dimensional data was more than 5 dimensions, and 100 dimensions would be considered extreme. However, machine learning groups had begun applying their methods to datasets with even more variables, and new data-producing technology was leading to demands that statistics also offer suitable methods. In 1958 Dempster (1958) pointed out that neither Hotelling’s T 2 test to compare two groups nor Fisher’s LDA can be used when p > n1 +n2−2. In both cases, the difficulty occurs in fitting a multivariate normal distribution in this number of dimensions. For a single dataset with p > n − 1, the sample covariance matrix is singular and hence not invertible. This is based on the following. The n × p data matrix XAll is at most rank min(n, p). Since the p×p matrix S = 1 n−1A, where A = ∑n i=1(xi−x)(xi−x)T , the rank of A and S is the same. Each observation contributes one rank to the matrix A provided the observations are linearly independent. One rank is lost for the estimation of the mean x. Hence the rank of A and S is at most n− 1, and these matrices are only 85 invertible if p ≤ n− 1. Dempster outlined a non-exact two-sample test that would work under these con- ditions. Various improvements have been suggested since. The standard solution for multiple linear regression is βˆ = (XTAllXAll) −1XTAllyAll, where yAll is a vector of continuous response variables. This also requires that p ≤ n: otherwise the matrix inversion is not possible. It is possible to minimise the sum of squares for the regression in other ways, but if p > n, the solution is not unique, and hence largely uninterpretable. As a result, standard experimental design demands that the number of observations n be at least p, and usually quite a bit higher. Unfortunately, this is not possible in all situations, partly for reasons of expense. The modern world has now produced an enormous array of equipment and facilities which are capable of collecting very large amounts of data. In many cases, the number of dimensions is very large, and sometimes much larger than the sample size, i.e. p n. The best known example of a need for high-dimensional data analysis was pre- sented by microarray technology in genomics. These were developed and commer- cialised around 1995 and have been an important tool in understanding the genomes of many species ever since. Humans have approximately 20,000 genes, and many other complex species have similar numbers of genes. Oversimplifying somewhat, each gene both encodes and is capable of leading production of a particular type of protein. The amount of each gene product (typically a protein) being produced is controlled by other genetic architecture, but one can get an idea of the current rate of production by measuring levels of messenger ribose-nucleic acid (mRNA), there being a unique form of this for each gene. Microarrays are compact chemical and photonic experimental apparatus which can measure the levels of a full set of mRNA levels from a tissue sample, e.g. a blood sample. Some medical and biological scientists were and are very interested to discover relationships between these levels of gene expression and measurable aspects of the 86 organisms from which the tissue samples were taken - these are known as phenotypes. Examples of phenotypes include: height, presence of a particular disease, a particular health outcome, ability to tolerate certain kinds of environment, crop yield, milk yield (in e.g. cattle) — an enormous array of possibilities. The set of mRNA levels form the explanatory variables and any phenotype of in- terest can be the response. This allows the full range of response variable types, with the main types of interest being continuous and nominal. Since each microarray typi- cally cost over $1000 to purchase and use, sample sizes tended to be fairly small, in the range 10-1000. Technology has changed over time and new problems have emerged with much larger numbers of explanatory variables, but the p > n dataset seems to be here to stay. In this chapter, we will consider methods to cope with large values for p, likely at least 100, where p may be greater than n or smaller. The choice of methods for such data depends greatly on the question which you wish to answer. Some of the main possibilities include: 1. Which of the predictor variables is associated with the response variable ? (a) considering each variable separately (b) considering the set of explanatory variables all at once 2. Can the set of explanatory variables be transformed into a space of smaller di- mension without losing anything important in terms of the association with the response variable ? 3. Can the set of explanatory variables be transformed into a space of smaller di- mension when there is no response variable ? (a) when the aim is to capture structure in the data (b) when the aim is clustering 87 4. How can we make the highest-quality predictions of the response variable, based on the explanatory variables ? One should be able to see that (1) is a special case of (2). We will discuss methods for 3(a), but there will not be time to address 3(b). We will make some comments about (4). 4.2 Principal Component Analysis Principal component analysis (PCA) is largely a technique to transform the original variables into a set of orthogonal variables which are ordered according to the size of their sample variances. It aims to find a set of underlying uncorrelated latent variables which produce the dependencies and the variation in the original variables. The first principal component direction is a unit vector whose direction in the space defined by the explanatory variables is that for which the sample variance is largest. The first principal components of the data are the projections of the data onto the first principal component direction, or equivalently, the result of dot products of the ob- servations with this vector. The second principal component direction is a direction orthogonal to the first, and within that constraint, also maximises the sample variance. The third principal component direction vector must be orthogonal to the first two, and also maximise the sample variance, and so on, until p principal component directions and the corresponding principal components can be defined. Note that the notation is somewhat confused in the literature, with many authors referring to the principal component directions as the principal components, and not bothering to name the transformed data. Duda et al. (2001) use this terminology. Mor- rison (2005) and Anderson (2003) avoid naming the directions until they determine their form. Here I have adopted the terminology used by Hastie et al. (2009). Some authors also refer to the principal component directions as loadings. In practice the point of the exercise is usually dimensionality reduction. This is 88 achieved by replacing the original dataset with a reduced dataset which contains only values for the first few principal components. These are generally enough to capture most of the variance in the data. Whether they capture most of the important informa- tion is another question. The overall hope is that variables which show little variation will be of little value, and those that vary the most will be the most important. Mathematically, in constructing the first principal component, we are seeking a lin- ear combination Y1 of the original variables X1, . . . , Xp which maximises the sample variance (and hence, it is hoped, the true variance). Let the first principal component be the following linear compound: Y1 = a11X1 + · · ·+ ap1Xp = aT1X. (4.1) Thus the sample variance of this principal component will be s2Y1 = p∑ j=1 p∑ k=1 aj1ak1sjk = a T 1 Sa1, (4.2) where S is the sample covariance matrix. The first principal component is chosen such that s2Y1 is maximised. Since one could easily scale such a solution up to infinity, we also require that aT1 a1 = 1, i.e. that a is of length 1 in the original variable space. The vector Y1 is also in the direction of the principal axis of the ellipsoid for a multivariate normal distribution fitted to the data. The second principal component, Y2 = a12X1 + · · ·+ ap2Xp = aT2X, (4.3) is also chosen to maximise the sample variance aT2 Sa2, but is subject to the additional constraint that it must be orthogonal to the first principal component. This is achieved by setting aT1 a2 = 0. Later principal components are chosen similarly, with the condi- tion that each be orthogonal to the earlier components. 89 We can find the weight vector a1 for the first principal component using the method of Lagrange multipliers. We need to maximise aT1 Sa1 subject to the constraint aT1 a1 = 1. Hence, we let λ be an unknown real-valued Lagrange multiplier and consider the Lagrange function u1 = a T 1 Sa1 − λ(aT1 a1 − 1). (4.4) It can be shown that there is stationary value of the Lagrange function which will maximise aT1 Sa1 while being subject to the constraint. We find stationary points of the Lagrange function via differentiation. The vector of partial derivatives is ∂u1 ∂a1 = 2Sa1 − 2λa1. (4.5) We set this to zero, and obtain Sa1 − λa1. (4.6) Hence the stationary points in terms of a1 are the eigenvectors of S, with their corre- sponding eigenvalues being values for λ. We wished to maximise aT1 Sa1, and we can see that aT1 Sa1 = λa T 1 a1 = λ. (4.7) Hence the first principal component should be chosen to be the eigenvector of S whose eigenvalue λ is largest. The second and subsequent principal components can be found similarly by subtracting the first principal component values from the data, and then searching for a direction which maximises the sample variance. It can be shown that the directions found will be orthogonal to those previously found, and that the jth principal component direction is that of the eigenvector corresponding to the jth largest eigenvalue of the original sample covariance matrix S. Having produced the principal component directions and the principal compo- nents (i.e. having transformed the original data), one can then calculate the percentage of the variance in the data captured by each component. It is common to keep only the 90 principal components which explain e.g. 90%, 95% or 99% of the variance, which might not need very many. However, it does need to be remembered that this method does not consider a response variable, and we do not know how much ability to explain a response will be lost in dropping some principal components. There have been many extensions of the technique of principal components, includ- ing sparse methods for high-dimensional data. Standard PCA can be used with high-dimensional data where p > n, with some caveats. Since S is symmetric (and hence Hermitian), all of its eigenvalues are real (i.e. not complex). As we have seen, the rank of S is at most n − 1, this implies that the number of unique eigenvalues of S is also at most n − 1. This limits the number of principal components to be min(n − 1, p), and hence equal to n − 1 when p > n. If we ignore the orthogonality aspect, this should not be surprising. If the data consists of n vectors in a p-dimensional space, then we could transform the data to a space whose basis is the n vectors, i.e. an n-dimensional space. The orthogonality requirement changes this somewhat, but makes no difference to the dimensionality. PCA loses a degree of freedom due to the calculation of the sample mean, needed to produce S. The n − 1 dimensional PCA basis appears to be an extreme case of over-fitting the data, but the effect on subsequent analysis is unclear. However, the main purpose of using PCA is to reduce the dimensionality, and it is quite likely that we would like to reduce it well below min(n − 1, p), reducing the tendency to over-fit in the process. Other software may require this before it can be applied to the resulting data. For example, in order to fit the covariance matrices, QDA requires that the dimension of the data is less than that of the sample size for the smallest class. LDA has no such restriction. While PCA does not take into account any relationship with the response variable, it completely removes collinearity and can be used to reduce the dimensional- ity to something more manageable. By not considering the response, it avoids a variety of potential biases in the assessment of predictive accuracy. As a result, it is a useful tool when the purpose is prediction. 91 4.3 Single-variable analysis In many cases, the belief of the investigators is that just some of the explanatory vari- ables are related to the response, and the aim is to find as many of these as possible. For example, if we have two groups of subjects, one with a certain disease, and the other healthy, we may wish to find out which of the explanatory variables have a dif- ferent distribution or mean in each group. Alternatively, we might have a continuous response variable, and wish to know which of the explanatory variables seem most strongly related to it. The variables that are found to be related are then candidates for further studies, with a typical eventual aim of intervening to change the response from a worse to a better state. The simplest approach is then to work through the explanatory variables one by one, and quantify their relationship to the response variable. The most common method for doing so is via a p-value. When the response is nominal and the explanatory vari- able is continuous, one is likely to use ANOVA if its conditions are met. Once the com- monly reasonable assumption of independence of observations is checked, we primar- ily need approximate normality of data in each group, along with similar variances. Sometimes further transformation of the data can achieve this, and when it cannot, the data may meet the less stringent conditions of alternatives such as the Kruskal-Wallis test. When there are just two groups, one can use a Welch’s t-test for which the groups have unequal variances. When the response is nominal and the explanatory variable is categorical, one might use a chi-square or Fisher exact test. For a continuous response variable, if the explanatory variable is categorical, one might again use ANOVA or alternatives. When both response and explanatory vari- ables are continuous, one would typically use linear regression. Some transformations might be needed to do so. The net result is that we have a p-value for every one of the explanatory variables. The difficulty is to then decide what threshold to use to determine which explanatory variables will be considered significant or of interest. If we set α = 0.05 and report 92 every variable with a p-value < α, then we have to be aware that we will collect 5% of any irrelevant variables, since such p-values are uniformly distributed on [0, 1]. If we have p = 20000, and the most of these variables are unrelated to the response, then this will produce 1000 irrelevant variables, and possibly some of the relevant variables. In courses covering experimental design, we teach how to control the family-wise or experiment-wise error rate (FWER) to be at most a certain value, e.g. 0.05. Potential solutions such as the Bonferroni method exist, but are likely to be far too stringent for situations where p > n. For example, if n = 100 and p = 20, 000, the Bonferroni method suggests that if we wish to have a family-wise error rate of 0.05, we should use an individual threshold of 0.05/20000 = 2.5 × 10−6. If using a two-sample t-test with equal variances, this requires a test statistic whose magnitude is at least 5. Given a sample size of 100 with 50 in each group, this requires a difference between the group means at least as large as the standard deviation within a group. Such differences can be seen by eye on exploratory plots and barely need statistical analysis. It is often difficult to know how the set of variables found will be used, but some- times it is clear. A laboratory may want to investigate up to 30 different genes, and like a list of those which are most promising. In one sense the task is then easy: we can always sort the variables according to their p-values, and our best candidates would be the smallest 30. Note that for a particular type of test, if one does not have suffi- ciently accurate p-values, one can simply sort by the absolute value of the test statistic, assuming that we are always performing two-sided tests. The problem is that we don’t then know if we are wasting people’s time. Are any of the 30 relevant ? Where should we draw the line in a principled fashion ? The currently most popular solution is to attempt to control what is known as the false discovery rate (FDR) (Benjamini and Hochberg, 1995). Since we will now be dis- cussing p-values, we will reluctantly join Benjamini and Hochberg (and Hastie, Tibshi- rani and Friedman) in changing our notation for the number of variables to m, repre- senting the number of tests to be performed, and the number of resulting p-values. 93 Declared non-significant Declared significant Total H0 true U V M0 H0 false T S M1 Total m−R R m Table 4.1: Number of errors in the testing of M null hypotheses. Table 4.1 shows the breakdown of all the hypothesis test results once a threshold criterion is chosen. The power of the procedure is the probability of declaring an alter- native hypothesis significant when it really is, and so can also be defined as Power = E ( S M1 ) . (4.8) The family-wise error rate can be defined as FWER = P (V ≥ 1). The proportion of false discoveries is the proportion of alternative hypotheses de- clared significant which are false. This is labelled Q = V/R, provided R > 0. We define Q = 0 when V = R = 0. The False Discovery Rate (FDR) is the expected value of Q: False Discovery Rate (FDR) = E(Q) = E ( V R ∣∣∣∣R > 0)P (R > 0), (4.9) where V and R are as defined in the table. The possibility of R = 0 is the reason why the above is not just E(V/R). However, P (R = 0) is typically close to zero, so there is not much difference. See (Sun and Cai, 2009) for further details. R is an observable random variable and S, T, U , and V are unobservable random variables, as are the derived quantities M0 and M1. Due to sums to known values, we only have two unknown random variables here, e.g. S and T . When all the null hypotheses are true, s = 0 and so v = r. If v = 0, then Q = 0. If v ≥ 1, then Q = 1. As a result, P (V ≥ 1) = E(Q). Hence for this case, the FDR is the same as the FWER. When m0 < m, it implies m1 > 0 (some alternative hypotheses are true) and the 94 FDR ≤ FWER. This can be shown as follows: We make the following claim: IV≥1 ≥ Q. (4.10) There are only two cases to consider. If V = 0, then IV≥1 = 0 andQ = 0, so IV≥1 ≥ Q. If V ≥ 1, then IV≥1 = 1 and Q = V/R ≤ 1, so IV≥1 ≥ Q. So the claim holds. Taking expectations of both sides: P (V ≥ 1) ≥ E(Q) (4.11) i.e. FWER ≥ FDR (4.12) As a result, any procedure which controls (places an upper bound on) the FWER also controls the FDR. However, if one controls just the FDR, the FWER can be considerably higher. This may be of no concern, and the benefit can be a large increase in the power of the overall procedure. Benjamini and Hochberg offered a method to control the FDR, which is described below. Assume we are testing m null hypotheses, which we shall name H1, H2, . . . , Hm and that based on the collected dataset, these yield p-values P1, . . . , Pm. Note that upper-case letters are used here to indicate that in general, the p-values are random variables before the data is collected. We now order the p-values from smallest to largest and relabel them as follows: P(1) ≤ P(2) ≤ · · · ≤ P(m). So P(i) is the ith smallest p-value, and is the result of testing the null hypothesis H(i). We define the following multiple testing procedure: Let k be the largest value of i for which P(i) ≤ imq∗. Then reject all H(i), i = 1, . . . , k. Theorem: for all independent test statistics and any configuration of false null hy- potheses, this procedure controls the FDR to be at most q∗. 95 4.4 Variable selection and the lasso We now turn to the problem of explanatory variable selection, when there is a response variable. Usually it is assumed that only some of the explanatory variables are related to the response variable, and the task of variable selection is to find these. So far, this is no different to the task of single variable analysis. The difference is that here we also plan to use the set of variables found for prediction. Further, we believe that by selecting the related variables and leaving out the unrelated variables, we will improve the accuracy of predictions. Further, we will have an interpretable and parsimonious (very frugal) model. This issue of variable selection is actually a special case of model selection. The way each variable relates to the response may also be considered un- known, with a variety of models considered, but one needs a lot of data to obtain clear answers to such questions. Here we will consider just two types of models: (multiple) linear regression to pre- dict a continuous response variable and logistic regression to predict a binary response variable. As we saw with neural networks, one way of limiting or avoiding over-fitting is to penalise the magnitude of the parameters — these are called shrinkage methods. As we saw with the BIC, another way is to penalise the number of parameters. Another way to choose a subset of explanatory variables for a model and hence try to avoid over- fitting begins with choosing a method of assessing predictive accuracy, e.g. the CV- based PRESS for a continuous response variable, and classic CV for a nominal response variable (classification). We then search over all possible combinations of the available variables somehow. When there are p explanatory variables, the number of possible subsets is 2p. It is not possible to fully explore this space when p is large. A number of approaches have been tried including forward selection (starting from the simplest model - a con- stant only, and building up in greedy fashion) and backward selection (starting from the full model - with all explanatory variables, and removing variables in a greedy 96 fashion), although these have often been done using “apparent” assessments of predic- tive accuracy, such as R2. Other authors have tried meta-heuristic algorithms to search this space, and preliminary conclusions suggest that while they can effectively min- imise a predictive accuracy cost function, the level of noise is often such that there is insufficient signal to encourage the removal of enough explanatory variables (Nguyen and Wood, 2012). The method of subset selection aimed to fix the number of selected variables at a certain value k < p and search over this smaller domain. This can be viewed as a hard- constraint version of BIC. It can also be viewed as a method which uses an L0 norm penalty. The L0 norm is ||β||L0 = |{p : βp 6= 0}|. (4.13) By far the most popular current method of variable selection is the lasso, the “least absolute shrinkage and selection operator” (Tibshirani, 1994). This method aims to shrink some parameter values and set others to zero, hence achieving both shrinkage and variable selection. For standard linear regression, we use the model Y = β0 + Xβ + , where Y is the response variable, X is a p-length tuple of explanatory variables, β = (β1, . . . , βp)T is a vector of regression slope parameters, β0 is the constant regression parameter and ∼ N(0, σ2) is the normally-distributed error term, which has variance parameter σ2. We are not interested in shrinking or remove β0 or σ2 - they must be present in all the models of interest. Given this model, it can be shown that the maximum likelihood solution is the same as the least squares solution. I.e. the standard solution is to choose (βˆ0, βˆ) = arg min β0,β n∑ i=1 ( yi − β0 − p∑ j=1 βjxij )2 . (4.14) It is well known that this has a closed form solution, which is not that difficult to derive, 97 at least for p < n. That is βˆ = (XTallXall) −1XallY, whereXall is the n×(p+1) matrix of explanatory variable values, and Y is the vector of response values. Here Xall has been augmented with a vector of 1’s (the first column) to allow the constant β0 to be included in the parameter vector β. A long-standing shrinkage method is known as ridge regression, weight decay or Tikhonov regularization. One typically normalises the predictor variables before start- ing, i.e. take away their sample mean and divide by their sample standard deviation. We then include as a penalty the sum of the squared regression weights, also known as an L2 penalty, as follows: (βˆ0, βˆ) = arg min β0,β n∑ i=1 ( yi − β0 − p∑ j=1 βjxij )2 + λ p∑ j=1 β2j . (4.15) Here λ is a tuning parameter that controls the amount of shrinkage. It can be chosen manually, by trial and error, or by optimizing over some measure of predictive accu- racy, e.g. via cross-validation. Note that we would need something such as two layers of cross-validation to both select such a parameter and estimate the accuracy of the resulting model. Ridge regression was introduced by (Hoerl and Kennard, 1970) primarily to im- prove regression for non-orthogonal predictor variables, i.e. those with substantial multicollinearity. The concern in this case is that the slope parameter estimates will be unstable, with large confidence intervals and high sensitivity to individual obser- vations. This can also be seen via the near-singularity of the matrix XTallXall. The con- dition number of the matrix to be inverted is decreased via ridge regression with the ridge estimator being given by βˆR = (X T allXall + λI) −1XallY. 98 If p > n, then XTallXall will not be invertible due to being rank-deficient. However, it is positive semi-definite, so at least one of its eigenvalues will be zero, with the rest positive. The addition of λI increases all the eigenvalues by λ (> 0), making this matrix positive definite and invertible, hence allowing a unique solution for βˆ. Tibshirani had what at first seems a fairly trivial idea, to replace the L2 penalty with an L1 penalty: (βˆ0, βˆ) = arg min β0,β n∑ i=1 ( yi − β0 − p∑ j=1 βjxij )2 + λ p∑ j=1 |βj| . (4.16) This is the Lagrangian equivalent of (βˆ0, βˆ) = arg min β0,β n∑ i=1 ( yi − β0 − p∑ j=1 βjxij )2 , (4.17) subject to p∑ j=1 |βj| ≤ t, (4.18) where t is a threshold parameter, which has a one-to-one relationship with λ. With normalized data, the solution for βˆ0 is y, and having chosen this, we then only need choose β1, . . . , βp. However, Tibshirani may have been the first person to realise the potential of this type of penalty in variable selection and its computational feasibility. What he may not have appreciated at the time was that he had invented it just in time for a new wave of data and analysis in which variable selection would feature promi- nently. Most of the examples in the original lasso paper have p < 10, but Tibshirani tried one simulation with p = 40 and n = 100. We will not usually be aware of an optimal value for λ or t. Tibshirani offered three methods for selecting this value. The first was simply the minimization of the prediction error, as assessed via five-fold cross-validation. The second was using a form of generalized cross-validation, and the third was based on an unbiased estimate of risk by Stein. 99 Tibshirani noted that Frank and Friedman (1993) discussed generalizing ridge re- gression and subset selection by adding a term of the form λ ∑p j=1 |βj|q, for q ≥ 0. However, they did not investigate other values of q. Tibshirani claimed that his choice of q = 1 was motivated by two factors: (i) that this was closer to q = 0, as used in subset selection, which seems natural when doing variable selection (ii) that this was the smallest value of q for which the optimization problem would be convex. To be able to choose λ, it would be best to be able to determine the set of parameters which result from optimising 4.16 for every possible value of λ. It can be shown that for a particular value of λ, the optimal values can be found via quadratic programming. Further, the problem of finding the complete solution path is also convex and according to Hastie et al. (2009) (p76), requires “the same order of computation as that of a single least squares fit using the p predictors”. We will not cover the details in this course, but a range of R packages are available which implement this and related methods. The lasso can also be used with any generalised linear model, i.e. one for which g(µ(x)) = βx, where µ(x) = E(Y |X = x), g() is a monotone function and Y is dis- tributed according to a member of the exponential family. In particular, it can be ap- plied to logistic regression. .1 Appendix A ANOVA identity: The total corrected sum of squares may be written as a∑ i=1 r∑ j=1 (yij − y··)2 = a∑ i=1 r∑ j=1 ( (yij − yi·) + (yi· − y··) )2 = a∑ i=1 r∑ j=1 (yij − yi·)2 + 2 a∑ i=1 r∑ j=1 (yij − yi·)(yi· − y··) + r a∑ i=1 (yi· − y··)2. 100 However, the cross-product term in this last equation is zero, because r∑ j=1 (yij − yi·) = yi· − ryi· = 0. Therefore we have a∑ i=1 r∑ j=1 (yij − y··)2 = r a∑ i=1 (yi· − y··)2 + a∑ i=1 r∑ j=1 (yij − yi·)2. (19) .2 Appendix B: Code for knn example library(MASS) scale=10 offset=3.5 x<-seq(1,10) c1 <- cbind(x,x) c2 <- cbind(x+offset,x) call<-rbind(c1,c2) eqscplot(call,pch=’’,xlab="x1",ylab="x2") points(c1) points(c2,pch=’x’) cs1 <- cbind(x*scale,x) cs2 <- cbind((x+offset)*scale,x) calls<-rbind(cs1,cs2) eqscplot(calls,pch=’’,xlab="x1",ylab="x2") points(cs1) points(cs2,pch=’x’) ## ### New version below including predicted classes ## ## scaled 101 library(class) minx = min(cs1[,1],cs2[,1]) maxx = max(cs1[,1],cs2[,1]) miny = min(cs1[,2],cs2[,2])-10 maxy = max(cs1[,2],cs2[,2])+10 xs <- seq(minx,maxx) ys <- seq(miny,maxy) lenxs = length(xs) lenys = length(ys) classes = factor(c(rep(1,10), rep(0,10))) z<-matrix(0,nrow=lenxs,ncol=lenys) for (i in 1:lenxs) { for (j in 1:lenys) { z[i,j] = knn(calls,c(xs[i],ys[j]),cl=classes,k=1) } } image(xs,ys,z,asp=1) points(cs1) points(cs2,pch=’x’) ## unscaled minu = min(c1[,1],c2[,1]) maxu = max(c1[,1],c2[,1]) minv = min(c1[,2],c2[,2]) maxv = max(c1[,2],c2[,2]) us <- seq(minu,maxu,0.02) vs <- seq(minv,maxv,0.02) lenus = length(us) lenvs = length(vs) classes = factor(c(rep(1,10), rep(0,10))) w<-matrix(0,nrow=lenus,ncol=lenvs) 102 for (i in 1:lenus) { for (j in 1:lenvs) { w[i,j] = knn(call,c(us[i],vs[j]),cl=classes,k=1) } } image(us,vs,w,asp=1) points(c1) points(c2,pch=’x’) 103 Bibliography Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19:716–723. Anderson, T. (2003). An Introduction to Multivariate Statistical Analysis. Wiley, 3rd edi- tion. Andrews, D. F. (1972). Plots of high-dimensional data. Biometrics, pages 125–136. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B, pages 289–300. Berger, J. (1975). Minimax estimation of location vectors for a wide class of densities. Annals of Statistics, 3:1318–1328. Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees. Wadsworth, Belmont, CA. Burman, P. (1989). A comparative study of ordinary cross-validation, v-fold cross- validation and the repeated learning-testing methods. Biometrika, 76:503–514. Chacon, J. and Duong, T. (2011). Unconstrained pilot selectors for smoothed cross- validation. Australian & New Zealand Journal of Statistics, 53(3):331–351. 104 Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39(1):1–38. Dempster, A. P. (1958). A high dimensional two sample significance test. The Annals of Mathematical Statistics, pages 995–1010. Devroye, L. (1978). A universal k-nearest neighbor procedure in discrimination. In Proceedings of the 1978 IEEE Computer Society Conference on Pattern Recognition and Image Processing, pages 142–147, Long Beach, CA. IEEE Press. Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recogni- tion. Springer-Verlag, New York. Duda, R., Hart, P., and Stork, D. (2001). Pattern Classification. Wiley, New York. Duong, T. (2007). ks: Kernel density estimation and kernel discriminant analysis for multivariate data in R. Journal of Statistical Software, 21(7):1–16. Efron, B. (1981). Nonparametric estimates of standard error: The jackknife, the boot- strap and other methods. Biometrika, 68:589–599. Efron, B. (1983). Estimating the error rate of a prediction rule: Improvement on cross- validation. Journal of the American Statistical Association, 78(382):316–331. Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap. Chapman & Hall. Fisher, R. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(28):179–188. Fraley, C. and Raftery, A. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97:611–631. Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regres- sion tools. Technometrics, 35(2):109–135. 105 Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition. Academic Press, San Diego. Geisser, S. (1975). The predictive sample reuse method with applications. Journal of the American Statistical Association, 70:320–328. Graybill, F. (2001). Matrices with Applications in Statistics. Duxbury, Belmont, CA, USA, 2nd edition. Hastie, T. and Tibshirani, R. (1996). Discriminant analyses by gaussian mixtures. Jour- nal of the Royal Statistical Society B, 8:155–176. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning. Springer, New York, 2nd edition. Hoerl, A. and Kennard, R. (1970). Ridge regression: Biased estimation for nonorthog- onal problems. Technometrics, 12(1):55–67. Ito, K. and Schull, W. J. (1964). On the robustness of the t20 test in multivariate analysis of variance when variance-covariance matrices are not equal. Biometrika, 51(1-2):71– 82. Jain, A. K. and Dubes, R. C. (1988). Algorithms for Clustering Data. Prentice-Hall, Engle- wood Cliffs, NJ, USA. James, W. and Stein, C. (1961). Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, pages 361–379. Kass, R. and Raftery, A. (1995). Bayes factors. Journal of the American Statistical Associa- tion, 90:773–795. Krishnamoorthy, K. and Yu, J. (2004). Modified nel and van der merwe test for the multivariate behrens–fisher problem. Statistics & Probability Letters, 66(2):161–169. Lachenbruch, P. and Mickey, M. (1968). Estimation of error rates in discriminant anal- ysis. Technometrics, 10:1–11. 106 McLachlan, G. and Gordon, R. (1989). Mixture models for partially unclassified data: a case study of renal venous renin levels in essential hypertension. Statistics in Medicine, 8:1291–1300. McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley, New York. McLachlan, G.J. (1992). Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York. Mood, A., Graybill, F., and Boes, D. (1974). Introduction to the Theory of Statistics. McGraw-Hill, 3rd edition. Morrison, D. F. (2005). Multivariate Statistical Methods. Duxbury, Belmont, CA, USA, 4th edition. Nguyen, H. and Wood, I. (2012). Variable selection in statistical models using population-based incremental learning with applications to genome-wide associa- tion studies. In Proceedings of the IEEE Congress on Evolutionary Computation, pages 1841–1848. Olshen, R. (1977). Comments on a paper by C.J. Stone. Annals of Statistics, 5:632–633. Quenouille, M. (1949). Problems in plane sampling. The Annals of Mathematical Statis- tics, 20:355–375. Quenouille, M. (1956). Notes on bias in estimation. Biometrika, 43:353–360. Quinlan, J. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA. Roeder, K. and Wasserman, L. (1997). Practical density estimation using mixtures of normals. Journal of the American Statistical Association, 89:487–495. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464. 107 Smyth, P. (2000). Model selection for probabilistic clustering using cross-validated like- lihood. Statistics and Computing, 9:63–72. Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, pages 197–206. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, B36:111–133. Sun, W. and Cai, T. (2009). Large-scale multiple testing under dependence. Journal of the Royal Statistical Society: Series B, 71(2):393–424. Tibshirani, R. (1994). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288. Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society B, 63:411–423. Tukey, J. (1958). Bias and confidence in not quite large samples. The Annals of Mathe- matical Statistics, 29:614. Wand, M. and Jones, M. (1995). Kernel Smoothing. Chapman and Hall, London. Wishart, J. (1928). The generalised product moment distribution in samples from a normal multivariate population. Biometrika, 20:32–52. Wu, C. (1983). On the convergence properties of the EM algorithm. Annals of Statistics, 11:95–103. Zehna, P. (1966). Invariance of maximum likelihood estimation. Annals of Mathematical Statistics, 37:744. Zhang, F. (2005). The Schur Complement and its Applications. Springer, New York. 108
欢迎咨询51作业君