MATH5855: Multivariate Analysis Dr Pavel Krivitsky based on notes by A/Prof Spiridon Penev University of New South Wales School of Mathematics Department of Statistics 2021 Term 3 This volume of notes is for individual students’ use only. It is therefore not to be distributed beyond the University of New South Wales. Since the notes will be uploaded in parts, these page numbers are indicative. 0 Preliminaries 4 1 Exploratory Data Analysis 15 2 The Multivariate Normal Distribution 17 3 Multivariate Normal Estimation 27 4 Intervals and Tests for the Mean 34 5 Correlations 43 6 Principal Components Analysis 50 7 Canonical Correlation Analysis 55 8 MLM and MANOVA 60 9 Tests of a Covariance Matrix 65 10 Factor Analysis 68 11 Structural Equation Modelling 74 12 Discrimination and Classification 79 13 Support Vector Machines 87 14 Cluster Analysis 96 15 Copulae 107 A Exercise Solutions 114 1 UNSW MATH5855 2021T3 Foreword Foreword These notes These notes do not substitute the lectures in Multivariate Analysis for Masters students. You are strongly recommended to attend each and every lecture and laboratory hour because the conceptual bases of the discussed modelling methods, as well as some additional derivations and explanations will then be focused on, as will be important portions of pertinent computer output. This volume is therefore not meant to be a substitute for a textbook, computer package manual, or lecture attendance. We rely on the widely spread and powerful statistical suites R and SAS to perform the actual calculations during the course. These notes are a compilation from several sources and other notes. Some of the sources are listed in your handout. As the closest reference book the following source could be mentioned: Johnson, R. & Wichern, D. (2007) Applied Multivariate Statistical Analysis. Sixth Edition, Prentice Hall. By no means can this book be a substitute for the whole set of notes, though. It is assumed that you are familiar with some basic concepts of linear algebra. These will be summarised at the beginning and will be used essentially in the rest of the course. These concepts include matrix and vector operations, determinants, traces, ranks, projectors, linear equations, inverses, eigenvectors and eigenvalues etc. I would appreciate it if you would let me know about any ways these notes could be further improved. Overview First we shall discuss some general aspects of Multivariate Analysis. Usually, when studying complex phenomena, many variables are required. Besides, the process of studying is usually an iterative one with many variables often added or deleted from the study. Multivariate analysis deals with developing methods for better understanding the relationships between the many variables included in the analysis of such complex phenomena. What makes Multivariate Analysis different? In your other classes, you have learned about a variety of methods for analysing many variables. For example, you have probably learned about multiple regression linear model: Yi = β0 + β1xi1 + β2xi2 + · · ·+ βpxip + ϵi, i = 1, . . . , n where Yi is the ith observation of the response variable, xi,k ith observation of the kth predictor variable, and ϵi the ith error. However, in this regression, we designate the p predictors as fixed (conditioned on) and only one variable per observation is random. Typically, we assume that the ϵis and therefore Yis are independent (conditional on the xs) or at least uncorrelated. Contrast this with a multivariate linear model, Yi1 = β01 + β11xi1 + β21xi2 + · · ·+ βp1xip + ϵi1, Yi2 = β02 + β12xi1 + β22xi2 + · · ·+ βp2xip + ϵi2, where Yi1 and Yi2 are the ith observations of two distinct response variables, and ϵi1 and ϵi2 may be correlated. The multivariate linear model can be used when multiple observations are taken 2 UNSW MATH5855 2021T3 Foreword on each individual in the sample, and it can allow us to model the relationships among these measurements. Difficulties in such a process: • More data to analyse • More involved mathematics necessary • Computer intensive methods involved in the process Objectives of multivariate methods: Data reduction: presenting the phenomenon as simply as possible but without sacrificing valu- able information. Typical representative method: Principal components analysis. Some- times, this reduction is achieved by introducing a small number of unobservable (latent) variables when trying to explain a large number of observable output variables. Represen- tative methods: factor analysis and covariance structure analysis. Sorting or grouping: creating groups of “similar” objects or variables that in a sense are more closer to each other than to objects outside the group; and finding reasonable explanation for the existing grouping. Representative methods: Factor Analysis, Cluster Analysis, Discriminant Analysis. Investigation of dependence among variables: finding which sets of variables can be con- sidered as independent and which are “more dependent”; and “measuring” the depen- dence.Representative Methods: Correlation Analysis, Partial Correlations, Canonical Cor- relations. Prediction: predicting values of one or more variables on the basis of observations of other variables that have been found to influence the former variables: a basic but important goal. Representative: Multivariate Regression. Hypothesis testing: either validating assumptions (e.g., normality) on the basis of which cer- tain analysis is being done or to reinforce some prior modelling convictions (e.g., equality of parameters).Hypothesis testing is relevant to the applications of all multivariate methods we will be dealing with. As a basic mathematical model for our analyses in this course the multivariate normal distribution will be used. Reasons are: our limited time and the complexity of other approaches. Although in practice also other distributions are relevant, modelling based on the multivariate normal distribution can still be a very good approximation. 3 UNSW MATH5855 2021T3 Lecture 0 Preliminaries 0 Preliminaries 0.1 Matrix algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 0.1.1 Vectors and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 0.1.2 Inverse matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 0.1.3 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 0.1.4 Orthogonal matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 0.1.5 Eigenvalues and eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . 7 0.1.6 Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 0.1.7 Orthogonal Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 0.2 Standard facts about multivariate distributions . . . . . . . . . . . . . . . . . . . 10 0.2.1 Random samples in multivariate analysis . . . . . . . . . . . . . . . . . . 10 0.2.2 Joint, marginal, conditional distributions . . . . . . . . . . . . . . . . . . 10 0.2.3 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 0.2.4 Density transformation formula . . . . . . . . . . . . . . . . . . . . . . . . 12 0.2.5 Characteristic and moment generating functions . . . . . . . . . . . . . . 13 0.3 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 0.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 0.1 Matrix algebra 0.1.1 Vectors and matrices As a shorthand notation, we shall be using X ∈Mp,n to indicate that X is a matrix with p rows and n columns. A notation x ∈ Rn will be used to indicate that x is a n-dimensional column vector. Of course, if x ∈ Rn, it also means that x ∈ Mn,1. Transposition will be denoted by ⊤. After a transposition, from a matrix X ∈Mp,n we get a new matrix X⊤ ∈Mn,p. In particular, from a column vector x ∈ Rn we arrive, after a transposition, to a row a vector x⊤ ∈ M1,n. It is well known that multiplication of a matrix (vector) with a scalar means multiplication of each of the elements of the matrix (vector) with that scalar. Also, two matrices (vectors) of the same dimension can be added (subtracted) and the result is a new matrix (vector) of the same dimension and elements which are the element wise sum (difference) of the elements of the matrices (vectors) to be added (subtracted). The Euclidean norm of a vector x = x1 x2 ... xp ∈ Rp is denoted by ∥x∥ and is defined as ∥x∥ =√∑pi=1 x2i . The inner product or, equivalently, the scalar product of two p-dimensional vectors x and y is denoted and defined in the following way: ⟨x,y⟩ = x⊤y = p∑ i=1 xiyi (0.1) Obviously, the relation ∥x∥2 = ⟨x,x⟩ holds. It is well known that if θ is the angle between two p-dimensional vectors x and y then it also holds ⟨x,y⟩ = ∥x∥∥y∥ cos(θ) (0.2) Since |cos(θ)| ≤ 1, we have the inequality |⟨x,y⟩| ≤ ∥x∥∥y∥ 4 UNSW MATH5855 2021T3 Lecture 0 Preliminaries which is one variant of the Cauchy–Bunyakovsky–Schwartz Inequality. Further, if we want to orthogonally project the vector x ∈ Rp on the vector y ∈ Rp then (having in mind the geometric interpretation of orthogonal projection) the result will be: x ⊤y y⊤yy. Finally, the rules for matrix multiplication are recalled: if X ∈Mp,k and Y ∈Mk,n (i.e. the number of columns in X is equal to the number of rows in Y ) then the multiplication XY is possible and the result is a matrix Z = XY ∈Mp,n with elements zi,j , i = 1, 2, . . . , p, j = 1, 2, . . . , n : zi,j = k∑ m=1 xi,mym,j (0.3) i.e. the element in the ith row and jth column of Z is a scalar product of the ith row of X and the jth column of Y . Note that the multiplication of matrices is not commutative and in general, it is not necessary for Y X to even exist when XY exists. When the matrices are both square (quadratic) of the same dimension p (i.e. both X ∈Mp,p and Y ∈Mp,p) then both XY and Y X will be defined but would in general not give rise to the same result. The following transposition rule is important to be mentioned (and easy to check): if X ∈ Mp,k and Y ∈Mk,n then the product XY exists and it holds: (XY )⊤ = Y ⊤X⊤ (0.4) One should be very careful with transposition though in order to avoid silly mistakes. If x ∈ Rp, for example, both x⊤x and xx⊤ exist. While the former is a scalar, the latter belongs to Mp,p! A square matrix X ∈ Mp,p is called symmetric if xi,j = xj,i for i = 1, 2, . . . , p and j = 1, 2, . . . , p holds. For such a matrix, we have X⊤ = X. The square matrix p×p I = δij for i = 1, 2, . . . , p and j = 1, 2, . . . , p holds (i.e., ones on the diagonal and zeros outside the diagonal) is called the identity matrix (of dimension p). Obviously, when the multiplication is possible then always XI = X and IX = X holds. The trace of a square matrix X ∈ Mp,p is denoted by tr(X) = ∑p i=1 xii. The following properties of traces are easy to obtain: i) tr(X + Y ) = tr(X) + tr(Y ) ii) tr(XY ) = tr(Y X) iii) tr(X−1Y X) = tr(Y ) iv) If a ∈ Rp and X ∈Mp,p then a⊤Xa = tr(Xaa⊤) 0.1.2 Inverse matrices To any square matrix X ∈ Mp,p one can attach a number |X| ≡ det(X) called a determinant of the matrix. It is defined as |X| = ∑ ±x1ix2j . . . xpm where the summation is over all permutations (i, j, . . . ,m) of the numbers (1, 2, . . . , p) by taking into account the sign rule: summands with an even permutation get a (+) whereas the ones with an odd permutation get a (−) sign. It can be seen that this is equivalent to another recursive definition, namely: • when p = 1 (scalar case) X = a is just a number and |X| = a in this case 5 UNSW MATH5855 2021T3 Lecture 0 Preliminaries • when p = 2 then ∣∣∣∣x11 x12x21 x22 ∣∣∣∣ = x11x22 − x12x21 • when p = 3 then the following rule applies:∣∣∣∣∣∣ x11 x12 x13 x21 x22 x23 x31 x32 x33 ∣∣∣∣∣∣ = x11x22x33+x12x23x31+x21x32x13−x31x22x13−x11x23x32−x12x21x33 (0.5) • recursively, for X ∈M(p,p), |X| = ∑ i (−1)i+jxij |Xij | = ∑ j (−1)i+jxij |Xij | where Xij denotes the matrix we get by deleting the ith row and jth column of X, and |Xij | is therefore the (i, j)th minor of X. Here we list some elementary properties of determinants that follow directly from the defini- tion: i) If one row or one column of the matrix contains zeros only, then the value of the determinant is zero. ii) |X⊤| = |X| iii) If one row (or one column) of the matrix is modified by multiplying with a scalar c then so is the value of the determinant. iv) |cX| = cp|X| v) If X,Y ∈Mp,p then |XY | = |X||Y | vi) If the matrix X is diagonal (i.e. all non-diagonal elements are zero) then |X| = ∏pi=1 xii. In particular, the determinant of the identity matrix is always equal to one. Given that |X| ≠ 0 (or equivalently, if the matrix X ∈Mp,p is nonsingular then an inverse matrix X−1 ∈Mp,p can be defined that has to satisfy XX−1 = Ip,p. It is easy to check that the inverse X−1 has as its (j, i)th entry |Xij ||X| (−1)i+j , where |Xij | is, as before, the (i, j)th minor of X. Some elementary properties of inverses follow: i) XX−1 = X−1X = I ii) (X−1)⊤ = (X⊤)−1 iii) (XY )−1 = Y −1X−1 when both X and Y are nonsingular square matrices of the same dimension. iv) |X−1| = |X|−1 v) If X is diagonal and nonsingular then all its diagonal elements are nonzero and X−1 is again diagonal with diagonal elements equal to 1xii , i = 1, 2, . . . , p. 6 UNSW MATH5855 2021T3 Lecture 0 Preliminaries 0.1.3 Rank A set of vectors x1,x2, . . . ,xk ∈ Rn is linearly dependent if there exist k numbers a1, a2, . . . , ak not all zero such that a1x1 + a2x2 + · · ·+ akxk = 0 (0.6) holds. Otherwise the vectors are linearly independent. In particular, for k linearly independent vectors the equality (0.6) would only be possible if all numbers a1, a2, . . . , ak were zero. The row rank of a matrix is the maximum number of linearly independent row vectors. The column rank is the rank of its set of column vectors. It turns out that the row rank and the column rank of a matrix are always equal. Thus the rank of a matrix X (denoted rk(X)) is either the row or the column rank. If X ∈Mp,n and rk(X) = min(p, n) we say that the matrix is of full rank. In particular, a square matrix A ∈Mp,p is of full rank if rk(A) = p. As is well known from the basic theorem of linear algebra Kronecker–Capelli or Rouche´–Capelli Theorem this means also that |A| ≠ 0 when A is of full rank. Then the inverse of A will also exist. Let b ∈ Rp be a given vector. Then the linear equation system Ax = b has a unique solution x = A−1b ∈ Rp. 0.1.4 Orthogonal matrices A square matrix X ∈Mp,p is orthogonal if XX⊤ = X⊤X = Ip,p holds. The following properties of orthogonal matrices are obvious: i) X is of full rank (rk(X) = p) and X−1 = X⊤ ii) The name orthogonal of the matrix originates from the fact that the scalar product of each two different column vectors equals zero. The same holds for the scalar product of each two different row vectors of the matrix. The norm of each column vector (or each row vector) is equal to one. These properties are equivalent to the definition. iii) |X| = ±1 0.1.5 Eigenvalues and eigenvectors For any square matrix X ∈Mp,p we can define the characteristic polynomial equation of degree p, f(λ) = |X − λI| = 0. (0.7) Equation (0.7) is a polynomial equation of power p so it has exactly p roots. In general, some of them may be complex and some may coincide. Since the coefficients are real, if there is a complex root of 0.7 then also its complex conjugate must be a root of the same equation. Denote any such eigenvalue by λ∗. In addition, tr(X) = ∑p i=1 λi and |X| = ∏p i=1 λi. Obviously, the matrix X − λ∗I is singular (its determinant is zero). Then, according to the Kronecker theorem, there exists a non-zero vector y ∈ Rp such that (X − λ∗I)y = 0,0 ∈ Rp. We call y an eigenvector of X that corresponds to the eigenvalue λ∗. Note that the eigenvector is not uniquely defined: µy for any real non-zero µ would also be an eigenvector corresponding to the same eigenvalue. Sparing some details of the derivation, we shall formulate the following basic result: Theorem 0.1. When the matrix X is real symmetric then all of its p eigenvalues are real. If the eigenvalues are all different then all the p eigenvectors that correspond to them, are orthogonal (and hence form a basis in Rp). These eigenvectors are also unique (up to the norming constant µ above). If some of the eigenvalues coincide then the eigenvectors corresponding to them are not necessarily unique but even in this case they can be chosen to be mutually orthogonal. 7 UNSW MATH5855 2021T3 Lecture 0 Preliminaries For each of the p eigenvalues λi, i = 1, 2, . . . , p, of X, denote its corresponding set of mutually orthogonal eigenvectors of unit length by ei, i = 1, 2, . . . , p, i.e. Xei = λiei, i = 1, 2, . . . , p, ∥ei∥ = 1, e⊤i ej = 0, i ̸= j holds. Then is can be shown that the following decomposition (spectral decomposition) of any symmetric matrix X ∈Mp,p holds: X = λ1e1e ⊤ 1 + λ2e2e ⊤ 2 + . . . λpepe ⊤ p . (0.8) Equivalently, X = PΛP⊤ where Λ = λ1 · · · 0... . . . ... 0 · · · λp is diagonal and P ∈Mp,p is an orthogonal matrix containing the p orthogonal eigenvectors e1, e2, . . . , ep. The above decomposition is a very important analytical tool. One of its most widely used applications is for defining a square root of a symmetric positive definite matrix. A symmetric matrix X ∈ Mp,p is positive definite if all of its eigenvalues are positive. (It is called non-negative definite if all eigenvalues are ≥ 0.) For a symmetric positive definite matrix we have all λi, i = 1, 2, . . . , p, to be positive in the spectral decomposition (0.8). But then X−1 = (P⊤)−1Λ−1P−1 = PΛ−1P⊤ = p∑ i=1 1 λi eie ⊤ i (0.9) (i.e. inverting X is very easy if the spectral decomposition of X is known). Moreover we can define the square root of the symmetric non-negative definite matrix X in a natural way: X 1 2 = p∑ i=1 √ λieie ⊤ i (0.10) The definition (0.10) makes sense since X 1 2X 1 2 = X holds. Note that X 1 2 is also symmetric and non-negative definite. Also X− 1 2 = ∑p i=1 λ − 12 i eie ⊤ i = PΛ − 12P⊤ can be defined where Λ− 1 2 is a diagonal matrix with λ −1/2 i , i = 1, 2, . . . , p being its diagonal elements. These facts will be used essentially in the subsequent sections. As an illustration of the usefulness of the spectral decomposition approach we shall show the following statement: Example 0.2. Let X ∈ Mp,p be symmetric positive definite matrix with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λp > 0 and associated eigenvectors of unit length e1, e2, . . . ep. Show that • maxy ̸=0 y ⊤Xy y⊤y = λ1 attained when y = e1. • miny ̸=0 y ⊤Xy y⊤y = λp attained when y = ep. Let X = PΛP⊤ be the decomposition (0.8) for X. Denote z = P⊤y. Note that y ̸= 0 implies z ̸= 0. Thus y⊤Xy y⊤y = y⊤PΛP⊤y y⊤y = z⊤Λz z⊤z = ∑p i=1 λiz 2 i∑p i=1 z 2 i ≤ λ1 ∑p i=1 z 2 i∑p i=1 z 2 i = λ1 If we take y = e1 then having in mind the structure of the matrix P we have z = P ⊤e1 = (1 0 · · · 0)⊤ and for this choice of y also z⊤Λz z⊤z = λ1 1 = λ1. The first part of the exercise is 8 UNSW MATH5855 2021T3 Lecture 0 Preliminaries shown. Similar arguments (just changing the sign of the inequality) apply to show the second part. In addition, you can try to show that maxy ̸=0,y⊥e1 y⊤Xy y⊤y = λ2 holds. How? 0.1.6 Cholesky Decomposition Computers perform arithmetic to a finite precision, typically around 16 decimal significant fig- ures. Furthermore, the numbers are expressed internally in scientific notation, and so the absolute magnitude of the number typically has little effect on precision, but certain operations on num- bers with very different magnitudes can sometimes produce severe rounding errors. For example, to a computer 1 × 1018 + 1 × 100 = 1,000,000,000,000,000,000 + 1 = 1,000,000,000,000,000,000: the 1 gets lost to a rounding error. When it comes to matrix inversion in particular, the key number is the condition number, |λ1/λp| of a positive definite matrix X, where λ1 is the largest eigenvalue of X and λp is the smallest. (The definition for non-positive-definite matrices can be different.) The higher this number is, the less numerically stable the inversion is likely to be. (Notice that if the matrix is singular, this number is infinite.) We generally try to avoid asking the computer to invert matrices in ways that lose precision. An alternative, more numerically stable definition of a “matrix square root” is the Cholesky decomposition. For a symmetric positive definite matrix X ∈Mp,p, there exists a unique upper- triangular matrix U ∈ Mp,p such that U⊤U = X holds. Note that many sources use a lower- triangular matrix L such that LL⊤ = X instead. It is easy to see that L ≡ U⊤, and which definition is used is arbitrary, provided it is used consistently, since UU⊤ ̸= X and neither do L⊤L. For example, the Wikipedia article uses L, whereas the R builtin function is chol() and SAS/IML’s root(x) both return U . This decomposition is particularly useful for generating correlated variables. 0.1.7 Orthogonal Projection Orthogonal projection of any vector y ∈ Rn on the space L(X) spanned by the columns of the matrix X ∈ Mn,p is a linear operation. Hence the result is a vector z ∈ Rn that has the representation z = Py where the matrix P ∈ Mn,n is called (orthogonal) projector. Since z ∈ L(X) (being a projection in this space), the projection of z on L(X) is z itself. Hence Py = z = Pz = PPy = P 2y or (P − P 2)y = 0 → P 2 = P ( since y ∈ Rn is arbitrary). Therefore, P should be idempotent. Further (y − z)⊤z = 0 or y⊤(P⊤ − I)Py = 0 for all y → (P⊤ − I)P = 0 or P⊤P = P . Taking transposes, P⊤P = P⊤ or P = P⊤ that is, P is symmetrical. So, the orthogonal projector is a symmetric and idempotent matrix. Vice versa, consider a symmetric and idempotent matrix P . Then if we take any y ∈ Rn then for z = Py → Pz = P 2y = Py → P (y − z) = 0 (and also P⊤(y − z) = 0 since P = P⊤). Consider L(P ) (the space generated by the rows/columns of P ). Now: z = Py → z ∈ L(P ) and P⊤(y − z) = 0 means that y − z is perpendicular to L(P ). Hence Py is the projection of y on L(P ). Hence, we have seen that P ∈ Mn,n is an orthogonal projection matrix if and only if it is a symmetric and idempotent matrix. Also, if P is an orthogonal projection on a given linear space M of dimension dim(M) then I − P an orthogonal projection on the orthocomplement of M. It holds rk(P ) = dim(M). Further, it can be seen that the rank of an orthogonal projector is equal to the sum of its diagonal elements. Finally, it can be shown that if the matrix X above has a full rank then the projector PL(X) = X(X⊤X)−1X⊤. If the matrix X is not of full rank then the generalised inverse 9 UNSW MATH5855 2021T3 Lecture 0 Preliminaries (X⊤X)− of X⊤X can be defined instead. Note that the generalised inverse may not be uniquely defined but no matter which version of it has been chosen, the matrix X(X⊤X)−X⊤ is uniquely defined and is the orthogonal projector on the space L(X) spanned by the columns of X also in cases when the rank of X is not full. 0.2 Standard facts about multivariate distributions 0.2.1 Random samples in multivariate analysis In order to study the sampling variability of statistics, with the ultimate goal of making inferences, one needs to make some assumptions about the random variables whose values constitute the data set X ∈Mp,n in (1.1). Suppose the data has not been observed yet but we intend to collect n sets of measurements on p variables. Since the actual observations can not be predicted before the measurements are made, we treat them as random variables. Each set of p measurements can be considered as a realisation of p-dimensional random vector and we have n independent realisations of such random vectorsXi, i = 1, 2, . . . , n, so we have the random matrix X ∈Mp,n: X = X11 X12 · · · X1j · · · X1n X21 X22 · · · X2j · · · X2n ... ... . . . ... . . . ... Xi1 Xi2 · · · Xij · · · Xin ... ... . . . ... . . . ... Xp1 Xp2 · · · Xpj · · · Xpn = [X1,X2, . . . ,Xn] (0.11) The vectors Xi, i = 1, 2, . . . , n are considered as independent observations of a p-dimensional random vector. We start discussing the distribution of such a vector. 0.2.2 Joint, marginal, conditional distributions A random vector X = (X1 X2 · · · Xp)⊤ ∈ Rp, p ≥ 2 has a joint cdf FX(x) = P (X1 ≤ x1, X2 ≤ x2, . . . , Xp ≤ xp) = FX(x1, x2, . . . , xp). In case of a discrete vector of observations X the probability mass function is defined as PX(x) = P (X1 = x1, X2 = x2, . . . , Xp = xp). If a density fX(x) = fX(x1, x2, . . . , xp) exists such that FX(x) = ∫ x1 −∞ · · · ∫ xp −∞ fX(t)dt1 . . . dtp (0.12) thenX is a continuous random vector with a joint density function of p arguments fX(x). From (0.12) we see that in this case fX(x) = ∂pFX(x) ∂x1∂x2..∂xp holds. The marginal cdf of the first k < p components of the vector X is defined in a natural way as follows: P (X1 ≤ x1, X2 ≤ x2, . . . , Xk ≤ xk) = P (X1 ≤ x1, X2 ≤ x2, . . . , Xk ≤ xk, Xk+1 ≤ ∞, ..., Xp ≤ ∞) = FX(x1, x2, . . . , xk,∞,∞, . . . ,∞) (0.13) 10 UNSW MATH5855 2021T3 Lecture 0 Preliminaries The marginal density of the first k components can be obtained by partial differentiation in (0.13) and we arrive at ∫ ∞ −∞ · · · ∫ ∞ −∞ fX(x1, x2, . . . , xp)dxk+1 . . . dxp For any other subset of k < p components of the vector X, their marginal cdf and density can be obtained along the same lines. In particular, each component Xi has marginal cdf FXi(xi), i = 1, 2, . . . , p. The conditional density X when Xr+1 = xr+1, . . . , Xp = xp is defined by f(X1,...,Xr|Xr+1,...,Xp)(x1, . . . , xr|xr+1, . . . , xp) = fX(x) fXr+1,...,Xp(xr+1, . . . , xp) (0.14) The above conditional density is interpreted as the joint density of X1, . . . , Xr when Xr+1 = xr+1, . . . , Xp = xp and is only defined when fXr+1,...,Xp(xr+1, . . . , xp) ̸= 0. In case X has p independent components then FX(x) = FX1(x1)FX2(x2) · · ·FXp(xp) (0.15) holds and, equivalently, also PX(x) = PX1(x1)PX2(x2) · · ·PXp(xp), fX(x) = fX1(x1)fX2(x2) · · · fXp(xp) (0.16) holds. We note that in case of mutual independence the p components, all conditional distribu- tions do not depend on the conditions and the factorisations FX(x) = p∏ i=1 FXi(xi), fX(x) = p∏ i=1 fXi(xi) hold. 0.2.3 Moments Given the density fX(x) of the random vector X the joint moments of order s1, s2 . . . , sp are defined, in analogy to the univariate case, as E(Xs11 · · ·Xspp ) = ∫ ∞ −∞ · · · ∫ ∞ −∞ xs11 · · ·xspp fX(x1, . . . , xp)dx1 . . . dxp (0.17) Note that if some of the si in (0.17) are equal to zero then in effect we are calculating the joint moment of a subset of the p random variables. Now, let X ∈ Rp and Y ∈ Rq with densities as above. The following moments are commonly used: Expectation: µX = E(X) = ∫ ∞ −∞ · · · ∫ ∞ −∞ xfX(x1, . . . , xp)dx1 . . . dxp ∈ Rp. 11 UNSW MATH5855 2021T3 Lecture 0 Preliminaries Variance–covariance matrix: (a.k.a. variance or covariance matrix) ΣX = Var(X) = Cov(X) = E(X − µX)(X − µX)⊤ = EXX⊤ − µXµ⊤X = σ11 σ12 · · · σ1p σ21 σ22 · · · σ2p ... ... . . . ... σp1 σp2 · · · σpp ∈Mp,p. Covariance matrix: ΣX,Y = Cov(X,Y ) = E(X − µX)(Y − µY )⊤ = EXY ⊤ − µXµ⊤Y = σX1Y1 σX1Y2 · · · σX1Yq σX2Y1 σX2Y2 · · · σX2Yq ... ... . . . ... σXpY1 σXpY2 · · · σXpYq ∈Mp,q. Let A ∈Mp′,p and B ∈Mq′,q fixed and known. Then, • µAX = AµX ∈ Rp′ • ΣAX = AΣXA⊤ ∈Mp′,p′ • ΣAX,BY = AΣX,Y B⊤ ∈Mp′,q′ As a corollary, if X ′, Y ′, A′ and B′ are variables and matrices with the same dimensions as originals (but possibly distributions and values), • E(AX +A′X ′) = AµX +A′µX′ • Var(AX +A′X ′) = AΣXA⊤ +AΣX,X′(A′)⊤ +A′ΣX′,XA⊤ +A′ΣX′(A′)⊤ • Cov(AX+A′X ′, BY +B′Y ′) = AΣX,Y B⊤+AΣX,Y ′(B′)⊤+A′ΣX′,Y B⊤+A′ΣX′,Y ′(B′)⊤ These identities are also useful when p = p′ = q = q′ = 1 (i.e., scalars). 0.2.4 Density transformation formula Assume, the p existing random variables X1, X2, . . . , Xp with given density fX(x) have been transformed by a smooth (i.e. differentiable) one-to-one transformation into p new random variables Y1, Y2 . . . , Yp, i.e. a new random vector Y ∈ Rp has been created by calculating Yi = yi(X1, X2 . . . , Xp), i = 1, 2, . . . , p (0.18) The question is how to calculate the density gY (y) of Y by knowing the transformation functions yi(X1, X2 . . . , Xp), i = 1, 2, . . . , p and the density fX(x) of the original random vector. Naturally, since the transformation (0.18) is assumed to be one-to-one, its inverse transformation Xi = xi(Y1, Y2 . . . , Yp), i = 1, 2, . . . , p also exists and then the following density transformation formula applies: fY (y1, . . . , yp) = fX [x1(y1, . . . , yp), . . . , xp(y1, . . . , yp)]|J(y1, . . . , yp)| (0.19) 12 UNSW MATH5855 2021T3 Lecture 0 Preliminaries where J(y1, . . . , yp) is the Jacobian of the transformation: J(y1, . . . , yp) = ∣∣∣∂x∂y ∣∣∣ ≡ ∣∣∣∣∣∣∣∣∣∣ ∂x1 ∂y1 ∂x1 ∂y2 · · · ∂x1∂yp ∂x2 ∂y1 ∂x2 ∂y2 · · · ∂x2∂yp ... ... . . . ... ∂xp ∂y1 ∂xp ∂y2 · · · ∂xp∂yp ∣∣∣∣∣∣∣∣∣∣ ≡ ∣∣ ∂y ∂x ∣∣−1 (0.20) Note that in (0.19) the absolute value of the Jacobian is substituted. 0.2.5 Characteristic and moment generating functions The characteristic function (cf) φ(t) of the random vector X ∈ Rp is a function of a p- dimensional argument. For any real vector t = (t1 t2 · · · tp)⊤ ∈ Rp it is defined as φX(t) = E(eit ⊤X) where i = √−1. Note that the cf always exists since |φX(t)| ≤ E(|eit⊤X |) = 1 < ∞. Maybe more simple (since it does not involve complex numbers) is the notion of moment gener- ating function (mgf). It is defined as MX(t) = E(e t⊤X). Note however that in some cases the mgf may not exist for values of t further away from the zero vector. Characteristic functions are in one-to-one correspondence with distributions and this is the reason to use them as a machinery to operate with in cases where direct operation with the distribution is not very convenient. In fact, when the density exists, under mild conditions the following inversion formulas hold for one-dimensional random variables and random vectors, respectively: fX(x) = 1 2π ∫ +∞ −∞ e−itx φX(t)dt fX(x) = (2π) −p ∫ Rp e−it ⊤x φX(t)dt. One important property of cf is the following: Theorem 0.3. If the cf φX(t) of the random vector X ∈ Rp is given and Y = AX + b, b ∈ Rq, A ∈ Mq,p is a linear transformation of X ∈ Rp into a new random vector Y ∈ Rq then it holds for all s ∈ Rq that φY (s) = e is⊤b φX(A ⊤s). Proof. at lecture. 0.3 Additional resources An alternative presentation of these concepts can be found in JW Ch. 2–3. 0.4 Exercises Exercise 0.1 In an ecological experiment, colonies of 2 different species of insect are confined to the same habitat. The survival times of the two species (in days) are random variables X1 and X2 respec- tively. It is thought that X1 and X2 have a joint density of the form fX(x1, x2) = θx1 e −x1(θ+x2) (0 < x1, x2) for some constant θ > 0. 13 UNSW MATH5855 2021T3 Lecture 0 Preliminaries (a) Show that fX(x1, x2) is a valid density. (b) Find the probability that both species die out within t days of the start of the experiment. (c) Derive the marginal density of X1. Identify this distribution and write down E(X1) and Var(X1). (d) Derive the marginal density of X2, and the conditional density of X2 given X1 = x1. (e) What evidence do you now have that X1 and X2 are not independent? Exercise 0.2 Let X = [X1, X2] ⊤ a random vector with E(X) = µ and Var(X) = Σ = σ2 ( 1 ρ ρ 1 ) . (a) Find Cov(X1 −X2, X1 +X2). (b) Find Cov(X1, X2 − ρX1). (c) Choose b to minimise Var(X2 − bX1). Exercise 0.3 Suppose X is a p-dimensional random vector with cf φX(t). If X is partitioned as [ X(1) X(2) ] , where X(1) is a p1-dimensional subvector, then show that (a) X(1) has cf φ (1) X (t(1)) = φX {[ t(1) 0 ]} , t(1) ∈ Rp1 . (b) X(1) and X(2) are independent if and only if φX(t) = φX {[ t(1) 0 ]} φX {[ 0 t(2) ]} , ∀t(1) ∈ Rp1 , ∀t(2)ϵRp−p1 . Exercise 0.4 Let X ∈Mp,p is a symmetric positive definite matrix with eigenvalues λ1 ≥ λ2 · · · ≥ λp > 0 and associated eigenvectors of unit length ei, i = 1, 2, . . . , p that give rise to the following spectral decomposition: X = λ1e1e ⊤ 1 + λ2e2e ⊤ 2 + . . . λpepe ⊤ p It is known that maxy ̸=0 y ⊤Xy y⊤y = λ1. Now, you show that maxy ̸=0,⟨y,e1⟩=0 y⊤Xy y⊤y = λ2. Can you find further generalisations of this claim? Exercise 0.5 We know that an orthogonal projection matrix has only 0 or 1 as possible eigenvalues. Using this property or otherwise, show that the rank of an orthogonal projector is equal to the sum of its diagonal elements. 14 UNSW MATH5855 2021T3 Lecture 1 Exploratory Data Analysis 1 Exploratory Data Analysis of Multivariate Data 1.1 Data organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.2 Basic summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.1 Data organisation Assume, we are dealing with p ≥ 1 variables. The values of these variables are all recorded for each distinct item, individual, or experimental trial. Each of these three words will be substituted sometimes by the word “case”. We will use the notation xij to indicate a particular value of the ith variable that is observed on the jth case. Consequently, n measurements on p variables can be represented in a form of a matrix p×n X = x11 x12 · · · x1j · · · x1n x21 x22 · · · x2j · · · x2n ... ... . . . ... . . . ... xi1 xi2 · · · xij · · · xin ... ... . . . ... . . . ... xp1 xp2 · · · xpj · · · xpn (1.1) The matrix X above contains the data consisting of all the observations on all the variables. This way of representing the data allows easy manipulations to be performed in order to obtain some easy descriptive statistics for each of the variables. 1.2 Basic summaries For example, the sample mean of the second variable is just x¯2 = 1 n ∑n j=1 x2j , the sample variance of the second variable is just s22 = 1 n ∑n j=1(x2j− x¯2)2 (Note that for the sample variance we shall sometimes use the divisor of n − 1 rather than n and each time this will be differentiated by displaying the appropriate expression). The sample covariance (the simple measure of linear association between variables 1 and 2) is given by s12 = 1 n ∑n j=1(x1j − x¯1)(x2j − x¯2) and one can understand easily how sik, i = 1, 2, . . . , p, k = 1, 2, . . . , p can be defined. Finally, the sample correlation coefficient (the measure of linear association between two variables that does not depend on the units of measurement) can be defined. The sample correlation coefficient of the ith and kth variables is defined by rik = sik√ sii √ skk . Because of the well-known Cauchy–Bunyakovsky–Schwartz Inequality, |rik| ≤ 1 holds. Note also that rik = rki for all i = 1, 2, . . . , p and k = 1, 2, . . . , p holds. It should be repeatedly noted that the sample correlations and covariance are useful only when trying to measure the linear association between two variables. Their value is less informative and is misleading in cases of nonlinear association. In this case one needs to invoke the quotient correlation instead: Zhang, Zhengjun. Quotient correlation: A sample based alternative to Pearson’s cor- relation. Annals of Statistics 36 (2008), no. 2, 1007--1030. doi:10.1214/009053607000000866 But because of the fact that covariance and correlation coefficients are routinely calculated and analysed they are very widely used and provide nice numerical summaries of association when the data do not exhibit obvious nonlinear patterns of association. 15 UNSW MATH5855 2021T3 Lecture 1 Exploratory Data Analysis The descriptive statistics that we discussed until now are usually organised into arrays, namely: Vector of sample means x¯ = ( x¯1 x¯2 · · · x¯p )⊤ Matrix of sample variances and covariances p×p S = s11 s12 · · · s1p s21 s22 · · · s2p ... ... . . . ... sp1 sp2 · · · spp (1.2) Matrix of sample correlations p×p R = 1 r12 · · · r1p r21 1 · · · r2p ... ... . . . ... rp1 rp2 · · · 1 (1.3) 1.3 Visualisation Some simple characteristics of the data are worth studying before the actual multivariate analysis would begin: • drawing scatterplot of the data; • calculating simple univariate descriptive statistics for each variable; • calculating sample correlation and covariance coefficients; and • linking multiple two-dimensional scatterplots. 1.4 Software SAS In SAS, the procedures that are used for this purpose are called proc means, proc plot and proc corr. Please study their short description in the included SAS handout. R In R, these are implemented in base::rowMeans, base::colMeans, stats::cor, graphics::plot, graphics::pairs, GGally::ggpairs. Here, the format is PACKAGE ::FUNCTION , and you can learn more by running library(PACKAGE ) ? FUNCTION 16 UNSW MATH5855 2021T3 Lecture 2 The Multivariate Normal Distribution 2 The Multivariate Normal Distribution 2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Properties of multivariate normal . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 Tests for Multivariate Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.1 Definition The multivariate normal (MVN ) density is a generalisation of the univariate normal for p ≥ 2 dimensions. Looking at the term (x−µσ ) 2 = (x − µ)(σ2)−1(x − µ) in the exponent of the well known formula f(x) = 1√ 2πσ2 e−[(x−µ)/σ] 2/2,−∞ < x <∞ (2.1) for the univariate density function, a natural way to generalise this term in higher dimensions is to replace it by (x− µ)⊤Σ−1(x− µ). Here µ = EX ∈ Rp is the expected value of the random vector X ∈ Rp and the matrix Σ = E(X − µ)(X − µ)⊤ = σ11 σ12 · · · σ1p σ21 σ22 · · · σ2p ... ... . . . ... σp1 σp2 · · · σpp ∈Mp,p is the covariance matrix. Note that on the diagonals of Σ we get the variances of each of the p random variables whereas σij = E[(Xi−E(Xi))(Xj −E(Xj))], i ̸= j are the covariances between the ith and jth random variable. Sometimes, we will also denote σii by σ 2 i . Of course, the above replacement would only make sense if Σ was positive definite. In general, however, we can only claim that Σ is (as any covariance matrix) non-negative definite (try to prove this claim e.g. using Example 0.2 from Section 0.1.5 or some other argument). If Σ was positive definite then the density of the random vector X can be written as fX(x) = 1 (2π)p/2|Σ| 12 e −(x−µ)⊤Σ−1(x−µ)/2, −∞ < xi <∞, i = 1, 2 . . . , p. (2.2) It can be directly checked that the random vector X ∈ Rp has EX = µ and E[(X − µ)(X − µ)⊤] = Σ. Since the density is uniquely defined by the mean vector and the covariance matrix we will denote it by Np(µ,Σ). In these notes, however, we will introduce the multivariate normal distribution not through its density formula but through more general reasoning that also allows to cover the case of singular Σ. We will utilise the famousCramer–Wold argument according to which the distribution of a p-dimensional random vectorX is completely characterised by the one-dimensional distributions of all linear transformations t⊤X, t ∈ Rp. Indeed, if we consider E[e(itt⊤X)] (which is assumed to be known for every t ∈ R1, t ∈ Rp) then we see that by substituting t = 1 we can get E[e(it⊤X)] which is the cf of the vector X (and the latter uniquely specifies the distribution of X). Hence the following definition will be adopted here: 17 UNSW MATH5855 2021T3 Lecture 2 The Multivariate Normal Distribution Definition 2.1. The random vector X ∈ Rp has a multivariate normal distribution if and only if (iff) any linear transformation t⊤X, t ∈ Rp has a univariate normal distribution. Lemma 2.2. The characteristic function of the (univariate) standard normal random variable X ∼ N(0, 1) is ψX(t) = exp(−t2/2). Proof. (optional, not examinable) ψX(t) = E exp(itX) = ∫ +∞ −∞ exp(itx) 1√ 2π exp(−x2/2)dx = ∫ +∞ −∞ 1√ 2π exp(itx− x2/2)dx = ∫ +∞ −∞ 1√ 2π exp(−(x2 − 2itx)/2)dx = ∫ +∞ −∞ 1√ 2π exp(−(x2 − 2itx+ (it)2)/2 + (it)2/2)dx = ∫ +∞ −∞ 1√ 2π exp(−(x− it)2/2 + (it)2/2)dx = exp(−t2/2) ∫ +∞ −∞ 1√ 2π exp(−(x− it)2/2)dx = exp(−t2/2) lim h→∞ ∫ +h −h 1√ 2π exp(−(x− it)2/2)dx. Change of variable: z = x− it x = z + it dx = dz results in ψX(t) = exp(−t2/2) lim h→∞ ∫ +h+it −h+it 1√ 2π exp(−z2/2)dz. The remaining integral is over a complex domain, so we must use Cauchy’s Theorem: contour integration over the contour +h+ it→ +h→ −h→ −h+ it→ +h+ it should result in 0, so∫ +h +h+it 1√ 2π exp(−z2/2)dz + ∫ −h +h 1√ 2π exp(−z2/2)dz+∫ −h+it −h 1√ 2π exp(−z2/2)dz + ∫ +h+it −h+it 1√ 2π exp(−z2/2)dz = 0 18 UNSW MATH5855 2021T3 Lecture 2 The Multivariate Normal Distribution for any real h and t. Solving for the integral of interest and taking the limit, lim h→∞ ∫ +h+it −h+it 1√ 2π exp(−z2/2)dz = − lim h→∞ ∫ +h +h+it 1√ 2π exp(−z2/2)dz − lim h→∞ ∫ −h +h 1√ 2π exp(−z2/2)dz − lim h→∞ ∫ −h+it −h 1√ 2π exp(−z2/2)dz = − lim h→∞ ∫ +h +h+it 1√ 2π exp(−z2/2)dz + :1 lim h→∞ ∫ +h −h 1√ 2π exp(−z2/2)dz − lim h→∞ ∫ −h+it −h 1√ 2π exp(−z2/2)dz, since the standard normal density integrates to 1. Lastly, consider limh→∞ ∫ +h +h+it exp(−z2/2)dz: change of variable y = (z − h)/i z = h+ iy dz = idy, then lim h→∞ ∫ +h +h+it exp(−z2/2)dz = lim h→∞ ∫ 0 1 exp(−(h+ iy)2/2)idy = ∫ 0 1 lim h→∞ exp(−(h2 + 2ihy − y2)/2)idy = ∫ 0 1 lim h→∞ exp(−h2/2) exp(−ihy) exp(−y2/2)idy = ∫ 0 1 0dy = 0, and, analogously, limh→∞ ∫ −h+it −h 1√ 2π exp(−z2/2)dz = 0, leaving lim h→∞ ∫ +h+it −h+it 1√ 2π exp(−z2/2)dz = 1 and ψX(t) = exp(−t2/2). Aside: The mgf MX(t) = E exp(tX) can also be derived and used in the argument below; however, cf s are more general so are preferred when possible. We show the (optional, not examinable) derivation here. We begin by completing the square: 19 UNSW MATH5855 2021T3 Lecture 2 The Multivariate Normal Distribution MX(t) = E exp(tX) = ∫ +∞ −∞ exp(tx) 1√ 2π exp(−x2/2)dx = ∫ +∞ −∞ 1√ 2π exp(tx− x2/2)dx = ∫ +∞ −∞ 1√ 2π exp(−(x2 − 2tx)/2)dx = ∫ +∞ −∞ 1√ 2π exp(−(x2 − 2tx+ t2)/2 + t2/2)dx = ∫ +∞ −∞ 1√ 2π exp(−(x− t)2/2 + t2/2)dx = exp(t2/2) ∫ +∞ −∞ 1√ 2π exp(−(x− t)2/2)dx. Change of variable: z = x− t x = z + t dx = dz results in MX(t) = exp(t 2/2) ∫ +∞ −∞ 1√ 2π exp(−z2/2)dz = exp(t2/2), since the integrand is just a standard normal density. Theorem 2.3. Suppose that for a random vector X ∈ Rp with a normal distribution according to Defini- tion 2.1 we have E(X) = µ and D(X) = E[(X − µ)(X − µ)⊤] = Σ. Then: i) For any fixed t ∈ Rp, t⊤X ∼ N(t⊤µ, t⊤Σt) i.e. t⊤X has an one dimensional normal distribution with expected value t⊤µ and variance t⊤Σt. ii) The cf of X ∈ Rp is φX(t) = e (it⊤µ− 12 t⊤Σt) . (2.3) Proof. Part i) is obvious. For part ii) we recall from Lemma 2.2 that the cf of the standard univariate normal random variable Z is e−t 2/2. Since any U ∼ N1(µ1, σ21) has a distribution that coincides with the distribution of µ1 + σ1Z we have: φU (t) = e itµ1 φσ1Z(t) = e itµ1 E(eitσ1Z) = eitµ1 φZ(tσ1) = e (itµ1− 12 t2σ21) But then, for the univariate random variable t⊤X ∼ N1(t⊤µ, t⊤Σt) we would have as a char- acteristic function φt⊤X(t) = e (itt⊤µ− 12 t2t⊤Σt) . Substituting t = 1 in the latter formula we find that φX(t) = e (it⊤µ− 12 t⊤Σt) . 20 UNSW MATH5855 2021T3 Lecture 2 The Multivariate Normal Distribution As an upshot, we see that given the expected value vector µ and the covariance matrix Σ we can use the cf formula (2.3) rather than the density formula (2.2) to define the p dimensional multivariate normal distribution. The advantage of the former in comparison to the latter is that in (2.3) only Σ is used, i.e. this definition makes also sense in cases of singular (i.e. non- invertible) Σ. We still want to know that in case of non-singular Σ the more general definition would give raise to the density (2.2). This is the content of the next theorem. Theorem 2.4. Assume the matrix Σ in (2.3) is nonsingular. Then the density of the random vector X ∈ Rp with cf as in (2.3) is given by (2.2). Proof. Consider the vector Y ∈ Rp such that Y = Σ− 12 (X−µ) (compare (0.10) in Section 0.1.5). Since obviously E(Y ) = 0 and D(Y ) = E(Y Y ⊤) = Σ− 1 2 E[(X − µ)(X − µ)⊤]Σ− 12 = Ip holds we can substitute to get the cf of Y ∈ Rp: φY (t) = e− 12 ∑p i=1 t 2 i . But the latter can be seen directly to be the characteristic function of the vector of p independent standard normal variables. Hence, from the relation Y = Σ− 1 2 (X − µ) we can also conclude that X = µ + Σ 12Y where the density fY (y) = 1 (2π)p/2 e− 1 2 ∑p i=1 y 2 i . With other words, X is a linear transformation of Y where the density of Y is known. We can therefore apply the density transformation approach (Section 0.2.4 of this lecture) to obtain: fX(x) = fY (Σ − 12 (x − µ))|J(x1, . . . , xp)|. It is easy to see (because of the linearity of the transformation) that |J(x1, . . . , xp)| = |Σ− 12 | = |Σ 12 |−1. Taking into account that ∑p i=1 y 2 i = y ⊤y = (x − µ)⊤Σ− 12Σ− 12 (x − µ) = (x − µ)⊤Σ−1(x − µ) we finally arrive at the density formula (2.2) for fX(x). 2.2 Properties of multivariate normal The following properties of multivariate normal can be easily derived using the machinery devel- oped so far: Property 1 If Σ = D(X) = Λ is a diagonal matrix then the p components of X are independent. (Indeed, in this case φX(t) = e i ∑p j=1 tjµj− 12 t2jσ2j which can be seen to be the cf of the vector of p independent components each distributed according to N(µj , σ 2 j ), j = 1, . . . , p). The above property can be paraphrased as “for a multivariate normal, if its components are uncorrelated they are also independent”. On the other hand, it is well known that always, i.e. not only for normal from the fact that certain components are independent we can conclude that they are also uncorrelated. Therefore, for the multivariate normal distribution we can conclude that its components are independent if and only if they are uncorrelated! Example 2.5 (Random variables that are marginally normal and uncorrelated but not inde- pendent). Consider two variables Z1 = (2W − 1)Y and Z2 = Y , where Y ∼ N1(0, 1) and, independently, W ∼ Binomial(1, 1/2) (so 2W − 1 takes −1 and +1 with equal probability). Property 2 If X ∼ Np(µ,Σ) and C ∈Mq,p is an arbitrary matrix of real numbers then Y = CX ∼ Nq(Cµ, CΣC⊤). To prove this property note that (see Section 0.2.5) for any s ∈ Rq we have: φY (s) = φX(C ⊤s) = eis ⊤Cµ− 12s⊤CΣC⊤s which means that Y = CX ∼ Nq(Cµ, CΣC⊤). 21 UNSW MATH5855 2021T3 Lecture 2 The Multivariate Normal Distribution Note also that if it happens that the rank of C is full and if rk(Σ) = p then the rank of CΣC⊤ is also full, i.e. the distribution of Y would not be degenerate in this case. Property 3 (This is a finer version of Property 1). Assume the vector X ∈ Rp is divided into subvectors X = ( X(1) X(2) ) and according to this subdivision the vector means are µ = ( µ(1) µ(2) ) and the covariance matrix Σ has been subdivided into Σ = ( Σ11 Σ12 Σ21 Σ22 ) . Then the vectors X(1) and X(2) are independent iff Σ12 = 0. Proof. (Exercise (see lecture)). Property 4 Let the vector X ∈ Rp be divided into subvectors X = ( X(1) X(2) ) , X(1) ∈ Rr, r < p,X(2) ∈ Rp−r and according to this subdivision the vector means are µ = ( µ(1) µ(2) ) and the covariance matrix Σ has been subdivided into Σ = ( Σ11 Σ12 Σ21 Σ22 ) . Assume for simplicity that the rank of Σ22 is full. Then the conditional density of X(1) given that X(2) = x(2) is Nr(µ(1) +Σ12Σ −1 22 (x(2) − µ(2)),Σ11 − Σ12Σ−122 Σ21) (2.4) Proof. Perhaps the easiest way to proceed is the following. Note that the expression µ(1) + Σ12Σ −1 22 (x(2)−µ(2)) (for which we want to show that it equals the conditional mean), is a function of x(2). Denote is as g(x(2)) for short. Let us construct the random vectors Z =X(1) − g(X(2)) and Y = X(2) − µ(2). Obviously EZ = 0 and EY = 0 holds. The vector ( Z Y ) is a linear transformation of a normal vector ( ( Z Y ) = A(X − µ), A = ( Ir −Σ12Σ−122 0 Ip−r ) ) and hence, its distribution is normal (Property 2). Calculating therefore covariance matrix of the vector ( Z Y ) we find that Var ( Z Y ) = AΣA⊤ = ( Σ11 − Σ12Σ−122 Σ21 0 0 Σ22 ) after a simple exercise in block multiplication of matrices. Hence the two vectorsZ and Y are uncorrelated normal vectors and therefore are independent (Property 3). But Y is a linear transformation of X(2) and this means that Z and X(2) are independent. Hence the conditional density of Z given X(2) = x(2) will not depend on x(2) and coincides with the unconditional density of Z. This means, it is normal with zero mean vector and its covariance matrix is Cov(Z) = Σ11 − Σ12Σ−122 Σ21 = Σ1|2 Hence we can state that X(1)− g(x(2)) ∼ N(0,Σ1|2) and correspondingly, the conditional distri- bution of X(1) given that X(2) = x(2) is (2.4). 22 UNSW MATH5855 2021T3 Lecture 2 The Multivariate Normal Distribution Example 2.6. As an immediate consequence of Property 4 we see that if p = 2, r = 1 then for a two-dimensional normal vector ( X1 X2 ) ∼ N {( µ1 µ2 ) , ( σ21 σ12 σ12 σ 2 2 )} its conditional density f(x1|x2) is N(µ1 + σ12σ22 (x2 − µ2), σ 2 1 − σ 2 12 σ22 ). As an exercise, try to derive the above result by direct calculations starting from the joint density f(x1, x2), going over to the marginal f(x2) by integration and finally getting f(x1|x2) = f(x1,x2) f(x2) . Property 5 If X ∼ Np(µ,Σ) and Σ is nonsingular then (X − µ)⊤Σ−1(X − µ) ∼ χ2p where χ2p denotes the chi-square distribution with p degrees of freedom. Proof. It suffices to use the fact that (see also Theorem 2.4) the vector Y ∈ Rp : Y = Σ− 12 (X − µ) ∼ N(0, Ip) i.e. it has p independent standard normal components. Then (X − µ)⊤Σ−1(X − µ) = Y ⊤Y = p∑ i=1 Y 2i ∼ χ2p according to the definition of χ2p as a distribution of the sum of squares of p independent standard normals. Finally, one more interpretation of the result in Property 4 will be given. Assume we want, as is a typical situation in statistics, to predict a random variable Y that is correlated with some p random variables (predictors) X = (X1 X2 · · · Xp). Trying to find the best predictor of Y we would like to minimise the expected value EY [{Y − g(X)}2|X = x] over all possible choices of the function g such that E g(X)2 < ∞. A little careful work and use of basic properties of conditional expectations leads us (see lecture) to the conclusion that the optimal solution to the above minimisation problem is g∗(x) = E(Y |X = x). This optimal solution is also called the regression function. Thus given a particular realisation x of the random vector X the regression function is just the conditional expected value of Y given X = x. In general, the conditional expected value may be a complicated nonlinear function of the predictors. However, if we assume in addition that the joint (p+ 1)-dimensional distribution of Y and X is normal then by applying Property 4 we see that given the realisation x of X, the best prediction of the Y value is given by b+ σ⊤0 C −1x where b = E(Y )− σ⊤0 C−1 E(X), C is the covariance matrix of the vector X, σ0 is the vector of Covariances of Y with Xi, i = 1, . . . , p. Indeed, we know that when the joint (p+1)-dimensional distribution of Y and X is normal the regression function is given by E(Y ) + σ⊤0 C −1(x− E(X)). By introducing the notation b = E(Y )− σ⊤0 C−1 E(X) we can write this as b+ σ⊤0 C−1x. That is, in case of normality, the optimal predictor of Y in the least squares sense turns out to be a very simple linear function of the predictors. The vector C−1σ0 ∈ Rp is the vector of the regression coefficients. Substituting the optimal values we get the minimal value of the sum of squares which is equal to Var(Y )− σ⊤0 C−1σ0. 23 UNSW MATH5855 2021T3 Lecture 2 The Multivariate Normal Distribution 2.3 Tests for Multivariate Normality We have seen that the assumption of multivariate normality may bring essential simplifications in analysing data. But applying inference methods based on the multivariate normality assumption in cases where it is grossly violated may introduce serious defects in the quality of the analysis. It is therefore important to be able to check the multivariate normality assumption. Based on the properties of normal distributions discussed in this lecture, we know that all linear combinations of normal variables are normal and the contours of the multivariate normal density are ellipsoids. Therefore we can (to some extent) check the multivariate normality hypothesis by: 1. checking if the marginal distributions of each component appear to be normal (by using Q–Q plots and the Shapiro–Wilk test, for example); 2. checking if the scatterplots of pairs of observations give the elliptical appearance expected from normal populations; 3. are there any outlying observations that should be checked for accuracy. All this can be done by applying univariate techniques and by drawing scatterplots which are well developed in SAS and R. To some extent, however, there is a price to be paid for concentrating on univariate and bivariate examinations of normality. There is a need to construct a “good” overall test of multivariate normality. One of the simple and tractable ways to verify the multivariate normality assumption is by using tests based on Mardia’s multivariate skewness and kurtosis measures. For any general multivariate distribution we define these respectively as β1,p = E[(Y − µ)⊤Σ−1(X − µ)]3 (2.5) provided that X is independent of Y but has the same distribution and β2,p = E[(X − µ)⊤Σ−1(X − µ)]2 (2.6) (if the expectations in (2.5) and (2.6) exist). For the Np(µ,Σ) distribution: β1,p = 0 and β2,p = p(p+ 2). (Note that when p = 1, the quantity β1,1 is the square of the skewness coefficient E(X−µ)3 σ3 whereas β2,1 coincides with the kurtosis coefficient E(X−µ)4 σ4 .) For a sample of size n consistent estimates of β1,p and β2,p can be obtained as βˆ1,p = 1 n2 n∑ i=1 n∑ j=1 g3ij βˆ2,p = 1 n n∑ i=1 g2ii where gij = (xi − x¯)⊤S−1n (xj − x¯). Notice that for βˆ1,p, we take advantage of our sample being independent and use observations xj for j ̸= i as the “Y ” values for xi. Both quantities βˆ1,p and βˆ2,p are nonnegative and for multivariate data, one would expect them to be around zero and p(p + 2), respectively. Both quantities can be utilised to detect departures from multivariate normality. Mardia has shown that asymptotically, k1 = nβˆ1,p/6 ∼ χ2p(p+1)(p+2)/6, and k2 = [βˆ2,p− p(p+ 2)]/[8p(p+ 2)/n] 1 2 is standard normal. Thus we can use k1 and k2 to test the null hypothesis of 24 UNSW MATH5855 2021T3 Lecture 2 The Multivariate Normal Distribution multivariate normality. If neither hypothesis is rejected, the multivariate normality assumption is in reasonable agreement with the data. It also has been observed that Mardia’s multivariate kurtosis can be used as a measure to detect outliers from the data that are supposedly distributed as multivariate normal. Shapiro–Wilk, Mardia, and other distribution tests have, as their null hypothesis, that the true population distribution is (multivariate) normal. This means that if the population distri- bution deviates from normality even a little, then as the sample size increases, the power of the test (the probability of rejecting the null hypothesis of normality) approaches 1. At the same time, as the sample size increases, the Central Limit Theorem tells us that many statistics, including sample means and (much more slowly) sample variances and covariances, approach normality—and multivariate statistics generally approach multivariate normality. This means that regardless of the underlying distribution, the statistical procedures depending on the normality assumption become valid—even as the chances that a statistical hypothesis test will detect what non-normality there is approaches 1. This means that we must not rely on hypothesis testing blindly but consider the situation on a case-by-case basis, particularly when dealing with large datasets. For a decent sample size, the “symmetric, bell-shaped” heuristic may indicate an adequate distribution, even if a hypothesis test reports a small p-value. 2.4 Software SAS Use CALIS procedure. The quantity k2 is called Normalised Multivariate Kurtosis there, whereas βˆ2,p − p(p+ 2) bears the name Mardia’s Multivariate Kurtosis. R MVN::mvn, psych::mardia 2.5 Examples Example 2.7. Testing multivariate normality of microwave oven radioactivity measurements (JW). 2.6 Additional resources An alternative presentation of these concepts can be found in JW Sec. 4.1–4.2, 4.6. 2.7 Exercises Exercise 2.1 Let X1 and X2 denote i.i.d. N(0, 1) r.v.’s. (a) Show that the r.v.’s Y1 = X1 − X2 and Y2 = X1 + X2 are independent, and find their marginal densities. (b) Find P (X21 +X 2 2 < 2.41). Exercise 2.2 Let X ∼ N3(µ,Σ) where µ = 3−1 2 Σ = 3 2 12 3 1 1 1 2 . 25 UNSW MATH5855 2021T3 Lecture 2 The Multivariate Normal Distribution (a) For A = ( 1 1 1 1 −2 1 ) find the distribution of Z = AX and find the correlation between the two components of Z. (b) Find the conditional distribution of [X1, X3] ⊤ given X2 = 0. Exercise 2.3 Suppose that X1, . . . ,Xn are independent random vectors, with each Xi ∼ Np(µi,Σi). Let a1, . . . , an be real constants. Using characteristic functions, show that a1X1 + · · ·+ anXn ∼ Np(a1µ1 + · · ·+ anµn, a21Σ1 + · · ·+ a2nΣn) Therefore, deduce that, if X1, . . . ,Xn form a random sample from the Np(µ,Σ) distribution, then the sample mean vector, X¯ = 1n ∑n i=1Xi, has distribution X¯ ∼ Np(µ, 1 n Σ) . Exercise 2.4 Prove that if X1 ∼ Nr(µ1,Σ11) and (X2|X1 = x1) ∼ Np−r(Ax1 + b,Ω) where Ω does not depend on x1 then X = ( X1 X2 ) ∼ Np(µ,Σ) where µ = ( µ1 Aµ1 + b ) , Σ = ( Σ11 Σ11A ⊤ AΣ11 Ω+AΣ11A ⊤ ) . Exercise 2.5 Knowing that, i) Z ∼ N1(0, 1) ii) Y |Z = z ∼ N1(1 + z, 1) iii) X|(Y,Z) = (y, z) ∼ N1(1− y, 1) (a) Find the distribution of XY Z and of Y |(X,Z). (b) Find the distribution of ( U V ) = ( 1 + Z 1− Y ) . (c) Compute E(Y |U = 2). 26 UNSW MATH5855 2021T3 Lecture 3 Multivariate Normal Estimation 3 Estimation of the Mean Vector and Covariance Matrix of Multivariate Normal Distribution 3.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.1 Likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.2 Maximum Likelihood Estimators . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.3 Alternative proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.4 Application in correlation matrix estimation . . . . . . . . . . . . . . . . . 29 3.1.5 Sufficiency of µˆ and Σˆ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Distributions of MLE of mean vector and covariance matrix of multivariate normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.1 Sampling distribution of X¯ . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.2 Sampling distribution of the MLE of Σ . . . . . . . . . . . . . . . . . . . 31 3.2.3 Aside: The Gramm–Schmidt Process (not examinable) . . . . . . . . . . . 32 3.3 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1 Maximum Likelihood Estimation 3.1.1 Likelihood function Suppose we have observed n independent realisations of p-dimensional random vectors from Np(µ,Σ). Suppose for simplicity that Σ is non-singular. The data matrix has the form X = X11 X12 · · · X1j · · · X1n X21 X22 · · · X2j · · · X2n ... ... . . . ... . . . ... Xi1 Xi2 · · · Xij · · · Xin ... ... . . . ... . . . ... Xp1 Xp2 · · · Xpj · · · Xpn = [X1,X2, . . . ,Xn] (3.1) The goal to estimate the unknown mean vector and the covariance matrix of the multivariate normal distribution by the Maximum Likelihood Estimation (MLE) method. Based on our knowledge from Lecture 2 we can write down the Likelihood function L(x;µ,Σ) = (2π)− np 2 |Σ|−n2 e− 12 ∑n i=1(xi−µ)⊤Σ−1(xi−µ) (3.2) (Note that we have substituted the observations in (3.2) and consider L as a function of the unknown parameters µ,Σ only.) Correspondingly, we get the log-likelihood function in the form logL(x;µ,Σ) = −np 2 log(2π)− n 2 log(|Σ|)− 1 2 n∑ i=1 (xi − µ)⊤Σ−1(xi − µ) (3.3) It is well known that maximising either (3.2) or (3.3) will give the same solution for the MLE. We start deriving the MLE by trying to maximise (3.3). To this end, first note that by 27 UNSW MATH5855 2021T3 Lecture 3 Multivariate Normal Estimation utilising properties of traces from Section 0.1.1, we can transform: n∑ i=1 (xi − µ)⊤Σ−1(xi − µ) = n∑ i=1 tr[Σ−1(xi − µ)(xi − µ)⊤] = tr[Σ−1( n∑ i=1 (xi − µ)(xi − µ)⊤)] = (by adding ±x¯ = 1n ∑n i=1 xi to each term (xi − µ) in ∑n i=1(xi − µ)(xi − µ)⊤) tr[Σ−1( n∑ i=1 (xi − x¯)(xi − x¯)⊤ + n(x¯− µ)(x¯− µ)⊤)] = tr[Σ−1( n∑ i=1 (xi − x¯)(xi − x¯)⊤)] + n(x¯− µ)⊤Σ−1(x¯− µ). Thus logL(x;µ,Σ) = −np 2 log(2π)−n 2 log(|Σ|)−1 2 tr[Σ−1( n∑ i=1 (xi−x¯)(xi−x¯)⊤)]−1 2 n(x¯−µ)⊤Σ−1(x¯−µ) (3.4) 3.1.2 Maximum Likelihood Estimators The MLE are the ones that maximise (3.4). Looking at (3.4) we realise that (since Σ is non- negative definite) the minimal value for 12n(x¯ − µˆ)⊤Σ−1(x¯ − µˆ) is zero and is attained when µˆ = x¯. It remains to find the optimal value for Σ. We will use the following Theorem 3.1 (Anderson’s lemma). If A ∈ Mp,p is symmetric positive definite, then the maximum of the function h(G) = −n log(|G|) − tr(G−1A) (defined over the set of symmetric positive definite matrices G ∈ Mp,p) exists, occurs at G = 1nA and has the maximal value of np log(n)− n log(|A|)− np. Proof. (sketch, details at lecture): Indeed, (see properties of traces): tr(G−1A) = tr((G−1A 1 2 )A 1 2 ) = tr(A 1 2G−1A 1 2 ) Let ηi, i = 1, . . . , p be the eigenvalues of A 1 2G−1A 1 2 . Then (since the matrix A 1 2G−1A 1 2 is positive definite) ηi > 0, i = 1, . . . , p. Also, tr(A 1 2G−1A 1 2 ) = ∑p i=1 ηi and |A 1 2G−1A 1 2 | = ∏pi=1 ηi holds. Hence − n log|G| − tr(G−1A) = n p∑ i=1 log ηi − n log|A| − p∑ i=1 ηi (3.5) Considering the expression n ∑p i=1 log ηi − n log|A| − ∑p i=1 ηi as a function of the eigenvalues ηi, i = 1, . . . , p we realise that it has a maximum which is attained when all ηi = n, i = 1, . . . , p. Indeed, the first partial derivatives with respect to ηi, i = 1, . . . , p are equal to n ηi − 1 and hence the stationary points are η∗i = n, i = 1, . . . , p. The matrix of second derivatives calculated at η∗i = n, i = 1, . . . , p is equal to −Ip and hence the stationary points give rise to a maximum of the function. Now, we can check directly by substituting the η∗ values that the maximal value of the function is np log(n) − n log(|A|) − np. But a direct substitution in the formula h(G) = −n log(|G|) − tr(G−1A) with G = 1nA also gives rise to np log(n) − n log(|A|) − np, i.e. the maximum is attained at G = 1nA. 28 UNSW MATH5855 2021T3 Lecture 3 Multivariate Normal Estimation Using the structure of the log-likelihood function in (3.4) and Theorem 3.1 (applied for the case A = ∑n i=1(xi − x¯)(xi − x¯)⊤ (!)) it is now easy to formulate following: Theorem 3.2. Suppose X1,X2, . . . ,Xn is a random sample from Np(µ,Σ), p < n. Then µˆ = X¯ and Σˆ = 1n ∑n i=1(xi − x¯)(xi − x¯)⊤ are the maximum likelihood estimators of µ and Σ, respectively. 3.1.3 Alternative proofs Alternative proofs of Theorem 3.2 are also available that utilise some formal rules for vector and matrix differentiation that have been developed as a standard machinery in multivariate analysis (recall that according to the folklore, in order to find the maximum of the log-likelihood, we need to differentiate it with respect to its arguments, i.e. with respect to the vector µ and to the matrix Σ), set the derivatives equal to zero and solve the corresponding equation system. If time permits, these matrix differentiation rules will also be discussed later in this course. 3.1.4 Application in correlation matrix estimation The correlation matrix can be defined in terms of the elements of the covariance matrix Σ. The correlation coefficients ρij , i = 1, . . . , p, j = 1, . . . , p are defined as ρij = σij√ σii √ σjj where Σ = (σij , i = 1 . . . , p; j = 1, . . . , p) is the covariance matrix. Note that ρii = 1, i = 1, . . . , p. To derive theMLE of ρij , i = 1, . . . , p, j = 1, . . . , p we note that these are continuous transformations of the covariances whose maximum likelihood estimators have already been derived. Then we can claim (according to the transformation invariance properties of MLE ) that ρˆij = σˆij√ σˆii √ σˆjj , i = 1, . . . , p, j = 1, . . . , p (3.6) 3.1.5 Sufficiency of µˆ and Σˆ Back from (3.4) we can write the likelihood function as L(x;µ,Σ) = 1 (2π) np 2 |Σ|n2 e − 12 tr[Σ−1( ∑n i=1(xi−x¯)(xi−x¯)⊤+n(x¯−µ)(x¯−µ)⊤)] which means that L(x;µ,Σ) can be factorised into L(x;µ,Σ) = g1(x)g2(µ,Σ; µˆ, Σˆ), i.e. the likelihood function depends on the observations only through the values of µˆ = X¯ and Σˆ. Hence the pair µˆ and Σˆ are sufficient statistics for µ and Σ in the case of a sample from Np(µ,Σ). Note that the structure of the multivariate normal density was essentially used here thus underlying the importance of the check on adequacy of multivariate normality assumptions in practice. If testing indicates significant departures from multivariate normality then inferences that are based solely on µˆ and Σˆ may not be very reliable. 3.2 Distributions of MLE of mean vector and covariance matrix of multivariate normal distribution Inference is not restricted to only find point estimators but also to construct confidence regions, test hypotheses etc. To this end we need the distribution of the estimators (or of suitably chosen functions of them). 29 UNSW MATH5855 2021T3 Lecture 3 Multivariate Normal Estimation 3.2.1 Sampling distribution of X¯ In the univariate case (p = 1) it is well known that for a sample of n observations from normal distribution N(µ, σ2) the sample mean is normally distributed: N(µ, σ 2 n ). Moreover, the sample mean and the sample variance are independent in the case of sampling from a univariate normal population (Basu’s Lemma). This fact was very useful in developing t-statistics for testing the mean vector. Do we have similar statements about the sample mean and sample variance in the multivariate (p > 1) case? Let the random vector X¯ = 1n ∑n i=1Xi ∈ Rp. For any l ∈ Rp : l⊤X¯ is a linear combination of normals and hence is normal (see Definition 2.1). Since taking expected value is a linear operation, we have E X¯ = 1nnµ = µ; In analogy with the univariate case we could formally write Cov X¯ = 1n2nCovX1 = 1 nΣ and hence X¯ ∼ Np(µ, 1nΣ). But we would like to develop a more appropriate machinery for the multivariate case that would help us to more rigorously prove statements like the last one. It is based on operations with Kronecker products. Kronecker product of two matrices A ∈ Mm,n and B ∈ Mp,q is denoted by A ⊗ B and is defined (in block matrix notation) as A⊗B = a11B a12B · · · a1nB a21B a22B · · · a2nB ... ... . . . ... am1B am2B · · · amnB (3.7) The following basic properties of Kronecker products will be used: (A⊗B)⊗ C = A⊗ (B ⊗ C) (A+B)⊗ C = A⊗ C +B ⊗ C (A⊗B)⊤ = A⊤ ⊗B⊤ (A⊗B)−1 = A−1 ⊗B−1 (A⊗B)(C ⊗D) = AC ⊗BD (whenever the corresponding matrix products and inverses exist) tr(A⊗B) = tr(A) tr(B) |A⊗B| = |A|p|B|m (in case A ∈Mm,m, B ∈Mp,p). In addition, the □⃗ operation on a matrix A ∈ Mm,n will be defined. This operation creates a vector A⃗ ∈ Rmn which is composed by stacking the n columns of the matrix A ∈Mm,n under each other (the second below the first etc). For matrices A,B and C (of suitable dimensions) it holds: −−−→ ABC = (C⊤ ⊗A)B⃗ Let us see how we could utilise the above to derive the distribution of X¯. Denote by 1n the vector of n ones. Note that if X is the random data matrix (see (0.11) in Lecture 0.2) then X⃗ ∼ N(1n ⊗ µ, In ⊗ Σ) and X¯ = 1n (1⊤n ⊗ Ip)X⃗. Hence X¯ is multivariate normal with E X¯ = 1 n (1⊤n ⊗ Ip)(1n ⊗ µ) = 1 n (1⊤n 1n ⊗ µ) = 1 n nµ = µ, Cov X¯ = n−2(1⊤n ⊗ Ip)(In ⊗ Σ)(1n ⊗ Ip) = n−2(1⊤n 1n ⊗ Σ) = n−1Σ. 30 UNSW MATH5855 2021T3 Lecture 3 Multivariate Normal Estimation Independence of X¯ and Σˆ How can we show that X¯ and Σˆ are independent? Recall the likelihood function L(x;µ,Σ) = 1 (2π) np 2 |Σ|n2 e − 12 tr[Σ−1( ∑n i=1(xi−x¯)(xi−x¯)⊤+n(x¯−µ)(x¯−µ)⊤)] We have two summands in the exponent from which one is a function of the observations through nΣˆ = ∑n i=1(xi−x¯)(xi−x¯)⊤ only and the other one depends on the observations through x¯ only. The idea is now to transform the original data matrix X ∈ Mp,n into a new matrix Z ∈ Mp,n whose columns are independent normal and in such a way that X¯ would only be a function of the first column Z1, whereas ∑n i=1(xi − x¯)(xi − x¯)⊤ would only be a function of Z2, . . . ,Zn. If we succeed, then clearly X¯ and ∑n i=1(xi − x¯)(xi − x¯)⊤ = nΣˆ would be independent. Now the claim is that the sought after transformation is given by Z = XA with A ∈ Mn,n being an orthogonal matrix with a first column equal to 1√ n 1n. Note that the first column of Z would be then √ nX¯. (An explicit form of the matrix A can be obtained using the Gramm– Schmidt Process discussed later.) Since Z⃗ = −−−−→ IpXA = (A ⊤⊗ Ip)X⃗, the Jacobian of the transfor- mation (X⃗ into Z⃗) is |A⊤⊗Ip| = |A|p = ±1 (note that A is orthogonal). Therefore, the absolute value of the Jacobian is equal to one. For Z⃗ we have: E(Z⃗) = (A⊤ ⊗ Ip)(1n ⊗ µ) = A⊤1n ⊗ µ = √ n 0 ... 0 ⊗ µ Further, Cov(Z⃗) = (A⊤ ⊗ Ip)(In ⊗ Σ)(A⊗ Ip) = A⊤A⊗ IpΣIp = In ⊗ Σ which means that the Zi, i = 1, . . . , n are independent. Note Z1 = √ nX¯ holds (because of the choice of the first column of the orthogonal matrix A). Further n∑ i=1 (Xi − X¯)(Xi − X¯)⊤ = n∑ i=1 XiX ⊤ i − 1 n ( n∑ i=1 Xi)( n∑ i=1 X⊤i ) = ZA⊤AZ⊤ −Z1Z⊤1 = n∑ i=1 ZiZ ⊤ i −Z1Z⊤1 = n∑ i=2 ZiZ ⊤ i Hence we proved the following Theorem 3.3. For a sample of size n from Np(µ,Σ), p < n the sample average X¯ ∼ Np(µ, 1nΣ). Moreover, the MLE µˆ = X¯ and Σˆ are independent. 3.2.2 Sampling distribution of the MLE of Σ Definition 3.4. A random matrix U ∈ Mp,p has a Wishart distribution with parameters Σ, p, n (denoting this by U ∼Wp(Σ, n)) if there exist n independent random vectors Y1, . . . ,Yn each with Np(0,Σ) distribution such that the distribution of ∑n i=1 YiY ⊤ i coincides with the distribution of U . 31 UNSW MATH5855 2021T3 Lecture 3 Multivariate Normal Estimation Note that we require that p < n and that U be non-negative definite. Having in mind the proof of Theorem 3.3 we can claim that the distribution of the matrix nΣˆ = ∑n i=1(Xi − X¯)(Xi − X¯)⊤ is the same as that of ∑n i=2ZiZ ⊤ i and therefore is Wishart with parameters Σ, p, n− 1. That is, we can denote: nΣˆ ∼Wp(Σ, n− 1). The density formula for the Wishart distribution is given in several sources but we will not deal with it in this course. Some properties of Wishart distribution will be mentioned though since we will make use of them later in the course: 1. If p = 1 and if we denote the “matrix” Σ by σ2 (as usual) then W1(Σ, n)/σ 2 = χ2n. In particular, when σ2 = 1 we see that W1(1, n) is exactly the χ 2 n random variable. In that sense we can state that the Wishart distribution is a generalisation (with respect to the dimension p) of the chi-squared distribution. 2. For an arbitrary fixed matrix H ∈Mk,p, k ≤ p one has: nHΣˆH⊤ ∼Wk(HΣH⊤, n− 1). (Why? Show it!) 3. Refer to the previous case for the particular value of k = 1. The matrix H ∈ M1,p is just a p-dimensional row vector that we could denote by c⊤. Then: i) nc ⊤Σˆc c⊤Σc ∼ χ2n−1 ii) n c ⊤Σ−1c c⊤Σˆ−1c ∼ χ2n−p 4. Let us partition S = 1n−1 ∑n i=1(Xi − X¯)(Xi − X¯)⊤ ∈Mp,p into S = ( S11 S12 S21 S22 ) ,S11 ∈Mr,r, r < p Σ = ( Σ11 Σ12 Σ21 Σ22 ) ,Σ11 ∈Mr,r, r < p. Further, denote S1|2 = S11 − S12S−122 S21, Σ1|2 = Σ11 − Σ12Σ−122 Σ21. Then it holds (n− 1)S11 ∼Wr(Σ11, n− 1) (n− 1)S1|2 ∼Wr(Σ1|2, n− p+ r − 1) 3.2.3 Aside: The Gramm–Schmidt Process (not examinable) Let A = [a1, . . . ,an] ∈ Mn,n be an arbitrary full-rank matrix whose first column must be preserved (up to a constant multiple) but which must otherwise be made into an orthogonal matrix. The idea of the the Gram–Schmidt Orthogonalisation (and Orthonormalisation) is to first make a2 orthogonal to a1, then a3 orthogonal to a1 and a2, all the way to making an orthogonal to all the previous vectors. This is accomplished by the following procedure: 32 UNSW MATH5855 2021T3 Lecture 3 Multivariate Normal Estimation 1. For each i = 2, . . . , n, 2. For each j = 1, . . . , i− 1, 3. Update ai = ai − ⟨ai,aj⟩⟨aj ,aj⟩aj . 4. For each k = 1, . . . , n, 5. Update ak = ak ∥ak∥ . Then, after Step 3 for a given i and j, ⟨ai,aj⟩ = ⟨ai − ⟨ai,aj⟩⟨aj ,aj⟩aj ,aj⟩ = ⟨ai,aj⟩ − ⟨ai,aj⟩ ⟨aj ,aj⟩ ⟨aj ,aj⟩ = 0. We can use induction to show that by the time we reach Step 4, ⟨ai,aj⟩ = 0 for any i and j. Observe that after Step 3 completes with i = 2 (and therefore j = 1 only), ⟨a1,a2⟩ = 0. Now, suppose that a1, . . . ,ai−1 are orthogonal. Then, after Step 3 for some j, for an arbitrary l < i, l ̸= j, ⟨ai − ⟨ai,aj⟩⟨aj ,aj⟩aj ,al⟩ = ⟨ai,al⟩ − ⟨ai,aj⟩ ⟨aj ,aj⟩ : 0⟨aj ,al⟩ = ⟨ai,al⟩, since l, j ≤ i − 1 and are therefore orthogonal. This means that Step 3 only affects ⟨ai,al⟩ for l = j: Step 3 cannot make ai no longer orthogonal to any of the vectors a1, . . . ,ai−1 to which it was previously orthogonal, so by the time the loop increments i, a1, . . . ,ai will be orthogonal, completing the proof by induction. Lastly, Steps 4 and 5 simply ensure that a1, . . . ,an are normal. At no point is a1 changed except for being normalised. Example 3.5. Gram–Schmidt process implemented in R. 3.3 Additional resources An alternative presentation of these concepts can be found in JW Sec. 4.3–4.5. 3.4 Exercises Exercise 3.1 Find the product A⊗B if A = ( 1 2 3 4 ) , B = ( 5 0 2 1 ) . 33 UNSW MATH5855 2021T3 Lecture 4 Intervals and Tests for the Mean 4 Confidence Intervals and Hypothesis Tests for the Mean Vector 4.1 Hypothesis tests for the multivariate normal mean . . . . . . . . . . . . . . . . . 34 4.1.1 Hotelling’s T 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.1.2 Sampling distribution of T 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1.3 Noncentral Wishart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.1.4 T 2 as a likelihood ratio statistic . . . . . . . . . . . . . . . . . . . . . . . . 36 4.1.5 Wilks’ lambda and T 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1.6 Numerical calculation of T 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1.7 Asymptotic distribution of T 2 . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Confidence regions for the mean vector and for its components . . . . . . . . . . 38 4.2.1 Confidence region for the mean vector . . . . . . . . . . . . . . . . . . . . 38 4.2.2 Simultaneous confidence statements . . . . . . . . . . . . . . . . . . . . . 38 4.2.3 Simultaneous confidence ellipsoid . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Comparison of two or more mean vectors . . . . . . . . . . . . . . . . . . . . . . 39 4.3.1 Reducing to a single population . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.2 The two-sample T 2-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.5 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1 Hypothesis tests for the multivariate normal mean 4.1.1 Hotelling’s T 2 Suppose again that, like in Lecture 3, we have observed n independent realisations of p-dimensional random vectors from Np(µ,Σ). Suppose for simplicity that Σ is non-singular. The data matrix has the form x = x11 x12 · · · x1j · · · x1n x21 x22 · · · x2j · · · x2n ... ... . . . ... . . . ... xi1 xi2 · · · xij · · · xin ... ... . . . ... . . . ... xp1 xp2 · · · xpj · · · xpn = [x1,x2, . . . ,xn] Based on our knowledge from Section 3.2 we can claim that X¯ ∼ Np(µ, 1nΣ) and nΣˆ ∼ Wp(Σ, n− 1). Consequently, any linear combination c⊤X¯, c ̸= 0 ∈ Rp follows N(c⊤µ, 1nc⊤Σc) and the quadratic form nc⊤Σˆc/c⊤Σc ∼ χ2n−1. Further, we have shown that X¯ and Σˆ are independently distributed and hence T = √ nc⊤(X¯ − µ)/ √ c⊤ n n− 1Σˆc ∼ tn−1, i.e. follows the t distribution with n− 1 degrees of freedom. This result has useful applications in testing for contrasts. Indeed, if we would like to test H0 : c ⊤µ = ∑p i=1 ciµi = 0, we note that under H0, T becomes simply T = √ nc⊤X¯/ √ c⊤Sc, 34 UNSW MATH5855 2021T3 Lecture 4 Intervals and Tests for the Mean that is, does not involve the unknown µ and can be used as a test-statistic whose distribution under H0 is known. If |T | > t1−α/2,n−1 we should reject H0 in favour of H1 : c⊤µ = ∑p i=1 ciµi ̸= 0. The formulation of the test for other (one-sided) alternatives is left for you as an exercise. More often we are interested in testing the mean vector of a multivariate normal. First consider the case of known covariance matrix Σ (variance σ2 in the univariate case). The standard univariate (p = 1) test for this purpose is the following: to test H0 : µ = µ0 versus H1 : µ ̸= µ0 at level of significance α, we look at U = √ n X¯−µ0σ and reject H0 if |U | exceeds the upper α 2 · 100% point of the standard normal distribution. Checking if |U | is large enough is equivalent to checking if U2 = n(X¯ − µ0)(σ2)−1(X¯ − µ0) is large enough. We can now easily generalise the above test statistic in a natural way for the multivariate (p > 1) case: calculate U2 = n(X¯ − µ0)⊤Σ−1(X¯ − µ0) and reject the null hypothesis of µ = µ0 when U2 is large enough. Similarly to the proof of Property 5 of the multivariate normal distribution (Section 2.2) and by using Theorem 3.3 of Section 3.2 you can convince yourself (do it (!)) that U2 ∼ χ2p under the null hypothesis. Hence, tables of the χ2-distribution will suffice to perform the above test in the multivariate case. Now let us turn to the (practically more relevant) case of unknown covariance matrix Σ. The standard univariate (p = 1) test for this purpose is the t-test. Let us recall it: to test H0 : µ = µ0 versus H1 : µ ̸= µ0 at level of significance α, we look at T = √ n X¯ − µ0 S , S2 = 1 n− 1 n∑ i=1 (Xi − X¯)2 and reject H0 if |T | exceeds the upper α2 · 100% point of the t-distribution with n − 1 degrees of freedom. We note that checking if |T | is large enough is equivalent to checking if T 2 = n(X¯−µ0)(s2)−1(X¯−µ0) is large enough. Of course, under H0, the statistic T 2 is F -distributed: T 2 ∼ F1,n−1 which means that H0 would be rejected at level α when T 2 > F1−α;1,n−1. We can now easily generalise the above test statistic in a natural way for the multivariate (p > 1) case: Definition 4.1 (Hotelling’s T 2). The statistic T 2 = n(X¯ − µ0)⊤S−1(X¯ − µ0) (4.1) where X¯ = 1n ∑n i=1Xi, S = 1 n−1 ∑n i=1(Xi − X¯)(Xi − X¯)⊤, µ0 ∈ Rp, Xi ∈ Rp, i = 1, . . . , n is named after Harold Hotelling. 4.1.2 Sampling distribution of T 2 Obviously, the test procedure based on Hotelling’s statistic will reject the null hypothesis H0 : µ = µ0 if the value of T 2 is sufficiently high. It turns out we do not need special tables for the distribution of T 2 under the null hypothesis because of the following basic result (that represents a true generalisation of the univariate (p = 1) case: Theorem 4.2. Under the null hypothesis H0 : µ = µ0, Hotelling’s T 2 is distributed as (n−1)pn−p Fp,n−p where Fp,n−p denotes the F -distribution with p and n− p degrees of freedom. Proof. Indeed, we can write the T 2 statistic in the form: T 2 = n(X¯ − µ0)⊤S−1(X¯ − µ0) n(X¯ − µ0)⊤Σ−1(X¯ − µ0)n(X¯ − µ0) ⊤Σ−1(X¯ − µ0). 35 UNSW MATH5855 2021T3 Lecture 4 Intervals and Tests for the Mean Denote by C = √ n(X¯ − µ0). Conditionally on C = c we have: n(X¯ − µ0)⊤S−1(X¯ − µ0) n(X¯ − µ0)⊤Σ−1(X¯ − µ0) = c⊤S−1c c⊤Σ−1c , has a distribution that only depends on the data through S−1. Noting that nΣˆ = (n − 1)S and having in mind the third property of Wishart distributions from Section 3.2.2, we can claim that this distribution is the same as of (n − 1)/χ2n−p. Note also that the distribution does not depend on the particular c. The second factor n(X¯ − µ0)Σ−1(X¯ − µ0) ∼ χ2p and its distribution depends on the data through X¯ only. Because of the independence of the mean and covariance estimators, we have that the distribution of T 2 is the same as the distribution of χ2p(n−1) χ2n−p where the two chi-squares are independent. But this means that T 2(n−p) p(n−1) ∼ Fp,n−p and hence T 2 ∼ p(n−1)n−p Fp,n−p. 4.1.3 Noncentral Wishart It is possible to extend the definition of the Wishart distribution in Section 3.2.2 by allowing the random vectors Yi, i = 1, . . . , n there to be independent with Np(µi,Σ) (instead of just having all µi = 0). One arrives at the noncentral Wishart distribution with parameters Σ, p, n − 1,Γ in that way (denoted also as Wp(Σ, n − 1,Γ). Here Γ = MM⊤ ∈ Mp,p, M = [µ1,µ2, . . . ,µn] is called a noncentrality parameter. When all columns of M ∈ Mp,n are zero, this is the usual (central) Wishart distribution. Theorem 4.2 can be extended to derive the distribution of the T 2 statistic under alternatives, i.e. the distribution of T 2 = n(X¯ −µ)⊤S−1(X¯ −µ) for µ ̸= µ0. This distribution turns out to be related to noncentral F-distribution. It is helpful in studying power of the test of H0 : µ = µ0 versus H1 : µ ̸= µ0. We shall spare the details here. 4.1.4 T 2 as a likelihood ratio statistic It is worth mentioning that Hotelling’s T 2 that we introduced by analogy with the univariate squared t statistic can in fact also be derived as the likelihood ratio test statistic for testing H0 : µ = µ0 versus H1 : µ ̸= µ0. This safeguards the asymptotic optimality of the test suggested in Sections 4.1.1–4.1.2. To see this, first recall the likelihood function (3.2). Its unconstrained maximisation gives as a maximum value: L(x; µˆ, Σˆ) = 1 (2π) np 2 |Σˆ|n2 e −np2 On the other hand, under H0 : max Σ L(x;µ0,Σ) = max Σ 1 (2π) np 2 |Σ|n2 e − 12 ∑n i=1(xi−µ0)⊤Σ−1(xi−µ0) Since logL(x;µ0,Σ) = −np2 log(2π)− n2 log|Σ|− 12 tr[Σ−1( ∑n i=1(xi−µ0)(xi−µ0)⊤)], on applying Anderson’s lemma (see Theorem 3.1 in Section 3.1.2) we find that maximum of logL(x;µ0,Σ) (whence also of L(x;µ0,Σ)) is obtained when Σˆ0 = 1 n ∑n i=1(xi−µ0)(xi−µ0)⊤ and the maximal value is 1 (2π) np 2 |Σˆ0|n2 e− np 2 . Hence the likelihood ratio is: Λ = maxΣ L(x;µ0,Σ) maxµ,Σ L(x;µ,Σ) = ( |Σˆ| |Σˆ0| ) n 2 (4.2) 36 UNSW MATH5855 2021T3 Lecture 4 Intervals and Tests for the Mean The equivalent statistic Λ 2 n = |Σˆ||Σˆ0| is called Wilks’ lambda. Small values of Wilks’ lambda lead to rejecting H0 : µ = µ0. 4.1.5 Wilks’ lambda and T 2 The following theorem shows the relation between Wilks’ lambda and T 2: Theorem 4.3. The likelihood ratio test is equivalent to the test based on T 2 since Λ 2 n = (1 + T 2 n−1 ) −1 holds. Proof. Consider the matrix A ∈Mp+1,p+1: A = (∑n i=1(xi − x¯)(xi − x¯)⊤ √ n(x¯− µ0)√ n(x¯− µ0)⊤ −1 ) = ( A11 A12 A21 A22 ) It is easy to check that |A| = |A22||A11 −A12A−122 A21| = |A11||A22 −A21A−111 A12| (4.3) holds from which we get: (−1)| n∑ i=1 (xi − x¯)(xi − x¯)⊤ + n(x¯− µ0)(x¯− µ0)⊤| = | n∑ i=1 (xi − x¯)(xi − x¯)⊤||−1− n(x¯− µ0)⊤( n∑ i=1 (xi − x¯)(xi − x¯)⊤)−1(x¯− µ0)| Hence (−1)|∑ni=1(xi − µ0)(xi − µ0)⊤| = |∑ni=1(xi − x¯)(xi − x¯)⊤|(−1)(1 + T 2n−1 ). Thus |Σˆ0| = |Σˆ|(1 + T 2n−1 ), i.e. Λ 2 n = (1 + T 2 n− 1) −1 (4.4) 4.1.6 Numerical calculation of T 2 Hence H0 is rejected for small values of Λ 2 n or equivalently, for large values of T 2. The critical values for T 2 are determined from Theorem 4.2. Relation (4.4) can be used to calculate T 2 from Λ 2 n = |Σˆ||Σˆ0| thus avoiding the need to invert the matrix S when calculating T 2! 4.1.7 Asymptotic distribution of T 2 Since S−1 is a consistent estimator of Σ−1, the limiting distribution of T 2 will coincide with the one of n(x¯−µ)⊤Σ−1(x¯−µ) which, as we know already, is χ2p. This coincides with a general claim of asymptotic theory which states that −2 log Λ is asymptotically distributed as χ2p. Indeed: −2 log Λ = n log(1 + T 2 n− 1) ≈ n n− 1T 2 ≈ T 2 (by using the fact that log(1 + x) ≈ x for small x). 37 UNSW MATH5855 2021T3 Lecture 4 Intervals and Tests for the Mean 4.2 Confidence regions for the mean vector and for its components 4.2.1 Confidence region for the mean vector For a given confidence level (1− α) it can be constructed in the form {µ|n(x¯− µ)⊤S−1(x¯− µ) ≤ F1−α,p,n−p p n− p (n− 1)} where F1−α,p,n−p is the upper α · 100% percentage point of the F distribution with (p, n− p) df. This confidence region has the form of an ellipsoid in Rp centred at x¯. The axes of this confidence ellipsoid are directed along the eigenvectors ei of the matrix S = 1 n−1 ∑n i=1(xi−x¯)(xi−x¯)⊤. The half-lengths of the axes are given by the expression √ λi √ p(n−1)F1−α,p,n−p n(n−p) , with λi, i = 1, . . . , p being the corresponding eigenvalues, i.e. Sei = λiei, i = 1, . . . , p Example 4.4. Microwave ovens (Example 5.3., pages 221–223, JW). 4.2.2 Simultaneous confidence statements For a given confidence level (1 − α) the confidence ellipsoids in Section 4.2.1 correctly reflect the joint (multivariate) knowledge about plausible values of µ ∈ Rp but nevertheless one is often interested in confidence intervals for means of each individual component. We would like to formulate these statements in such a form that all of the separate confidence statements should hold simultaneously with a prespecified probability. This is why we are speaking about simultaneous confidence intervals. First, note that if the vector X ∼ Np(µ,Σ) then for any l ∈ Rp : l⊤X ∼ N1(l⊤µ, l⊤Σl) and, hence, for any fixed l we can construct an (1− α) · 100% confidence interval of l⊤µ in the following simple way:( l⊤x¯− t1−α/2,n−1 √ l⊤Sl√ n , l⊤x¯+ t1−α/2,n−1 √ l⊤Sl√ n ) (4.5) By taking l⊤ = [1, 0, . . . , 0] or l⊤ = [0, 1, 0, . . . , 0] etc. we obtain from (4.5) the usual confidence interval for each separate component of the mean. Note however that the confidence level for all these statements taken together is not (1 − α). To make it (1 − α) for all possible choices simultaneously we need to take a larger constant than t1−α/2,n−1 in the right hand side of the inequality | √ n(l⊤x¯−l⊤µ¯)√ l⊤Sl | ≤ t1−α/2,n−1 (or equivalently n(l ⊤x¯−l⊤µ¯)2 l⊤Sl ≤ t21−α/2,n−1). 4.2.3 Simultaneous confidence ellipsoid Theorem 4.5. Simultaneously for all l ∈ Rp, the interval( l⊤x¯− √ p(n− 1) n(n− p)F1−α,p,n−pl ⊤Sl, l⊤x¯+ √ p(n− 1) n(n− p)F1−α,p,n−pl ⊤Sl ) will contain l⊤µ¯ with a probability at least (1− α). Example 4.6. Microwave Ovens (Example 5.4, p. 226 in JW). 38 UNSW MATH5855 2021T3 Lecture 4 Intervals and Tests for the Mean Proof. Note that according to Cauchy–Bunyakovski–Schwartz Inequality: [l⊤(x¯−µ)]2 = [(S1/2l)⊤S−1/2(x¯−µ)]2 ≤ ∥S1/2l∥2∥S−1/2(x¯−µ)∥2 = (l⊤Sl)(x¯−µ)⊤S−1(x¯−µ). Therefore, max l n(l⊤(x¯− µ))2 l⊤Sl ≤ n(x¯− µ)⊤S−1(x¯− µ) = T 2 (4.6) Inequality (4.6) helps us to claim that whenever a constant c has been such that T 2 ≤ c2 then also n(l ⊤x¯−l⊤µ¯)2 l⊤Sl ≤ c2 holds for any l ∈ Rp, l ̸= 0. Equivalently, l⊤x¯− c √ l⊤Sl n ≤ l⊤µ¯ ≤ l⊤x¯+ c √ l⊤Sl n (4.7) for every l. Now it remains to choose c2 = p(n − 1)F1−α,p,n−p/(n − p) to make sure that 1 − α = P (T 2 ≤ c2) holds and this will automatically ensure that (4.7) will contain l⊤µ¯ with probability 1− α. Bonferroni Method The simultaneous confidence intervals when applied for the vectors l⊤ = [1, 0, . . . , 0], l⊤ = [0, 1, 0, . . . , 0] etc. are much more reliable at a given confidence level than the one-at-a-time intervals. Note that the former also utilise the covariance structure of all p variables in their construction. However, sometimes we can do better in cases where one is interested in a small number of individual confidence statements. In this latter case, the simultaneous confidence intervals may give too large a region and the Bonferroni method may prove more efficient instead. The idea of the Bonferroni approach is based on a simple probabilistic inequality. Assume that simultaneous confidence statements about m linear combinations l⊤1 µ, l ⊤ 2 µ, . . . , l ⊤ mµ are required. If Ci, i = 1, 2, . . . ,m denotes the ith confidence statement and P (Ci true) = 1− αi then P (all Ci true) = 1− P (at least one Ci false) ≥ 1− m∑ i=1 P (Ci false) = 1− m∑ i=1 (1− P (Ci true)) = 1− (α1 + α2 + · · ·+ αm) Hence, if we choose αi = α m , i = 1, 2, . . . ,m (that is, if calculate each statement at confidence level (1− αm ) · 100% instead of (1− α) · 100%) then the probability of any statement being false will not exceed α. Example 4.7. Microwave Ovens (based on JW Example 5.4, p. 226). 4.3 Comparison of two or more mean vectors Finally, let us note that comparison of the mean vectors of two or more than two different multivariate populations when there are independent observations from each of the populations is an important, practically relevant problem. For the purposes of this section, suppose that we observe two samples, X1,X2, . . . ,XnX ∈ Rp and Y1,Y2, . . . ,YnY ∈ Rp, with means µX ∈ Rp and µY ∈ Rp respectively and variances ΣX ∈ Mp,p and ΣY ∈ Mp,p, respectively. Typically, we wish to test H0 : µX − µY = δ0. Multivariate ANOVA for comparing more than two populations is discussed in Lecture 8. 39 UNSW MATH5855 2021T3 Lecture 4 Intervals and Tests for the Mean 4.3.1 Reducing to a single population As with the univariate t-test, under some scenarios the test of a difference between two pop- ulations in fact reduces to a one-sample test. For example, if the samples are paired and nX = nY = n, we may proceed analogously to the paired t-test: we take Di = Xi − Yi for i = 1, . . . , n and proceed as if with a 1-sample T 2 test: T 2 = n(D¯ − δ0)⊤S−1D (D¯ − δ0) ∼ (n− 1)p n− p Fp,n−p, (4.8) where D¯ ∈ Rp and SD ∈ Mp,p are the sample mean and variance of D1, . . . ,Dn, respectively, assuming Di are normally distributed. (It is important to note that any diagnostics for this test should be performed on the differences, not on the original values.) We can also formulate this is in a “multivariate” form: let the contrast matrix C ∈ Mp,p+p be C = +1 −1+1 −1 +1 −1 . Then, we can express Di = C ( Xi Yi ) and the test as H0 : C ( µX µY ) = δ0. It is easy to show that the test statistic reduces to (4.8). C can have more complex forms. For example, in a repeated measures design, we may measure the results of a series of p treatment outcomes on each sampling unit. If we then collect each individual i’s measurements into a vector Xi, we may test whether all outcomes are the same in expectation by forming C = 1 −1... . . . 1 −1 ∈Mp−1,p and testing H0 : CµX = 0p−1. It is easy to show that CµX = 0p−1 holds if and only if all elements of µX are equal. 4.3.2 The two-sample T 2-test We now turn to the scenario where X and Y are, in fact, independent samples. As with the univariate test, we must decide whether we are prepared to assume that ΣX = ΣY = Σ in the population and therefore use the pooled test. If so—and necessarily if the sample sizes are small—we evaluate Spooled = (nX − 1)SX + (nY − 1)SY nX + nY − 2 . Since Spooled estimates Σ, Var(X¯ − Y¯ ) = Σ nX + Σ nY ≈ Spooled nX + Spooled nY = Spooled ( 1 nX + 1 nY ) . And, since X¯ − Y¯ ∼ Np(µX − µY ,Σ(n−1X + n−1Y )), we write T 2 = (X¯−Y¯ −δ0)⊤ { Spooled ( 1 nX + 1 nY )}−1 (X¯−Y¯ −δ0) ∼ (nX + nY − 2)p nX + nY − p− 1Fp,nX+nY −p−1. (4.9) 40 UNSW MATH5855 2021T3 Lecture 4 Intervals and Tests for the Mean We would thus reject H0 if T 2 falls above the F critical value in (4.9), construct a confidence region based on{ δ ∣∣(x¯− y¯ − δ)⊤S¯−1p (x¯− y¯ − δ) ≤ (nX + nY − 2)pnX + nY − p− 1F1−α,p,nX+nY −p−1 } and simultaneous contrast confidence intervals l⊤(x¯− y¯)± √ (nX + nY − 2)p nX + nY − p− 1F1−α,p,nX+nY −p−1l ⊤Spooled ( 1 nX + 1 nY ) l. If we are not prepared to make the pooling assumption, our test statistic is instead T 2 = (X¯ − Y¯ − δ0)⊤ ( SX nX + SY nY )−1 (X¯ − Y¯ − δ0). Even for modest sample sizes, under multivariate normality, the distribution of this T 2 is rea- sonably well approximated by νpν−p+1Fp,ν−p+1, where ν = p+ p2∑2 i=1 1 ni ( tr [{ 1 ni Si ( 1 n1 S1 + 1 n2 S2 )−1}2] + [ tr { 1 ni Si ( 1 n1 S1 + 1 n2 S2 )−1}]2) . The confidence regions are then produced by{ δ ∣∣(x¯− y¯ − δ)⊤(SX nX + SY nY )−1 (x¯− y¯ − δ) ≤ νp ν − p+ 1Fp,ν−p+1 } and simultaneous contrast confidence intervals l⊤(x¯− y¯)± √ νp ν − p+ 1Fp,ν−p+1l ⊤ ( SX nX + SY nY ) l. 4.4 Software R: car::confidenceEllipse, package Hotelling, rrcov::T2.test, ergm::approx.hotelling.diff.test, MVTests::TwoSamplesHT2 SAS: See IML implementations. 4.5 Additional resources An alternative presentation of these concepts can be found in JW Sec. 5.1–5.5 and 6. 4.6 Exercises Exercise 4.1 Suppose X1,X2, . . . ,Xn are independent Np(µ,Σ) random vectors with sample mean vector X¯ and sample covariance matrix S. We wish to test the hypothesis H0 : µ2 − µ1 = µ3 − µ2 = · · · = µp − µp−1 = 1 where µ1, µ2, . . . , µp are the elements of µ. 41 UNSW MATH5855 2021T3 Lecture 4 Intervals and Tests for the Mean (a) Determine a (p − 1) × p matrix C so that H0 may be written equivalently as H0 : Cµ = 1 where 1 is a (p− 1)× 1 vector of ones. (b) Make an appropriate transformation of the vectors Xi, i = 1, 2, . . . , n and hence find the rejection region of a size α test of H0 in terms of X¯, S, and C. Exercise 4.2 A sample of 50 vector observations, each containing three components, is drawn from a normal distribution having covariance matrix Σ = 3 1 11 4 1 1 1 2 . The components of the sample mean are 0.8, 1.1 and 0.6. Can you reject the null hypothesis of zero distribution mean against a general alternative? Exercise 4.3 Evaluate Hotelling’s statistic T 2 for testing H0 : µ = ( 7 11 ) using the data matrix X =( 2 8 6 8 12 9 9 10 ) . Test the hypothesis H0 at level α = 0.05. What conclusion is reached? Exercise 4.4 Let X1, . . . ,Xn1 , i.i.d. Np(µ1,Σ) independently of Y1, . . .Yn2 i.i.d. Np(µ2,Σ), Σ known. Prove that X¯ ∼ Np(µ1, 1n1Σ) and Y¯ ∼ Np(µ2, 1n2Σ). Hence W = X¯ − Y¯ ∼ N(µ1 − µ2, ( 1 n1 + 1n2 ) Σ) so that X¯ − Y¯ − (µ1 − µ2) ∼ N(0, ( 1 n1 + 1n2 ) Σ). Construct a test of H0 : µ1 = µ2. Exercise 4.5 Let X¯ and S be based on n observations fromNp(µ,Σ) and letX be an additional observation from Np(µ,Σ). Show that X − X¯ ∼ Np(0, (1 + 1n )Σ). Find the distribution of nn+1 (X − X¯)⊤S−1(X − X¯) and suggest how to use this result to give a (1 − α) prediction region for X based on X¯ and S (i.e., a region in Rp such that one has a given confidence (1 − α) that the next observation will fall into it). 42 UNSW MATH5855 2021T3 Lecture 5 Correlations 5 Correlation, Partial Correlation, Multiple Correlation 5.1 Partial correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.1.1 Simple formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.1.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Multiple correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2.1 Multiple correlation coefficient as ordinary correlation coefficient of trans- formed data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2.2 Interpretation of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2.3 Remark about the calculation of R2 . . . . . . . . . . . . . . . . . . . . . 46 5.2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3 Testing of correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.3.1 Usual correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.3.2 Partial correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.3.3 Multiple correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . . 48 5.3.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.3.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.4 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 First of all, we would like to make some general comments on similarities and differences between correlations and dependencies. Very often we are interested in correlations (dependencies) between a number of random variables and are trying to describe the “strength” of the (mutual) dependencies. For example, we would like to know if there is a correlation (mutual non-directed dependence) between the length of the arm and of the leg. But, if we would like to get an information about (or to predict) the length of the arm by measuring the length of the leg, we are dealing with dependence of the arm’s length on the leg’s length. Both problems described in this example make sense. On the other hand, there are other examples/situations in which only one of the problems is interesting or makes sense. If we study the dependence between rain and crops, this makes a perfect sense but there is no sense at all to study the (directed) influence of crops on rain. In a nutshell, we can say that when studying the mutual (linear) dependence, we are dealing with correlation theory whereas when studying directed influence of one (input) variable on another (output) variable, we are dealing with regression theory. It should be clearly pointed out though that correlation alone, no matter how strong, can not help us identify the direction of influence and can not help us in regression modelling. Our reasoning about direction of influence should come outside of Statistical theory, from another theory. Another important point to always bear in mind is that, as already discussed in Lecture 2, uncorrelated does not necessarily mean independent if the multivariate data happens to fail the multivariate normality test. Nonetheless, for multivariate normal data, the notions of “uncorre- lated” and “independent” coincide. In general, there are 3 types of correlation coefficients: • The usual correlation coefficient between 2 variables • Partial correlation coefficient between 2 variables after adjusting for the effect (regression, association ) of set of other variables. • Multiple correlation between a single random variable and a set of p other variables 43 UNSW MATH5855 2021T3 Lecture 5 Correlations 5.1 Partial correlation For X ∼ Np(µ,Σ) we defined the correlation coefficient ρij = σij√σii√σjj , i, j = 1, 2, . . . , p and discussed the MLE ρˆij in (3.6). It turned out that they coincide with the sample correlations rij we introduced in the first lecture (formula (1.3)). To define partial correlation coefficients, recall the Property 4 of the multivariate normal distribution from Section 2.2: If vector X ∈ Rp is divided into X = ( X(1) X(2) ) , X(1) ∈ Rr, r < p,X(2) ∈ Rp−r and according to this subdivision the vector means are µ = ( µ(1) µ(2) ) and the covariance matrix Σ has been subdivided into Σ = ( Σ11 Σ12 Σ21 Σ22 ) and the rank of Σ22 is full then the conditional density of X(1) given that X(2) = x(2) is Nr(µ(1) +Σ12Σ −1 22 (x(2) − µ(2)),Σ11 − Σ12Σ−122 Σ21). We define the partial correlations of X(1) given X(2) = x(2) as the usual correlation coef- ficients calculated from the elements σij.(r+1),(r+2)...,p of the matrix Σ1|2 = Σ11 − Σ12Σ−122 Σ21, i.e. ρij.(r+1),(r+2),...,p = σij.(r+1),(r+2),...,p√ σii.(r+1),(r+2),...,p √ σjj.(r+1),(r+2),...,p . (5.1) We call ρij.(r+1),(r+2),...,p the correlation of the ith and jth component when the components (r + 1), (r + 2), etc. up to the pth (i.e. the last p − r components) have been held fixed. The interpretation is that we are looking for the association (correlation) between the ith and jth component after eliminating the effect that the last p − r components might have had on this association. To find ML estimates for these, we use the transformation invariance property of the MLE to claim that if Σˆ = ( Σˆ11 Σˆ12 Σˆ21 Σˆ22 ) is the usual MLE of the covariance matrix then Σˆ1|2 = Σˆ11 − Σˆ12Σˆ−122 Σˆ21 with elements σˆij.(r+1),(r+2),...,p, i, j = 1, 2, . . . , r is the MLE of Σ1|2 and correspondingly, ρˆij.(r+1),(r+2),...,p = σˆij.(r+1),(r+2),...,p√ σˆii.(r+1),(r+2),...,p √ σˆjj.(r+1),(r+2),...,p , i, j = 1, 2, . . . , r will be the ML estimators of ρij.(r+1),(r+2)...,p, i, j = 1, 2, . . . , r. 5.1.1 Simple formulae For situations when p is not large, as a partial case of the above general result, simple plug-in formulae are derived that express the partial correlation coefficients by the usual correlation coefficients. We shall discuss such formulae now. The formulae are given below: i) partial correlation between first and second variable by adjusting for the effect of the third: ρ12.3 = ρ12 − ρ13ρ23√ (1− ρ213)(1− ρ223) . 44 UNSW MATH5855 2021T3 Lecture 5 Correlations ii) partial correlation between first and second variable by adjusting for the effects of third and fourth variable: ρ12.3,4 = ρ12.4 − ρ13.4ρ23.4√ (1− ρ213.4)(1− ρ223.4) . For higher dimensional cases computers need to be utilised. 5.1.2 Software SAS: PROC CORR R: ggm::pcor, ggm::parcor 5.1.3 Examples Example 5.1. Three variables have been measured for a set of schoolchildren: i) X1: Intelligence ii) X2: Weight iii) X3: Age The number of observations was large enough so that one can assume the empirical correlation matrix ρˆ ∈ M3,3 to be the true correlation matrix: ρˆ = 1 0.6162 0.82670.6162 1 0.7321 0.8267 0.7321 1 . This suggests there is a high degree of positive dependence between weight and intelligence. But (do the calculation (!)) ρˆ12.3 = 0.0286 so that, after the effect of age is adjusted for, there is virtually no correlation between weight and intelligence, i.e. weight obviously plays little part in explaining intelligence. 5.2 Multiple correlation Recall our discussion in the end of Section 2.2 for the best prediction in mean squares sense in case of multivariate normality: If we want to predict a random variable Y that is correlated with p random variables (predictors) X = ( X1 X2 · · · Xp )⊤ by trying to minimise the expected value E[{Y − g(X)}2|X = x] the optimal solution (i.e. the regression function) was g∗(X) = E(Y |X). When the joint (p+ 1)-dimensional distribution of Y and X is normal this function was linear in X. Given a specific realisation x of X it was given by b + σ⊤0 C −1x where b = E(Y )−σ⊤0 C−1 E(X), C is the covariance matrix of the vector X, σ0 is the vector of Covariances of Y with Xi, i = 1, . . . , p. The vector C −1σ0 ∈ Rp was the vector of the regression coefficients. Now, let us define the multiple correlation coefficient between the random variable Y and the random vector X ∈ Rp to be the maximum correlation between Y and any linear combination α⊤X, α ∈ Rp. This makes sense: to look at the maximal correlation that we can get by trying to predict Y as a linear function of the predictors. The solution to this which also gives us an algorithm to calculate (and estimate) the multiple correlation coefficient is given in the next lemma. 45 UNSW MATH5855 2021T3 Lecture 5 Correlations 5.2.1 Multiple correlation coefficient as ordinary correlation coefficient of trans- formed data Lemma 5.2. The multiple correlation coefficient is the ordinary correlation coefficient between Y and σ⊤0 C −1X ≡ β∗⊤X. (I.e., β∗ ≡ C−1σ0.) Proof. Note that for any α ∈ Rp : Cov(Y,α⊤X) = α⊤Cβ∗ and, in particular, Cov(Y,β∗⊤X) = β∗ ⊤ Cβ∗ holds. Using Cauchy–Bunyakovsky–Schwartz inequality we have: [Cov(α⊤X,β∗ ⊤ X)]2 ≤ Var(α⊤X)Var(β∗⊤X) and therefore: σ2Y ρ 2(Y,α⊤X) = (α⊤σ0)2 α⊤Cα = (α⊤Cβ∗)2 α⊤Cα ≤ β∗⊤Cβ∗ holds, σ2Y denoting the variance of Y . In this last equality we can get the equality sign by choosing α = β∗, i.e. the squared correlation coefficient ρ2(Y,α⊤X) of Y and α⊤X is maximised over α when α = β∗. Coefficient of Determination From Lemma 5.2 we see that the maximum correlation between Y and any linear combination α⊤X, α ∈ Rp, is R = √ β∗⊤Cβ∗ σ2Y . This is the multiple correlation coefficient. Its square R2 is called coefficient of determination. Having in mind that β∗ = C−1σ0 we see that R =√ σ⊤0 C−1σ0 σ2Y . If Σ = ( σ2Y σ ⊤ 0 σ0 C ) = ( Σ11 Σ12 Σ21 Σ22 ) is the partitioned covariance matrix of the (p+1)- dimensional vector (Y,X)⊤ then we know how to calculate the MLE of Σ by Σˆ = ( Σˆ11 Σˆ12 Σˆ21 Σˆ22 ) so the MLE of R would be Rˆ = √ Σˆ12Σˆ −1 22 Σˆ21 Σˆ11 . 5.2.2 Interpretation of R At the end of Section 2.2 we derived the minimal value of the mean squared error when trying to predict Y by a linear function of the vector X. It is achieved when using the regression function and the value itself was σ2Y − σ⊤0 C−1σ0. The latter value can also be expressed by using the value of R. It is equal to σ2Y (1 − R2). Thus, our conclusion is that when R2 = 0 there is no predictive power at all. In the opposite extreme case, if R2 = 1, it turns out that Y can be predicted without any error at all (it is a true linear function of X). 5.2.3 Remark about the calculation of R2 Sometimes, the correlation matrix only may be available. It can be shown that in that case the relation 1−R2 = 1 ρY Y (5.2) 46 UNSW MATH5855 2021T3 Lecture 5 Correlations holds. In (5.2), ρY Y ≡ (ρ−1)11 is the upper left-hand corner of the inverse of the correlation matrix ρ ∈Mp+1,p+1 determined from Σ. We note that the relation ρ = V − 12ΣV − 12 holds with V = σ2y 0 · · · 0 0 c11 · · · 0 ... ... . . . ... 0 0 · · · cpp One can use (5.2) to calculate R2 by first calculating the right hand side in (5.2). To show Equality (5.2) we note that 1−R2 = σ 2 Y − σ⊤0 C−1σ0 σ2Y = |C| |C| σ2Y − σ⊤0 C−1σ0 σ2Y = |Σ| |C|σ2Y , with the last equality in the numerator holding because of (4.3). But |C||Σ| = σ Y Y ≡ (Σ−1)11, the entry in the first row and column of Σ−1. (Recall from Section 0.1.2: (X−1)ji = |Xij | |X| (−1)i+j .) Since ρ−1 = V 1 2Σ−1V 1 2 , we see that ρY Y = σY Y σ2Y holds. Therefore 1−R2 = 1ρY Y . 5.2.4 Examples Example 5.3. Let µ = µYµX1 µX2 = 52 0 and Σ = 10 1 −11 7 3 −1 3 2 = (σY Y σ⊤0 σ0 ΣXX ) . Calculate: (a) The best linear prediction of Y using X1 and X2. (b) The multiple correlation coefficient R2Y.(X1,X2). (c) The mean squared error of the best linear predictor. Solution β∗ = Σ−1XXσ0 = ( 7 3 3 2 )−1( 1 −1 ) = ( .4 −.6 −.6 1.4 )( 1 −1 ) = ( 1 −2 ) and b = µY − β∗⊤µX = 5− (1,−2) ( 2 0 ) = 3. Hence the best linear predictor is given by 3 +X1 − 2X2. The value of: RY.(X1,X2) = √√√√√ (1,−1) ( .4 −.6 −.6 1.4 )( 1 −1 ) 10 = √ 3 10 = .548 The mean squared error of prediction is: σ2Y (1−R2Y.(X1,X2)) = 10(1− 310 ) = 7. Example 5.4. Relationship between multiple correlation and regression, and equivalent ways of computing it. 47 UNSW MATH5855 2021T3 Lecture 5 Correlations 5.3 Testing of correlation coefficients 5.3.1 Usual correlation coefficients When considering the distribution of a particular correlation coefficient ρˆij = rij the problem becomes bivariate because only the variablesXi andXj are involved. Direct transformations with the bivariate normal can be utilised to derive the exact distribution of rij under the hypothesis H0 : ρij = 0. It turns out that in this case the statistic T = rij √ n−2 1−r2ij ∼ tn−2 and tests can be performed by using the t-distribution. For other hypothetical values the derivations are more painful. There is one most frequently used approximation that holds no matter what the true value of ρij is. We shall discuss it here. Consider Fisher’s Z transformation Z = 1 2 log[ 1+rij 1−rij ]. Under the hypothesis H0 : ρij = ρ0 it holds: Z ≈ N(1 2 log[ 1 + ρ0 1− ρ0 ], 1 n− 3) In particular, in the most common situation, when one would like to test H0 : ρij = 0 versus H1 : ρij ̸= 0 one would reject H0 at 5% significance level if |Z| √ n− 3 ≥ 1.96. Based on the above, now you suggest how to test the hypothesis of equality of two correlation coefficients from two different populations(!). 5.3.2 Partial correlation coefficients Coming over to testing partial correlations, not much has to be changed. Fisher’s Z approxi- mation can be used again in the following way: to test H0 : ρij.r+1,r+2,...,r+k = ρ0 versus H1 : ρij.r+1,r+2,...,r+k ̸= ρ0 (i.e., conditioning on k variables) we construct Z = 12 log[ 1+rij.r+1,r+2,...,r+k1−rij.r+1,r+2,...,r+k ] and a = 12 log[ 1+ρ0 1−ρ0 ]. Asymptotically Z ∼ N(a, 1n−k−3 ) holds. Hence, test statistic to be com- pared with significance points of the standard normal is now : √ n− k − 3|Z − a|. If ρ0 = 0, the t-test can be used, with “n− 2” replaced by “n− k− 2” in both the statistic and the degrees of freedom. 5.3.3 Multiple correlation coefficients It turns out that under the hypothesis H0 : R = 0 the statistic F = Rˆ2 1−Rˆ2 × n−p p−1 ∼ Fp−1,n−p. Hence, when testing significance of the multiple correlation, the rejection region would be { Rˆ2 1−Rˆ2× n−p p−1 > F1−α,p−1,n−p} for a given significance level α. It should be stressed that the value of p in Section 5.3.3 refers to the total number of all variables (the output Y and all of the input variables in the input vector X). This is different from the value of p that was used in Section 5.2. In other words, the p in Section 5.3.3 is the p+ 1 in Section 5.2. 5.3.4 Software SAS: PROC CORR R: ggm::pcor.test 5.3.5 Examples Example 5.5. Testing ordinary correlations: age, height, and intelligence. Example 5.6. Testing partial correlations: age, height, and intelligence. 48 UNSW MATH5855 2021T3 Lecture 5 Correlations 5.4 Additional resources An alternative presentation of these concepts can be found in JW Sec. 7.8. 5.5 Exercises Exercise 5.1 Suppose X ∼ N4(µ,Σ) where µ = 1 2 3 4 and Σ = 3 1 0 1 1 4 0 0 0 0 1 4 1 0 4 20 . Determine: (a) the distribution of X1 X2 X3 X1 +X2 +X4 ; (b) the conditional mean and variance of X1 given x2, x3, and x4; (c) the partial correlation coefficients ρ12.3, ρ12.4; (d) the multiple correlation between X1 and (X2, X3, X4). Compare it to ρ12 and comment. (e) Justify that X2X3 X4 is independent of X1 − (1 0 1) 4 0 00 1 4 0 4 20 −1X2X3 X4 . Exercise 5.2 A random vector X ∼ N3(µ,Σ) with µ = 2−3 1 and Σ = 1 1 11 3 2 1 2 2 . (a) Find the distribution of 3X1 − 2X2 +X3. (b) Find a vector a ∈ R2 such that X2 and X2 − a⊤ ( X1 X3 ) are independent. 49 UNSW MATH5855 2021T3 Lecture 6 Principal Components Analysis 6 Principal Components Analysis 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.2 Precise mathematical formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.3 Estimation of the Principal Components . . . . . . . . . . . . . . . . . . . . . . . 51 6.4 Deciding how many principal components to include . . . . . . . . . . . . . . . . 52 6.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.7 PCA and Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.8 Application to finance: Portfolio optimisation . . . . . . . . . . . . . . . . . . . . 53 6.9 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.1 Introduction Principal components analysis is applied mainly as a variable reduction procedure. It is usually applied in cases when data is obtained from a possibly large number of variables which are possibly highly correlated. The goal is to try to “condense” the information. This is done by summarising the data in a (small) number of transformations of the original variables. Our motivation to do that is that we believe there is some redundancy in the presentation of the information by the original set of variables since e.g. many of these variables are measuring the same construct. In that case we try to reduce the observed variables into a smaller number of principal components (artificial variables) that would account for most of the variability in the observed variables. For simplicity, these artificial new variables are presented as a linear combinations of the (optimally weighted) observed variables. If one linear combination is not enough, we can choose to construct two, three, etc. such combinations. Note also that principal components analysis may be just an intermediate step in much larger investigations. The principal components obtained can be used for example as inputs in a regression analysis or in a cluster analysis procedure. They are also a basic method in extracting factors in factor analysis. 6.2 Precise mathematical formulation Let X ∼ Np(µ,Σ) where p is assumed to be relatively large. To perform a reduction, we are looking for a linear combination α⊤1X with α1 ∈ Rp suitably chosen such that it maximises the variance of α⊤1X subject to the reasonable normalising constraint ∥α1∥2 = α⊤1 α1 = 1. Since Var(α⊤1X) = α ⊤ 1 Σα1 we need to choose α1 to maximise α ⊤ 1 Σα1 subject to α ⊤ 1 α1 = 1. This requires us to apply Lagrange’s optimisation under constraint procedure: 1. construct the Lagrangian function Lag(α1, λ) = α ⊤ 1 Σα1 + λ(1−α⊤1 α1) where λ ∈ R1 is the Lagrange multiplier; 2. take the partial derivative with respect to α1 and equate it to zero: 2Σα1 − 2λα1 = 0 =⇒ (Σ− λIp)α1 = 0. (6.1) From (6.1), we see that α1 must be an eigenvector of Σ and since we know from Example 0.2 what the maximal value of α ⊤Σα α⊤α is, we conclude that α1 should be the eigenvector that 50 UNSW MATH5855 2021T3 Lecture 6 Principal Components Analysis corresponds to the largest eigenvalue λ¯1 of Σ. The random variable α ⊤ 1X is called the first principal component. For the second principal component α⊤2X we want it to be normalised according to α ⊤ 2 α2 = 1, uncorrelated with the first component and to give maximal variance of a linear combination of the components of X under these constraints. To find it, we construct the Lagrange function: Lag1(α2, λ1, λ2) = α ⊤ 2 Σα2 + λ1(1−α⊤2 α2) + λ2α⊤1 Σα2 Its partial derivative w.r.t. α2 gives 2Σα2 − 2λ1α2 + λ2Σα1 = 0 (6.2) Multiplying (6.2) by α⊤1 from left and using the two constraints α ⊤ 2 α2 = 1 and α ⊤ 2 Σα1 = 0 gives: −2λ1α⊤1 α2 + λ2α⊤1 Σα1 = 0 =⇒ λ2 = 0 (WHY? Have in mind that α1 was an eigenvector of Σ.) But then (6.2) also implies that α2 ∈ Rp must be an eigenvector of Σ (has to satisfy (Σ − λ1Ip)α2 = 0). Since it has to be different from α1, having in mind that we aim at variance maximisation, we see that α2 has to be the normalised eigenvector that corresponds to the second largest eigenvalue λ¯2 of Σ. The process can be continued further. The third principal component should be uncorrelated with the first two, should be normalised and should give maximal variance of a linear combination of the components of X under these constraints. One can easily realise then that the vector α3 ∈ Rp in the formula α⊤3X should be the normalised eigenvector that corresponds to the third largest eigenvalue λ¯3 of the matrix Σ etc.. Note that if we extract all possible p principal components then ∑p i=1Var(α ⊤ i X) will just equal the sum of all eigenvalues of Σ and hence p∑ i=1 Var(α⊤i X) = tr(Σ) = Σ11 + · · ·+Σpp. Therefore, if we only take a small number of k principal components instead of the total possible number p we can interpret their inclusion as one that explains a Var(α⊤1 X)+···+Var(α⊤kX) Σ11+···+Σpp ×100% = λ¯1+···+λ¯k Σ11+···+Σpp × 100% of the total population variance Σ11 + · · ·+Σpp. 6.3 Estimation of the Principal Components In practice, Σ is unknown and has to be estimated. The principal components are derived from the normalised eigenvectors of the estimated covariance matrix. Note also that extracting principal components from the (estimated) covariance matrix has the drawback that it is influenced by the scale of measurement of each variableXi, i = 1, . . . , p. A variable with large variance will necessarily be a large component in the first principal component (note the goal of explaining the bulk of variability by using the first principal component). Yet the large variance of the variable may be just an artefact of the measurement scale used for this variable. Therefore, an alternative practice is adopted sometimes to extract principal components from the correlation matrix ρ instead of the covariance matrix Σ. Example 6.1 (Eigenvalues obtained from Covariance and Correlation Matrices: see JW p. 437). It demonstrates the great effect standardisation may have on the principal components. The relative magnitudes of the weights after standardisation (i.e. from ρ may become in direct opposition to the weights attached to the same variables in the principal component obtained from Σ). 51 UNSW MATH5855 2021T3 Lecture 6 Principal Components Analysis For the reasons mentioned above, variables are often standardised before sample principal components are extracted. Standardisation is accomplished by calculating the vectors Zi =( X1i−X¯1√ s11 X2i−X¯2√ s22 · · · Xpi−X¯p√spp )⊤ , i = 1, . . . , n. The standardised observations matrix Z = [Z1,Z2, . . . ,Zn] ∈ Mp,n gives the sample mean vector Z¯ = 1nZ1n = 0 and a sample covariance matrix SZ = 1 n−1ZZ ⊤ = R (the correlation matrix of the original observations). The principal components are extracted in the usual way from R now. 6.4 Deciding how many principal components to include To reduce the dimensionality (which is the motivating goal), we should restrict attention to the first k principal components and ideally, k should be kept much less than p but there is a trade-off to be made here since we would also like the proportion ψk = λ¯1+...λ¯k λ¯1+...λ¯p be close to one. How could a reasonable trade-off be made? Three methods are most widely used: • The “scree plot”: basically, it is a graphical method of plotting the ordered λ¯k against k and deciding visually when the plot has flattened out. Typically, the initial part of the plot is like the side of the mountain, while the flat portion where each λ¯k is just slightly smaller than λ¯k−1, is like the rough scree at the bottom. This motivates the name of the plot. The task here is to find where “the scree begins”. • Choose an arbitrary constant c ∈ (0, 1) and choose k to be the smallest one with the property ψk ≥ c. Usually, c = 0.9 is used, but please, note the arbitrariness of the choice here. • Kaiser’s rule: it suggests that from all p principal components only the ones should be retained whose variances (after standardisation) are greater than unity, or, equivalently, only those components which, individually, explain at least 1p100% of the total variance. (This is the same as excluding all principal components with eigenvalues less than the overall average). This criterion has a number of positive features that have contributed to its popularity but can not be defended on a safe theoretical ground. • Formal tests of significance. Note that it actually does not make sense to test whether λ¯k+1 = · · · = λ¯p = 0 since if such a hypothesis were true then the population distribution would be contained entirely within a k-dimensional subspace and the same would be true for any sample from this distribution, hence we would have the estimated λ¯ values for indices k + 1, . . . , p being also equal to zero with probability one! What seems to be reasonable to do instead, is to test H0 : λ¯k+1 = · · · = λ¯p (without asking the common value to be zero). This is a more quantitative variant of the scree test. A test for this hypothesis is to form the arithmetic and geometric means a0 = arithmetic mean of the last p− k estimated eigenvalues; g0 = geometric mean of the last p− k estimated eigenvalues, and then construct −2 log λ = n(p− k) log a0g0 . The asymptotic distribution of this statistic under the null hypothesis is χ2ν where ν = (p−k+2)(p−k−1) 2 . The interested student can find more details about this test in the monograph of Mardia, Kent and Bibby. We should note, however, that the last result holds under multivariate normality assumption and is only valid as stated for the covariance-based (not the correlation-based) version of the principal component analysis. In practice, many data analysts are reluctant to make a multivariate normality assumption at the early stage of the descriptive data analysis and hence distrust the above quantitative test but prefer the simple Kaiser criterion. 52 UNSW MATH5855 2021T3 Lecture 6 Principal Components Analysis 6.5 Software Principal components analysis can be performed in SAS by using either the PRINCOMP or the FACTOR procedures and in R using stats::prcomp, stats::princomp, or about half-dozen other implementations. 6.6 Examples Example 6.2. The Crime Rates example will be discussed at the lecture. The data gives crime rates per 100,000 people in seven categories for each of the 50 states in USA in 1997. Principal components are used to summarise the 7-dimensional data in 2 or 3 dimensions only and help to visualise and interpret the data. 6.7 PCA and Factor Analysis Principal components can serve as a method for initial factor extraction in exploratory factor analysis. But one should mention here that Principal component analysis is not Factor analysis. The main difference is that in factor analysis (to be studied later in this course) one assumes that the covariation in the observed variables is due to the presence of one or more latent variables (factors) that exert casual influence on the observed variables. Factor analysis is being used when it is believed that certain latent factors exist and it is hoped to explore the nature and number of these factors. In contrast, in principal component analysis there is no prior assumption about an underlying casual model. The goal here is just variable reduction. 6.8 Application to finance: Portfolio optimisation Many other problems in Multivariate Statistics lead to formulating optimisation problems that are similar in spirit to the Principal Component Analysis problem. Hereby, we shall illustrate the Efficient portfolio choice problem. Assume that a p-dimensional vectorX of returns of the p assets is given. Then the return of a portfolio that has these assets with weights (c1, c2, . . . , cp) (with ∑p i=1 ci = 1) is Q = c ⊤X and the mean return is c⊤µ. (Here we assume that EX = µ, Var(X) = Σ.) The risk of the portfolio is c⊤Σc. Further, assume that a prespecified mean return µ¯ is to be achieved. The question is how to choose the weights c so that the risk of a portfolio that achieves the prespecified mean return, is as small as possible. Mathematically, this is equivalent to the requirement to find the solution of an optimisation problem under two constraints. The Lagrangian function is: Lag(λ1, λ2) = c ⊤Σc+ λ1(µ¯− c⊤µ) + λ2(1− c⊤1p) (6.3) where 1p is a p-dimensional vector of ones. Differentiating (6.3) with respect to c we get the first order conditions for a minimum: 2Σc− λ1µ− λ21p = 0. (6.4) To simplify derivations, we shall consider the so-called case of non-existence of a riskless asset with a fixed (non-random) return. Then it makes sense to assume that Σ is positive definite and hence Σ−1 exists. We get from (6.4) then: c = 1 2 Σ−1(λ1µ+ λ21p). (6.5) 53 UNSW MATH5855 2021T3 Lecture 6 Principal Components Analysis After multiplying by 1⊤p from left both sides of the equality, we get: 1 = 1 2 1⊤p Σ −1(λ1µ+ λ21p) (6.6) We can get λ2 from (6.6) as λ2 = 2−λ11⊤p Σ−1µ 1⊤p Σ−11p and then substitute it in the formula for c to end up with: c = 1 2 λ1(Σ −1µ− 1 ⊤ p Σ −1µ 1⊤p Σ−11p Σ−11p) + Σ−11p 1⊤p Σ−11p . (6.7) In a similar way, if we multiply both sides of (6.5) by µ⊤ from left and use the restriction µ⊤c = µ¯ we can get one more relationship between λ1 and λ2 : λ1 = 2µ¯−λ2µ⊤Σ−11p µ⊤Σ−1µ The linear system of 2 equations with respect to λ1 and λ2 can be solved then and the values substituted in (6.7) to get the final expression for c using µ, µ¯ and Σ. (Do it (!)) One special case is of particular interest. This is the so-called variance-efficient portfolio (as opposed to the mean–variance-efficient portfolio considered above). For the variance-efficient portfolio, there is no prespecified mean return, that is, there is no restriction on the mean. It is only required to minimise the variance. Obviously, we have λ1 = 0 then and from (6.7) we get the optimal weights for the variance efficient portfolio: copt = Σ−11p 1⊤p Σ−11p . 6.9 Additional resources An alternative presentation of these concepts can be found in JW Ch. 8. 6.10 Exercises Exercise 6.1 A random vector Y = Y1Y2 Y3 is normally distributed with zero mean vector and Σ = 1 ρ/2 0ρ/2 1 ρ 0 ρ 1 where ρ is positive. (a) Find the coefficients of the first principal component and the variance of that component. What percentage of the overall variability does it explain? (b) Find the joint distribution of Y1, Y2 and Y1 + Y2 + Y3. (c) Find the conditional distribution of Y1, Y2 given Y3 = y3. (d) Find the multiple correlation of Y3 with Y1, Y2. 54 UNSW MATH5855 2021T3 Lecture 7 Canonical Correlation Analysis 7 Canonical Correlation Analysis 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 7.2 Application in testing for independence of sets of variables . . . . . . . . . . . . . 55 7.3 Precise mathematical formulation and solution to the problem . . . . . . . . . . 56 7.4 Estimating and testing canonical correlations . . . . . . . . . . . . . . . . . . . . 57 7.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7.6 Some important computational issues . . . . . . . . . . . . . . . . . . . . . . . . 58 7.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 7.8 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 7.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 7.1 Introduction Assume we are interested in the association between two sets of random variables. Typical examples include: relation between set of governmental policy variables and a set of economic goal variables; relation between college “performance” variables (like grades in courses in five different subject matter areas) and pre-college “achievement” variables (like high-school grade- point averages for junior and senior years, number of high-school extracurricular activities) etc. The way the above problem of measuring association is solved in Canonical Correlation Analysis, is to consider the largest possible correlation between linear combination of the variables in the first set and a linear combination of the variables in the second set. The pair of linear combinations obtained through this maximisation process is called first canonical variables and their correlation is called first canonical correlation. The process can be continued (similarly to the principal components procedure) to find a second pair of linear combinations having the largest correlation among all pairs that are uncorrelated with the initially selected pair. This would give us the second set of canonical variables with their second canonical correlation etc. The maximisation process that we are performing at each step reflects our wish (again like in principal components analysis) to concentrate the initially high dimensional relationship between the 2 sets of variables into a few pairs of canonical variables only. Often, even only one pair is considered. The rationale in canonical correlation analysis is that when the number of variables is large, interpreting the whole set of correlation coefficients between pairs of variables from each set is hopeless and in that case one should concentrate on a few carefully chosen representative correlations. Finally, we should note that the traditional (simple) correlation coefficient and the multiple correlation coefficient (Lecture 5) are special cases of canonical correlation in which one or both sets contain a single variable. 7.2 Application in testing for independence of sets of variables Besides being interesting in its own right (see Section 7.1), calculating canonical correlations turns out to be important for the sake of testing independence of sets of random variables. Let us remember that testing for independence and for uncorrelatedness in the case of multivariate normal are equivalent problems. Assume now that that X ∼ Np(µ,Σ). Furthermore, let X be partitioned into r, q components (r+ q = p) with X(1) ∈ Rr,X(2) ∈ Rq and correspondingly, the covariance matrix Σ = E(X − µ)(X − µ)⊤ = σ11 σ12 · · · σ1p σ21 σ22 · · · σ2p ... ... . . . ... σp1 σp2 · · · σpp ∈Mp,p 55 UNSW MATH5855 2021T3 Lecture 7 Canonical Correlation Analysis has been also partitioned into Σ = ( Σ11 Σ12 Σ21 Σ22 ) , accordingly. We shall assume for simplicity that the matrices Σ, Σ11, and Σ22 are nonsingular. To test H0 : Σ12 = 0 against a general alternative, a sensible way to go would be the following: for fixed vectors a ∈ Rr, b ∈ Rq let Z1 = a ⊤X(1) and Z2 = b⊤X(2) giving ρa,b = Cor(Z1, Z2) = a ⊤Σ12b√ a⊤Σ11ab⊤Σ22b . H0 is equivalent to H0 : ρa,b = 0 for all a ∈ Rr, b ∈ Rq. For a particular pair a, b, H0 would be accepted if |ra,b| = |a ⊤S12b|√ a⊤S11ab⊤S22b ≤ k for certain positive constant k. (Here Sij are the corresponding data based estimators of Σij .) Hence an appropriate acceptance region for H0 would be given in the form {X ∈ Mp,n : maxa,b r2a,b ≤ k2}. But maximising r2a,b means to find the maximum of (a⊤S12b)2 under constraints a⊤S11a = 1 and b⊤S22b = 1, and this is exactly the data-based version of the optimisation problem to be solved in Section 7.1. For the goals in Sections 7.1 and 7.2 to be achieved, we need to solve problems of the following type. 7.3 Precise mathematical formulation and solution to the problem Canonical variables are the variables Z1 = a ⊤X(1) and Z2 = b⊤X(2) where a ∈ Rr, b ∈ Rq are obtained by maximising (a⊤Σ12b)2 under the constraints a⊤Σ11a = b⊤Σ22b = 1. To solve the above maximisation problem, we construct Lag(a, b, λ1, λ2) = (a ⊤Σ12b)2 + λ1(a⊤Σ11a− 1) + λ2(b⊤Σ22b− 1). Partial differentiation with respect to the vectors a and b gives: 2(a⊤Σ12b)Σ12b+ 2λ1Σ11a = 0 ∈ Rr, (7.1) 2(a⊤Σ12b)Σ21a+ 2λ2Σ22b = 0 ∈ Rq. (7.2) We multiply (7.1) by the vector a⊤ from left and equation (7.2) by b⊤ from left and after subtracting the two equations obtained we get λ1 = λ2 = −(a⊤Σ12b)2 = −µ2. Hence: Σ12b = µΣ11a (7.3) and Σ21a = µΣ22b (7.4) hold. Now we first multiply (7.3) by Σ21Σ −1 11 from left, then both sides of (7.4) by the scalar µ and after finally adding the two equations we get: (Σ21Σ −1 11 Σ12 − µ2Σ22)b = 0. (7.5) The homogeneous equation system (7.5) having a non-trivial solution w.r.t. b means that |Σ21Σ−111 Σ12 − µ2Σ22| = 0 (7.6) must hold. Then, of course, |Σ− 1222 ||Σ21Σ−111 Σ12 − µ2Σ22||Σ− 1 2 22 | = |Σ− 1 2 22 Σ21Σ −1 11 Σ12Σ − 12 22 − µ2Iq| = 0 must hold. This means that µ2 has to be an eigenvalue of the matrix Σ − 12 22 Σ21Σ −1 11 Σ12Σ − 12 22 . Also, b = Σ − 12 22 bˆ where bˆ is the eigenvector of Σ − 12 22 Σ21Σ −1 11 Σ12Σ − 12 22 corresponding to this eigenvalue (WHY?!). 56 UNSW MATH5855 2021T3 Lecture 7 Canonical Correlation Analysis (Note, however, that this representation is good mainly for theoretical purposes, the main advantage being that one is dealing with eigenvalues of a symmetric matrix. If doing calculations by hand, it is usually easier to calculate b directly as the solution of the linear equation (7.5), i.e., find the largest eigenvalue of the (non-symmetric) matrix Σ−122 Σ21Σ −1 11 Σ12 and then find the eigenvector b that corresponds to it. Besides, we also see from the definition of µ that µ2 = (a⊤Σ12b)2 holds.) Since we wanted to maximise the right hand side, it is obvious that µ2 must be chosen to be the largest eigenvalue of the matrix Σ − 12 22 Σ21Σ −1 11 Σ12Σ − 12 22 (or, which is the same thing, the largest eigenvalue of the matrix Σ21Σ −1 11 Σ12Σ −1 22 ). Finally, we can obtain the vector a from (7.3): a = 1µΣ −1 11 Σ12b. That way, the first canonical variables Z1 = a ⊤X(1) and Z2 = b⊤X(2) are determined and the value of the first canonical correlation is just µ. The orientation of the vector b is chosen such that the sign of µ should be positive. Now, it is easy to see that if we want to extract a second pair of canonical variables we need to repeat the same process by starting with the second largest eigenvalue µ2 of the matrix Σ − 12 22 Σ21Σ −1 11 Σ12Σ − 12 22 (or of the matrix Σ −1 22 Σ21Σ −1 11 Σ12). This will automatically ensure that the second pair of canonical variables is uncorrelated with the first pair. The process can theoretically be continued until the number of pairs of canonical variables equals the number of variables in the smaller group. But in practice, much fewer canonical variables will be needed. Each canonical variable is uncorrelated with all the other canonical variables of either set except for the one corresponding canonical variable in the opposite set. Note. It is important to point out that already by definition the canonical correlation is at least as large as the multiple correlation between any variable and the opposite set of variables. It is in fact possible for the first canonical correlation to be very large while all the multiple correlations of each separate variable with the opposite set of canonical variables are small. This once again underlines the importance of Canonical Correlation analysis. 7.4 Estimating and testing canonical correlations The way to estimate the canonical variables and canonical correlation coefficients is based on the plug-in technique: one follows the steps outlined in Section 7.3, by each time substituting Sij in place of Σij . Let us now discuss the independence testing issue outlined in Section 7.2. The acceptance region of the independence test of H0 in Section 7.2. would be {X ∈ Mp,n : largest eigenvalue of S − 12 22 S21S −1 11 S12S − 12 22 ≤ kα} where kα has been worked out and is given in the so called Hecks charts. This distribution depends on three parameters: s = min(r, q), m = |r−q|−12 , and N = n−r−q−22 , n being the sample size. Besides using the charts, one can also use good F - distribution-based approximations for a (transformations of) this distribution like Wilk’s lambda, Pillai’s trace, Hotelling trace, and Roy’s greatest root. 7.5 Software Here we shall only mention that all these statistics and their P -values (using suitable F -distribution- based approximations) are readily available as an output in the SAS program CANCORR so that performing the test is really easy-one can read out directly the p-value from the SAS output. In R, see stats::cancor and package CCA for computing and visualisation, and package CCP for testing canonical correlations. 57 UNSW MATH5855 2021T3 Lecture 7 Canonical Correlation Analysis 7.6 Some important computational issues Note that calculating X− 1 2 and X 1 2 for a symmetric positive definite matrix X according to the theoretically attractive spectral decomposition method may be numerically unstable. This is especially the case when some of the eigenvalues are close to zero (or, more precisely, when the the ratio of the greatest eigenvalue and the least eigenvalue—the condition number—is high). We can use the Cholesky decomposition described in Section 0.1.6 instead. Looking back at (7.5), we see that if U⊤U = Σ−122 gives the Cholesky decomposition of the matrix Σ −1 22 then µ 2 is an eigenvalue of the matrix A = UΣ21Σ −1 11 Σ12U ⊤. Indeed, by multiplying from left by U and from right by U⊤ in (7.6) we get: |A− µ2UΣ22U⊤| = 0. But UΣ22U ⊤ = U(U⊤U)−1U⊤ = UU−1(U⊤)−1U⊤ = I holds. 7.7 Examples Example 7.1. Canonical Correlation Analysis of the Fitness Club Data. Three physio- logical and three exercise variables were measured on twenty middle aged men in a fitness club. Canonical correlation is used to determine if the physiological variables are related in any way to the exercise variables. Example 7.2. JW Example 10.4, p. 552 Studying canonical correlations between leg and head bone measurements: X1, X2 are skull length and skull breadth, respectively; X3, X4 are leg bone measurements: femur and tibia length, respectively. Observations have been taken on n = 276 White Leghorn chicken. The example is chosen to also illustrate how a canonical correlation analysis can be performed when the original data is not given but the empirical correlation matrix (or empirical covariance matrix) is available. 7.8 Additional resources An alternative presentation of these concepts can be found in JW Ch. 10. 7.9 Exercises Exercise 7.1 Let the components of X correspond to scores on tests in arithmetic speed (X1), arithmetic power (X2), memory for words (X3), memory for meaningful symbols (X4), and memory for meaningless symbols (X5). The observed correlations in a sample of 140 are 1.0000 0.4248 0.0420 0.0215 0.0573 1.0000 0.1487 0.2489 0.2843 1.0000 0.6693 0.4662 1.0000 0.6915 1.0000 . Find the canonical correlations and canonical variates between the first two variates and the last three variates. Comment. Write a SAS-IML or R code to implement the required calculations. 58 UNSW MATH5855 2021T3 Lecture 7 Canonical Correlation Analysis Exercise 7.2 Students sit 5 different papers, two of which are closed book and the rest open book. For the 88 students who sat these exams the sample covariance matrix is S = 302.3 125.8 100.4 105.1 116.1 170.9 84.2 93.6 97.9 111.6 110.8 120.5 217.9 153.8 294.4 . Find the canonical correlations and canonical variates between the first two variates (closed book exams) and the last three variates (open book exams). Comment. Exercise 7.3 A random vector X ∼ N4(µ,Σ) with µ = 0 0 0 0 and 1 2ρ ρ ρ 2ρ 1 ρ ρ ρ ρ 1 2ρ ρ ρ 2ρ 1 where ρ is a small enough positive constant. (a) Find the two canonical correlations between ( X1 X2 ) and ( X3 X4 ) . Comment. (b) Find the first pair of canonical variables. Exercise 7.4 Consider the following covariance matrix Σ of a four dimensional normal vector: Σ =( Σ11 Σ12 Σ21 Σ22 ) = 100 0 0 0 0 1 0.95 0 0 0.95 1 0 0 0 0 100 . Verify that the first pair of canonical variates are just the second and the third component of the vector and the canonical correlation equals .95. 59 UNSW MATH5855 2021T3 Lecture 8 MLM and MANOVA 8 Multivariate Linear Models and Multivariate ANOVA 8.1 Univariate linear models and ANOVA . . . . . . . . . . . . . . . . . . . . . . . . 60 8.2 Multivariate Linear Model and MANOVA . . . . . . . . . . . . . . . . . . . . . . 61 8.3 Computations used in the MANOVA tests . . . . . . . . . . . . . . . . . . . . . . 61 8.3.1 Roots distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 8.3.2 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 8.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 8.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 8.6 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 8.1 Univariate linear models and ANOVA Recall the univariate linear model: for observations i = 1, 2, . . . , n, let the response variable Yi = xiβ + ϵi, for predictor row vector x ⊤ i ∈ Rk assumed fixed and known, coefficient vector β ∈ Rp fixed and unknown, and ϵi i.i.d.∼ N(0, σ2). In matrix form, Y = ( Y1 Y2 · · · Yn )⊤ and X = ( x⊤1 x ⊤ 2 · · · x⊤n )⊤ ∈Mn,k. We will assume that X contains an intercept. Then, Y = Xβ + ϵ, where ϵ ∼ Nn(0, Inσ2). The MLE for β requires us to minimise n∑ i=1 (Yi − xiβ)2 = ∥Y −Xβ∥2 = (Y −Xβ)⊤(Y −Xβ), and, after some vector calculus, we get βˆ = (X⊤X)−1X⊤Y with Var(βˆ) = (X⊤X)−1X⊤Var(Y )X(X⊤X)−1 = (X⊤X)−1σ2. Furthermore, we can consider projection matrices A = In − X(X⊤X)−1X⊤ and B = X(X⊤X)−1X⊤ − 1n(1⊤n 1n)−11⊤n , with AY = Y −X{(X⊤X)−1X⊤Y } = Y − Yˆ , the residual vector and BY = X{(X⊤X)−1X⊤}Y − 1n(1⊤n 1n)−11⊤nY = Yˆ − 1nY¯ the vector of fitted values over and above the mean, and observe that Cov(AY , BY ) = AVar(Y )B⊤ = σ2AB⊤ = X(X⊤X)−1X⊤ −X(X⊤X)−1X⊤X(X⊤X)−1X⊤ − 1n(1⊤n 1n)−11⊤n +X(X⊤X)−1X⊤1n(1⊤n 1n)−11⊤n = 1 n (X(X⊤X)−1X⊤1n − 1n)1⊤n = 0 if X contains an intercept effect. Then, SSE = Y ⊤AY ∼ σ2χ2n−k and SSA = Y ⊤BY ∼ σ2χ2k−1, independent, letting us set up F = SSA/(k−1)SSE/(n−k) ∼ Fk−1,n−k, etc.. 60 UNSW MATH5855 2021T3 Lecture 8 MLM and MANOVA 8.2 Multivariate Linear Model and MANOVA How do we generalise it to multivariate response? That is, suppose that we observe the following response matrix: Y = Y ⊤1 Y ⊤2 ... Y ⊤n = Y11 Y12 · · · Y1p Y21 Y22 · · · Y2p ... ... . . . ... Yn1 Yn2 · · · Ynp ∈Mn,p with xi and X as before, and Y ⊤i = xiβ + ϵ ⊤ i where β ∈Mk,p, and ϵi ∼ Np(0,Σ), Σ ∈Mp,p symmetric positive definite. In matrix form, Y = Xβ +E, where E = ( ϵ1 ϵ2 · · · ϵn )⊤ ∈Mn,p. Then, we can write E⃗ ∼ Nnp(0,Σ⊗ In) or −→ E⊤ ∼ Nnp(0, In ⊗ Σ), and Y⃗ ∼ Nnp({β⊤ ⊗ In}X⃗,Σ⊗ In) or −−→ Y ⊤ ∼ Nnp({In ⊗ β⊤} −−→ X⊤, In ⊗ Σ). MLE is equivalent to the OLS problem minimising ∑n i=1 tr{(Yi−xiβ)(Yi−xiβ)⊤} = tr{(Y − Xβ)⊤(Y −Xβ)}, leading to βˆ = (X⊤X)−1X⊤Y again, with Var( −→ˆ β⊤) = Var( −−−−−−−−−−−→ Y ⊤X(X⊤X)−1) = Var{((X⊤X)−1X⊤ ⊗ Ip) −−→ Y ⊤} = ((X⊤X)−1X⊤ ⊗ Ip)(Ip ⊗ Σ)((X⊤X)−1X⊤ ⊗ Ip)⊤ = ((X⊤X)−1X⊤ ⊗ Ip)((X⊤X)−1X⊤ ⊗ Σ)⊤ = (X⊤X)−1 ⊗ Σ, or Var( −→ˆ β ) = Σ⊗ (X⊤X)−1. Projection matricesA andB still work (check it!), and we can write SSE = Y ⊤AY ∼Wp(Σ, p(n− k − 1)) and SSA = Y ⊤BY ∼Wp(Σ, p(k − 1)). Notice that they are now matrices. 8.3 Computations used in the MANOVA tests In standard (univariate) Analysis of Variance, with usual normality assumptions on the errors, testing about effects of the factors involved in the model description is based on the F test. The F tests are derived from the ANOVA decomposition SST = SSA + SSE. The argument goes as follows: i) SSE and SSA are independent, (up to constant factors involving the variance σ2 of the errors) χ2 distributed; 61 UNSW MATH5855 2021T3 Lecture 8 MLM and MANOVA ii) By proper norming to account for degrees of freedom, from SSE and SSA one gets statistics that have the following behaviour: the normed SSE always delivers an unbiased estimator of σ2 no matter if the null hypothesis or alternative is true; the normed SSA delivers an unbiased estimator of σ2 under the null hypothesis but delivers an unbiased estimator of a “larger” quantity under the alternative. The above observation is crucial and motivates the F -testing: F statistics are (suitably normed to account for degrees of freedom) ratios of SSA/SSE. When taking the ratio, the factors involving σ2 cancel out and σ2 does not play any role in the distribution of the ratio. Under H0 their distribution is F . When the null hypothesis is violated, then the same statistics will tend to have “larger” values as compared to the case when H0 is true. Hence significant (w.r.t. the corresponding F -distribution) values of the statistic lead to rejection of H0. Aiming at generalising these ideas to the Multivariate ANOVA (MANOVA) case, we should note that instead of χ2 distributions we now have to deal with Wishart distributions and we need to properly define (a proper functional of) the SSA/SSE ratio which would be a “ratio” of matrices now. Obviously, there are more ways to define suitable statistics in this context! It turns out that such functionals are related to the eigenvalues of the (properly normed) Wishart- distributed matrices that enter the decomposition SST = SSA + SSE in the multivariate case. 8.3.1 Roots distributions Let Yi, i = 1, 2, . . . , n ind.∼ Np(µi,Σ). Then the following data matrix: Y = Y ⊤1 Y ⊤2 ... Y ⊤n = Y11 Y12 · · · Y1p Y21 Y22 · · · Y2p ... ... . . . ... Yn1 Yn2 · · · Ynp ∈Mn,p is a n × p matrix containing n p-dimensional (transposed) vectors. Denote: E(Y ) = M , Var(Y⃗ ) = Σ⊗ In. Let A and B be projectors such that Q1 = Y ⊤AY and Q2 = Y ⊤BY are two independent Wp(Σ, v) and Wp(Σ, q) matrices, respectively. Although the theory is general, to keep you on track, you could always think about a multivariate linear model example: Y = Xβ +E, Yˆ = Xβˆ A = In −X(X⊤X)−X⊤, B = X(X⊤X)−X⊤ − 1n(1⊤n 1n)−11⊤n and the corresponding decomposition Y [In − 1n(1⊤n 1n)−11⊤n ]Y = Y ⊤BY + Y ⊤AY = Q2 +Q1 of SST = SSA + SSE = Q2 + Q1 where Q2 is the “hypothesis matrix” and Q1 is the “error matrix”. Lemma 8.1. Let Q1,Q2 ∈ Mp,p be two positive definite symmetric matrices . Then the roots of the determinant equation |Q2 − θ(Q1 + Q2)| = 0 are related to the roots of the equation |Q2 − λQ1| = 0 by: λi = θi1−θi (or θi = λi1+λi ). Lemma 8.2. Let Q1,Q2 ∈ Mp,p be two positive definite symmetric matrices . Then the roots of the determinant equation |Q1 − v(Q1 + Q2)| = 0 are related to the roots of the equation |Q2 − λQ1| = 0 by: λi = 1−vivi (or vi = 11+λi ). 62 UNSW MATH5855 2021T3 Lecture 8 MLM and MANOVA We can employ the above two lemmas to see that if λi, vi, θi are the roots of |Q2 − λQ1| = 0, |Q1 − v(Q1 +Q2)| = 0, |Q2 − θ(Q1 +Q2)| = 0 then: Λ = |Q1(Q1 +Q2)−1| = p∏ i=1 (1 + λi) −1 (Wilks’ Criterion statistic) or |Q2Q−11 | = p∏ i=1 λi = p∏ i=1 1− vi vi = p∏ i=1 θi 1− θi or |Q2(Q1 +Q2)−1| = p∏ i=1 θi = p∏ i=1 λi 1 + λi = p∏ i=1 (1− vi) and other functional transformations of these products of (random) roots would have a distribu- tion that would only depend on p (the dimension of Yi), v (the Wishart degrees of freedom for Q1), and q (same for Q2). There are various ways to choose such functional transformations (statistics) and many have been suggested like: • Λ (Wilks’s Lambda) • tr(Q2Q−11 ) = tr(Q −1 1 Q2) = ∑p i=1 λi (Lawley–Hotelling trace) • max iλi (Roy’s criterion) • V = tr[Q2(Q1 +Q2)−1] = ∑p i=1 λi 1+λi (Pillai statistic / Pillai’s trace) Tables and charts for their exact or approximate distributions are available. Also, P -values for these statistics are readily calculated in statistical packages. In these applications, the meaning of Q1 is of the “error matrix” (also denoted by E sometimes) and the meaning of Q2 is that of a “hypothesis matrix” (also denoted by H sometimes). The distribution of the statistics defined above depends on the following three parameters: • p = the number of responses • q = νh = degrees of freedom for the hypothesis • v = νe = degrees of freedom for the error Based on these, the following quantities are calculated: s = min(p, q), m = 0.5(|p − q| − 1), n = 0.5(v−p−1), r = v−0.5(p− q+1), u = 0.25(pq−2). Moreover, we define: t = √ p2q2−4 p2+q2−5 if p2 + q2 − 5 > 0 and t = 1 otherwise. Let us order the eigenvalues of E−1H = Q−11 Q2 according to: λ1 ≥ λ2 ≥ · · · ≥ λp. Then the following distribution results are exact if s = 1 or 2, otherwise approximate: • Wilks’s test. The test statistics, Wilks’s lambda, is Λ = |E||E+H| = ∏p i=1 1 1+λi Then it holds: F = 1−Λ 1/t Λ1/t . rt−2upq ∼ Fpq,rt−2u df (Rao’s F). • Lawley–Hotelling trace Test. The Lawley–Hotelling statistic is U = tr(E−1H) = λ1 + · · ·+ λp, and F = 2(sn+ 1) Us2(2m+s+1) ∼ Fs(2m+s+1),2(sn+1) df. 63 UNSW MATH5855 2021T3 Lecture 8 MLM and MANOVA • Pillai’s test. The test statistic, Pillai trace, is V = tr(H(H +E)−1) = λ11+λ1 + · · ·+ λp 1+λp and F = 2n+s+12m+s+1 × Vs−V ∼ Fs(2m+s+1),s(2n+s+1) df. • Roy’s maximum root criterion. The test statistic is just the largest eigenvalue λ1. Finally, we shall mention one historically older and very universal approximation to the distribution of the Λ statistic due to Bartlett (1927): It holds: level of −[νe − p−νh+12 ] log Λ = c(p, νh,M)× level of χ2pνh , where the constant c(p, νh,M = νe−p+1) is given in tables. Such tables are prepared for levels α = 0.10, 0.05, 0.025 etc.. In the context of testing the hypothesis about significance of the first canonical correlation, we have: E = S22 − S21S−111 S12, H = S21S−111 S12. The Wilks’s statistic becomes |S||S11||S22| . (Recall (4.3)!) We also see that in this case, if µ 2 i were the squared canonical correlations then µ21 was defined as the maximal eigenvalue to S −1 22 H, that is, it is a solution to |(E +H)−1H − µ21I| = 0 However, setting λ1 = µ 2 1 1−µ21 we see that: |(E+H)−1H−µ21I| = 0 =⇒ |H−µ21(E+H)| = 0 =⇒ |H− µ21 1− µ21 E| = 0 =⇒ |E−1H−λ1I| = 0 holds and λ1 is an eigenvalue of E −1H. Similarly you can argue for the remaining λi = µ2i 1−µ2i values. What are the degrees of freedom of E and H? 8.3.2 Comparisons From all statistics discussed, Wilks’s lambda has been most widely applied. One important reason for this is that this statistic has the virtue of being convenient to use and, more importantly, being related to the Likelihood Ratio Test! Despite the above, the fact that so many different statistics exist for the same hypothesis testing problem, indicates that there is no universally best test. Power comparisons of the above tests are almost lacking since the distribution of the statistic under alternatives is hardly known. 8.4 Software In SAS, both PROC GLM and PROC REG can conduct analysis and perform hypothesis tests. In R, use stats::lm. 8.5 Examples Example 8.3. Multivariate linear modelling of the Fitness dataset. 8.6 Additional resources An alternative presentation of these concepts can be found in JW Ch. 7. 64 UNSW MATH5855 2021T3 Lecture 9 Tests of a Covariance Matrix 9 Tests of a Covariance Matrix 9.1 Test of Σ = Σ0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 9.2 Sphericity test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 9.3 General situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 9.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Previously, we developed a number of techniques for decomposing and analysing covariance matrices and their properties. Here, we develop a general family of tests for their structure, which will let you specify almost arbitrary tests for the covariance structure of a multivariate normal population. 9.1 Test of Σ = Σ0 We start with this simpler case since ideas are more transparent. The practically more relevant cases are about comparing covariance matrices of two or more multivariate normal populations but the derivations of the latter tests is more subtle. For these we will only formulate the final results. Assume now that we have the sample X1,X2, . . . ,Xn from a Np(µ,Σ) distribution and we would like to test H0 : Σ = Σ0 against the alternative H1 : Σ ̸= Σ0. Obviously the problem can be easily transformed into testing H¯0 : Σ = Ip since otherwise we can consider the modified observations Yi = Σ − 12 0 Xi which under H0 will be multivariate normal with a covariance matrix being equal to Ip. Therefore we can assume that X1,X2, . . . ,Xn is a sample from a Np(µ,Σ) and we want to test H0 : Σ = Ip versus H1 : Σ ̸= Ip. We will derive the likelihood ratio test for this problem. The likelihood function is L(x;µ,Σ) = (2π)− np 2 |Σ|−n2 e− 12 ∑n i=1(xi−µ)⊤Σ−1(xi−µ) = (2π)− np 2 |Σ|−n2 e− 12 tr[Σ−1 ∑n i=1(xi−µ)(xi−µ)⊤] . Under the hypothesis H0, the maximum of the likelihood function is obtained when µ¯ = x¯. Under the alternative we have to maximise with respect to both µ and Σ and we know from Section 3.1.2 that the maximum of the likelihood function is obtained for µˆ = x¯ and Σˆ = 1 n ∑n i=1(xi − x¯)(xi − x¯)⊤. Then we obtain easily the likelihood ratio Λ = maxµ L(x;µ, Ip) maxµ,Σ L(x;µ,Σ) = e[− 1 2 trV ] |V |−n2 nnp2 e−np2 where V = ∑n i=1(xi − x¯)(xi − x¯)⊤. Therefore − 2 log Λ = np log n− n log|V |+ trV − np, (9.1) and according to the asymptotic theory the quantity in (9.1) is asymptotically distributed as χ2p(p+1)/2 (the degrees of freedom being the difference of the number of free parameters under the alternative and under the hypothesis). This test would reject H0 if the value of the −2 log Λ statistic is significantly large. 9.2 Sphericity test Further, it is more realistic to assume that the structure of the covariance matrix is only known up to some constant. Having in mind the discussion in the beginning of Section 9.1, we can 65 UNSW MATH5855 2021T3 Lecture 9 Tests of a Covariance Matrix assume without loss of generality that H0 : Σ = σ 2Ip against a general alternative. This test has the name “sphericity test”. The likelihood ratio test can be developed in a manner similar to the previous case (do it (!)) and the final result is that −2 log Λ = np log(nσˆ2)− n log|V |. Here, σˆ2 = 1np ∑n i=1(xi − x¯)⊤(xi − x¯). The asymptotic distribution of np log(nσˆ2) − n log|V | under the null hypothesis will be again χ2 but the degrees of freedom are this time p(p+1)2 − 1 = (p−1)(p+2) 2 (WHY (?!)). Again, the hypothesis will be rejected for large values. 9.3 General situation Testing equality of covariance matrices from k different multivariate normal populationsNp(µi,Σi), i = 1, 2, . . . , k is a very important problem especially in discriminant analysis and multivariate anal- ysis of variance. Let, k be the number of populations; p the dimension of vector; n the total sample size n = n1 + n2 + . . . nk, ni being the sample size for each population. The analysis of deviance test statistic that results is −2 log ∏k i=1|Σˆi| ni 2 |Σˆpooled|n2 , with Σˆi the MLE sample variance (with denominator ni as opposed to ni − 1) of population i, and Σˆpooled = 1 n ∑k i=1 niΣˆi, asymptotically distributed χ 2 (k−1)p(p+1)/2. It has been noticed that this test has the defect that it is (asymptotically) biased: that is, the probability of rejecting H0 when H0 is false can be smaller than the probability of rejecting H0 when H0 is true (i.e., it may happen that in some points of the parameter space the probability of a correct decision is smaller than the probability for a wrong decision). Hence it is desirable to modify it to make it asymptotically unbiased. Further let N = n − k and Ni = ni − 1. Under the null hypothesis of equality of all k covariance matrices, it holds: − 2ρ log ∏k i=1|Si| Ni 2 |Spooled|N2 , (9.2) for ρ = 1 − [(∑ki=1 1Ni ) − 1N ] 2p2+3p−16(p+1)(k−1) , Si the sample variance (with n − 1 denominator) of population i, and Spooled = 1 N ∑k i=1NkSi, is asymptotically distributed as χ 2 (k−1)p(p+1)/2. Large values of the statistic are significant and lead to the rejection of the hypothesis about equality of the k covariance matrices. In the following, we will avoid the subtle details and refer to Chapter 8 of the monograph Muirhead, R. (1982) Aspects of Multivariate Statistical Theory. Wiley, New York. 66 UNSW MATH5855 2021T3 Lecture 9 Tests of a Covariance Matrix The modified LR is achieved by replacing ni and n by Ni and N (that is, by the correct degrees of freedom). We note that indeed ρ = 1− [(∑ki=1 1Ni )− 1N ] 2p2+3p−16(p+1)(k−1) is close to 1 anyway if all sample sizes ni were very large. Finally, the scaling of the test statistic by ρ = 1− [( ∑k i=1 1 Ni )− 1 N ] 2p2+3p−1 6(p+1)(k−1) that is made in (9.2) serves to improve the quality of the asymptotic approximation of the statistic by the limiting χ21 2 (k−1)p(p+1) distribution. Such (asymptotically negligible) scalar transformations of the LR statistic that yield improved test statistic with a chi-squared null distribution of order O(1/n) instead of the ordinary O(1) for the standard LR, are known in the literature under the common name Bartlett corrections. Thus (9.2) is a Bartlett corrected version of the modified LR statistic. 9.4 Software SAS: PROC CALIS, PROC DISCRIM (option) R: heplots::boxM, MVTests::BoxM The statistic (9.2) is the one that is implemented in software packages. 9.5 Exercises Exercise 9.1 Follow the discussion about the sphericity test. Argue that if λˆi, i = 1, 2, . . . , p denote the eigenvalues of the empirical covariance matrix S then −2 log Λ = np log arithm. mean λˆi geom. mean λˆi . Of course, the above statistic is asymptotically χ2(p+2)(p−1)/2 distributed under H0 since it only represents the sphericity test in a different form. Exercise 9.2 Show that the likelihood ratio test of H0 : Σ is a diagonal matrix rejects H0 when −n log |R| is larger than χ21−α,p(p−1)/2. (Here R is the empirical correlation matrix, p is the dimension of the multivariate normal and n is the sample size.) 67 UNSW MATH5855 2021T3 Lecture 10 Factor Analysis 10 Factor Analysis 10.1 ML Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 10.2 Hypothesis testing under multivariate normality assumption . . . . . . . . . . . . 70 10.3 Varimax method of rotating the factors . . . . . . . . . . . . . . . . . . . . . . . 71 10.4 Relationship to Principal Component Analysis . . . . . . . . . . . . . . . . . . . 71 10.4.1 The principal component solution of the factor model . . . . . . . . . . . 71 10.4.2 The Principal Factor Solution . . . . . . . . . . . . . . . . . . . . . . . . . 71 10.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 10.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 10.7 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Let Yi, i = 1, 2, .., n be independent Np(µ,Σ) variables (think of the Yis as a results of a battery of p tests applied to the ith individual). Fundamental assumption in factor analysis: Yi = Λfi + ei (10.1) Λ ∈Mp,k factor loading matrix (full rank); fi ∈ Rk (k < p) factor variable. The components of fi are thought to be the (latent) factors. Usually fi are taken to be independent N(α, Ik) (i.e., “orthogonal”) but also “oblique” factors are considered sometimes with a covariance matrix ̸= Ik. ei independent Np(θ,Σe) with Σe diagonal, i.e., Σe = diag(σ 2 1 , σ 2 2 , . . . , σ 2 p). Also, the es are independent of the fs. Then, µ = Λα+ θ; Σ = ΛΛ⊤ +Σe, or, componentwise: Var(Yir) = k∑ j=1 λ2rj + σ 2 r = communality + uniqueness. Cov(Yir, Yis) = k∑ j=1 λrjλsj . The fundamental idea of factor analysis is to describe the covariance relationships among many variables (p “large”) in terms of few (k “small”) underlying, not observable (latent) random quantities (the factors). The model is motivated by the following argument: suppose variables can be grouped by their correlations. That is, all variables in a particular group are highly correlated among themselves but have relatively small correlations with variables in a different group. It is then quite reasonable to assume that each group of variables represents a single underlying construct (factor) that is “responsible” for the observed correlations. Important notes • The model (10.1) is similar to a linear regression model but the key differences are that fi are random and are not observable. 68 UNSW MATH5855 2021T3 Lecture 10 Factor Analysis • If we knew the Λ (or have found estimates of them), then using properties of orthogonal projections on the linear space spanned by the columns of Λ, we would get: αˆ = (Λ⊤Λ)−1Λ⊤Y¯ ; θˆ = Y¯ − Λαˆ. Because of the above observation, we can consider only µ, Λ, and σ2i , i = 1, 2, . . . , p as unknown parameters when parameterising the factor analysis model. Note also that primary interest in factor analysis is focused on estimating Λ. • There is a fundamental indeterminacy in this model even when we require that Var(f) = Ik since, if P ∈Mk,k is any orthogonal matrix then obviously ΛΛ⊤ = ΛP (ΛP )⊤; Λfi = (ΛP )(P⊤fi). Hence replacing Λ by ΛP and fi by P ⊤fi leads to the same equations. 10.1 ML Estimation The likelihood function for the n observations Y1,Y2, . . . ,Yn ∈ Rp is L(Y ;µ,Λ, σ21 , σ 2 2 , .., σ 2 p) = (2π) −np/2|Σ|−n/2 exp[−1 2 n∑ i=1 (Yi − µ)⊤Σ−1(Yi − µ)] = (2π)−np/2|Σ|−n/2 exp[−n 2 (tr(Σ−1S) + (Y¯ − µ)⊤Σ−1(Y¯ − µ))] with S = 1n ∑n i=1(Yi − Y¯ )(Yi − Y¯ )⊤. Taking logL, we get: logL(Y ;µ,Λ, σ21 , σ 2 2 , .., σ 2 p) = − np 2 log(2π)− n 2 log(|Σ|)− n 2 [tr(Σ−1S)+ (Y¯ −µ)⊤Σ−1(Y¯ −µ))]. After differentiating w.r.t. µ, ∂ logL ∂µ = nΣ−1(Y¯ − µ) = 0 =⇒ µˆ = Y¯ . It remains to estimate Λ and Σe by minimising: Q = 1 2 log|ΛΛ⊤ +Σe|+ 1 2 tr(ΛΛ⊤ +Σe)−1S. To implement the minimisation of Q we use the following rules for matrix differentiation: ∂ ∂Λ log|ΛΛ⊤ +Σe| = 2(ΛΛ⊤ +Σe)−1Λ (10.2) ∂ ∂A tr(A−1B) = −(A−1BA−1)⊤. (10.3) Applying (10.3) and the chain rule we get: ∂ ∂Λ tr[(ΛΛ⊤ +Σe)−1S] = −2(ΛΛ⊤ +Σe)−1S(ΛΛ⊤ +Σe)−1Λ. 69 UNSW MATH5855 2021T3 Lecture 10 Factor Analysis Hence after substitution: ∂ ∂Λ Q = (ΛΛ⊤ +Σe)−1Λ− (ΛΛ⊤ +Σe)−1S(ΛΛ⊤ +Σe)−1Λ = (ΛΛ⊤ +Σe)−1[ΛΛ⊤ +Σe − S](ΛΛ⊤ +Σe)−1Λ = 0. (10.4) Woodbury Matrix Identity gives (ΛΛ⊤ +Σe)−1 = Σ−1e − Σ−1e Λ(I + Λ⊤Σ−1e Λ)−1Λ⊤Σ−1e . (10.5) Hence form (10.4) and (10.5) we get [ΛΛ⊤ +Σe − S]Σ−1e Λ{I − (I + Λ⊤Σ−1e Λ)−1Λ⊤Σ−1e Λ} = 0. (10.6) Since the rank of the matrix in the curly brackets in (10.6) is full we get [ΛΛ⊤ +Σe − S]Σ−1e Λ = 0, or, equivalently, SΣ−1e Λ = Λ(I + Λ ⊤Σ−1e Λ). The latter can also be written as (Σ−1/2e SΣ −1/2 e )Σ −1/2 e Λ = Σ −1/2 e Λ(I + Λ ⊤Σ−1e Λ). (10.7) To find a particular solution, we require Λ⊤Σ−1e Λ to be diagonal. Then (10.7) implies that the matrix Σ −1/2 e Λ has as its columns k eigenvectors that correspond to the k eigenvalues of Σ −1/2 e SΣ −1/2 e . More subtle analysis shows that to obtain the minimum value of Q these have to be the eigenvectors that correspond to the largest eigenvalues of Σ −1/2 e SΣ −1/2 e . Based on this fact, the following iterative solution (due to Lawley) has been proposed that can be described algorithmically as follows: 1. With an initial guess Σ˜e, calculate Σ˜ −1/2 e Λ˜ by using the eigenvectors of the k largest eigenvalues of Σ˜ −1/2 e SΣ˜ −1/2 e . 2. Then from Σ˜ −1/2 e Λ˜, get a (first iteration) value for Λ˜. 3. With this value of Λ˜ we can calculate the value of Q˜(Σ˜e) = 1 2 log|Λ˜Λ˜⊤ + Σ˜e|+ 12 tr(Λ˜Λ˜⊤ + Σ˜e) −1S (which is the value of the functional). This functional only depends on the p nonzero values of Σ˜e and there are several powerful numerical procedures to find its mini- mum. 4. If it is achieved at Σ∗e, then update Σ˜e with the new guess Σ ∗ e and repeat from Step 1 to convergence. 10.2 Hypothesis testing under multivariate normality assumption The most interesting hypothesis is H0 : k factors against H1 : ̸= k factors. logL1 = −np 2 log(2π)− n 2 log|S| − np 2 logL0 = −np 2 log(2π)− n 2 log|Σˆ| − n 2 tr(Σˆ−1S) (where Σˆ = ΛˆΛˆ⊤ + Σˆe). Hence −2 log L0L1 = n[log|Σˆ| − log|S| + tr(Σˆ−1S) − p]. The asymptotic distribution of this statistic is χ2 with df = p(p+1)2 − [pk+p− k(k−1)2 ] = 12 [(p−k)2−p−k]. Why? 70 UNSW MATH5855 2021T3 Lecture 10 Factor Analysis 10.3 Varimax method of rotating the factors If Λˆ0 is the estimated factor loading matrix obtained by the ML method, we know that Λˆ = Λˆ0P with any orthogonal P ∈ Mk,k can be used instead. How to choose a particular P such that Λˆ has some desirable properties? Let dr = ∑p i=1 λ 2 ir, then the varimax method of rotating the factors consists in choosing P to maximise Sd = k∑ r=1 { p∑ i=1 (λ2ir − dr p )2} = k∑ r=1 { p∑ i=1 λ4ir − ( ∑p i=1 λ 2 ir) 2 p }. This corresponds to the wish to make, for each column of factor loadings, some of the coordinates to be “very large” and the rest to be “very small” (in absolute value). Iterative solution to the above rotation problem exists. Note: Rotation of factor loadings is particularly recommended for loadings obtained by ML method since the initial values of Λˆ0 are constrained to satisfy the condition that Λˆ⊤0 Σ −1 e Λˆ0 be diagonal. This is convenient for computational purposes but may not lead to easily interpretable factors. 10.4 Relationship to Principal Component Analysis There are different ways in which you can relate Factor analysis to Principal Component analysis. We will discuss two of them here. 10.4.1 The principal component solution of the factor model Starting with the matrix S = 1 n n∑ i=1 (Yi − Y¯ )(Yi − Y¯ )⊤ we can write down its spectral decomposition by using all of its p eigenvalues and eigenvectors. In such a way we would derive a perfect reconstruction of S but since it has been achieved by using p factors, it does not deliver any dimension reduction and is useless. We would prefer to employ a smaller number k of eigenvalues and eigenvectors of S and to get only an approximate reconstruction of S S ≈ k∑ i=1 τia⃗ia⃗ ⊤ i = ΛΛ ⊤ whereby τi are the characteristic roots of S, taking the k biggest ones (w.o.l.g. τ1, τ2, .., τk) and ai being their corresponding eigenvectors. Since the understanding is that (if k is the right number of factors) all communalities have been taken into account then sii − ∑k j=1 λ 2 ij would be the estimators of the uniquenesses. This approach shows the k factors have been extracted from S in the same way like the principal components are calculated. The method is called the principal component solution of the factor model. 10.4.2 The Principal Factor Solution This is yet another method that uses similar ideas from principal components analysis. It is similar to the principal component solution, but the factor extraction is not performed directly 71 UNSW MATH5855 2021T3 Lecture 10 Factor Analysis on S. To describe it, let us assume for a moment that the uniquenesses are known (or can be estimated reasonably well) and we can decompose S = Sr +Σe whereby the number k of factors is known and Σe is the diagonal matrix containing the unique- nesses. Then the factor analysis model states that (an estimate of) Λ should satisfy Sr = S − Σe = ΛΛ⊤ Hence Λ estimate can be found by performing principal component analysis on Sr : If Sr = ∑p i=1 tib⃗ib⃗ ⊤ i , ti being the characteristic roots of Sr, take the k biggest ones (w.o.l.g. t1, t2, .., tk). Denote B = ( b1 b2 · · · bk ) ; ∆ = diag(t1, t2, . . . , tk). Then Λˆ = B∆1/2. Can do it also iteratively! This approach has some problems: i) There is no reliable estimate of Σe available. (The most commonly used one in the case where S is the correlation matrix R is σ2ei = 1/r ii where rii is the ith diagonal element of R−1.) ii) How to select k? Note: The methods in Section 10.4.2 are not efficient as compared to the ML method and in general, the ML method is the preferred one. However, for the ML method one has to assume normality and the alternative approaches described here are used in cases where multivariate normality in a serious doubt. Most often in practice the choice of k is done by combining subject matter knowledge, “reasonableness” of results and by looking at proportion variance explained. 10.5 Software SAS As you might expect, factor analysis is implemented in PROC FACTOR. Some remarks: • if you want to extract different numbers of factors (the example below shows how to extract n = 2 factors), you should run the procedure once for each number of factors; • the communalities need a preliminary estimate. If one considers the correlation matrix instead of Σ, then the communalities can be estimated by the squared multiple correlations of each of the variables with the rest (these communality estimates are used to get pre- liminary estimates of the uniquenesses to start the iteration process). If in the iteration process it happens that a communality estimate exceeds 1-the case is referred to as an ultra-Heywood case and the Heywood option sets such communality to one thus allowing iterations to continue; • the scree option can be used to produce a plot of the eigenvalues Σ that is helpful in deciding how many factors to use; • besides method=ml you can use method=principal; • with the ML method option, the Akaike’s Information criterion (and Schwarz’s Bayesian Criterion) are included. These can be used to estimate the “best” number of parameters to include in a model (in case more than one model is acceptable). The number of factors that yields the smallest AIC is considered “best”. 72 UNSW MATH5855 2021T3 Lecture 10 Factor Analysis R Function stats::factanal() is the built-in implementation. Package psych contains addi- tional functions and utilities, as well as its own implementation, psych::fa(), with a number of model selection tools. Package nFactors contains utilities for determining the number of factors (e.g., scree plots). 10.6 Examples Example 10.1. Data about five socioeconomic variables for 12 census data in the Los Angeles area. The five variables represent total population, median school years, total unemployment, miscellaneous professional services, and median house value. Use ML method and varimax rotation. • Try to run the above model with n = 3 factors. The message “WARNING: Too many factors for a unique solution” appears. This is not surprising as the number of parameters in the model will exceed the number of elements in Σ ( 12 [(p − k)2 − p − k] = −2). In this example you can run the procedure for n = 1 and for n = 2 only (do it!) and you will see that n = 2 gives the adequate representation. • Try using psych::fa.parallel() to search for optimal number of factors. 10.7 Additional resources An alternative presentation of these concepts can be found in JW Ch. 9. 73 UNSW MATH5855 2021T3 Lecture 11 Structural Equation Modelling 11 Structural Equation Modelling 11.1 General form of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 11.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 11.3 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 11.4 Some particular SEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 11.5 Relationship between exploratory and confirmatory FA . . . . . . . . . . . . . . . 76 11.6 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 11.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Factor analysis (FA) is only one example of a new approach to data analysis which is not based on the individual observations. We were not able to use the regression approach since the input factors were latent (not observable). There were too many unknowns. We went to analyse the covariance matrix Σ (and its estimator S) which involved the actual parameters of interest—σ2i and Λ. That is, we switched from the level of individual observations to analyse covariance matrices instead. There are a series of methods which are based on analysis of covariances rather than individual cases. Instead of minimising functions of observed and predicted individual values, we minimise the differences between sample covariances and covariances predicted by the model. The fundamental hypothesis in these analyses is H0 : Σ = Σ(θ) against H1 : Σ ̸= Σ(θ). Here Σ has p(p + 1)/2 unknown elements (estimated by S) but these are assumed to be reproducible by just k = dim(θ) < p(p + 1)/2 parameters. Note that more generally we could consider fitting means and covariances, or means and covariances and higher moments to a given structure. Regression analysis with random inputs, simultaneous equations systems, confirmatory factor analysis, canonical correlations, (M)ANOVA can be considered special cases. Structural equation modelling is an important statistical tool in economics and behavioural sciences. Structural equations express relationships among several variables that can be either directly observed variables (manifest variables) or unobserved hypothetical variables (latent vari- ables). In structural models, as opposed to functional models, all variables are taken to be random rather than having fixed levels. In addition, for maximum likelihood estimation and generalised least squares estimation (see below), the random variables are assumed to have an approximately multivariate normal distribution. Hence you are advised to remove outliers and consider transformations to normality before fitting. 11.1 General form of the model η = Bη + Γξ + ζ. (11.1) Here, η ∈ Rm vector of output latent variables; ξ ∈ Rn′ vector of input latent variables; B ∈Mm,m, Γ ∈Mm,n′ coefficient matrices; Note: (I −B) is assumed to be nonsingular. ζ ∈ Rm disturbance vector with E ζ = 0. 74 UNSW MATH5855 2021T3 Lecture 11 Structural Equation Modelling To this modelling equation (11.1) we attach 2 measurement equations: Y = ΛY η + ϵ; (11.2) X = ΛXξ + δ; (11.3) Y ∈ Rp, X ∈ Rq; ΛY ∈ mp×m,ΛX ∈ mq×n′ with ϵ ∈ Rp, δ ∈ Rq zero-mean measurement errors. These errors are assumed to be uncorrelated with ξ and ζ and with each other. Generative model for X and Y Y X ϵ δ η ξ ζ B Γ ΛX ΛY The above quite general model (11.1)–(11.2)–(11.3) is calledKeesling–Wiley–Jo¨reskog model. Its interpretation is that the input and output latent variables ξ and η are connected by a system of linear equations (the structural model (11.1)) with coefficient matrices B and Γ and an error vector ζ. The random vectors Y and X represent the observable vectors (measurements). The implied covariance matrix for this model can be obtained. Let Var(ξ) = Φ; Var(ζ) = Ψ; Var(ϵ) = θϵ; Var(δ) = θδ. Then, Σ = Σ(θ) = ( ΣY Y (θ) ΣY X(θ) ΣXY (θ) ΣXX(θ) ) = ( ΛY (I −B)−1(ΓΦΓ⊤ +Ψ)[(I −B)−1]⊤Λ⊤Y + θϵ ΛY (I −B)−1ΓΦΛ⊤X ΛXΦΓ ⊤[(I −B)−1]⊤Λ⊤Y ΛXΦΛ⊤X + θδ ) . (11.4) 11.2 Estimation Under the normality assumption, we can use the MLE. Since the “data” is the estimated covari- ance matrix S = 1 n− 1 n∑ i=1 {( Yi − Y¯ Xi − X¯ )( Yi − Y¯ Xi − X¯ )⊤} , and since it is known that (n− 1)S ∼ Wp+q(n− 1,Σ), we can utilise the form of the Wishart density to derive that logL(S,Σ(θ)) = constant− n− 1 2 {log|Σ(θ)|+ tr[SΣ−1(θ)]}. This is the function that has to be maximised. Hence, to find MLE, we minimise FML(θ) = log|Σ(θ)|+ tr[SΣ−1(θ)]− log|S| − (p+ q). (11.5) The function (11.5) has the advantage that FML would be zero for the “saturated model” (with Σˆ = S). I.e., a perfect fit is indicated by zero (and any non-perfect fit gives rise to > 0 value of FML). 75 UNSW MATH5855 2021T3 Lecture 11 Structural Equation Modelling 11.3 Model evaluation Under normality, model adequacy is mostly tested by an asymptotic χ2-test. Under H0 : Σ = Σ(θ) versus H1 : Σ ̸= Σ(θ), the statistic to be used is T = (n− 1)FML(θˆML) and under H0, its asymptotic distribution is χ2 with df = (p+q)(p+q+1)2 − dim(θ). Reason: logL0 = logL(S, ΣˆMLE) = logL(S,Σ(θˆML)) = −n− 1 2 {log|ΣˆMLE|+ tr[SΣˆ−1MLE]}+ constant; logL1 = logL(S,S) = −n− 1 2 {log|S|+ (p+ q)}+ constant. Then, −2 log L0 L1 = (n− 1){log|ΣˆMLE|+ tr(SΣˆ−1MLE)− log|S| − (p+ q)} = (n− 1)FML(θˆML). 11.4 Some particular SEM From the general model (11.1)–(11.2)–(11.3), we can obtain following particular models: A) ΛY = Im , ΛX = In′ ; p = m; q = n ′; θϵ = 0 ; θδ = 0 =⇒ Y = BY + ΓX + ζ (the classical econometric model). B) ΛY = Ip , ΛX = Iq =⇒ The measurement error model: • η = Bη + Γξ + ζ • Y = η + ϵ • X = ξ + δ C) Factor Analysis Models: Just take the measurement part X = ΛXξ + δ. 11.5 Relationship between exploratory and confirmatory FA In EFA the number of latent variables is not determined in advance; further, the measurement errors are assumed to be uncorrelated. In CFA a model is constructed to a great extent in advance, the number of latent variables ξ is set by the analyst, whether a latent variable influ- ences an observed variable is specified, some direct effects of latent on observed values are fixed to zero or some other constant (e.g., one), measurement errors δ may correlate, the covariance of latent variables can be either estimated or set to any value. In practice, distinction between EFA and CFA is more blurred. For instance, researchers using traditional EFA procedures may restrict their analysis to a group of indicators that they believe are influenced by one factor. Or, researchers with poorly fitting models in CFA often modify their model in an exploratory way with the goal of improving fit. 76 UNSW MATH5855 2021T3 Lecture 11 Structural Equation Modelling 11.6 Software SAS In SAS, the standard PROC CALIS is used for fitting Structural Equation Models, and it has been significantly upgraded in SAS 9.3. In particular, now you can analyse means and covariance (or even higher order) structures (instead of just covariance structures like in the classical SEM). R There are two packages for SEM in R: lavaan and sem. sem is an older package, whereas lavaan aims to provide an extensible framework for SEMs and their extensions: • can mimic commercial packages (including those below) • provides convenience functions for specifying simple special cases (such as CFA) but also a more flexible interface for advanced users • mean structures and multiple groups • different estimators and standard errors (including robust) • handling of missing data • linear and nonlinear equality and inequality constraints • categorical data support • multilevel SEMs • package blavaan for Bayesian estimation • etc. Others Note that the general form of the SEM model given here is only one possible description due to Karl Jo¨reskog. His paradigm has been first implemented in the software called LISREL (Linear Structural Relationships). There are other equivalent descriptions due to Bentler and Weeks, to McDonald and some other prominent researchers in the field. Some of them also have proposed their own software for fitting SEM models according to their model specification. The EQS program for PC that deals with the Bentler/Weeks model, was very popular for a while. The latest “hit” in the area is the program MPLUS (M is for Bength Muthe´n). Mute´n is a former PhD student of Jo¨reskog and has been the developer of LISREL. During the last 15 years or so however, he has developed his own program MPLUS. Its latest version 6 represents a fully integrated framework and is the premier software in the area of general latent variable modelling specifically in the behavioural sciences. MPLUS capabilities include: • Exploratory factor analysis • Structural equation modelling • Item response theory analysis • Growth curve modelling 77 UNSW MATH5855 2021T3 Lecture 11 Structural Equation Modelling • Mixture modelling (latent class analysis) • Longitudinal mixture modelling (hidden Markov, latent transition analysis, latent class growth analysis, growth mixture analysis) • Survival analysis (continuous- and discrete-time) • Multilevel analysis • Bayesian analysis • etc. 11.7 Examples Example 11.1. Wheaton, Muthen, Alwin, and Summers (1977) Anomie example. 78 UNSW MATH5855 2021T3 Lecture 12 Discrimination and Classification 12 Discrimination and Classification 12.1 Separation and Classification for two populations . . . . . . . . . . . . . . . . . . 79 12.2 Classification errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 12.3 Summarising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 12.4 Optimal classification rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 12.4.1 Rules that minimise the expected cost of misclassification (ECM) . . . . . 81 12.4.2 Rules that minimise the total probability of misclassification (TPM) . . . 81 12.4.3 Bayesian approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 12.5 Classification with two multivariate normal populations . . . . . . . . . . . . . . 82 12.5.1 Case of equal covariance matrices Σ1 = Σ2 = Σ . . . . . . . . . . . . . . . 82 12.5.2 Case of different covariance matrices (Σ1 ̸= Σ2) . . . . . . . . . . . . . . . 83 12.5.3 Optimum error rate and Mahalanobis distance . . . . . . . . . . . . . . . 84 12.6 Classification with more than 2 normal populations . . . . . . . . . . . . . . . . . 84 12.7 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 12.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 12.9 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 12.10Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 12.1 Separation and Classification for two populations Discriminant analysis and classification are widely used multivariate techniques. The goal is either separating sets of objects (in discriminant analysis terminology) or allocating new objects to given groups (in classification theory terminology). Basically, discriminant analysis is more exploratory in nature than classification. However, the difference is not significant especially because very often a function that separates may sometimes serve as an allocator, and, conversely, a rule of allocation may suggest a discriminatory procedure. In practice, the goals in the two procedures often overlap. We will consider the case of two populations (classes of objects) first. Typical examples include: an anthropologist wants to classify a skull as a male or female; a patient needs to be classified as needing surgery or not needing surgery etc.. Denote the two classes by π1 and π2. The separation is to be performed on the basis of measurements of p associated random variables that form a vector X ∈ Rp. The observed values of X belong to different distributions when taken from π1 and π2 and we shall denote the densities of these two distributions by f1(x) and f2(x), respectively. Allocation or classification is possible due to the fact that one has a learning sample at hand, i.e., there are some measurement vectors that are known to have been generated from each of the two populations. These measurements have been generated in earlier similar experiments. The goal is to partition the sample space into 2 mutually exclusive regions, say R1 and R2, such that if a new observation falls in R1, it is allocated to π1 and if it falls in R2, it is allocated to π2. 12.2 Classification errors There is always a chance of an erroneous classification (misclassification). Our goal will be to develop such classification methods that in a suitably defined sense minimise the chances of misclassification. It should be noted that one of the two classes may have a greater likelihood of occurrence because one of the two populations might be much larger than the other. For example, there tend 79 UNSW MATH5855 2021T3 Lecture 12 Discrimination and Classification to be a lot more financially sound companies than bankrupt companies. These prior probabilities of occurrence should also be taken into account when constructing an optimal classification rule if we want to perform optimally. In a more detailed study of optimal classification rules, cost is also important. If classifying a π1 object to the class π2 represents a much more serious error than classifying a π2 object to the class π1 then these cost differences should also be taken into account when designing the optimal rule. The conditional probabilities for misclassification are defined naturally as: Pr(2|1) = Pr(X ∈ R2|π1) = ∫ R2 f1(x)dx (12.1) Pr(1|2) = Pr(X ∈ R1|π2) = ∫ R1 f2(x)dx (12.2) 12.3 Summarising We turn briefly to the question of how to summarise a classifier’s performance. Each object has a true class membership and the one predicted by the classifier, and for a given dataset for which true memberships are known, we may summarise the counts of the four resulting possibilities in a contingency table called a confusion matrix, i.e., Predicted class 1 2 Actual class 1 Members of 1 correctly classified Members of 1 misclassified as 2 2 Members of 2 misclassified as 1 Members of 2 correctly classified A confusion matrix can be produced when there are more than two classes as well. In the special case where there are two classes that can be meaningfully labelled as Nega- tive/Positive, False/True, No/Yes, Null/Alternative, or similar, it is common to use the following terminology for them: Predicted class Negative Positive Actual class Negative True Negative (TN) False Positive (FP) Positive False Negative (FN) True Positive (TP) One can then define various performance metrics such as sensitivity (a.k.a. recall, true positive rate (TPR)): Pr(Pred. pos.|Act. pos.) = TPTP+FN specificity (a.k.a. selectivity, true negative rate (TNR)): Pr(Pred. neg.|Act. neg.) = TNTN+FP false positive rate (a.k.a. FPR, fall-out): Pr(Pred. pos.|Act. neg.) = FPTN+FP = 1− TNR accuracy: TP+TNTP+FP+TN+FN total probability of misclassification (a.k.a. TPM): 1− accuracy precision (a.k.a. positive predictive value): Pr(Act. pos.|Pred. pos.) = TPTP+FP negative predictive value: Pr(Act. neg.|Pred. neg.) = TNTN+FN 80 UNSW MATH5855 2021T3 Lecture 12 Discrimination and Classification F1 score: 2TP2TP+FP+FN Many classifiers return a continuous score that needs to be thresholded to produce a binary decision (e.g., predict “Yes” if the score exceeds some constant k and “No” otherwise), it is a common practice to plot a receiver operating characteristic (ROC) curve by varying the threshold and then plotting the TPR (on the vertical axis) against FPR (on the horizontal axis) that result. Both of which decrease as k increases. A perfect classifier would have a threshold for which the curve achieves the (0, 1) point, whereas classifier close to the y = x line is no better than chance. 12.4 Optimal classification rules 12.4.1 Rules that minimise the expected cost of misclassification (ECM) Lemma 12.1. Denote by pi the prior probability of πi, i = 1, 2, p1 + p2 = 1. Then the overall probabilities of incorrectly classifying objects will be: Pr(misclassified as π1) = Pr(1|2)p2 and Pr(misclassified as π2) = Pr(2|1)p1. Further, let c(i|j), i ̸= j, i, j = 1, 2 be the misclassification costs. Then the expected cost of misclassification is ECM = c(2|1)Pr(2|1)p1 + c(1|2)Pr(1|2)p2 (12.3) The regions R1 and R2 that minimise ECM are given by R1 = {x : f1(x) f2(x) ≥ c(1|2) c(2|1) p2 p1 } (12.4) and R2 = {x : f1(x) f2(x) < c(1|2) c(2|1) p2 p1 }. (12.5) Proof. It is easy to see that ECM = ∫ R1 [c(1|2)p2f2(x)− c(2|1)p1f1(x)]dx+ c(2|1)p1. Hence, the ECM will be minimised if R1 includes those values of x for which the integrand [c(1|2)p2f2(x)− c(2|1)p1f1(x)] ≤ 0 and excludes all the complementary values. Note the significance of the fact that in Lemma 12.1 only ratios are involved. Often in practice, one would have a much clearer idea about the cost ratio rather than for the actual costs themselves. For your own exercise, consider the partial cases of Lemma 12.1 when p2 = p1, c(1|2) = c(2|1) and when both these equalities hold. Comment on the soundness of the classification regions in these cases. 12.4.2 Rules that minimise the total probability of misclassification (TPM) If we ignore the cost of misclassification, we can define the total probability of misclassification as TPM = p1 ∫ R2 f1(x)dx+ p2 ∫ R1 f2(x)dx Mathematically, this is a particular case of Lemma 12.1 when the costs of misclassification are equal—so nothing new here. 81 UNSW MATH5855 2021T3 Lecture 12 Discrimination and Classification 12.4.3 Bayesian approach Here, we try to allocate a new observation x0 to the population with the larger posterior prob- ability Pr(πi|x0), i = 1, 2. According to Bayes’s formula we have Pr(π1|x0) = p1f1(x0) p1f1(x0) + p2f2(x0) , Pr(π2|x0) = p2f2(x0) p1f1(x0) + p2f2(x0) Mathematically, the strategy of classifying an observation x0 as π1 if Pr(π1|x0) > Pr(π2|x0) is again a particular case of Lemma 12.1 when the costs of misclassification are equal. (Why?) But note that the calculation of the posterior probabilities themselves is in itself a useful and informative operation. 12.5 Classification with two multivariate normal populations Until now we did not specify any particular form of the densities f1(x) and f2(x). Essential simplification occurs under normality assumption and we are going over to a more detailed discussion of this particular case now. Two different cases will be considered- of equal and of non-equal covariance matrices. 12.5.1 Case of equal covariance matrices Σ1 = Σ2 = Σ Now we assume that the two populations π1 and π2 are Np(µ1,Σ) and Np(µ2,Σ), respectively. Then, (12.4) becomes R1 = {x : exp[−1 2 (x− µ1)⊤Σ−1(x− µ1) + 1 2 (x− µ2)⊤Σ−1(x− µ2)] ≥ c(1|2) c(2|1) × p2 p1 }. Similarly, from (12.5) we get R2 = {x : exp[−1 2 (x− µ1)⊤Σ−1(x− µ1) + 1 2 (x− µ2)⊤Σ−1(x− µ2)] < c(1|2) c(2|1) × p2 p1 }, and we arrive at the following result: Theorem 12.2. Under the above assumptions, the allocation rule that minimises the ECM is given by: 1. allocate x0 to π1 if (µ1 − µ2)⊤Σ−1x0 − 1 2 (µ1 − µ2)⊤Σ−1(µ1 + µ2) ≥ log[c(1|2) c(2|1) × p2 p1 ]. 2. Otherwise, allocate x0 to π2. Proof. Simple exercise (to be discussed at lectures). Note also that it is unrealistic to assume in most situations that the parameters µ1, µ2, and Σ are known. They will need to be estimated by the data instead. Assume, n1 and n2 observations are available from the first and from the second population, respectively. If x¯1 and x¯2 are the sample mean vectors and S1 and S2 the corresponding sample covariance matrices, then under the assumption of Σ1 = Σ2 = Σ we can derive the pooled covariance matrix estimator Spooled = (n1−1)S1+(n2−1)S2 n1+n2−2 (This is an unbiased estimator of Σ (!)). Hence the sample classification rule becomes: 82 UNSW MATH5855 2021T3 Lecture 12 Discrimination and Classification 1. allocate x0 to π1 if (x¯1 − x¯2)⊤S−1pooledx0 − 1 2 (x¯1 − x¯2)⊤S−1pooled(x¯1 + x¯2) ≥ log[ c(1|2) c(2|1) × p2 p1 ] (12.6) 2. Otherwise, allocate x0 to π2. This empirical classification rule is called an allocation rule based on Fisher’s discrim- inant function. The function (x¯1 − x¯2)⊤S−1pooledx0 − 1 2 (x¯1 − x¯2)⊤S−1pooled(x¯1 + x¯2) itself (which is linear in the vector observation x0) is called Fisher’s linear discriminant function. Of course, the latter rule is only an estimate of the optimal rule since the parameters in the latter have been replaced by estimated quantities. But we are expecting this rule to perform well when n1 and n2 are large. It is to be pointed out that the allocation rule in (12.6) is linear in the new observation x0. The simplicity of its form is a consequence of the multivariate normality assumption. • Allocation rule based on Fisher’s discriminant function: (x¯1 − x¯2)⊤S−1pooledx0 − 1 2 (x¯1 − x¯2)⊤S−1pooled(x¯1 + x¯2) • Function itself called Fisher’s linear discriminant function. • Only an estimate of the optimal rule. – linear in the new observation x0 12.5.2 Case of different covariance matrices (Σ1 ̸= Σ2) Theorem 12.3. Now we assume that the two populations π1 and π2 are Np(µ1,Σ1) and Np(µ2,Σ2), respectively. Repeating the same steps as in Theorem 12.2 we get R1 = {x : −1 2 x⊤(Σ−11 − Σ−12 )x+ (µ⊤1 Σ−11 − µ⊤2 Σ−12 )x− k ≥ log[ c(1|2) c(2|1) × p2 p1 ]} R2 = {x : −1 2 x⊤(Σ−11 − Σ−12 )x+ (µ⊤1 Σ−11 − µ⊤2 Σ−12 )x− k < log[ c(1|2) c(2|1) × p2 p1 ]} where k = 12 log( |Σ1| |Σ2| ) + 1 2 (µ ⊤ 1 Σ −1 1 µ1 − µ⊤2 Σ−12 µ2) and we see that the classification regions are quadratic functions of the new observation in this case. One obtains the following rule: 1. allocate x0 to π1 if −1 2 x⊤0 (S −1 1 − S−12 )x0 + (x¯⊤1 S−11 − x¯⊤2 S−12 )x0 − kˆ ≥ log[ c(1|2) c(2|1) × p2 p1 ] where kˆ is the empirical analog of k. 83 UNSW MATH5855 2021T3 Lecture 12 Discrimination and Classification 2. Allocate x0 to π2 otherwise. When Σ1 = Σ2, the quadratic term disappears and we can easily see that the classification regions from Theorem 12.2 are obtained. Of course, the case considered in Theorem 12.3 is more general but we should be cautious when applying it in practice. It turns out that in more than two dimensions, classification rules based on quadratic functions do not always perform nicely and can lead to strange results. This is especially true when the data are not quite normal and when the differences in the covariance matrices are significant. The rule is very sensitive (non- robust) towards departures from normality. Therefore, it is advisable to try to first transform the data to more nearly normal by using some classical normality transformations. A detailed discussion of these effects will be provided during the lecture. Also, tests discussed in Lecture 9 can be used to check if equal variance assumption is valid. 12.5.3 Optimum error rate and Mahalanobis distance We defined the TPM quantity in general terms for any classification rule (12.3). When the regions R1 and R2 are selected in an optimal way, one obtains the minimal value of TPM which is called optimum error rate (OER) and is being used to characterise the difficulty of the classification problem at hand. Hereby we shall illustrate the calculation of the OER for the simple case of two normal populations with Σ1 = Σ2 = Σ and prior probabilities p1 = p2 = 1 2 . In this case TPM = 1 2 ∫ R2 f1(x)dx+ 1 2 ∫ R1 f2(x)dx, and OER is obtained by choosing R1 = {x : (µ1 − µ2)⊤Σ−1x− 1 2 (µ1 − µ2)⊤Σ−1(µ1 + µ2) ≥ 0} and R2 = {x : (µ1 − µ2)⊤Σ−1x− 1 2 (µ1 − µ2)⊤Σ−1(µ1 + µ2) < 0}. If we introduce the random variable Y = (µ1−µ2)⊤Σ−1X = l⊤X then Y |i ∼ N1(µiY ,∆2), i = 1, 2 for the two populations π1 and π2 where µiY = (µ1 − µ2)⊤Σ−1µi, i = 1, 2. The quantity ∆ = √ (µ1 − µ2)⊤Σ−1(µ1 − µ2) is the Mahalanobis distance between the two normal popu- lations and it has an important role in many applications of Multivariate Analysis. Now Pr(2|1) = Pr(Y < 1 2 (µ1 − µ2)⊤Σ−1(µ1 + µ2)) = Pr(Y − µ1Y ∆ < −∆ 2 ) = Φ(−∆ 2 ), Φ(·) denoting the cumulative distribution function of the standard normal. Along the same lines we can get (do it (!)) : Pr(1|2) = Φ(−∆2 ) to that finally OER = minimum TPM = Φ(−∆2 ). In practice, ∆ is replaced by its estimated value ∆ˆ = √ (x¯1 − x¯2)⊤S−1pooled(x¯1 − x¯2). 12.6 Classification with more than 2 normal populations Formal generalisation of the theory for the case of g > 2 groups π1, π2, . . . , πg is straightforward but optimal error rate analysis is difficult when g > 2. It is easy to see that the ECM classification rule with equal misclassification costs becomes (compare to (12.4) and (12.5)) now: 1. Allocate x0 to πk if pkfk > pifi for all i ̸= k. 84 UNSW MATH5855 2021T3 Lecture 12 Discrimination and Classification Equivalently, one can check if log pkfk > log pifi for all i ̸= k. When applying this classification rule to g normal populations fi(x) ∼ Np(µi,Σi), i = 1, 2, . . . , g it becomes: 1. Allocate x0 to πk if log pkfk(x0) = log pk−p 2 log(2π)−1 2 log|Σk|−1 2 (x0−µk)⊤Σ−1k (x0−µk) = maxi log pifi(x0). Ignoring the constant p2 log(2π) we get the quadratic discriminant score for the ith pop- ulation: dQi (x) = − 1 2 log|Σi| − 1 2 (x− µi)⊤Σ−1i (x− µi) + log pi (12.7) and the rule advocates to allocate x to the population with a largest quadratic discriminant score. It is obvious how one would estimate from the data the unknown quantities involved in (12.7) in order to obtain the estimated minimum total probability of misclassification rule. (You formulate the precise statement (!)). In the case we are justified to assume that all covariance matrices for the g populations are equal, a simplification is possible (like in the case g = 2). Looking only at the terms that vary with i = 1, 2, . . . , g in (12.7) we can define the linear discriminant score: di(x) = µ ⊤ i Σ −1x− 1 2µ ⊤ i Σ −1µi + log pi. Correspondingly, a sample version of the linear discriminant score is obtained by substituting the arithmetic means x¯i instead of µi and Spooled = n1−1 n1+n2+...ng−gS1+ · · ·+ ng−1n1+n2+···+ng−gSg instead of Σ thus arriving at dˆi(x) = x¯ ⊤ i S −1 pooledx− 1 2 x¯⊤i S −1 pooledx¯i + log pi Therefore the Estimated Minimum TPM Rule for Equal Covariance Normal Popula- tions is the following: 1. Allocate x to πk if dˆk(x) is the largest of the g values dˆi(x), i = 1, 2, . . . , g. In this form, the classification rule has been implemented in many computer packages. 12.7 Software SAS: PROC DISCRIM R: MASS:lda, MASS:qda 12.8 Examples Example 12.4. Linear and quadratic discriminant analysis for the Edgar Anderson’s Iris data, and using cross-validation to assess classifiers. 12.9 Additional resources An alternative presentation of these concepts can be found in JW Sec. 11.1–11.6. 85 UNSW MATH5855 2021T3 Lecture 12 Discrimination and Classification 12.10 Exercises Exercise 12.1 Three bivariate normal populations, labelled i = 1, 2, 3 have same covariance matrix given by Σ = ( 1 0.5 0.5 1 ) and means µ1 = ( 1 1 ) , µ2 = ( 1 0 ) ,µ3 = ( 0 1 ) , respectively. (a) Suggest a classification rule for an observation x = ( x1 x2 ) that corresponds to one of the three populations. You may assume equal priors for the three populations and equal mis- classification costs. (b) Classify the following observations to one of the three distributions: ( 0.2 0.6 ) , ( 2 0.8 ) , ( 0.75 1 ) . (c) Show that in R2, the 3 classification regions are bounded by straight lines and draw a graph of these three regions. 86 UNSW MATH5855 2021T3 Lecture 13 Support Vector Machines 13 Support Vector Machines 13.1 Introduction and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 13.2 Expected versus Empirical Risk minimisation . . . . . . . . . . . . . . . . . . . . 87 13.3 Basic idea of SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 13.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 13.4.1 Linear SVM: Separable Case . . . . . . . . . . . . . . . . . . . . . . . . . 90 13.4.2 Linear SVM: Nonseparable Case . . . . . . . . . . . . . . . . . . . . . . . 91 13.5 Nonlinear SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 13.6 Multiple classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 13.7 SVM specification and tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 13.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 13.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 13.1 Introduction and motivation As seen in Lecture 12, when classifying into one of two p-dimensional multivariate normal pop- ulations, the scores are either linear (when the same covariance matrices are used) or quadratic (when the covariance matrices are different). Even optimality for such simple classifiers could be shown due to the multivariate normality assumption. However, when the two populations are not multivariate normal, the situation is more difficult, the bounds between the populations may be more blurry and significantly more non-linear classification techniques may be necessary to achieve a good classification. Support vector machines (SVM) are an example of such non- linear statistical classification techniques. They usually achieve superior results in comparison to more traditional non-linear parametric classification techniques such as logit analysis or non- parametric techniques such as neural networks. Mathematically, when using SVM, we try to formulate the classification as an empirical risk minimisation problem and to solve the problem under additional restrictions on the allowed (nonlinear) classifier functions. 13.2 Expected versus Empirical Risk minimisation Let Y be and “indicator” with values +1 and −1 that indicate if certain p dimensional observation belongs to one of two groups of interest. We want to find a “best” classifier in a class F of functions f. Each classifier function f(x) is meant to deliver a value of +1 or −1 for a given observation vector x. To this end, we consider the expected risk R(f) = ∫ 1 2 |f(x)− y|dP (x, y) Since the joint distribution P (x, y) is unknown in practice, we consider the empirical risk over a training set (xi, yi), i = 1, 2, . . . , n of observations instead: Rˆ(f) = 1 n n∑ i=1 1 2 |f(xi)− yi| The loss in the risk’s definition is the “zero-one loss” given by L(x, y) = 1 2 |f(x)− y| and, thanks to the chosen labels ±1 for Y obviously has the values 0 (if classification is correct) and 1 (if classification is wrong). 87 UNSW MATH5855 2021T3 Lecture 13 Support Vector Machines Minimising the empirical (instead of the unknown expected) risk means to find fn = argminf∈F Rˆ(f) as an approximation to fopt = argminf∈F R(f). Generally speaking the two solutions fn and fopt do not coincide and without further assumptions may be quite different. However, thanks to some ground breaking work by V. Vapnik there are theoretical results which, loosely speaking, state that if F is not too large and n → ∞, there is an upper bound on their difference with probability (1− η): R(f) ≤ Rˆ(f) + ϕ(h n , log η n ) The above inequality can be interpreted as stating that the test error is bounded from above by the sum of the training error and the complexity of the set of models under consideration. We can then try to minimise the bound from above and hope that in that way we keep under control to a minimum the (unknown) test error. The function ϕ above is monotone increasing in h (at least for large enough sample sizes n). Here h denotes the Vapnik–Chervonenkis (VC) dimension (i.e., a measure of the complexity of the class F). For a linear classification rule f(x) = sign(w⊤x + b) with a p dimensional predictor x it is known that ϕ( h n , log η n ) = √ h(log( 2nh ) + 1)− log(η4 ) n . and that the VC dimension is h = p+ 1. You can now directly check that ∂ ∂h [ h(log( 2nh ) + 1)− log(η4 ) n ] = 1 n log( 2n h ) > 0 as long as h < 2n which confirms the monotone increasing property stated above. In general, the VC dimension of a given set of functions is equal to the maximal number of points that can be separated in all possible ways by that set of functions. At first glance, the “more rich” the function class F the better the classification rule would be. Indeed you can construct a classifier that has zero classification error on the training set. However, this classifier will be too specialised for the given training set with no ability to generalise for other sets. Hence such a classifier would be undesirable. At first glance, the “more rich” the function class F the better the classification rule would be. Indeed you can construct a classifier that has zero classification error on the training set. However, this classifier will be too specialised for the given training set with no ability to generalise for other sets. Hence such a classifier would be undesirable. “More rich” is tantamount to require bigger complexity of F or equivalently higher value of h (and therefore of ϕ). The term ϕ(hn , log ηn ) can be considered penalty for the excessive complexity of the classifier function. You can see directly that the derivative ∂ϕ( hn , log(η) n ) ∂h ≥ 0 if and only if 2n ≥ h. For large enough n this means that the function ϕ is increasing with the complexity of the model. Hence the sum of the two terms: Rˆ(f) (precision) and ϕ(hn , log η n ) (complexity) represents the compromise between precision in the risk estimation and the complexity of the classifier. Therefore minimising this sum is the sensible thing to do in order to perform “optimally”. The rest of the lecture focuses on ways to solve (or solve approximately) this minimisation problem for some classes F . For additional information, see Section 19.4 of Ha¨rdle, W. and Simar, L., Applied Multivariate Statistical Analysis, Third Edition, Springer, 2012. A treatment along similar lines can also be found in the e-book (available from the library) 88 UNSW MATH5855 2021T3 Lecture 13 Support Vector Machines Hastie, T., Friedman, J. and Tibshirani, R. The Elements of Statistical Learning: Data Mining, Inference and Prediction, Second Edition, Springer 2009. 13.3 Basic idea of SVMs A linear classifier is one that given feature vector xnew and weights w, classifies ynew based on the value of w⊤xnew; for example, yˆnew = { +1 if w⊤xnew + b > 0 −1 if w⊤xnew + b < 0 for a threshold −b. Here, we see that every element of x, xi, gets a weight wi: Sign of wi determines whether increasing xi pushes the prediction toward yi = −1 or yi = +1. Magnitude of wi determines how strongly. The regions of x for which the model predicts +1 as opposed to−1 are defined byw⊤x+b = 0. Points x that satisfy that equation exactly form a line (if d = 2), a plane (if d = 3), or a hyperplane (if d ≥ 3). We call the data linearly separable if a hyperplane that separates them exists. Let us focus on this linearly separable case (and consider the nonseparable case later.) The following diagram illustrates one such line: x1 x2 −1 +1 y separating line w −b ∥w ∥ Now, usually, there are infinitely many different hyperplanes which could be used to separate a linearly separable dataset. We therefore have to define the “best” one. The “best” choice can be regarded as the middle of the widest empty strip (or higher dimensional analogue) between the two classes, one that maximises the margin |b+−b−|∥w∥ in the following illustration: x1 x2 −1 +1 yw ⊤ x + b = 0 w ⊤ x + b + = 0 w ⊤ x + b− = 0 |b+− b−| ∥w∥ 89 UNSW MATH5855 2021T3 Lecture 13 Support Vector Machines =⇒ We want to make the margin |b+−b−|∥w∥ as big as possible. The scale of w and b is arbitrary: for arbitrary α ̸= 0, any x that satisfies w⊤x+ b = 0 also satisfies (αw)⊤x + (αb) = α(w⊤x + b) = 0, so (w, b) and (αw, αb) define the same plane. We fix |b+ − b| = |b− − b| = 1, and only vary w: our “outer” hyperplanes become w⊤x+ (b− 1) = 0 w⊤x+ (b+ 1) = 0. Then, the margin of |b+−b−|∥w∥ = 2 ∥w∥ is maximised by minimising ∥w∥. Therefore, a Linear Support Vector Machine minimises ∥w∥2 subject to separating −1s and +1s. 13.4 Estimation 13.4.1 Linear SVM: Separable Case We write the boundaries of the empty region as w⊤x+ (b− 1) = 0 =⇒ w⊤x+ b = +1 w⊤x+ (b+ 1) = 0 =⇒ w⊤x+ b = −1, and observe that yˆi = { +1 if w⊤xi + b > 0 −1 if w⊤xi + b < 0 = sign(w⊤xi + b). This means that if w⊤x+ b = 0 separates −1s and +1s (i.e., yi = yˆi for all i = 1, . . . , n.), yi(w ⊤xi + b) ≥ 1. Therefore, a linear SVM learning task for can be expressed as a constrained optimisation problem: argmin w 1 2 ∥w∥2 subject to yi(w⊤xi + b) ≥ 1, i = 1, . . . , n. (Here and elsewhere, argmina h(a) is that a which minimises the value of h(a).) The objective is quadratic (convex) and the constraints are linear. This problem can be solved by Lagrange multipliers. The following outlines the steps and the key results. 1. Rewrite the objective function as the Lagrangian: (note the use of αis instead of λis): Lag(w, b;α) = 1 2 ∥w∥2 − n∑ i=1 αi [ yi(w ⊤xi + b)− 1 ] . 2. As the constraints are inequalities rather than equalities, apply the so-called KKT (Karush– Kuhn–Tucker) conditions: the saddle point (w, b,α) : Lag′(w, b;α) = 0 will be the constrained optimum if αi ≥ 0, i = 1, . . . , n. Thus, our goal becomes to solve for Lag′(w, b;α) = 0 subject to αi ≥ 0. 3. Set derivatives of Lag with respect to w and b equal to zero: ∂L ∂w = w − n∑ i=1 αiyixi = 0 =⇒ w = n∑ i=1 αiyixi, ∂L ∂b = − n∑ i=1 αiyi = 0 =⇒ n∑ i=1 αiyi = 0. 90 UNSW MATH5855 2021T3 Lecture 13 Support Vector Machines 4. Note, also, that yi(w ⊤xi + b)− 1 ≥ 0, i = 1, . . . , n, αi ( yi(w ⊤xi + b)− 1 ) = 0, i = 1, . . . , n. for some αi ≥ 0, i = 1, . . . , n. Notice that the second equation implies that either αi = 0 or yi(w ⊤xi + b) = 1 (or both). But that means that if αi ̸= 0, the observation lies on a corresponding hyperplane and is known as a support vector. Dual Optimisation Problem Substituting the expression of w in terms of α and expanding ∥w∥2, we get the dual problem: LagD(α) = n∑ i=1 αi − 1 2 n∑ i=1 n∑ j=1 αiαjyiyjx ⊤ i xj , to be maximised subject to αi ≥ 0, i = 1, . . . , n n∑ i=1 αiyi = 0. This is a quadratic programming problem, for which many software tools are available. 13.4.2 Linear SVM: Nonseparable Case Of course, in real-world problems, it is not possible to find hyperplanes which perfectly separate the target classes. The soft margin approach considers a trade-off between margin width and number of training misclassifications. Slack variables ξi ≥ 0 are included in the constraints: we insist that yi(w ⊤xi + b) ≥ 1− ξi. (13.1) The optimisation then becomes argmin w,ξ ( 1 2 ∥w∥2 + C n∑ i=1 ξi ) subject to yi(w ⊤xi + b) ≥ 1− ξi, i = 1, . . . , n, or a tuning constant C. Small C means a lot of slack, whereas a large C means little slack. In particular, if we set C =∞, we require separation to be perfect, a hard margin. Now, taking (13.1) and solving for ξi gives ξi ≥ 1 − yi(w⊤xi + b). We want to make ξi as small as possible, so we can set ξi = 1− yi(w⊤xi + b). Dual Optimisation Problem The Lagrangian is now (with additional multipliers µ), Lag(w, b, ξ;α,µ) = 1 2 ∥w∥2 + C n∑ i=1 ξi − n∑ i=1 αi [ yi(w ⊤xi + b)− 1 + ξi ]− n∑ i=1 µiξi. 91 UNSW MATH5855 2021T3 Lecture 13 Support Vector Machines Now, ∂L ∂w = w − n∑ i=1 αiyixi = 0 =⇒ w = n∑ i=1 αiyixi ∂L ∂b = − n∑ i=1 αiyi = 0 =⇒ n∑ i=1 αiyi = 0 ∂L ∂ξ = C1n −α− µ = 0 =⇒ C − αi − µi = 0, i = 1, . . . , n. with additional KKT conditions for i = 1, . . . , n: αi ≥ 0 µi ≥ 0 αi ( yi(w ⊤xi + b)− 1 + ξi ) = 0. Substituting into the Lagrangian leads to LagD(α,µ) = n∑ i=1 αi − 1 2 n∑ j=1 n∑ k=1 αjαkyjyk(x ⊤ j xk) + n∑ i=1 ξi(C − αi − µi). But C − αi − µi = 0, so as long as αi ≤ C, µi ≥ 0 is completely determined by αi, and we get a dual problem argmax α n∑ i=1 αi − 1 2 n∑ j=1 n∑ k=1 αjαkyjyk(x ⊤ j xk) subject to n∑ i=1 αiyi = 0 and 0 ≤ αi ≤ C, i = 1, . . . , n. We can also express the prediction in two ways: Primal: yˆ(x) = sign(w⊤x+ b), (13.2) Dual: yˆ(x) = sign{ n∑ j=1 αjyj(x ⊤ j x) + b}. (13.3) Primal (w) form requires d parameters, while dual (α) form requires n parameters. This means that for high-dimensional problems—those with d≫ n, a huge number of predictors—the dual representation can be more efficient. But it gets better! Notice that only the xis closest to the separating hyperplane—those with αj > 0—matter in determining yˆ(x), so most of them will have no effect. Thus, computationally, effective “n” will actually much smaller than the sample size, so the above condition can be met far more often than one might expect. Again, those xis that “support” the hyperplane are called support vectors. In addition, notice that the dual form only depends on (x⊤j xk)s. This opens the door to nonlinear SVMs. 92 UNSW MATH5855 2021T3 Lecture 13 Support Vector Machines 13.5 Nonlinear SVMs Consider: x1 x2 The true classification for these points is y = { +1 if x1 2 + x2 2 > 0.752 −1 if x12 + x22 < 0.752 , but one can hardly draw a line separating them. What we can do is transform x so that a linear decision boundary can separate them. In this case, suppose we augmented our x with squared terms: (x1, x2)→ (x1, x2, x21, x22) : x1 0. 0 0 .3 0 .6 0.0 0.2 0.4 0.6 0. 0 0. 2 0. 4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 x2 x1.2 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0. 0 0. 3 0. 6 0 .0 0. 2 0. 4 x2.2 Now, a linear separator exists! Better yet, recall that the dual form (13.3) depends only on dot products x⊤i xj . However, we can specify other kernels k(xi,xj). For example, a “kernel” function of the form k(u,v) = (u⊤v + 1)2 can be regarded as a dot product u21v 2 1 + u2v 2 2 + 2u1v1 + 2u2v2 + 1 93 UNSW MATH5855 2021T3 Lecture 13 Support Vector Machines = (u21, u 2 2, √ 2u1, √ 2u2, 1) ⊤(v21 , v 2 2 , √ 2v1, √ 2v2, 1). which reconstructs the above augmentation. In general, kernel functions can be expressed in terms of high dimensional dot products. Computing dot products via kernel functions is com- putationally “cheaper” than using transformed attributes directly. A common type of kernel is a radial basis function: a function of distance from the origin, or from another fixed point v. Usually, the distance is Euclidean, i.e. ∥u− v∥ = √ (u1 − v1)2 + · · ·+ (un − vn)2. A common radial basis function is Gaussian: ϕ(u,v) = exp (−γ∥u− v∥2) . We can use ϕ(·, ·) as our SVM kernel. 13.6 Multiple classes Finally, we briefly consider the problem when there are more than two classes. Suppose that there are K > 2 categories. Recall that w⊤xi gives us a “score” that we normally compare to b. However, we do not have to do so. Instead, for each k = 1, . . . ,K, we can fit a separate SVM (i.e., wk and bk) for whether an observation is in k vs. not. We can then predict yˆnew by evaluating w ⊤ k xnew+ bk for each k and taking highest biggest one. This is called the One-against-rest approach. A computationally more expensive approach that tends to perform better is the One-against- one: an SVM is fit for every distinct pair k1, k2 = 1, . . . ,K, fit an SVM for k1 vs. k2, and predict the “winner” of all the rounds (if any). This requires fitting K(K − 1)/2 binary classifiers, but to smaller datasets. 13.7 SVM specification and tuning Categorical data can be handled by introducing binary dummy variables to indicate each possible value. When fitting an SVM, the user must specify some control parameters, these include cost constant C for slack variables, the type of kernel function, and its parameters. Unlike the more probabilistic forms of classification, it is difficult to predict the out-of-sample classification error for SVMs, so cross-validation is used. The following kernel functions available via the R e1071 package: linear: u⊤v polynomial: (γu⊤v + c0)p radial basis: exp(−γ∥u− v∥2) sigmoid: tanh(−γu⊤v + c0) for constants γ, p, and c0. 13.8 Examples Example 13.1. SVM classification for the Edgar Anderson’s Iris data, and using ROC curves. 94 UNSW MATH5855 2021T3 Lecture 13 Support Vector Machines 13.9 Conclusion We conclude with a brief discussion of the advantages and disadvantages of SVMs. SVM training can be formulated as a convex optimisation problem, with efficient algorithms for finding the global minimum, and the final result involves support vectors rather than the whole training set. This is both a computational benefit, but also one to robustness: outliers have less effect than for other methods. On the other hand, they are much more difficult to interpret than model-based classification techniques like the linear discriminant analysis. Furthermore, SVMs do not actually provide class probability estimates. These can be estimated by cross-validation, however. 95 UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis 14 Cluster Analysis 14.1 “Classical” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 14.1.1 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 14.1.2 Example: K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 14.1.3 Extension: K-medioids . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 14.1.4 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 14.1.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 14.1.6 Assessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 14.1.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 14.2 Model-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 14.2.1 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 14.2.2 Multivariate normal clusters . . . . . . . . . . . . . . . . . . . . . . . . . . 102 14.2.3 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 14.2.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 14.2.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 14.2.6 Expectation–Maximisation Algorithm . . . . . . . . . . . . . . . . . . . . 104 14.3 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 The goal of cluster analysis is to identify groups in data. In contrast to SVMs and discriminant analysis, no preexisting group labels are provided. This makes it an example of unsupervised learning. The input of cluster analysis is therefore an unlabelled sample x1, . . . ,xn ∈ Rp, and the output is a grouping of observations such that more similar (in some sense) observations are placed in the group. That is, cluster analysis assigns to each xi a group index Gi ∈ {1, . . . ,K} such that if Gi = Gj , xi and xj are “on average” more similar in some sense than if Gi ̸= Gj . Throughout this lecture, we will use the following additional notation. G = (G1, . . . , Gn) ⊤: a column vector of cluster memberships. S1, . . . , SK : a partitioning of the observations {1, . . . , n} into K non-overlapping sets such that for every i ∈ Sk, Gi = k. S = (S1, . . . , SK): a shorthand for the clustering expressed in terms of sets. We will consider a taxonomy of approaches to clustering. The “classical” approach is to specify an algorithm that assigns observations to clusters. (Often, but not always, an objective function may be defined that is optimised by the algorithm.) Classical approaches can be further subdivided into hierarchical clustering, which produces a hierarchy of nested clusterings in a tree which has observations as leaves; and non-hierarchical, which merely assigns a label to each point. The model-based approach to clustering is to postulate a mixture model—a model consisting of a mixture of probability distributions with different location parameters. The parameters of this model embody information about the clusters (e.g., their means and frequencies), and estimating them enables probabilistic, or soft clusterings. We discuss these approaches in turn. 14.1 “Classical” 14.1.1 Components In order to cluster data—particularly multivariate data—we must first define a proximity mea- sure: some function d(x1,x2) that determines difference between two observations. (Equivalently 96 UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis we can define a similarity score and negate or invert it.) Here are some common metrics measures: Euclidean: ∥x1 − x2∥ = √∑p j=1(x1j − x2j)2, the “ordinary” straight-line distance. taxicab/Manhattan: ∥x1 − x2∥1 = ∑p j=1|x1j − x2j |, distance if one is only allowed to travel parallel to the axes (like a taxicab on the Manhattan city grid). Gower: p−1 ∑p j=1 I(x1j ̸= x2j): for binary measures. A metric should be substantively meaningful and appropriate for the data. It is also common to scale all of the dimensions (say, to have variance of 1 or to be between 0 and 1) before clustering. Given these distances, we specify the algorithm that minimises within-cluster and maximises between-cluster distances in some sense—that sense often operationalised in an objective function. 14.1.2 Example: K-means Perhaps the best known clustering algorithm is the K-means. It has the advantage of being simple and intuitive. The objective function that it ultimately minimises (over the partitioning S = (S1, . . . , SK)) is K∑ k=1 1 2|Sk| ∑ i,j∈Sk ∥xi − xj∥2, the sum of squared Euclidean distances between every distinct pair of observations within each cluster (appropriately scaled). It can be shown (using a decomposition similar to that of ANOVA) that this is equivalent to minimising K∑ k=1 ∑ i∈Sk ∥xi − x¯Sk∥2, x¯Sk = 1 |Sk| ∑ i∈Sk xi, which is simply the sum of the squared Euclidean distances between each data point and the mean of its cluster. The following algorithm often does a good job finding such a clustering: 1. Randomly assign a cluster index to each element of G(0). 2. Calculate cluster means (centroids): x¯ S (t−1) k = 1 |S(t−1)k | ∑ i∈S(t−1)k xi, k = 1, . . . ,K. 3. Calculate distances of each data point from each mean: dik = ∥xi − x¯S(t−1)k ∥, i = 1, . . . , n, k = 1, . . . ,K. 4. Reassign each point to its nearest mean: G (t) i = argmin k dik. (Here and elsewhere, argmina h(a) is that a which minimises the value of h(a).) 5. Repeat from Step 2 until G(t) = G(t−1). 97 UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis 14.1.3 Extension: K-medioids A generalisation of K-means is the K-medioids technique. We define a medioid x˜Sk of cluster k to be a specific observation that has the closest summed distance (however defined) to all other observations in Sk: x˜Sk = argminxj ∑ i∈Sk d(xj ,xi). The Method of K-medioids or partitioning around medioids (PAM) minimises the sum of these distances: argmin S K∑ k=1 ∑ i∈Sk d(xi, x˜Sk). This method is much more expensive computationally than K-means, but it is also more robust to outliers. It is typically fit as follows: 1. Randomly assign a cluster index to each element of G(0). 2. Calculate cluster medioids: x˜ S (t−1) k = argmin xj ∑ i∈S(t−1)k d(xj ,xi), k = 1, . . . ,K. 3. Calculate distances of each data point from each medioid: dik = d(xi, x˜S(t−1)k ), i = 1, . . . , n, k = 1, . . . ,K. 4. Reassign each point to its nearest medioid: G (t) i = argmin k dik. 5. Repeat from Step 2 until G(t) = G(t−1). 14.1.4 Hierarchical clustering Hierarchical clustering, instead of partitioning the data into K groups, produces a hierarchy of clusterings whose sizes range from 1 (no splits) to as high as n (every observation its own cluster). This clustering is typically visualised in a dendrogram, a tree diagram whose branching represents subdivisions of the data into clusters and whose height represents the distances between points or clusters. The algorithms for producing these clusterings are either agglomerative, in that they start with each observation in its own cluster, then combine nearest observations into clusters, nearest clusters into bigger clusters, etc.; or divisive, starting with the whole dataset, then splitting it into a small number of clusters, those clusters into smaller clusters, etc.. The former require defining a notion of a distance between clusters. The latter require to defining a criterion based on which a cluster is split. Some common examples of distances are provided in the following table: 98 UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis Single linkage d(S1, S2) = min{d(xi,xj) : i ∈ S1, j ∈ S2} Complete linkage d(S1, S2) = max{d(xi,xj) : i ∈ S1, j ∈ S2} Average linkage (unweighted) d(S1, S2) = 1 |S1||S2| ∑ i∈S1 ∑ j∈S2 d(xi,xj) Average linkage (weighted) d(S1 ∪ S2, S3) = d(S1,S3)+d(S2,S3)2 Centroid d(S1, S2) = ∥x¯S1 − x¯S2∥ Ward d(S1, S2) = ∑ i∈S1∪S2∥xi − x¯S1∪S2∥2−∑i∈S1∥xi − x¯S1∥2−∑i∈S2∥xi − x¯S2∥2 = |S1||S2||S1|+|S2|∥x¯S1 − x¯S2∥2 A framework that is useful for expressing different between-cluster distances is the Lance– Williams framework. Given three clusters, S1, S2, and S3, and suppose that we have some metric for evaluating pairwise distances between them, i.e., d(S1, S2), d(S1, S3), and d(S2, S3). Then, we define the distance resulting from combining S1 and S2 in terms of these pairwise distances and coefficients α1, α2, β, and γ: d(S1 ∪ S2, S3) = α1d(S1, S3) + α2d(S2, S3) + βd(S1, S2) + γ|d(S1, S3) − d(S2, S3)|. This, plus the distance metric between individual points (which applies when the clusters have only one observation in them), allows us to define and efficiently calculate distances between clusters. For example, the unweighted average linkage can be expressed in this framework as follows: d(S1 ∪ S2, S3) = 1|S1 ∪ S2||S3| ∑ i∈S1∪S2 ∑ j∈S3 d(xi,xj) = 1 (|S1|+ |S2|)|S3| ∑ i∈S1 ∑ j∈S3 d(xi,xj) + ∑ i∈S2 ∑ j∈S3 d(xi,xj) = |S1||S3|d(S1, S3) + |S2||S3|d(S2, S3) (|S1|+ |S2|)|S3| =⇒ α1 = |S1||S1|+ |S2| , α2 = |S2| |S1|+ |S2| , β = γ = 0. Ward’s method—the most popular hierarchical clustering criterion—similarly, uses the squared Euclidean distances d(xi,xj) = ∥xi − xj∥2 between points and then α1 = |S1|+ |S3| |S1|+ |S2|+ |S3| , α2 = |S2|+ |S3| |S1|+ |S2|+ |S3| , β = −|S3| |S1|+ |S2|+ |S3| , γ = 0. Ward’s method joins the groups that will increase the within-group variance least. 14.1.5 Software SAS: 99 UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis Hierarchical: PROC CLUSTER (PROC TREE to visualise, PROC DISTANCE to preprocess), PROC VARCLUS Non-hierarchical: PROC FASTCLUS, PROC MODECLUS, PROC FASTKNN R: Hierarchical: stats::hclust, cluster::agnes Non-hierarchical: stats::kmeans, cluster::pam • Many others 14.1.6 Assessing Lastly, we briefly discuss how a clustering G may be assessed. Ideally, this measurement should be “fair” to the number of clusters K. For example, in K-means clustering, splitting a cluster will always reduce the within-cluster variances, and so those cannot be used as a criterion. • Given a clustering G, how good is it? • Ideally, measurement should be invariant to K. – I.e., not within-cluster variances. A popular method, inspired by K-medioid clustering, is the silhouettes. For each i = 1, . . . , n, let a(i) = 1 |SGi | − 1 ∑ j∈SGi d(xi,xj) b(i) = min k ̸=Gi 1 |Sk| ∑ j∈Sk d(xi,xj). Observe that a(i) is the distance between i and other observations its own cluster and b(i) is the distance between i and observations in the cluster nearest to i to which i does not belong. In a good clustering each observation will be much closer to its own cluster than to its neighbouring cluster, so b(i)≫ a(i). Then, silhouette of i is a value between −1 and +1 calculated as follows: s(i) = { b(i)−a(i) max(a(i),b(i)) if |SGi | > 1 0 otherwise . That is s(i) evaluates how much closer is i to the rest of its cluster than it is to its nearest cluster, and a higher silhouette indicates a better clustering for point i. Mean silhouette n−1 ∑n i=1 s(i) then measures the overall quality of clustering. 14.1.7 Examples Example 14.1. Hierarchical, non-hierarchical clustering and assessment illustrated on the Edgar Anderson’s Iris data. 100 UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis 14.2 Model-based clustering 14.2.1 Mixture Models Lastly, we turn to model-based clustering. We will discuss the theoretical underpinnings of this approach—mixture models—and an important special case of Gaussian clustering and its parametrisation. The Expectation–Maximisation algorithm, often used to estimate these models will also be described, as it is useful in a wide variety of circumstances, but it is not examinable. A finite mixture model is a probability model under which each observation comes from one of several distributions, but we do not observe from which one. (Infinite mixture models exist as well, but they are outside of the scope of this class.) A mixture model is specified as follows. We setK to be the number of distributions (clusters), and a collection of K density functions on the support of xi, fk(xi;θk) (for k = 1, . . . ,K) each having a parameter vectors θk (e.g., its expectation), which we do not know and must estimate. We also postulate K (unknown) probabilities πk that an observation (any observation) comes from cluster k. (Standard restrictions apply: 0 ≤ πk ≤ 1, ∑K k=1 πk = 1.) For brevity, we define π = (π1, . . . , πK) ⊤, a vector of these probabilities; and Ψ = {θ1, . . . ,θK ,π}, the collection of all model parameters. Then, we assume the following data-generating process: for each i = 1, . . . , n, 1. Sample Gi ∈ {1, . . . ,K} with Pr(Gi = k;π) = πk. 2. Sample Xi|Gi ∼ fGi(·;θGi). 3. Observe Xi, and “forget” Gi. The pdf of this mixture density is fXi(xi; Ψ) = K∑ k=1 πkfk(xi;θk). (14.1) We wish to estimate the parameters Ψ from the sample of x = [x1, . . . ,xn]. This leads to the likelihood Lx(Ψ) = n∏ i=1 K∑ k=1 πkfk(xi;θk). (14.2) This formulation is convenient for a number of reasons. It is a probability model for the Xis, and therefore we can use it to obtain a soft clustering : rather than a hard clustering that assigns a point to a single cluster, we can apportion an observation’s membership by how likely it to have come from each cluster. An application of Bayes’s rule and (14.1) gives Pr(Gi = k|xi; Ψ) = πkfk(xi;θk)∑K k′=1 πk′fk′(xi;θk′) . We can also embed it into a hierarchical model (a meaning distinct from the hierarchical clustering above), in which either xis are parameters for some model for the data or for the observation process or θs are functions of some hyper-parameters. Lastly, the fact that we have a well-defined likelihood facilitates model selection. 101 UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis 14.2.2 Multivariate normal clusters As with other analysis scenarios discussed in this course, the multivariate normal distribution provides a useful formulation for the clusters. Consider the following parametrisation: fk(xi;θk) = 1 (2π)p/2|Σ(θk)|1/2 e − 12 (xi−µ(θk))⊤Σ(θk)−1(xi−µ(θk)) . Here, µ(θk) is the mean vector of cluster k (e.g., first p elements of θk), and Σ(θk) is the model for the variances. We may also have different clusters “share” elements of θ, and a more general case is fk(xi;θ) = 1 (2π)p/2|Σk(θ)|1/2 e − 12 (xi−µk(θ))⊤Σk(θ)−1(xi−µk(θ)), (14.3) where µk(θ) and Σk(θ) “extract” the appropriate elements from θ. One advantage of multivariate normal clusters is in its flexibility in specifying cluster size and shape. (Recall your exercises from Week 1.) Recall the eigendecomposition of the covari- ance matrix Σ = PΛP⊤, with P orthogonal and Λ diagonal and nonnegative. Let us further parametrise it as Σ = λPAP⊤, with P ∈Mp,p orthogonal, A ∈Mp,p diagonal and nonnegative with |A| = 1 (unimodular), and scalar λ > 0. This allows us to interpret the structure of the matrix in simple, substantive terms. Starting with λ, recall recalling that the determinant of a matrix can be viewed as its volume. Then, |Σ| = λp|P ||A||P⊤| = λp, which makes λ is the “spread”, “size”, or “volume” of the cluster. To interpret the diagonal, unimodular matrix A, observe that if A = Ip, then Σ = λPAP⊤ = λPP⊤ = λIp, making the cluster spherical—equal variances on all dimensions. Similarly, if some diagonal elements of A are much larger than others, then the cluster will be an ellipsoid more stretched in one direction than in others. Lastly, observe that if P = Ip, then Σ = λPAP⊤ = λA, an ellipsoid whose axes are parallel to coordinate axes, implying the elements of Xi within each cluster are uncorrelated with unequal variances. More generally, P controls the rotation of ellipsoid—the correlation between the dimensions and the orientation of the cluster. When it comes to estimating K clusters, we can permit the λs, the As, and the P s to vary between the clusters, be constant between the clusters, or, for A and P , be fixed at the identity. Each combination embodies different assumption about the shape and the relationship between clusters; and, in general, the more we permit to vary, the more parameters we must estimate and the more data we therefore require. Generally, 1. For a mixture of K clusters, we must, invariably, estimate the cluster membership proba- bilities π1, . . . , πK (K − 1 parameters) and cluster means µ1, . . . ,µK (Kp parameters). 2. Then, λ can be constrained λ1 = λ2 = · · · = λK (1 parameter) or allowed to vary (K parameters). 102 UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis 3. Then, A can be fixed A1 = A2 = · · · = AK = Id (0 parameters), constrained A1 = A2 = · · · = AK (p− 1 parameters), or allowed to (K(p− 1) parameters). 4. Lastly, if A is not fixed at the identity matrix, P can either be fixed P1 = P2 = · · · = PK = Id (0 parameters), constrained P1 = P2 = · · · = PK ( ( p 2 ) parameters), or allowed to vary (K ( p 2 ) parameters). The different cluster shapes identified by their constraint triple (λ,A, P ) encoding being fixed at identity as I, being constrained to equality between clusters as E, and being allowed to vary freely as V are given in the following figure: Incorporated under the terms of Creative Commons Attribution 3.0 Unported license from Figure 2 of: Luca Scrucca, Michael Fop, T. Brendan Murphy, and Adrian E. Raftery (2016). mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models. The R Journal 8:1, pages 289-317. 14.2.3 Model selection As mentioned before, model-based clustering requires one to specify both the number of clusters K and the within-cluster models fk(xi; Ψ). In the case of multivariate normal clustering, we have a large number of possible specifications for the Σks, and the number of parameters can grow quickly for “XXV” models in particular. At the same time, because it is likelihood-based, a variety of standard model-selection tech- niques can be used. For example, BIC is recommended: BICν = −2 logLx(Ψˆ) + ν log n, where ν the number of parameters estimated. (Here, lower BIC is better, but some authors and software packages use 2 logLx(Ψˆ)− ν log n, with higher BIC being better.) Substantive considerations also matter. For example, how many clusters does our research hypothesis predict? Do we expect correlations between dimensions to vary between clusters? 103 UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis 14.2.4 Software SAS: PROC MBC R: package mclust and others 14.2.5 Examples Example 14.2. Model-based clustering and model selection illustrated on the Edgar Anderson’s Iris data. 14.2.6 Expectation–Maximisation Algorithm Lastly, we discuss the typical computational approach for estimating these mixture models. The logL(Ψ) in (14.2) is computationally tractable, but it does not simplify or decompose much, because while the logarithm of a product is a sum of the logarithms, the logarithm of a sum does not, in general, simplify further. Thus, we introduce the Expectation–Maximisation (EM) algorithm: 1. Introduce an unobserved (latent) variable Gi, i = 1, . . . , n giving the cluster membership of i. 2. Suppose that G1, . . . , Gn are observed; then, this complete-data likelihood, Lx,G1,...,Gn(Ψ) = n∏ i=1 πGifGi(xi;θGi) : we “know” the exact cluster from which each observation came, so we no longer have to sum over the possible clusters. Then, the log-likelihood decomposes into two summations: logLx,G1,...,Gn(Ψ) = n∑ i=1 log πGi + n∑ i=1 log fGi(xi;θGi), (14.4) one that depends only on the πks and the other only on the θks. 3. Start with an initial guess Ψ(0). 4. Iterate E-step and M-step described below to convergence. E-step The Expectation step consists of starting with a parameter guess Ψ(t−1) and evaluating Q(Ψ|Ψ(t−1)) = EG1,...,Gn|x;Ψ(t−1)(logLx,G1,...,Gn(Ψ)) : the expected value of the complete-data log-likelihood. We can evaluate it by calculating (using the Bayes’s rule) q (t−1) ik = Pr(Gi = k|x; Ψ(t−1)) = π (t−1) k fk(xi;θ (t−1) k )∑K k′=1 π (t−1) k′ fk′(xi;θ (t−1) k′ ) , i = 1, . . . , n, k = 1, . . . ,K, then substituting them in as Q(Ψ|Ψ(t−1)) = n∑ i=1 K∑ k=1 q (t−1) ik log πk + n∑ i=1 K∑ k=1 q (t−1) ik log fk(xi;θk). (14.5) Observe that, like (14.4), (14.5) decomposes into a summation that depends only on the πks and a summation that depends only on the θks. 104 UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis M-step The Maximisation step then consists of maximising the Q(Ψ|Ψ(t−1)) with respect to Ψ to obtain the next parameter guess: Ψ(t) = argmax Ψ Q(Ψ|Ψ(t−1)), s.t. K∑ k=1 πk=1. Conveniently, the form (14.5) separates the πks from the θks, and so we can maximise them separately (i.e., if we differentiate with respect to one, the summation involving the other will vanish). Maximising (14.5) with respect to θks, we take the derivative ∂Q(Ψ|Ψ(t−1)) ∂θk = n∑ i=1 q (t−1) ik ∂ log fk(xi;θk) ∂θk , and set to 0. This is a weighted maximum likelihood estimator. Maximising (14.5) with respect to to πks is also straightforward. We will use Lagrange Multipliers to do so: Lag(π) = n∑ i=1 K∑ k=1 q (t−1) ik log πk − α( K∑ k=1 πk − 1). Differentiating, Lag′k(π) = n∑ i=1 q (t−1) ik π −1 k − α. Setting to 0, πk = n∑ i=1 q (t−1) ik /α. Summing and solving for α, K∑ k=1 πk = 1 α K∑ k=1 n∑ i=1 q (t−1) ik = 1, α = K∑ k=1 n∑ i=1 q (t−1) ik . Therefore, π (t) k = ∑n i=1 q (t−1) ik∑K k=1 ∑n i=1 q (t−1) ik . “Sharing” θs Lastly, recall that when we select one of the “E” models and (14.3) in Section 14.2.2, we no longer have a separate θk for every fk. We may then need to redefine θ ∈ RKp+1 or more to contain parameters for all groups (separate means, distinct variance parameters, etc.), and fk(xi;θ) to “extract” those elements of θ that it needs, with Ψ = (θ,π). Inferentially, θ replaces θk in all derivations above. In particular, Q(Ψ|Ψ(t−1)) = n∑ i=1 K∑ k=1 q (t−1) ik log πk + n∑ i=1 K∑ k=1 q (t−1) ik log fk(xi;θ), 105 UNSW MATH5855 2021T3 Lecture 14 Cluster Analysis so ∂Q(Ψ|Ψ(t−1)) ∂θ = n∑ i=1 K∑ k=1 q (t−1) ik ∂ log fk(xi;θ) ∂θ , which is still a weighted MLE, but now it is joint for all groups, and without simplification. 14.3 Additional resources An alternative presentation of these concepts can be found in JW Sec. 12.1–12.5. Additional software demonstration of model-based clustering can be found in Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A. E. (2016). mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1), 289. 106 UNSW MATH5855 2021T3 Lecture 15 Copulae 15 Copulae 15.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 15.2 Common copula types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 15.2.1 Elliptical copulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 15.2.2 Archimedean copulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 15.3 Margins, estimation, and simulation . . . . . . . . . . . . . . . . . . . . . . . . . 111 15.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 15.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 15.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 15.1 Formulation For the multivariate normal, independence is equivalent to absence of correlation between any two components. In this case the joint cdf is a product of the marginals. When the independence is violated, the relation between the joint multivariate distribution and the marginals is more involved. An interesting concept that can be used to describe this more involved relation is the concept of copula. We focus on the two-dimensional case for simplicity. Then the copula is a function C : [0, 1]2 → [0, 1] with the properties: i) C(0, u) = C(u, 0) = 0 for all u ∈ [0, 1]. ii) C(u, 1) = C(1, u) = u for all u ∈ [0, 1]. iii) For all pairs (u1, u2), (v1, v2) ∈ [0, 1]× [0, 1] with u1 ≤ v1, u2 ≤ v2 : C(v1, v2)− C(v1, u2)− C(u1, v2) + C(u1, u2) ≥ 0. The name is due to the implication that the copula links the multivariate distribution to its marginals. This is explicated in the following theorem: Theorem 15.1 (Sklar’s Theorem). Let F (·, ·) be a joint cdf with marginal cdf ’s FX1(.) and FX2(.). Then there exists a copula C(·, ·) with the property F (x1, x2) = C(FX1(x1), FX2(x2)) for every pair (x1, x2) ∈ R2. When FX1(·) and FX2(·) are continuous the above copula is unique. Vice versa, if C(·, ·) is a copula and FX1(·), FX2(·) are cdf then the function F (x1, x2) = C(FX1(x1), FX2(x2)) is a joint cdf with marginals FX1(·) and FX2(·). Taking derivatives we also get: f(x1, x2) = c(FX1(x1), FX2(x2))fX1(x1)fX2(x2) (15.1) where c(u, v) = ∂2 ∂u∂v C(u, v) is the density of the copula. This relation clearly shows that the contribution to the joint density of X1, X2 comes from two parts: one that comes from the copula and is “responsible” for the dependence (c(u, v) = ∂ 2 ∂u∂vC(u, v)) and another one which takes into account marginal information only (fX1(x1)fX2(x2)). 107 UNSW MATH5855 2021T3 Lecture 15 Copulae It is also clear that the independence implies that the corresponding copula is Π(u, v) = uv (this is called the independence copula). These concepts are generalised also to p dimensions with p > 2. The following figure illustrates an independence copula: 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0 C(u,v) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0 c(u,v) C(u,v) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 c(u,v) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 Independence copula, dim. d = 2 15.2 Common copula types 15.2.1 Elliptical copulae An interesting example is the Gaussian copula. For p = 2 it is equal to: Cρ(u, v) = Φρ(Φ −1(u),Φ−1(v)) = ∫ Φ−1(u) −∞ ∫ Φ−1(v) −∞ fρ(x1, x2)dx2dx1. Here fρ(·, ·) is the joint bivariate normal density with zero mean, unit variances and a correlation ρ, Φρ(·, ·) is its cdf, and Φ−1(·) is the inverse of the cdf of the standard normal. (This is “The formula that killed Wall street”.) When ρ = 0 we see that we get C0(u, v) = uv (as is to be expected). Non-Gaussian copulae are much more important in practice and inference methods about copulae are a hot topic in Statistics. The reason for importance of non-Gaussian copulae is that 108 UNSW MATH5855 2021T3 Lecture 15 Copulae Gaussian copulae do not allow us to model reasonably well the tail dependence, that is, joint extreme events have virtually a zero probability. Especially in financial applications, it is very important to be able to model dependence in the tails. The t-copula, based on the multivariate t-distribution does a slightly better job in tail be- haviour. The multivariate t-distribution with variance parameter Σ and ν degrees of freedom is defined as T = Z/X, where Z ∼ N(0,Σ) and, independently, X ∼ χ2ν . Note that Var(T ) ̸= Σ. The following figure illustrates an Gaussian copula: 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0 C(u,v) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.00 2 4 6 8 c(u,v) C(u,v) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 c(u,v) 123 3 4 4 5 5 6 6 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 Normal copula, dim. d = 2 param.: (rho.1 = 0.9) The following figure illustrates a multivariate t-copula copula: 109 UNSW MATH5855 2021T3 Lecture 15 Copulae 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0 C(u,v) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.00 5 10 c(u,v) C(u,v) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 c(u,v) 123 3 4 4 5 5 6 6 7 7 8 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 t-copula, dim. d = 2 param.: (rho.1 = 0.9, df = 4.0) 15.2.2 Archimedean copulae The Gumbel–Hougaard copula is much more flexible in modeling dependence in the upper tails. For an arbitrary dimension p is is defined as CGHθ (u1, u2, . . . , up) = exp{−[ p∑ j=1 (− log uj)θ]1/θ}, where θ ∈ [1,∞) is a parameter that governs the strength of the dependence. You can easily see that the Gumbell-Hougaard copula reduces to the independence copula when θ = 1 and to the Fre´chet–Hoeffding upper bound copula min(u1, . . . , up) when θ →∞. The following figure illustrates a Gumbel–Hougaard copula: 110 UNSW MATH5855 2021T3 Lecture 15 Copulae 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0 C(u,v) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.00 2 4 6 8 c(u,v) C(u,v) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 c(u,v) 12 2 3 47 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 Gumbel copula, dim. d = 2 param.: 2 The Gumbel–Hougaard copula is also an example of the so-called Archimedean copulae. The latter are characterised by their generator ϕ(·): a continuous, strictly decreasing, convex function from [0, 1] to [0,∞) such that ϕ(1) = 0. Then the Archimedean copula is defined via the generator as follows: C(u1, u2, . . . , up) = ϕ −1(ϕ(u1) + · · ·+ ϕ(up)). Here, ϕ−1(t) is defined to be 0 if t is not in the image of ϕ(·). Example 15.2. Show that the Gumbell–Hougaard copula is an Archimeden copula with a gener- ator ϕ(t) = (− log t)θ. The benefit of using the Archimedean copulae is that they allow for simple description of the p-dim dependence by using a function of one argument only (the generator). However it is seen immediately that the Archimedean copula is symmetric in its arguments and this limits its applicability for modelling dependencies that are not symmetric in their arguments. The so-called Liouville copulae are an extension of the Archiemedean copulae and can be used also to model dependencies that are not symmetric in their arguments. 15.3 Margins, estimation, and simulation So far, we have discussed the copula functions C(·, ·) and copula density c(·, ·), but using copulae also requires marginal cdfs FX1(·) and FX2(·) and pdfs fX1(·) and fX2(·) (and so on, for more 111 UNSW MATH5855 2021T3 Lecture 15 Copulae than two variables). We can, in fact, specify arbitrary univariate continuous distributions (e.g, normal, gamma, beta, Laplace, etc.) for them. This choice is driven by substantive considerations (E.g., is the distribution positive?) Then, the density (15.1), appropriately parametrised, provides the likelihood, e.g., L(ρ,θ1,θ2) = fρ,θ1,θ2(x1, x2) = cρ(FX1|θ1(x1), FX2|θ2(x2))fX1|θ1(x1)fX2|θ2(x2), which we can maximise in terms of the parameters of the copula and of the marginal distributions to obtain their estimates. A closed form for these estimators is rarely available, and so it is typically done numerically. However, we might not want to specify margins in the first place. What can we do then? The empirical distribution function (edf) Fˆ (·) is an unbiased estimator for the true cdf F (·). Given Xij , i = 1, 2, j = 1, . . . , n observations we can obtain one for each of the 2 variables: FˆXi(x) = n −1 n∑ j=1 I(Xij ≤ x). We can then use it in the copula cdf, i.e., F (x1, x2) = C(FˆX1(x1), FˆX2(x2)). How do we estimate the parameters of the copula? Although Fˆ (·) is straightforward, fˆ(·) is not and requires further assumptions and tuning parameters (e.g., kernel bandwidth). This means that likelihood L(ρ,θ1,θ2) is no longer available to maximise. However, other methods are possible. Typically, we convert the data into empirical quantiles Pij = n n+1 FˆXi(Xij), with denominator n+1 used to ensure that Pij run from 1 n+1 to n n+1 . The resulting empirical quantiles will be uniform but maintain their correlations (approximately). Then, we can tune our copula function’s parameters until the correlations it induces among the empirical quantiles matches their observed correlations. Lastly, simulating copulae with parametric margins is straightforward, and simulating cop- ulae with empirical margins is possible as well. C(·, ·) and c(·, ·) themselves represent a valid distribution with uniform margins can therefore be used to make random dependent draws of marginally uniform quantiles P⋆ = [P1⋆, P2⋆] ⊤. The variables on the original scale can be ob- tained using inverse-transform sampling as Xi⋆ = F −1 Xi (Pi⋆), i = 1, 2 for parametric margins and Xi⋆ = Fˆ −1 Xi (Pi⋆), i = 1, 2 for empirical margins. Here, Fˆ −1 Xi (·) is the inverse of the FˆXi(·), typically smoothed in some way, since FˆXi(·) represents a discrete distribution. 15.4 Software SAS: PROC COPULA R: Packages copula, VineCopula, and others. 15.5 Examples Example 15.3. Microwave Ovens example (with empirical and gamma margins). Example 15.4. Stock and portfolio modelling. 112 UNSW MATH5855 2021T3 Lecture Copulae 15.6 Exercises Exercise 15.1 The (p-dimensional) Clayton copula is defined for a given parameter θ > 0 as Cθ(u1, u2, . . . , up) = [ p∑ i=1 u−θi − p+ 1 ]−1/θ . Show that it is an Archimedean copula and that its generator is ϕ(x) = θ−1(x−θ − 1). 113 UNSW MATH5855 2021T3 Lecture A Exercise Solutions A Exercise Solutions Note that these solutions omit the steps of differentiation and integration, as well as arithmetic, as those can be performed by a computer. 0.1 (a) 1. θx1 e −x1(θ+x2) ≥ 0 as long as θ, x1, and x2 > 0. 2. ∫∞ 0 ∫∞ 0 θx1 e −x1(θ+x2) dx2dx1 = 1. (b) Pr(X1 < t,X2 < t) = F (t, t) = ∫ t 0 ∫ t 0 θx1 e −x1(θ+x2) dx2dx1 = t θ + t + e−tθ( θ e−t 2 θ + t − 1). (c) fX1(x1) = ∫ ∞ 0 θx1 e −x1(θ+x2) dx2 = θ e−x1θ 1x1>0 ∼ Exponential(θ). Then E(X1) = θ −1 and Var(X1) = θ−2. (d) fX2(x2) = ∫ ∞ 0 θx1 e −x1(θ+x2) dx1 = θ (θ + x2)2 1x2>0, so fX2|X1(x2|x1) = fX(x1, x2) fX1(x1) = θx1 e −x1(θ+x2) θ e−x1θ = x1 e −x1x2 1x2>0 ∼ Exponential(x1). (e) fX2|X1(x2|x1) = x1 e−x1x2 ̸= θ(θ+x2)2 = fX2(x2). More simply, the conditional distribution of X2|X1 depends on X1. 114 UNSW MATH5855 2021T3 Lecture A Exercise Solutions 0.2 (a) Let Y = ( 1 −1 1 1 ) X = ( X1 −X2 X1 +X2 ) . Then Cov(Y ) = ( 1 −1 1 1 ) σ2 ( 1 ρ ρ 1 )( 1 −1 1 1 )⊤ = σ2 ( 2− 2ρ 0 0 2ρ+ 2 ) , so Cov(X1 −X2, X1 +X2) = 0. Note that we only actually require Cov(X1 −X2, X1 +X2) = ( 1 −1)σ2(1 ρ ρ 1 )( 1 1 ) = 0. (b) Cov(X1, X2 − ρX1) = ( 1 0 )(1 ρ ρ 1 )(−ρ 1 ) = 0. (c) Var(X2 − bX1) = (−b 1)σ2(1 ρ ρ 1 )(−b 1 ) = σ2(b2 − 2bρ+ 1). ∂b2 − 2bρ+ 1 ∂b = 2b− 2ρ set= 0 =⇒ b = ρ, and ∂ 2b2−2bρ+1 ∂b2 = 2 > 0 =⇒ b = ρ is a minimum. 0.3 (a) This is trivial, but for additional rigour, we can use Theorem 0.3 letting A = ( Ip1 0p1,p2 ) ∈ Mp1,p and b = 0. Then X(1) = AX = b, and φ (1) X (s) = φX { A⊤s } = φX { A⊤s } = φX {( Ip1 0p1,p2 ) s } = φX {[ s 0 ]} . (b) If X(1) and X(2) are independent, then fX(x) = fX(1)(x(1))fX(2)(x(2)). Then for φX(t) = E(e it⊤X) = ∫ Rp eit ⊤x fX(x)dx = ∫ Rp2 ∫ Rp1 eit ⊤ (1)x(1) eit ⊤ (2)x(2) fX(1)(x(1))fX(2)(x(2))dx(1)dx(2) = ∫ Rp1 eit ⊤ (1)x(1) fX(1)(x(1))dx(1) ∫ Rp2 eit ⊤ (2)x(2) fX(2)(x(2))dx(2) = φX {[ t(1) 0 ]} φX {[ 0 t(2) ]} . 115 UNSW MATH5855 2021T3 Lecture A Exercise Solutions Conversely, if φX(t) = φX {[ t(1) 0 ]} φX {[ 0 t(2) ]} = φX(1)(t(1))φX(2)(t(2)), since always, e−it ⊤x = e−it ⊤ (1)x(1) e−it ⊤ (2)x(2) , we can take the inverse of the Fourier transform (which cf is), fX(x) = (2π) −p ∫ Rp φX(t) e −it⊤x dt = (2π)−p1(2π)−p2 ∫ Rp2 ∫ Rp1 φX(1)(x(1))φX(2)(x(2)) e −it⊤(1)x(1) e−it ⊤ (2)x(2) dt(1)dt(2) = (2π)−p1 ∫ Rp1 φX(1)(x(1)) e −it⊤(1)x(1) dt(1)(2π)−p2 ∫ Rp2 φX(2)(x(2)) e −it⊤(2)x(2) dt(2) = fX(1)(x(1))fX(2)(x(2)). 0.4 Using the notation from Example 0.2, write X = PΛP⊤, and denote z = P⊤y. Now, since we constrain ⟨y, e1⟩ = 0, then z1 = ⟨y, e1⟩ = 0, so y⊤Xy y⊤y = y⊤PΛP⊤y y⊤y = z⊤Λz z⊤z = ∑p i=2 λiz 2 i∑p i=2 z 2 i , which we maximise by setting z = (0 1 · · · 0)⊤ resulting in z⊤Λz z⊤z = λ2. 0.5 First, let us show that an orthogonal projection matrix P has only 0 or 1 as possible eigenvalues. This stems directly from its idempotency: let λ be an eigenvalue of P and y the corresponding eigenvector. Then, P 2y = PPy = λPy = λ2y, but idempotency implies that P 2y = Py = λy, and so λ = λ2, forcing it to be either 0 or 1. Now, spectral decomposition implies that P = ∑n i=1 λieie ⊤ i , and so rk(P ) is the number of its nonzero eigenvalues. Meanwhile, tr(P ) = tr( n∑ i=1 λieie ⊤ i ) = n∑ i=1 λi tr(e ⊤ i ei) = n∑ i=1 λi1 = rk(P ). 116 UNSW MATH5855 2021T3 Lecture A Exercise Solutions 2.1 (a) Write the joint distribution of Y1 and Y2 as( Y1 Y2 ) = ( 1 −1 1 1 )( X1 X2 ) then, Var ( Y1 Y2 ) = ( 1 −1 1 1 ) I2 ( 1 1 −1 1 ) = ( 2 0 0 2 ) , and Y1 and Y2 are independent (being multivariate normal and uncorrelated) and identically distributed N(0, 2). (b) P (χ22 < 2.41) = 0.7 (i.e., pchisq(2.41,2)). 2.2 (a) Z ∼ N (( 4 7 ) , ( 16 −2 −2 7 )) . Hence Cor (Z1, Z2) = − 2√16×7 . (b) Take ( X˜(1) X˜(2) ) = X1X3 X2 , and rearrange to get distribution isN3 32 −1 , 3 1 21 2 1 2 1 3 . Call its mean and variance µ˜ and Σ˜. Then, X1, X3 | X2 ∼ N ( µ˜(1)|(2), Σ˜(1)|(2) ) where µ˜(1)|(2) = µ˜(1) + Σ˜(1)(2)Σ˜ −1 (2)(2) ( X˜(2) − µ˜(2) ) = ( 3 2 ) + ( 2 1 ) 1 3 (X2 + 1) = ( 3 2 ) + ( 2/3 1/3 ) (X2 + 1) Σ˜(1)|(2) = ( 3 1 1 2 ) − ( 2 1 ) 1 3 ( 2 1 ) = ( 3 1 1 2 ) − ( 4/3 2/3 2/3 1/3 ) = 1 3 ( 5 1 1 5 ) In particular, for x2 = 0 we get, X1, X3 | X2 ∼ N (( 3 23 2 13 ) , 1 3 ( 5 1 1 5 )) . 117 UNSW MATH5855 2021T3 Lecture A Exercise Solutions 2.3 Take t ∈ Rp. Observe that a1X1+· · ·+anXn =XA for A = [a1, . . . , an] andX = [X1, . . . ,Xn]. Then, along the lines of Theorem 0.3, φa1X1+···+anXn(t) = n∏ j=1 φajXj (t) = n∏ j=1 eiajt ⊤µj− a2j 2 t ⊤Σjt = eit ⊤( ∑n j=1 ajµj)− 12 t⊤( ∑n j=1 a 2 jΣj)t = φN( ∑n j=1 ajµj , ∑n j=1 a 2 jΣj) (t) (by definition). Then, substitute µi = µ, Σi = Σ, and ai = 1 n for all i = 1, . . . , n to obtain the distribution of X¯. 2.4 By Property 4, the conditional distribution X2 | X1 = x1 must have the form Ax1 + b +X3 (i.e., a linear combination of x1, a constant, and some noise X3 ∼ N(0,Ω) independent of X1). Hence, the marginal distribution of X2 is the same as the distribution of AX1 + b +X3. But then, X = ( X1 X2 ) = ( Ir 0 A Ip−r )( X1 X3 ) + ( 0 b ) and will be multivariate normal. We only need the mean and the covariance matrix. Now, E(X1) = µ1 and E(X2) = EX1 [EX2(X2 |X1)] = EX1 [EX3(AX1+b+X3)] = Aµ1+b, and Var(X2) = Var(AX1 +X3) = AΣ11A ⊤ +Ω, with Cov(X1,X2) = E[(X1 − µ1)(AX1 + b−Aµ1 − b)⊤] = E[(X1 − µ1)(X1 − µ1)⊤A⊤] = E[(X1 − µ1)(X1 − µ1)⊤]A⊤ = Σ11A⊤, hence X ∼ Np (( µ1 Aµ1 + b ) , ( Σ11 Σ11A ⊤ AΣ11 Ω+AΣ11A ⊤ )) . 2.5 (a) Using Exercise 2.4, we can get the joint distribution of ( Z Y ) ∼ N2 (( 0 1 ) , ( 1 1 1 2 )) (or, equivalently ( Y Z ) ∼ N2 (( 1 0 ) , ( 2 1 1 1 )) . Applying the same procedure again, we can get (with Ω = 1, b = 1, and A = (−1, 0))YZ X ∼ N3 10 0 , 2 1 −21 1 −1 −2 −1 3 118 UNSW MATH5855 2021T3 Lecture A Exercise Solutions or, equivalently, XY Z ∼ N 01 0 , 3 −2 −1−2 2 1 −1 1 1 . Then, Y | (X,Z) is normal with µY |(X,Z) = 1 + (−2 1)( 3 −1−1 1 )−1(( X Z ) − ( 0 0 )) = 1 + (−2 1) 1 2 ( 1 1 1 3 )( X Z ) = 1 + 1 2 (Z −X) and σ2Y |(X,Z) = 2− (−2 1)( 3 −1−1 1 )−1(−2 1 ) = 1 2 : Y | (X,Z) ∼ N(1 + 1 2 (Z −X), 1 2 ) (b) ( U V ) = ( 1 + Z 1− Y ) is obviously normal. Moreover, µU = 1 + E(Z) = 1, µV = E(1 − Y ) = 0, σ2U = σ2Z = 1, σ2V = σ 2 Y = 2, σU,V = −σZ,Y = −1. Hence,( U V ) ∼ N2 (( 1 0 ) , ( 1 −1 −1 2 )) . (c) Y | (U = 2) has the same distribution as Y | Z + 1 = 2, that is, Y | Z = 1. Using (b), we get Y | U = 2 ∼ N1(2, 1). 119 UNSW MATH5855 2021T3 Lecture A Exercise Solutions 3.1 5 0 10 0 2 1 4 2 15 0 20 0 6 3 8 4 120 UNSW MATH5855 2021T3 Lecture A Exercise Solutions 4.1 (a) C = −1 1 0 0 · · · · · · 0 −1 1 0 · · · · · · 0 0 −1 1 . . . · · · ... ... . . . . . . . . . 0 0 0 · · · 0 −1 1 ∈Mp−1,p is the required matrix. (b) Yj = CXj → Yj are i.i.d. N( Cµ, CΣC⊤ ), SY = CSC⊤, Y¯ = CX¯, µY = Cµ n ( Y¯ − µY )⊤ S −1 Y ( Y¯ − µY ) = n ( X¯ − µ)⊤ C⊤(CSC⊤)−1C(X¯−µ) ∼ (n− 1)(p− 1) n− p+ 1 Fp−1,n−p+1 Hence, the rejection region would be{ X : n ( CX¯ − 1)⊤ (CSC⊤)−1(CX¯ − 1) > (n− 1)(p− 1) n− p+ 1 F1−α,p−1,n−p+1 } , where 1p−1 ∈ Rp−1 is a (p− 1) vector of ones. 4.2 Use the fact that Y = n ( X¯ − µ0 )⊤ Σ−1 ( X¯ − µ0 ) ∼ χ23: plug-in n = 50, X¯ = 0.81.1 0.6 , µ0 = 00 0 , Σ = 3 1 11 4 1 1 1 2 , find Σ−1, and hence reject if Y > χ21−α,3. 4.3 From the data, X¯ = [6, 10]⊤ and S = ( 24 −10 −10 6 ) /3 and S−1 = ( 18 30 30 72 ) /44. Then, T 2 = n ( X¯ − µ0 )⊤ S−1 ( X¯ − µ0 ) = 13.636. To compute the P -value, evaluate F = n−p(n−1)pT 2 = 4.545 and P -value = Pr(F ≥ Fp,n−p) = 0.180 > 0.05 (i.e., pf(4.545455, 2, 2, lower.tail=FALSE)). Do not reject H0: there is not sufficient evidence to believe that the population mean differs from [7, 11]⊤. 4.4 Use Exercise 2.3 on the two samples, then the property of the difference of means. Observe that the variance does not depend on the means, and so we can the pooled T 2 test (4.9). 121 UNSW MATH5855 2021T3 Lecture A Exercise Solutions 4.5 For a difference of independent variables, means subtract and variances add, so X − X¯ ∼ Np ( 0, ( 1 + 1n ) Σ ) and (n− 1)S ∼Wp (Σ, n− 1) by definition, and they are independent. Call C =X − X¯. Then, C n n+ 1 (X − X¯)⊤S−1(X − X¯) = C ⊤S−1C C⊤Σ−1C ( n n+ 1 ) (C⊤Σ−1C). Now, C⊤S−1C C⊤Σ−1C = n− 1 χ2n−p , independent of X or X¯, so n n+ 1 C⊤Σ−1C = (X − X¯)⊤ (( 1 + 1 n ) Σ ) (X − X¯) ∼ χ2p and independent of S. Hence n n+ 1 (X − X¯)⊤S−1(X − X¯) ∼ (n− 1)χ 2 p χ2n−p , i.e., the distribution asked: ∼ p(n−1)(n−p) Fp,n−p (same as distribution of T 2). Then, (1 − α)100% prediction region would be:{ X : n n+ 1 (X − X¯)⊤S−1(X − X¯) < p(n− 1) (n− p) F1−α,p,n−p } . 122 UNSW MATH5855 2021T3 Lecture A Exercise Solutions 5.1 (a) Let X˜4 = X1 +X2 +X4. We can obtain what we are looking for as a linear combination: X1 X2 X3 X˜4 = X1 X2 X3 X1 +X2 +X4 = 1 0 0 0 0 1 0 0 0 0 1 0 1 1 0 1 X1 X2 X3 X4 , Then, E X1 X2 X3 X˜4 = 1 2 3 7 and Var X1 X2 X3 X˜4 = 1 0 0 0 0 1 0 0 0 0 1 0 1 1 0 1 3 1 0 1 1 4 0 0 0 0 1 4 1 0 4 20 1 0 0 1 0 1 0 1 0 0 1 0 0 0 0 1 = 3 1 0 5 1 4 0 5 0 0 1 4 5 5 4 31 , so X1 X2 X3 X˜4 ∼ N 1 2 3 7 , 3 1 0 5 1 4 0 5 0 0 1 4 5 5 4 31 . (b) Using the expression for the conditional distribution of a normal distribution, E X1∣∣∣∣ X2X3 X4 = 1 + ( 1 0 1 ) 4 0 00 1 4 0 4 20 −1 x2 − 2x3 − 3 x4 − 4 = 1 + ( 1 0 1 ) 1/4 0 00 5 −1 0 −1 1/4 x2 − 2x3 − 3 x4 − 4 = 1 + ( 1 0 1 ) −0.5 + x2411 + 5x3 − x4 1− x3 + x44 = 1− 0.5 + x2 4 + 2− x3 + x4 4 = 2.5 + x2 4 − x3 + x4 4 . 123 UNSW MATH5855 2021T3 Lecture A Exercise Solutions And, Var X1∣∣∣∣ X2X3 X4 = 3− ( 1 0 1 ) 14 0 00 5 −1 0 −1 14 10 1 = 3− ( 1 0 1 ) 14−1 1 4 = 2.5. (c) Looking at the upper part 3 1 01 4 0 0 0 1 of the covariance matrix, we see that X3 is independent of ( X1, X2 ) . Hence x3 does not influence the correlation of X1 and X2 =⇒ ρ12.3 = ρ12 =√ 3 6 = 0.2887. For ρ12.4, Σ11 − Σ12Σ−122 Σ21 = ( 3 1 1 4 ) − 1 20 ( 1 0 0 0 ) = ( 59 20 1 1 4 ) . Hence, ρ12.4 = √ 5 59 = 0.291. (d) R1.234 = √√√√√√√( 1 0 1 ) 4 0 00 1 4 0 4 20 −1 10 1 3 = √√√√√1 3 ( 1 0 1 ) 14 0 00 5 −1 0 −1 14 10 1 = √ 1 6 = 0.408 > ρ12. Of course R1.234 should be larger than ρ12 (or at least no smaller), and this is supported numer- ically (0.408 > 0.2887). (e) Consider X2X3 X4 and X1 − ( 1 0 1 ) X245X3 −X4 X4 4 −X3 = X1 − X2 4 − X4 4 +X3. 124 UNSW MATH5855 2021T3 Lecture A Exercise Solutions Then directly you can check: Cov(X2, X1 − X2 4 − X4 4 +X3) = 1− 1 = 0 Cov(X3,X1 − X2 4 − X4 4 +X3) = −1 + 1 = 0 Cov(X4,X1 − X2 4 − X4 4 +X3) = 1− 5 + 4 = 0. But more clever is to say: X1−E X1∣∣∣∣ x2x3 x4 and x2x3 x4 are uncorrelated. This general argument was put forward and proved as a part of the proof of Property 4 of the Multivariate Normal Distribution in Section 2.2. 5.2 (a) 3−2 1 ⊤ X1X2 X3 ∼ N ( 3 −2 1 ) 2−3 1 , ( 3 −2 1 ) 1 1 11 3 2 1 2 2 3−2 1 ∼ N(13, 9). (b) Let vector a = ( U V ) . Cov(X2, X2 − UX1 − V X3) = Var(X2)− U Cov(X2, X1)− V Cov(X2, X3) = 3− U − 2V = 0. Then, if, say, U = 1, then V = 1, so a = ( 1 1 ) . 125 UNSW MATH5855 2021T3 Lecture A Exercise Solutions 6.1 First, let us note that not all ρ > 0 are allowed since Σ must be non-negative definite. It must hold that ∣∣∣∣ 1 ρ/2ρ/2 1 ∣∣∣∣ = 1− ρ2/4 ≥ 0 (since otherwise, for some a ∈ R and b ∈ R,ab 0 ⊤ 1 ρ/2 0ρ/2 1 ρ 0 ρ 1 ab 0 = a2 + aρ/2 + bρ/2 + b2 < 0, making the whole matrix no longer non-negative definite) and∣∣∣∣∣∣ 1 ρ/2 0 ρ/2 1 ρ 0 ρ 1 ∣∣∣∣∣∣ = 1− 54ρ2 ≥ 0. This means that 0 < ρ ≤ 2√ 5 . (a) First, let us find the 3 eigenvalues of Σ:∣∣∣∣∣∣ 1− λ ρ/2 0 ρ/2 1− λ ρ 0 ρ 1− λ ∣∣∣∣∣∣ = 1− 3λ+ 3λ2 − λ3 − 54ρ2(1− λ) = 0 and (1− λ) [ λ2 − 2λ+ 1− 5 4 ρ2 ] = 0. Solving this equation, we obtain three roots: λ1 = 1, λ2 = 1 − √ 5 2 ρ, and λ3 = 1 + √ 5 2 ρ. The larges eigenvalue is λ3 = 1 + √ 5 2 ρ. By definition, its corresponding eigenvector a1a2 a3 satisfies, a1 + ρ 2 a2 = a1 + √ 5 2 ρa1 ρ 2 a1 + a2 + ρa3 = a2 + √ 5 2 ρa2 ρa2 + a3 = a3 + √ 5 2 ρa3. Solving (up to a constant), a2 = √ 5a1, a3 = 2√ 5 a2 = 2a1. 126 UNSW MATH5855 2021T3 Lecture A Exercise Solutions So a1 1√5 2 is an eigenvector. To normalise it, choose a1 = 1√10 . Thus, the first principal component is 1√ 10 Y1 + √ 1 2 Y2 + 2√ 10 Y3. It explains 1+ √ 5 2 ρ 3 · 100% of the overall variability. (b) Y1Y2 Y1 + Y2 + Y3 = 1 0 00 1 0 1 1 1 Y1Y2 Y3 ∼ N 1 0 00 1 0 1 1 1 00 0 , 1 0 00 1 0 1 1 1 1 ρ2 0ρ 2 1 ρ 0 ρ 1 1 0 10 1 1 0 0 1 ∼ N 00 0 , 1 ρ2 1 + ρ2ρ 2 1 1 + 3 2ρ 1 + ρ2 1 + 3 2ρ 3(1 + ρ) . (c) N ( (( 0 0 ) + ( 0 ρ ) y3 ) , ( 1 ρ2 ρ 2 1 ) − ( 0 ρ )( 0 ρ ) ) = N ( ( 0 ρy3 ) , ( 1 ρ2 ρ 2 1− ρ2 ) ) . (d) Cov Y3Y2 Y1 = 1 ρ 0ρ 1 ρ2 0 ρ2 1 , so R = √√√√√( ρ 0 )( 1 ρ2ρ 2 1 )−1 ( ρ 0 ) 1 = 1√ 1− ρ24 √( ρ 0 )( 1 −ρ2−ρ2 1 )( ρ 0 ) = ρ√ 1− ρ24 . 127 UNSW MATH5855 2021T3 Lecture A Exercise Solutions 7.1 Split up the matrix: Σ = ( Σ11 Σ12 Σ21 Σ22 ) into Σ11 = ( 1 0.4248 0.4248 1 ) , Σ12 = ( 0.0420 0.0215 0.0573 0.1487 0.2489 0.2843 ) , Σ22 = 1 0.6693 0.46620.6693 1 0.6915 0.4662 0.6915 1 , Σ21 = Σ12. We need to find eigenvalues for Σ⊤12Σ −1 11Σ12Σ −1 22 if calculating by hand (this would be easier than finding eigenvalues of Σ − 12 22 Σ ⊤ 12Σ −1 11Σ12Σ − 12 22 ). If using SAS, we would use the following statements: proc iml; S 11 = {1 0.4248 , 0.4248 1} ; S 12 = {0.0420 0.0215 0.0573 , 0.1487 0.2489 0.2843} ; S 22 = {1 0.6693 0.4662 , 0.6693 1 0.6915 , 0.4662 0.6915 1}; S 22inv = inv(S 22); S r = root(S 22inv); a = S r*S 12’*inv(S 11)*S 12*S r’; call eigen(c, d, a); print c; print d; The result, C = 0.09464550.0035185 2.252× 10−18 , D = −0.1281 0.7192 0.68290.2840 −0.6331 0.7201 0.9502 0.2862 −0.1232 . Further, b=S r’*d[,1]; a=1/sqrt(0.09464557)*inv(S 11)*(S 12)*b; gives a = ( 0.3262 −1.0940 ) and b = 0.1724−0.5079 −0.6794 with a⊤( X1 X2 ) and b⊤ X3X4 X5 the canonical variates, and relevant eigenvalues λ = 0.0946, 0.0035. In R, s <- c(1, 0.4248, 0.0420, 0.0215, 0.0573, 1, 0.1487, 0.2489, 0.2843, 1, 0.6693, 0.4662, 1, 0.6915, 1) S <- matrix(NA, 5, 5) S[lower.tri(S,TRUE)] <- s S[upper.tri(S)] <- t(S)[upper.tri(S)] S 11 <- S[1:2,1:2] 128 UNSW MATH5855 2021T3 Lecture A Exercise Solutions S 12 <- S[1:2,3:5] S 22 <- S[3:5,3:5] S 22inv <- solve(S 22) S r <- chol(S 22inv) # Can use Cholesky instead of square root. A <- S r%*%t(S 12)%*%solve(S 11)%*%S 12%*%t(S r) (e <- eigen(A)) (b <- t(S r)%*%e$vectors[,1]) (a <- 1/sqrt(e$values[1]) * solve(S 11)%*%S 12%*%b) This suggests that the first canonical correlation is sufficient. The first canonical correlation represents primarily a positive association between arithmetic power and memory for symbols (both kinds). 7.2 R and SAS implementations are as in previous exercise, but with modified matrices give a = ( −0.0260 −0.0518 ) , b = −0.0823−0.0081 −0.0035 , with the relevant eigenvalues λ = 0.4396, 0.0016. The eigenvalues suggest that there is little for the second canonical correlation left to explain. (I.e., a factor of over 200.) The first canonical correlation appears to indicate a positive relationship (i.e., negative × negative) between the first open book exam and the two closed book exams, whereas the other two open book exams are weakly associated with the closed book exams. (Rerunning after converting to correlation matrix does not change this.) 7.3 (a) The following is an outline of the solution: 1. Since this makes them easier to perform, work with the matrix Σ21Σ −1 11 Σ12Σ −1 22 2. Using the 2× 2 matrix inversion formula, evaluate it. 3. Using the 2 × 2 martix determinant formula, find the expression for the characteristic polynomial, in terms of ρ and λ. You should get λ1 = 4ρ2 1+4ρ+4ρ2 and λ2 = 0, which means that one canonical variables pair is enough. (b) Similarly, solve for eigenvectors of Σ21Σ −1 11 Σ12Σ −1 22 and transform them. 129 UNSW MATH5855 2021T3 Lecture A Exercise Solutions 7.4 (a) Splitting this up, Σ11 = ( 100 0 0 1 ) , Σ22 = ( 1 0 0 100 ) , Σ12 = ( 0 0 0.95 0 ) , Σ−122 = 1 100 ( 100 0 0 1 ) ,Σ − 12 22 = 1 10 ( 10 0 0 1 ) , so Σ − 12 22 Σ ⊤ 12Σ −1 11 Σ12Σ − 12 22 = ( (0.95)2 0 0 0 ) . µ2 = (0.95)2 and eigenvector b = ( 1 0 ) , hence Z2 = 1×X3 + 0×X4 = X3 and a = 1 0.95 1 100 ( 1 0 0 100 )( 0 0 0.95 0 )( 1 0 ) = ( 0 1 ) , so Z1 = 0×X1 + 1×X2 = X2. µ2 = (0.95)2, and µ = 0.95 is the first canonical correlation. Can you give another argument for the canonical variables and canonical correlation in this problem that will help you to avoid all the calculations above? 130 UNSW MATH5855 2021T3 Lecture A Exercise Solutions 9.1 Since S = 1nV = 1 n ∑n i=1(xi−x¯)(xi−x¯)⊤ (using n instead of n−1 here to simplify notation—the factor cancels), observe that arithm. mean λˆi = 1 p p∑ i=1 λˆi = 1 p tr(S) = 1 pn tr{ n∑ i=1 (xi − x¯)(xi − x¯)⊤} = 1 pn tr{ n∑ i=1 (xi − x¯)⊤(xi − x¯)} = σˆ2 and geom. mean λˆi = ( p∏ i=1 λˆi )1/p = |S|1/p = ( 1 np |V |)1/p = 1 n |V |1/p. Substituting, −2 log Λ = np log arithm. mean λˆi geom. mean λˆi = np log σˆ2 1 n |V |1/p = np log nσˆ2 − np log|V |1/p = np log nσˆ2 − n log|V |, the test statistic from Section 9.2. 9.2 Observe that we can write the sample correlation matrix as R = diag(Σˆ)−1/2Σˆdiag(Σˆ)−1/2, where diag(A)ij = { Aii i = j 0 otherwise , a diagonal matrix whose diagonal elements are the diagonal elements of A. Recall that for a diagonal matrix, the matrix inverse, the matrix square root, etc. become simply elementwise operations on the diagonal and its determinant is a product of its diagonal values. Then, let V = ∑n i=1(xi − x¯)(xi − x¯)⊤ = nΣˆ as before. If Σ is diagonal, then elements of X 131 UNSW MATH5855 2021T3 Lecture A Exercise Solutions are independent, so if σ2j = VarXj , σˆ 2 j = n −1∑n i=1(xji − x¯j)2 = n−1Vjj = Σˆjj . Then, Λ = ∏p j=1(Σˆjj) −n2 e − 1 2Σˆjj ∑n i=1(xji−x¯j)2 |V |−n2 nnp2 e−np2 = ∏p j=1(Σˆjj) −n2e− n 2 |Σˆ|−n2e−np2 = ∏p j=1(Σˆjj) −n2 (|diag(Σˆ)−1/2|−n2 )2 |diag(Σˆ)−1/2|−n2 |Σˆ|−n2 |diag(Σˆ)−1/2|−n2 =( (((( ((({∏pj=1(Σˆjj)}−n2((((((((((({(∏pj=1(Σˆjj)−1/2)2}−n2 |diag(Σˆ)−1/2Σˆdiag(Σˆ)−1/2|−n2 = |diag(Σˆ)−1/2Σˆdiag(Σˆ)−1/2|n2 = |R|n2 . so −2 log Λ = −n log |R|. Lastly, the degrees of freedom for the χ2 distribution are # param. SPD matrix︷ ︸︸ ︷ p(p+ 1) 2 − # param. diag. matrix︷︸︸︷ p = p(p− 1) 2 . 132 UNSW MATH5855 2021T3 Lecture A Exercise Solutions 12.1 (a) Normal populations with equal variances implies LDA, so use the the expression from Sec- tion 12.6, with µi replacing x¯i and Σ replacing Spooled, since those are given to us rather than estimated from the sample. This leads to the following rule: 1. Evaluate di(x) = µ ⊤ i Σ −1x− 1 2 µ⊤i Σ −1µi + log 1 3 for i = 1, 2, 3. 2. Classify x into the category with the highest di(x). (b) We shall illustrate the first case in detail, and only the results for the remainder. Let x =( 0.2 0.6 ) = ( 1/5 3/5 ) , and evaluate Σ−1 = ( 4/3 −2/3 −2/3 4/3 ) . Then, d1(x) = ( 1 1 )⊤( 4/3 −2/3 −2/3 4/3 )( 1/5 3/5 ) − 1 2 ( 1 1 )⊤( 4/3 −2/3 −2/3 4/3 )( 1 1 ) + log 1 3 = − 2 15 + log 1 3 d2(x) = ( 1 0 )⊤( 4/3 −2/3 −2/3 4/3 )( 1/5 3/5 ) − 1 2 ( 1 0 )⊤( 4/3 −2/3 −2/3 4/3 )( 1 0 ) + log 1 3 = −4 5 + log 1 3 d3(x) = ( 0 1 )⊤( 4/3 −2/3 −2/3 4/3 )( 1/5 3/5 ) − 1 2 ( 0 1 )⊤( 4/3 −2/3 −2/3 4/3 )( 0 1 ) + log 1 3 = 0 + log 1 3 Thus, we classify to Category 3. For x = ( 2 0.8 ) , d1(x) = 6 5 + log 1 3 , d2(x) = 22 15 + log 1 3 , d3(x) = − 1415 + log 13 . Classify into Category 2. For x = ( 0.75 1 ) , d1(x) = 1 2 + log 1 3 , d2(x) = − 13 + log 13 , d3(x) = 16 + log 13 . Classify into Category 1. (c) To be at the boundary between two regions, say, i and j, the point x must have di(x) = dj(x). Then, µ⊤i Σ −1x− 1 2 µ⊤i Σ −1µi + log πi = µ⊤j Σ −1x− 1 2 µ⊤j Σ −1µj + log πj µ⊤i Σ −1x− µ⊤j Σ−1x = − 1 2 µ⊤j Σ −1µj + 1 2 µ⊤i Σ −1µi + log πj − log πi (µ⊤i Σ −1 − µ⊤j Σ−1)x = − 1 2 µ⊤j Σ −1µj + 1 2 µ⊤i Σ −1µi + log πj − log πi. If we call a = (µ⊤i Σ −1−µ⊤j Σ−1)⊤ ∈ R2 and c = − 12µ⊤j Σ−1µj+ 12µ⊤i Σ−1µi+log πj− log πi ∈ R, neither of them depending on x, we can write a⊤x = c =⇒ a1x1 + a2x2 = c =⇒ x2 = c a2 − a1 a2 x1, 133 UNSW MATH5855 2021T3 Lecture A Exercise Solutions an equation for a line with slope −a1/a2 and y-intercept c/a2. Here’s a sketch of region bound- aries: -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 -1 .0 -0 .5 0. 0 0. 5 1. 0 1. 5 2. 0 x1 x 2 134 UNSW MATH5855 2021T3 Lecture A Exercise Solutions 15.1 First, we solve for the inverse of the generator: ϕ−1(x) = (θx+ 1)−1/θ. Then, substitute into the Archimedean form: Cθ(u1, u2, . . . , up) = ϕ −1{ p∑ i=1 ϕ(ui)} = [ θ{ p∑ i=1 θ−1(u−θi − 1)}+ 1 ]−1/θ = [ θθ−1{( p∑ i=1 u−θi )− p}+ 1 ]−1/θ = [ p∑ i=1 u−θi − p+ 1 ]−1/θ . 135
欢迎咨询51作业君