Lecture 1: Exploratory Data Analysis of Multivariate Data 1.1 Data organisation 1.2 Basic summaries 1.3 Visualisation 1.4 Software UNSW MATH5855 2020T3 Lecture 1 Slide 1 1. Exploratory Data Analysis of Multivariate Data 1.1 Data organisation 1.2 Basic summaries 1.3 Visualisation 1.4 Software UNSW MATH5855 2020T3 Lecture 1 Slide 2 Representation case (a.k.a. item, individual, or experimental trial) p ≥ 1 variables recorded on each unit of analysis xij ith (of p) variable observed on jth (of n) case data matrix: p×n X = x11 x12 · · · x1j · · · x1n x21 x22 · · · x2j · · · x2n ... ... . . . ... . . . ... xi1 xi2 · · · xij · · · xin ... ... . . . ... . . . ... xp1 xp2 · · · xpj · · · xpn (1.1) UNSW MATH5855 2020T3 Lecture 1 Slide 3 1. Exploratory Data Analysis of Multivariate Data 1.1 Data organisation 1.2 Basic summaries 1.3 Visualisation 1.4 Software UNSW MATH5855 2020T3 Lecture 1 Slide 4 Univariate summaries sample mean (of variable i) x¯i = 1n ∑n j=1 xij sample variance (of variable i) s2i = 1 n ∑n j=1(xij − x¯i )2 I Sometimes, we will use divisor of n − 1 instead. UNSW MATH5855 2020T3 Lecture 1 Slide 5 Bivariate summaries sample covariance (of variables i and k) sik = 1 n ∑n j=1(xij − x¯i )(xkj − x¯k) I Linear association only! I Symmetric: sik ≡ ski . sample correlation (of variables i and k) rik = sik√sii√skk ≡ sik si sk I A unitless measure. I Also symmetric. I Cauchy–Bunyakovsky–Schwartz Inequality =⇒ |rik | ≤ 1. I Also linear; can use quotient correlation instead for nonlinear. UNSW MATH5855 2020T3 Lecture 1 Slide 6 Calculations on matrix data The descriptive statistics that we discussed until now are usually organised into arrays, namely: Vector of sample means x¯ = ( x¯1 x¯2 · · · x¯p )> Matrix of sample variances and covariances n×n S = s11 s12 · · · s1p s21 s22 · · · s2p ... ... . . . ... sp1 sp2 · · · spp (1.2) Matrix of sample correlations n×n R = 1 r12 · · · r1p r21 1 · · · r2p ... ... . . . ... rp1 rp2 · · · 1 (1.3) UNSW MATH5855 2020T3 Lecture 1 Slide 7 1. Exploratory Data Analysis of Multivariate Data 1.1 Data organisation 1.2 Basic summaries 1.3 Visualisation 1.4 Software UNSW MATH5855 2020T3 Lecture 1 Slide 8 Graphical representations Some simple characteristics of the data are worth studying before the actual multivariate analysis would begin: I drawing scatterplot of the data; I calculating simple univariate descriptive statistics for each variable; I calculating sample correlation and covariance coefficients; and I linking multiple two-dimensional scatterplots. UNSW MATH5855 2020T3 Lecture 1 Slide 9 1. Exploratory Data Analysis of Multivariate Data 1.1 Data organisation 1.2 Basic summaries 1.3 Visualisation 1.4 Software UNSW MATH5855 2020T3 Lecture 1 Slide 10 SAS In SAS, the procedures that are used for this purpose are called proc means, proc plot and proc corr. Please study their short description in the included SAS handout. R In R, these are implemented in base::rowMeans, base::colMeans, stats::cor, graphics::plot, graphics::pairs, GGally::ggpairs. Here, the format is PACKAGE::FUNCTION, and you can learn more by running library(PACKAGE) ? FUNCTION UNSW MATH5855 2020T3 Lecture 1 Slide 11
欢迎咨询51作业君