STAT3017/7017 - Big Data Statistics - Assessment 3 Page 1 of 2 Assessment 3 Due by Monday 19 September 2022 09:00 An important problem in multivariate analysis is the test of sphericity of the data, that is, the null hypothesis H0 : Σ = σ2Ip where σ2 is unspecified. This hypothesis expresses the fact that the error is cross-sectionally uncorrelated (independent if the data is normally distributed) and have the same variance (homoscedasticity). Clearly, data sampled from a multivariate normal Np(µ,Σ) would exhibit sphericity as the density function is constant on the ellipsoids (− µ)TΣ−1(− µ) = k for every positive value of k and ∈ ℝp. A general class of distributions with this property is the class of elliptical distributions. A random vector with zero mean follows an elliptical distribution if (and only if) it has the stochastic representation = ξA, (⋆) where the matrix A ∈ ℝp×p is nonrandom and rank(A) = p, ξ ≥ 0 is a random variable representing the radius of , and u ∈ ℝp is the random direction, which is independent of ξ and uniformly distributed on the unit sphere Sp−1 in ℝp, denoted by ∼ Unif(Sp−1). Question 1 [8 marks] (a)[2] Write a function runifsphere(n,p) that samples n observations from the distribution Unif(Sp−1) using the fact that if ∼ Np(0, Ip) then /∥∥ ∼ Unif(Sp−1). Check your results by: (1) set p = 10, n = 100 and show that the (Euclidean) norm of each observation is equal to 1, (2) generate a scatter plot in the case p = 2, n = 500 to show that the samples lie on a circle. (b)[2] A classic statistic for testing sphericity (called John’s test) that is proposed in [A] and [B] is U = 1 p tr ( n (1/p) trn − Ip )2 = (1/p) tr2n ((1/p) trn)2 − 1, where is it shown that when p is fixed and n →∞, under the null hypothesis, it holds that np 2 U d→ χ2ρ with ρ := 12p(p + 1) − 1. Perform a simulation to show that np2 U is distributed like χ2ρ under the null hypothesis in the case n = 5000 observations, p = 5, and with data generated from Np(0, Ip). (c)[2] Check the impact on the distribution of np 2 U when the data is sampled from a double exponential distribution (i.e., a particular case of an elliptic distribution). This can be generated using (⋆) with ξ ∼ Gamma(p, 1) and A = Ip. (d)[2] Implement a hypothesis test for sphericity (H0 : Σ = σIp) using John’s test. Plot its empirical size and power in the case that the data is normal (as per question b) and in the case that the data is double exponential (as per question c). Dale Roberts - Australian National University Last updated: September 2, 2022 STAT3017/7017 - Big Data Statistics - Assessment 3 Page 2 of 2 Question 2 [6 marks] Recently, there have been a few recent research papers that consider high-dimensional sample covariance matrices in the case where the data is sampled from an elliptical distribution. (a)[2] Have a look at the paper [C], consider Theorem 2.2, Eq. (2.10), and the notation used (for all the following terms in this question). Perform a simulation experiment to examine the fluctuations of βˆn1 and βˆn2. In the experiment, take Hp = 12δ1 + 1 2 δ2 and choose the distribution of ξ ∼ k1Gamma(p, 1) with k1 = 1/ √ p + 1. Set the dimensions to be p = 200 and n = 400. Choose the number of simulations based on the computational power of your machine. Similar to Figure 1 in [C], use a QQ-plot to show normality. (b)[2] Unfortunately, the results of [C] do not cover all elliptic distributions due to a moment condition on the distribution, see Table 1 in [C]. The results in [D] extend their results to more general elliptic distributions such as multivariate Gaussian mixtures1. A p-dimensional vector ∈ ℝp is a multivariate Gaussian mixture with k subpopulations if its density function has the form f () = k∑ j=1 pjφ(;µj ,Σj) where (pj) are the k mixing weights and φ(·;µj ,Σj) denote the density function of the jth subpopulation with mean vector µj and covariance Σj . In the case where µ1 = µ2 = · · ·µk = 0 ∈ ℝp and Σj = vjΣ for some vj > 0 with j = 1, . . . , k . Write an R function to sample from such a distribution using the representation from Eq. (11) in [D]. (c)[2] Using your code from (b), perform a simulation experiment to simulate fluctations of βˆ2 under a Gaussian scale mixture model where the variable ξ has a discrete distribution with two mass points ℙ(ξ = 1.8√p) = 0.8 and ℙ(ξ = 1.5√p) = 0.2. Consider the cases: (i) p = 100, n = 150, (ii) p = 600, n = 900. In both cases, plot a histogram of the distribution of βˆ2 against the theoretical limiting density and also a QQ-plot similar to Figure 1 in [D]. Note: this is the experiment just above Section 3 in [D]. References [A] John (1971). Some optimal multivariate tests. Biometrika. [B] John (1972). The distribution of a statistic used for testing sphericity of normal distributions. Biometrika. [C] Hu, Li, Liu, Zhou (2019). High-dimensional covariance matrices in elliptical distributions with application to spherical test. Annals of Statistics. [D] Zhang, Hu, Li (2022). CLT for linear spectral statistics of high-dimensional sample covariance matrices in elliptical distributions. Journal of Multivariate Analysis. 1Recall I mentioned in Lecture 1 that one difficulty in big datasets is the presence of multiple subpopulations. Dale Roberts - Australian National University Last updated: September 2, 2022
欢迎咨询51作业君