STAT3017/7017 - Big Data Statistics - Assessment 3 Page 1 of 2

Assessment 3

Due by Monday 19 September 2022 09:00

An important problem in multivariate analysis is the test of sphericity of the data, that is, the

null hypothesis H0 : Σ = σ2Ip where σ2 is unspecified. This hypothesis expresses the fact that

the error is cross-sectionally uncorrelated (independent if the data is normally distributed) and

have the same variance (homoscedasticity). Clearly, data sampled from a multivariate normal

Np(µ,Σ) would exhibit sphericity as the density function is constant on the ellipsoids

(− µ)TΣ−1(− µ) = k

for every positive value of k and ∈ ℝp. A general class of distributions with this property is

the class of elliptical distributions. A random vector with zero mean follows an elliptical

distribution if (and only if) it has the stochastic representation

= ξA, (⋆)

where the matrix A ∈ ℝp×p is nonrandom and rank(A) = p, ξ ≥ 0 is a random variable

representing the radius of , and u ∈ ℝp is the random direction, which is independent of ξ

and uniformly distributed on the unit sphere Sp−1 in ℝp, denoted by ∼ Unif(Sp−1).

Question 1 [8 marks]

(a)[2] Write a function runifsphere(n,p) that samples n observations from the distribution

Unif(Sp−1) using the fact that if ∼ Np(0, Ip) then /∥∥ ∼ Unif(Sp−1). Check

your results by: (1) set p = 10, n = 100 and show that the (Euclidean) norm of each

observation is equal to 1, (2) generate a scatter plot in the case p = 2, n = 500 to

show that the samples lie on a circle.

(b)[2] A classic statistic for testing sphericity (called John’s test) that is proposed in [A]

and [B] is

U =

1

p

tr

(

n

(1/p) trn

− Ip

)2

=

(1/p) tr2n

((1/p) trn)2

− 1,

where is it shown that when p is fixed and n →∞, under the null hypothesis, it holds

that np

2

U

d→ χ2ρ with ρ := 12p(p + 1) − 1. Perform a simulation to show that np2 U

is distributed like χ2ρ under the null hypothesis in the case n = 5000 observations,

p = 5, and with data generated from Np(0, Ip).

(c)[2] Check the impact on the distribution of np

2

U when the data is sampled from a double

exponential distribution (i.e., a particular case of an elliptic distribution). This can

be generated using (⋆) with ξ ∼ Gamma(p, 1) and A = Ip.

(d)[2] Implement a hypothesis test for sphericity (H0 : Σ = σIp) using John’s test. Plot its

empirical size and power in the case that the data is normal (as per question b) and

in the case that the data is double exponential (as per question c).

Dale Roberts - Australian National University

Last updated: September 2, 2022

STAT3017/7017 - Big Data Statistics - Assessment 3 Page 2 of 2

Question 2 [6 marks]

Recently, there have been a few recent research papers that consider high-dimensional

sample covariance matrices in the case where the data is sampled from an elliptical

distribution.

(a)[2] Have a look at the paper [C], consider Theorem 2.2, Eq. (2.10), and the notation

used (for all the following terms in this question). Perform a simulation experiment

to examine the fluctuations of βˆn1 and βˆn2. In the experiment, take Hp = 12δ1 +

1

2

δ2

and choose the distribution of ξ ∼ k1Gamma(p, 1) with k1 = 1/

√

p + 1. Set the

dimensions to be p = 200 and n = 400. Choose the number of simulations based on

the computational power of your machine. Similar to Figure 1 in [C], use a QQ-plot

to show normality.

(b)[2] Unfortunately, the results of [C] do not cover all elliptic distributions due to a

moment condition on the distribution, see Table 1 in [C]. The results in [D] extend

their results to more general elliptic distributions such as multivariate Gaussian

mixtures1. A p-dimensional vector ∈ ℝp is a multivariate Gaussian mixture with k

subpopulations if its density function has the form

f () =

k∑

j=1

pjφ(;µj ,Σj)

where (pj) are the k mixing weights and φ(·;µj ,Σj) denote the density function of

the jth subpopulation with mean vector µj and covariance Σj . In the case where

µ1 = µ2 = · · ·µk = 0 ∈ ℝp and Σj = vjΣ for some vj > 0 with j = 1, . . . , k . Write

an R function to sample from such a distribution using the representation from Eq.

(11) in [D].

(c)[2] Using your code from (b), perform a simulation experiment to simulate fluctations

of βˆ2 under a Gaussian scale mixture model where the variable ξ has a discrete

distribution with two mass points ℙ(ξ = 1.8√p) = 0.8 and ℙ(ξ = 1.5√p) = 0.2.

Consider the cases: (i) p = 100, n = 150, (ii) p = 600, n = 900. In both cases,

plot a histogram of the distribution of βˆ2 against the theoretical limiting density and

also a QQ-plot similar to Figure 1 in [D]. Note: this is the experiment just above

Section 3 in [D].

References

[A] John (1971). Some optimal multivariate tests. Biometrika.

[B] John (1972). The distribution of a statistic used for testing sphericity of normal distributions. Biometrika.

[C] Hu, Li, Liu, Zhou (2019). High-dimensional covariance matrices in elliptical distributions with application

to spherical test. Annals of Statistics.

[D] Zhang, Hu, Li (2022). CLT for linear spectral statistics of high-dimensional sample covariance matrices

in elliptical distributions. Journal of Multivariate Analysis.

1Recall I mentioned in Lecture 1 that one difficulty in big datasets is the presence of multiple subpopulations.

Dale Roberts - Australian National University

Last updated: September 2, 2022

欢迎咨询51作业君