STAT 460/560 + 461/561 STATISTICAL INFERENCE I & II 2019/2020, TERMs I & II Jiahua Chen and Ruben Zamar c© Department of Statistics University of British Columbia Contents 1 Some basics 1 1.1 Discipline of Statistics . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Probability and Statistics models . . . . . . . . . . . . . . . . 3 1.3 Statistical inference . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 8 2 Normal distributions 9 2.1 Uni- and Multivariate normal . . . . . . . . . . . . . . . . . . 10 2.2 Standard Chi-square distribution . . . . . . . . . . . . . . . . 12 2.3 Non-central chi-square distribution . . . . . . . . . . . . . . . 14 2.4 Cochran Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 F- and t-distributions . . . . . . . . . . . . . . . . . . . . . . . 17 2.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.7 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 20 3 Exponential distribution families 23 3.1 One parameter exponential distribution family . . . . . . . . . 23 3.2 The multiparameter case . . . . . . . . . . . . . . . . . . . . . 26 3.3 Other properties . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 29 4 Optimality criteria of point estimation 31 4.1 Point estimator and some optimality criteria . . . . . . . . . . 32 4.2 Uniformly minimum variance unbiased estimator . . . . . . . . 35 4.3 Information inequality . . . . . . . . . . . . . . . . . . . . . . 37 1 2 CONTENTS 4.4 Other desired properties of a point estimator . . . . . . . . . . 40 4.5 Consistency and asymptotic normality . . . . . . . . . . . . . 42 4.6 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 45 5 Approaches of point estimation 49 5.1 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2 Maximum likelihood estimation . . . . . . . . . . . . . . . . . 52 5.3 Estimating equation . . . . . . . . . . . . . . . . . . . . . . . 53 5.4 M-Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.5 L-estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.6 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 56 6 Maximum likelihood estimation 59 6.1 MLE examples . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.2 Newton Raphson algorithm . . . . . . . . . . . . . . . . . . . 61 6.3 EM-algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.4 EM-algorithm for finite mixture models . . . . . . . . . . . . . 66 6.4.1 Data Examples . . . . . . . . . . . . . . . . . . . . . . 69 6.5 EM-algorithm for finite mixture models repeated . . . . . . . 70 6.6 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 73 7 Properties of MLE 75 7.1 Trivial consistency . . . . . . . . . . . . . . . . . . . . . . . . 78 7.2 Trivial consistency for one-dimensional θ . . . . . . . . . . . . 79 7.3 Asymptotic normality of MLE after the consistency is estab- lished . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.4 Asymptotic efficiency, super-efficient, one-step update scheme 83 7.5 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 85 8 Analysis of regression models 89 8.1 Least absolution deviation and least square estimators . . . . 90 8.2 Linear regression model . . . . . . . . . . . . . . . . . . . . . . 91 8.3 Local kernel polynomial method . . . . . . . . . . . . . . . . . 95 8.4 Spline method . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 8.5 Cubic spline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 CONTENTS 3 8.6 Smoothing spline . . . . . . . . . . . . . . . . . . . . . . . . . 113 8.7 Effective number of parameters and the choice of λ . . . . . . 116 8.8 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 117 9 Bayes method 119 9.1 An artificial example . . . . . . . . . . . . . . . . . . . . . . . 120 9.2 Classical issues related to Bayes analysis . . . . . . . . . . . . 123 9.3 Decision theory . . . . . . . . . . . . . . . . . . . . . . . . . . 126 9.4 Some comments . . . . . . . . . . . . . . . . . . . . . . . . . . 128 9.5 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 129 10 Monte Carlo and MCMC 133 10.1 Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . 134 10.2 Biased or importance sampling . . . . . . . . . . . . . . . . . 138 10.3 Rejective sampling . . . . . . . . . . . . . . . . . . . . . . . . 139 10.4 Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . . . 142 10.4.1 Discrete time Markov chain . . . . . . . . . . . . . . . 143 10.5 MCMC: Metropolis sampling algorithms . . . . . . . . . . . . 146 10.6 The Gibbs samplers . . . . . . . . . . . . . . . . . . . . . . . . 149 10.7 Relevance to Bayes analysis . . . . . . . . . . . . . . . . . . . 151 10.8 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 151 11 More on asymptotic theory 155 11.1 Modes of convergence . . . . . . . . . . . . . . . . . . . . . . . 155 11.2 Convergence in distribution . . . . . . . . . . . . . . . . . . . 158 11.3 Stochastic Orders . . . . . . . . . . . . . . . . . . . . . . . . . 159 11.3.1 Application of stochastic orders . . . . . . . . . . . . . 161 12 Hypothesis test 167 12.1 Null hypothesis. . . . . . . . . . . . . . . . . . . . . . . . . . . 168 12.2 Alternative hypothesis . . . . . . . . . . . . . . . . . . . . . . 169 12.3 Pure significance test and p-value . . . . . . . . . . . . . . . . 171 12.4 Issues related to p-value . . . . . . . . . . . . . . . . . . . . . 173 12.5 General notion of statistical hypothesis test . . . . . . . . . . 175 12.6 Randomized test . . . . . . . . . . . . . . . . . . . . . . . . . 177 4 CONTENTS 12.7 Three ways to characterize a test . . . . . . . . . . . . . . . . 178 13 Uniformly most powerful test 183 13.1 Simple null and alternative hypothesis . . . . . . . . . . . . . 184 13.2 Making more from N-P lemma . . . . . . . . . . . . . . . . . . 189 13.3 Monotone likelihood ratio . . . . . . . . . . . . . . . . . . . . 190 13.4 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 193 14 Pushing Neyman–Pearson Lemma Further 197 14.1 One parameter exponential family . . . . . . . . . . . . . . . . 199 14.2 Two-sided alternatives . . . . . . . . . . . . . . . . . . . . . . 202 14.3 Unbiased test . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 14.3.1 Existence of UMPU tests . . . . . . . . . . . . . . . . . 204 14.4 UMPU for normal models . . . . . . . . . . . . . . . . . . . . 207 14.5 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 207 15 Locally most powerful test 211 15.1 Score test and its local optimality . . . . . . . . . . . . . . . . 212 15.2 General score test . . . . . . . . . . . . . . . . . . . . . . . . . 214 15.3 Implementation remark . . . . . . . . . . . . . . . . . . . . . . 216 15.4 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 217 15.5 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 218 16 Likelihood ratio test 221 16.1 Likelihood ratio test: as a pure procedure . . . . . . . . . . . . 221 16.2 Wilks Theorem under regularity conditions . . . . . . . . . . . 224 16.3 Asymptotic chisquare of LRT statistic . . . . . . . . . . . . . 227 16.4 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 228 16.5 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 230 17 Likelihood with vector parameters 233 17.1 Asymptotic normality of MLE after the consistency is estab- lished . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 17.2 Asymptotic chisquare of LRT for composite hypotheses . . . . 237 17.3 Asymptotic chisquare of LRT: one-step further . . . . . . . . . 240 CONTENTS 5 17.3.1 Some notational preparations . . . . . . . . . . . . . . 240 17.4 The most general case: final step . . . . . . . . . . . . . . . . 243 17.5 Statistical application of these results . . . . . . . . . . . . . . 244 17.6 Assignment Problems . . . . . . . . . . . . . . . . . . . . . . . 246 17.7 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 248 18 Wald and Score tests 251 18.1 Wald test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 18.1.1 Variations of Wald test in the aspect of Fisher infor- mation . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 18.1.2 Variations of Wald test in the aspect of H0 . . . . . . . 253 18.1.3 Variations of Wald test in the aspect of H0 . . . . . . . 254 18.2 Score Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 18.3 Power and consistency . . . . . . . . . . . . . . . . . . . . . . 256 18.4 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 257 18.5 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 258 19 Tests under normality 261 19.1 One-sample problem under normality . . . . . . . . . . . . . . 261 19.2 Two-sample problem under normality assumption . . . . . . . 264 19.3 Test for equal mean under equal variance assumption . . . . . 266 19.4 Test for equal mean without equal variance assumption . . . . 267 19.5 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 269 20 Non-parametric tests 271 20.1 One-sample sign test. . . . . . . . . . . . . . . . . . . . . . . . 272 20.2 Sign test for paired observations. . . . . . . . . . . . . . . . . 273 20.3 Wilcoxon signed-rank test . . . . . . . . . . . . . . . . . . . . 274 20.4 Two-sample permutation test. . . . . . . . . . . . . . . . . . . 276 20.5 Kolmogorov-Smirnov and Crame´r-von Mises tests . . . . . . . 280 20.6 Pearson’s goodness-of-fit test . . . . . . . . . . . . . . . . . . . 281 20.7 Fisher’s exact test . . . . . . . . . . . . . . . . . . . . . . . . . 283 20.8 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 285 6 CONTENTS 21 Confidence intervals or regions 289 21.1 Confidence intervals based on hypothesis test . . . . . . . . . . 292 21.2 Confidence interval by pivotal quantities . . . . . . . . . . . . 293 21.3 Likelihood intervals. . . . . . . . . . . . . . . . . . . . . . . . 294 21.4 Intervals based on asymptotic distribution of θˆ . . . . . . . . . 296 21.5 Bayes Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 21.6 Prediction intervals . . . . . . . . . . . . . . . . . . . . . . . . 301 21.7 Hypothesis test and confidence region . . . . . . . . . . . . . . 302 21.8 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 304 22 Empirical likelihood 307 22.1 Definition of the empirical likelihood . . . . . . . . . . . . . . 307 22.2 profile likeihood . . . . . . . . . . . . . . . . . . . . . . . . . . 309 22.3 Large sample properties . . . . . . . . . . . . . . . . . . . . . 311 22.4 Likelihood ratio function . . . . . . . . . . . . . . . . . . . . . 315 22.5 Numerical computation . . . . . . . . . . . . . . . . . . . . . . 317 22.6 Empirical likelihood applied to estimating functions . . . . . . 319 22.7 Adjusted empirical likelihood . . . . . . . . . . . . . . . . . . 324 22.8 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 325 23 Resampling methods 329 23.1 Problems addressed by resampling . . . . . . . . . . . . . . . . 329 23.2 Resampling procedures . . . . . . . . . . . . . . . . . . . . . . 330 23.3 Bias correction . . . . . . . . . . . . . . . . . . . . . . . . . . 332 23.4 Variance estimation . . . . . . . . . . . . . . . . . . . . . . . . 333 23.5 The cumulative distribution function . . . . . . . . . . . . . . 336 23.6 Recipes for confidence limit . . . . . . . . . . . . . . . . . . . 339 23.7 Implementation based on resampling . . . . . . . . . . . . . . 341 23.8 A word of caution . . . . . . . . . . . . . . . . . . . . . . . . . 342 23.9 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 343 24 Multiple comparison 347 24.1 Analysis of variance for one-way layout. . . . . . . . . . . . . . 348 24.2 Multiple comparison . . . . . . . . . . . . . . . . . . . . . . . 350 24.3 The Bonferroni Method . . . . . . . . . . . . . . . . . . . . . 350 CONTENTS 7 24.4 Tukey Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 24.5 False discovery rate . . . . . . . . . . . . . . . . . . . . . . . . 353 24.6 Method of Benjamini and Hochberg . . . . . . . . . . . . . . . 354 24.7 How to apply this principle? . . . . . . . . . . . . . . . . . . . 355 24.8 Theory and proof . . . . . . . . . . . . . . . . . . . . . . . . . 357 Chapter 1 Some basics 1.1 Discipline of Statistics Statistics is a discipline that serves other scientific disciplines. Statistics is itself may not considered by many as a branch of science. A scientific disci- pline constantly develops theories to describe how the nature works. These theories are falsified whenever their prediction contradicts the observations. Based on these theories and hypotheses, scientists form a model for the nat- ural world and the model is then utilized to predict what happens to the nature under new circumstances. Scientific experiments are constantly de- signed find evidences that may contradict the prediction of the proposed model and aim at DISPROVING hypotheses behind the model/theory. If a theory is able to make useful predictions and we fail to find contradicting evidences, it gains broad acceptance. We may then temporarily consider it as “the truth”. Even if a model/theory does not give a perfect prediction, but a prediction precise enough for practical purposes and it is much simpler than a more precise model/theory, we tend to retain it as a working model. I regard, for example, Newton’s laws as such an example as compared to more elaborating Einstein’s relativity. If a theory does not provide any prediction that can potentially be dis- proved by some experiments, then it is not a scientific theory. Religious theories form a rich group of such examples. Statistics in a way is a branch of mathematics. It does not model our 1 2 CHAPTER 1. SOME BASICS nature. For example, it does not claim that when a fair die is rolled, the probability of observing 1 is 1/6. Rather, for example, it claims that if the probability of observing 1 is 1/6, and if the outcomes of two dice are indepen- dent, then the probability of observing (1, 1) is 1/36, and the probability of observing either (1, 2) and (2, 1) is 2/36. If one applies a similar model to the spacial distribution of two electrons, the experimental outcomes may contra- dict the prediction of this probability model, yet the contradiction does not imply that the statistic theory is wrong. Rather, it implies that the statistical model does not apply to the distribution of the electrons. The moral of this example is, a statistical theory cannot be disproved by physical experiments. Its theories are of logical truth, and this makes it unqualified as a scientific discipline in the sense we mentioned earlier. We should make a distinction of the inconsistency between a probability model and the real world, and the inconsistency within our logical deriva- tions. If we err at proving a proposition, that proposition is very likely false within our logical system. It does not disprove the logical system. We call logically proved propositions as theorems. In comparison, the propositions regarded as temporary truth in science are named as laws. Of course, we sometimes abuse these terminologies such as “Law of Large Numbers”. In a scientific investigation, one may not always be able to find clear-cut evidence against a hypothesis. For instance, genetic theory indicates that tall fathers have tall sons in general. Yet there are many factors behind the height of the son. Suppose we collect 1000 father-son pairs randomly from a human population. Let us measure their heights as (xi, yi), i = 1, 2, . . . , 1000. A regression model in the form of yi = a+ bxi + i with some regression coefficient (a, b) and random error , can be a useful summary of the data. If the statistical analysis of the data supports the model with some b > 0, then the genetic theory survives the attack. If we have a strong evidence to suggest b is not very different from 0, or it may even be negative, then the genetic theory has to be abandoned. In this case, the genetic theory is not disproved by statistics, but by physical experiments (data collected on father- son heights) assisted by the statistical analysis. Whatever the outcome of 1.2. PROBABILITY AND STATISTICS MODELS 3 the statistical analysis is, the statistic theory is not falsified. It is the genetic theory that is being tortured. 1.2 Probability and Statistics models In scientific investigations, we often quantify the outcomes of an experiment in order to develop a useful model for the real world. An existing scientific theory can often give a precise prediction: the water boils at 100 degrees Celsius at the sea level on the Earth. In other cases, precise prediction is nearly impossible. For example, scientists still cannot predict when and where the next serious earthquake will be. There used to be beliefs that a yet to be discovered perfect scientific model exists which can explain away all randomness. In terms of earthquakes, it might be possible to have a precise prediction if we know the exact tensions between the geographic structures all around the world, the amount of heat being generated at the core of the earth, the positions of all heavenly bodies and a lot more. In other words, the claim is that we study randomness only because we are incompetent in science or because a perfect model is too complicated to be practically useful. This is now believed not the case. The uncertainty principle in quantum theory indicates that the randomness might be more fundamental than many of us are willing to accept. It strongly justifies the study of statistics as an “academic discipline”. A probability space is generally denoted as (Ω,B, P ). We call Ω the sam- ple space, which is linked to all possible outcomes of an experiment under consideration. The notion of experiment becomes rough when the real world problem becomes complex. It is better off to take the mathematical conven- tion to simply assume its existence. B is a σ-algebra. Mathematically, it stands for a collection of subsets of Ω with some desirable properties. We require that it is possible to assign a probability to each subset of Ω that is a member of B without violating some desired rules. How large a probability is assigned to a particular member of B is a rule denoted by P . A random variable (vector) X is a measurable function on Ω. It takes values on Rn if X has length n. It induces a probability space (Rn,B, F ) where F is its distribution. In statistics, we consider problems of inferring 4 CHAPTER 1. SOME BASICS about F within a set of distributions pre-specified. This set of distributions is called statistical model, and it is presented as a probability distribution family F sometime with additional structures. If vector X has n components and they are independent and identically distributed (i.i.d. ), we use F for individual distribution, not for the joint distribution. This convention will be clear when we work with specific problems. In this case, we call it population F defined on (R,B). Components of X are samples from population F . When the individual probability distributions in F is conveniently la- belled by a subset of Rd, the Euclid space of dimension d, we say that F is a parametric distribution family. The label is often denoted as θ, and its all possible values Θ is called parameter space. In applications, we usually only consider parametric models whose probability distributions have a density function with respect to a common σ-finite measure. In such situations, we write F = {f(x; θ) : θ ∈ Θ}. The σ-finite measure is usually the Lebesgue which makes f(x; θ) the com- monly referred density functions. When the σ-finite measure is the counting measure, the density functions are known as probability mass function. If F is not parameterized, we have a non-parametric model. Probability theory and statistics Probability theory studies the prop- erties of stochastic systems. For instance, the convergence property of the empirical distribution based on an i.i.d. sample. Statistical theory aims at inferring about the stochastic system based on (often) an i.i.d. sample from this system. For instance, does the system (population) appear to be a mix- ture of two more homogeneous subpopulations? Probability theory is the foundation of statistical inference. Given an inference goal, statisticians may propose many possible ap- proaches. Some approaches may deem inferior and dismissed over the time. Most approaches have merits that are not completely shadowed by other approaches. Some statistical techniques used as standard methods in other disciplines yet most statisticians never heard of. As a statistician, I hope to have the knowledge to understand these approaches, not to have the knowl- edge of all statistical approaches. 1.3. STATISTICAL INFERENCE 5 1.3 Statistical inference Let X = (X1, X2, . . . , Xn) be a random sample from a statistical model F . That is, we assume that they are independent and identically distributed with a distribution which is a member of F . Let their realized values be x = (x1, x2, . . . , xn). A statistical inference is to infer about the specific member F of F based on the realized value x. If we take a single guess of F , the result is a point estimate; If we provide a collection of possible F , the result is an interval estimate (usually); If we make a judgement on whether a single or a subset of F contains the “true” distribution, the procedure is called hypothesis test. In general, in the last case, we are required to quantify the strength of the evidence based on which the judgement is made. If we partition the space of F into several submodels and infer which submodel F belongs, the procedure is called model selection. In general, for model selection, we do not quantify the evidence favouring the specific submodel. This is the difference between “hypothesis test” and “model selection”. Another general category of statistical inference is based on Bayesian paradigm. The Baysian approach does not identify any F or any set of F . Instead, it provides a probabilistic judgement on every member of subset of F . The probabilistic judgement is obtained via conditional distribution by placing a prior distribution on F and conditional on observations in the form of X = x. We call it posterior distribution. The final decision will be made based on consideration such as minimizing an expected lost. Definition 1.1. A statistic is a function of data which does not depend on any unknown parameters. More concretely, the value of a statistic can be evaluated without knowing the value of the unknown parameters in the model. The sample mean x¯n = n−1(x1 +x2 + · · ·+xn) is a statistic. However, x¯n−E(X1) is in general not a statistic because it is a function of both data, x¯n, and the usually unknown value, E(X1). The value of E(X1) often depends on parameter θ behind F . Let T (x) be a statistic. We may also regard T (x) as the realized value of T when the realized value of X is x. We may regard T = T (X) as a quantity to be “realized”. Since X is random, the outcome of T is also random. The distribution of T (X) is called its sample distribution. Unfortunately, 6 CHAPTER 1. SOME BASICS it is often hard to be completely consistent when we deal with T (X) and T (x). We may have to read between lines to tell which one of the two is under discussion. Since the distribution of X is usually only known up to being a member of F which is often labeled by a parameter θ, the (sample) distribution of T is also only known up to the unknown parameter θ. Definition 1.2. Let T (x) be a statistic. If the conditional distribution of X given T does not depend on unknown parameter values, we say T is a sufficient statistics. When T is sufficient, all information contained in X about θ is contained in T . In this case, one may choose to ignore X but work only on T without loss of any efficiency. Such a simplification is most helpful if T is much simpler than X or it is a substantial reduction of X. Directly verifying the sufficiency of a statistic is often difficult. We gen- erally use factorization theorem to identify sufficient statistics. If the density function of X can be written as f(x; θ) = h(x)g(T (x); θ) for some function h(·) and g(·; ·), then T (x) is sufficient for θ. In some situations, direct verification is not too complex. For example, if X1, X2 are independent Poisson distributed with mean parameter θ. Then the conditional distribution of X1, X2 given T = X1 + X2 are binomial (T , 1/2) which is free from the unknown parameter θ. Hence, T is sufficient for θ. Definition 1.3. Sufficient statistic T (x) is minimum sufficient if T is the function of every other sufficient statistic. A minimum sufficient statistic may still contain some redundancy. If a statistic has the property that none of its non-zero function can have identi- cally 0 expectation, this statistic is called complete. When the requirement is reduced to included only “bounded functions”, then T is called bounded- complete. We have a few more such notions. Definition 1.4. Sufficient statistic T (x) is complete if E(g(T )) = 0 under every F ∈ F implies g(·) ≡ 0 almost surely. 1.3. STATISTICAL INFERENCE 7 In contrast, if the distribution of T does not depend on θ or equivalently on the specific distribution of X, we say that T is an ancillary statistic. Definition 1.5. If the distribution of the statistic T (x) does not depend on any parameter values, it is an ancillary statistic. Example: Suppose X = (X1, . . . , Xn) is a random sample from N(θ, 1) with θ ∈ R. Recall that T = X¯ is a complete and sufficient statistic of θ. At the same time, X − T = (X1 − X¯, . . . , Xn − X¯) is an ancillary statistic. It does not contain any information about the value of θ. However, it is not completely useless. Under the normality assumption, X − T is multivariate normal. We can study the realized value of X−T to see whether it looks like a realized value from a multivariate normal. If the conclusion is negative, the normality assumption is in serious question. If the validity of a statistical inference heavily depends on normality, such a diagnostic procedure is very important. Remark: In this example the probability model F is all normal distribu- tions with mean θ and known variance σ2 = 1. Notationally, F = {N(θ, 1) : θ ∈ R}. Definition 1.6. If T is a function of both data X and the parameter θ, but its distribution is not a function of θ, we call T a pivotal quantity. In the last example, S = X¯ − θ is a pivotal quantity. Note that this claim is made under the assumption that θ is the “true” parameter value of the distribution of X, it is not a dummy variable. This is another common practice in statistical literature: if not declared, notation θ is used both as a dummy variable and the “true” value of the distribution of the random sample X. This notion also applies to Bayes methods, θ is often regarded as a realized value from its prior distribution, and X is then a sample from the distribution labeled by this “true” value of θ. Note that the parameter θ is a label of F that belongs to F in parametric models. It may as well be regarded as a function of F , call it functional if you please. Any function of F can be regarded as a parameter by the same token. For example, the median of F is a parameter. This works even if F is a popularly used parametric distribution family such as Poisson. 8 CHAPTER 1. SOME BASICS 1.4 Assignment problems 1. Let X1, X2, . . . , Xn be a random sample (i.i.d. ) from a continuous dis- tribution f(x). Namely, the distribution family F contains all univari- ate continuous distributions. Let R1, R2, . . . , Rn be the rank statistic. That is, R1 = the rank of X1 among n random variables. (a) Show that the vector R = (R1, R2, . . . , Rn) τ is an ancillary statistic and find its distribution. (b) What information contained in R that might be useful for statistical inference? 2. Let X1, X2, . . . , Xn be a random sample (i.i.d. ) from N(θ, σ 2). Let X¯n and s2n be the sample mean and variance. (a) Verify that (X1 − X¯n, . . . , Xn − X¯n)/sn is an ancillary statistic. (b) Verify by factorization theorem that X¯n, s 2 n are jointly sufficient. (c) Suppose σ = 1 is known. Show that X¯n is complete for θ by definition. Chapter 2 Normal distributions Let X be a random variable. Namely, it is a function on a probability space (Ω,B, P ). It randomness is inherited from probability measure P . By definition of random variable, {X ≤ t} = {ω : ω ∈ Ω, X(ω) ≤ t} is a member of B for any real value t. Hence, there is a definitive value Fx(t) = P ({X ≤ t}) for any t ∈ F . We refer Fx(t) as the cumulative distribution function (c.d.f. ) of X. Often, we omit the subscript and write it as F (t). Note t itself is a dummy variable so it does not carry any specific meaning other than it stands for a real number. In most practices, we use F (x) for the c.d.f. of X. This can lead to confusion: Once F (x) is used as c.d.f. of X, F (y) remains the c.d.f. of X, not necessarily that of another random variable called Y . The c.d.f. of a random variable largely determines it randomness prop- erties. This is the basis of forming distribution families: distributions whose c.d.f. having a specific algebraic form. Of course, there are often physical causes behind the algebraic form. For instance, success-failure experiment is behind the binomial distribution family. Uni- and Multi-variate normal distribution families occupy a special space in the classical mathematical statistics. We provide a quick review as follows. 9 10 CHAPTER 2. NORMAL DISTRIBUTIONS 2.1 Uni- and Multivariate normal A random variable has standard normal distribution if its density function is given by φ(x) = 1√ 2pi exp(−1 2 x2). We generally use Φ(x) = ∫ x −∞ φ(t)dt to denote the corresponding c.d.f. . If X has probability density function φ(x;µ, σ) = σ−1φ( x− µ σ ) = 1√ 2piσ exp(− 1 2σ2 x2) then it has normal distribution with mean µ and variance σ2. We use Φ(x;µ, σ) to denote the corresponding c.d.f. If Z has standard normal distribution, then X = σZ + µ has normal distribution with parameters (µ, σ2) which represent mean and variance. The moment generating function of X is given by Mx(t) = exp(µt+ 1 2 σ2t2) which exists for all t ∈ R. The moment of the standard normal Z are: E(Z) = 0, E(Z2) = 1, E(Z3) = 0 and E(Z4) = 3. Why is the normal distribution normal? The central limit theorem tells us that if X1, X2, . . . , Xn, . . . is a sequence of i.i.d. random variables with E(X) = 0 and var(X) = 1, then P (n−1/2 n∑ i=1 Xi ≤ x)→ ∫ x −∞ φ(t)dt for all x, where φ(t) is the density function of the standard normal distribu- tion (normal with mean 0 and variance 1). Recall that many distributions we investigated can be viewed as distribu- tions of sum of i.i.d. random variables, hence, when properly scaled as in the central limit theorem, their distributions are well approximated by normal. These examples include: binomial, Poisson, Negative binomial, Gamma. 2.1. UNI- AND MULTIVARIATE NORMAL 11 In general, if the outcome of a random quantity is influenced by numerous factors and none of them play a determining role, then the sum of their effects is normally distributed. This reasoning is used to support the normality assumption on our “height” distribution, even though none of us ever had a negative height. Multivariate normal. Let the vector Z = {Z1, Z2, . . . , Zd}′ consist of in- dependent, standard normally distributed components. Their joint density function is given by f(z) = {2pi}−d/2 exp{−1 2 zτz} = {2pi}−d/2 exp{−1 2 d∑ j=1 z2i }. Easily, we have E(Z) = 0 and var(Z) = Id, the identity matrix. The moment generating function of Z (joint one) is given by Mz(t) = exp{1 2 tτt} which is in vector form. Let B be a matrix of size m× d and µ be a vector of length m. Then X = BZ + µ is multivariate normally distributed with E(X) = µ, var(X) = BBτ . We will use notation Σ = BBτ . It is seen that if X is multivariate nor- mally distributed, N(µ,Σ), then its linear function, Y = AX + b is also multivariate normally distributed: N(Aµ+ b,AΣAτ ). Note this claim does not require Σ nor A to have full rank. It also implies all marginal distributions of a multivariate normal random vector is normally distributed. The inverse is not completely true: if all marginal distributions of a random vector are normal, the random vector does not necessarily have multivariate normal distribution. However, if all linear combinations of X has normal distribution, then the random vector X has multivariate normal distribution. 12 CHAPTER 2. NORMAL DISTRIBUTIONS When Σ has full rank, then N(µ,Σ) has a density function given by φ(x;µ,Σ) = (2pi)−d/2{det(Σ)}−1/2 exp{−1 2 (x− µ)τΣ−1(x− µ)} where det(·) is the determinant of a matrix. We use Φ(x;µ,Σ) for the multivariate c.d.f. . Partition of X. Assume that a multivariate normal random vector is parti- tioned into two parts: Xτ = (Xτ1,X τ 2). The mean vector, covariance matrix can be partitioned accordingly. In particular, we denote the partition of the mean vector as µτ = (µτ1,µ τ 2) and the covariance matrix as Σ = ( Σ11 Σ12 Σ21 Σ22 ) . Theorem 2.1. Suppose Xτ = (Xτ1,X τ 2) is multivariate normal, N(µ,Σ). Then (1) X1 is multivariate N(µ1,Σ11). (2) X1 and X2 are independent if and only if Σ12 = 0. (3) Assume Σ22 has full rank. Then the conditional distribution of X1|X2 is normal with conditional mean µ1 + Σ12Σ −1 22 (X2−µ2) and variance matrix Σ11 −Σ12Σ−122 Σ21. That is, for multivariate normal random variables, zero-correlation is equivalent to independence. The above result for conditional distribution is given when Σ22 has full rank. The situation where Σ22 does not have full rank can be worked out by removing the redundancy in X2 before applying the above result. 2.2 Standard Chi-square distribution We first fix the idea with a definition. Definition 2.1. Let Z1, Z2, . . . , Zd be a set of i.i.d. standard normally dis- tributed random variables. The sum of squares T = Z21 + Z 2 2 + · · ·+ Z2d is said to have chi-square distribution with d degrees of freedom. 2.2. STANDARD CHI-SQUARE DISTRIBUTION 13 For convenience of future discussion, we first put down a simple result without a proof here. Theorem 2.2. Let Z1, Z2, . . . , Zd be a set of i.i.d. standard normally dis- tributed random variables. The sum of squares T = a1Z 2 1 + a2Z 2 2 + · · ·+ adZ2d has chi-square distribution if and only if a1, . . . , ad are either 0 or 1. We use notation χ2d as a symbol of the chi-square distribution with d degrees of freedom. The above definition is how we understand the chi- square distribution. Yet without seeing its probability density function and so on, we may only have superficial understanding To obtain the density function of T , we may work on the density function of Z21 first. It is seen that P (Z21 ≤ x) = P (− √ x ≤ Z1 ≤ √ x). = ∫ √x −√x φ(t)dt Hence, by taking derivative with respect to x, we get its pdf as fZ21 (x) = 1 2 √ pi (x 2 )1/2−1 exp(−x 2 ). This is the density function of a specific Gamma distribution with 1/2 degrees of freedom and scale parameter 2. Because of this and from the property of Gamma distribution, we conclude that T has Gamma distribution with d/2 degrees of freedom, and scale parameter 2. Its p.d.f. is given by fT (x) = 1 2Γ(d/2) (x 2 )d/2−1 exp(−x 2 ). Its moment generating function can also be obtained easily: MT (t) = ( 1 1− 2t )d/2 . Note that this function is defined only for t < 1/2. The mean of T is d, and its variance is 2d. 14 CHAPTER 2. NORMAL DISTRIBUTIONS Clearly, if X is N(µ,Σ) of length d and that Σ has full rank, then W = (X−µ)τΣ−1(X−µ) has chi-square distribution with d degrees of freedom. The cumulative distribution function of standard chi-square distribution with (virtually) any degrees of freedom has been well investigated. There used to be detailed numerical tables for their quantiles and so on. We have easy-to- use R functions these days. Hence, whenever a statistic is found to have a chi-square distribution, we consider its distribution is known. If A is a symmetric matrix such that AA = A, we say that it is idem- potent. In this case, the distribution of ZτAZ is chisquare distribution with degrees of freedom equaling the trace of A when Z is N(0, I). 2.3 Non-central chi-square distribution We again first fix the idea with a definition. Definition 2.2. Let Z1, Z2, . . . , Zd be a set of i.i.d. standard normally dis- tributed random variables. The sum of squares T = (Z1 + γ) 2 + Z22 + · · ·+ Z2d is said to have non-central chi-square distribution with d degrees of freedom and non-centrality parameter γ2. Let T ′ = (Z1 − γ)2 + Z22 + · · ·+ Z2d with the same γ as in the definition. The distribution of T ′ is the same as the distribution of T . This can be proved as follows. Let W1 = −Z1 and Wj = Zj for j = 2, . . . , d. Clearly, T ′ = (W1 + γ)2 +W 22 + · · ·+W 2d and W1,W2, . . . ,Wd remain i.i.d. standard normally distributed. Hence, T and T ′ must have the same distribution. However, T 6= T ′ when they are regarded as random variables on the same probability space. 2.3. NON-CENTRAL CHI-SQUARE DISTRIBUTION 15 The second remark is about the stochastic order of two distributions. Without loss of generality, γ > 0. When d = 1, and for any x > 0, we find P{(Z1 + γ)2 ≥ x2} = 1− Φ(x− γ) + Φ(−x− γ). Taking derivative with respect to γ, we get φ(x− γ)− φ(−x− γ) = φ(x− γ)− φ(x+ γ) > 0. That is, the above probability increases with γ over the range of γ > 0. That is, (Z1 + γ) 2 is always more likely to take larger values than Z21 does. For convenience, let χ2d and χ 2 d(γ 2) be two random variables with respec- tively central and non-central chi-square distributions with the same degrees of freedom d. We can show that for any x, P{χ2d(γ2) ≥ x2} ≥ P{χ2d ≥ x2}. This proof of this result will be left as an exercise. In data analysis, a statistic or random quantity T often has central chisquare distribution under one model assumption, say A, but non-central chisquare distribution under another model assumption, say B. Which model assumption is better supported by the data? Due to the above result, a large observed value of T is supportive of B while a small observed value of T is supportive of A. This provides a basis for hypothesis test. We set up a threshold value for T so that we accept B when the observed value of T exceeds this value. Let X be multivariate normal N(µ, Id). Then XτX has non-central chisquare distribution with non-centrality parameter µτµ. This can be proved as follow. Without loss of generality, assume µ 6= 0. Let A be an orthogonal matrix so that its first row equals µ/‖µ‖. Let Y = AX. Write Yτ = (Y1, Y2, . . . , Yd). Then Y ′ 1 = Y1−‖µ‖, Y2, . . . , Yd are i.i.d. standard normal random variables. Hence, XτX = YτY = (Y ′1 + ‖µ‖)2 + Y 22 + · · ·+ Y 2d 16 CHAPTER 2. NORMAL DISTRIBUTIONS has non-central chi-square distribution with non-centrality parameter µτµ. As an exercise, please show that if X is multivariate normal N(µ,Σ), then Q = XΣ−1X has non-central chi-square distribution with non-centrality parameter γ2 = µτΣ−1µ. It can be verified that E(Q) = d+ γ2; var(Q) = 2(d+ 2γ2). When Σ = σ2Id, then XτX has non-central chi-square distribution with d degrees of freedom and non-centrality parameter γ2 = ‖µ‖2. Suppose W1 and W2 are two independent non-central chi-square dis- tributed random variables with d1 and d2 degrees of freedome, and non- centrality parameters γ21 and γ 2 2 . Then W1 + W2 is also non-central chi- square distributed and its degree of freedom is d1 + d2 and non-centrality parameters γ21 + γ 2 2 . 2.4 Cochran Theorem We first look into a simple case. Theorem 2.3. Suppose X is N(0, Id) and that XτX = XτAX + XτBX = QA + QB such that both A and B are symmetric with ranks a and b respectively. If a + b = d, then QA and QB are independent and have χ 2 a and χ 2 b distributions. Proof: By standard linear algebra result, there exists an orthogonal matrix R and diagonal matrix Λ such that A = RτΛR. This implies B = Id −A = Rτ (Id −Λ)R 2.5. F- AND T-DISTRIBUTIONS 17 in which (Id −Λ) is also diagonal. The rank of A equals the number of non-zero entries of Λ and that of B is the number of entries of Λ not equalling 1. Since a + b = d, this necessitates all entries of Λ are either 0 or 1. Without loss of generality, Λ = diag(1, · · · , 1, 0, . . . , 0). Note that orthogonal transformation Y = RX makes entries of Y i.i.d. standard normal. Therefore, QA = Y τΛY = Y 21 + · · ·+ Y 2a which has χ2a distribution. Similarly, QB = Y τ (Id −Λ)Y = Y 2a+1 + · · ·+ Y 2d which has χ2b distribution. In addition, they are quadratic forms of different segments of Y. Therefore, they are independent. Remark: Since XτAX = XτAτX, we have QA = X τ{(A + Aτ )/2}X in which {(A+Aτ )/2} is symmetric. Hence, we do not loss much generality by assuming both A and B are symmetric. The result does not hold without symmetry assumption though I cannot find references: Try A = [ 1 −1 0 0 ] , B = [ 0 1 0 1 ] . Under symmetry assumption, take it as a simple exercise to show that if XτX = XτA1X + · · ·+ XτBpX = p∑ j=1 Qj such that rank(A1) + · · ·+ rank(Ap) = d then Qj’s are independent, each has chisquare distribution of degrees rank(Aj). 2.5 F- and t-distributions If X and Y have chisquare distributions with degrees of freedom m and n respectively, then the distribution of F = X/m Y/n 18 CHAPTER 2. NORMAL DISTRIBUTIONS is called F with m and n degrees of freedom. Note that X/(X + Y ) = (1 + Y/X)−1 has Beta distribution. Thus, there is a very simple relationship between the F -distribution and the Beta distribution. t-distribution. If X has standard normal distribution, and S2 has chisquare distribution with n degrees of freedom. Further, when X and S2 are inde- pendent, t = X√ S2/n has t-distribution with n degrees of freedom. When n = 1, this distribution reduces to the famous Cauchy distribution, none of its moments exist. When n is large, S2/n converges to 1. Thus, the t-distribution is not very different from the standard normal distribution. A general consensus is that when n ≥ 20, it is good enough to regard t-distribution with n degrees of freedom as the standard normal in statistical inferences. 2.6 Examples In this section, we give a few commonly used distributional results in mathe- matical statistics. Two examples are generally referred to as one-sample and two-sample problems. Example 2.1. Consider the normal location-scale model in which for i = 1, . . . , n, we have Yi = µ+ σi such that 1, . . . , n are i.i.d. N(0, 1). Let Y be the corresponding Y vector which is multivariate normal with mean µτ = (1, 1, . . . , 1) = µ1τ and identity covariance matrix I. Similarly, we use for the vector of . 2.6. EXAMPLES 19 The sample variance can be written as s2n = (n− 1)−1Yτ (I− n−111τ )Yτ = (n− 1)−1σ2τ (I− n−111τ ). The key matrix (I − n−111τ ) is idempotent. Hence, other than factor (n − 1)−1σ2, the sample variance has chisquare distribution with n− 1 degrees of freedom. In addition, the sample mean Y¯n = n −11τY is uncorrelated to (I − n−111τ )Yτ . Hence, they are independent. This further implies that the sam- ple mean and sample variance are independent. Example 2.2. Consider the classical two-sample problem in which we have two i.i.d. samples from normal distribution: Xτ = (X1, X2, . . . , Xm) are i.i.d. N(µ1, σ 2) and Yτ = (Y1, Y2, . . . , Yn) are i.i.d. N(µ2, σ 2). We are of- ten interested in examining the possibility whether µ1 = µ2. Let X¯m and Y¯n be two sample means. It is seen that RSS0 = mn m+ n {X¯m − Y¯n}2 is a quadratic form that represents the variation between two samples. At the same time, RSS1 = m∑ i=1 {Xi − X¯m}2 + n∑ j=1 {Yj − Y¯n}2 is a quadratic form that represents the internal variations within two popu- lations. It is natural to compare the relative size of RSS0 against RRS1 to decide whether two means are significantly different. For this purpose, it is useful to know their sample distributions and independence relationship. It is easy to directly verify that RSS0 and RRS1 are independent and both have chisquare distributions. We may also find XτX + YτY = RSS0 + RSS1 + (m+ n) −1(Xτ1m + Yτ1n)(1τmX + 1 τ nY) The ranks of three quadratic forms on the right hand side are 1, m + n − 2 and 1 which sum to n. The decomposition remains the same when we replace X by (X− µ)/σ and Y by (Y− µ)/σ. Hence when µ1 = µ2 = µ and σ = 1, 20 CHAPTER 2. NORMAL DISTRIBUTIONS RSS0 and RRS1 independent and chisquare distributed by Cochran Theorem (after scaled by σ2). This further implies that F = RSS0 RSS1/(m+ n− 2) has F-distribution with degrees of freedom 1 and m+ n− 2. The F-distribution conclusion is the basis for the analysis of variance, two-sample t-test and so on. 2.7 Assignment problems 1. Let χ2d and χ 2 d(γ 2) be two random variables with respectively central and non-central chi-square distributions with the same degrees of free- dom d. Show that for any x, P{χ2d(γ2) ≥ x2} ≥ P{χ2d ≥ x2}. Suppose χ2n is a third random variable with central chisquare distribu- tion and n degrees of freedom. Then Fd,n(γ) = χ2d(γ 2)/d χ2n/n is said to have non-central F-distribution, when the two chisquare ran- dom variables are independent. Show that for any x, P{Fd,n(γ2) ≥ x2} ≥ P{Fd,n(0) ≥ x2}. 2. Let Z be a multivariate normal N(0, Id) random vector. Show that for a symmetric matrix A, ZτAZ has central chisquare distribution if and only if A2 = A. 2.7. ASSIGNMENT PROBLEMS 21 3. Let Z be a multivariate normal N(0, Id) random vector. Under sym- metry assumptions on A1, . . . ,Ap, show that if ZτZ = ZτA1Z + · · ·+ ZτApZ = p∑ j=1 Qj such that rank(A1) + · · ·+ rank(Ap) = d then Qj’s are independent, each has chisquare distribution of degrees equaling rank(Aj). 22 CHAPTER 2. NORMAL DISTRIBUTIONS Chapter 3 Exponential distribution families In mathematical statistics, the normal distribution family plays a very im- portant role for its simplicity and for the reason that many distributions are well approximated by a normal distribution. We have also seen many useful other distributions are derived from normal distributions. There are many other commonly used distribution families in mathe- matical statistics. Many of them have density functions conform to a specific algebraic structure. The algebraic structure further enables simple statistical conclusions in data analysis. Hence, it is often useful to have this structure discussed in mathematical statistics. 3.1 One parameter exponential distribution family Consider a one parameter distribution family whose probability distributions have a density function with respect to a common σ-finite measure. That is, the family is made of {f(x; θ) : θ ∈ Θ ⊂ R} with Θ being its parameter space. 23 24 CHAPTER 3. EXPONENTIAL DISTRIBUTION FAMILIES Definition 3.1. Suppose there exist real valued functions η(θ), T (x), A(θ) and h(x) such that f(x; θ) = exp{η(θ)T (x)− A(θ)}h(x). (3.1) We say {f(x; θ) : θ ∈ Θ ⊂ R} is a one-parameter exponential family. The definition does not give much insight on the specific algebraic form is of interest. Let us build some intuition from several examples. Example 3.1. Suppose X1, . . . , Xn are i.i.d. from Binomial (m, θ). Their joint density (probability mass) function is given by f(x1, . . . , xn; θ) = n∏ i=1 [( m xi ) θxi(1− θ)m−xi ] . Let T (X) = ∑ Xi, and T (x) = ∑ xi and h(x) = n∏ i=1 ( m xi ) . Then we find f(x1, . . . , xn; θ) = exp{T (x) log θ + (nm− T (x)) log(1− θ)}h(x) = exp{log{θ/(1− θ)}T (x) + nm log(1− θ)}h(x). This conforms the definition of one parameter family with η = log{θ/(1− θ)} and A(θ) = nm log(1− θ). As an exercise, you can follow this example to show that both Negative Binomial, Poisson distributions are one-parameter exponential families. 3.1. ONE PARAMETER EXPONENTIAL DISTRIBUTION FAMILY 25 In the above example, η is call log-odds because θ/(1 − θ) is the odds of success compared to failure in typical binary experiments. It is equally useful to “label” Binomial distribution family by log-odds. Note that θ = exp(η) 1 + exp(η) . Hence, we may equivalently state that the joint density function of X is given by g(x1, . . . , xn; θ) = exp{ηT (x)− nm log(1 + exp(η))}h(x). This form also confirms the definition of the one-parameter exponential fam- ily. Definition 3.2. Let X be a random variable or vector. The support of X of that of its distribution is the set of all x such that for any δ > 0, P{X ∈ (x− δ, x+ δ)} > 0. For the sake of accuracy, a definition sometimes has to be abstract. The support of X is intuitively the set of x such that X = x is a “possible event”. When Z is N(0, 1), we have P (Z = z) = 0. Hence, we cannot interpret “possible event” as a positive probability event. The above definition first expands x and then judges its “possibility”. Hence, the support contains all x at which the density function is positive and continuous. We do not ask you to memorize this definition. Rather, we merely point out that if two distributions belong the same one-parameter exponential fam- ily, then they have the same support. In comparison, a standard exponential distribution has support [0,∞) and a standard normal distribution has sup- port R. Let us now show you another interesting property. Example 3.2. Let us now consider the natural form of the one-parameter exponential family: f(x1, . . . , xn; η) = exp{ηT (x)− A(η)}h(x) with η being a real value whose parameter space is an interval. The moment generating function of T (x) is given by MT (s) = E exp{sT (X)} = exp{A(η + s)− A(η)}. 26 CHAPTER 3. EXPONENTIAL DISTRIBUTION FAMILIES This implies that E{T} = M ′T (0) = A′(η). and E{T 2} = M ′′T (0) = A′′(η) + {A′(η)}2. Hence, var(T ) = A′′(η). This example shows that the exponential families have some neat prop- erties which make them an interest object to study. 3.2 The multiparameter case We can practically copy the previous definition without any changes. Definition 3.3. Suppose there exist real-vector valued functions η(θ), T(x), and real valued functions A(θ) and h(x) such that f(x;θ) = exp{ητ (θ)T(x)− A(θ)}h(x). (3.2) We say {f(x; θ) : θ ∈ Θ ⊂ Rd} is a multi-parameter exponential family. Without the above expansion, the exponential family does not even in- clude normal distribution. Example 3.3. Let X1, X2, . . . , Xn be i.i.d. with distribution N(µ, σ 2). Their joint density function φ(x1, . . . , xn;µ, σ 2) = (2pi)−n/2σ−n exp{− ∑n i=1(xi − µ)2 2σ2 } = (2pi)−n/2 exp{ µ σ2 n∑ i=1 xi − 1 2σ2 n∑ i=1 x2i − nµ2 2σ2 − n log σ}. We now regard θ as a vector made of µ and σ. The above density function 3.3. OTHER PROPERTIES 27 fits into the definition (3.2) with the following functions: η(θ) = ( µ σ2 ,− 1 2σ2 )τ , T(x) = ( ∑ xi, ∑ x2i ) τ , A(θ) = −nµ 2 2σ2 − n log σ, h(x) = (2pi)−n/2. Recall the Binomial distribution example. We had joint density function given by f(x1, . . . , xn; θ) = exp{T (x) log θ + (nm− T (x)) log(1− θ)}h(x). It can also be regarded as a multi-parameter exponential family with d = 2 and η = (log θ, log(1− θ))τ ; Tnew(x) = (T (x), nm− T (x))τ . The parameter space in terms of values of η is a curve in R2 which does not contain any open (non-empty) subset of R2. We generally avoid having a distribution families with degenerate parameter spaces. As an exercise, one can verify that two-parameter Gamma distribution family is a multiple parameter exponential family. 3.3 Other properties Suppose X1 and X2 both have distributions belonging to some exponential families and they are independent. Then their joint distribution also belongs to an exponential family. By factorization theorem, T(X) in exponential family is a sufficient statis- tic. It is also a complete statistic when the family does not degenerate. The distribution of T belongs to an exponential family. Definition 3.4. Let T be a k-dimensional vector valued function and h be a real value function. The canonical k-dimensional exponential family gener- ated by T and h is g(x; η) = exp{ητT (x)− A(η)}h(x). 28 CHAPTER 3. EXPONENTIAL DISTRIBUTION FAMILIES The parameter space for η is all η ∈ Rk such that exp{ητT (x)}h(x) has finite integration with respect to the corresponding σ-finite measure. We call the parameter space, E, the natural parameter space. We call T and h generators. Because the integration of a density function equals 1, the integration of exp{ητT (x)}h(x) equals exp(A(η) if it is finite. Hence, the natural parameter space E contains all η at which A(·) is well-defined. Definition 3.5. We say that an exponential family F is of rank k if and only if the generating statistic T is k-dimensional and 1, T1, . . . , Tk are linearly independent with positive probability. That is, P (a0 + k∑ j=1 ajTj = 0; η) < 1 for some η unless all non-random coefficients a0 = a1 = · · · = ak = 0. In the above definition, we only need to verify the probability inequality for one η value. If it is satisfied for one η value, then it is satisfied for any other η value. Theorem 3.1. Suppose F = {g(x; η) : η ∈ E} is a canonical exponential family generated by (T, h) with natural parameter space E such that E is open. Then the following are equivalent: (a) F is of rank k. (b) var(T; η) is positive definite. (c) η is identifiable: g(x; η1) ≡ g(x; η2) for all x implies η1 = η2. These discussions on exponential family suffice for the moment so we move to the next topic. 3.4. ASSIGNMENT PROBLEMS 29 3.4 Assignment problems 1. Show that both Negative Binomial and Poisson distributions are one- parameter exponential families. 2. Show that the family of uniform distributions f(x; θ) = θ−11(0 < x < θ) over θ ∈ R+ is not a one-parameter exponential family. 3. (a) Show that two-parameter Gamma distribution family is a multi- ple parameter exponential family, and select a T so that var(T ; η) is positive definite. (b) Show that multinomial distribution family with fixed number of trials n is a multiple parameter exponential family, and select a T such that var(T ; η) is positive definite. 30 CHAPTER 3. EXPONENTIAL DISTRIBUTION FAMILIES Chapter 4 Optimality criteria of point estimation A general setting of the mathematical statistics is: we are given a data x believed to be the observed value of a random object X. The probability distribution of X will be denoted as F ∗ and F ∗ is believed to be a member of a distribution family F . Based on the fact that X has an observed value x, identify a single or a set of F in F which might be the “true” F ∗ that describe the probability distribution of X. There are many serious fallacies related to the above thinking. The first one is the specification of F , which is referred as a model in this course. If a specific form of F is given, how certain are we on F ∗ is one of F? Even if the distribution of X is a member of F , X may not be accurately observed. What we have recorded may be Y = X + . Hence, we may unknowingly work on the distribution of Y instead that of X. In this course, we do not discuss these possible fallacies but leave them to applied courses. We take the approach that if the distribution ofX is indeed a member of F and x is its accurate observed value, what can we say about F ∗? Also, we often study the situation where X is an i.i.d. replication of some random system so that X = (X1, . . . , Xn). The model of the distribution of X will be then taken over by the model for X1 which is representative for every Xi, i = 1, 2, . . . , n. We state that X1, . . . , Xn is an random or an i.i.d. sample from population/distribution F of F . In this case n is referred to as 31 32 CHAPTER 4. OPTIMALITY CRITERIA OF POINT ESTIMATION sample size. With many replications, or when n→∞, we should be able to learn a lot more about F ∗. 4.1 Point estimator and some optimality cri- teria Let θ be a parameter in the probability model F and suppose we have a random sample X. The parameter space is loosely Θ = {θ : θ = g(F ), F ∈ F} for some functional g. A point estimator of θ is a statistic T whose range is Θ. The realized value of T , T (x), is an estimate of θ. We generally allow, for the least, T to take values on the smallest closed set containing Θ. That is, taking values on limiting points of Θ. Definition 4.1. A point estimator of θ is a statistic T whose range is Θ. The realized value of T , T (x), is an estimate of θ. The definition implies that as an estimator, T (X) is regarded as a mech- anism/rule of mapping X to Θ; as an estimate, T (x) is a value in Θ which corresponding to data x. In both cases, we may use θˆ as their common notation. One must realize T (x) = 0 is an estimator of θ as long as 0 ∈ Θ. Hence, we always can estimate the parameter in any statistical models, no matter how complex the model is. We may not be able to find an estimator with a satisfactory precision or certain desired properties. Suppose the parameter space is a subset of Rd for some integer d. Hence, T (X) takes values in Rd. When the distribution of X is given by an F ∈ F or equivalently c.d.f. F (x; θ) or p.d.f. f(x; θ). Hence, T (X) is a distribution induced by F (x; θ) or simply by θ. To fix the idea, we assume the “true” pa- rameter value of F is θ, the generic θ. When θˆ = T (X) has finite expectation under any θ, we define bias(θˆ) = E{T (X); θ} − θ as the bias of θˆ = T (X) when it is used as an estimator of θ and when the true parameter value is θ. 4.1. POINT ESTIMATOR AND SOME OPTIMALITY CRITERIA 33 Definition 4.2. Suppose X has a distribution F ∈ F which is parameterized by θ ∈ Θ. Suppose T (X) is an estimator of θ such that E{T (X); θ} = θ for all θ ∈ Θ, then we say T (X) is an unbiased estimator of θ. For some reason, statisticians and others prefer estimators that are unbi- ased. This is not always well justified. Example 4.1. Suppose X has binomial distribution with parameters n and θ, n is known and θ is an unknown parameter. A commonly used estimator for θ is θˆ = X n . An estimator motivated by Bayesian approach is θ˜ = X + 1 n+ 2 . It is seen E{θˆ; θ} = θ. Hence, it is an unbiased estimator. We find that other than θ = 0.5, bias(θ˜) = 1− 2θ n+ 2 6= 0. Hence, θ˜ is a biased estimator. Which estimator makes more sense to you? In the above example, the bias of θ˜ has a limit 0 when n goes to infinite. Often, we discuss situations where the data set contains n i.i.d. observations from a distribution F which is a member of F . The above result indicates that even though θ˜ is biased, the size of the bias diminishes when the sample size n gets large. Many of us tends to declare that θ˜ is asymptotically unbiased when this happens. While we do not feel such a notion of “asymptotically unbiased” is wrong, this terminology is often abused. In statistical literature, people may use this term when √ n(θˆ − θ) 34 CHAPTER 4. OPTIMALITY CRITERIA OF POINT ESTIMATION has a limiting distribution whose mean is zero. In this case, the bias of θˆ does not necessarily goes to zero. To avoid such confusions, let us invent a formal definition. Definition 4.3. Suppose there is an index n such that Xn has a distribution in Fn and an → ∞ as n → ∞ while the parameter space Θ of Fn does not depend on n. Let θ be the true parameter value and θˆn is an estimator (a sequence of estimators). If an(θˆn − θ) has a limiting distribution whose expectation is zero, for any θ ∈ Θ, then we say θˆn is asymptotically rate-an unbiased. Most often, we take an = n 1/2 in the above definition. We do not have good reasons to require an estimator unbiased. Yet we feel that being asymp- totically unbiased for some an is a necessity. When n → ∞ in common settings, the amount of information about which F is the right F becomes infinity. If we cannot make it right in this situation, the estimation method is likely very poor. The variance of an estimator is as important a criterion in judging an estimator. Clearly, having a lower variance implies the estimator is more accurate. In fact, let ϕ(·) be a convex function. Then an estimator is judged superior if E{ϕ(θˆ − θ)} is smaller. When ϕ(x) = x2, the above criterion becomes Mean Squared Error: mse(θˆ) = E{(θˆ − θ)2}. It is seen that mse(θˆ) = bias2(θˆ) + var(θˆ). To achieve lower mse the estimator must balance the loss due to variation and bias. Similar to asymptotic bias, it helps to give definite notions of asymptotic variance and mse of an estimator. 4.2. UNIFORMLY MINIMUM VARIANCE UNBIASED ESTIMATOR 35 Definition 4.4. Suppose there is an index n such that Xn has a distribution in Fn and an → ∞ as n → ∞ while the parameter space Θ of Fn does not depend on n. Let θ be the true parameter value and θˆn is an estimator (a sequence of estimators). Suppose an(θˆn − θ) has a limiting distribution with mean B(θ) and variance σ2(θ), for θ ∈ Θ. We say θˆn has asymptotic bias B(θ) and asymptotic variance σ 2(θ) at rate an. Further more, we define the asymptotic mse at rate an as the σ 2(θ) + B2(θ). Unfortunately, themse is often a function of θ. In any specific application, the “true value” of θ behind X is not known. Hence, it is not possible to find an estimator which is a better estimator in terms of variance or mse whichever value θ is the true value. Example 4.2. Suppose X1, X2, . . . , Xn form an i.i.d. sample from N(θ; 1) such that Θ = R. Define θˆ = n−1 ∑ Xi and θ˜ = 0. It is seen that var(θˆ) = n−1 > var(θ˜) for any θ ∈ R. However, no one will be happy to use θ˜ as his/her estimator. In addition, mse(θˆ) = n−1 > mse(θ˜) for all |θ| < n−1/2. Hence, even if we use a more sensible performance criterion, it still does not imply that our preferred sample mean is indisputably a superior estimator. 4.2 Uniformly minimum variance unbiased es- timator This section contains some materials that most modern statisticians believe we should not have them included in statistical classes. Yet we feel a quick discussion is still a good idea. Either bias, var, mse can be used to separate the performance of es- timators we can think of. Yet without any performance measure, how can 36 CHAPTER 4. OPTIMALITY CRITERIA OF POINT ESTIMATION statisticians recommend any method to scientists? This is the same problem when professors are asked to recommend their students. Everyone is unique. Simplistically declaring one of them is the best will draw more criticisms than praises. Yet at least, we can timidly say one of the students has the highest average mark on mathematics courses, in this term, among all students with green hair and so on. Definition 4.5. Suppose X is a random sample from F with parameter θ ∈ Θ. An unbiased estimator θˆ is uniformly minimum variance estimator of θ, UMVUE, if for any other unbiased estimator θ˜ of θ, varθ(θˆ) ≤ varθ(θ˜) for all θ ∈ Θ. In the above definition, we added a subscript θ to highlight the fact that the variance calculation is based on the assumption that the of X has true parameter value θ. We do not always do so in other part of the course note. Upon the introduction of UMVUE, a urgent question to be answered is its existence. This answer is positive at least in textbook examples. Example 4.3. Suppose X1, X2, . . . , Xn form an i.i.d. sample from Poisson distribution with mean parameter θ and the parameter space is Θ = R+. Let θˆ = X¯n = n −1∑Xi. It is easily seen that θˆ is an unbiased estimator of θ. Suppose that θ˜ is another unbiased estimator of θ. Because X¯n is complete and sufficient statistic, we find θ˘ = E{θ˜|X¯n) is a function of data only. Hence, it is an estimator of θ. Using a formula that for any two random variables, var(Y ) = E{var(Y |Z)}+var{E(Y |Z)}, we find var(θ˘) ≤ var(θ˜). Furthermore, this estimator is also unbiased. Hence, E{θˆ − θ˘} = 0 4.3. INFORMATION INEQUALITY 37 for all θ ∈ R+. Because both estimators are function of X¯n and the com- pleteness of X¯n, we have θˆ = θ˘. Hence, var(θˆ) = var(θ˘) ≤ var(θ˜). Therefore, X¯n is the UMVUE. Now, among all estimators of θ that are unbiased, the sample mean has the lowest possible variance. If UMVUE is a criterion we accept, then the sample mean is the best possible estimator under the Poisson model for the mean parameter θ. Why is such a beautiful conclusion out of fashion these days? Some of the considerations are as follows. In real world applications, having a random sample strictly i.i.d. from a Poisson distribution is merely a fantasy. If so, why should we bother? Our defence is as follows. If the sample mean is optimal in the sense of UMVUE under the ideal situation, it is likely a superior one even if the situation is slightly different from the ideal. In addition, the optimality consideration is a good way of thinking. Suppose λ = 1/θ which is called rate parameter under Poisson model assumption. How would you estimate λ? Many will suggest that X¯−1n is a good candidate estimator. Sadly, this estimator is biased and has infinite variance! Lastly, in modern applications, we rarely work with such simplistic models. In these cases, it is nearly impossible to have a UMVUE. If so, we probably should not bother our students with such technical notions. 4.3 Information inequality At least in textbook examples, some estimators are fully justified as optimal. This implies that there is an intrinsic limit on how precise an estimator can achieve. Let X be a random variable modelled by F or more specifically a para- metric family f(x; θ). Let T (X) be a statistic with finite variance given any 38 CHAPTER 4. OPTIMALITY CRITERIA OF POINT ESTIMATION θ ∈ Θ. Denote ψ(θ) = E{T (X); θ} = ∫ T (x)f(x; θ)dx where the Lebesgue measure can be replaced by any other suitable measures. Suppose some regularity conditions on f(x; θ) are satisfied so that our fol- lowing manipulations are valid. Taking derivatives with respect to θ on two sides of the equality, we find ψ′(θ) = ∫ T (x)f ′(x; θ)dx = ∫ T (x)s(x; θ)f(x; θ)dx where s(x; θ) = f ′(x; θ) f(x; θ) = ∂ ∂θ {log f(x; θ)}. It is seen that∫ s(x; θ)f(x; θ)dx = ∫ f ′(x; θ)dx = d dθ ∫ f(x; θ)dx = 0. We define the Fisher information I(θ) = E [ ∂ ∂θ {log f(X; θ)} ]2 = E{s(X; θ)}2. Hence, {ψ′(θ)}2 = { ∫ {T (x)− ψ(θ)}f(x; θ)dx}2 ≤ ∫ {T (x)− ψ(θ)}2s(x; θ)f(x; θ)dx× ∫ {s(x; θ)}2f(x; θ)dx = var(T (x))I(θ). This leads to the following theorem. Theorem 4.1. Crame´r-Rao information inequality. Let T (X) be any statistic with finite variance for all θ ∈ Θ. Under some regularity conditions, var(T (X)) ≥ {ψ ′(θ)}2 I(θ) where ψ(θ) = E(T (X); θ). 4.3. INFORMATION INEQUALITY 39 If T (X) is unbiased for θ, then ψ′(θ) = 1. Therefore, var(T ) ≥ I−1(θ). When I(θ) is larger, the variance of T could be smaller. Hence, it indeed measures the information content in data X with respect to θ. For conve- nience of reference, we call I−1(θ) the information lower bound for estimating θ. In assignment problems, X is often made of n i.i.d. observations from f(x; θ). Let X1 be one component of X. It is a simple exercise to show that I(θ;X) = nI(θ;X1) in the obvious notation. We need to pay attention to what I(θ) stands for in many occasions. It could be the information contained in a single X1, but also could be information contained in the i.i.d. sample X1, . . . , Xn. Example 4.4. Suppose X1, X2, . . . , Xn form an i.i.d. sample from Poisson distribution with mean parameter θ and the parameter space is Θ = R+. The density function of X1 is given by f(x; θ) = P (X1 = x; θ) = θx x! exp(−θ). Hence, s(x; θ) = x θ − 1 and the information in X1 is given by I(θ) = E { X θ − 1 }2 = 1 θ . Therefore, for any unbiased estimator Tn of θ based on the whole sample, we have var(Tn) ≥ 1 nI(θ) = θ n . Since the sample mean is unbiased and has variance θ/n, it is an estimator that attains the information lower bound. The definition of Fisher information depends on how the distribution family is parameterized. If η is a smooth function of θ, the Fisher information 40 CHAPTER 4. OPTIMALITY CRITERIA OF POINT ESTIMATION with respect to η is not the same as the Fisher information with respect to θ. As an exercise, find the information lower bound for estimating η = exp(−θ) under Poisson distribution model. Derive its UMVUE given n i.i.d. observations. 4.4 Other desired properties of a point esti- mator Given a data set from an assumed model F , we often ask or are asked whether certain aspect of F can be estimated. This can be the mean or median of F where F is any member of F . In general, we may write the parameter as θ = θ(F ), a functional defined on F . Definition 4.6. Obsolete Concept of Estimability. Suppose the data set X is a random sample from a model F and suppose θ = θ(F ) is a parameter. We say θ is estimable if there exists a function T (·) such that E(T (X);F ) = θ(F ) for all F ∈ F . In other words, a parameter is estimable if we can find an unbiased estima- tor for this parameter. We can give many textbook examples of estimability. In contemporary applications, we are often asked to “train” a model given a data set with very complex structure. In this case, we do not even have a good description of F . Because of this, being estimable for a useful func- tional on F is a luxury. We have to give up this concept but remain aware of such a definition. It is not hard to give an example of un-estimable parameters according to the above definition though the example can overly technical. Instead, we show that there is a basic requirement for a parameter to be estimable. Definition 4.7. Identifiability of a statistical model. Let F be a para- metric model in statistics and Θ be its parameter space. We say F is iden- tifiable if for any θ1, θ2 ∈ Θ, F (x; θ1) = F (x; θ2) 4.4. OTHER DESIRED PROPERTIES OF A POINT ESTIMATOR 41 for all x implies θ1 = θ2. A necessary condition for a parameter θ to be estimable is that θ is identifiable. Otherwise, suppose F (x; θ1) = F (x; θ2) for all x, but θ1 6= θ2. For any estimator θˆ, we cannot have both E{θˆ; θ1} = θ1; E{θˆ; θ2} = θ2 because two expectations are equal while θ1 6= θ2. Definition 4.8. Proposed notion of estimability. Let F be a parametric model in statistics and Θ be its parameter space. Suppose the sample plan under consideration may be regarded as one of a sequence of sampling plans indexed by n with sample Xn from F . If there exists an estimator Tn, a function of Xn, such that P (|Tn − θ| ≥ ; θ)→ 0 for any θ ∈ Θ and > 0 as n → ∞, then we say θ is (asymptotically) estimable. The sampling plans in my mind include the plan of obtaining i.i.d. ob- servations, obtaining observations of time series with extended length and so on. This definition makes sense but we will not be surprised to draw serious criticisms. Example 4.5. Suppose we have an i.i.d. sample of size n from Poisson distribution. Let λ be the rate parameter. It is seen that λ is asymptotically estimable because P (∣∣ 1 n−1 + X¯n − λ∣∣ > )→ 0 as n→∞, where X¯n is the sample mean. In this example, I have implicitly regarded “having i.i.d. sample of size n” as a sequence of sampling plan. If one cannot obtain more and more i.i.d. observations from this population, then the asymptotic estimability does not make a lot of sense. 42 CHAPTER 4. OPTIMALITY CRITERIA OF POINT ESTIMATION If two random variables are related by Y = (5/9)(X−32) such as the case where Y and X are the temperatures measured in Celsius and Fahrenheit. Given measures X1, X2, . . . , Xn on a random sample from some population, it is most sensible to estimate the mean temperature as X¯n, the sample mean of X. If one measures the temperature in Celsius to get Y1, . . . , Yn on the same random sample, we should have estimated the mean by Y¯n, the sample mean of Y . Luckily, we have Y¯n = (5/9)(X¯n − 32). Some internal consistency is maintained. Such a desirable property is termed as equivariant. and sometimes is also called invariant. See Lehmann for references. In another occasion, one might be interested in estimating mean parame- ter µ in Poisson distribution. This parameter tells us the average number of events occuring in a time period of interest. At the same time, one might be interested in knowing the chance that nothing happens in the period which is exp(−µ). Let X¯n as the sample mean of the number of events over n distinct periods of time. We naturally estimate µ by X¯n and exp(−µ) by exp(−X¯n). If so, we find ĝ(µ) = g(µˆ) with g(x) = exp(−x). This is a property most of us will find desirable. When an estimator satisfies above property, we say it is invariant. Rigorous definitions of equivariance and invariance can be lengthy. We will be satisfied with a general discussion as above. In the Poisson distribution example, it is seen that E{exp(−X¯n)} = exp{nµ[exp(−1/n)− 1]} 6= exp(−µ). Hence, the most natural estimator of exp(−µ) is not unbiased. The UMVUE of exp(−µ) is given by E{1(X1 = 0)|X¯n}. The UMVUE of µ is given by X¯n. Thus, the UMVUE is not invariant when the population is the Poisson distribution family. As a helpful exercise for improving one’s technical strength, work out the explicit expression of E{1(X1 = 0)|X¯n}. 4.5 Consistency and asymptotic normality A point estimator is a function of data and the data are a random sample from a distribution/population that is a member of distribution family. Hence, 4.5. CONSISTENCY AND ASYMPTOTIC NORMALITY 43 it is random in general: its does not take a value with probability one. In other words, we can never be completely sure about the unknown parameter. However, when the sample size increases, we gain more and more information about its underlying population. Hence, we should be able to decide what the “true” parameter value with higher and higher precision. Definition 4.9. Let θn be an estimator of θ based on a sample of size n from a distribution family F (x; θ) : θ ∈ Θ. We say that θn is weakly consistent if, as n→∞, for any > 0 and θ ∈ Θ P (|θˆn − θ| ≥ ; θ)→ 0. In comparison, we have a stronger version of consistency. Definition 4.10. Let θn be an estimator of θ based on a sample of size n from a distribution family F (x; θ) : θ ∈ Θ. We say that θn is strongly consistent if, as n→∞, for any θ ∈ Θ P ( lim n→∞ θˆn = θ; θ) = 1. Here are a few remarks one should not take them seriously but worth to point out. First, the i.i.d. structure in the above definitions is not essential. However, it is not easy to give a more general and rigorous definition with- out this structure. Second, the consistency is not really a property of one estimator, but a sequence of estimators. Unless θˆn for all n are constructed based on the same principle, otherwise, the consistency is nothing relevant in applications: your n is far from infinity. For this reason, there is a more sensible definition called Fisher consistency. To avoid too much technicality, it is mentioned but not spelled out here. Lastly, when we say an estimator is consistent, we mean weakly consistent unless otherwise stated. The next topic is asymptotic normality. It is in fact best to be called limiting distributions. Suppose θˆn is an estimator of θ based on n i.i.d. observations from some distribution family. The precision of this estimator can be judged by its bias, variance, mean square error and so on. Ultimately, the precision of θˆn is its sample distribution. Unfortunate, the sample distribution of θˆn is often not easy to directly work with. At the same time, 44 CHAPTER 4. OPTIMALITY CRITERIA OF POINT ESTIMATION when n is very large, the distribution of its standardized version stabilizes. This is the limiting distribution. If we regard the limiting distribution as the sample distribution of θˆ, the difference is not so large. That is, the error diminishes when n increases. For this reason, statisticians are fond of finding limiting distributions. Definition 4.11. Let Tn be a sequence of random variables, we say its dis- tribution converges to that of T if lim n→∞ P (Tn ≤ t) = P (T ≤ t) for all t ∈ R at which F (t) = P (T ≤ t) is continuous. In this definition, Tn is just any sequence random variable, it may contain unknown parameters in specific examples. The index n need not be the sam- ple size in typical set up. The multivariate case will not be given here. The typical applications, the limiting distribution is about asymptotic normality. Example 4.6. Suppose we have an i.i.d. sample X1, . . . , Xn from a distri- bution family F . A typical estimator for F (t), the cumulative distribution function of X is the empirical distribution Fn(t) = n −1 n∑ i=1 1(Xi ≤ t). For each given t, the distribution of Fn(t) is kind of binomial. At the same time, √ n{Fn(t)− F (t)} d−→ N(0, σ2) with σ2 = F (t){1− F (t)} as n→∞. Remark: in this example, we have a random variable on one side but a distribution on the other side. It is interpreted as the distribution sequence of the random variables, indexed by n, converges to the distribution specified on the right hand side. As an exercise, one can work out the following example. 4.6. ASSIGNMENT PROBLEMS 45 Example 4.7. Suppose we have an i.i.d. sample X1, . . . , Xn from a uniform distribution family F such that F (x; θ) is uniform on (0, θ) and Θ = R+. Define θˆn = max{X1, X2, . . . , Xn} which is often denoted as X(n) and called order statistic. It is well known that n{θ − θˆ} d−→ exp(θ). Namely, the limiting distribution is exponential. Is θˆ asymptotically unbiased at rate √ n, at rate n? 4.6 Assignment problems 1. Let X1, X2, ..., Xn be an i.i.d. sample from the Uniform distribution Unif(0, θ). Define θˆn = max{X1, X2, . . . , Xn}, which is often denoted as X(n) and called order statistic. Find the limiting distribution of n(θ − θˆn) as n→∞. Is θˆ asymptotically unbiased at rate √ n, at rate n? 2. Let X1, X2, ..., Xn be an i.i.d. random sample from Poisson (θ), and let η = exp(−θ). From the previous assignment, we find that the UMVUE for η is given by ηˆ = (1− 1/n)nX¯ . (a) Follow the Definition 4.9 as given in the Lecture Notes, prove that ηˆ is weakly consistent, i.e., prove that, for any ε > 0 and θ > 0, P (|ηˆ − η| > ε; η)→ 0, as n→∞. (b) Conduct a simulation study to find the probability in part (a). Let = 0.01, η = exp(−1) and repeat the simulation with sample sizes n = 100 and 1000, with N = 20000 repetitions. Report your findings. 46 CHAPTER 4. OPTIMALITY CRITERIA OF POINT ESTIMATION 3. LetX1, X2, ..., Xn be an i.i.d. sample from the following mixture model, with density function f(x;λ, pi) = (1− pi) exp(−x) + piλ−1 exp(−x/λ), x > 0. Suppose we observe the sample data 0.61683384, 0.49301343, 0.08751571, 6.32112518, 1.46224603, 0.17420356, 1.07460011, 0.18795447, 2.01524287, 0.83013365, 0.04476622, 2.01365679, 1.63824658, 0.01627277, 5.71925356, 3.85095169, 0.75024996, 1.26231923, 0.70529060, 1.66594757 (a) Derive an analytical expression for moment estimate of the param- eters λ and pi. (b) Obtain their numerical values. 4. Given a positive constant k, we define a function for the purpose of M-estimation: ϕ(x; θ) = { (x− θ)2 , if |x− θ| ≤ k; k2 , if |x− θ| > k. (a) The M-Estimator θˆ of θ is the value at whichMn(θ) = ∑n i=1 ϕ(Xi, θ) is minimized. Assume that none of i makes |Xi − θˆ| = k where θˆ is the solution to the optimization problem. Show that θˆ is the mean of Xi such that |Xi − θˆ| < k. (b) Given the sample data 1.551 -1.170 -0.201 1.143 0.138 3.103 1.455 -2.121 -1.672 6.150 and that k = 2.0, calculate the value of the M-Estimate defined in part (a). 4.6. ASSIGNMENT PROBLEMS 47 5. Let X1, X2, ..., Xn be an i.i.d. random sample from the Exponential dis- tribution Exp (θ) with mean θ (the density is f(x; θ) = θ−1 exp(−x/θ)). Denote by X(1), X(2), . . . , X(n) the corresponding order statistics for this random sample. Then, Wk = X(k+1) − X(k), 1 ≤ k ≤ n − 1 are called the spacings of the order statistics. By convention, define W0 = X(1), the first order statistic. (a) It is known that W0,W1, ...,Wn−1 are independent to each other, with Wk ∼ exp( θ n− k ), for k = 0, 1, . . . , n− 1. Verify this result for the case where n = 2. (b) Let Tn = X(1) +X(2) + · · ·+X(k) + (n− k)X(k). Suppose n = 10, k = 8, find the mean and variance for this statistic Tn. (c) Suppose n = 10, k = 8, and use on your result from part (b), work out an unbiased L-Estimator for the parameter θ. 48 CHAPTER 4. OPTIMALITY CRITERIA OF POINT ESTIMATION Chapter 5 Approaches of point estimation Even though any statistics with proper range is a point estimator, we gener- ally prefer estimators derived based on some principles. This leads to a few common estimation procedures. 5.1 Method of moments Suppose F is a parametric distribution family so that it permits a general expression F = {F (x; θ) : θ ∈ Θ} such that Θ ⊂ Rd for some positive integer d. We assume the parameter is identifiable. In most classical examples, the distributions are labeled smoothly by θ: two distributions having close parameter values are similar in some metric. In addition, the first d moments are smooth functions of θ. They map Θ to Rd in a one-to-one fashion: different θ value leads to different first d moments. Suppose we have an i.i.d. sample X1, . . . , Xn of size n from F and X is univariate. For k = 1, 2, . . . , d, define equations with respect to θ as n−1{Xk1 +Xk2 + · · ·+Xkn} = E{Xk; θ}. The solution in θ, if exists and unique, are called moment estimator of θ. 49 50 CHAPTER 5. APPROACHES OF POINT ESTIMATION Example 5.1. If X1, . . . , Xn is an i.i.d. sample from Negative binomial dis- tribution whose probability mass function (p.m.f. ) is given by f(x; θ) = (−m x ) (θ − 1)xθm for x = 0, 1, 2, . . .. It is known that E{X; θ} = m/θ. Hence, the moment estimator of θ is given by θˆ = m/X¯n. If X1, . . . , Xn is an i.i.d. sample from N(µ, σ 2) distribution. It is known that E{X,X2} = (µ, µ2 + σ2). The moment equations are given by n−1{X1 +X2 + · · ·+Xn} = µ; n−1{X21 +X22 + · · ·+X2n} = µ2 + σ2. The moment estimators are found to be µˆ = X¯n; σˆ 2 = n−1 ∑ X2i − X¯2n. Note that σˆ2 differs from the sample variance by a scale factor n/(n− 1). Moment estimators are often easy to construct and have simple distri- butional properties. In classical examples, they are also easy to compute numerically. The use of moment estimator depends on the existence and also unique- ness of the solutions to the corresponding equations. There seem to be little discussion on this topic. We suggest that moment estimators are estima- tors of ancient tradition in which era only simplistic models were considered. Such complications do not seem to occur too often for these models. We will provide an example based on exponential mixture as an exercise problem. One may find the classical example in Pearson (1904?) where a heroic ef- fort was devoted to solve moment equations to fit a two-component normal mixture model. Other than it is a general convention, there exists nearly no theory to support the use of the first d moments for the method of moments rather than other moments. The method of moments also does not have to be restricted to situations where i.i.d. observations are available. 5.1. METHOD OF MOMENTS 51 Example 5.2. Suppose we have T observations from a simple linear regres- sion model: Yt = βXt + t for t = 0, 1, . . . , T , such that 1, . . . , T are i.i.d. N(0, 1) and X1, . . . , XT are non-random constants. It is seen that E{ ∑ Yt} = β ∑ Xt. Hence, a moment estimator of β is given by βˆ = ∑ Yt∑ Xt . The method of moments makes sense based on our intuition. What statis- tical properties does it have? Under some conditions, we can show that it is consistent and asymptotically normal. Specifying exact conditions, however, is surprisingly more tedious than we may expect. Consider the situation where an i.i.d. sample of size n from a parametric statistical model F is available. Let θ denote the parameter and Θ ⊂ Rd be the parameter space. Let µk(θ) be the kth moment of X, the random variable whose distribution is F (x; θ) which is a member of F . Assume that µk(θ) exists and continuous in θ for k = 1, 2, . . . , d. Assume also the moment estimator of θ, θˆ is a unique solution to moment equations for large enough n. Recall the law of large numbers: n−1{Xk1 +Xk2 + · · ·+Xkn} → µk(θ) almost surely when n→∞. By the definition of moment estimates, we have µk(θˆ)→ µk(θ) for k = 1, 2, . . . , d when n→∞, almost surely. Assume that as a vector valued function made of first d moments, µ(θ) “inversely continuous” a term we invent on spot: for any fixed θ∗ and dynamic θ, ‖µ(θ)− µ(θ∗)‖ → 0 52 CHAPTER 5. APPROACHES OF POINT ESTIMATION only if θ → θ∗. Then, µk(θˆ) → µk(θ) almost surely implies θˆ → θ almost surely. We omit the discussion of asymptotic normality here. 5.2 Maximum likelihood estimation If one can find a σ-finite measure such that each distribution in F has a density function f(x). Then the likelihood function is given by (not defined as) L(F ) = f(x) which is a function of F on F . To remove the mystic notion of F , under parametric model, the likelihood becomes L(θ) = f(x; θ) because we can use θ to represent each F in F . If θˆ is a value in Θ such that L(θˆ) = sup θ L(θ) then it is a maximum likelihood estimate (estimator) of θ. If we can find a sequence {θm}∞m=1 such that lim m→∞ L(θm) = sup θ L(θ) and lim θm = θˆ exists, then we also call θˆ a maximum likelihood estimate (estimator) of θ. The observation x includes the situation where it is a vector. The common i.i.d. situation is a special case where x is made of n i.i.d. observations from a distribution family F . In this case, the likelihood function is given by the product of n densities evaluated at x1, . . . , xn respectively. It remains a function of parameter θ. The probability mass function, when x is discrete, is also regarded as a density function. This remark looks after discrete models. In general, the likelihood function is defined as follows. 5.3. ESTIMATING EQUATION 53 Definition 5.1. The likelihood function on a model F based on observed values of X is proportional to P (X = x;F ) where the probability is computed when X has distribution F . When F is a continuous distribution, the probability is computed as the probability of the event “when X belongs to a small neighbourhood of x”. The argument of “proportionality” leads to the joint density function f(x) or f(x; θ) in general. The proportionality is a property in terms of F . The likelihood function is a function of F . The phrase “proportional to” in the definition implies the likelihood func- tion is not unique. If L(θ) is a likelihood function based on some data, then cL(θ) for any c > 0 is also a likelihood function based on the same data. 5.3 Estimating equation The MLE of a parameter is often obtained by solving a score equation: ∂Ln(θ) ∂θ = 0. It is generally true that E [ ∂ logLn(θ) ∂θ ; θ ] = 0 where the expectation is computed when the parameter value (of the distri- bution of the data) is given by θ. Because of this, the MLE is often regarded as a solution to ∂ logLn(θ) ∂θ = 0. It appears that whether or not ∂ logLn(θ)/∂θ is the derivative function of the log likelihood function matters very little. This leads to the following consideration. In applications, we have reasons to justify that a parameter θ solves equa- tion E{g(X; θ)} = 0. 54 CHAPTER 5. APPROACHES OF POINT ESTIMATION Given an set of i.i.d. observations in X, we may solve n∑ i=1 g(xi; θ) = 0 and use its solution as an estimate of θ (or estimator if xi’s are replaced by Xi). Clearly, such estimators are sensible and may be preferred when com- pletely specifying a model for X is at great risk of misspecification. Example 5.3. Suppose (xi, yi), i = 1, 2, . . . , n is a set of i.i.d. observations from some F such that E(Yi|Xi = xi) = xτiβ. We may estimate β by the solution to n∑ i=1 xτi (yi − xτiβ) = 0. The solution is given by βˆ = { n∑ i=1 xix τ i }−1{ n∑ i=1 xiyi} which is the well known least squares estimator. The spirit of this example is: we do not explicitly spell out any distribu- tional assumptions on (X, Y ) other than the form of the conditional expecta- tion. 5.4 M-Estimation Motivated from a similar consideration, one may replace Ln(θ) by some other functions in some applications. Let ϕ(x; θ) be a function of data and θ but we mostly interested in its functional side in θ after x is given. In the i.i.d. case, we may maximize Mn(θ) = n∑ i=1 ϕ(xi; θ) 5.5. L-ESTIMATOR 55 use its solution as an estimate of θ (or estimator if xi’s are replaced by Xi). In this situation, parameter θ is defined as the solution to the minimum point of E{ϕ(X; ξ);F} in ξ where F is the true distribution of X. Example 5.4. Suppose (xi, yi), i = 1, 2, . . . , n is a set of i.i.d. observations from some F such that E(Yi|Xi = xi) = xτiβ. We may estimate β by the solution to the minimization/optimization problem: min β { n∑ i=1 (yi − xτiβ)2}. In this case, ϕ(x, y;β) = (y − xτβ)2. The solution is again given by βˆ = { n∑ i=1 xix τ i }−1{ n∑ i=1 xiyi} which is the well known least squares estimator. In some applications, the data set may contain a few observations whose y values are much much larger than the rest of observations. Their pres- ence makes the other observed values have almost no influence on the fitted regression coefficient βˆ. Hence, Huber suggested to use ϕ(x, y;β) = (y − xτβ)2 |y − xτβ| ≤ k k(y − xτβ) y − xτβ > k −k(y − xτβ) y − xτβ < −k for some selected constant k instead. This choice limits the influences of observations with huge values. Some- times, such abnormal values, often referred to as outliers, are caused by recording errors. 5.5 L-estimator Suppose we have a set of univariate i.i.d. observations and and it is simple to record them in terms of sizes such that X(1) ≤ X(2) ≤ · · · ≤ X(n). We call 56 CHAPTER 5. APPROACHES OF POINT ESTIMATION them order statistics. To avoid the influence of outliers, one may estimate the population mean by a trimmed mean: (n− 2)−1 n−1∑ i=2 X(i). This practice is used on Olympic games though theirs are not estimators. One can certainly remove more observations from consideration and make the estimator more robust. The extreme case is to use the sample median to estimate the population mean. In this case, the estimator makes sense only if the mean and median are the same parameters under the model assumption. In general, an L-estimator is any linear combination of these order statis- tics. The coefficients are required to be non-random and naturally do not depend on unknown parameters. 5.6 Assignment problems 1. Let X1, X2, ..., Xn be an i.i.d. sample from the Uniform distribution Unif(0, θ). Define θˆn = max{X1, X2, . . . , Xn}, which is often denoted as X(n) and called order statistic. Find the limiting distribution of n(θ − θˆn) as n→∞. Is θˆ asymptotically unbiased at rate √ n, at rate n? 2. Let X1, X2, ..., Xn be an i.i.d. random sample from Poisson (θ), and let η = exp(−θ). From the previous assignment, we find that the UMVUE for η is given by ηˆ = (1− 1/n)nX¯ . (a) Follow the Definition ?? as given in the Lecture Notes, prove that ηˆ is weakly consistent, i.e., prove that, for any ε > 0 and θ > 0, P (|ηˆ − η| > ε; η)→ 0, as n→∞. 5.6. ASSIGNMENT PROBLEMS 57 (b) Conduct a simulation study to find the probability in part (a). Let = 0.01, η = exp(−1) and repeat the simulation with sample sizes n = 100 and 1000, with N = 20000 repetitions. Report your findings. 3. LetX1, X2, ..., Xn be an i.i.d. sample from the following mixture model, with density function f(x;λ, pi) = (1− pi) exp(−x) + piλ−1 exp(−x/λ), x > 0. Suppose we observe the sample data 0.61683384, 0.49301343, 0.08751571, 6.32112518, 1.46224603, 0.17420356, 1.07460011, 0.18795447, 2.01524287, 0.83013365, 0.04476622, 2.01365679, 1.63824658, 0.01627277, 5.71925356, 3.85095169, 0.75024996, 1.26231923, 0.70529060, 1.66594757 (a) Derive an analytical expression for moment estimate of the param- eters λ and pi and (b) obtain their numerical values. 4. Given a positive constant k, let us define a function for the purpose of M-estimation: ϕ(x; θ) = { (x− θ)2 , if |x− θ| ≤ k; k2 , if |x− θ| > k. (a) The M-Estimator θˆ of θ is the value at whichMn(θ) = ∑n i=1 ϕ(Xi, θ) is minimized. Assume that none of i makes |Xi − θˆ| = k where θˆ is the solution to the optimization problem. Show that θˆ is the mean of Xi such that |Xi − θˆ| < k. (b) Given the sample data 1.551 -1.170 -0.201 1.143 0.138 3.103 1.455 -2.121 -1.672 6.150 58 CHAPTER 5. APPROACHES OF POINT ESTIMATION and that k = 2.0, calculate the value of the M-Estimate defined in part (a). 5. Let X1, X2, ..., Xn be an i.i.d. random sample from the Exponential dis- tribution Exp (θ) with mean θ (the density is f(x; θ) = θ−1 exp(−x/θ)). Denote by X(1), X(2), . . . , X(n) the corresponding order statistics for this random sample. Then, Wk = X(k+1) − X(k), 1 ≤ k ≤ n − 1 are called the spacings of the order statistics. By convention, define W0 = X(1), the first order statistic. (a) It is known that W0,W1, ...,Wn−1 are independent to each other, with Wk ∼ Exp( θ n− k ), for k = 0, 1, . . . , n− 1. Verify this Theorem for the case when n = 2. (b) Let Tn = X(1) +X(2) + · · ·+X(k) + (n− k)X(k). When n = 10, k = 8, find the mean and variance for this statistic Tn. (c) Suppose n = 10, k = 8, and use on your result from part (b), give an unbiased L-Estimator for the parameter θ. Chapter 6 Maximum likelihood estimation In textbooks such as here, we have plenty of examples where the solutions to MLE are easy to obtain. We now give some examples where the routine approaches do not work. 6.1 MLE examples The simplest example is when we have i.i.d. data of size n from N(µ, σ2) distribution (family). In this case, the log-likelihood function is given by `n(µ, σ 2) = −n log σ − 1 σ2 n∑ i=1 (xi − µ)2. Note that I have omitted the constant that does not depend on parameters. Regardless of the value of σ2, the maximum point in µ is µˆ = X¯n, the sample mean. Let σ˜2 = n−1 ∑n i=1(xi − µ)2 and do not regard it as an estimator but a statistic for the moment. Then, we find `n(µˆ, σ 2) = −n log σ − nσ˜ 2 σ2 . This function is maximized at σ2 = σ˜2. Hence, the MLE of σ2 is given by σˆ2 = σ˜2. Type I censor. The next example is a bit unusual. In industry, it is vital to ensure that components in a product will last for a long time. Hence, we 59 60 CHAPTER 6. MAXIMUM LIKELIHOOD ESTIMATION need to have a clear idea on their survival distributions. Such information can be obtained by collecting complete failure time data on a random sample of the components. When the average survival time is very long, one has to terminate the experiment at some point, likely before all samples fail. Let the life time of a component be X and the termination time be nonrandom T . Then, the observation may be censored and we only observe min(X,T ). This type of censorship is commonly referred to as type I censor. Suppose the failure time data can be properly modelled by exponential distribution f(x; θ) = θ−1 exp(−x/θ), x > 0. Let x1, x2, · · · , xm be the ob- served failure times of m out of n components. The rest of n−m components have not experienced failure at time T (which is not random). In this case, the likelihood function would be given by Ln(θ) = θ −m exp { −θ−1[ m∑ i=1 xi + (n−m)T ] } . Interpreting likelihood function based on the above definition makes it easier to obtained the above expression. Some mathematics behind this likelihood is as follows. To observe that n−m components lasted longer than T , the probability of this event is given by ( n n−m ) {exp(−θ−1T )}n−m{1− exp(−θ−1T )}m. Given m components failed before time T , the joint distribution is equivalent to an i.i.d. conditional exponential distribution whose density is given by θ−1 exp(−θ−1x) 1− exp(−θ−1T ) . Hence, the joint density of x1, . . . , xm is given by m∏ i=1 [ θ−1 exp(−θ−1xi) 1− exp(−θ−1T ) ] . The product of two factors gives us the algebraic expression of Ln(θ). Once the likelihood function is obtained, we can find the explicit solution to the MLE of θ easily. 6.2. NEWTON RAPHSON ALGORITHM 61 There are more than one way to arrive at the above likelihood function. Discrete parameter space. Suppose a finite population is made of two types of units, A and B. The population size N = A+B units where A and B also denote the number of types A and B units. Assume the value of B is known which occurs in capture-recapture experiment. A sample of size n is obtained by “simple random sample without replacement” and x of the sampled units are of type B. Based on this observation, what is the MLE of A? To answer this question, we notice that the likelihood function is given by L(A) = ( A n−x )( B x )( A+B n ) . Our task is to find an expression of the MLE of A. Note that “find the MLE” is not very rigorous statement. Let us leave this problem to classroom discussion. Non-smooth density functions. Suppose we have an i.i.d. sample of size n from uniform distribution on (0, θ) and the parameter space is Θ = R+. Find the MLE of θ. 6.2 Newton Raphson algorithm Other than textbook examples, most applied problems do not permit an analytical solutions to the maximum likelihood estimation. In this case, we resort to any optimization algorithms that work. For illustration, we still resort to “textbook examples.” Example 6.1. Let X1, . . . , Xn be i.i.d. random variables from Weibull dis- tribution with fixed scale parameter: f(x; θ) = θxθ−1 exp(−xθ) with parameter space Θ = R+ on support x > 0. Clearly, the log likelihood function of θ is given by `n(θ) = n log θ + (θ − 1) n∑ i=1 log xi − n∑ i=1 xθi . 62 CHAPTER 6. MAXIMUM LIKELIHOOD ESTIMATION It is seen that `′n(θ) = n θ + n∑ i=1 log xi − n∑ i=1 xθi log xi; `′′n(θ) = − n θ2 − n∑ i=1 xθi log 2 xi < 0. Therefore, the likelihood function is convex and hence has unique maximum in θ. Either when θ → 0+ and when θ →∞, we have `n(α)→ −∞. For numerical computation, we can easily locate θ1 < θ2 such that the maximum point of `n(θ) is within the interval [θ1, θ2]. Following the above example, a bisection algorithm can be applied to locate the maximum point of `n(θ). 1. Compute y1 = ` ′ n(θ1), y2 = ` ′ n(θ2) and θ = (θ1 + θ2)/2; 2. If `′n(θ) > 0, let θ1 = θ; otherwise, θ2 = θ; 3. Repeat the last step until |θ1 − θ2| < for a pre-specified precision constant > 0. 4. Report θ as the numerical value of the MLE θˆ. It will be an exercise to numerically find an upper and lower bounds and the MLE of θ given a data set. The bisection method is easy to understand. Its convergence rate, in terms of how many steps it must take to get the final result is judged not high enough. When θ is one dimensional, our experience shows the criticism is not well founded. Nevertheless, it is useful to understand another standard method in numerical data analysis. Suppose one has an initial guess of the maximum point of the likelihood function, say θ(0). For any θ close to this point, we have `′n(θ) ≈ `′n(θ(0)) + `′′n(θ(0))(θ − θ(0)). 6.2. NEWTON RAPHSON ALGORITHM 63 If the initial guess is pretty close to the maximum point, then the value of the second derivative `′′n(θ (0)) < 0. From the above approximation, we would guess that θ(1) = θ(0) − `′n(θ(0))/`′′n(θ(0)) is closer to the solution of `′n(θ) = 0. This consideration leads to repeated updating: θ(k+1) = θ(k) − `′n(θ(k))/`′′n(θ(k)). Starting from θ(0), we therefore obtain a sequence θ(k). If the problem is not tricky, this sequence converges to the maximum point of `n(θ). Once it stabilizes, we regard the outcome as the numerical value of the MLE. The iterative scheme is called Newton-Raphson method. Its success de- pends on a good choice of θ(0) and the property of `n(θ) as a function of θ. If the likelihood has many local maxima, then the outcome of the algorithm can be one of these local maxima. For complex models and multi-dimensional θ, the convergence is far from guaranteed. The good/lucky choice of θ(0) is crucial. Although in theory, each iteration moves θ(k+1) toward true maximum faster by using Newton-Raphson method, we pay extra cost on computing the second derivation. For multi-dimensional θ, we need to invert a matrix which is not always a pleasant task. The implementation of this method is not always so simple. Implementing Newton-Raphson for a simple data example will be an ex- ercise. Example 6.2. Logistic distribution. Let X1, X2, . . . , Xn be i.i.d. with density function f(x; θ) = exp{−(x− θ)} [1 + exp{−(x− θ)}]2 . The support of the distribution is the whole line, and parameter space is R. We usually call it a location distribution family. The log-likelihood function is give by `n(θ) = nθ − nx¯n − 2 n∑ i=1 log[1 + exp{−(xi − θ)}]. 64 CHAPTER 6. MAXIMUM LIKELIHOOD ESTIMATION Its score function is `′n(θ) = s(θ) = n− 2 n∑ i=1 exp{−(x− θ)} 1 + exp{−(x− θ)} . The MLE is a solution to s(θ) = 0. One may easily find that `′′n(θ) = s ′(θ) = −2 n∑ i=1 exp{−(xi − θ)} [1 + exp{−(xi − θ)}]2 < 0. Thus, the score function is monotone in θ, which implies the solution to s(θ) = 0 is unique. It also implies that the solution is the maximum point of the likelihood, not minimum nor stationary points. It is also evident that there is no analytical solution to this equation, Newton-Raphson algorithm can be a good choice for numerically evaluate the MLE in applications. 6.3 EM-algorithm Suppose we have n observations from a tri-nomial distribution. That is, there are n independent and independent trials each has 3 possible outcomes. The corresponding parameters are p1, p2, p3. We summarize these observations into n1, n2, n3. The log-likelihood function is `n(p1, p2, p3) = n1 log p1 + n2 log p2 + n3 log p3. Using Lagrange method, we can easily show that the MLEs are pˆj = nj/n for j = 1, 2, 3. If, however, another m trials were carried out but we know only their outcomes are not of the third kind. In some words, the data contains some missing information. 6.3. EM-ALGORITHM 65 The log-likelihood function when the additional data are included be- comes `n(p1, p2, p3) = n1 log p1 + n2 log p2 + n3 log p3 +m log(p1 + p2). Working out the MLE is no longer straightforward now. Given specific values, there are many numerical algorithms can be used to compute MLE. We recommand EM-algorithm in this case. If we knew which of these m observations were of type I, we would have obtained the complete data log-likelihood as: `c(p1, p2, p3) = (n1 +m1) log p1 + (n2 +m2) log p2 + n3 log p3 where c stands for “complete data”. Since we do not know what these m1 and m2 are, we replace them with some predictions based on what we know already. In this case, we use conditional expectations. E-step: If the current estimates pˆ1 = n1/n and pˆ2 = n2/n are relevant. Then, we might expect that out of m non-type III observations, mˆ1 = mpˆ1/(pˆ1 + pˆ2) are of type I and mˆ2 = mpˆ2/(pˆ1 + pˆ2) are of type II. That is, the conditional expectation (given data, and the current estimates of the parameter values) of m1 and m2 are given by mˆ1 and mˆ2. When m1 and m2 are replaced by their conditional expectations, we get a function Q(p1, p2, p3) = (n1 + mˆ1) log p1 + (n2 + mˆ2) log p2 + n3 log p3. This is called E-stap because we Replace the unobserved values by their conditional expectations. M-step: In this step, we update unknown parameters by the maximizer of Q(p1, p2, p3). The updated estimator values are p˜1 = (n1 +m1)/(n+m) p˜2 = (n2 +m2)/(n+m), p˜3 = n3/(n+m). If they represent a better guess of the MLE, then we should update the Q-function accordingly. After which, we should carry out the M-step again to obtain more satisfactory approximation to the MLE. We therefore iterate between the E and M steps until some notion of convergence. These idea is particularly useful when the data structure is complex. In most cases, the EM iteration is guaranteed to increase the likelihood. Thus, it should converge, and converge to a local maximum for the least. 66 CHAPTER 6. MAXIMUM LIKELIHOOD ESTIMATION 6.4 EM-algorithm for finite mixture models Let envisage a population made of a finite number of subpopulations, each is governed by a specific distribution from some distribution family. Taking a random sample from a finite mixture model, we obtain a set of units without knowing their subpopulation identities. The resulting random variable has density function f(x;G) = m∑ j=1 pijf(x; θj) with G denoting a mixing distribution on parameter space of θ, Θ, by as- signing probability pij on θj. Given a random sample of size n, x1, x2, . . . , xn, from this distribution, the log likelihood function is given by `n(G) = n∑ i=1 log f(xi;G). (6.1) Other than order m, we regard pij, θj as parameters to be estimated. Com- puting the maximum likelihood estimate of G is to find the values of m pairs of pij and θj such that `n(G) is maximized. Taking advantage of the mixture model structure, EM-algorithm can of- ten be effectively implemented to locate the location of the maximum point of the likelihood function. Conceptually, each observation x from a mixture model is part of a com- plete vector observation (x, zτ ) where z is a vector of mostly 0 and a single 1 of length m. The position of 1 is its subpopulation identity. Suppose we have a set of complete observations in the form of (xi, z τ i ): i = 1, 2, . . . , n. The log likelihood function of the mixing distribution G is given by `c(G) = n∑ i=1 m∑ j=1 zij log{pijf(xi; θj)}. (6.2) Since for each i, zij equals 0 except for a specific j value, only one log{pijf(xi; θj)} actually enters the log likelihood function. We use x for the vector of the xi and X as its corresponding random vector and start the EM-algorithm with an initial mixing distribution with 6.4. EM-ALGORITHM FOR FINITE MIXTURE MODELS 67 m support points: G(0)(θ) = m∑ j=1 pi (0) j 1(θ (0) j ≤ θ). E-Step. This step is to find the expected values of the missing data in the full data likelihood function. They are zi in the context of the finite mix- ture model. If the mixing distribution G is given by G(0), its corresponding random variable has conditional expectation given by E{Zij|X = x;G(0)} = pr(Zij = 1|Xi = xi;G(0)) = f(xi; θ (0) j )pr(Zij = 1;G (0))∑m k=1 f(xi; θ (0) k )pr(Zik = 1;G (0)) = pi (0) j f(xi; θ (0) j )∑m k=1 pi (0) k f(xi; θ (0) k ) . The first equality has utilized two facts: the expectation of an indicator ran- dom variable equals the probability of “success”; only the ith observation is relevant to the subpopulation identity of the ith unit. The second equality comes from the standard Bayes formula. The third one spells out the proba- bility of “success” if G(0) is the true mixing distribution. The superscript (0) reminds us that the corresponding quantities are from G(0), the initial mixing distribution. One should also note the expression is explicit and numerically easy to compute as long as the density function itself can be easily computed. We use notation w (0) ij for E{Zij|X = x;G(0)}. Replacing zij by w(0)i in `c(G), we obtain a function which is usually denoted as Q(G;G(0)) = n∑ i=1 m∑ j=1 w (0) ij log{pijf(xi; θj)}. (6.3) In this expression, Q is a function of G, and its functional form is determined by G(0). The E-Step ends at producing this function. M-Step. Given this Q function, it is often simple to find a mixing distribu- tion G having it maximized. Note that Q has the following decomposition: Q(G;G(0)) = m∑ j=1 { n∑ i=1 w (0) ij } log(pij) + m∑ j=1 { n∑ i=1 w (0) ij log f(xi; θj) } . 68 CHAPTER 6. MAXIMUM LIKELIHOOD ESTIMATION In this decomposition, two additive terms are functions of two separate parts of G. The first term is a function of mixing probabilities only. The second term is a function of subpopulation parameters only. Hence, we can search for the maxima of these two functions separately to find the overall solution. The algebraic form of the first term is identical to the log likelihood of a multinomial distribution. The maximization solution is given by pi (1) j = n −1 n∑ i=1 w (0) ij for j = 1, 2, . . . ,m. The second term is further decomposed into the sum of m log likelihood functions, one for each subpopulation. When f(x; θ) is a member of classical parametric distribution family, then the maximization with respect to θ often has an explicit analytical solution. With a generic f(x; θ), we cannot give an explicit expression but an abstract one: θ (1) j = arg sup θ { n∑ i=1 w (0) ij log f(xi; θj)}. The mixing distribution G(1)(θ) = m∑ j=1 pi (1) j 1(θ (1) j ≤ θ) then replaces the role of G(0) and we go back to E-step. Iterating between E-step and M-step leads to a sequence of intermediate estimates of the mixing distribution: G(k). Often, this sequence converges to at least a local maximum of `n(G). With some luck, the outcome of this limit is the global maximum. In most applications, one would try a number of G(0) and compare the values of `n(G (k)) the EM-algorithm leads to. The one with the highest value will have its G(k) regarded as the maximum likelihood estimate of G. The algorithm stops after many iterations when the difference between G(k) and G(k−1) is considered too small to continue. Other convergence cri- teria may also be used. 6.4. EM-ALGORITHM FOR FINITE MIXTURE MODELS 69 6.4.1 Data Examples Leroux and Puterman (1992) and Chen and Kalbfleisch (1996) analyze data on the movements of a fetal lamb in each of 240 consecutive 5-second intervals and propose a mixture of Poisson distributions. The observations can be summarized by the following table. x 0 1 2 3 4 5 6 7 freq 182 41 12 2 2 0 0 1 It is easily seen that the distribution of the counts is over-dispersed. The sample mean is 0.358 which is significantly smaller than the sample variance which is 0.658 given that the sample size is 240. A finite mixture model is very effective at explaining the over-dispersion. There is a general agreement that a finite Poisson mixture model with order m = 2 is most suitable. We use this example to demonstrate the use of EM- algorithm for computing the MLE of the mixing distribution given m = 2. Since the sample mean is 0.358 and the data contains a lot of zeros. Let us choose an initial mixing distribution with (pi (0) 1 , pi (0) 2 , θ (0) 1 , θ (0) 2 ) = (0.7, 0.3, 0.1, 4.0). We do not have more specific reasons behind the above choice. A simplistic implementation of EM-algorithm for this data set is as fol- lows. pp = 0.7; theta = c(0.1, 4.0) xx = c(rep(0, 182), rep(1, 41), rep(2, 12), 3, 3, 4, 4, 7) #data inputted, initial mixing distribution chosen last = c(pp, theta) dd= 1 while(dd > 0.000001) { temp1 = pp*dpois(xx, theta[1]) temp2 = (1-pp)*dpois(xx, theta[2]) w1 = temp1/(temp1+temp2) 70 CHAPTER 6. MAXIMUM LIKELIHOOD ESTIMATION w2 = 1 - w1 #E-step completed pp = mean(w1) theta[1] = sum(w1*xx)/sum(w1) theta[2] = sum(w2*xx)/sum(w2) #M-step completed updated = c(pp, theta) dd = sum((last - updated)^2) last = updated } print(updated) When the EM-algorithm converges, we get pˆi1 = 0.938 and θˆ1 = 0.229, θˆ2 = 2.307. The likelihood value at this Gˆ equals −186.99 (based on the usual expression of the Poisson probability mass function). The fitted frequency vector is given by x 0 1 2 3 4 5 6 7 freq 182 41 12 2 2 0 0 1 fitted freq 180.4 44.5 8.6 3.4 1.8 0.8 0.3 0.1 6.5 EM-algorithm for finite mixture models repeated Let envisage a population made of a finite number of subpopulations, each is governed by a specific distribution from some distribution family. Taking a random sample from a finite mixture model, we obtain a set of units without knowing their subpopulation identities. The resulting random variable has density function f(x;G) = m∑ j=1 pijf(x; θj) with G denoting a mixing distribution on parameter space of θ, Θ, by as- signing probability pij on θj. 6.5. EM-ALGORITHM FOR FINITEMIXTUREMODELS REPEATED71 Given a random sample of size n, x1, x2, . . . , xn, from this distribution, the log likelihood function is given by `n(G) = n∑ i=1 log f(xi;G). (6.4) Other than order m, we regard pij, θj as parameters to be estimated. Com- puting the maximum likelihood estimate of G is to find the values of m pairs of pij and θj such that `n(G) is maximized. Taking advantage of the mixture model structure, EM-algorithm can of- ten be effectively implemented to locate the location of the maximum point of the likelihood function. Conceptually, each observation x from a mixture model is part of a com- plete vector observation (x, z) where z takes values j with probability pij for j = 1, 2, . . . ,m. Suppose we have a set of complete observations in the form of (xi, zi): i = 1, 2, . . . , n. The log likelihood function of the mixing distribution G is given by `c(G) = n∑ i=1 m∑ j=1 1(zi = j) log{pijf(xi; θj)}. (6.5) Clearly, only one log{pijf(xi; θj)} actually enters the log likelihood function. We use x for the vector of the xi and X as its corresponding random vector and start the EM-algorithm with an initial mixing distribution with m support points: G(0)(θ) = m∑ j=1 pi (0) j 1(θ (0) j ≤ θ). E-Step. This step is to find the expected values of the missing data in the full data likelihood function. If the mixing distribution G is given by G(0), its corresponding random variable has conditional expectation given by E{1(Zi = j)|X = x;G(0)} = f(xi; θ (0) j )pr(Zi = j;G (0))∑m k=1 f(xi; θ (0) k )pr(Zi = j;G (0)) = pi (0) j f(xi; θ (0) j )∑m k=1 pi (0) k f(xi; θ (0) k ) . 72 CHAPTER 6. MAXIMUM LIKELIHOOD ESTIMATION The first equality has utilized two facts: the expectation of an indicator ran- dom variable equals the probability of “success”; only the ith observation is relevant to the subpopulation identity of the ith unit. The second equality comes from the standard Bayes formula. The third one spells out the proba- bility of “success” if G(0) is the true mixing distribution. The superscript (0) reminds us that the corresponding quantities are from G(0), the initial mixing distribution. One should also note the expression is explicit and numerically easy to compute as long as the density function itself can be easily computed. We use notation w (0) ij for E{1(Zi = j)|X = x;G(0)}. Replacing 1(Zi = j) by w (0) i in ` c(G), we obtain a function which is usually denoted as Q(G;G(0)) = n∑ i=1 m∑ j=1 w (0) ij log{pijf(xi; θj)}. (6.6) In this expression, Q is a function of G, and its functional form is deter- mined by G(0). The E-Step ends at producing this function. In other words, Q(G;G(0)) is the conditional expectation of `c(G) when X = x are given, and G(0) is regarded as the true mixing distribution behind X. M-Step. Given this Q function, it is often simple to find a mixing distribu- tion G having it maximized. Note that Q has the following decomposition: Q(G;G(0)) = m∑ j=1 { n∑ i=1 w (0) ij } log(pij) + m∑ j=1 { n∑ i=1 w (0) ij log f(xi; θj) } . In this decomposition, two additive terms are functions of two separate parts of G. The first term is a function of mixing probabilities only. The second term is a function of subpopulation parameters only. Hence, we can search for the maxima of these two functions separately to find the overall solution. The algebraic form of the first term is identical to the log likelihood of a multinomial distribution. The maximization solution is given by pi (1) j = n −1 n∑ i=1 w (0) ij for j = 1, 2, . . . ,m. 6.6. ASSIGNMENT PROBLEMS 73 The second term is further decomposed into the sum of m log likelihood functions, one for each subpopulation. When f(x; θ) is a member of a classical parametric distribution family, then the maximization with respect to θ often has an explicit analytical solution. With a generic f(x; θ), we cannot give an explicit expression but an abstract one: θ (1) j = arg sup θ { n∑ i=1 w (0) ij log f(xi; θj)} for j = 1, 2, . . . ,m. The mixing distribution G(1)(θ) = m∑ j=1 pi (1) j 1(θ (1) j ≤ θ) is an updated estimate of G from G(0) based on data. We then replace the role of G(0) by G(1) and go back to E-step. Iterating between E-step and M-step leads to a sequence of intermediate estimates of the mixing distribution: G(k). Often, this sequence converges to at least a local maximum of `n(G). With some luck, the outcome of this limit is the global maximum. In most applications, one would try a number of G(0) and compare the values of `n(G (k)) the EM-algorithm leads to. The one with the highest value will have its G(k) regarded as the maximum likelihood estimate of G. The algorithm stops after many iterations when the difference between G(k) and G(k−1) is considered too small to continue. Other convergence cri- teria may also be used. 6.6 Assignment problems 1. Let X1, X2, ..., Xn be an i.i.d. random variables from Weibull distribu- tion with fixed scale parameter, whose density function is given by f(x; θ) = θxθ−1 exp(−xθ), x > 0, θ > 0. (You may want to first go over the Example 6.1 in the Lecture Notes.) Suppose we observe the sample data 74 CHAPTER 6. MAXIMUM LIKELIHOOD ESTIMATION 0.6944788, 0.3285051, 0.7165376, 0.8865894, 1.0858084, 0.4040884, 1.0538935, 1.2487677, 1.1523552, 0.9977360, 0.7251880, 1.0716697, 1.0382114, 1.1535934, 0.9175693, 0.5537849, 0.9701821, 0.5486354, 1.0168818, 0.5193687 (a) For this sample, numerically find an upper bound θ1 and a lower bound θ2, so that the maximum point of the likelihood function is within the interval [θ1, θ2]. (b) Use a bisection algorithm (as discussed in section 6.2 in Lecture Notes) to numerically find the MLE of θ for this observed sample. You need to attach your code, preferably in R. 2. Let X1, X2, ..., Xn be an i.i.d. random variables from Weibull distribu- tion with fixed scale parameter. (a) Work out analytically the updating formulas for the parameter θ in the Newton-Raphson method (as discussed in section 6.2 in Lecture Notes). In other words, how do you obtain the value θ(k+1) in the (k + 1)-th iterative step based on the value θ(k). (b) For the same observed sample as in Problem 1, numerically find the MLE of the parameter θ using the Newton-Raphson algorithm. Start from the initial value be θ(0) = 1 and report the first 5 values of the iteration. Code is part of the required solution. 3. Let f(x; θ) be the p.m.f. of Poisson distribution with mean θ. Derive the KL divergence function K(θ1, θ2). Repeat this problem when f(x;α) = {Γ(α)}−1xα−1 exp(−x). Chapter 7 Properties of MLE Consider the situation we have have a data set x whose joint density function is a member of distribution family specified by density functions {f(x; θ) : θ ∈ Θ}. Suppose η = g(θ) is an invertible parameter transformation and denote the inverse transformation by θ = h(η) and the parameter space of η be Υ. Clearly, for each θ, there is an η such that f(x; θ) = f(x;h(η)) = f˜(x; η) where we have introduced f˜(x; η) for the function under the new parameter- ization. In other words, {f(x; θ) : θ ∈ Θ} = {f˜(x; η) : η ∈ Υ}. The likelihood functions in these two systems are related by `(θ) = ˜`(η) for η = g(θ). If θˆ is a value such that `(θˆ) = sup θ∈Θ `(θ)}, we must also have ˜`(g(θˆ)) = `(θˆ) = sup θ∈Θ `(θ) = sup η∈Υ ˜`(η). 75 76 CHAPTER 7. PROPERTIES OF MLE Hence, h(θˆ) is the MLE of η = h(θ). In conclusion, the MLE as a general method for point estimation, is equi- variant. If we estimate µ by x¯, then we estimate µ2 by x¯2 in common notation. Next, we give results to motivate the use of MLE. The following inequality plays an important role. Jensen’s inequality. Let X be a random variable with finite mean and g be a convex function. Then E[g(X)] ≥ g[E(X)]. Proof: We give a heuristic proof. Function g is convex if and only if for every set of x1, x2, . . . , xn and positive numbers p1, p2, . . . , pn such that ∑n i=1 pi = 1, we have n∑ i=1 pig(xi) ≥ g( n∑ i=1 pixi). This essentially proves the inequality when X is a discrete random variable of finite number of possible values. Since every random variable can be approximated by such random variables, we can take a limit to get the general case. This is always possible when X has finite first moment. Kulback-Leibler divergence. Suppose f(x) and g(x) are two density func- tions with respect to some σ-finite measure. The Kulback-Leibler divergence between f and g is defined to be K(f, g) = E{log[f(X)/g(X)]; f} where the expectation is computed when X has distribution f . Let Y = g(X)/f(X) and h(y) = − log(y). It is seen that h(y) is a convex function. It is easily seen that E{Y } ≤ 1 where the inequality can occur if the support of f(x) is a true subset of that of g(x). In any case, by Jensen’s inequality, we have E{h(Y )} ≥ h(E{Y }) ≥ 0. 77 This implies that K(f, g) ≥ 0 for any f and g. Clearly, K(f, f) = 0. Because K(f, g) is positive unless f = g, it serves as a metric to measure how different g is from f . At the same time, the KL divergence is not a distance in mathematical sense because K(f, g) 6= K(g, f) in general. Let F be a parametric distribution family possessing densities f(x; θ) and parameter space Θ. Let f(x) be simply a density function may or may not be a member of F . If we wish to find a density in F that is the best approximation to f(x) in KL-divergence sense, a sensible choice is f(x; θˆ) such that θˆ = arg min θ∈Θ K(f(x), f(x; θ)). In most applications, f(x) is not known but we have an i.i.d. sample X1, . . . , Xn from it. In this case, we may approximate K(f(x), f(x; θ)) as follows: K(f(x), f(x; θ)) = ∫ log{f(x)/f(x; θ)}f(x)dx ≈ n−1 n∑ i=1 log{f(xi)/f(xi; θ)} = n−1 n∑ i=1 log{f(xi)} − n−1`n(θ) where the second term is the usual log likelihood function. Hence, minimiz- ing KL-divergence is approximately the same as maximizing the likelihood function. The analog goes further to situations where non-i.i.d. observations are available. Unlike UMVUE or other estimators, MLE does not aim at most precisely determining the best possible value of “true” θ. One may wonder if it mea- sures up if it is critically examined from different angles. This will be the topic of the next section. 78 CHAPTER 7. PROPERTIES OF MLE 7.1 Trivial consistency Under very general conditions, the MLE is strongly consistent. We work out a simple case her. Consider the situation where Θ = {θj : j = 1, . . . , k} for some finite k. Assume that F (x; θj) 6= F (x; θl) for at least one x value when j 6= l, where F (x; θ) is the cumulative distribu- tion function of f(x; θ). The condition means that the model is identifiable by its parameters. We assume an i.i.d. sample from F (x; θ0) has been ob- tained but pretend that we do not know θ0. Instead, we want to estimate it by the MLE. Let `n(θ) be the likelihood function based on the i.i.d. sample of size n. By the strong law of large numbers, we have n−1{`n(θ)− `n(θ0)} → −K(f(x; θ0), f(x; θ)) almost surely for any θ ∈ Θ. The identifiability condition implies that K(f(x; θ0), f(x; θ)) > 0 for any θ 6= θ0. Therefore, we have `n(θ) < `n(θ0) almost surely as n→∞. When there are only finite many choices of θ in Θ, we must have max{`n(θ) : θ 6= θ0} < `n(θ0) almost surely. Hence, the MLE θˆn = θ0 almost surely. Let us summarize the result as follows. Theorem 7.1. Let X1, . . . , Xn be a set of iid sample from the distribution family {f(x; θ) : θ ∈ Θ} and the true value of the parameter is θ = θ0. Assume the identifiability condition that F (x; θ ′ ) 6= F (x; θ′′) (7.1) 7.2. TRIVIAL CONSISTENCY FOR ONE-DIMENSIONAL θ 79 for at least one x whenever θ ′ 6= θ′′. Assume also that E| log f(X; θ)| <∞ (7.2) for any θ ∈ Θ, where the expectation is computed under θ0. Then, the MLE θˆ → θ0 almost surely when Θ = {θj : j = 0, 1, . . . , k} for some finite K. Although the above proof is very simple. The idea behind it can be applied to prove the general result. For any subset B of Θ, define f(x;B) = sup θ∈B f(x; θ). We assume that f(x;B) is a measurable function of x for all B under con- sideration. We can generalize the above theorem as follows. Theorem 7.2. Let X1, . . . , Xn be a set of i.i.d. sample from the distribution family {f(x; θ) : θ ∈ Θ} and that Θ = ∪kj=0Bj for some finite k. Assume that the true value of the parameter is θ = θ0 ∈ B0 and that E| log f(X;Bj)| < E[log f(X; θ0)] (7.3) for j = 1, 2, . . . , k. Then, the MLE θˆ ∈ B0 almost surely. 7.2 Trivial consistency for one-dimensional θ Consider the situation where we have a set of i.i.d. observations from a one- dimensional parametric family {f(x; θ) : θ ∈ Θ ⊂ R}. The log likelihood function remains the same as `n(θ) = n∑ i=1 log f(xi; θ). We likely have defined score function earlier, which is, given i.i.d. observations Sn(θ;x) = n∑ i=1 ∂{log f(xi; θ)} ∂θ . 80 CHAPTER 7. PROPERTIES OF MLE We will use plain S(θ;x) if when x is regarded as a single observation. We can be sloppy by using notation E{S(θ)} in which x has to be interpreted as the random variable X whose distribution is f(x; θ), with the same θ in S and f . Let us put done a few regularity conditions. They are not most general but suffice in the current situation. R0 The parameter space of θ is an open set of R. R1 f(x; θ) is differentiable to order three with respect to θ at all x. R2 For each θ0 ∈ Θ, there exist functions g(x), H(x) such that for all θ in a neighborhood N(θ0), (i) ∣∣∣∣∂f(x; θ)∂θ ∣∣∣∣ ≤ g(x); (ii) ∣∣∣∣∂2f(x; θ)∂θ2 ∣∣∣∣ ≤ g(x); (iii) ∣∣∣∣∂3 log f(x; θ)∂θ3 ∣∣∣∣ ≤ H(x) hold for all x, and∫ g(x)dx <∞; E0{H(X)} <∞. R3 For each θ ∈ Θ, 0 < Eθ { ∂ log f(x; θ) ∂θ }2 <∞. Although the integration is stated as with respect to dx, the results we are going to state remain valid if it is replace by some σ-finite measure. For instance, the result is applicable to MLE under Poisson model where dx must be replaced by summation over non-negative integers. All conditions are stated as they are required for all x. An exception over a 0-measure set of x is allowed, as long as this 0-measure set is the same for all θ ∈ Θ. 7.2. TRIVIAL CONSISTENCY FOR ONE-DIMENSIONAL θ 81 Lemma 7.1. (1) Under regularity conditions, we have E { ∂ log f(X; θ) ∂θ ; θ } = 0. (2) Under regularity conditions, we have E { ∂ log f(X; θ) ∂θ }2 = −E { ∂2 log f(X; θ) ∂θ2 } = I(θ). Proof. We first remark that the first result is the same as stating E{S(θ)} = 0. The proof of one is based on the fact that∫ f(x; θ)dx = 1. Taking derivative with respect to θ on both sizes, permitting the exchange of derivative and integration under regularity condition R2, and expressing the resultant properly, we get result (1). To prove (2), notice that ∂2 log f(X; θ) ∂θ2 = { f ′′(X; θ) f(X; θ) } − { f ′(X; θ) f(X; θ) }2 . The result is obtained by taking expectation on both sizes and the fact E { f ′′(X; θ) f(X; θ) } = ∫ f ′′(x; θ)dx = 0. This completes the proof. We now give a simple consistency proof when θ is one-dimensional. Theorem 7.3. Given an i.i.d. sample of size n from some one-parameter model {f(x; θ) : θ ∈ Θ ⊂ R}. Suppose θ∗ is the true parameter value. Under Conditions R0-R3, there exists an θˆn sequence such that (i) Sn(θˆn) = 0 almost surely; (ii) θˆn → θ∗ almost surely. 82 CHAPTER 7. PROPERTIES OF MLE Proof. (i) As a function of θ, E{S(θ)} has derivative equaling −I(θ∗) at θ = θ∗. Hence, it is a decreasing function at θ∗. This implies the existence of sufficiently small > 0, such that E{S(θ∗ + )} < 0 < E{S(θ∗ − )}. By the law of large numbers, we have n−1Sn(θ∗ ± ) a.s.−→ E{S((θ∗ ± )}. Hence, almost surely, we have Sn(θ ∗ + ) < 0 < Sn((θ∗ − ). By intermediate value theorem, there exists a θˆ ∈ (θ∗ − , θ∗ + ) such that Sn(θˆ) = 0. This proves (i). (ii) is a direct consequence of (i) as can be made arbitrarily small. 7.3 Asymptotic normality of MLE after the consistency is established Under the assumption that f(x; θ) is smooth, and the MLE θˆ is a consistent estimator of θ, we must have Sn(θˆ) = 0. By the mean-value theorem in mathematical analysis, we have Sn(θ ∗) = Sn(θˆ) + S ′n(θ˜)(θ ∗ − θˆ) where θ˜ is a parameter value between θ∗ and θˆ. By the result in the last lemma, we have n−1S ′n(θ˜)→ −I(θ∗), 7.4. ASYMPTOTIC EFFICIENCY, SUPER-EFFICIENT, ONE-STEP UPDATE SCHEME83 the Fisher information almost surely. In addition, the classical central limit theorem implies n−1/2Sn(θ∗)→ N(0, I(θ∗)). Thus, by Slutzky’s theorem, we find √ n(θˆ − θ∗) = n−1/2I−1(θ∗)Sn(θ∗) + op(1)→ N(0, I−1(θ∗)) in distribution as n→∞. Many users including statisticians ignore the regularity conditions. In- deed, they are satisfied by most commonly used models. If one does not bother with the full rigour, he or she should at least make sure that the parameter value in consideration is an interior point, the likelihood function is smooth enough. If the data set does not have i.i.d. structure, one should make sure that some form of uniformity hold. 7.4 Asymptotic efficiency, super-efficient, one- step update scheme By Cramer-Rao information inequality, for any estimator of θ given i.i.d. data and sufficiently regular model, we have var(θˆn) ≥ I−1n (θ∗) for any estimator θˆn assuming unbiasedness. The MLE under regularity conditions has asymptotic variance I(θ∗) at rate √ n. Loosely speaking, the above inequality becomes equality for MLE. Hence, the MLE is “efficient”: no other estimators can achieve lower asymptotic variance. Let us point out the strict interpretation of asymptotic efficiency is not correct. Suppose we have a set of i.i.d. observations from N(θ, 1). The MLE of θ is X¯n. Clearly, if θ ∗ is the true value, we have √ n(X¯n − θ∗) d−→ N(0, 1). Can we do better than the MLE? Let θ˜n = { 0 if |X¯n| ≤ n−1/4 X¯n otherwise. 84 CHAPTER 7. PROPERTIES OF MLE When the true value θ∗ = 0, then pr(|X¯n| ≤ n−1/4)→ 1 as n→ 0. Hence, √ n(X¯n − θ∗) d−→ N(0, 0) with asymptotic variance 0 at rate √ n. When the true value θ∗ 6= 0, then pr(|X¯n| ≤ n−1/4)→ 0 which implies pr(θ˜n = X¯n)→ 1. Consequently, √ n(θ˜n − θ∗) d−→ N(0, 1). What have we seen? If θ∗ 6= 0, then θ˜n has the same limiting distribution as that of X¯n at the same rate. So they have the same asymptotic efficiency. When θ∗ = 0, the asymptotic variance of θ˜n is 0 which is smaller than that of X¯n (at rate √ n). It appears that the unattractive θ˜n is superior than the MLE in this example. Is there any way to discredit θ˜n? Statisticians find that if θ ∗ = n−1/4, namely changes with n, then the variance of √ nθ˜n goes to infinity while that of √ nX¯n remains the same. It is a good exercise to compute its variance in this specific case. If some performance uniformity in θ is required, the MLE is the one with the lowest asymptotic variance. Hence, the MLE is generally referred to as asymptotically efficient under regularity conditions, or simply asymptotically optimal. Estimators such as θ˜n are called super-efficient estimators. Their existence makes us think harder. We do not recommend these estimators. If one estimator has asymptotic variance σ21 and the other one has asymp- totic variance σ22 at the same rate and both asymptotically unbiased, then the relative efficiency of θˆ1 against θˆ2 is defined as σ 2 2/σ 2 1. A higher ratio implies higher relative efficiency. This definition is no longer emphasized in contemporary textbooks. 7.5. ASSIGNMENT PROBLEMS 85 Suppose θ˜ is not asymptotically efficient. However, it is good enough such that for any > 0, we have pr{n1/4|θ˜ − θ| ≥ } → 0 as n→∞. Let θˆn = θ˜n − `′n(θ˜n)/`′′n(θ˜n) in apparent notation. Under regularity conditions, it can be shown that √ n(θˆ − θ∗) d−→ N(0, I−1(θ∗)). Namely, the Newton-Raphson update formula can turn an ordinary estimator into an asymptotically efficient estimator easily. Suppose we have a set of i.i.d. observations from Cauchy distribution with location parameter θ. Under this setting, the score function has multiple solutions. It is not straightforward to obtain the MLE in applications. One way to avoid this problem is to estimate θ by the sample median which is not optimal. The above updating formula can then be used to get an asymptotically efficient (optimal) estimator. Let us leave it as an exercise problem. 7.5 Assignment problems 1. Let X1, X2, . . . , Xn be a set of i.i.d. random variables from N(θ, 1), and let X¯n be the sample mean. Suppose θ ∗ is the true value of the mean parameter θ. Let θ˜n = { 0 if |X¯n| ≤ n−1/4 ; X¯n otherwise . (a) For θ∗ = n−1/4 which changes with n, show that P (θ˜n = 0)→ 0.5 as n→∞. 86 CHAPTER 7. PROPERTIES OF MLE (b) Under the same condition as in (a), show that the MSE of θ˜n nE{(θ˜n − θ∗)2} → ∞. Hint: develop an inequality based on result (a). (c) Use computer to generate data of size n = 1600 from N(θ∗ = n−1/4, 1), and compute the values of θˆ = X¯n and θ˜n. Repeat it N = 1000 times so that you have N many pairs of these values. Compare their simulated total MSE: N∑ k=1 (θˆ − θ∗)2; N∑ k=1 (θ˜ − θ∗)2. 2. Let X1, X2, ..., X2n+1 be an i.i.d. random sample from Cauchy distribu- tion with location parameter θ, whose density function is given by f(x; θ) = 1 pi{1 + (x− θ)2} . The sample median is given by the order statistic x(n+1). (a) Show that the sample median satisfies P (n1/4|x(n+1) − θ| ≥ )→ 0 for any > 0 as n→∞. Remark: directly proving this result is challenging for inexperienced. Proving it by directly quoting an existing result is not satisfactory. (b) Derive the explicit expression of the Newton-Raphson iteration for Cauchy distribution. (c) Simulation N = 1000 times with 2n + 1 = 201, θ = 0 and obtain total MSEs in the same way as the last example. Clearly present your results. (d) Plot the histogram of the 1000 X(n+1). Do the same for the one-step Newtwon-Raphson estimator. (e) Do these histograms support our asymptotic results on MLE and on median? 7.5. ASSIGNMENT PROBLEMS 87 3. Derive the EM-iteration formulas for data from two component Bino- mial mixture model: f(x;G) = ( m x ) {piθx1(1− θ1)m−x + (1− pi)θx2(1− θ2)m−x} under the setting of n i.i.d. observations with m ≥ 3. (Sizes n and m are not relevant in these formulas) Be sure to have E-step and M-step clearly presented together with the corresponding Q function. 88 CHAPTER 7. PROPERTIES OF MLE Chapter 8 Analysis of regression models In this chapter, we investigate the estimation problems when data are pro- vided in the form (yi; xi) : i = 1, 2, . . . , n. (8.1) The range of y is R and the range of x is Rp. We call them response vari- able and explanatory variables (sometimes covariates). In many applications, such data are collected because the users believe a large proportion of the variability in y from independent trials can be explained away from the vari- ations in x. Often, we feel that they are linked via a regression relationship with additive error: yi = g(xi;θ) + σi (8.2) such that the error terms i are uncorrelated with mean 0 and variance 1. In the current form of expression, it hints that an analytical form of g(x;θ) is specified. All that are left to statisticians is to decide what is the most “ap- propriate” value of θ in the specific occasion. The distributional information about may or may not be specified depending on specific circumstances. Factoring out σ in the error term may not always be most convenient for statistical discussion. We may choose to replace σi by i but allowing to have a variance different from 1. The observations on the explanatory variable, xi, are either regarded as chosen by scientists (users) so that their values are not random, or they are independent samples from some population whose distribution is not related to g(·) nor θ. In addition, they are independent of . 89 90 CHAPTER 8. ANALYSIS OF REGRESSION MODELS The appropriateness of a regression model in specific applications will not be discussed in this course. We continue our discussion under the assumption that all promises for (8.2) are solid. It is generally convenient to use matrix notation here. We define and denote the covariate matrix as Xn = x11 x12 · · · x1p x21 x22 · · · x2p . . . . . . . . . xn1 xn2 . . . xnp = xτ1 xτ2 . . . xτn = (X1,X2, . . . ,Xp). We define design matrix as Zn = (1,X1,X2, . . . ,Xp) which is the covariate matrix supplemented by a column vector made of 1. We also use bold faced y and for column vectors of length n for response values and error terms. When necessary, we use yn,Xn with subindex n to highlight the sample size n. Be cautious that X3 stands for the column vector of the third explanatory variable, not the covariate matrix when n = 3. We trust that such abuses will not cause much confusion though mathematically not rigorous. 8.1 Least absolution deviation and least square estimators Suppose we are given a data set in the form of (8.1) and we are asked to use the data to fit model (8.2). Let us look into the problem of how to best estimate θ and σ. We do not discuss the issues such as the fitness of function g(·) and the distribution of . There are many potential approaches for estimating θ. One way is to select θ value such that the average difference between yi and g(xi; θ) is minimized. To implement this idea, one may come up with many potential 8.2. LINEAR REGRESSION MODEL 91 distances. The absolute difference is one of the favourites. With this choice, we define Mn(θ) = n∑ i=1 |yi − g(xi; θ)| and find the corresponding M-estimator for θ. This estimator is generally called the least absolute deviation estimator. A disadvantage of this approach is the inconvenience of working with absolute value function both analytically and numerically. A more convenient choice is Mn(θ) = n∑ i=1 {yi − g(xi; θ)}2. The resultant estimator is called the least square estimator. We may place a parametric distribution assumption on that of . If has standard normal N(0, 1) distribution, then the MLE of θ equals the least squares estimator. If has double exponential distribution with density function f(u) = 1 2 exp{−|u|} then, the least absolute deviation estimator is also the MLE under this model. 8.2 Linear regression model Linear regression model is a special signal plus error model. In this case, the regression function E(Y |X = x) has a specific form: E(Y |X = x) = g(x; θ) = β0 + β1x1 + · · ·+ βpxp. We can write it in vector form with zτ = (1,xτ ) as g(x; θ) = zτβ (8.3) which is linear in regression coefficient β = (β0, β1, . . . , βp) τ . While we generally prefer to include β0 in most applications, this is not a mathematical necessity. In some applications, the scientific principle may seriously demand 92 CHAPTER 8. ANALYSIS OF REGRESSION MODELS a model with β0 = 0. Luckily, even though the subsequent developments will be based on z which implies β0 is part of the model, most theoretical results remain valid when z is reduced to x so that β0 = 0 is enforced. We will not rewrite the same result twice for this reason. We have boldfaced two terminologies without formally defining them. It is worth to emphasize here that model is linear not because the regression function g(x; θ) is linear in x, but it is linear in θ which is denoted as β here. In applications, we may use x1 for some explanatory variables such as dosage and include x2 = log(x1) as another explanatory variable in the linear model. In this case, the linear regression model has a regression function g(x, θ) not linear in x1. Suppose we have n independent observations from regression model (8.2) with linear regression function (8.3), one way to estimate the regression co- efficient vector is by the least squares. The M-function now has form Mn(β) = (yn − Znβ)τ (yn − Znβ) = n∑ i=1 (yi − zτiβ)2. (8.4) For linear regression model, there is an explicit solution to the least squares problem in a neat matrix notation. Theorem 8.1. Suppose (yi,xi) are observations from linear regression model (8.2) with g(x, θ) given by (8.3). The solution to the least squares problem as defined in (8.4) is given by βˆn = (Z τ nZn) −1Zτnyn (8.5) if ZτnZn has full rank. If ZτnZn does not have full rank, one solution to the least squares problem is given by βˆn = (Z τ nZn) −Zτnyn where A− here denotes a specific generalize inversion. Remark: the statement hints that if ZτnZn does not have full rank, the solution is not unique. However, we will not discuss it in details. 8.2. LINEAR REGRESSION MODEL 93 Proof. We only give a proof when ZτnZn has full rank. It is seen that Mn(β) = {(yn − Znβˆ) + Zn(βˆ − β)}τ{(yn − Znβˆ) + Zn(βˆ − β)} = (yn − Znβˆ)τ (yn − Znβˆ) + (βˆ − β)τ (ZτnZn)(βˆ − β) ≥ (yn − Znβˆ)τ (yn − Znβˆ). The lower bound implied by the above inequality is attained when β = βˆ. Hence, βˆ is the solution to the least squares problem. Let βˆn be the least squares estimator of β and β be the true value of the parameter without giving it a special notation. We find E{βˆn|Xn} = (ZτnZn)−1Zτn{Znβ} = β. Hence, βˆn is an unbiased estimator of the regression coefficient vector. No- tice that this conclusion is obtained under the assumption that x and are independent. Also notice that we assumed has zero mean and constant variance, but placed no assumption on its distributions. Put the additive error terms in the form of σn, e have βˆn − β = σ(ZτnZn)−1Znn. Hence, var(βˆn) = (Z τ nZn) −1σ2. Because we made a distinction between the covariate matrix Xn and the design matrix Zn, the above expression may appear different from those in standard textbooks. With β estimated by βˆ, it is naturally to regard yˆn = Znβˆn = Hnyn as the estimated value of yn, where the hat matrix Hn = Zn(Z τ nZn) −1Zτn. In fact, we call yˆn fitted value(s). How closely does yˆn match yn? The residual of the fit is given by ˆn = (In −Hn)yn = σ(In −Hn)n. 94 CHAPTER 8. ANALYSIS OF REGRESSION MODELS One can easily verify that Hn and In −Hn are symmetric and idempotent, and (In − Hn)Zn = 0. From geometric angle, Hn is a projection matrix. The operation Hnyn projects yn into the linear space spun by Zn. Naturally, (In −Hn)yn is the projection of yn into the linear space orthogonal to Zn. This leads to a decomposition of the sum of squares: yτnyn = y τ nHnyn + y τ n(In −Hn)yn. The second term is the “residual sum of squares”. It is an easy exercise to prove that yτn(In −Hn)yn = ˆτnˆn. We directly verified that βˆ solves the least squares problem. One may derive this result by searching for solutions to ∂Mn(β) ∂β = 0. This leads to normal equation Zτn{yn − Znβ} = 0. We again leave it as an easy exercise. We have seen that the least squares estimator βˆn has a few neat prop- erties. Yet we cannot help to ask: can we find other superior estimators? The answer is no at least in one respect. The least squares estimator has the lowest variance among all unbiased linear estimators of β. A linear estimator is defined as one that can be written as a linear combinations of yi. It must be able to be written in the form of Ayn for some matrix A not dependent on yn. Theorem 8.2. Gauss-Markov Theorem. Let βˆn be the least squares estimator and β˜n = Ayn for some nonrandom matrix A (may depend on Xn) be an unbiased linear estimator of β under the linear regression model with n independent obser- vations. Then var(β˜)− var(βˆ) ≥ 0. 8.3. LOCAL KERNEL POLYNOMIAL METHOD 95 Proof. Suppose Ayn is unbiased for β. We must have E(Ayn) = AZnβ = β for any β. Hence, we must have AZ = Ip+1. This implies var(β˜ − βˆ) = σ2{A− (ZτnZn)−1Zτn}{Aτ − Zn(ZτnZn)−1} = var(β˜)− var(βˆ). Because the variance matrix for any random variable is non-negative definite. Hence, we must have var(β˜)− var(βˆ) ≥ 0. An estimator which is linear in data and unbiased for the target parameter is called best linear unbiased estimator (BLUE) if it has the lowest possible variance matrix. Not only the least squares estimator βˆ is BLUE for β, but bτ βˆ is BLUE for bτβ for any non-random vector b. At the same time, be aware that if we have additional information about the distribution of n in the linear model, then we may obtain more efficient estimator for β, but that estimator is either not linear or not unbiased. 8.3 Local kernel polynomial method Naturally, a linear regression model is not always appropriate in applications, but we may still believe a signal plus noise relationship is sound. In this sec- tion, we consider the situation where the regression function g(x) is smooth in x, but we are unwilling to place more restrictions on it. At the same time, we only study the simple situation where x is a univariate covariate. Suppose we wish to estimate g(x) at some specific x∗ value. By definition, g(x∗) = E(Y |X = x∗). Suppose that among n observations {(yi, xi)}, i = 1, . . . , n we collected, there are many xi values such that xi = x ∗. The average of their corresponding yi would be a good estimate of g(x ∗). In reality, there may not be any xi equalling x ∗ exactly. Hence, this idea does not work. On the other hand, when n is very large, there might be many xi which are very 96 CHAPTER 8. ANALYSIS OF REGRESSION MODELS close to x∗. Hence, the average of their corresponding yi should be a sensible estimate of g(x∗). To make use of this idea, one must decide how close is close enough. Even within the small neighbourhood, should we merely use constant, rather than some other smooth functions of x to approximate g(x)? For any u in close enough to x (rather than x∗ for notation simplicity) and some positive integer p, when g(x) is sufficiently smooth at x, we have g(u) ≈ f(x) + f ′(x)(u− x) + . . .+ (1/p!)f (p)(x)(u− x)p. Let β0 = f(x), β1 = f ′(x), . . . , βp = (1/p!)f (p)(x). Then the approximation can be written as g(u) ≈ β0 + β1(u− x) + . . .+ βp(u− x)p. Note that at u = x, we have g(x) ≈ β0. Suppose that for some h > 0, f(u) perfectly coincides with the above polynomial function for x ∈ [x− h, x+ h]. If so, within this region, we have a linear regression model with regression coefficient βx. A natural approach of estimating this local βx is the least squares: βˆx = arg min β n∑ i=1 1(|xi − x| ≤ h){yi − zτiβ}2 where zi = {1, (xi − x), (xi − x)2, . . . , (xi − x)p}τ . Note again that zi is defined dependent on x-value, the location at which g(x) is being estimated. Note that we have added a subindex x to β. This is helpful because this vector is specific to the regression function g(u) at u = x. When we change target from u = x1 to u = x2 6= x2, we must refit the data and obtain the β specific for u = x2. We repeatedly state this to emphasize the local nature of the current approach. The above formulation implies that ith observation will be excluded even if |xi − x| is only slightly larger than h. At the same time, any observations 8.3. LOCAL KERNEL POLYNOMIAL METHOD 97 with |xi−x| ≤ h are treated equally. This does not seem right in our intuition. One way to avoid this problem is to replace the indicator function by a general kernel function K(x) often selected to satisfy the following properties: 1. K(x) ≥ 0; 2. ∫∞ −∞K(x)dy = 1; 3. K(x) = K(−x), That is, K(x) is a symmetric function. For instance, the density function φ(x) of N(0, 1) has these properties. In fact, any symmetric density function does. Let Kh(x) = h −1K(x/h). We now define the local polynomial kernel estimator of βx as βˆx = arg min β n∑ i=1 Kh(xi − x){yi − zτiβ}2 An explicit solution to the above optimization problem is readily available using matrix notation. Let ym be the response vector, define design matrix Zx = 1 x1 − x · · · (x1 − x) p ... ... · · · ... 1 xn − x · · · (xn − x)p and weight matrix Wx = diag{Kh(x1 − x), Kh(x2 − x), · · · , Kh(xn − x)}. The M-function can then be written as Mn(β) = (y− Zxβ)τWx(y− Zxβ). It is an easy exercise to show that the solution is given by βˆx = (Z τ xWxZx) −1ZτxWxyn Let ej be a (p + 1)× 1 vector such that the jth element being 1 and all other elements being 0, j = 1, . . . , p+ 1. Then we estimate g(x) by gˆ(x) = βˆ0 = e τ 1(Z τ xWxZx) −1ZτxWxyn 98 CHAPTER 8. ANALYSIS OF REGRESSION MODELS where βˆ0 is the first element of βˆx. Remark: Notationally, the above locally kernel polynomial estimator re- mains the same for any choice of p. Suppose g(x) is differentiable up to order p. Then, for k = 1, . . . , p, we estimate the kth derivative g(k)(x) by gˆ(k)(x) = k!βˆk = k!e τ k+1(Z τ xWxZx) −1ZτxWxyn. When we decide to use p = 0 in this approach, the estimator gˆ(x) becomes fˆ(x) = ∑n i=1 Kh(xi − x)yi∑n i=1Kh(xi − x) , which is known as the local constant kernel estimator, kernel regression es- timator and Nadaraya-Watson estimator. This estimator can be motivated by the fact that g(u) is a constant function in a small neighborhood of x: u ∈ [x−h, x+h] for some sufficiently small h. The estimator is the weighted average of the corresponding response values whose x is within small neigh- bourhood of x. When we decide to use p = 1 in this approach, the estimator is called the local linear kernel estimator of g(x). Before this estimator is applied to any specific data, we must make a choice on the kernel function K, the degree of the polynomial p and the bandwidth h. We now go over these issues. Choice of K(y). The choice of kernel function K(x) is not crucial. Other than it should have a few desired properties, its specific form does not markedly change the variance or bias of gˆ(x). In our future examples, we will mostly use normal density function. Clearly, the normal density function has the listed three properties. Choice of p. For the given bandwidth h and kernel K(x), a large value of p would expectedly reduce the bias of the estimator because the local approximation becomes more and more accurate as p increases. At the same time, when p is large, we have more parameters to estimate as reflected in the dimension 8.3. LOCAL KERNEL POLYNOMIAL METHOD 99 of β. Hence, the variance of the estimator will increase and there will be a larger computational cost. Fan and Gijbels (1996) showed that when the degree of the polynomial employed increases from p = k + 2q to p = k + 2q + 1 for estimating g(k)(x), the variance does not increase. However, if we increase the degree from p = k + 2q + 1 to p = k + 2q + 2, the variance increases. Therefore for estimating g(k)(x), it is beneficial to use a degree p such that p − k is odd. Since bandwidth h also controls the bias and variance trade-off of g(k)(x), they recommended the lowest odd order for p − k, namely p = k + 1, or occasionally p = k+ 3. For the regression function itself, they recommended local linear kernel estimator (i.e. p = 1) instead of the Nadaraya-Watson estimator (i.e. p = 0). To have a better understanding of the above information, we summarize some theoretical results about the local linear kernel estimator and Nadaraya- Watson estimator here. Let them be denoted as gˆll(x) and gˆnw(x), respec- tively. We have gˆnw(x) = ∑n i=1 Kh(xi − x)yi∑n i=1Kh(xi − x) gˆll(x) = βˆ0 = arg min β0 {min β1 n∑ i=1 Kh(xi − x){yi − β0 − β1(xi − x)}2}. Under the regression model assumption that yi = g(xi) + σi and for random xi such that its density function is given by f(x), and under many conditions regulating f(x), g(x) and distribution of , we have E{gˆnw(x)|x} ≈ g(x) + 0.5h2µ2(K) { g′′(x) + 2f ′(x)g′(x) f(x) } ; E{gˆll|x} ≈ g(x) + 0.5h2g′′(x)µ2(K); var{gˆnw(x)|x} ≈ σ 2 nhf(x) R(K); var{gˆll(x)|x} ≈ σ 2 nhf(x) R(K) 100 CHAPTER 8. ANALYSIS OF REGRESSION MODELS where µ2(K) and R(K) are some positive constants depending on kernel function K. The above results show that the local linear kernel estimator gˆll(x) and Nadaraya-Watson estimator gˆnw(x) have the same asymptotic variance condi- tional on x. which is the conclusion that we discussed before. The asymptotic bias of gˆnw(x) has an extra bias term 2f ′(x)g′(x)µ2(K)h2/f(x). The coeffi- cient 2f ′(x)g′(x)/g(x) is also called design bias because it depends on the design, namely, the distribution of x. This implies that the bias is sensitive to the positions of design point xi’s. Note that f ′(x) f(x) can have high influence on the bias when x is close to the boundary. For example, when the den- sity points xi have standard normal distribution, |f ′(x)/f(x)| = |x|, which is very large when x approaches to∞. Hence 2f ′(x)g′(x)/f(x) is also known as boundary bias. These two biases are reduced by using the local linear kernel estimator. In summary, local linear kernel estimator is free from the design and boundary biases, but Nadaraya-Watson estimator is not. Choice of bandwidth h Suppose we have made choice of the kernel function K(x) and p. We now discuss the choice of bandwidth h. Bandwidth plays a very important role in estimating the regression function g(x). First, as h increases, the local approximation becomes worse and worse and hence the bias of local polynomial kernel estimator increases. On the other hand, more and more observations will be included in estimating g(x). Hence the variance of local polynomial kernel estimator decreases. A good choice of a bandwidth helps to balance the bias and variance. Second, as h increases, the local polynomial kernel estimate becomes smoother and smoother. This can be observed in Figure 8.1, in which we compare the Nadaraya-Watson estimates of g(x) constructed when the bandwidth h takes three values, 0.1, 1, and 4, respectively. Conceptually, the number of param- eters required to describe the curve decreases. In this sense, h controls the model complexity. We should choose a bandwidth to balance the modelling fitting and model complexity. 8.3. LOCAL KERNEL POLYNOMIAL METHOD 101 Figure 8.1: Motorcycle data: Nadaraya-Watson estimates of g(x) with nor- mal kernel 10 20 30 40 50 −1 00 −5 0 0 50 Times in milliseconds after impact Ac ce ler at ion (i n g) bandwidth = .1 bandwidth = 1 bandwidth=4 102 CHAPTER 8. ANALYSIS OF REGRESSION MODELS We introduce two bandwidth selection methods here: l eave-one-out cross- validation (cv) and generalized cross-validation (gcv). These two methods are also widely used in studying other regression problems. The idea of leave-one-out cv is as follows. Recall that one purpose of fitting a regression model is to predict the response value in a new trial. So a reasonable choice of h should result in a small prediction error. Unfor- tunately, we do not know the true response, and therefore we cannot know how good is the prediction fˆ(x) given h. The idea of cross-validation is to first delete one observation from the data set, and treat the remaining n− 1 observations as the training data set and the deleted observations as testing data. We then test the goodness of prediction for the testing observation by using the training data set. We repeat the process for all observations and get the prediction errors for all observations. We choose h by minimizing the sum of prediction errors. Mathematically, let gˆ−i(xi) be the estimate of g(xi) based on the n− 1 observations without xi. For the given h, the cv score is defined as cv(h) = n∑ i=1 {yi − gˆ−i(xi)}2. The optimal h based on the leave-one-out cross-validation idea is hcv = arg mincv(h). It seems that it might be time consuming to evaluate cv(h) since we ap- parently need to recompute the estimate after dropping out each observation. Fortunately, there is a shortcut formula for computing cv(h). Let l(x) = ( l1(x), . . . , ln(x) ) = eτ1(Z τ xWxZx) −1ZτxWx. Then gˆ(x) = n∑ j=1 lj(x)yj and gˆ(xi) = n∑ j=1 lj(xi)yj. Define the fitted value vector ŷ = (yˆ1, · · · , yˆn)τ = (gˆ(x1), · · · , gˆ(xn))τ . It then follows that ŷ = Ly 8.3. LOCAL KERNEL POLYNOMIAL METHOD 103 where L is an n × n matrix whose ith row is l(xi); thus Lij = lj(xi) and Lii = li(xi). It can be shown that cv(h) = n∑ i=1 { yi − fˆ(xi) 1− Lii }2 . We can minimize the above cv(h) to get the hcv. The second method for choosing h is called the generalized cross-validation. For this method, rather than minimizing cv(h), an alternative is to use an approximation called generalized cross-validation (gcv) score in which each Lii is replaced with its average v/n, where v = tr(L) = ∑n i=1 Lii is called the effective degrees of freedom. Thus, we would minimize gcvscore gcv(h) = n∑ i=1 { Yi − fˆ(xi) 1− v/n }2 to obtain the bandwidth hgcv. That is, hgcv = arg min h gcv(h). Usually hcv is quite close to hgcv. In Appendix I, we include the R function bw.cv() to choose the bandwidth for the local polynomial kernel estimate for continuous response. The source code is saved in bw cv.R. In this function, if the option cv=T, then the cvmethod is used; if the option cv=F, then the gcvmethod is used. The R function regCVBwSelC() in the R package locpol can also be used to obtain hcv for the continuous response. The R function regCVBwSelC() gives the same result as the R function bw.cv() with cv=T. Further it is much faster. Figure 8.2 gives the cv(h) and gcv(h) for p = 0, 1. Here the normal kernel is used. (Remark by your instructor: these programs are not included). Similar to kernel density estimation, Wand and Jones (1995) applied the idea of direct plug-in methods for bandwidth selection for local linear kernel estimate. This idea is implemented in R function dpill() in the package KernSmooth. I did not cover this idea because it is only applicable for local linear kernel estimate. Further it is more complicated to implement compared with cv and gcv methods. 104 CHAPTER 8. ANALYSIS OF REGRESSION MODELS Figure 8.2: Motorcycle data: cv(h) and gcv(h) for p = 0, 1 with normal kernel 0.5 1.0 1.5 2.0 80000 84000 88000 92000 p=0h CV score 0.5 1.0 1.5 2.0 90000 95000 100000 p=0h GCV score 0.5 1.0 1.5 2.0 75000 80000 85000 90000 p=1h CV score 0.5 1.0 1.5 2.0 80000 90000 100000 110000 p=1h GCV score 8.3. LOCAL KERNEL POLYNOMIAL METHOD 105 Applying the above mentioned R functions, for p = 0, hcv = 0.914 and hgcv = 1.089; for p = 1, hcv = 1.476, hgcv = 1.570, and the direct plug-in gives hDPI = 1.445. Figure 8.3 gives the fitted curves of f(x) with p = 0, 1, in which the bandwidth is selected by cv or gcv. Here the normal kernel is used. The two curves for p = 0 are almost the same. The fitted curves for p = 1 with the bandwidths hcv, hgcv, and hDPI are almost the same. Hence we only plot the curves with the bandwidths selected by cv and gcv. The four fitted curves are very close to each. They do not show too much difference when they are plotted in the same panel. Properties of fˆ(x) Let h be given. We have E{gˆ(x)|x} ≈ f(x) and var{gˆ(x)|x} = σ2e1τ (ZτxWxZx)−1(ZτxW2xZx)(ZτxWxZx)−1e1. Therefore the standard error is given by se{fˆ(x)} = √ σˆ2e1τ (Z τ xWxZx) −1(ZτxW 2 xZx)(Z τ xWxZx) −1e1, where σˆ2 is an estimator of σ2. Wand and Jones (1995) suggested the fol- lowing form for σˆ2: σˆ2 = n− 2v + v˜ with v = tr(L) = n∑ i=1 Lii, v˜ = tr(L τL) = n∑ i=1 n∑ j=1 L2ij. 106 CHAPTER 8. ANALYSIS OF REGRESSION MODELS Figure 8.3: Motorcycle data: fitted curves for p = 0, 1 with normal kernel, in which the bandwidth is selected by cvor gcv 10 20 30 40 50 −1 00 −5 0 0 50 CV; p=0 Times in milliseconds after impact Ac ce ler at ion (i n g) 10 20 30 40 50 −1 00 −5 0 0 50 GCV; p=0 Times in milliseconds after impact Ac ce ler at ion (i n g) 10 20 30 40 50 −1 00 −5 0 0 50 CV; p=1 Times in milliseconds after impact Ac ce ler at ion (i n g) 10 20 30 40 50 −1 00 −5 0 0 50 GCV; p=1 Times in milliseconds after impact Ac ce ler at ion (i n g) 8.4. SPLINE METHOD 107 8.4 Spline method Let us again go back to model (8.2) but do not assume a parametric regression function g(x;θ). Instead, we only postulate that E(Y |X = x) = g(x) for some smooth function g(·). Suppose we try to estimate g(·) by simplistic least squares estimator without a careful deliberation. The solution will be regarded as the solution to the minimization problem to n∑ i=1 {yi − g(xi)}2. If all xi values are different, the solution is given by any function gˆ such that gˆ(xi) = yi. Such a perfect fitting clearly does not have any prediction power for a new observation whose covariate value is not equal to the existing covariate values. Furthermore, if gˆ(x) just connects all points formed by observations, it lacks some smoothness we may expect. If we require g(x) to be a linear function of x, then it is a very smooth function, but the fitting is unsatisfactory if E(Y |X = x) is not far from linear in x. One way to balance the need of smoothness and fitness is to use smoothing spline. Among all functions with first two continuous derivatives, let us find the one that minimizes the penalized L2-loss function gˆλ(x) = arg min g(x) [ n∑ i=1 {yi − g(xi)}2 + λ ∫ {g′′(x)}2dx ] , (8.6) for some positive tuning or smoothing parameter λ. which is called smoothing parameter. In the penalized L2-loss function, the first term measures the goodness of model fitting, while the second term penalizes the curvature in the function. We will remain vague on the range of x. When we use λ = 0: gˆλ(x) becomes the ordinary least squares estima- tor. The solution is not unique and has little prediction power. When we use λ = ∞, then the optimal solution must be g′′(x) = 0 for all x. The solution must be linear in x. We are back to use linear regression model and the associated least squares estimator. 108 CHAPTER 8. ANALYSIS OF REGRESSION MODELS Clear, a good fit is possible by choose a λ value in between 0 to ∞ to get a smooth function with reasonable fitting. Note that the above mini- mization is taken over all possible function g(x), and such functions form an infinite dimensional space. Remarkably, it can be shown that solution gˆλ(x) to the penalized least squares problem is a natural cubic spline with knots at the unique values of {xi}ni=1. Here we consider the case when x is one-dimensional. 8.5 Cubic spline We now need a brief introduction to the cubic spline. A cubic spline is a function which is piece-wisely cubic polynomial. Namely, we partition the real line into finite number of intervals and a cubic spline is a polynomial of x of degree 3 which has continuous derivative. More precisely, suppose −∞ = t0 < t1 < t2 < . . . < tk < tk+1 =∞ are k distinct real values, then s(x) is a cubic spline if 1. It is a cubic function on each interval [ti, ti+1]: si(x) = {ai + bix+ cix2 + dix3} s(x) = k∑ i=0 si(x)1(ti < x ≤ ti+1). 2. s(x) and its first and second derivatives are continuous: si(ti+1) = si+1(ti+1), s′i(ti+1) = s ′ i+1(ti+1), s′′i (ti+1) = s ′′ i+1(ti+1). The connection values t1, . . . , tk are called the knots of the cubic spline. In particular, t1 and tk are called the boundary knots, and t2, . . . , tk−1 are called the interior knots. Furthermore, if 8.5. CUBIC SPLINE 109 3. s(x) is linear outside the interval [t1, tk]; that is, s(x)1(x ≤ t1) = (a0+b0x)1(x ≤ t1); s(x)1(x ≥ tk) = (ak+bkx)1(x ≥ tk) for some a0, b0, ak, bk, we call s(x) a natural cubic spline with knots at t1, . . . , tk. Note that this also means c0 = ck = 0. The following result shows that there is a simpler way to express a cubic spline. Theorem 8.3. Any cubic spline s(x) with knots at {t1, . . . , tk} can be written as: s(x) = β0 + β1x+ β2x 2 + β3x 3 + k∑ j=1 βj+3(x− tj)3+, (8.7) where (x)+ = max(0, x) for some coefficients β0, . . . , βk+3. In other words, the cubic spline is a member of the linear space with basis functions 1, x, x2, x3, (x− t1)3+, . . . , (x− tk)3+. Proof. The function defined by (8.7) is clearly a cubic function on every interval [t0, ti+1]. We can also easily verify that its first two derivatives are continuous. This shows that such functions are cubic splines. To prove this theorem, we need further show that every cubic spline with knots at {t1, . . . , tk} can be written in the form specified by (8.7). Let g(x) be a cubic spline with knots at {t1, . . . , tk}. Denote γi = g′′(ti) for i = 1, 2, . . . , k. We show that there exists a function s(x) in the form of (8.7) such that β3 = 0, βk+3 = 0, and s′′(ti) = γi for i = 1, . . . , k. 110 CHAPTER 8. ANALYSIS OF REGRESSION MODELS If such a function exists, we must have, for other β values β2/3 = γ1/6; β2/3 + β4(t2 − t1) = γ2/6; β2/3 + β4(t3 − t1) + β5(t3 − t2) = γ3/6; · · · β2/3 + β4(tk−1 − t1) + · · ·+ βk+1(tk−1 − tk−2) = γk−1/6; β2/3 + β4(tk − t1) + · · ·+ βk+1(tk − tk−2) + βk+2(tk − tk−1) = γk/6; Taking differences, we find another set of equations whose solutions clearly exist: β4 = (1/6)(γ2 − γ1)/(t2 − t1); β4 + β5 = (1/6)(γ3 − γ2)/(t3 − t2); β4 + β5 + β6 = (1/6)(γ4 − γ3)/(t4 − t3); · · · β4 + β5 + · · ·+ βk+2 = (1/6)(γk − γk−1)/(tk − tk−1). The solution s(x) with any choice of β0 and β1 we have just obtained, has the same second derivatives with the cubic spline g(x) at {t1 = 0, t2, . . . , tk}. Now we can select β0 and β1 values such that s(t1) = g(t1) and s ′(t1) = g′(t1). Together with s′′(t1) = g′′(t1), s′′(t2) = g′′(t2), and they are both cubic functions, we must have s(x) = g(x) for all x ∈ [t1, t2]. Applying the same argument, they must be identical over [t1, tk]. This proves the existence. As a remark, there can be multiple cubic splines identical on [t1, tk] but different outside this interval. Suppose s(x) = β0 + β1x+ β2x 2 + β3x 3 + k∑ j=1 βj+3(x− tj)3+ is a natural cubic spline with knots {t1, t2, . . . , tk}. Since it is linear below t1, we must have β2 = β3 = 0. 8.5. CUBIC SPLINE 111 At the same time, being linear beyond tk implies we must have k∑ j=1 βj+1(x− tj)+ = 0 for all x ≥ tk. This is possible only if both k∑ j=1 βj+3 = 0, k∑ j=1 tjβj+3 = 0. In conclusion, out of k + 4 entries of β, only k of them are free for a natural cubic spline. For this reason, we need to think a bit about how to fit a natural cubic spline when data and knots are given. One approach is as follows. Define functions for j = 1, . . . , k dj(x) = (x− tj)3+ − (x− tk)3+ tk − tj . Further, let N1(x) = 1, N2(x) = x, and for j = 3, . . . , k, let Nj(x) = dj−1(x)− d1(x). The following theorem says that every natural cubic spline is a linear com- bination of Nj(x). Theorem 8.4. Let t1 < t2 < . . . < tk be k knots and {N1(x), . . . , Nk(x)} be functions defined above. Then all natural cubic splines s(x) with knots in {t1, . . . , tk} can be expressed as: s(x) = k∑ j=1 βjNj(x), for some coefficients β1, . . . , βk. Proof. Note that (tk − tj)dj(x) = (x− tj)3+ − (x− tk)3+. 112 CHAPTER 8. ANALYSIS OF REGRESSION MODELS Equivalently, (x− tj)3+ = (tk − tj)dj(x) + (x− tk)3+. Substituting this expression into generic form of cubic spline, and activating the constrains on βj implied by natural cubic spline, we find s(x) = β0N1(x) + β1N2(x) + k∑ j=1 βj+3(tk − tj)Nj+1(x). Note that the kth term is zero. The conclusion is therefore true. In general, a natural cubic spline can give very good approximation to any function in a finite interval. This makes it useful to fit nonparametric signal plus noise regression models. Given data {yi;xi} and the k knots, t1, . . . , tk, we may suggest that g(x) ≈ k∑ j=1 βjNj(x). For the ith observation, we have g(xi) ≈ k∑ j=1 βlNj(xi), which is now a linear combination of k derived covariates. Let y be the response vector, β the regression coefficient vector and the error vector. Define design matrix Zn = N1(x1) · · · Nk(x1)... ... ... N1(xn) · · · Nk(xn) . The approximate regression model becomes y ≈ Zβ + . (8.8) We may use least squares estimator of β given by βˆ = (ZτZ)−1Zτy. 8.6. SMOOTHING SPLINE 113 Let N(x) = {N1(x), . . . , Nk(x)}τ . Once βˆ is obtained, we estimate the re- gression function by gˆ(x) = Nτ (x)βˆ. Suppose (8.8) is in fact exact, then the properties of least squares estimator are applicable. We summarize them as follows: (a) E{βˆ} = β and E{gˆ(x)} = g(x); (b) var(βˆ) = σ2(ZTZ)−1 (c) var{gˆ(x)} = σ2Nτ (x)(ZτZ)−1N(x). If (8.8) is merely approximate, then the above equalities are approximate. The approximation errors will not be discussed here. The above idea is known as regression spline, which is a large research topic in nonparametric regression. This approach is very widely used in many applications to model a nonlinear and unknown function g(x). To apply this method, we must decide the number of knots and choose the knots t1, . . . , tk after the number of knots (k) is decided. 8.6 Smoothing spline Smoothing spline addresses the knot-selection problem of regression spline by taking all different covariate values as the knots. It uses the size of penalty to determine the level of smoothness. Recall that we claim that the numeric solution of smoothing spline to (8.6) is a natural cubic spline with knots at all distinct values (t1 < · · · < tk) of {xi}ni=1. This conclusion is implied by the following two claims. Suppose gˆλ(x) is the solution to the penalized sum of squares. Two claims about this function is as follows. 1. Given {ti; gˆλ(ti)}, based on the discussion in the last section there is a unique natural cubic spline s(x) with knots in {t1, . . . , tk} such that s(ti) = gˆλ(ti), i = 1, . . . , k. 114 CHAPTER 8. ANALYSIS OF REGRESSION MODELS Because of the above, we have n∑ i=1 {yi − s(xi)}2 = n∑ i=1 {yi − gˆλ(xi)}2. 2 For the s(x) defined above, we have∫ {gˆ′′λ(x)}2dx ≥ ∫ {s′′(x)}2dx with the equality holds if and only if gˆλ(x) = s(x) for all x. If this is true, we must have gˆλ(x) = s(x), a natural cubic spline. A serious proof is needed for the second claim. Here is the proof. Let γi = s ′′(ti) for i = 1, . . . , k with s(x) being a cubic spline with knots on t1, . . . , tk. Being “natural”, we have γ1 = γk = 0. Let g(x) be another function with finite second derivatives such that g(ti) = s(ti) for i = 1, 2, . . . , tk. It is seen that∫ ti+1 ti g′′(x)s′′(x)dx = ∫ ti+1 ti s′′(x)dg′(x) = [s′′(ti+1)g′(ti+1)− s′′(ti)g′(ti)]− ∫ ti+1 ti g′(x)s′′′(x)dx, Note that k−1∑ i=1 [s′′(ti+1)g′(ti+1)− s′′(ti)g′(ti)] = γkg′(tk)− γ1g′(t1) = 0. Being linear on every interval [ti, ti=1], we have s′′′(x) = γi+1 − γi ti+1 − ti = αi where we have used αi for the slope. With this, we find∫ ti+1 ti g′(x)s′′′(x)dx = αi{g(ti+1)− g(ti)} = αi{s(ti+1)− s(ti)} 8.6. SMOOTHING SPLINE 115 where the last equality is from the fact that g(x) and s(x) are equal at knots. Hence, we arrive at the conclusion that∫ tk t1 g′′(x)s′′(x)dx = − k∑ i=1 αi{s(ti+1)− s(ti)}. This result is applicable when g′′(x) = s′′(x). Hence, we also have∫ tk t1 s′′(x)s′′(x)dx = − k∑ i=1 αi{s(ti+1)− s(ti)}. This implies that ∫ tk t1 g′′(x)s′′(x)dx = ∫ tk t1 s′′(x)s′′(x)dx. Making use of this result, we get∫ tk t1 {g′′(x)− s′′(x)}2dx = ∫ tk t1 {g′′(x)}2dx− ∫ tk t1 {s′′(x)}2dx ≥ 0. This equality holds only if g′′(x) = s′′(x) for all x ∈ [t1, tk]. Hence the overall conclusion is proved. Consider the problem of searching for a natural cubic splines that min- imizes the penalized optimization problem (within this class of functions). Given a function g(x) = k∑ j=1 βjNj(x) for some constants β1, . . . , βk, its sum of squared residuals is given by n∑ i=1 {yi − g(xi)}2 = (y− Zβ)τ (y− Zβ) where Z = N1(x1) · · · Nk(x1)... ... ... N1(xn) . . . Nk(xn) . 116 CHAPTER 8. ANALYSIS OF REGRESSION MODELS The penalty term over interval [t1, tk] for this g(x) becomes∫ {g′′(x)}2dx = ∫ k∑ j=1 k∑ l=1 βjβlN ′′ j (x)N ′′ l (x)dx = β TNβ with N = (Njl)k×k and Njl = ∫ tk t1 N ′′j (x)N ′′ l (x)dx. The penalized sum of squares of g(x) is given by (y− Zβ)τ (y− Zβ) + λβτNβ. It is minimized, given λ, at βˆλ = (Z τZ + λN)−1Zτy and the fitted regression function is gˆλ(x) = k∑ j=1 βˆλ,jNj(x). 8.7 Effective number of parameters and the choice of λ If we regard gˆλ(x) as a fit based on a linear regression, then we seem to have employed k independent parameters. Due to regularization induced by penalty, the effective number of parameters is lower than k. Note that the fitted value of response vector is given by yˆλ = Z(Z τZ + λN)−1Zτy = Aλy. We call Aλ smoother matrix. Similar to local polynomial kernel method, we define the effective degrees of freedom (dfs) or effective number of parameters to be dfλ = trace(Aλ). 8.8. ASSIGNMENT PROBLEMS 117 As λ increases, the effective number of parameters (dfλ) decreases and gˆλ(x) becomes smoother and smoother. We can hence try out a range of λ values and examine the resulting gˆλ(x) and select the most satisfactory one. How- ever, this procedure needs human interference and cannot be automated. To overcome this deficiency, one may choose λ using cv or gcv criteria. Similar to local polynomial kernel method, we define the gcv score as a function of λ to be gcv(λ) = (y− yˆλ)τ (y− ŷλ) {1− trace(Aλ)/n}2 . The gcvmethod chooses λ as the minimizer of gcv(λ). The cv approach is similar. Let gˆ−i(xi) be the estimate of g(xi) based on n − 1 observations without the ith observation. We define the cv score as a function of λ to be cv(λ) = n∑ i=1 {yi − gˆ−i(xi)}2. It turns out that cv(λ) = n∑ i=1 ( yi − gˆλ(xi) 1− trace(Aλ,i,i) )2 . This expression enable us to only fit the model once for each λ in order to compute cv(λ). The cv method chooses λ value as the minimizer of cv(λ). Remark: The so-called R-functions are not included. 8.8 Assignment problems 1. Find the asymptotic efficiency of the least absolute deviation estimator when the data are i.i.d. samples from normal distribution, and the asymptotic efficiency of the least squares estimator when the data are i.i.d. samples from double exponential. 2. Let βˆ be the least squares estimator of β under the linear model. Show that for any non-random vector b bτ βˆ is the BLUE of bτβ. 118 CHAPTER 8. ANALYSIS OF REGRESSION MODELS Chapter 9 Bayes method Most of the data analysis methods we have discussed so far are regarded as frequentist methods. More precisely, these methods are devised based on the conviction that the data are generated from a fixed system which is a member of a family of systems. While the system is chosen by nature, the outcomes are random. By analyzing the data obtained/generated/sampled from this system, we infer the properties of THIS system. The methods devised subsequently are judged by their average performances if they are repeatedly applied to all possible realized data from this system. For in- stance, we regard sample mean as an optimal estimator for the population mean under normal model in some sense. Whichever N(θ, σ2) is the truth, on average, (x¯− θ)2 has the lowest average among all θˆ whose average equals θ. A procedure is judged optimal only if this optimality holds at each and every possible θ, σ2 value. When considered from such a frequentist point of view, the statisticians do not play favours to any specific system against the rest of them in this family. Simplistically, each system in the family is regarded equal likely before hand. This view is subject to dispute. In some applications, we may actual have some preference between such systems. What is the chance that a patient entering a clinic with fever actually has a simple flu? If this occurs at a flu season, the doctor would immediately look for more signs of flu. If it is not a flu season, the doctor will cast a bigger net to the cause of the fever. The conclusion arrived by the doctor is not completely dependent on 119 120 CHAPTER 9. BAYES METHOD the evidence: having fever. This example shows that most of human being act on their prior belief. The famous Bayes theorem provides one way to formally utilize prior information. Let A and B be two events in the context of probability theory. It is seen that the conditional probability of B given A pr(B|A) = pr(A|B)pr(B) pr(A|B)pr(B) + pr(A|Bc)pr(Bc) where Bc is the complement of B, or the event that B does not occur. This formula is useful to compute the conditional probability of B after A is known to have occurred when all probabilities on the right hand side are known. The comparison between pr(B|A) and pr(B) reflects what we learn from event A about the likeliness of event B. 9.1 An artificial example Suppose one of two students is randomly selected to write a typical exam. Their historical averages are 70 and 80 percent. After we are told the mark of this exam is 85%, which student has been selected in the first place? Clearly, both are possible but most of us will bet on the one who has historical average of 80%. It turns out that Bayes theorem gives us a quan- titative way to justify our decision if we are willing to accept some model assumptions. Suppose the outcome of the exam results have distributions who densities are given by fa(x) = x7−1(1− x)3−1 B(7, 3) 1(0 < x < 1); fb(x) = x8−1(1− x)2−1 B(8, 2) 1(0 < x < 1) for students A and B with beta function defined to be B(a, b) = ∫ 1 0 xa−1(1− x)b−1dx 9.1. AN ARTIFICIAL EXAMPLE 121 for a, b, > 0. The probability that they are selected to write the exam is pr(A) = pr(B) = 0.5 which is our prior belief that reflects the random selection very well. Let X denote the outcome of the exam. It is seen that pr(A|X = x) = 0.5fa(x) 0.5fa(x) + 0.5fb(x) . If X = 85%, we find pr(A|X = 85) = 0.3818. If X = 60%, we find pr(A|X = 60) = 0.7000. Based on these calculations, we seem to know what to do next. To use the frequentist approach discussed earlier, we re-state this ex- periment as follows. One observation X has been obtained from a Beta distribution family with parameter space Θ = {(7, 3); (8, 2)} or {A,B}. If X = 0.85, what is your estimate of θ? The likelihood values at these two parameter points are given by `((7, 3)) = fa(0.85) = 2.138; `((8, 2)) = fb(0.85) = 3.462. Hence, the MLE is given by θˆ = (8, 2) corresponding to student B. Based on frequentist approach which ignores the prior information, we are told it is more likely that student B wrote the exam. If the MLE has been chosen as the frequentist method to be used, then student B is our choice, even though we know it is not certain. Using Bayes analysis together with the prior information provided, we claim that there is a 82% chance that student B wrote the exam. At this moment, we have yet to make a decision. The calculation of the posterior probability itself does not directly provide one. Suppose wrongfully conclud- ing it was written by student B may result in a loss of a million dollars, while wrongfully concluding it was student A may result in a loss of a single dollar, then we may still claim/act that it was student A who wrote the exam. 122 CHAPTER 9. BAYES METHOD Figure 9.1: Posterior probability as a function of x 0.2 0.4 0.6 0.8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 Exam Score P ro b (A |X =x ) 9.2. CLASSICAL ISSUES RELATED TO BAYES ANALYSIS 123 9.2 Classical issues related to Bayes analysis We suggested that a statistical model is a family of distributions often rep- resented as a collection of parameterized density functions. We use {f(x; θ) : θ ∈ Θ} as a generic notation. In most applications, we let Θ be a subset of Rd. When a set of observations X are obtained and a statistical model is assumed, a frequentist would regard X is generated from ONE member of {f(x; θ) : θ ∈ Θ} but usually we do not know which one. The information contains in X helps us to decide which one is most likely, or a close proximate of this ONE. In comparison, a Baysian may also regard X is generated from ONE member of {f(x; θ) : θ ∈ Θ}. However, this one θ value itself is generated from another distribution called prior distribution, Π(θ). In other words, it is a realized value of a random variable whose distribution is given by Π(θ). If we have full knowledge of Π(θ), then it should be combined with X to infer which θ has been THE θ in {f(x; θ) : θ ∈ Θ} that generated X. We generally cannot nail down to a single θ value given X and Π(θ). With the help of Bayes theorem, we are able to compute the conditional distribution of θ given X, which is called posterior. That is, we retain the random nature of θ but update our knowledge about its distributions when X becomes available. Statistical inference about θ will then be made based on the updated knowledge. From the above discussion, it is seen that the a preliminary step in Bayes analysis is to obtain posterior distribution of θ, assuming the model itself has been given and the data have been collected. That is, we have already decided on the statistical model f(x; θ), prior distribution Π(θ) and data X collected in the application. Note that this X can be a vector of i.i.d. observations given θ. The notion of GIVEN θ is important because θ is a random variable in the context of Bayes analysis. Particularly in early days, the Bayes analysis is possible only if some kind of neat analytical expression of the posterior is available. Indeed, I can give you many such examples when things lineup nicely. Example 9.1. Suppose we have an observation X from a binomial distri- 124 CHAPTER 9. BAYES METHOD bution f(x; θ) = C(n, x)θx(1− θ)n−x for x = 0, 1, . . . , n. Suppose we set the prior distribution with density function pi(θ) = θa−1(1− θ)b−1 B(a, b) 1(0 < θ < 1). By Bayes rule, the density function of the posterior distribution of θ is given by fp(θ|X = x) = f(x; θ)pi(θ)∫ f(x; θ)pi(θ)dθ . It appears to get explicit expression, we must find the outcome of the integra- tion. However, this can often be avoided. Note that f(x; θ)pi(θ) = C(n, x)θa+x−1(1− θ)b+n−x−11(0 < θ < 1). Hence, we must have fp(θ|X = x) = θ a+x−1(1− θ)b+n−x−11(0 < θ < 1) c(n, a, b, x) for some constant c(a, b, x) not depending on θ. As a function of θ, it matches the density function of Beta distribution with degrees of freedom a+x, b+n−x. At the same time, its integration must be 1. This shows that we must have c(n, a, b, x) = B(a+ 1, n+ b− x). The posterior distribution is Beta with a+ x, n+ b− x degrees of freedom: fp(θ|X = x) = θ a+x−1(1− θ)b+n−x−11(0 < θ < 1) B(a+ 1, n+ b− x) This will be the posterior distribution used for Bayes decision. You may notice that Binomial distribution and the Beta distribution are perfectly paired up to permit an easy conclusion on the posterior distribution. There are many such pairs. For instance, if X has Poisson distribution with mean θ, and θ has prior one parameter Gamma distribution, then the posterior distribution of θ is also Gamma. We leave this case as an exercise. Such prior distributions are call conjugate priors. Another good exercise problem is to draw the density function of many beta distributions. It helps to get an intuition on what you have assumed if a beta prior is applied. 9.2. CLASSICAL ISSUES RELATED TO BAYES ANALYSIS 125 Definition 9.1. Let {f(x; θ) : θ ∈ Θ} be a statistical model. Namely, it is a family of distributions. Suppose for any prior distribution pi(θ) as a member of distribution family {pi(θ; ξ) : ξ ∈ Ξ}, the posterior distribution of θ given a set of i.i.d. observations from f(x; θ) is a member of {pi(θ; ξ) : ξ ∈ Ξ}, then we say that {pi(θ; ξ) : ξ ∈ Ξ} is a conjugate prior distribution family of {f(x; θ) : θ ∈ Θ}. Remark: We have seen that the posterior density is given by fp(θ|X = x) = f(x; θ)pi(θ)∫ f(x; θ)pi(θ)dθ . This formula is generally applicable. In addition, one should take note that the denominator in this formula does not depend on θ. Hence, the denomina- tor merely serves as a scale factor in fp(θ|X = x). In classical examples, its value can be inferred from the analytical form of the numerator. In complex examples, its value does not play a rule in Bayes analysis. Example 9.2. Suppose that given µ, X1, . . . , Xn are i.i.d. from N(µ, σ 2 0) with known σ20. Namely, σ 2 0 is not regarded as random. The prior distribution of µ is N(µ0, τ 2 0 ) with both parameter values are known. The posterior distribution of µ given the sample is still normal with parameters µB = nx¯/σ20 + µ0/τ 2 0 1/σ20 + 1/τ 2 0 ; and σ2B = [ n σ20 + 1 τ 20 ]−1 . The philosophy behind the Bayes data analysis is to accommodate our prior information/belief about the parameter in statistical inference. Some- time, prior information naturally exists. For instance, we have a good idea on the prevalence of human sex ratio. In other applications, we may have some idea on certain parameters. For example, the score distribution of a typical course. Even if we cannot perfectly summarize our belief with a prior distribution, one of the distributions in the beta distribution family can be good enough. 126 CHAPTER 9. BAYES METHOD It is probably not unusual that we do not have much idea about the parameter value under a statistical model assumption. Yet one may be at- tracted to the easiness of the Bayesian approach and would like to use Bayes analysis anyway. She may decide to use something called non-informative prior. Yet there seem to be no rigorious definition on what kind of priors are a non-informative priors. In the normal distribution example, one may not have much idea about the mean of the distribution in a specific application. If one insists on use Bayesian approach, he or she may simply use a prior density function pi(µ) = 1 for all µ ∈ R. This prior seems to reflect the lack of any idea on which µ value is more likely than any other µ values. In this case, pi(µ) is not even a proper density function with respect to Lesbesgue measure. Yet one may obtain a proper posterior density following the rule of Bayes theorem. It appears to me that Bayes analysis makes sense when prior information about the parameter truly exists. In some occasions, it does not hurt to employ this tool even if we do not have much prior information. If so, the Bayes inference conclusion should be critically examined just likely any other inference conclusions. 9.3 Decision theory Let us back to the position that a statistical model f(x; θ) is given, prior distribution Π(θ) is chosen and data X have been collected. At least in principle, the Bayes theorem has enabled us to obtain posterior distribution of θ: fp(θ|X). At this point, we need to decide how to estimate θ, the value generated from Π(θ), and X is a random sample from f(x; θ) with this θ. With fp(θ|X) at hand, how do you estimate θ? First of all, you may pick any function of X as your estimator of θ. This has not changed. Second, if you wish to find a superior estimator, then you must provide a criterion to judge superiority. In the content of Bayes data analysis, the criteria for point estimation is through loss functions. 9.3. DECISION THEORY 127 Definition 9.2. Assume a probability model with parameter space Θ. A loss function `(·, ·) is a non-negative valued function on Θ × Θ such that `(θ1, θ2) = 0 when θ1 = θ2. Finally, since we do not know what the true θ value is, with the posterior distribution, we can only hope to minimize the average loss. Hence, the decision based on the bayes rule is to look for θˆ such that the expected loss is minimized: ∫ L(θˆ, θ)fp(θ|X)dθ = min. A naturally choice of the loss function is L(θˆ, θ) = (θˆ − θ)2. The solution to this loss function is clearly the posterior mean of θ for one-dimension θ.. This extends to the situation where θ is multidimensional. One may use the loss function L(θˆ, θ) = |θˆ − θ|. If so, the solution is the posterior median for one-dimension θ. The exten- sion to the multidimensional θ is possible. Example 9.3. Suppose we have an observation X from a binomial distri- bution f(x; θ) = C(n, x)θx(1− θ)n−x for x = 0, 1, . . . , n. Suppose we set the prior distribution with density function pi(θ) = θa−1(1− θ)b−1 B(a, b) 1(0 < θ < 1). By Bayes rule, the density function of the posterior distribution of θ is given by fp(θ|X = x) = f(x; θ)pi(θ)∫ f(x; θ)pi(θ)dθ . The posterior distribution is Beta with a+ x, n+ b− x degrees of freedom: fp(θ|X = x) = θ a+x−1(1− θ)b+n−x−11(0 < θ < 1) B(a+ 1, n+ b− x) 128 CHAPTER 9. BAYES METHOD If the square loss is employed, then the Bayes estimator of θ is given by∫ θfp(θ|X = x)dθ = a+ x a+ b+ n . When a = b = 1, the prior distribution of θ is uniform on (0, 1). This is regarded as a non-informative prior. With this prior, we find θˆ = x+ 1 n+ 2 which seems to make more sense than the MLE x/n. Since Bayes estimator is generally chosen as the minimizer of some ex- pected posterior loss, it is optimal in this sense by definition. However, the optimality is judged with respect to the specific loss function and under the assumed prior. Blindly claiming a Bayes estimator is optimal out of con- tent is not recommended here. If this logic is applicable, then we would as rightfully claim that the MLE is optimal, because it maximizes a criterion function called likelihood. Such a claim would be ridiculous because we have many examples where the MLEs are not even consistent. We will have an exercise problem to work out Bayes estimators under square loss under normal model with some conjugate prior distribution on both mean and variance. Once the posterior distribution is ready, we are not restricted to merely give a point estimation. These issues will be discussed in other parts of this course. At the same time, we may get some sense that being able to precisely describing the posterior distribution is one of the most important topic in Bayes data analysis. 9.4 Some comments There are two major schools on how the statistical data analysis should be carried out: frequentist and Bayesian. If some prior information exists and can be reasonably well summarized by some prior distribution, then I feel the inference based on Bayes analysis is fully justified. If one does not have much sensible prior information on the statistical model appropriate to the 9.5. ASSIGNMENT PROBLEMS 129 data at hand, it is still acceptable to use the formality of the Bayes analysis. Yet blindly claiming the superiority of a Bayesian approach is not of my taste. Particularly in the later case, the Bayes conclusion should be critically examined as much as any data analysis methods. To make things worse, many statisticians seem to regard themselves doing research on Bayesian methods, yet they do not aware the principle of the Bayes analysis. Probably, they merely feel that this is an easy topic to publish papers (not true if one is a serious Bayesian). To be more strict, a Bayesian should have a strong conviction that the model parameters are invariably realized values from some distribution. There is an interest and very valid question, is/was Bayes a Bayesian? 9.5 Assignment problems 1. (a) Using R-package to plot 6 beta density functions with degrees of freedoms: (a, b) = (0.5, 0.1), (0.1, 0.5), (0.5, 0.5), (1, 1), (5, 1), (1, 5), (5, 20), (5, 50). (b) Select two additional pairs based on your own curiosity and plot them. (c) Show that the density functions with parameters (a, b) and (b, a) are mirror images of each other. Remark: (c) is an observational question. 2. Show that the following two pairs of distribution families permit ana- lytical descriptions of the posterior distribution. Identify the posterior distribution together with specific parameter values. Assume that we have a single observation from the statistical model. (a) Statistical model: Poisson with parameter θ; Prior distribution family: one parameter Gamma with degree of freedom d. Namely, the design function is given by pi(θ) = θd−1 exp(−θ)/Γ(d) for θ ≥ 0. 130 CHAPTER 9. BAYES METHOD (b) Statistical model: N(µ, σ2); Prior distribution: N(µ0, σ 2) for µ given σ2, and one parameter Gamma for 1/σ2 with degree of freedom d0 = 5. Namely, the prior distribution is specified through µ given σ 2 and the distribution of σ2. 3. Given a set of i.i.d. observations of size n = 20 from N(µ, σ2) and the prior distribution specified as in the previous problem with µ0 = 0. (a) Find the posterior 75% quantile of the mean parameter µ; (b) Find the posterior expectation of µ2. 4. Following the last problem. Assume the data set contains n = 20 observations as follows: 1.1777518 -0.5867896 0.2283789 -0.1735369 -0.2328192 1.0955114 1.2053680 -0.7216797 -0.3387580 0.1620835 1.4173256 0.0240219 -0.6647623 0.6214567 0.7466441 1.9525066 -1.2017093 1.9736293 -0.1168171 0.4511754 (a) Given d0 = 5, plot the posterior mean of µ as a function of µ0 over [−2, 2]. (b) Given µ0 = 0, plot the posterior mean of σ 2 as a function of d0 over [0.5, 10]. Remark: use Monte Carlo simulation if direct numerical/analytical computation is too difficult/infeasible. 5. Based on the class example where the statistical model is Binomial and the prior distribution of θ is Beta(a, b). (a) Compute the expected posterior squared loss of the MLE θˆ = X/n and the Bayes estimator θˇ = (X + 1)/(n + 2). Remark: they are functions of (a, b). (b) Compute the frequentist MSE of these two estimators. That is, regard the MSEs as functions of θ and θ a non-random unknown pa- rameter. 9.5. ASSIGNMENT PROBLEMS 131 6. Consider the problem where an i.i.d. sample x1, x2, . . . , xn of size n from Gamma distribution family f(x; θ) = 1 Γ(θ) xθ−1 exp(−x). We have a discrete prior distribution of θ given by P (θ = j) = j/9; j = 2, 3, 4. (a) Give the expression of the posterior probability mass function of θ. (b) When n = 3 and the observed values are 4.134, 2.116, 4.105 so that∑ xi = 10.335 and ∏ xi = 35.90. What is your estimated value of θ? Remark: Be a good Bayesian. 132 CHAPTER 9. BAYES METHOD Chapter 10 Monte Carlo and MCMC Recall that a statistical model is a distribution family, at least this is what we suggest. Let us first focus on parametric models: {f(x; θ) : θ ∈ Θ}. In this case, θ is generally a real valued vector and Θ is a subset of Euclidean space with nice properties such as convex, open and so on. After placing a prior distribution on θ, we have created a Bayes model. We do not seem to have an explicit consensus on a definition of and a notation for Bayes model, even though statisticians are not shy at using this terminology. Based on our understanding, we define a Bayes model as a system with two necessary components: a family of distributions, and a prior distribution on the space of this distribution family: Bayes Model = [{f(x; θ) : θ ∈ Θ}, pi(θ)]. When Θ is a subset of Euclidean space, we generally regard pi(·) a density function with respect to Lesbesgue measure on Θ. For an abstract {f(x; θ) : θ ∈ Θ}, pi(·) represents an abstract distribution. Logically, a Bayes model is not the same as Bayes analysis. Bayes analysis is generally carried out based on the posterior distribution. Yet there is no formal requirement on this rule. Frequentists often take likelihood function as the basis for inference, yet they may design inference procedures in any way they like. In my opinion, this includes procedures based on posterior distributions. Suppose a θ value is generated according to pi(·), and subsequently, a 133 134 CHAPTER 10. MONTE CARLO AND MCMC data set X is generated from THIS f(x; θ). Here we implicitly assume that X is accurately measured and available to use for the purpose of inference, and the value of θ is hidden from us. The inference target is θ based on data from this experiment. Any decision about the possible value of θ in Bayes analysis is generally based on the posterior density of θ given X. We use notation fp(θ|X) for posterior distribution (density). It is conceptually straightforward to define and derive the posterior distribution. Hence, there are not much left for a statistician to do. Bayes analysis makes a decision based on posterior distribution. Research on Bayesis methods includes: (a) most suitable prior distributions in specific applications; (b) the influence of the choice of prior distribution to the final decision; (c) numerical or theoretical methods for determining the posterior distribution; (d) properties of the posterior distribution; (e) decision rule. There might be more topics out there. This chapter is about topic (c). For some well paired up f(x; θ) and pi(θ) (when pi(·) is a conjugate prior for f(x; θ)), it is simple to work out the analytical form of the posterior density function. A Bayesian needs only decide the best choices of pi(θ) and the subsequent decision rule. In many real world problems, the posterior density is on high dimensional space and does not have a simple analytical form. The Bayes analysis before the contemporary computing power has been a serious challenge because of this formidable task. This task becomes less and less an issue today. We discuss a number of commonly used techniques in this chapter. 10.1 Monte Carlo Simulation The content of this section is related but not limited to Bayes analysis. Sup- pose in some applications, we wish to compute E{g(X)} and X is known to have a certain distribution. This is certainly a simple task in many text- book examples. For instance, if X has Poisson distribution with mean θ and g(x) = x(x− 1)(x− 2)(x− 3), then E{g(X)} = θ4. 10.1. MONTE CARLO SIMULATION 135 However, if g(x) = x log(x + 1), the answer to E{g(X)} is not analytically available. Suppose we have an i.i.d. sample x1, . . . , xn with sufficiently large n from this distribution, then by the law of large numbers, E{g(X)} ≈ n−1 n∑ i=1 xi log(1 + xi). Let us generate n = 100 values from Poisson distribution with θ = 2. Using a function in R-package, we get 100 values 5 2 3 4 1 2 1 2 1 1 2 3 2 2 2 3 1 2 0 4 1 2 5 1 1 2 3 1 1 1 2 0 2 1 1 3 0 5 1 5 1 2 1 0 2 3 5 2 6 3 2 4 3 1 1 2 2 1 1 2 2 5 0 2 1 3 3 1 3 1 1 2 2 3 1 2 1 4 0 4 2 3 0 0 2 1 3 1 0 2 1 0 3 1 3 6 1 3 3 3 Based on this sample, we get an approximated value E{G(X)} ≈ 2.691. I can just as easily use n = 10, 000 and find E{g(X)} ≈ 2.648 in one try. With contemporary computer, we can afford to repeat it as many times as we like: E{g(X)} ≈ 2.642, 2.641, 2.648. It appears E{g(X)} = 2.645 would be a very accurate approximations. Computation based on simulated data is generally called Monte Carlo method. We must answer two questions before we continue. The first is why do not we use a numerical approach if we need to compute E{g(X)}. Indeed, we can put up a quick R-code {ii= 0:50; sum(ii*log(1+ii)*dpois(ii, 2))} and get a value 2.647645. This is a very accurate answer to this specific problem. Yet if we wish to compute E{(X1 + √ X2) 2 log(1 +X1 +X3X4)}, where X1, X2, X3, X4 may have a not very simple joint distribution, a neat numerical solution becomes hard. Since the contemporary computers are so powerful, the above problem is only “slightly” harder. Yet there are real 136 CHAPTER 10. MONTE CARLO AND MCMC world problem of this nature, but involves hundreds or more random vari- ables. For these problems, the numerical problem quickly becomes infeasible even for contemporary computers. In comparison, the complexity of the Monte Carlo method remains the same even when g(X) is a function of vector X with a very high dimension. The second question is how easy is it to generate quality “random sam- ples” from a given distribution by computer? There are two issues related to this question. First, the computer does not have an efficient way to gen- erate random numbers. However, with some well designed algorithms, it can produce massive amount of data which appear purely random. We call them pseudo random number generators. We do not discuss this part of the problem in this course. The other issue is how to make sure these random numbers behave like samples from the desired distributions. Our starting point is that it is easy to generate i.i.d. observations (pseudo numbers) from uniform distribution [0, 1]. We investigate the techniques for generating i.i.d. observations from other distributions. Theorem 10.1. Let F (x) be any univariate continuous distribution function and U be a standard uniformly distributed random variable. Let Y = inf{x : F (x) ≥ U}. Then the distribution function of Y is given F (·). Proof. We only need to work out the c.d.f. of Y . If it is the same as F (·), then the theorem is proved. Routinely, we have pr(Y ≤ t) = pr(inf{x : F (x) ≥ U} ≤ t) = pr(F (t) ≥ U) = F (t) because pr(U ≤ u) = u for any u ∈ (0, 1). This completes the proof. Since we generally only have pseudo numbers in U , applying the above transformation will only lead to “pseudo numbers” in Y . Example 10.1. Let g(u) = − log u. Then, Y = g(U) has exponential distri- bution if U has standard uniform distribution. Let g(u) = (− log u)a for some positive constant a. Then Y = g(U) has Weilbull distribution. 10.1. MONTE CARLO SIMULATION 137 As an exercise problem, find the function g(·) which makes g(U) standard Cauchy distributed. Here is another useful exercise problem for knowledge. If Z1, Z2 are inde- pendent standard normally distributed random variables, then r2 = Z21 +Z 2 2 are exponentially distributed. One should certainly know that r2 is also chisquare distributed with 2 degrees of freedom. Example 10.2. Let U1, U2 be two independent standard uniform random variables. Let g1(s, t) = √ −2 log s cos(2pit); g2(s, t) = √ −2 log s sin(2pit). Then, g1(U1, U2), g2(U1, U2) are two independent standard normal random variables. If we can efficiently generate pseudo numbers from uniform distribution, then the above result enables us to efficiently generate pseudo numbers from standard normal distributions. Since general normal distributed random variables are merely location-scale shifted standard normal random variables, their generation can hence also be efficiently generated this way. Due to well established relationship between various distributions, pseudo numbers from many many classical distributions can be efficiently generated. Here are a few well-known results which were also given in the chapter about normal distributions. Example 10.3. Let Z1, Z2, . . . be i.i.d. standard normally distributed random variables. (a) X2n = Z 2 1 +Z 2 2 + · · ·+Z2n has chisquare distribution with n degrees of freedom. (b) Fn,m = (X 2 n/n)/(Y 2 m/m) has F distribution with n,m degrees of free- dom when X2n, Y 2 m are independent. (c) Bn = (X 2 n)/(X 2 n + Y 2 m) has Beta distribution with n,m degrees of freedom when X2n, Y 2 m are independent. We can also generate multinomial pseudo numbers with any probabilities: p1, p2, . . . , pm: generate U from uniform, then let X = k for k such that p1 + · · ·+ pk−1 < U ≤ p1 + · · ·+ pk−1 + pk−1. 138 CHAPTER 10. MONTE CARLO AND MCMC The left hand side is regarded as zero for k = 1. 10.2 Biased or importance sampling Back to the problem of computing E{g(X)} when X has a distribution with density or probability mass function f(x). If generating pseudo numbers from f(x) is efficient, then it is a good idea to approximate this expectation by n−1 n∑ i=1 g(xi). If it is more convenient to generate pseudo numbers from a different dis- tribution f0(x) which has the same support as f(x), then it is easier to approximate this expectation by n−1 n∑ i=1 {g(yi)f(yi)/f0(yi)} where y1, . . . , yn observations are generated from f0(x). If Y has a distribution with density f0(x), we have E{g(Y )f(Y )/g0(Y )} = ∫ {g(y)f(y)/f0(y)}f0(y)dy = ∫ g(y)f(y)dy = E{g(X)} where X has distribution f(x). Note that it is important that f and f0 have the same support so that the range of integrations remains the same. If X has discrete distribution, the integration will be changed to summation. The conclusion is not affected. In sample survey, the units in the finite population often have different probabilities to be included in the sample due to various considerations. The population total Y = N∑ i=1 yi, 10.3. REJECTIVE SAMPLING 139 where N is the number of sample units in the finite population and yi is the response value of the ith unit, is often estimated by Horvath-Thompson estimator: Yˆ = ∑ i∈s yi/pii where s is the set of units sampled and pii is the probability that the unit i is in the sample. The role of pii is the same as f0(x) in the importance sampling content. In sampling practice, some units with specific properties of particular in- terest are hard to obtain in an ordinary sampling plan. Specific measures are often taken so that these units have higher probability to be included than otherwise when all units are treated equally. The practice may also be regarded as finding a specific f0(x) to replace f(x) even though the expecta- tion of g(X) under f(x) distribution is the final target. One such example is to obtain the proportional of HIV+ person in Vancouver population. A simple random sample may end up with a sample of all HIV- individuals giving lower accurate estimation of the rate of HIV+. The same motivation is used in numerical computation. If f(x) has lower values in certain region of x, then a straightforward random number generator will have very few values generated from that region. This problem makes such numerical ap- proximations inefficient. Searching for some f0(x) can be a good remedy to address this shortcoming. Here is another example. To estimate the survival time of cancer patient. Let us a random sample from all cancer patients at a specific time point. If their survive times are denoted as Y1, Y2, . . . , Yn whose distribution is denoted as f0(y). The actually survival distribution would be different if every cancer patient is counted equally. This is because f0(y) ∝ yf(y) where f(y) is the “true” survival time distribution. This may also be regarded as importance sampling created by nature. 10.3 Rejective sampling Instead of generating data from an original target distribution f(x), we may generate data from f0(x) and obtain more effective numerical approximation 140 CHAPTER 10. MONTE CARLO AND MCMC of E{g(X)}. This is what we have seen in the last section. The same idea is at work in rejective sampling. The target of this game is to obtain pseudo numbers which may be regarded as random samples from f(x). Of course, to make it a good tool, we must select an f0(x) which is easy to handle. Let f(x) be the density function from which we wish to get random samples. Let f0(x) be a density function with the same support and further sup x f(x) f0(x) = u <∞ Denote pi(x) = f(x) uf0(x) . Apparently, pi(x) ≤ 1 for any x. In addition, if f(x) is known up to a constant multiplication, the above calculations remain feasible. One potential example of such an f(x) is when f(x) = C exp(−x4) 1 + x2 + sin2(x) . Since f(x) > 0 and its integration converges, we are sure that C−1 = ∫ exp(−x4) 1 + x2 + sin2(x) dx is well defined. Yet we do not have its exact value. In this example, an accurate approximate value of C is not hard to get. Yet if f(·) is the joint density of many variables, even a numerical approximation is not feasible. Particularly in Bayes analysis, this can occur. If an effective way to generate “random” samples from f(x) is possible, then we do not need to know C any more in many applications. Now we present the procedure of the rejective sampling method. 1. Generate a sequence of i.i.d. samples X1, X2, . . . from f0(x). 2. Generate a sequence of i.i.d. samples U1, U2, . . . from the standard uni- form distribution. 10.3. REJECTIVE SAMPLING 141 3. For i = 1, 2, . . ., if Ui ≤ pi(Xi), let Yi = Xi; otherwise, we leave Yi undefined. 4. Collect the Xi values not reject in the last step to form a sequence of i.i.d. sample: Y1, Y2, . . .. It is easy to see why this procedure is called rejective sampling. We now show that the outcome of the above procedure indeed produce a set of i.i.d. sample from distribution f(x). Theorem 10.2. The output of the rejective sampling, Yi, has distribution F (x) with density function f(x) for any i. Proof. This is demonstrated as follows. First, we consider the case for i = 1. It is seen that pr{U > pi(X)} = E{1− pi(X)} = 1− ∫ pi(x)f0(x)dx = 1− u−1. Hence, the distribution of Y1 is given by pr(Y1 ≤ y) = ∞∑ k=1 pr(U1 > pi(X1), . . . , Uk−1 > pi(Xk−1), Uk < pi(Xk), Xk ≤ y) = ∞∑ k=1 (1− u−1)k−1pr(Uk < pi(Xk), Xk ≤ y) = ∞∑ k=1 (1− u−1)k−1pr(U < pi(X), X ≤ y) = uE{pr(X ≤ y, U ≤ pi(X)|X)} = uE{pi(X)1(X ≤ y)}. Taking the definition of pi(x) into consideration, we find pr(Y1 ≤ y) = u ∫ y −∞ f(x) uf0(x) f0(x)dx = F (y). This shows that the rejective sampling method indeed leads to random num- bers from the target distribution. 142 CHAPTER 10. MONTE CARLO AND MCMC Let us define the waiting time T = min{i : Ui ≤ pi(Xi)} which is the number of pairs of pseudo numbers in (X,U) it takes to get a pseudo observation Y . Its probability mass function is given by pr(T = k) = pr ( U1 > pi(X1), . . . , Uk−1 > pi(Xk−1), Uk < pi(Xk) ) = (1− u−1)k−1u−1. That is, T has geometric distribution with mean u. If we use an f0(·) which leads to large u, the rejective sampling is numer- ically less efficient. It takes more tries on average to obtain one sample from the target distribution. The best choice is f0(·) = f(·) in terms of computa- tion efficiency. Of course, this means we are not using a rejective sampling tool at all. Here is an exercise problem. Suppose we can easily generate random numbers from standard normal distribution whose density is given by φ(x) = (2pi)−1/2 exp(−x2/2). Some how, we wish to generate data from double ex- ponential: f0(x) = 1 2 exp(−|x|). The rejective sampling is a choice. Compute the constant u as defined above. Write a code in R to implement the rejective sampling method to generate n = 1000 observations from N(0, 1). Show the Q-Q plot of the data generated and report the number of pairs of (X,U) in rejective sampling required. How many pairs of (X,U) do you expect to be needed to generate n = 1000 normally distributed random numbers with this method? 10.4 Markov chain Monte Carlo Not an expert myself, my comments here may not be accurate. The rejection sample approach appears to be effective for generating univariate random variables (pseudo numbers). In applications, we may wish to generate a large quantity of vector valued observations. Markov chain Monte Carlo seems to be one of the solutions to this problem. To introduce this method, we need a dose of Markov chain. 10.4. MARKOV CHAIN MONTE CARLO 143 10.4.1 Discrete time Markov chain A Markov chain is a special type of stochastic process. A stochastic process in turn is a collection of random variables. Yet we cannot pay equal amount of attention to all stochastic processes but the ones that behave themselves. Markov chain is one of them. We narrow our focus even further on processes containing a sequence of random variables having a beginning but no end: X0, X1, X2, . . . . The subindices {0, 1, 2, . . .} are naturally called time. In addition, we consider the case where Xn takes values in the same space with countable members for all n. Without loss of generality, we assume the space is S = {0,±1,±2, . . .}. We call S state space. For such a stochastic process, we define transition probabilities for s < t to be pij(s, t) = pr(Xt = j|Xs = i). Definition 10.1. A discrete time Markov chain is an ordered sequence of random variables with discrete state space S and has Markov property: pr(Xs+t = j|Xs = i,Xs−1 = i1, . . . , Xs−k = ik) = pij(s, s+ t) for all i, j ∈ S and s, t ≥ 0. If further, all one-step transition probabilities pij(s, s+ 1) do not depend on s, we say the Markov chain is time homogeneous. The Markov property is often referred to as: given present, the future is independent of the past. In this section, we further restrict ourselves to homogeneous, discrete time Markov chain. We will work as if S is finite and S = {1, 2, . . . , N}. The subsequent discussion does not depend on this assumption. Yet most conclusions are easier to understand under this assumption. We simplify the one step transition probability notation to pij = pr(X1 = j|X0 = i). 144 CHAPTER 10. MONTE CARLO AND MCMC Let P be a matrix formed by one step transition probabilities: P = (pij). For finite state space Markov chain, its size is N × N . We may also notice its row sums equal to 1. It is well known that the t-step transition matrix P(t) = {pr(Xt = j|X0 = i)} = Pt for any positive integer t. For convenience, we may take 0-step transition matrix as P0 = I, the identity matrix. The relationship is so simple, we do not need a specific notation for t-step transition matrix. Let Πt be the column vector made of pr(Xt = i), i = 1, 2, . . . , N and t = 0, 1, . . .. This vector fully characterizes the distribution of Xt. Hence, we simply call it the distribution of Xt. It is seen that Πτt = Π τ 0P t. Namely, the distribution of Xt in a homogeneous discrete time Markov chain is fully determined by the distribution of X0 and the transition probability matrix P. Under some conditions, limt→∞Πt exists. The limit itself is unique and is a distribution on the state space S. For a homogeneous discrete time Markov chain with finite state space, the following conditions are sufficient: (a) irreducible: for any (i, j) ∈ S, there exists a t ≥ 1 such that pr(Xt = j|X0 = i) > 0. (b) aperiodic: the greatest common factor of {t : pr(Xt = i|X0 = i) > 0} is 1 for any i ∈ S. When a Markov chain is irreducible, all states in S have the same period which is defined as the greatest common factor of {t : pr(Xt = i|X0 = i) > 0}. Theorem 10.3. If a homogeneous discrete time Markov chain has finite space and properties (a) and (b), then for any initial distribution Π0, lim t→∞ Πt = Π exists and is unique. 10.4. MARKOV CHAIN MONTE CARLO 145 We call Π in the above theorem as equilibrium distribution and such a Markov chain ergodic. It can be shown further that when these conditions are satisfied, then for any i, j ∈ S, lim t→∞ pr(Xt = j|X0 = i) = pij where pij is the jth entry of the equilibrium distribution Π. Definition 10.2. For any homogeneous discrete time Markov chain with transition matrix P and state space S, if Π is a distribution on the state space such that Πτ = ΠτP when we call it a stationary distribution. It is easily seen that the equilibrium distribution is a stationary distri- bution. However, there are examples where there exist many stationary distributions but there is no equilibrium distribution. If a vector of probability mass function pii : i = 0, 1, . . . , N satisfies the balance equation: piipi,j = pijpji for any i, j, then it is the limiting distribution of the Markov chain. In other words, the balance equation serves as a criterion on whether pii is the limiting distribution of the Markov chain. Finally, we comment on the relevance of this section to MCMC. If one wishes to generate observations from a distribution f(x). It is always possible for us to find a discrete distribution Π whose p.m.f. is very close that that of f(x). Suppose we can further create a Markov chain with proper state space and transition matrix with Π as its equilibrium distribution. If so, we may generate random numbers from this Markov chain: x1, x2, . . .. When t is large enough, the distribution of Xt is nearly the same as the target distribution Π. One should be aware that x1, x2, . . . are not observed values of independent and identically distributed random variables. By the name of Markov chain, it is most likely that they have the same distribution, but not independent. The Markov chain Monte Carlo also works for continuous distributions. However, the general theory cannot be presented without a full course on 146 CHAPTER 10. MONTE CARLO AND MCMC Markov chain. This section is helpful to provide some intuitive justification on the Markov chain Monte Carlo in the next section. 10.5 MCMC: Metropolis sampling algorithms Sometimes, direct generation of i.i.d. observations from a distribution f(·) is not feasible. Rejective sampling can also be difficult because to find a proper f0(·) is not easy. These happen when f(·) is the distribution of a high- dimensional random vector, or it does not have an exact analytical form. Markov chain Monte Carlo is regarded as a way out in recent literature. Yet you will see that the solution is not to provide i.i.d. random numbers/vectors, but dependent with required marginal distributions. Let X0, X1, X2, . . . be random variables that form a time-homogeneous Markov process. We use process here instead of chain to allow the rang of X to be Rd or something generic. It has all the properties we mentioned in the last section. We define the kernel function K(x, y) be the conditional density function of X1 given X0. Roughly speaking, K(x, y) = pr(X1 = y|X0 = x) = f(x, y) fX(x) which is the transition probability when the process is in fact a chain. We may also use K(x, y) = f1|0(x1|x0) as the conditional density of X1 given X0 when the joint density is definitely needed. One Metropolis sampling algorithm goes as follows. 1. Let t = 0 and choose a x0 value. 2. Choose a proposed kernel K0(x, y) so that the corresponding Markov process is convenient to generate random numbers/vectors from the conditional density. 3. Choose a function r(x, y) taking values in [0, 1] and r(x, x) = 1. 10.5. MCMC: METROPOLIS SAMPLING ALGORITHMS 147 4. Generate a y value from conditional distribution K0(xt, y) and a stan- dard uniform random number u. If u < r(xt, y), let xt+1 = y; otherwise, let xt+1 = xt. Update t = t+ 1. 5. Repeat Step 4 until sufficient number of random numbers are obtained. In the above algorithm, we initially generate random numbers from a Markov chain with transition probability matrix specified by K0(x, y). Due to a rejective sampling step, many outcomes are not accepted and in which cases, the previous value xt is retained. What have we obtained? We can easily seen that {x0, x1, . . .} remains a Markov chain with the same state space in spite of rejecting many y values generated according to K0. We use Markov chain to illustrate the point. The transition probability of this Markov chain is computed as follows. Consider the case when X0 = i and the subsequent Y is generated according to the conditional distribution K(i, ·). Let U be i.i.d. uniform [0, 1] random variables. For any j 6= i ∈ S, we have K(i, j) = pr(X1 = j|X0 = i) = pr(U < r(i, Y ), Y = j|X + 0 = i) = r(i, j)K0(i, j). Clearly, the chance of not making a move is K(i, i) = 1 +K0(i, i)− ∞∑ j=1 r(i, j)K0(i, j). Suppose the target distribution has probability mass function Π. We hope to select K0(x, y) and r(x, y) so that Π is the equilibrium distribution of the Markov chain with transition matrix K(x, y). Consider the situation where the working transition matrix K0(x, y) is symmetric and we choose for all i, j, r(i, j) = min{1,Π(j)/Π(i)} in the above so called Metropolis algorithm. One important property of this choice is that we need not know individual values of Π(i) for each i but their ratios. This is a useful property in Bayes method where the posterior density 148 CHAPTER 10. MONTE CARLO AND MCMC function is often known up to a constant factor. Computing the value of the constant factor is not a pleasant task. The above choice of r(i, j) makes the computation unnecessary which is a big relief. With this choice of r(x, y), we find Π(i)K(i, j) = min{Π(i),Π(j)}K0(i, j) = min{Π(i),Π(j)}K0(j, i) = Π(j)K(j, i). This property is a sufficient condition for Π to be the equilibrium distribution of the Markov chain with transition probabilities given by K(i, j). Note that the existence of the equilibrium distribution is assumed and can be ensured by the choice of an appropriate K0(i, j). Although Step 4 in the Metropolis algorithm is very similar to the rejec- tive sampling, they are not the same. In rejective sampling, if a proposed value is rejected, this value will be thrown out and a new candidate will be generated. In current Step 4, if a proposed value is rejected, the previous value in the Markov chain will be adopted. We presented the result for discrete time homogeneous Markov chain with countable state space. The symbolic derivation for general state space is the same. The symmetry requirement on K0(x, y) is not absolutely needed to ensure the limiting distribution is given by Π. When K0(x, y) is not symmetric, we may instead choose r(x, y) = min { 1, f(y)K0(y, x) f(x)K0(x, y) } . We use x, y here to reinforce the impression that both x, y can be real values, not just integers. A toy exercise is to show that this choice also leads to f(x) satisfying the balance equation: f(x)K(x, y) = f(y)K(y, x). Finally, because f(x) is the density function of the equilibrium distribu- tion, when t → ∞, the distribution of Xt generated from the Metropolis 10.6. THE GIBBS SAMPLERS 149 algorithm has density function f(x). At the same time, the distribution of Xt for any finite t is not f(x) unless that of X0 is. However, for large enough t, we may regard the distribution of Xt as f(x). This is the reason why a burning period is needed before we use Xt as random samples from f(x) in many applications. Obviously, Xt, Xt+1 generated by this algorithm are not independent ex- cept for very special cases. However, in many applications, a non-i.i.d. sequence suffices. For instance, when the Markov chain is ergodic, n−1 n∑ t=1 g(Xt)→ E{g(X)} almost surely where E is computed with respect to the limiting distribution. 10.6 The Gibbs samplers Gibbs samplers are another class of algorithms to generate random numbers based on a Markov chain. Suppose X = (U, V ) has some joint distribution with both u and v taking values as real vectors. Suppose that given U = u for any u, it is easy to generate a value v from conditional distribution of V |(U = u); and the opposite is also true. The goal is to generate number vectors with distribution of U , with distribution of V , or with distribution of (U, V ). We should add that directly generating random vectors (U, V ) itself is not as easy a task. A Gibbs sampler as follows leads to a Markov chain/process whose equi- librium distribution is that of U . 1. Pick a value u0 for U0. Let t = 0. 2. Generate a value vt from the conditional distribution V |(U = ut). 3. Generate a value ut+1 from the conditional distribution U |(V = vt). 4. Let t = t+ 1 and go back to Step 2. 150 CHAPTER 10. MONTE CARLO AND MCMC Theorem 10.4. The random numbers generated from the above sampler with joint distribution/density f(u, v) form an observed sequence of a Markov chain/process {U0, U1, . . .}. Assume the limiting distribution of the Markov chain exists and unique. Then, the limiting distribution of Ut is the marginal distribution of f(u, v) (assume the limiting distribution exists and unique). Proof. This is only a proof for discrete case. Let pu|v(u, v) be the conditional probability mass function of U given V and similarly define pv|u(v, u). The transition probability of the Markov chain is given by pij = pr(Ut+1 = j|Ut = i) = ∑ k pu|v(j|k)pv|u(k|i). Let gu(u) and gv(v) be the marginal distributions of U and V . Multiplying both sides by gu(i) and summing ove i, we have∑ i gu(i)pij = ∑ i {∑ k pu|v(j|k)pv|u(k|i)gu(i) } = ∑ k pu|v(j|k) {∑ i pv|u(k|i)gu(i) } = ∑ k pu|v(j|k)gv(k) = gu(j). This implies that the distribution of U , as a column vector Π, satisfies the relationship Πτ = ΠτP where P is the transition matrix, for the discrete Markov chain. Since the limiting distribution of Ut is gu(·) and the conditional dis- tribution of Vt is pv|u(·). It is immediately clear that the marginal dis- tribution of Vt in the limit is gv(v). Their joint limiting distribution is f(u, v) = pv|u(v|u)gu(u) as desired. There are clearly many other problems with the use of Gibbs sampling. Not an expertise myself, it is best for me to not say too much here. 10.7. RELEVANCE TO BAYES ANALYSIS 151 10.7 Relevance to Bayes analysis As we pointed out, the basis of Bayes data analysis is the posterior distri- bution of the model parameters. However, we often only have the analytical form of the posterior distribution up to a multiplicative constant. It is seen that in Metropolis sampling algorithm, this is all we need to generate random numbers from such distributions. In the case of Gibbs samplers, the idea can be extended. Suppose U = (U1, U2, . . . , Uk) and we wish to obtain samples whose marginal distribution is that of U . Let U−i be subvector of U with Ui removed. Suppose it is efficient to generate data from the conditional distribution of Ui given U−i for all i. Then one may iteratively generate Ui to obtain sample from the distribution f U using Gibbs samplers. 10.8 Assignment problems 1. Find a function g(·) such that g(U) has the standard Cauchy distribu- tion when U has uniform distribution on [0, 1]. 2. Suppose we want to generate random numbers from standard normal distribution whose density is given by φ(x) = (2pi)−1/2 exp(−x2/2). We decide to use rejective sampling plan via double exponential distri- bution whose density function is given by: f0(x) = 1 2 exp(−|x|). (a) Compute the constant u as defined in our notes. (b) Write a code in R to implement the rejective sampling method to generate n = 1000 observations from N(0, 1). (c) Work out the Q-Q plot of the data generated and report the number of pairs of (X,U) in rejective sampling required. (d) How many pairs of (X,U) do you expect to be needed to generate n = 1000 normally distributed random numbers with this method? 152 CHAPTER 10. MONTE CARLO AND MCMC 3. In the Metropolis sampling algorithm, one choice of r(·, ·) function is r(x, y) = min { 1, f(y)K0(y, x) f(x)K0(x, y) } . Show that this choice also leads to f(x) satisfying the balance equation: f(x)K(x, y) = f(y)K(y, x). Remark: for discrete case. 4. Suppose P is the transition probability matrix of finite state space and Π = [pii : i = 0, 1, . . . , N ] τ is a probability vector. Prove that if pii satisfies the balance equation: piipi,j = pijpj,i for any i, j, then Π is the limiting distribution of the Markov Chain under conditions (a) and (b) in the chapter. 5. We used to have difficulties to “identify” the marginal posterior dis- tributions of µ, 1/σ2 when their priors are independent. For the pur- pose of generating samples from from posterior distributions of µ and λ = 1/σ2, this is not an issue if the Gibbs sampler is used. This problem is designed to illustrate this point. Assume that we have n observations x1, . . . , xn i.i.d. from N(µ, σ 2). (a) Let N(0, 4) be the prior for µ, and Gamma(d0 = 5) be the prior for λ = 1/σ2. Let µ and λ be independent. Write down the joint posterior density function of µ and 1/σ2 up to a multiplication constant. (b) Write a code to generate data by Gibbs sampler method from the above posterior distribution. Generate N = 1000 of pairs. Obtain the means of µ and 1/σ2 based on the following data: dat <- c(1.1777518, -0.5867896, 0.2283789, -0.1735369, -0.2328192, 1.0955114, 1.2053680, -0.7216797, -0.3387580, 0.1620835, 1.4173256, 0.0240219, -0.6647623, 0.6214567, 0.7466441, 1.9525066, -1.2017093, 1.9736293, -0.1168171, 0.4511754) 10.8. ASSIGNMENT PROBLEMS 153 (c) Plot the posterior density function in (a) and a kernel density esti- mator of µ, λ based on posterior sample obtained in (b). Remark: make use of existing functions for generating Gamma, normal random numbers, and for density estimation. 6. Construct an example where the Gibbs sampler will fail to generate random numbers from the marginal distribution of f(u, v), as stated in the theorem in this chapter when some condition is violated. 154 CHAPTER 10. MONTE CARLO AND MCMC Chapter 11 More on asymptotic theory Various approaches to point estimation have been discussed so far. An es- timator is recommended when it has certain desirable properties. Among many things, we like to know its bias and variance which can be derived from its finite sample distribution. Characterizing exact finite sample distri- butions is difficult in most cases. Fortunately, also in most cases, an estimator has a limiting distribution when the sample size increases to infinite. The limiting distribution approximates the finite sample distribution and enables us to make inferences accordingly. In this chapter, we provide additional discussions on asymptotic theories. 11.1 Modes of convergence Let X,X1, X2, . . . be a sequence of random variables defined on some prob- ability space (Ω,B, P ). Definition 11.1. We say {Xn}∞n=1 or simply Xn converges in probability to random variable X, if for every > 0, lim n→∞ pr(|Xn −X| > ) = 0. We use notation Xn p−→ X. Here is an example in which the convergence in probability can be directly verified. 155 156 CHAPTER 11. MORE ON ASYMPTOTIC THEORY Example 11.1. Let X1, X2, . . . , be a sequence of i.i.d. random variables each has exponential distribution with rate λ > 0. Let X(1) = min{X1, X2, . . . , Xn}. Then X(1) p−→ 0. Proof: Here 0 is considered as a random variable which takes value 0 with probability 1. Note that for every > 0, pr(|X(1) − 0| > ) = pr(X(1) > ) = pr(X1 > , . . . , Xn > ) = pr(X1 > ) · · ·P (Xn > ) = exp(−nλ)→ 0 as n→ 0. Hence, by Definition 11.1, X(1) p−→ 0. Definition 11.2. We say Xn converges to X almost surely (or with proba- bility 1) if and only if P{ω : lim n→∞ Xn(ω) = X(ω)} = 1. We use notation Xn a.s.−→ X. Here is a quick example for the mode of almost sure convergence. Example 11.2. Let Y be a random variable and let Xn = n −1Y for n = 1, 2, . . .. For any sample point ω ∈ Ω, as n→∞, we have Xn(ω) = n −1Y (ω)→ 0. Hence, pr(ω : limXn(ω) = 0) = 1. Therefore Xn → 0 almost surely. It is natural to ask whether the two modes of convergence defined so far are equivalent. The following example explains that the convergence in probability does not imply the almost sure convergence. The construction is somewhat involved. Please do not spend a lot of time on it. 11.1. MODES OF CONVERGENCE 157 Example 11.3. Consider a probability space (Ω,B, P ) where Ω = [0, 1], B is the usual Borel σ-algebra, and the probability measure pr is the Lesbesgue measure. For any event A ∈ B, 1(A) is an indicator random variable. Define, for k = 0, 1, 2, . . . , 2n−1 − 1 and n = 1, 2, . . ., X2n−1+k = 1([ k 2n−1 , k + 1 2n−1 ]). Since any positive integer m can be uniquely written as 2n−1 + k for some n and k between 0 and 2n−1−1, we have well defined Xm for all positive integer m. On one hand, for every > 0, it is seen that pr(|Xm − 0| > ) ≤ 2−(n−1) → 0. Hence, Xm p−→ 0. On the other hand, for each ω ∈ Ω and any given n, there is a k such that k − 1 2n ≤ ω < k 2n . Hence, no matter how large N is, we can always find an m = 2n−1 + k > N for which Xm(ω) = 1, and Xm+1(ω) = 0. Therefore, Xm(ω) does not have a limit. This claim is true for any sample point in Ω. Hence, Xm does not almost surely converge to anything. The following theorem shows that the mode of almost sure convergence is a stronger mode of convergence. Theorem 11.1. If Xn converges almost surely to X, then Xn p−→ X. Let Bn, n = 1, 2, . . . be a sequence of events. That is, they are subsets of sample space Ω and members of B. If a sample point belongs to infinite many Bn, for example it belongs to all B2n, we say it occurs infinitely often. The subset which consists of sample points that occur infinitely often is denoted as {Bn i.o.} = ∩∞n=1 ∪∞i=n Bi. 158 CHAPTER 11. MORE ON ASYMPTOTIC THEORY Theorem 11.2 (Borel-Cantelli Lemma). 1. Let {Bn} be a sequence of events. Then ∞∑ n=1 pr(Bn) <∞ implies pr({Bn i.o.}) = 0; 2. If Bn, n = 1, 2, . . . are mutually independent, then ∞∑ i=1 pr(Bn) =∞ implies pr({Bn i.o.}) = 1. The proof of this lemma relies on the expression {Bn i.o.} = ∩∞n=1∪∞i=nBi. We now introduce other modes of convergence. 11.2 Convergence in distribution The convergence in distribution is usually discussed together with the modes of convergence for a sequence of random variables. Although they are con- nected, convergence in distribution is very different from other modes of convergence in nature. Definition 11.3. Let G1, G2, . . . , be a sequence of (univariate) cumulative distribution functions. Let G be another cumulative distribution function. We say Gn converges to G in distribution, denoted as Gn d−→ G if lim n→∞ Gn(x) = G(x) for all points x at which G(x) is continuous. This definite is not based on a sequence of random variables. If there is a sequence of random variables X1, X2, . . . and X whose distributions are given by G1, G2, . . . and G, we also say that Xn d−→ X. These random variables may not be defined on the same probability space. When we state 11.3. STOCHASTIC ORDERS 159 that Xn d−→ X, it means that the distributions of Xn converges to the distribution of X as n→∞. Theorem 11.3. If Xn p−→ X, then Xn d−→ X. Suppose c is a non-random constant. If Xn d−→ c, then Xn p−→ c. A probability space is generally irrelevant to the convergence in distri- bution. Yet we can create a shadow probability space for the corresponding random variables. Theorem 11.4 (Skorokhod’s representation theorem). If Gn d−→ G, then there exist a probability space (Ω,B, P ) and random variables Y1, Y2, . . . and Y , such that 1. Yn has distribution Gn for n = 1, 2, . . . and Y has distribution G. 2. Yn a.s.−→ Y . The following result is intuitively right but hard to prove unless the above theorem is applied. Example 11.4. If Xn d−→ X and g is a real, continuous function, then g(Xn) d−→ g(X). This is a simple exercise problem. There is an equivalent definition of the mode of convergence in distribution. We state here as a theorem. Theorem 11.5. Let X1, X2, . . . be a sequence of random variables. Then, Xn d−→ X if and only if E{g(Xn)} → E{g(X)} for all bounded, uniformly continuous real valued function g. 11.3 Stochastic Orders Random variables come with different sizes. When a number of random variable sequences are involved in a problem, it is helpful to know their relative sizes. Let {Xn}∞n=1 be a sequence of random variables. If Xn p−→ 0, we say Xn = op(1). That is, compared with constant 1, the size of Xn becomes less and less noticeable. Naturally, we may also want to compare Xn with other sequences of numbers. 160 CHAPTER 11. MORE ON ASYMPTOTIC THEORY Definition 11.4. Let {an} be a sequence of positive constants. We say Xn = op(an) if Xn/an p−→ 0 as n→∞. Let {Yn}∞n=1 be another sequence of random variables. We say Xn = op(Yn) if and only if Xn/Yn = op(1). How do we describe that Xn and an are about the same magnitude? Intuitively, this should be the case when Xn/an stays clear from both 0 and infinite. In common practice, we only exclude the latter. A rigorous mathematical definition is as follows: Definition 11.5. We say Xn = Op(an) if and only if for every > 0, there exist an M such that for all n, pr(|Xn/an| ≥M) < . Note that Xn = Op(an) only reveals that |Xn| is not larger compared with an. The size of |Xn| can, however, be much smaller than an. Example 11.5. Assume X1, X2, . . . is a sequence of i.i.d. standard Poisson random variables. Then max{X1, X2, . . . , Xn} = Op(log n). The next example essentially gives an equivalent definition. Example 11.6. If for every > 0, there exist M such that for all n large enough, pr(|Xn/an| ≥M) < , then Xn = Op(an). The following is a useful result. Example 11.7. Suppose Xn → X in distribution, then Xn = Op(1). 11.3. STOCHASTIC ORDERS 161 11.3.1 Application of stochastic orders Stochastic order enables us to ignore irrelevant details above Xn and Yn in asymptotic derivations. Some useful facts are as follows. Lemma 11.1. The stochastic orders have the following properties. 1. If Xn = Op(1) and Yn = op(1), then −Xn = Op(1), −Yn = op(1). 2. If Xn = Op(1) and Yn = Op(1), then XnYn = Op(1), Xn + Yn = Op(1). 3. If Xn = op(1) and Yn = op(1), then XnYn = op(1), Xn + Yn = op(1). 4. If Xn = op(1) and Yn = Op(1), then XnYn = op(1), Xn + Yn = Op(1). If Xn converges to X in distribution and Yn differs from Xn by a random amount of size op(1), we expect that Yn also converges to X in distribution. This is a building block to for more complex approximation theorems. Lemma 11.2. Assume Xn d−→ X and Yn = Xn + op(1). Then Yn d−→ X. Proof: Let x be a continuous point of the c.d.f. of X. Let > 0 such that x+ is also a continuous point of the c.d.f. of X. Then pr(Yn ≤ x) = pr(Yn ≤ x, |Yn −Xn| ≤ ) + pr(|Yn −Xn| > , Yn < x) ≤ pr(Xn ≤ x+ ) + pr(|Yn −Xn| > ) → pr(X ≤ x+ ). The second term goes to zero because Yn −Xn = op(1). For any given x, can be chosen arbitrarily small due to the property of the monotonicity of distribution functions. Thus we must have lim sup n→∞ pr(Yn ≤ x) ≤ pr(X ≤ x). Similarly, we can show lim inf n→∞ pr(Yn ≤ x) ≥ pr(X ≤ x). The two inequalities together imply pr(Yn ≤ x)→ pr(X ≤ x) 162 CHAPTER 11. MORE ON ASYMPTOTIC THEORY for all x at which the c.d.f. of X is continuous. Hence Yn d−→ Y . The above result makes the next lemma obvious. Lemma 11.3. Here are some additional properties of the stochastic orders: If an → a, bn → b, and Xn d−→ X, then anXn + bn d−→ aX + b. If Yn p−→ a and Zn p−→ b, and Xn d−→ X, then YnXn + Zn d−→ aX + b. The following well-known theorem becomes a simple implication. Theorem 11.6 (Slutsky’s Theorem). Let Xn d−→ X and Yn p−→ c where c is a finite constant. Then 1. Xn + Yn d−→ X + c; 2. XnYn d−→ cX; 3. Xn/Yn d−→ X/c when c 6= 0. Although the formal Slutsky’s Theorem is stated as above, many of us regard Lemma 11.2 as the conclusion of this theorem. Each of the conclusions in the above theorem can be easily proved by using Lemma 11.2. In this lecture note, we will refer Lemma 11.2 as Slutsky’s Theorem. Here is another convenient theorem Theorem 11.7. Let an be a sequence of real values and Xn be a sequence of random variables. Suppose an → ∞ and an(Xn − µ) d−→ Y . If g(x) is a function which has continuous derivative at x = µ, then an{g(Xn)− g(µ)} d−→ g′(µ)Y. The most useful result for convergence in distribution is the central limit theorem. Theorem 11.8 (Central Limit Theorem). Assume X1, X2, . . . are i.i.d. . random variables with E(X) = 0 and var(X) = 1. Then as n→∞, √ nX¯n d−→ N(0, 1). If, instead, E(X) = µ and var(X) = σ2, then 11.3. STOCHASTIC ORDERS 163 1. √ nσ−1(X¯n − µ) d−→ N(0, 1); 2. √ n(X¯n − µ) d−→ N(0, σ2); 3. n−1/2 ∑n i=1{(Xi − µ)/σ} d−→ N(0, 1); 4. n−1/2 ∑n i=1(Xi − µ) d−→ N(0, σ2). It is not advised to state X¯n − µ d−→ N(0, σ2/n). The righthand side is not a limit at all. Example 11.8. Let Xn, Yn, be a pair of independent Poisson distributed random variables with mean nλ1 and nλ2. Define Tn = (Yn/Xn)1(Xn > 0). Then Tn is asymptotically normal. 1. Prove that if Xn converges almost surely to X, then Xn p−→ X. 2. Let X1, X2, . . . , be a sequence of random variables such that Xn d−→ X for some random variable X. Prove that Xn = Op(1). 3. Let X1, X2, . . . , be a sequence of i.i.d. random variables each has expo- nential distribution with rate λ > 0. Let X(1) = min{X1, X2, . . . , Xn}. Prove that, as n→∞, (a) X(1) = op(n −1/2); (b) X(1) = Op(n −1); (c) X(1) = Op(1). Remark: the proposed orders in this assignment problem are not accu- rate. 164 CHAPTER 11. MORE ON ASYMPTOTIC THEORY 4. Assume we have a sequence of random variables and consider the dy- namic of n→∞. (a) If Xn = Op(n −1), does it imply Xn = o(1) in almost sure sense? (b) If E{nXn} = 1 for all n, show that Xn = Op(n−1)? is it possible that Xn = Op(n −2) in some case? 5. Let an be a sequence of real values and Xn be a sequence of random variables. Suppose an →∞ and an(Xn−µ) d−→ Y . If g(x) is a function satisfying Lipschitz condition: |g(x)− g(y)| ≤ C|x− y| for some constant C. Show that an{g(Xn)− g(µ)} = Op(1). 6. Prove the following stochastic orders properties. (a) If Xn = Op(1) and Yn = op(1), then −Xn = Op(1), −Yn = op(1). (b) If Xn = Op(1) and Yn = Op(1), then XnYn = Op(1), Xn + Yn = Op(1). (c) If Xn = op(1) and Yn = op(1), then XnYn = op(1), Xn +Yn = op(1). (d) If Xn = op(1) and Yn = Op(1), then XnYn = op(1), Xn+Yn = Op(1). 7. Let Xn, Yn, be a pair of independent Poisson distributed random vari- ables with mean nλ1 and nλ2. Define Tn = (Yn/Xn)1(Xn > K) for some K > 0. Show that √ n(Tn − λ2/λ1) d−→ N(0, σ2) and work out the expression of σ2. 11.3. STOCHASTIC ORDERS 165 8. Suppose Xn d−→ X as n→∞. (a) Given an example that E(Xn) 6→ E(X). (b) Give a nontrivial condition on top of Xn d−→ X, so that after its addition, E(Xn)→ E(X) with a solid proof. 9. Let X1, X2, . . . , Xn be a random sample (iid) from a family of uniform distributions with density function f(x; θ) = θ−11(0 < x < θ) and the parameter space Θ = R+. Namely it contains all positive real values. (a) Find the maximum likelihood estimator θˆ of θ. What is the value of the likelihood function at θˆ? (b) Find appropriate an such that an(θˆ− θ) has a non-degenerate lim- iting distribution. Work out the limiting distribution. 166 CHAPTER 11. MORE ON ASYMPTOTIC THEORY Chapter 12 Hypothesis test Recall again that a statistics model is a family of distributions. When they are parameterized, the model is parametric. Otherwise, the model is non- parametric. One may notice that the regression models are not exceptions to this definition. Suppose a random sample from a distribution F is ob- tained/observed. A statistical model assumption is to specify a distribution family F such that F is believed to be a member of it. Often, we are interested in a special subfamily F0 of F . The statistical problem is to decide whether or not F is a member of F0 based on a random sample from this unknown F . There might be situations where the question can be answered with certainty. Most often, statistics are used to quantify the strength of the evidence against F0 from chosen angles. Hypothesis test is an approach which recommends whether or not F0 should be rejected. It also implicitly recommends a distribution in the complement of F0 if F0 is rejected. We consider F0 as null hypothesis and also denote it as H0. Its complement in F forms alternative hypothesis and is denoted as Ha or H1. The specification of F is based on our knowledge on the subject matter and the property of probability distributions. For instance, a binomial distri- bution family is used when the number of passengers show up for a specific flight, the number of students show up for a class and so on. The choice of F0 often relates to the background of the application. We provide a number of scenarios in the next section. 167 168 CHAPTER 12. HYPOTHESIS TEST 12.1 Null hypothesis. Where is F0 from? The question is more complicated than we may believe. Here are some examples motivated from various classical books. (a) The null hypothesis may correspond to the prediction out of some sci- entific curiosity. One wishes to use data to examine its validity. We suspect that the sex ratio of new babies is 50%. In this case, one may collect data to critically examine how well this belief approximates the real world. (b) In genetics, when two genes are located in two different chromosomes, their recombination rate is exactly θ = 0.5 according to Mendel’s law. Rejection of a null hypothesis of θ = 0.5 based on experimental or observational data leads to meaningful scientific claims. Scientists or geneticists in this and similar cases must bear the bur- den of proof. The null hypothesis stands on the opposite side of their convictions. (c) Some statistical methods are developed under certain distributional assumptions on the data such as the analysis of variance. If the nor- mality assumption is severely violated, the related statistical conclu- sions become dubious. A test of normality as the null hypothesis is often conducted. We are alarmed only if there is a serious departure from normality. Otherwise, we will go ahead to analyze the data under normality assumption. (d) H0 may assert complete absence of structure in some sense. So long as the data are consistent with H0 it is not justified to claim that data provide clear evidence in favour of some particular kind of structure. Does living near hydro power line make children more likely to have leukaemia? The null hypothesis would suggest the cases to be dis- tributed geographically randomly. (e) The quality of products from a production-line fluctuates randomly within some range over the time. One may set up a null hypothesis 12.2. ALTERNATIVE HYPOTHESIS 169 that the system is in normal status characterized by some key specific parameter values. The rejection of the null hypothesis sets off an alarm that the system is out of control. (f) When a new medical treatment is developed, its superiority over the standard treatment must be established in order to be approved. Nat- urally, we will set the null hypothesis to be “there is no difference between two treatments”. (g) There are situations where we wish to show a new medicine is not inferior than the existing one. This is often motivated by the desire to produce a specific medicine at a lower cost. One needs to be careful to think about what the null hypothesis should be here. (i) In linear regression models, we are often interested to test whether a regression coefficient has a value different from zero. We put zero- value as the null value. Rejection of which implies the corresponding explanatory has no-nil influence on the response value. In all examples, we do not reject H0 unless the evidence against it is mounting. Often, H0 is not rejected not because it holds true perfectly, but because the data set does not contain sufficient information, or the departure is too mild to matter in a scientific sense, or the departure from H0 is not in the direction of concern. It is hard to distinguish these causes. We will come to this issue again after introduction of the alternative hypothesis. 12.2 Alternative hypothesis In the last section, we discussed the motivation of choosing a subset F0 of F to form H0. It is naturally to form the alternative hypothesis Ha or H1 as the remaining distributions in F . If so, the alternative hypothesis is heavily dependent on our choice of F . Since any data set is extreme in some respects, severe departure from F0 can always be established. Thus, it can be meaningless to ask absolutely whether F0 is true, by allowing F to contain all imaginable distributions. The question becomes meaningful only when a proper alternative hypothesis is proposed. 170 CHAPTER 12. HYPOTHESIS TEST The alternative hypothesis serves the purpose of specifying the direction of the departure the true model from the null hypothesis that we care! In the example when a new medicine is introduced, the ultimate goal is to show that it extends our lives. We put down a null hypothesis that the new medicine is not better than the existing one. The goal of the experiment and hence the statistical hypothesis test is to show the contrary: the new medicine is better. Thus, the alternative hypothesis specified the direction of the departure we intend to detect. In regression analysis, we may want to test the normality assumption on the error term to ensure the suitability of the least sum of squares approach. In this case, we often worry whether the true distribution has a heavier tail probability than the normal distribution. Thus, we want to detect departures toward “having a heavy tail”. If the error distribution is not normal but uniform on a finite interval, for instance, we may not care at all. Therefore, if H1 is not rejected based on a hypothesis test, we have not provided any evidence to claim H0 is true. All we have shown is that the error distribution does not seem to have a heavy tail. According to genetic theory, the recombination rate θ of two genes on the same chromosome is lower than 0.5. Hence, if the data lead to an ob- served very high recombination rate, we may have evidence to reject the null hypothesis of θ = 0.5. However, it does not support the sometimes sacred genetic claim that two genes are linked. To establish linkage, F would be chosen as all binomial distributions with probability of success no more than 0.5. In many social sciences, theories are developed in which the response of interest is related to some explanatory variable. When one can afford to collect a very large data set, such a connection is always confirmed by rejecting the null hypothesis that the correlation is nil. As long as the theory is not completely nonsense, a lower level of connection inevitably exists. When the data size is large, even a practically meaningless connection will be detected with statistical significance. In summary, specifying alternative hypothesis is more than simply putting done the possible distributions of the data in addition to these included in the null already. It specifies the direction of the departure from the null model 12.3. PURE SIGNIFICANCE TEST AND P -VALUE 171 which we hope to detect or to declare its non-fitness. We generally investigate the hypothesis test problem under the assumption that the data are generated from a distribution inside H0 and what happens if this distribution is a member of H1. This practice is convenient for statistical research. We should not take it as truth in applications. It could happen that the data suggest the truth is not in H0, H1 is slightly a better choice, yet the truth is not in H0 nor H1. Hence, by rejecting H0, the hypothesis test itself does not prove that H1 contains the truth. 12.3 Pure significance test and p-value Suppose a random sample X = x is obtained from a distribution F0 and the statistics model is F . We hope to test the null hypothesis H0 : F0 ∈ F0. Let T (x) be a statistic to be used for statistical hypothesis test. Hence, we call it test statistic. Ideally, it is chosen to has two desirable properties: (a) the specific sample distribution of T when H0 is true is known (not merely up to a distribution family but a specific distribution) at least approximately. If H0 contains many distributions, this property implies that the sample distribution of T remains the same whichever distribu- tion in F0 that X may have, or at least approximately. In other words, it is an auxiliary statistic under H0. (b) the larger the observed value of T , the stronger the evidence of depar- ture from H0, in the direction of H1. If a statistic has these two properties, we are justified to reject the null hypothesis when the realized value of T is large. Let t0 = T (x) be its real- ized/observed value and p0 = pr(T (X) ≥ t0;H0) which is the probability that T (X) is larger than the observed value when the null hypothesis is true. When pr(T (X) = t0;H0) > 0, a continuity correction may be applied. That is, we may revise the definition to p0 = pr(T (X) > t0;H0) + 0.5pr(T (X) = t0;H0). 172 CHAPTER 12. HYPOTHESIS TEST In general, this is just a convention, not an issue of “correctness”. The smaller the value of p0, the stronger is the evidence that the null hypothesis is false. We call p0 the p-value of the test. Remark: the definition of p-value is well motivated when a test statistic has been introduced and it has the above two desired properties. Without the known-distribution assumption, pr(T (X) ≥ t0;H0) does not have a definitive answer. Without the other property, we are not justified to be exclusively concerned on the choice of T (X) ≥ t0, rather than other possible values of T (X). If T is a test statistic with properties (a) and (b), and that g is a monotone strictly increasing function, the g(T ) makes an another test statistic, and the p-value based on g(T ) will be the same as the p-value based on T . Since there is no standard choice of T (x), there is not a definite p-value for a specific pair of hypotheses even if the test statistic T (x) has these two properties. Because of this, the definition of p-value has been illusive in many books. Assume issues mentioned above have been fixed. If magically, p0 = 0, then H0 cannot be true or something impossible would have been observed. When p0 is very small, then either we have observed an unlikely event under H0, or the rare event is much better explained by a distribution in H1. Hence, we are justified to reject H0 in favour of H1. Take notice that a larger T (x) value is more likely if the distribution F is a member of H1. How small p0 should be in order for us to reject H0. A statistical practice is to set up a standard, say 5%, so we commonly rejectH0 when p0 < 5%. The choice of 5% is merely a convention. There is no scientific truth behind this magic cut-off point. There is a joke related to this number: scientists tell their students that 5% is found to be optimal by statisticians, and statisticians tell their students that the 5% is chosen based on some scientific principles. Incidentally, the Federal Food and Drug administration in the United States uses 5% as its golden standard. If a new medicine beats the existing one by a pre-specified margin, and it is demonstrated by test of significance at 5% level, then the new medicine will be approved. Of course, we assume that all other requirements have been met. Most research journals used to accept results established via statistical significance at 5% level. You will pretty 12.4. ISSUES RELATED TO P -VALUE 173 soon be under pressure to find a statistical method that results in a p-value smaller than 5% for a scientist. Recently, however, this practice has been discredited. Not all test statistics we recommend have both properties (a) and (b). There are practical reasons behind the use of statistics without these prop- erties. When their usage leads to controversies, it is helpful to review the reasons why properties (a) and (b) are desirable and interpret the data anal- ysis outcomes accordingly. In the above discussion, no specifics about H0 and H1 are given, nor any- thing specific about the test statistic. So the discussion is purely conceptual, leading to the term of “pure significance” test. One is advised to not taking this term very seriously. 12.4 Issues related to p-value After one has seen the data, he can easily find the data are extreme in some way. One may select a null hypothesis accordingly and most likely, the p- value will be small enough to declare it is statistically significant (if the old standard is permitted). This problem is well–known but hard to prevent. After you have seen the final exam results of stat460/560, you may compare the average marks between under and graduate students, between male and female students, foreign and domestic students, younger and older students and many more ways. If 5% standard on p-value is applied to each test, pretty soon we will find one set of hypothesis that is tested significant. This is statistically invalid. To find one out of 20 tests with its p-value below 5% is much more likely than to find a p-value below 5% of a pre-decided test. A pharmaceutical company must provide a detailed protocol before a clinical trial is carried out. If the data fail to reject the null hypothesis, but point to an other meaningful phenomenon, the FDR will not accept the result based on analysis of the current data. The company must conduct another clinical trial to establish the new claim. For example, if they try to show that eating carrots reduces the rate of stomach cancer, yet the data collected imply a reduction in the rate of liver cancer, the conclusion will not be accepted. One could have examined the rates of a thousand cancers: liver 174 CHAPTER 12. HYPOTHESIS TEST cancer happened to produce a low p-value. By this standard, Columbus did not discover America because he did not put discovering America into his protocol. Rather, he aimed to find a short cut to India. Another issue is the difference between Statistical significance and the Scientific significance. Consider a problem in lottery business, each ball, numbered from 1 to 49, should be equally likely to be selected. Suppose I claim that the odd numbers are more likely to be sampled than the even numbers. The rightful probability of an odd ball being selected should be p = 25/49. In the real world, nothing is perfect. Assume that the truth is p = 25/49 + 10−6. It is not hard to show that if we conduct 1024 trials, the chance that the null hypothesis p = 25/49 being rejected is practically 1, at 5% level or any reasonable level based on a reasonable test. Yet such a statistical significant result is nonsensical to a lottery company. They need not be alarmed unless the departure from p = 25/49 is more than 10−3, presumably. In a more practical example, if a drug extends the average life expectancy by one-day, it is not significant no matter how small the p-value of the test produces. There are abundant discussions on the usefulness of p-value. There has been suggestions of not teaching the concept of the p-value which I beg to differ. The key is to make everyone understand what it presents, rather than frantically searching for a test (analysis) that gives a p-value smaller than 0.05. Here is an example suggested by students. It is not as meaningful to be 100% sure that someone stole 10 dollars from a store. It is a serious claim if we are 50% sure that someone killed the store owner. In regression analysis, a regression coefficient is often declared highly significant. It generally refers to a very small p-value is obtained when testing for its value being zero. This is unfortunate: the regression coefficient may be scientifically indifferent from zero, but its effect is magnified by a microscope created by a big data set. 12.5. GENERAL NOTION OF STATISTICAL HYPOTHESIS TEST 175 12.5 General notion of statistical hypothesis test Suppose a random sample of X from F is taken. The null hypothesis H0 as a subset of F is specified and H1 is made of the rest of distributions in F . No matter how a test statistic is constructed, in the end, one divides the range of X into two, potentially three non-overlap regions: C and its complement Cc. We will come back to the potential third region. The procedure of the hypothesis test then rejects H0 when the observed value of X, x ∈ C. Thus, C is called the critical region. When x 6∈ C, we retain the null hypothesis. However, I do not advocate the terminology of “Accept H0”. Such a statement can be misleading. When we fail to prove an accused guilty, it does not imply its innocence. “Not guilty 6= Innocent.” When the true distribution F ∈ H0 yet x ∈ C occurs, the null hypothesis H0 is erroneously rejected. The probability pr(X ∈ C) is called Type I error. We use α(F ) = pr(X ∈ C;F ) as a function of F on H0. We define α = sup F∈H0 pr(X ∈ C;F ) = sup F∈H0 α(F ) as the size of the test. Type I error is not the same as the size of the test because H0 may contain many distributions. The size of a test is determined by the “least favourable distribution” which is the one that maximizes the probability of X ∈ C. Under simple models, it is easy to identify such a least favourable distribution. In a general context, we have long given up the effort of doing so. If x 6∈ C yet F ∈ H1, we fail to reject H0, the corresponding probability is called Type II error. For each distribution F ∈ H1, we call β(F ) = pr(X ∈ C;F ) the power function of F on H1. If F is a parametric model with parameter θ, it is more convenient to use β(θ) = pr(X ∈ C; θ), θ ∈ H1. 176 CHAPTER 12. HYPOTHESIS TEST The type II error is therefore γ(θ) = 1−β(θ). The notational convention may differ from one textbook to another, one should always read the “fine prints” before determining whether β(θ) is the power function or type II error. We do not usually discuss the situation where F 6∈ F . If this happens, a “third type” of error has occurred. One should take this possibility into serious consideration in real world applications. Example 12.1. (One-sample t-test). Assume we have a random sample from a distribution that belongs to F = {N(θ, σ2)} family. We test the null hypothesis H0 : θ = 0. Let T (x) = √ nx¯ s where x¯ = n−1(x1 + x2 + · · · + xn) is the realized value of X¯ and s2 is the realized value of the sample variance. It is seen that T (X) has t-distribution regardless of which distribution in H0 is the true distribution of X. Thus, it has property (a). At the same time, the larger is the value of |T |, the more obvious that the null hypothesis is inconsistent with the data. Thus, |T | also has property (b). In other words, |T | rather than T makes a desirable test statistic. Let t0.975,n−1 be the 97.5% quantile of the t-distribution with n− 1 degrees of freedom. We may put C = {x : |T (x)| ≥ t0.975,n−1} as the critical region of a test. If so, its size is α = pr(|T (X)| ≥ t0.975,n−1;H0) = 0.05. It is less convenient to write down its power function. The p-value of this test is p0 = pr(|T (X)| ≥ T (x);H0) where T (x) is the realized value of T . Rejecting H0 whenever p0 < 0.05 is equivalent to rejecting H0 whenever x ∈ C. Providing a p-value has added benefit: we know whether H0 is rejected with barely sufficient evidence or very strong evidence. 12.6. RANDOMIZED TEST 177 Again, p-value should be read with a pinch of salt. Even if the true θ-value is only slightly different from 0, the evidence against H0 can be made very strong with a large sample size n. Hence, small p-value shows how strong the evidence is against H0, it does not necessarily indicate H0 is an extremely poor model for the data. To avoid the dilemma implied by overly relying on small p-value, it might be better to specify H1 as |θ| > 0.1 and put H0 as |θ| < 0.1 instead. We have placed an arbitrary value 0.1 here, it is not hard to come up with a sensible small value in a real world application. 12.6 Randomized test Particularly in theoretical development, we often hope to construct a test with exactly the pre-given size. The above approach may not be feasible in some circumstances. Example 12.2. Suppose we observe X from a binomial model with n = 2 and the probability of success θ ∈ (0, 1). Let the desired size of the test be α = 0.05 for the null hypothesis θ = 0.5. In this case, we have only 8 candidates for the critical region C. None of them result in a test of the exact size α = 0.05. An artificial approach to find a test with the pre-specified size is as follows. We do not reject H0 if X = 1. When X = 0, 2, we toss a biased coin and reject H0 when the outcome is a head. By selecting a coin such that pr(Head) = 0.1, the probability of rejecting H0 based on this approach is exactly 0.05 when θ = 0.5. Thus, we have artificially attained the required size 0.05. The region {0, 2} is the third region in the range of X mentioned previ- ously. Abstractly, a statistical hypothesis test is represented as a function φ(x) such that 0 ≤ φ(x) ≤ 1. We reject H0 with probability φ(x) when X = x. When φ(x) = 0 or 1 only, the sample space is neatly divided into the critical region and its complement. Otherwise, the region of 0 < φ(x) < 1 178 CHAPTER 12. HYPOTHESIS TEST is a randomization region. When x falls into that region, we randomize the decision. Defining a test by a function φ(x) is mathematically convenient. Note that its size α = sup F∈H0 E{φ(X);F} and its power function on F ∈ H1 is given by β(F ) = E{φ(X);F}. The type I error is defined for F ∈ H0 and given by α(F ) = E{φ(X);F}. We do not place many restrictions on φ(x) to use it as a test function. Instead, we ask when φ(x) is a good test. This question leads to the call for optimality definitions. We will come to this issue soon. 12.7 Three ways to characterize a test Discussions in previous section have presented three hypothesis test proce- dures. 1. Define a test statistic, T , such that we reject H0 when T is large. Preferably, T has two specific properties: known and same sample dis- tribution under whichever distribution in H0; larger observed value of T indicates more extreme departure of F from H0 toward the direction we try to capture. We compute p-value as p = pr(T ≥ tobs;H0) where tobs is the observed value. When T has discrete distribution, we may apply a continuity correction p = pr(T > tobs;H0) + 0.5pr(T = tobs;H0). We reject H0 if p is below some pre-decided level, usually 5%. 12.7. THREE WAYS TO CHARACTERIZE A TEST 179 There seem to be no universal and rigorous definitions of the p-value. A general requirement for p-value calculation is to have a test statis- tic which takes larger values when H0 is violated. After which, one identifies a most likely distribution of the data, say Fˆ , and computes p = pr(T ≥ tobs; Fˆ ) in which Fˆ is not regarded as random. This value is generally regarded as p-value of the test. 2. Define a critical region C in terms of the range of X. When the realized value x ∈ C, we reject H0. The region C is often required to have a given size α: sup H0 pr(X ∈ C) = α. 3. When X is discrete, we may get into situation where no critical region has a pre-specified size α. This is not problematic in applications, but is problematic for theoretical discussions. Hence, we define a test as a function φ(x) taking values between 0 and 1. We reject H0 with probability φ(x) where x is the realized/observed value of X. The size of this test is calculated as supH0 E{φ(X)}. Method 1 is a special case of method 2 by letting C = {x : T (x) > k} for some k. Both methods 1 and 2 can be regarded as special cases of method 3: by letting φ(x) = 1(x ∈ C). We reject H0 with probability 1 when x ∈ C, and do not reject H0 otherwise. Clearly, a trivial test φ(x) = α has size α. Its existence ensures that a test with any specific size between 0 and 1 is possible. The statistical issue is on finding one with better properties. Suppose φ˜(x) is a test of size α˜ < α for a pair of hypotheses H0 and H1. There must exist a test φ(x) of size α such that E{φ(X);F} ≥ E{φ˜(X);F} for every F ∈ H1. Chapter 12 questions 180 CHAPTER 12. HYPOTHESIS TEST 1. Suppose T (x) is a test statistic with the desired properties and g(·) is a strictly monotoning increasing function. (a) Show that a test based on g(T ) is equivalent to a test based on T . That is, two resultant tests will have the same rejection region when the size of the test is set at the same level, α. (b) What is the consequence of using T as a test statistic when its distribution depends on F in the null hypothesis H0? 2. Let Xi, i = 1, 2, . . . , n be a set of i.i.d. observations from N(µ, σ 2). Consider the test problem for H0 : µ = σ versus H1 : µ > σ. Let X¯n and s2n be the sample mean and variance. Define Tn = X¯n sn = (X¯n − µ) + µ sn . (a) Illustrate that the statistic Tn has the desired properties for the purpose of statistical significance test. (b) Suppose the observed value of Tn = t0. What is the p-value of the test based on Tn in a probability expression? 3. Based on your life experience, make up a story to demonstrate the dif- ference between “statistical significance” and “scientific significance”. Try your best so that the even a politician can understand your point. 4. Let (Xi, Yi), i = 1, 2, . . . , n be a set of iid bivariate observations with their joint probably density function given by f(x, y; θ1, θ2) = xy θ21θ 2 2 exp(− x θ1 − y θ2 ). Consider the test problem for H0 : θ1 = θ2 versus H1 : θ1 > θ2. Let X¯n and Y¯n be sample means and define Tn = log{X¯n} − log{Y¯n}. (a) Illustrate that Tn has the desired properties for the purpose of statistical significance test. (b) Suppose the observed value of Tn = t0. What is the p-value of the test based on Tn in a probability expression? 12.7. THREE WAYS TO CHARACTERIZE A TEST 181 Remark: I look for an expression in the spirit of P (Tn ∈ [1, 2]). (c) Bonus for giving concrete justification (not mathematical details) that X¯n Y¯n has F-distribution and indicating the degrees of freedom. 5. (a) What is the difference between the type I error and the size of a test? (b) Consider the t-test for one-sided hypothesis under the normal model as in Example 12.1. Let H0 : θ ≤ 0 and H1 : θ > 0. Suppose n = 20 and we reject H0 when T (X) > 2.0. (i) What is the size of this test? (ii) Plot the type I error of the test as a function of θ/σ. (iii) Plot the power of the test as a function of θ/σ. Remark: pick appropriate ranges in your plots. 6. Consider the case when we have i.i.d. observations X1, . . . , Xn from N(θ, σ2). We wish to use one-sample t-test for one-sided hypothesis H0 : θ ≤ 0 against H0 : θ > 0. We put the size of the test at α = 0.04. (a) What is the minimum size n needed in order to have a power 75% at the distribution with θ = 0.24 and σ2 = 1.44? (b) If you double the sample size obtained in (a), what would be the power of the test at the same distribution? (c) With the sample size obtained in (a), what would be the power of the test if θ = 0.24 and σ2 = 2.56? 7. Suppose we have a random sample X1, . . . , Xn from a continuous dis- tribution whose density function f(x) > 0 for all x. We wish to test the hypothesis that the median of f(x), m = 0 against the simple alternative m = 1. (a) Discuss that Tn, the number of observations larger than 0, is a useful test statistic. (b) Suppose n = 12 and Tn = 8. Calculate the p-value of the test. 182 CHAPTER 12. HYPOTHESIS TEST (c) If the significance level is set at 5% and n = 12, what is the rejection region of this test? Remark: randomization has to be utilized to achieve “exact size” re- quirement. 8. Suppose we have a random sample X1, . . . , Xn which is likely a sample from Gamma distribution with density function f(x; θ) = x θ2 exp(−x θ ). The parameter space is Θ = R+, and the range of the random variable is also R+. At the same time, it is suspected that the actual distribution might be a Gamma mixture: f(x; pi, θ1, θ2) = γf(x; θ1) + (1− γ)f(x; θ2) for some γ ∈ (0, 1) and θ1 6= θ2. (a) Denote µ = E(X) and σ2 = var(X). Compute µ and σ2 when γ = 0, hence the true distribution f(x; θ) and find the function g such that σ2 = g(µ). (b) Compute µ and σ2 when 0 < γ < 1. Show that σ2 > g(µ) where g(·) is the function obtained in (a) and θ1 6= θ2. (c) The result in (b) is helpful to motivate a test statistic for the null hypothesis that the data is from a pure Gamma distribution, against the alternative of Gamma mixture. Find such a statistic and give a brief justification based on two desired properties of a test statistic. Chapter 13 Uniformly most powerful test Contemporary statistical education emphasizes teaching students effective data analysis methods in a time efficient way. The success is measured in terms of whether a student can quickly answer a statistical question raised from applications. This sometimes is done at the cost of not knowing why the statistical method actually answers the applied question other than it gives an “answer”. Against this trend, in this course, we insist on discussing what it means by effective data analysis methods. We preach that even though the topic “uniformly most powerful test” itself is not important, the idea behind this concept is. Definition 13.1. Let φ(x) be a test of size α for a test problem with null and alternative hypotheses H0 and H1. If for any size-α test φ1(x) and F ∈ H1, we have E{φ(X);F} ≥ E{φ1(X);F} then φ(x) is a uniformly most powerful test. Let us emphasize again that both H0 and H1 are subsets of a distribution family. When the data X are from some F ∈ H1, we wish to have as high a probability as possible to tell that its distribution F is not in H0. At the same time, we do not do it at any costs. We require H0 is not rejected with high probability when F ∈ H0. Since X is random by the name of the game, there are unlikely any perfect solutions. 183 184 CHAPTER 13. UNIFORMLY MOST POWERFUL TEST The task of finding Uniformly Most Powerful (UMP) tests is often dif- ficult or even impossible. Some may argue that such a result is not mean- ingful/useful. I agree to a large degree. However, the idea behind UMP is important and serves a good purpose. The knowledge we gain from such ex- ercises helps us to develop sensible methods for general problems. There are special cases where UMP tests exist. We therefore do not wish to completely eliminate this concept from classroom discussions. Next, we work with the simplest case. 13.1 Simple null and alternative hypothesis When a null hypothesis is identified, the task of statistical significance test is to see whether or not the data suggest a departure from the null models in a specific direction. The simplest situation is where the statistical model F contains only two distinct distributions. The null hypothesis contains one and the alternative hypothesis contains the other. More specifically, we may present them as two density functions (with respect to some σ-finite measure): H0 : f0(x), H1 : f1(x). Note that if X represents a set of i.i.d. random variables, the above setting still applies. we will use E1 and E0 for expectations under the alternative and the null models when applicable. Based on measure theory, for any given two distributions, it is possible to find a σ-finite measure, with respect to which, the density functions of two distributions exist. This justifies the above general assumption. Lemma 13.1. Neyman-Pearson Lemma: Consider the simple null and alternative hypothesis test problem as specified. (1) For any size α between 0 and 1, there exists a test φ and a constant k such that E0{φ(X)} = α (13.1) and φ(x) = { 1 when f1(x) > kf0(x); 0 when f1(x) < kf0(x). (13.2) 13.1. SIMPLE NULL AND ALTERNATIVE HYPOTHESIS 185 (2) If a test has the properties (13.1) and (13.2), then it is the most powerful for testing H0 against H1. (3) If φ is most powerful with size no more than α, then it satisfies (13.2) for some k. It also satisfies (13.1) unless there exists a test of size smaller than α and with power 1. Proof and discussion. Proof of (1): A likelihood ratio test of size α exists. To prove the existence, let α(t) = pr(f1(X) > tf0(X);H0) It is a decreasing function of t. Hence, there exists a t0 such that α(t0) ≤ α ≤ α(t0−). Let φ(x) = 1(f1(X) > t0f0(X)) + c1(f1(X) = t0f0(X)) with c = α− α(t0) α(t0−)− α(t0) if needed. Then this φ(x) is the test with the required properties. Remark: The seemly overly complex proof is caused due to the need of covering the discrete situation where pr{f1(X) = t0f0(X)} 6= 0. Otherwise, the truthfulness is trivial. Proof of (2): Suppose φ(x) is the test given in (1), and φ˜ is another test of size α. Then {φ(x)− φ˜(x)}{f1(x)− kf0(x)} ≥ 0 This implies, by integrating both sizes, with respect to the measure the density is defined, E1{φ(X)− φ˜(X)} ≥ kE0{φ(X)− φ˜(X)} = 0. 186 CHAPTER 13. UNIFORMLY MOST POWERFUL TEST where we used E1 and E0 for expectations under the alternative and the null models. The right hand side equals 0 because two tests have the same size. Hence, φ has better power. Proof of (3): If φ˜(X) is also a most powerful test, then we should have P [{φ(x)− φ˜(x)}{f1(x)− kf0(x)} > 0] = 0. Otherwise, the derivation in the proof of (2) would implies φ˜(X) has lower power which is in contradiction of the assumption hat φ˜(X) is also most powerful. From {φ(x)− φ˜(x)}{f1(x)− kf0(x)} = 0 with probability one, we conclude φ(x) = φ˜(x) for all x except when f1(x)− kf0(x) = 0. Hence, φ˜(x) also has form (13.1). ♦ This lemma claims that the most powerful test has to be the likelihood ratio test. At the same time, the third part of the lemma leaves rooms for non-uniqueness. This is due to the flexibility of making decisions on the set of x such that f1(x)/f0(x) = k. The randomization test can be used to achieve the right size of the test. It may also be possible to split this set in other ways and obtain a non-randomization test with the right size. These tests are all MP. Hence, MP test is not necessarily unique. What is the relevance of this classical, famous albeit absolete lemma? Introductory statistical courses generally recommend one-sample t-test or z- test for zero-mean hypothesis under normal model. Yet we usually do not comment on why they are most recommended. An important reason for z-test is that it is the Uniformly Most Powerful test as shown below. Example 13.1. Let X = (X1, . . . , Xn) be a random sample from N(θ, 1). Let us test H0 : θ = 0 against H1 : θ = 1. By Neyman-Pearson Lemma, the most powerful test has the form φ(x) = 1(fn(x; θ = 1) > kfn(x; θ = 0)} 13.1. SIMPLE NULL AND ALTERNATIVE HYPOTHESIS 187 where I use fn for the n-variate density, and use θ = 1 and θ = 0 to highlight the parameter values under the alternative and null hypotheses. The constant k is to be chosen such that the test has given size. It is not needed in this example because the density ratio, when regarded as random variable, has continuous distribution. Note that the critical region can be represented equivalently in many forms. Clearly, {fn(x; θ = 1) > kfn(x; θ = 0)} = {log fn(x; θ = 1) > log fn(x; θ = 0) + log k} = {−1 2 n∑ i=1 (Xi − 1)2 > −1 2 n∑ i=1 X2i + k ′} = { n∑ i=1 (Xi − 1)2 < n∑ i=1 X2i − k′′} = {−2 n∑ i=1 Xi + n < −k′′} = { n∑ i=1 Xi > k ′′′} In other words, there exists a k′′′ such that φ(x) = 1(fn(x; θ = 1) > kfn(x; θ = 0)} = 1{ n∑ i=1 Xi > k ′′′}. Since all we care is the size of the test, there is no need to find exactly how k′′′ is related to k. We need only work out the critical value k′′′ in the last step each time a size of the test is specified. Suppose we want the test to have size α = 0.05. This requires us to pick a specific value of k′′′. Because the size is computed under the null hypothesis in this example contains only a single distribution, we need only solve the equation P ( n∑ i=1 Xi > c; θ = 0) = 0.05 which implies that c = 1.645 √ n is the solution. If we set α = 0.025, then c = 1.960 √ n is the solution. 188 CHAPTER 13. UNIFORMLY MOST POWERFUL TEST Suppose in addition to require the size of the test being 0.05, we also want to have power of the test β(1) = 80%. This can be achieved by selecting an appropriate sample size n: P ( n∑ i=1 Xi > 1.645 √ n; θ = 1) ≥ 0.8. Because n is discrete, the problem should be interpreted as finding the smallest n such that the power is at least 0.8. When θ = 1, we have P ( n∑ i=1 Xi > 1.645 √ n; θ = 1) = P (n−1/2 n∑ i=1 (Xi − 1) > 1.645− n1/2; θ = 1) = P (Z > 1.645− n1/2). with Z being a standard normal random variable. The 20% quantile of the standard normal is −0.842. Thus, we require 1.645 − n1/2 ≤ −0.842 or n ≥ (1.645 + 0.842)2 = 6.18. Thus, n = 7 meets the requirement. Remark: It is seen that if the alternative hypothesis H1 is replaced by θ = θ1 for any θ1 > 0, the most powerful test itself remains the same. That is, the test is most powerful for any alternative such that θ1 > 0. In other words, the above test is also a UMP test against H1 : θ > 0. However, to attain the power of 80% at a different θ value such as θ1 = 0.5, the required sample size will be higher. Remark: It is easy to verify that the critical region of the most powerful test when H1 becomes θ = θ1 < 0 has the form∑ Xi < c. Clearly, a most powerful test for the H1 being θ > 0 cannot also be a most powerful for the H1 being θ < 0. Hence, the notion of most powerful is in general “alternative hypothesis” specific. It is often impossible to have a test that is uniformly most powerful against composite alternative hypothesis. Here, composite means the alternative hypothesis F − F0 contains more than a single distribution. Remark: Point of the example: the UMP test derived from Neyman-Pearson Lemma is the same test we generally recommend in other courses. 13.2. MAKING MORE FROM N-P LEMMA 189 13.2 Making more from N-P lemma N-P lemma is more relevant than it appears. Here is a helpful theorem to make the future derivation simpler. Theorem 13.1. Suppose that a test φ(X) of size α is most powerful H0 against H˜1 : F = F1 for every F1 ∈ H1. Then it is UMP for H0 against H1. Proof: Suppose φ˜(X) is another test of size α for testing H0 versus H1. For any F1 ∈ H1, by the assumption on φ(X), we have E{φ(X) : F1} ≥ E{φ˜(X) : F1}. This trivially shows that φ(X) is UMP against H1. ♦ Example 13.2. Suppose X1, . . . , Xn is an i.i.d. sample from Poisson distri- bution. We test for H0 : θ ≤ 1 versus H1 : θ > 1 with the nominal level α. Consider testing H˜0 : θ = 1 versus H˜1 : θ = 2. The likelihood ra- tio f(x; 2)/f(x; 1) = c exp{(log 2)∑xi}. By Neyman–Pearson Lemma, one UMP test has the form of φ(X) = 1 ∑ xi > k; c ∑ xi = k; 0 ∑ xi < k for some k and c to get the size of the test equaling α. That is, they are chosen so that E{φ(X) : θ = 1} = α. Thus, the choice of k and c does not depend on H˜1. Hence, it is UMP for H˜0 versus H1. Next, we hope to retain the same proposition with H˜0 replaced by H0. It is clear that E{φ(X) : θ} < α when θ < 1. Hence, φ(X) remains a size α test for H0. Therefore, there cannot be any other tests of size α having greater power at any θ > 1. The above result is more generally applicable. Note that allowing ran- domization makes the discussion under Poisson model smoother. We do not recommend this type of randomization in applications. 190 CHAPTER 13. UNIFORMLY MOST POWERFUL TEST 13.3 Monotone likelihood ratio In two previous examples, we started with searching for a most powerful test for simple H0 and H1. In the end, however, the test is found uniformly most powerful for some composite null and alternative hypotheses. This is because the distributions in the statistical model are parameterized in a monotonic way. The following definition provides a specific terminology for distribution families with such a property. Definition 13.2. Suppose that the distribution of X belongs to a parameter family with density functions {f(x; θ) : θ ∈ Θ ⊂ R}. The family is said to have monotone likelihood ratio in T (x) if and only if, for any θ1 < θ2, f(x; θ2) f(x; θ1) is a nondecreasing function of T (x) for values x at which at least one of f(x; θ1) and f(x; θ2) is positive. It is seen that T (x) is a useful statistic for the purpose of hypothesis test because it is a stochastically increasing function of θ. Lemma 13.2. Monotonicity of {T (x)}. Suppose X has a distribution from a monotone likelihood ratio family. Then E{T ; θ} is nondecreasing in θ. Proof: Because when θ2 > θ1, f(x; θ2) f(x; θ1) is non-decreasing in T . Two random variables, T (X) and f(X; θ2)/f(X; θ1) are positively correlated when the distribution of X is f(x; θ1). Let µ1 = E1{T (X)}, the expectation under θ1. We have E1 { [T (X)− µ1]f(X; θ2) f(X; θ1) } ≥ 0. Expanding this inequality gives us the conclusion. ♦ Extension This conclusion is applicable to any nondecreasing function g(T ). 13.3. MONOTONE LIKELIHOOD RATIO 191 Example 13.3. One parameter exponential family f(x; θ) = exp(η(θ)T (x)− ξ(θ))h(x) has monotone likelihood ratio in T (x) when η(θ) is a nondecreasing function in θ. The result remains true for the joint distribution an i.i.d. observations. It will be seen that the UMP tests exist for one parameter exponential families with the above property. It is helpful if you remember that most commonly employed one parameter distribution families are one parameter exponential family. Therefore, the result to be shown in the next theorem is broadly applicable. Before introducing another theorem, we point out another monotone like- lihood ratio family. This one is more of theoretical interest. Example 13.4. Let X1, . . . , Xn be an iid sample from f(x; θ) = θ−11(0 < x < θ). Then the distribution family of X = (X1, . . . , Xn) has monotone likelihood ratio in X(n), the largest order statistic. Theorem 13.2. Suppose the distribution of X is in a parametric family with real valued parameter θ and has monotone likelihood ratio in T (X). Consider H0 : θ ≤ θ0 and H1 : θ > θ0. (i) There exists a UMP test of size α, given by φ(X) = 1 T (X) > k; c T (X) = k; 0 T (X) < k. (ii) For any θ < θ0, φ(X) minimizes the type I error α(θ) among all φ˜ such that E{φ˜(X); θ0} = α. Proof (i) By Neyman–Pearson lemma, this test is one of the Most Powerful tests for H˜0 : θ = θ0 against H˜1 : θ = θ1 for any θ1 > θ0 because the density ratio 192 CHAPTER 13. UNIFORMLY MOST POWERFUL TEST is an increasing function of T . Hence, φ(X) is Uniformly Most Powerful for H˜0 against H1 : θ > θ0. By Lemma on the Monotonicity of E{T (x)}, E{φ(X); θ} is a nondecreas- ing function of θ. Therefore E{φ(X); θ} ≤ α for all α ∈ H0. Thus, φ(X) is a size-α test for H0 versus H1. Subsequently, it is UMP H0 versus H1 by the extended N-P lemma. (ii) Let us define ξ = −θ so that we have a density function g(x; ξ) = f(x;−θ). In terms of ξ, the family has monotone density ratio for ξ in T˜ = −T (x). Consider testing for H∗0 : ξ ≤ ξ0 = −θ0 versus H∗1 : ξ > ξ0 = −θ0 with size α∗ = 1− α. Hence, the UMP tests will have the following form φ∗(X) = 1− φ(X) = 1 T (X) < k; 1− c T (X) = k; 0 T (X) > k. with k and c chosen such that the test has size α∗. We remark here that the middle part is not unique but it does not invalidate our claim. This uniformly most power test has its power maximized is the same as having E{φ(X) : θ} = 1− E{φ∗(X) : ξ} minimized when ξ ∈ H∗1 which is the same as θ ∈ H0. This completes the proof. ♦ Example 13.5. Uniform distribution Let X1, . . . , Xn be a random sample from the uniform distribution on (0, θ). Then the distribution family of X = (X1, . . . , Xn) has monotone likelihood ratio in X(n). For any θ1 < θ2, the density ratio f(x; θ2)/f(x; θ1) = (θ1/θ2) n1(0 < x(n) < θ2) 1(0 < x(n) < θ1) . Other than the constant factor (θ1/θ2) n, the ratio takes three values: 1, ∞ and undefined. The last case does not matter as the case of both densities 13.4. ASSIGNMENT PROBLEMS 193 being zero is excluded in the definition. This ratio is clearly an increasing function of X(n). Consider the hypothesis H0 : θ ≤ θ0 and H1 : θ > θ0. By the theorem we have just proved, the UMP test can be written as φ(X) = 1 X(n) > k; c X(n) = k; 0 X(n) < k. for some k and c. Because the distribution of X(n) is continuous, we have P (X(n) = k) = 0 for any k. Hence, it can be simplified into φ(X) = { 1 X(n) > k; 0 X(n) < k. The c.d.f. of X(n) is given by (x/θ0) n under null for 0 < x < θ0. Hence, the choice of k is determined by α = 1− (k/θ0)n and k = θ0(1− α)1/n is the solution. The power at θ > θ0 is β(θ) = 1− (1− α)(θ0/θ)n. Remark The UMP is not unique as the density ratio is a discrete random variable. 13.4 Assignment problems 1. Let X1, . . . , Xn be an i.i.d. sample from N(1, θ = σ 2). Find a UMP test of size α for testing H0 : θ ≤ θ0 versus H1 : θ > θ0. (a) When n = 10, θ0 = 3, and α = 0.05, obtain the power function β(θ) over an interval (3, 10). Hint: take a grid of 100 points in this interval and get numerical values using R. Plot this function. 194 CHAPTER 13. UNIFORMLY MOST POWERFUL TEST (b) Use Monte Carlo simulation to verify the values β(3.5) and β(5). Give a thoughtful justification on the size of your simulation. 2. Let X1, . . . , Xn be a random sample from a distribution with density function f(x; θ) = 2θ−2x1(0 < x < θ) and parameter space R+. The required size of the test is α in the following questions. (a) Work out the rejection region of a most powerful test for θ = 1 against θ = 1.03. (b) Work out the rejection region of the most powerful test for θ = 1 against θ = 0.95. (c) Use computer simulation to check the precision of the type I errors of both, when we set the size α = 0.08 and the sample size is n = 20. Be thoughtful on the size of your simulation. (d) Compute the power of two tests and use simulation to verify these values. 3. LetX1, X2, X3 be three independent random variables with respectively density functions f1(x; θ) = 1 θ exp(−x θ ); f2(x; θ) = x θ2 exp(−x θ ); f3(x; θ) = x2 2θ3 exp(−x θ ) and the shared parameter θ, parameter space Θ = R+. (a) Show that the distribution of (X1, X2, X3) belongs to a distribution family with monotone likelihood ratio in T (X) = X1 +X2 +X3. (b) Verify that Eθ{T (X)} is monotone in θ. (c) Construct a uniformly most powerful test of size α ∈ (0, 1) for H0 : θ ≤ 3 against H1 : θ > 3. This should be done by (i) first give the format; (ii) next specify the constant for the rejection region. 13.4. ASSIGNMENT PROBLEMS 195 Remark: For the famous z-test, the rejection region is in the form |T | > z1−α/2 with z1−α/2 being the solution to∫ z −∞ (2pi)−1/2 exp(−t2/2)dt = 1− α/2. In other words, z1−α/2 is the upper quantile of the standard normal distribution. Your answer should follow this format. 4. Let X1, X2, . . . , Xn be an i.i.d. sample from a negative binomial distri- bution family with parameter θ and a known constant m (which is a non-negative integer): pr(X = x) = ( m+ x− 1 x ) (1− θ)xθm = c(m,x)(1− θ)xθm. for x = 0, 1, . . . and the parameter space Θ = (0, 1). (a) Show that the joint distribution is in a monotone likelihood ratio family in Tn = −X¯n, the negative of the sample mean. (b) In view of (a), specify the uniformly most powerful test for H0 : θ ≤ 0.5 against H1 : θ > 0.5 and put the size α = 0.05. (c) Describe a physical experiment leading to the negative binomial distribution with brief explanation. 196 CHAPTER 13. UNIFORMLY MOST POWERFUL TEST Chapter 14 Pushing Neyman–Pearson Lemma Further The famous Naymann–Pearson Lemma is established on two simple hypothe- ses. We have seen its generalization to the situation where the alternative hypothesis is made of an interval of parameter values and the data are from monotone likelihood ratio families. The null hypothesis can also be extended so that the resulting test is UMP: uniformly most power. The main purpose of this chapter is to develop tools to cover two-sided alternative hypothesis. We start with a statistically not so meaningful result. It will be the basis for something statistically meaningful. Theorem 14.1. Consider the situation where H0 = {f1, f2} and H1 = {f3}. Let α1, α2 be constants taking values between 0 and 1. We use Ej to denote expectation operation when the data X has distribution fj, j = 1, 2, 3. Let T be the class of tests such that Ej{φ(X)} ≤ αj; j = 1, 2. More formally, T = {φ(·) : Ej{φ(X)} ≤ αj; j = 1, 2.}. Let T0 be a subset of T such that the above inequalities replaced by equalities. Namely, T0 = {φ(·) : Ej{φ(X)} = αj; j = 1, 2.}. 197 198CHAPTER 14. PUSHING NEYMAN–PEARSON LEMMA FURTHER Suppose there are constants k1 and k2 such that φ∗(x) = { 1 f3(x) > k1f1(x) + k2f2(x); 0 f3(x) < k1f1(x) + k2f2(x) (14.1) is a member of T0. We have two conclusions: (i) E3{φ∗(X)} ≥ E3{φ(X)} for any φ(x) ∈ T0. (ii) If both k1 ≥ 0 and k2 ≥ 0, then E3{φ∗(X)} ≥ E3{φ(X)} for any φ(x) ∈ T . Proof (i) Simply construct function {φ∗(x)− φ(x)}{f3(x)− (k1f1(x) + k2f2(x))} which is non-negative at all x. If both φ∗(x), φ(x) ∈ T0, we will find E3{φ∗(X)} ≥ E3{φ(X)} right away by integrating the above function. (ii) If φ∗(x) ∈ T0, it means that E1{φ∗(x)} = α1 and E2{φ∗(x)} = α2. When φ(x) ∈ T merely, we have E1{φ(X)} ≤ α1; E2{φ(X)} ≤ α2 by definition. Integrating {φ∗(x)− φ(x)}{f3(x)− (k1f1(x) + k2f2(x))} with respect the corresponding σ-finite measure, we find E3{φ∗(X)} − E3{φ(X)} ≥ k1[α1 − E1{φ(X)}] + k2[α2 − E2{φ(X)}] ≥ 0 where we have used the condition that both k1 and k2 are nonnegative.. Hence, the conclusion is verified. One should read the result this way. The most powerful test of a specific size becomes hard to obtain when H0 is composite. There may be none. We have only managed to produce a result in still very simplistic situation. This proposition can be generalized slightly to situation where H0 has finite number of density functions. Another shortcoming of this result is that it is hard to determine whether such k1 and k2 exist. Answering this question is technically involved. So I only copy the following result below for your reference. 14.1. ONE PARAMETER EXPONENTIAL FAMILY 199 Theorem 14.2. Let f1, f2, f3 be three density functions with respect to the same σ-finite measure. The following two conclusions are true: (a) The set M = {(E1{φ(X)},E2{φ(X)}) : φ is a test} is convex and closed. (b) If (α1, α2) is an interior point of M , then there exist constants k1, k2 such that a test φ∗(x) in the form of (14.1) with type I errors α1 and α2 at f1 and f2 exists. Discussion: The N-P lemma gives us a UMP when both H0 and H1 contains a single distribution. We have generalized N-P lemma to the situation where H1 contains many distributions previously. Theorem 4.1 expands the N-P lemma a bit further: it allows H0 to contain two distributions when H1 contains only a single distribution. When H0 is given in the form of 1 ≤ θ ≤ 2, say under N(θ, 1) model assumption, the distributions in H0 that matter are the ones with θ = 1 and θ = 2. Here by “matter”, we mean the type I errors and the size of a good test are determined by these two distributions. Once a UMP test is obtained for H˜0 : {θ = 1, θ = 2}, this test is likely also a UMP test for H0 itself. See Lehmann (Vol II, pp96) for details. 14.1 One parameter exponential family The generalized N-P lemma has its targeted application to problems related to one parameter exponential family. Theorem 14.3. Suppose we have an i.i.d. sample x1, x2, . . . , xn from a one– parameter exponential family with density function given by f(x; θ) = exp{θY (x)− A(θ)}h(x). This family is a monotone density ratio family in Tn(x) = ∑ Y (xi). Suppose we want to test for H0 : θ 6∈ (θ1, θ2) versus H1 : θ ∈ (θ1, θ2) for some θ1 6= θ2. 200CHAPTER 14. PUSHING NEYMAN–PEARSON LEMMA FURTHER (i) A UMP test of size α is given by φ(T ) = 1 k1 < Tn(x) < k2; cj Tn(x) = kj, j = 1, 2; 0 Tn(x) < k1 or Tn(x) > k2 where k1, k2, c1, c2 are chosen such that E{φ(X); θj} = α, j = 1, 2. (Note 0 < c1, c2 < 1). (ii) The test given in (i) minimizes type I error at every θ ∈ H0 among the tests satisfying E{φ(T ); θj} = α, j = 1, 2. Proof of this proposition Since Tn(x) = ∑ Y (xi) is sufficient for θ. We need only work on a test defined as a function of Tn(x). Otherwise, E{φ(X)|T} is a test with the same size and power function. (i) Next, we first work on a UMP for testing H˜0 : {θ1, θ2} versus H˜1 : {θ3} for some θ3 ∈ (θ1, θ2). Note the structure: the alternative model is a single distribution within the interval; while the null models are two distributions at two ends. According to the generalized Neyman–Pearson lemma in the form of proposition, such a UMP may exist. For any test φ(T ), we denote its rejection probability by β(θ;φ) = E{φ(T ); θ}. One candidate test for having UMP property is proposed to be φ(T ) = 1 f(x; θ3) > k1f(x; θ1) + k2f(x; θ2); c f(x; θ3) = k1f(x; θ1) + k2f(x; θ2); 0 f(x; θ3) < k1f(x; θ1) + k2f(x; θ2). We do not elaborate but assume the existence of c, k1 and k2 such that β(θ1;φ) = β(θ2;φ) = α. The inequality f(x; θ3) > k1f(x; θ1) + k2f(x; θ2) 14.1. ONE PARAMETER EXPONENTIAL FAMILY 201 used in defining the above φ(T ) under the exponential family can be written as a1 exp(b1T ) + a2 exp(b2T ) < 1 for some constants a1, a2, b1 and b2. Due to the relative sizes of θ1, θ2 and θ3, we must have b1b2 < 0. We find the sign information about a1 and a2 is helpful and give it a careful discussion as follows: (1) If both a1, a2 are smaller than 0, then the inequality holds with prob- ability 1. That is, the size of the test would be 1. This is disallowed. (2) If a1 ≤ 0 but a2 > 0, together with the known function b1b2 < 0, it implies that a1 exp(b1T ) + a2 exp(b2T ) is monotone in T . That is, the inequality in the form of a1 exp(b1T ) + a2 exp(b2T ) < 1 is equivalent to one of T < t or T > t for some constant t. If so, the rejection probability β(θ;φ) would be an monotone function in θ. This contradicts β(θ1;φ) = β(θ2;φ) = α. (3) The only choice left is a1 > 0 and a2 > 0. Note that a1 exp(b1T ) + a2 exp(b2T ) is now convex in T . The inequality in the form of a1 exp(b1T ) + a2 exp(b2T ) < 1 is equivalent to the one in the form of k1 < T < k2 for another set of k1 and k2. In summary, our discussion leads to conclusion that the test is to reject H0 when k1 < T < k2. This is in good agreement with our intuition. Based on the generalized Neyman-Pearson together with a1 > 0 and a2 > 0, this φ(T ) is UMP for testing H˜0 : {θ1, θ2} versus H˜1 : {θ3}. Because this φ(T ) does not depend on the specific choice of θ3, the UMP conclusion extends to H˜0 : {θ1, θ2} versus H1. 202CHAPTER 14. PUSHING NEYMAN–PEARSON LEMMA FURTHER To get the full generality that φ(T ) is UMP for testing H0 : θ 6∈ (θ1, θ2) versus H1, we only need to verify that β(θ;φ) ≤ α at every θ 6∈ [θ1, θ2]. This is true due to the concavity of β function. Consider the test problem with H˜0 : {θ1, θ2} against H˜1 : {θ3} for some θ3 in the original H0. Consider the test φ ∗(T ) = 1− φ(T ). It can be verified (similar to what have been done) that this φ∗(T ) has the form specified in the generalized N-P lemma. Therefore, φ∗(T ) has the best power at θ3 (among those with β(θ1) = 1 − α, β(θ2) = 1 − α. This implies that φ(T ) has the lowest type I error possible, which makes it at least as low as α. (ii) It has been proved by the above step. ♦ Remark: The result itself is mathematically interesting but it is awkward to come up with a situation in applications where it will be used. Its usefulness will be seen in the next section. For students with interest in mathematical techniques, this is a proof for us to gain mathematical insight. The result is stated for a one-parameter exponential family of distribution in a specific form. A general one-parameter exponential family can usually be transformed into this form by a monotone function to the parameter. Hence, the conclusion is more general than it appears. 14.2 Two-sided alternatives Consider the hypothesis θ = 1 versus the alternative H1 : θ 6= 1 given observations from exponential distribution with mean θ. Let us separate H1 into H11 : θ > 1 and H12 : θ < 1. The alternatives in the form of H1 is called two-sided. Assume the size of the test is required to be α. We now work out the test in the situation where i.i.d. observations from an exponential distribution family f(x; θ) = (1/θ) exp(−x/θ) are provided. The UMP for H0 versus H11, according to discussion in the last chapter, 14.2. TWO-SIDED ALTERNATIVES 203 is given by φ1(x) = 1( ∑ xi > k1) for so that E0{φ1(X)} = α. The UMP for H0 versus H12, for the same reason, is given by φ2(x) = 1( ∑ xi < k2) for so that E0{φ2(X)} = α. Suppose a UMP test φ(x) exists for H0 versus H1. This test remains a UMP for H0 versus H11. Hence, we must have φ(x) = φ1(x) except for a zero-measure set of x. For the same reason, we must also have φ(x) = φ2(x) except for a zero-measure set of x. Such a φ(x) clearly does not exist. Hence, there exist no UMP for this problem. This example is not restricted to the exponential distribution but true in general. Although there is no UMP test for two-sided alternatives, we may provide a sensible test based on the idea of “pure significance test”. If we define Tn = max{x¯, 1/x¯}. A large value of Tn (deviating from 1) is a good indication that θ = 1 is violated. Thus, we may compute p0 = P (Tn ≥ tobs; θ = 1) as the p-value and reject the null hypothesis of θ = 1, say when p0 < 0.05. We may agree that this is a sensible test. However, we cannot help to ask whether this is the best we can do. Furthermore, in what sense that this test is best? We could have defined the test statistics as T ′n = max{x¯, 2/x¯}. A test based on T ′n has the same properties. In some situations, it is possible to set up a useful standard. This is our next topic. 204CHAPTER 14. PUSHING NEYMAN–PEARSON LEMMA FURTHER 14.3 Unbiased test A great person is not necessarily the best compared to everyone else in every respect in a large population, he/she might be the best in a small community or in a specific aspect. We may be disappointed to that in many situations, or in nearly all realistic situations, there exist no UMP tests. However, we may look for the best test(s) among those which are not weird in some respects but outsmart others merely in a very narrow sense. These words are said here to motivate us to look for sensible tests which are optimal in restricted class. In this section, we compare tests that are unbiased by the following definition. Definition 14.1. Consider the problem of testing for a null hypothesis de- noted as H0 against an alternative hypothesis denoted as H1 based on data X. A test φ(X) of size α is unbiased if sup F∈H0 E{φ(X);F} ≤ α; inf F∈H1 E{φ(X);F} ≥ α. Justification of unbiasedness. Every guilty party should be more likely to be sent to prison than every innocent party in a court. Be aware of the wording: merely more likely. One can easily show that unbiased tests always exist for any pair of null and alternative hypotheses. Be aware that most tests proposed in the liter- ature under complex models are not unbiased. Yet it does not hurt to think about the unbiasedness issue. Definition 14.2. Consider the problem of testing for a null hypothesis de- noted as H0 against an alternative hypothesis denoted as H1 based on data X. If a test is most power at every F ∈ H1 within the class of unbiased tests of size α, we say it is a Uniformly Most Powerful Unbiased (UMPU) test of size α. 14.3.1 Existence of UMPU tests The notion of unbiasedness is helpful in some typical situations. We only dis- cuss this topic for a one-parameter exponential family with density function 14.3. UNBIASED TEST 205 given by f(x; θ) = exp{θY (x)− A(θ)}h(x). This family has monotone density ratio in T = ∑ Y (xi). Of course, we also know that T is complete and sufficient for θ. The above parameterization is a natural one. Theorem 14.4. Suppose we want to test for H0 : θ ∈ [θ1, θ2] versus H1 : θ 6∈ [θ1, θ2] for some θ1 6= θ2. A UMPU test of size α is given by φ(T ) = 1 T < k1 or T > k2 cj T = kj, j = 1, 2. 0 k1 < T < k2. where k1, k2, c1, c2 are chosen such that E{φ(T ); θj} = α, j = 1, 2. (Note 0 < c1, c2 < 1). Proof: A good test should clearly be based on T as it is complete and sufficient for θ. Thus, we will not look into other possibilities. According to Theorem 14.3 proved earlier, 1− φ(T ) for the φ(T ) defined above is a UMP for H˜0 : θ 6∈ [θ1, θ2] versus H˜1 : θ ∈ [θ1, θ2] of size α˜ = 1− α. We put a side proposition with a proof here. Under exponential family, E{φ(T ); θ} is continuous in θ for any test φ(T ). Because of this proposition, if φ(T ) is an unbiased test for H0 versus H1, we must have E{φ(T ); θj} = α, j = 1, 2. If another unbiased test φ∗(T ) is of size α for H0 versus H1 but has higher power at some θ3 ∈ H1, we would have E{φ∗(T ); θ1} = E{φ∗(T ); θ2} = α and E{φ∗(T ); θ3} > E{φ(T ); θ3}. 206CHAPTER 14. PUSHING NEYMAN–PEARSON LEMMA FURTHER In terms of H˜0 and H˜1, we find a pair of tests: 1− φ∗(T ) and 1− φ(T ) both of size 1 − α, unbiased, but the type I error of 1 − φ∗(T ) is lower than that of 1 − φ(T ) at θ3 ∈ H˜0. This contradicts the UMP result Theorem 14.3 (ii) given earlier. ♦ UMPU when θ1 = θ2. Theorem 14.4 is not directly applicable to the situation where θ1 = θ2. Direction application is not theoretical justified and also lead to some difficulties. A test of the same form would require us to select k1, k2, c1, c2 such that E{φ(T ); θ} = α at θ = θ1. Ignore the “continuity correction” step of choosing constants c1 and c2, we would have many choices of k1 and k2 to satisfy a single constraint like this one. The solution comes from the following consideration. Let us apply this theorem to the situation where θ2 = θ1 + δ and let δ ↓ 0. Clearly, we would have lim δ↓0 E{φ(T ); θ2} − E{φ(T ); θ1} δ = 0. This implies, in the context of one-parameter exponential family, E{Tφ(T ); θ1} = αE{T ; θ1}. Hence, a UMPU for H0 : θ = θ0 against H1 : θ 6= θ0 of size α is given by φ(T ) = 1 T < k1 or T > k2 cj T = kj, j = 1, 2. 0 k1 < T < k2. where k1, k2, c1, c2 are chosen such that E{φ(T ); θ0} = α, E{Tφ(T ); θ0} = αE{T ; θ0}. To implement this procedure, one may resort to numerical approximations to find these constants. Constants c1, c2 serve the purpose of ensuring the equality requirements are met exactly. They have little relevance in terms of statistical practice. 14.4. UMPU FOR NORMAL MODELS 207 14.4 UMPU for normal models The normal distribution has two parameters. Thus, what we have discussed do not allow us even to show the optimality of most famous t-test. We will have this topic picked up later. 14.5 Assignment problems 1. Let X be a sample of size n = 1 from a distribution with density function f(x; θ) = 3θ−3(θ − x)21(0 < x < θ) Let H0 : θ ≤ θ0 and H1 : θ > θ0. (a) Verify that this distribution family has monotone likelihood ratio on x. (b) Derive the UMP test of size α ∈ (0, 0.5). That is, specify the critical region or give the expression of the decision function φ(x). (c) Prove that a UMP test is always an unbiased test. (d) Analytically verify the test given in (b) is unbiased. 2. Let X1, . . . , Xn be an i.i.d. sample from Poisson distribution with mean parameter θ. (a) Specify the analytical form of UMPU test of size α for testing H0 : θ = 1 versus H1 : θ 6= 1. (b) Use R to get the critical region numerically when n = 9 and α = 0.1. (c) A conventional rather than UMPU test would have selected k1, k2, c1, c2 such that φ(x) = 1 ∑ xi 6∈ (k1, k2) c1 ∑ xi = k1; c2 ∑ xi = k2; 0 ∑ xi ∈ (k1, k2) and that for j = 1, 2: pr( ∑ Xi < k1) + c1pr( ∑ Xi = k1) = α/2 208CHAPTER 14. PUSHING NEYMAN–PEARSON LEMMA FURTHER pr( ∑ Xi > k2) + c2pr( ∑ Xi = k2) = α/2 Use R to get the critical region numerically when n = 9 and α = 0.1 for this test. (d) Use computer simulation to confirm (to shed doubt) that the test in (b) is superior than test given in (c). 3. Let X1, . . . , Xn be an i.i.d. sample from a distribution with density function given by f(x; θ) = 2θ−2x exp(−x2/θ2)1(x ≥ 0) with parameter space Θ = (0,∞). (a) Derive the UMP test for H0 : θ 6∈ (.9, 1.2) versus H1 : θ ∈ (.9, 1.2) when n = 10 and size α = 0.08. Hint: Use R to get the critical region numerically. (b) Following (a). Compute the Type I error at θ = 1.5 and type II error at θ = 1.1. 4. Let X1, . . . , Xn be an iid sample from a distribution with density func- tion given by f(x; θ) = 2θ−2x exp(−x2/θ2)1(x ≥ 0) with parameter space Θ = (0,∞). (a) Obtain rejection region of the UMPU test for H0 : θ ∈ [1, 2] versus H1 : θ 6∈ [1, 2] when n = 10. Hint: Use R to get the critical region numerically. (b) Following (a). Compute the Type I error at θ = 1.5 and type II error at θ = 2.5. 5. Let X1, . . . , Xn be an i.i.d. sample from a distribution with density function given by f(x; θ) = 2θ−2x exp(−x2/θ2)1(x ≥ 0) 14.5. ASSIGNMENT PROBLEMS 209 with parameter space Θ = (0,∞). (a) Obtain rejection region of the UMPU test for H0 : θ ∈ [.9, 1.2] versus H1 : θ 6∈ [.9, 1.2] when n = 10 with α = 0.08. Hint: Use R to get the critical region numerically. (b) Following (a). Compute the Type I error at θ = 1.1 and type II error at θ = 1.5. 6. Let X1, . . . , Xn be a random sample from a N(µ0, σ 2 = θ) distribution, where 0 < θ <∞ and µ0 is known. (a) Show that the likelihood ratio test of H0 : θ = θ0 versus H1; θ 6= θ0 can be based upon the statistic W = ∑n i=1(Xi − µ0)2/θ0. (b) State the null distribution of W . (c) Give an explicit rejection rule based on φ(W ) and describe how to get the constants needed in φ(W ). (d) For n = 9, obtain these constants numerically for α = 0.05. 210CHAPTER 14. PUSHING NEYMAN–PEARSON LEMMA FURTHER Chapter 15 Locally most powerful test While the UMP theorems seem impressive mathematically, they are not broad enough. Other than being hard to find them, they often do not exist unless the data are from some classical well behaved parametric models. We have no choice but to relax the optimality requirements if we wish to recom- mend some effective methods for hypothesis test in real world applications. Definition 15.1. Consider the simple null hypothesis H0 : {θ0} against H1 : θ > θ0 in a one parameter setting. Let β(θ) be the power function of a test φ(x) of size α. Suppose for any other test φ∗(x) of size α, there exists an > 0 such that E{φ∗(X); θ} ≤ β(θ) for all θ ∈ (θ0, θ0 + ). Then we say φ(X) is locally most powerful. There are a number of easily missed out details in this definition. One is that the locally most powerful criterion is only applicable to one-parameter distribution families. In addition, it is restricted to specific type of null and alternative hypothesis. Namely, the null hypothesis contains a single distribution and the alternative hypothesis is one-sided. An immediately question after this definition is: under what conditions such a locally most powerful test exists. We give this answer in the next section. 211 212 CHAPTER 15. LOCALLY MOST POWERFUL TEST 15.1 Score test and its local optimality We give a straightforward theorem about existence as follows. Theorem 15.1. Let {f(x; θ) : θ ∈ Θ} be a regular statistic model with score function defined to be S(θ;x) = ∂ log f(x; θ) ∂θ . We consider the case where Θ is an interval of real number. A test defined by φ(x) = 1(S(θ0;x) > k) is a locally most powerful test for H0 : {θ0} against H1 : θ > θ0 among the tests with size α = E{φ(X); θ0}. Remark: For mathematical simplicity, we have totally ignored the request of having pre-specified size α. The one we defined about is a test of whatever size itself ends up. Also, even though the result is presented as if it is applicable when there is only a single observation x. It is applicable if x is a vector, particularly when it is a vector made of i.i.d. observations. In that case, we use S(xi; θ) for the contribution of the ith sample. The overall score function would be ∑ i S(xi; θ). We have switched two entries of S(·, ·) because we intend to study its randomness induced by the randomness of X. If we intend to study it as a function of θ given some observed value x, we use S(θ;x). Being regular for a model here means that for any T (X) integrable, E{T (X); θ} = ∫ T (x)f(x; θ)dν(x) is differentiable with respect to θ and the derivative can be taken within the integration sign. In simple words, the order of derivative and integration can be exchanged without altering the outcome. The local optimality holds for only simple null hypothesis against the one- sided alternative. The local optimality is lost immediately if any of these is violated. Nevertheless, the above score test itself is broadly applicable. 15.1. SCORE TEST AND ITS LOCAL OPTIMALITY 213 Proof: Being locally most power is the same as to require β(θ) = E{φ(X); θ} to have the largest possible derivative at θ = θ0 among all tests of size α. Thus, we show the test defined by φ(x) = 1(S(θ0;x) > k) makes β(θ) having the largest derivative. Let φ∗(x) be another test of the same size. Then {φ(x)− φ∗(x)}{S(x; θ0)− k} ≥ 0. Taking expectation under distribution f(x; θ0), and noticing E{φ(x)−φ∗(x); θ0} = 0, we get E{[φ(X)− φ∗(X)]S(X; θ)} ≥ 0. Under regularity conditions, the left hand side is the difference of derivatives of two power functions. ♦ The proof seems not tight enough and the problem occurs when E{[φ(X)− φ∗(X)]S(X; θ)} = 0. Further investigation reveals that this occurs only if φ(x) = φ∗(x) with prob- ability one with respect to f(x; θ0). Example 15.1. Let X1, . . . , Xn be an iid sample from Cauchy distribution with f(x; θ) = 1 pi{1 + (x− θ)2} . Consider the test for H0 : θ = 0 against H1 : θ > 0. The locally most powerful test is φ(x) = 1(2 ∑ xi/(1 + x 2 i ) > k) for some k such that the test has the require size. The distribution of ∑{Xi/(1 + X2i )} is not well investigated. There is no simple way to compute k value with which the size requirement is met precisely. However, it is easy to show that ∑ Xi/(1 +X 2 i ) is asymptotically N(0, n/8). Thus, when n is large (say larger than 20), we may use normal 214 CHAPTER 15. LOCALLY MOST POWERFUL TEST approximation to get a k value so that the size of the test is close to required size. From this example, we notice that the above discussion leaves out a prac- tical consideration: choosing the constant k so that the test can be imple- mented in a real world problem. A general principle is work out the distribu- tion of the score function ∑ i S(xi; θ0). Let k be its upper 1− αth quantile. If ∑ i S(xi; θ0) has a discrete distribution, one may use randomization to achieve exact size α. Apparently, randomization is not so important in real world applications. When n is not large, the normal approximation “can be” used, but the precision “may be” poor. Such a problem will probably not occur in real world applications. If we have to work on such a problem, then one may simulate the distribution of ∑ Xi/(1 +X 2 i ). For instance, when n = 10, the 95% quantile of the normal distribution with variance n/8 is 1.839. Based on 100,000 data sets simulation, the observed 95% sample quantile is 1.848. It turns out that the normal approximation is not so poor at all in this particular example. One reason is that even though Cauchy distribution does not have even the first moment, the random variable X/(1 +X2) is bounded and has symmetric distribution. Hence, the normal approximation works nicely. 15.2 General score test The locally most powerful test we gave in the last section is a score test. When the model assumption f(x; θ) is correct, we have E{S(X; θ); θ} = 0 for any θ under regularities conditions. To emphasize the random aspect of the score function, we put x ahead of θ in the score function in this section. If a statistician is asked to judge whether or not θ = θ0 is a plausible value, he or she could take a look at the value n∑ i=1 S(xi; θ0) 15.2. GENERAL SCORE TEST 215 where the summation is needed when we are given a set of i.i.d. observations. From pure significance test point of view, this is an informative statistic about whether θ0 is an acceptable value. More specifically, if θ is the only parameter under consideration. The null hypothesis is θ = θ0 and the alternative is θ 6= θ0, then the score test is to reject the null hypothesis when |∑S(xi; θ0)| > k for some k. When the sample size n is large enough, say over 20, we use central limit theory to decide the size k so that the test has size approximately equaling the specified α. Note that the alternative hypothesis in this section is H1 : θ 6= θ0 which is two-sided. Because of this, there generally exist no locally most powerful tests similar to the one discussed in the beginning. Another point is that the rejection region is more conveniently defined as (x1, . . . , xn) : { ∑ S(xi; θ0)}2 > k for some k. Let I(θ) = E{S(X; θ)}2 be the Fisher information. Under some conditions, {∑S(xi; θ0)}2/I(θ0) has chisquare limiting distribution with one degree of freedom. This result is often used to find approximate k value so that the test has (approximately) the required size. Once we leave the territory of optimality consideration, we generally do not make a fuss on “randomization” to ensure the size of the test exactly the same as pre-specified. The above suggestion for score test also works when θ is a vector parameter. Suppose we can split a vector parameter θ = (ξ, η) and wishes to test H0 : ξ = ξ0. Note that the null hypothesis becomes composite, namely, it contains a set of distributions instead of a single. One may work out S(x, ξ0, η) = ∂ log f(x; ξ, η) ∂ξ ∣∣∣∣ ξ=ξ0 and build a test statistics based on n∑ i=1 S(xi, ξ0, ηˆ0) 216 CHAPTER 15. LOCALLY MOST POWERFUL TEST where ηˆ0 is the MLE of η given ξ = ξ0. That is, ηˆ0 = arg max{`n(ξ0; η) : η}. To effectively use this statistic, we need to find out its distribution, at least the asymptotic one, in order to specify the rejection region so that the size of the test meets some specification. We do not leave more details about the asymptotic distribution to a later chapter. In general, if there is a function g(x; θ) such that E{g(X; θ); θ} = 0 for all θ ∈ H0. Then, the value of T = inf θ∈H0 | ∑ g(xi; θ)| could be used as some kind of statistic for “pure significance hypothesis test”. Among all such choices, the score test is optimal in some sense. 15.3 Implementation remark Whether we test for one-sided alternative or two-sided, we need to given a “rejection region” which is linked to the choice of k, given the desired size of the test. Particularly in assignment problems, the distribution of the score function at θ = θ0 under the null hypothesis may be a member of well known distribution family. Thus, a constant k can be selected accordingly without much difficulty. In more realistic situations, we general use the limiting distribution of the score function at θ = θ0 to choose a k such that the size of the test is approximately α. When both approaches are feasible, the first choice is preferred. Yet it does not mean the second choice is wrong: it is an approximate answer. The approximation may not be accurate when the sample size is not very large. Thus, we do not recommend computing the value k based on limiting distribution unless the sample size is reasonably large. This recommendation is not applicable to classroom, assignment or textbook problems. In real world situations, use simulation to decide how well the approximation is and whether it is satisfactory given the current sample size. 15.4. ASSIGNMENT PROBLEMS 217 15.4 Assignment problems 1. Let X1, . . . , Xn be i.i.d. observations from Cauchy distribution with density function f(x; θ) = 1 pi{1 + (x− θ)2} with the location parameter θ ∈ R. We wish to test the hypothesis for H0 : θ = 0 against an alternative to be specified. We set the size of the test at α = 0.05. (a) Derive the score test statistic against the alternative H1 : θ > 0. If n = 10 and the size of the test is set at α = 0.05, specify the rejection region. (b) Derive the score test statistic against the alternative H1 : θ < 0. (c) Suppose one chooses Tn to be the sample median as his/her test statistic to test against H1 : θ > 0. If n = 10 and the size of the test is set at α = 0.05, specify the rejection region. (d) Following (c), what would be the rejection region if the hypotheses are replaced by H0 : θ = 6 against H1 : θ > 6? (e) Use computer simulation to compare the powers of the tests in (a) and (c) at θ = 0.2. Remark: generate at least 20K data sets so that the simulation error is most likely below 0.3%. 2. Suppose we have one observation from Binomial distribution with pa- rameters m = 50 and probability of success p. We set the size of the test at α = 0.05. (i) Obtain the locally most powerful test for H0 : p = 0.3 versus H1 : p > 0.3. (ii) Obtain the score test for H0 : p = 0.3 versus H1 : p 6= 0.3. Remark: obtain the rejection region but ignore the need of random- ization in (i). Use chisquare approximation in (ii) to determine the rejection region. Use software R for numerical calculations. 218 CHAPTER 15. LOCALLY MOST POWERFUL TEST 15.5 Assignment problems 1. Let X1, . . . , Xn be i.i.d. observations from Cauchy distribution with density function f(x; θ) = 1 pi{1 + (x− θ)2} with the location parameter θ ∈ R. We wish to test the hypothesis for H0 : θ = 0 against an alternative to be specified. We set the size of the test at α = 0.05. (a) Derive the score test statistic against the alternative H1 : θ > 0. If n = 10 and the size of the test is set at α = 0.05, specify the rejection region based on asymptotic normality of the test statistic. (b) Derive the score test statistic against the alternative H1 : θ < 0. (c) Suppose one chooses Tn to be the sample median as his/her test statistic to test against H1 : θ > 0. If n = 10 and the size of the test is set at α = 0.05, specify the rejection region. Remark: The sample median is asymptotically normal. Let ξˆn and ξ be the sample and population median. We have √ n(ξˆn − ξ) d−→ N(0, 1/4f(ξ)) where f(x) is the density function of the corresponding distribution, (d) Following (c), what would be the rejection region if the hypotheses are replaced by H0 : θ = 6 against H1 : θ > 6? Remark: Make use of invariance property. (e) Use computer simulation to compare the powers of the tests in (a) and (c) at θ = 0.2. Remark: generate at least 20K data sets so that the simulation error is most likely below 0.3%. 2. Suppose we have one observation from Binomial distribution with pa- rameters m = 50 and probability of success p. We set the size of the test at α = 0.05. 15.5. ASSIGNMENT PROBLEMS 219 (i) Obtain the locally most powerful test for H0 : p = 0.3 versus H1 : p > 0.3. (ii) Obtain the score test for H0 : p = 0.3 versus H1 : p 6= 0.3. Remark: (a) use test statistic and critical value to present tests; (b) for convenience, use normal and Chisquare limiting distributions to determine the critical values. (c) use R language to obtain numerical values for the given m and α. 220 CHAPTER 15. LOCALLY MOST POWERFUL TEST Chapter 16 Likelihood ratio test The conclusion in the famous Neyman–Pearson lemma may not be too useful when we must work with more complex models. However, it tells us that the “optimal metric” in testing for a null model containing only a single distribu- tion with parameter value θ0 against the alternative which also only contains a single distribution with parameter value θ1 is their relative likelihood size. This motivates the use of likelihood ratio test. 16.1 Likelihood ratio test: as a pure proce- dure Let us consider the situation where we have a random sample from a distri- bution that belongs to a parametric distribution family: {f(x; θ) : θ ∈ Θ}. Let H0 and H1 be subsets of Θ. In common practice, we take H1 = Θ−H0, the complement of H0. Hence, when both H0 and H1 are explicitly specified in a problem, we also automatically set Θ = H0 ∪H1. Let Ln(θ) = ∏n i=1 f(Xi; θ) be the likelihood function of θ defined in Θ under the commonly assumed i.i.d. setting. We call {f(x; θ) : θ ∈ H0} 221 222 CHAPTER 16. LIKELIHOOD RATIO TEST or simply H0 the null model. We also call H1 the alternative model but things will be slightly different as will be seen. The distribution in H0 that fits the data best from the likelihood angle is the one with θ value that maximizes Ln(θ) within H0. Let θˆ0 be the maximizer. Similarly, the best value under the alternative model is the one that maximizes Ln(θ) for θ ∈ H1. Yet we do not directly utilize this value subsequently. Instead, we let θˆ1 be the maximizer of Ln(θ) over Θ = H0∪H1, the entire parameter space. Namely, θˆ1 is the MLE under the full model. In order for the definitions of θˆ0 and θˆ1 viable, the supremums at null and full models must be attained at some parameter value. This is generally true and we assume this is the case without truly lose much of generality. This technical issue has an easy fix. The commonly used likelihood ratio statistic in the literature is defined to be Λn = Ln(θˆ0) Ln(θˆ1) = exp{ sup θ∈H0 `n(θ)− sup θ∈Θ `n(θ)}. The likelihood ratio test statistic is defined to be Rn = −2 log Λn = 2{`n(θˆ1)− `n(θˆ0)} where we have used the log likelihood function `n(θ) = logLn(θ) = ∑ i log f(xi; θ). The multiplication factor 2 in Rn does not play a rule in defining a test. It makes the limiting distribution of Rn a neat chisquare under regularity conditions to be seen, We define the likelihood ratio test as φ(x) = 1{Rn ≥ c} for some c such that the test has pre-specified size. From now on, we will not pay attention to the situation where Rn has a discrete distribution. More precisely, the test will be regarded as if randomization is never needed to make the size of the test being exactly the same as pre-specified. One reason for this convention is that to find the precise critical value c is generally difficult, numerically infeasible, even without this complication. In addition, 16.1. LIKELIHOOD RATIO TEST: AS A PURE PROCEDURE 223 when the sample size n is large, we have the following result due to Wilks that works well enough. This result enables us to come up with an approximate critical value. If it is an approximation already, it is pointless to have another layer of approximation. From the data analysis point of view, we have to go over several steps to perform a likelihood ratio test. 1. Understand the data structure and come to an agreed model from which the data were supposedly collected. 2. Work out the likelihood function. Identify the null and alternative hypotheses from the application background. 3. Numerically find the MLE of unknown parameters under the null and the full models. Numerically obtain the value of likelihood ratio statis- tic, Rn. 4. Based on the user specified size of the test, α, and our knowledge on the sample distribution of Rn under the null model to determine the rightful c such that a rejection of the null model is recommended when Rn ≥ c. We may instead compute pr(Rn > Robs) and report this value as the p- value of this test and the specific pair of hypotheses. Leave the decision to the user. The model choice should be made after thorough scientific understanding of the applied problem. The statistical properties of the model should reflect this understanding. This is a topic in Statistical Consulting courses. We mostly discuss situations where the observations are i.i.d. here. Point 2 is generally a topic in specialized courses such as “Generalized Linear Models”. Numerical computation can be fitted into a specialize course in statistics or computer science. A course in mathematical statistics focuses on the last point. How we determine the appropriate value of c or compute p-values symbolically. 224 CHAPTER 16. LIKELIHOOD RATIO TEST 16.2 Wilks Theorem under regularity condi- tions The likelihood ratio test is most popular in applications not only because it is “optimal” due to Neyman-Peason lemma, but also because the distribution of its test-statistic is “model-free” when the sample size n is very large, the model and hypotheses are regular. Here is a simplified version of the elegant result by Wilks( 1938?). Theorem 16.1. Suppose H0 is an open subset of an m-dimensional subspace of Θ and Θ is an open subset of Rm+d. Under some regularity conditions and assume the data set is made of n i.i.d. observations, we have pr{Rn ≤ t)→ pr(Z21 + Z22 + · · ·Z2d ≤ t} as the sample size n→∞, and under any null model θ = θ0 ∈ H0. We have used Z1, . . . , Zd as a set of i.i.d. standard normal random vari- ables. Based on the above theorem, when n is large, a test with approximate size α is obtained by choosing the critical value c = χ2d,1−α, the (1 − α)th quantile of the chisquare distribution with d degrees of freedom. When H0 contains many distributions, this theorem says whichever θ0 ∈ H0 is the specific distribution which generated the data, the distribution of Rn stays the same, asymptotically. In many research papers, the distribution of the test-statistic is often referred to as the distribution of the test. Such a statement is not rigorous, but does not seem to cause many problems. If you get confused, it can be helpful to question the meaning of this statement. We do not give a proof nor a list the conditions at the moment. Let us examine a few examples of the likelihood ratio test. Example 16.1. Consider the exponential distribution model with mean pa- rameter θ and parameter space R+, and a hypothesis test problem in which H0 : θ = 1 and H1 : θ 6= 1. Given a random sample of observations, we find θˆ1 = X¯n. Since H0 contains a single distribution, we have θˆ0 = 1. The likelihood ratio statistic is given by Rn = −2n{log X¯n − (X¯n − 1)} 16.2. WILKS THEOREM UNDER REGULARITY CONDITIONS 225 Under the null hypothesis, it is known that X¯n → 1 almost surely. Thus, we have, approximately, 2n{(X¯n − 1)− log X¯n} = n(X¯n − 1)2 + op(1) where op(1) is an asymptotically zero random quantity. More precisely, op(1) is a random quantity that goes to 0 in probability. See the definition in an earlier chapter. By the central limit theorem, √ n(X¯n− 1) is asymptotically N(0, 1) under the null hypothesis: θ = 1. Using Slutsky’s theorem, we find that Rn is asymptotically χ21. Because of this, an asymptotical rejection/critical region for a size 0.05 likelihood ratio test is approximately C = {x : Rn ≥ 3.841}. In the form of test function, φ(x) = 1{Rn ≥ 3.841}. Suppose we put H1 as the set of θ-values larger than 1. Subsequently, we regard the parameter space is given by Θ = [1,∞). If so, the MLE of θ under H1 is no longer always X¯n. In this case, the limiting distribution of Rn is not χ 2 1. We will see that the regularity condition is not satisfied with this H1. That is, Theorem 16.1 does not always apply. Example 16.2. Consider the test problem where an iid sample is from N(θ, σ2) and H0 : θ = 0 against H1 : θ 6= 0. The MLE under H1 is given by θˆ = X¯n and σˆ 2 n = n −1∑(Xi − X¯n)2. Under the null hypothesis, the MLE of σ2 is σˆ20 = n −1∑X2i . It is not too hard to find that Rn ≈ n log [ 1 + σˆ20 − σˆ2n σˆ2n ] ≈ nX¯2n. Thus, its limiting distribution is χ21. There are many reasons why the likelihood ratio test is preferred by statis- ticians and practitioners. Let me try to give you a list that I am aware of. 226 CHAPTER 16. LIKELIHOOD RATIO TEST a) Because the limiting distribution of the likelihood ratio statistic under regularity conditions is chisquare, it does not depend on unknown pa- rameters. We say that it is asymptotically pivotal. One may recall that one of the two preferred properties of a test statistic is that the statistic has a sample distribution free from unknown parameters under the null hypothesis. b) Due to Neyman-Pearson Lemma, we believe that the LRT is nearly “most powerful”. The claim is unproven, and likely false. Yet when lacking any evidences to the contrary, we love to believe the power of the LRT is superior. c) Whether a limiting distribution is useful or not for statistical inference depend on how closely it approximates the finite sample distribution when the sample size is in the range that often occurs in applications. For example, if a clinical trial typically recruits 200 patients, then the limiting distribution is useful when it provides a good approximation when n = 200. It would be not so useful if the approximation is poor until n = 2000. There is a general belief that the chisquare approximation for LRT is often good for moderate n. d) The LRT is invariant to parameter transformation. If a one-to-one transformation is applied to θ to get ξ = g(θ). The LRT remains equal when testing g(H0) against g(H1). Note that I am regarding H0 and H1 as subsets of parameters. If one makes an one-to-one transformation to the data, the inference conclusion based on likelihood approach will also remain the same. A user should be aware that the data transformation leads to change of the model before making use of this claim. Let us also point out that the LRT is often abused. The asymptotic chisquare distribution is valid only if (a) the true value of the parameter is an interior point of the parameter space; (b) the distribution family is regular; (c) the observations are i.i.d. . The result may still be valid when (c) is violated. Yet the validity depends on the structure of the model which should be examined before the LRT together with the chisquare approximation is used. If (a) is violated, the result is almost surely void. If (b) is violated, we 16.3. ASYMPTOTIC CHISQUARE OF LRT STATISTIC 227 probably should not use LRT although there are examples, I believe, that the asymptotic result remains valid. Yet there is no reason to assume so in general. Example 16.3. Suppose we have an i.i.d. sample from f(x; pi) = (1− pi)N(0, 1) + piN(1, 1). The parameter space is [0, 1]. Suppose we want to test H0 : pi = 0 against H1 : pi > 0. Under the null model, that is, assume the true value of pi = 0, the MLE pˆi = 0 with probability approximately 0.5. Because of this, the limiting dis- tribution of the likelihood ratio statistic equals 0 with probability 0.5. Hence, the chisquare limiting distribution does not apply. The reason for the failure is that pi = 0 is on the boundary of the parameter space. Your may work out the asymptotic involved in the above example follow- ing mathematical principles. We do not provide details here. 16.3 Asymptotic chisquare of LRT statistic Let consider the simplest case when H0 = {θ0} and that θ is one dimensional. In this case, the LRT statistic Rn = 2{`n(θˆ)− `n(θ0)}. We carry out the simplistic proof for the situation where the MLE is con- sistent. Thus, it is within an infinitesimal neighbourhood of θ0. We do not spell out but assume the model under consideration is regular. The more technical discussion will be given in subsequent chapters. Applying Taylor’s expansion, we have `n(θ0) = `n(θˆ) + ` ′ n(θˆ)(θ0 − θˆ) + (1/2)`′′n(θ˜)(θ0 − θˆ)2. However, being MLE, θˆ makes `′n(θˆ) = 0. In addition, being consistent, we find n−1`′′n(θ˜) = n −1`′′n(θ0) + op(1) = −I(θ0) + op(1) 228 CHAPTER 16. LIKELIHOOD RATIO TEST where I(·) is the Fisher information. Hence, we find Rn = 2{`n(θˆ)− `n(θ0)} = {nI(θ0) + op(n)}(θ0 − θˆ)2. Recall that √ n(θˆ − θ0) = n−1/2I−1Sn(θ0) + op(1) we get Rn = I−1{n−1/2Sn(θ0)}2 + op(1). Because n−1/2Sn(θ0)→ N(0, I(θ0)), we find Rn → χ21 in distribution. 16.4 Assignment problems 1. Suppose we have one bi-variate normally distributed observation (X1, X2) with mean µ = (µ1, µ2) and identity variance matrix. Consider the like- lihood ratio test for H0 : (µ1 ≤ 0) and (µ2 ≤ 0); H1 : µ1 > 0 or µ2 > 0. (i) Obtain the expression of the likelihood ratio test statistic R. (ii) What is the distribution of R when (µ1, µ2) = (0, 0)? That is, obtain its cumulative distribution function. Hint: Notation such as X+ = max{0, X} can be useful. 2. Let (Xi, Yi), i = 1, 2, . . . , n be a set of i.i.d. bivariate distributions with joint probably mass function f(x, y, θ1, θ2) = θx1θ y 2 x!y! exp(−{θ1 + θ2}). (i) Obtain the analytical expression of the likelihood ratio test statistic Rn for H0 : θ1 = θ2 versus H1 : θ1 6= θ2. 16.4. ASSIGNMENT PROBLEMS 229 (ii) Directly prove that Rn has a chi-square limiting distribution with one degree of freedom under the null model. (iii) Let n = 50 and θ1 = θ2 = 2. Use computer simulation to find out how closely the null rejection probability is to the nominal level 0.05. (iv) Repeat (iii) when θ1 = θ2 = 20. Should the type I error in this case closer to 0.05 compared to the test in (iii), why and is it? Remark: Do not cite a generic result when you prove (ii). Based on the asymptotic result, the rejection region for a level 0.05 test is Rn > 3.84. Employ at least 20K repetitions in the simulation. 3. Let X1, X2, . . . , Xn be an iid sample from a distribution in Poisson distribution family with mean parameter θ: P (X = k) = θk k! exp(−θ). (a) Derive the expression of the likelihood ratio test statistic Rn for H0 : θ = 1 against H1 : θ 6= 1. (b) Verify Wilks theorem that Rn in (a) has χ 2 1 limiting distribution under H0. (c) Derive the expression of the likelihood ratio test statistic Rn for H0 : θ = 1 against H1 : θ > 1. (d) Derive the score function Sn(θ) of θ in the current context. (e) Derive the score test statistic for H0 = 1 against H1 : θ 6= 1. (f) Obtain the limiting distribution of the likelihood ratio test in (c). 4. Let X1, X2, . . . , Xn be an i.i.d. sample from a negative binomial distri- bution family with parameter θ and a known constant m (which is a positive integer): P (X = x) = ( m+ x− 1 x ) (1− θ)xθm = c(m,x)(1− θ)xθm. for x = 0, 1, . . . and the parameter space Θ = (0, 1). Use X¯n as notation for the sample mean. 230 CHAPTER 16. LIKELIHOOD RATIO TEST (a) Derive the expression of the likelihood ratio test statistic Rn for H0 : θ = 0.5 against H1 : θ 6= 0.5. (b) Verify Wilks theorem that Rn has χ 2 1 distribution under H0. 16.5 Assignment problems 1. Suppose we have one bi-variate normally distributed observation (X1, X2) with mean µ = (µ1, µ2) and identity variance matrix. Consider the like- lihood ratio test for H0 : (µ1 ≤ 0) and (µ2 ≤ 0); H1 : µ1 > 0 or µ2 > 0. (i) Obtain the expression of the likelihood ratio test statistic R. (ii) What is the distribution of R when (µ1, µ2) = (0, 0)? That is, obtain its cumulative distribution function. Hint: Notation such as X+ = max{0, X} can be useful. 2. Let (Xi, Yi), i = 1, 2, . . . , n be a set of i.i.d. bivariate distributions with joint probably mass function f(x, y, θ1, θ2) = θx1θ y 2 x!y! exp(−{θ1 + θ2}). (i) Obtain the analytical expression of the likelihood ratio test statistic Rn for H0 : θ1 = θ2 versus H1 : θ1 6= θ2. (ii) Directly prove that Rn has a chi-square limiting distribution with one degree of freedom under the null model. (iii) Let n = 50 and θ1 = θ2 = 2. Use computer simulation to find out how closely the null rejection probability is to the nominal level 0.05. (iv) Repeat (iii) when θ1 = θ2 = 20. Should the type I error in this case closer to 0.05 compared to the test in (iii), why and is it? Remark: Do not cite a generic result when you prove (ii). Based on the asymptotic result, the rejection region for a level 0.05 test is Rn > 3.84. Employ at least 20K repetitions in the simulation. 16.5. ASSIGNMENT PROBLEMS 231 3. Let X1, X2, . . . , Xn be an i.i.d. sample from a distribution in Poisson distribution family with mean parameter θ: pr(X = k) = θk k! exp(−θ) for k = 0, 1, 2, . . .. (a) Derive the expression of the likelihood ratio test statistic Rn for H0 : θ = 1 against H1 : θ 6= 1. (b) Verify Wilks theorem that Rn in (a) has χ 2 1 limiting distribution under H0. (c) Derive the expression of the likelihood ratio test statistic Rn for H0 : θ = 1 against H1 : θ > 1. (d) Derive the score function Sn(θ) of θ in the current context. (e) Derive the score test statistic for H0 = 1 against H1 : θ 6= 1. (f) Obtain the limiting distribution of the likelihood ratio test in (c). 4. Let X1, X2, . . . , Xn be an i.i.d. sample from a negative binomial distri- bution family with parameter θ and a known constant m (which is a positive integer): pr(X = x) = ( m+ x− 1 x ) (1− θ)xθm = c(m,x)(1− θ)xθm. for x = 0, 1, . . . and the parameter space Θ = (0, 1). Use X¯n as notation for the sample mean. (a) Derive the expression of the likelihood ratio test statistic Rn for H0 : θ = 0.5 against H1 : θ 6= 0.5. (b) Verify Wilks theorem that Rn has χ 2 1 distribution under H0. 232 CHAPTER 16. LIKELIHOOD RATIO TEST Chapter 17 Likelihood with vector parameters Consider the situation where we have a set of i.i.d. observations from a para- metric family {f(x; θ) : θ ∈ Θ ⊂ Rd} for some positive integer d. The log likelihood function remains the same as `n(θ) = n∑ i=1 log f(xi; θ). Note that the dimension of X is not an issue here. The score function is still Sn(θ;x) = n∑ i=1 ∂{log f(xi; θ)} ∂θ but we should regard it as a vector. Having n observations in i.i.d. setting will be assumed in this chapter in general. The regularity conditions are the same though sometimes we should in- terpret them as “element wise”. In addition, the regularity conditions are required for both {f(x; θ) : θ ∈ H0} and {f(x; θ) : θ ∈ Θ}. R0 the parameter space of θ is an open set of Rm or Rm+d R1 f(x; θ) is differentiable to order three with respect to θ at all x. 233 234 CHAPTER 17. LIKELIHOOD WITH VECTOR PARAMETERS R2 For each θ0 ∈ Θ, there exist functions g(x), H(x) such that for all θ in a neighborhood N(θ0), (i) ∣∣∣∣∂f(x; θ)∂θ ∣∣∣∣ ≤ g(x); (ii) ∣∣∣∣∂2f(x; θ)∂θ2 ∣∣∣∣ ≤ g(x); (iii) ∣∣∣∣∂3 log f(x; θ)∂θ3 ∣∣∣∣ ≤ H(x) hold for all x, and∫ g(x)dx <∞; E0{H(X)} <∞. We have used E0 for the expectation calculated at θ0 R3 For each θ ∈ Θ, 0 < Eθ { ∂ log f(x; θ) ∂θ }2 <∞. This inequality is interpreted as positive-definite. Although the integration is stated as with respect to dx, the results we are going to state remain valid if it is replaced by some σ-finite measure. All conditions are stated as if required at all x. Exception over a 0- measure set (with respect to f(x; θ0)) of x is allowed. Lemma 17.1. (1) Under regularity conditions, we have Eθ { ∂ log f(x; θ) ∂θ } = 0. (2) Under regularity conditions, we also have Eθ [{ ∂ log f(x; θ) ∂θ }{ ∂ log f(x; θ) ∂θ }τ] = −Eθ { ∂2 log f(x; θ) ∂θ∂θτ } . The proof of the above lemma remains the same as the one for one-dim θ. The second identity in this lemma is called Bartlett identity in the literature. 235 Theorem 17.1. Suppose θ0 is the true parameter value. Under Conditions R0-R3, there exists an θˆn sequence such that (i) Sn(θˆn) = 0 almost surely; (ii) θˆn → θ0 almost surely. Proof. (i) Let be a small enough positive number. Consider a θ∗ value such that ‖θ∗− θ0‖ = . That is, θ∗ is on the ball centred at θ0 with radius . We aim to show that almost surely, `n(θ ∗) < `n(θ0) (17.1) simultaneously for all such θ. If (17.1) is true, it implies that `n(θ) has a local maximum within this ball. Because the likelihood function is smooth, the derivative at this local maximum is 0. Hence, conclusion (i) is true. Is (17.1) true? By Taylor’s series, we have `n(θ ∗) = `n(θ0) + {`′n(θ0)}T (θ∗ − θ0) + 1 2 (θ∗ − θ0)T `′′n(θ˜)(θ∗ − θ0) for some θ˜ in the -ball. It is known that `′n(θ0) = Op(n 1/2). In addition, we have n−1`′′n(θ0)→ −I(θ0) almost surely. Here I(θ0) is the Fisher Information which is positive definite by R3. Activating R2(iii), it is easy to show that almost surely, sup θ∗ n−1|`′′n(θ˜)− `′′n(θ0)| ≤ C in some norm for some C not random nor dependent on θ∗ and so on. These assessments lead to `n(θ ∗)− `n(θ0) = {`′n(θ0)}T (θ∗ − θ0)− n 2 (θ∗ − θ0)T I(θ0)(θ∗ − θ0) + 3O(n). Roughly, the first term is of size n1/2, the second is −n2 and the remainder is n3. Thus, the over all size is determined by −n2 which is negative. This completes the proof of (i). 236 CHAPTER 17. LIKELIHOOD WITH VECTOR PARAMETERS The order assessments can be made rigorously but will not be given here. (ii) is a direct consequence of (i). This result is not equivalent to the consistency of MLE even for this special case. There exists a proof of the consistency of MLE based on much more relaxed conditions. However, the proof is too complex to be explained clearly in this course. 17.1 Asymptotic normality of MLE after the consistency is established Under the assumption that f(x; θ) is smooth, and θˆ is a consistent estimator of θ, we must have Sn(θˆ) = 0. By the mean-value theorem in mathematical analysis, we have Sn(θ0) = Sn(θˆ) + S ′ n(θ˜)(θ0 − θˆ) where θ˜ is a parameter value between θ0 and θˆ. This claim is not exactly true but somehow accepted by most. A more rigorous proof will be very similar conceptually but can be tedious to look after all details. By one of the lemmas proved previously, we have n−1S ′n(θ˜)→ −I(θ0) the Fisher information almost surely. In addition, the classical multivariate central limit theorem can be applied to obtain n−1/2Sn(θ0)→ N(0, I(θ0)). Thus, by Slutzky’s theorem, we find √ n(θˆ − θ0) = n−1/2I−1(θ0)Sn(θ0) + op(1)→ N(0, I−1(θ0)) in distribution as n→∞. 17.2. ASYMPTOTIC CHISQUAREOF LRT FOR COMPOSITE HYPOTHESES237 17.2 Asymptotic chisquare of LRT for com- posite hypotheses Let us still consider the simplest case when H0 = {θ0} that is an interior point of Θ and Θ has dimension d. The alternative is θ 6= θ0. Assume the regularity conditions are satisfied by the full model {f(x; θ) : θ ∈ Θ}. In this case, the LRT statistic Rn = 2{`n(θˆ)− `n(θ0)}. Remember, we work on the case in which the MLE consistent. Thus, it is within an infinitesimal neighborhood of θ0. Applying Taylor’s expansion, we have `n(θ0) = `n(θˆ) + {`′n(θˆ)}T (θ0 − θˆ) + (1/2)(θ0 − θˆ)T{`′′n(θ˜)}(θ0 − θˆ). However, being MLE, θˆ makes `′n(θˆ) = 0. In addition, with θˆ being consistent, we find n−1`′′n(θ˜) = n −1`′′n(θ0) + op(1) = −I(θ0) + op(1). Hence, we find Rn = 2{`n(θˆ)− `n(θ0)} = n(θ0 − θˆ)T{I(θ0) + op(1)}(θ0 − θˆ). Recall that √ n(θˆ − θ0) = n−1/2I−1(θ0)Sn(θ0) + op(1) we get Rn = n −1STn (θ0)I −1(θ0)Sn(θ0) + op(1). Because n−1/2Sn(θ0)→ N(0, I(θ0)), we find Rn → χ2d in distribution. Remark: d is the dimension difference between H0 and H1. Counter Example. Suppose that we have an iid sample of size n from (1− γ)N(0, 1) + γN(2, 1) 238 CHAPTER 17. LIKELIHOOD WITH VECTOR PARAMETERS where γ is the mixing proportion. We would like to test the hypothesis H0 : γ = 0 versus H1 : γ > 0. The log likelihood function is given by `n(γ) = n∑ i=1 log{1 + γ[exp(−2(xi − 2))− 1]}. We have `′n(γ) = n∑ i=1 exp(−2(xi − 2))− 1 1 + γ[exp(−2(xi − 2))− 1] . At γ = 0, we find `′n(0) = n∑ i=1 {exp(−2(xi − 2))− 1} which has 0-expectation under H0. According to CLT, we find P (`′n(0) > 0)→ 0.5 as n → ∞. It is clear that `′n(γ) is a decreasing function over γ > 0. Thus, when `′n(0) < 0, we get ` ′ n(γ) < 0. Two facts imply that if the data are generated from H0, and we look for MLE in general, we would find P (γˆ = 0)→ 0.5. Case I: when `′n(0) ≤ 0, we have γˆ = 0. This further leads to Rn = 2{`n(γˆ)− `n(0)} = 0. Case II: when `′n(0) > 0, we have γˆ > 0. It solves the equation n∑ i=1 exp(−2(xi − 2))− 1 1 + γ[exp(−2(xi − 2))− 1] = 0. For brevity, let us assume the solution is at a small neighborhood of γ = 0. Thus, the above equation is approximated by n∑ i=1 {exp(−2(xi − 2))− 1} − γ n∑ i=1 {exp(−2(xi − 2))− 1}2 + op(n) = 0. 17.2. ASYMPTOTIC CHISQUAREOF LRT FOR COMPOSITE HYPOTHESES239 This leads to γˆ = ∑n i=1{exp(−2(xi − 2))− 1} γ ∑n i=1{exp(−2(xi − 2))− 1}2 + op(n −1/2). Consequently, `n(γˆ) = [ ∑n i=1{exp(−2(xi − 2))− 1}]2 γ ∑n i=1{exp(−2(xi − 2))− 1}2 + op(1). Combining two cases, we can unify the expansion to `n(γˆ) = {[∑ni=1{exp(−2(xi − 2))− 1}]+}2 γ ∑n i=1{exp(−2(xi − 2))− 1}2 + op(1). As n→∞, the limiting distribution is given by that of (Z+)2 which is often denoted as 0.5χ20 + 0.5χ 2 1. Morale of this example: The full model has parameter space Θ = [0, 1]. The null model has parameter space {0}. The parameter space under the full model is not an open set of R. This invalidates the result obtained under regularity condition. Most people will tell you the reason for not having a chisquare limiting distribution is that the true value γ = 0 is not an interior point of Θ. This is a reasonable explanation but does not survive serious scrutiny. In many applications, the i.i.d. assumption is violated. The regularity conditions are no-longer sensible. Yet particularly in biostatistics applica- tions, the users still regard the MLEs asymptotically normal, and the like- lihood ratio statistics asymptotically chisquare. Often, they are not wrong. At the same time, it is a worry-some trend that our scientific claims are built on a less and less solid foundation. I hope that these lectures help you to get a sense of when the “chisquare” distribution is valid. In addition, you are able to rigorously establish what- ever conclusions needed in various applications rather than merely have an impression that some claims are true. 240 CHAPTER 17. LIKELIHOOD WITH VECTOR PARAMETERS 17.3 Asymptotic chisquare of LRT: one-step further Write θT = (θT1 , θ T 2 ) so that θ1 is a vector of length d and θ2 is a vector of length k. The superscript T is to make all vectors column vector. Consider the composite null hypothesis H0 that θ1 = 0 in vector sense. The alternative is H1 : θ1 6= 0. The full model has θ a vector of length m+ d, while the null model has θ living in a subspace of length m. Both spaces are open subsets of their corresponding Euclid spaces Rm+d and Rm. In this section, we denote θT0 = (θ T 10, θ T 20) as the true vector value of the parameter whose corresponding distribution generated the data x1, . . . , xn. In addition, this θ0 is one of the parameter vectors in H0. We assume that θ0 is an interior point of the parameter space of H0 as usual. This is part of the regularity conditions to ensure the validity of the asymptotic result to be introduced. We use θˆ as the MLE of θ without placing any restrictions on the range of θ. We use θˆ0 as the MLE or the maximum point of θ in the space of H0. The consistency results discussed before ensure that both θˆ and θˆ0 almost surely converge to θ0 when H0 is true. When notationally necessary, they will be partitioned into (θˆT1 , θˆ T 2 ) T and (θˆT01, θˆ T 02) T respectively. Of course, we have θˆ01 = 0 under the null hypothesis. 17.3.1 Some notational preparations The Fisher information with respect to θ is now a matrix. We denote I(θ) = E [{ ∂ log f(X; θ) ∂θ }{ ∂ log f(X; θ) ∂θ }T] = −E { ∂2 log f(X; θ) ∂θ∂θT } 17.3. ASYMPTOTIC CHISQUARE OF LRT: ONE-STEP FURTHER 241 The expectation is also computed regarding the distribution of X is given by f(x; θ): the same θ inside out. This matrix can be partitioned into 4 blocks: Iij(θ) = E [{ ∂ log f(X; θ) ∂θi }{ ∂ log f(X; θ) ∂θj }T] = −E { ∂2 log f(X; θ) ∂θi∂θTj } . for i, j = 1, 2. In other words, we have I(θ) = { I11(θ) I12(θ) I21(θ) I22(θ) } The regularity conditions make I(θ) positive definite which implies both I11 and I22 are positive definite. The expectations are understood as taken with the distribution of X is given by f(x; θ). Namely, the same parameter value for operation E and the subject. The score function is now also a vector. Let us write STn (θ) = (S T n1, S T n2) = n∑ i=1 ( ∂ log f(X; θ) ∂θT1 , ∂ log f(X; θ) ∂θT2 ) . The subscripts stand for transpose and they make every vector a row vector. They do not have other practical purposes. Matrix result. Let I11,2 = I11 − I12I−122 I21. It is laborious to verify that I−1(θ) = ( I 0 −I−122 I21 I )( I−111,2 0 0 I−122 )( I −I12I−122 0 I ) where I itself is an identity matrix of proper size. We allow the same I to be identity matrices of different sizes here if it does not cause confusion. Based on matrix theory, or by direct verification, we have xT I−1x = (xT1 − xT2 I−122 I21)I−111,2(x1 − I12I−122 x2) + xT2 I−122 x2 for any vector x of proper length and partition. Applying this matrix result to Sn and I, we find STn I −1(θ)Sn = (STn1 − STn2I−122 I21)I−111,2(Sn1 − I12I−122 Sn2) + STn2I−122 Sn2. 242 CHAPTER 17. LIKELIHOOD WITH VECTOR PARAMETERS It is known that n−1/2Sn is asymptotically normal with covariance matrix given by I(θ). This implies that n−1/2(Sn1 − I12I−122 Sn2) is asymptotically normal with covariance matrix I11,2. Hence, the first term n−1(STn1 − STn2I−122 I12)I−111,2(Sn1 − I12I−122 Sn2)→ χ2d where d is the dimension of θ1. Let us now use these results to prove the claim of the theorem. The LRT statistic now becomes Rn = 2{`n(θˆ)− `n(θˆ0)} = 2{`n(θˆ)− `n(θ0)} − 2{`n(θˆ0)− `n(θ0)}. For the first one, we apparently have Rn1 = n −1STn (θ0){I−1(θ0)}Sn(θ0) + op(1). Based on the same principle, we have Rn2 = n −1STn2(θ0){I−122 (θ0)}Sn2(θ0) + op(1). Combining two expansions, we find Rn = n −1[STn (θ0){I−1(θ0)}Sn(θ0)− STn2(θ0){I−122 (θ0)}Sn2(θ0)] + op(1). With all the preparation results already established, we have Rn → χ2d in distribution as n→∞. ♦ Final remark on regularity conditions: The regularity conditions by the first look are placed on the full distribution family under consideration. A second look reveals that H0 forms a sub distribution family. We require the listed regularity conditions are satisfied by the model formed by H0. The conditions on finite Fisher information and so on ensure the use of the Law of Large Numbers, Central Limit Theorem, and to ensure that the remainder terms in Taylor’s expansion are high order terms. They do not have influence on the existence nor the form of the limiting distributions at various stages of the proof. 17.4. THE MOST GENERAL CASE: FINAL STEP 243 17.4 The most general case: final step To highlight the fact that θ is a parameter vector, we use boldface θ in this section. The null hypothesis discussed in the last section can be expressed as H0 : Aθ = 0 with specific matrix A = diag{1, 1, . . . , 1, 0, 0, . . . , 0}. Denote the number of 1’s as d and number of 0’s as k. We can easily generalize this result to be applicable to any matrix A of rank d and θ of length m + d. It is well known in linear algebra that the solution set of Aθ = 0 forms a linear space of dimension m. There exist m+ d linearly independent vectors ξ1, ξ2, . . . , ξd+m such that all solutions to Aθ = 0 can be expressed as θ = λ1ξ1 + · · ·λmξm. Namely, in the space of λ, H0 becomes λ = (λ1, · · · , λm, λm+1 = 0, . . . , λm+d = 0) which is the same as the special case we have discussed. Namely, the conclu- sion Rn → χ2d remains solid. Most generally, assume the parameter space is a subset of Rd+m. The composite hypothesis is either expressed as R(θ) = 0 Hypothesis form I for a continuously differentiable vector valued function R, or expressed as θ = g(λ) Hypothesis form II for a continuously differentiable g(·). When it is in form I, denote the rank of the differential matrix at θ0 as d. Hence, it puts d constraints on the parameter in a small neighborhood of θ0. After which, based on inverse function theorem, there exists a smooth function g such that the solution to R(θ) = 0 can be written as θ = g(λ) where the dimension of λ is m, in a neighborhood of θ0. 244 CHAPTER 17. LIKELIHOOD WITH VECTOR PARAMETERS In both cases, we may interpret that the null hypothesis sets d elements in θ to 0 and leave m of them free. The same proof presented earlier makes Rn → χ2d in distribution. The regularity conditions must be applicable to the distribution family formed by parameters as solution of R(θ) = 0. • true parameter value θ0 is an interior point of Θ and an interior point of the solution space of R(θ) = 0. • There is a neighborhood of θ0 in terms of Θ, over which R(θ) = 0 admits a smooth solution θ = g(λ). • There are neighborhoods of λ0 and θ0 respectively such that g(λ) is differentiable with full rank derivative matrix. 17.5 Statistical application of these results The whole purpose of proving Rn → χ2d is to test hypothesis in applications. As the size of Rn represents the departure from the null model, the test based on likelihood ratio is mathematically given by φ(x) = 1(Rn ≥ c) and this c will be chosen as χ2d(1− α) for a size-α test. Example 17.1. Suppose we have an i.i.d. sample from a trinomial distribu- tion. That is, each outcome of a trial is one of three types. Let the corre- sponding probabilities of occurrence be p1, p2, p3. Clearly, p1 + p2 + p3 = 1. After n trials, we have n1, n2, n3 observations of three types. The log likelihood function is given by `n(p1, p2, p3) = n1 log p1 + n2 log p2 + n3 log p3. The maximum likelihood estimator of these parameters are given by pˆj = nj/n 17.5. STATISTICAL APPLICATION OF THESE RESULTS 245 for j = 1, 2, 3. (i) Consider the test for pj = pj0 6= 0, j = 1, 2, 3 versus pj 6= pj0 for at least one of j = 1, 2, 3. The likelihood ratio test statistic is apparently given by Rn = 2n1 log(pˆ1/p10) + 2n2 log(pˆ2/p20) + 2n3 log(pˆ3/p30) = 2n ∑ pˆj log(pˆj/pj0). According to our theorem on the LRT, when n→∞, Rn is approximately χ22 distributed under the null model. The MLEs under this model are consistent and asymptotically normal. We have pˆj = pj0 +Op(n −1/2). Therefore, we have log(pˆj/pj0) = − log{1− (pˆj − pj0)/pˆj} = (pˆj − pj0)/pˆj + (1/2)(pˆj − pj0)2/pˆ2j +Op(n−3/2). Hence, Rn = n ∑ j (pˆj − pj0)2/pˆj +Op(n−1/2) = n ∑ j (pˆj − pj0)2/pj0 +Op(n−1/2). Note that the change from pˆj to pˆj0 in the second equality leads to a dis- crepancy of size Op(n −1/2). This discrepancy is understood as having been absorbed into the the remainder term Op(n −1/2). The leading term is the famous Pearson’s chisquare test statistics. It is often used for “goodness–of–fit” test. Another version of this test will be used as an assignment problem. The result remains similar if there are more than 3 categories. For the purpose of assignment, we do not require rigorous justification on why these Op(n −1/2) terms are indeed Op(n −1/2). 246 CHAPTER 17. LIKELIHOOD WITH VECTOR PARAMETERS 17.6 Assignment Problems 1. Suppose that X1, . . . , Xn are i.i.d. from the Weibull distribution with pdf f(x; θ, γ) = θ−1γxγ−1 exp(−xγ/θ) with the range of x being x > 0. The parameter space is γ > 0 and θ > 0. Consider the problem of testing H0 : γ = 1 versus H1 6= 1. (a) Go over the regularity conditions one by one and confirm if they are satisfied or not. (b) Find the expression of the likelihood ratio test statistics as a func- tion of γˆ, the MLE of γ under H1. Remark: a full analytical solution may not be possible. (c) Generate a data set from the null model with θ = 1.5 and γ = 1. Compute the value of Rn. set.seed(2014561) y = rweibull(120, 1, 1.5) Whoever is interested in this problem, obtain a histogram of Rn based on 2000 repetitions and a qq-plot against χ21 distribution. 2. Let X1, . . . , Xn be i.i.d. from N(µ, σ 2). (a) Suppose that σ2 = γµ2 with unknown γ > 0 and µ ∈ R. Find the likelihood ratio test for H0 : γ = 1 versus H1 : γ 6= 1. (b) Repeat (a) when σ2 = γµ with unknown γ > 0 and µ > 0. (c) Are the regularity conditions satisfied (for chi-square limiting dis- tribution of the LRT)? 3. Consider the 2×3 table that is often encountered in many applications. The outcomes of n objects are often been summarized as counts I II III a n11 n12 n13 b n21 n22 n23 17.6. ASSIGNMENT PROBLEMS 247 The problem of interest is to see whether the attribute in terms of being a or b is independent of the attribute in terms of category I, II and III. Let pij be the probability that a random subject falls into cell (i, j). (i) Derive the likelihood ratio test statistic for the null hypothesis pij = pi·p·j where pi· and p·j are marginal probabilities against the alternative that pij 6= pi·p·j. Identify (rather than prove) the limiting distribution of this statistic as n = ∑ ij nij →∞. (ii) Show that this statistic is asymptotically equivalent to the Pear- son’s chisquare test statistic: n ∑ i,j (pˆij − pˆi·pˆ·j)2 pˆij where pˆi· = ∑ j nij/n, pˆ·j = ∑ i nij/n and pˆij = nij/n. That is, the difference between two statistics has limit 0 under H0. 4. Let’s perform a simulation study to check the conclusion of Q4. Let n = 300 and repeatedly simulating the table N = 10, 000 times. (a) Simulate from pij = {(0.3, 0.7)× (0.2, 0.35, 0.45)} and obtain the value of the likelihood ratio test statistic Rn. Record all Rn values and draw a QQ plot against the null limiting distribution. Report the simulated rejection rate for the size 0.05 likelihood ratio test. (b) Simulate from (p11, p12, p13, p21, p22, p23) = {(0.1, 0.15, 0.4, 0.2, 0.1, 0.05)} and obtain the value of the likelihood ratio test statistic Rn. Record all Rn values and draw a QQ plot against the null limiting distribution. Report the simulated rejection rate for the size 0.05 likelihood ratio test. 248 CHAPTER 17. LIKELIHOOD WITH VECTOR PARAMETERS 17.7 Assignment problems 1. Suppose that X1, . . . , Xn are i.i.d. from the Weibull distribution with pdf f(x; θ, γ) = θ−1γxγ−1 exp(−xγ/θ) with the range of x being x > 0. The parameter space is γ > 0 and θ > 0. Consider the problem of testing H0 : γ = 1 versus H1 6= 1. (a) Go over the regularity conditions one by one and confirm if they are satisfied or not. (b) Find the expression of the likelihood ratio test statistics as a func- tion of γˆ, the MLE of γ under H1. Remark: a full analytical solution may not be possible. (c) Generate a data set from the null model with θ = 1.5 and γ = 1. Compute the value of Rn. set.seed(2014561) y = rweibull(120, 1, 1.5) Whoever is interested in this problem, obtain a histogram of Rn based on 2000 repetitions and a qq-plot against χ21 distribution. 2. Let X1, . . . , Xn be i.i.d. from N(µ, σ 2). (a) Suppose that σ2 = γµ2 with unknown γ > 0 and µ ∈ R. Find an analytical expression of the likelihood ratio test statistic for H0 : γ = 1 versus H1 : γ 6= 1. (b) Repeat (a) when σ2 = γµ with unknown γ > 0 and µ > 0. (c) Are the regularity conditions satisfied (for chi-square limiting dis- tribution of the LRT)? 3. Consider the 2×3 table that is often encountered in many applications. The outcomes of n objects are often been summarized as 17.7. ASSIGNMENT PROBLEMS 249 counts I II III a n11 n12 n13 b n21 n22 n23 The problem of interest is to see whether the attribute in terms of being a or b is independent of the attribute in terms of category I, II and III. Let pij be the probability that a random subject falls into cell (i, j). (i) Derive the likelihood ratio test statistic for the null hypothesis pij = pi·p·j where pi· and p·j are marginal probabilities against the alternative that pij 6= pi·p·j. Identify (rather than prove) the limiting distribution of this statistic as n = ∑ ij nij →∞. (ii) Show that this statistic is asymptotically equivalent to the Pear- son’s chisquare test statistic: n ∑ i,j (pˆij − pˆi·pˆ·j)2 pˆij where pˆi· = ∑ j nij/n, pˆ·j = ∑ i nij/n and pˆij = nij/n. That is, the difference between two statistics has limit 0 under H0. 4. Let’s perform a simulation study to check the conclusion of Q4. Let n = 300 and repeatedly simulating the table N = 10, 000 times. (a) Simulate from pij = {(0.3, 0.7)× (0.2, 0.35, 0.45)} and obtain the value of the likelihood ratio test statistic Rn. Record all Rn values and draw a QQ plot against the null limiting distribution. Report the simulated rejection rate for the size 0.05 likelihood ratio test. (b) Simulate from (p11, p12, p13, p21, p22, p23) = {(0.1, 0.15, 0.4, 0.2, 0.1, 0.05)} 250 CHAPTER 17. LIKELIHOOD WITH VECTOR PARAMETERS and obtain the value of the likelihood ratio test statistic Rn. Record all Rn values and draw a QQ plot against the null limiting distribution. Report the simulated rejection rate for the size 0.05 likelihood ratio test. Chapter 18 Wald and Score tests We discuss two types of tests that are closely related to the likelihood ratio test in this chapter. 18.1 Wald test We still consider the situation where n i.i.d. observations from a model {f(x; θ) : θ ∈ Θ} is provided. Under regularity conditions, we have shown that the MLE of θ is asymptotically normal. That is, √ n(θˆn − θ) d−→ N(0, I−1(θ)) as the sample size n→∞. Note that this claim is made implicitly assuming the true parameter value is given by the same θ as the θ in these expressions. Because if the above generically applicable asymptotic normality, to test for the simple null hypothesis of H0 : θ = θ0 against H1 : θ 6= θ0, we may define Wn(θ0) = n(θˆn − θ0)τ I(θ0)(θˆn − θ0) which has approximately two desired properties for a test statistic. We reject H0 when Wn(θ0) ≥ χ2d(1−α) in favour of the generic alternative H1 : θ 6= θ0. One may notice that the most crucial factor for the validity of this test is √ n(θˆn − θ) d−→ N(0, I−1(θ)) 251 252 CHAPTER 18. WALD AND SCORE TESTS as the sample size n → ∞. It is not crucial for θˆn to be the MLE nor for matrix I to be the Fisher information. Hence, the Wald test is more generally applicable. 18.1.1 Variations of Wald test in the aspect of Fisher information Because replacing I(θ0) with any of its consistent estimator does not change the limiting distribution of Wn, such manipulations lead to many versions of the Wald test. 1. We may replace I(θ0) in Wn by Iˆn(θ0) = − 1 n n∑ i=1 ∂2 log f(xi; θ) ∂θ2 ∣∣∣∣∣ θ=θ0 . This expression may be called an observed Fisher information at θ0. This change simplifies the test when it is too complex to find the ana- lytical form of the Fisher information matrix. 2. We may replace I(θ0) in Wn by Iˆn(θˆ)− 1 n n∑ i=1 ∂2 log f(xi; θ) ∂θ2 ∣∣∣∣∣ θ=θˆ where θˆ is the MLE of θ. Note that we ignore the given value θ0 in favour of an estimated value. This quantity is often computed when the MLE is obtained by iterative methods such as Newton-Raphson. 3. We may replace I(θ0) in Wn by I(θˆ) where θˆ is the MLE of θ. Note that we again ignore the given value θ0 in favour of an estimated value. 4. When the regularity conditions are satisfied, we may replace I(θ0) by 1 n n∑ i=1 { ∂ log f(xi; θ) ∂θ }{ ∂ log f(xi; θ) ∂θ }τ ∣∣∣∣∣ θ=θ0 . Unlike the earlier choices, this quantity is always positive (or at least non-negative) definite. 18.1. WALD TEST 253 18.1.2 Variations of Wald test in the aspect of H0 The Wald test we introduced works for simple null hypothesis. Suppose the vector parameter θ can be written as θτ = (θτ1 , θ τ 2). To fix the idea, we denote the dimension of θ as d + k and the dimensions of θ1 and θ2 are d and k. Consider the problem of testing H0 : θ1 = θ10 against H1 : θ1 6= θ10. Assume the regularity conditions are satisfied by this model and the hypotheses. In this case, the null hypothesis is composite, as opposed to be simple in the last section. Let θˆn be the MLE over the whole parameter space Θ and θˆ τ n = (θˆ τ 1n, θˆ τ 2n) be the corresponding partition. Because θˆn is asymptotically normal, so is any of its sub-vector (or linear combination). This implies √ n(θˆ1n − θ10) d−→ N(0, I11(θ0)) where I11 is upper-left corner block sub matrix of I−1 corresponding to θ1. This leads to a sensible test statistic Wn(θ10) = n(θˆ1n − θ10)τ{I11(θˆ)}−1(θˆ1n − θ10). (18.1) Clearly, we have Wn(θ10)→ χ2d with d being the dimension of θ1. A test of approximate size α is therefore given by φ(X) = { 1 when Wn(θ10) ≥ χ2d(1− α) 0 otherwise Rather than defining Wn(θ10) as in (18.1), one may try to use n(θˆ1n − θ10)τ{I11(θ0)}−1(θˆ1n − θ10). The above quantity is in fact not a statistic because we do not know the value of θ0 even when H0 is true. The null hypothesis H0 here is a composite one. There are, however, many well justified choices in the place of I11(θˆ). For instance, one may require θˆ10 = θ10 to obtain restricted MLE θˆ20. That is, θˆ20 = arg max θ2 `n(θ10, θ2). 254 CHAPTER 18. WALD AND SCORE TESTS After which we replace I11(θˆ) in by I11(θ10, θˆ20) in (18.1). Just like what we discussed in the last section, there can be many other statistics that can be used in place of I11(θˆ) without changing the asymptotic conclusion. We do not have a rule to decide which one is the “optimal” choice in the place of I11(θˆ). More accurately speaking, we do not aware of any commonly accepted rules. 18.1.3 Variations of Wald test in the aspect of H0 We work under the same title but slightly different situation here. Suppose the null hypothesis is specified in the form of H0 : ϕ(θ) = 0 where ϕ(·) takes vector values of dimension d. The dimension of θ is d + k, the same as before. Note that when ϕ(θ) = θ1 − θ10 with some known value θ10, this H0 reduces to the last case. Naturally, the alternative hypothesis is H1 : ϕ(θ) 6= 0. Assume both ϕ(·) and ϕ′(·) are smooth and the rank of ϕ′(θ) is d for θ in a neighbourhood of the true parameter value. If one regards φ(θ) as a parameter itself, and φ(θˆ) is its asymptotically normal estimator, then applying the principle behind the Wald test, we would define Wn = nϕ τ (θˆ){ϕ′(θˆ)I−1(θˆ)ϕ′(θˆ)τ}−1ϕ(θˆ) as a test statistic. It can be shown that we still have, under H0, Wn → χ2d. Clearly, an approximate size α test can be similarly constructed based on this Wn. 18.2 Score Test We have seen that under regularity conditions, E {∂ log f(X; θ) ∂θ } = 0 18.2. SCORE TEST 255 where the expectation is taken under the assumption that the distribution of X is given by f(x; θ). Thus, when we test for H0 : θ = θ0, the value of the score function Sn(θ0) = n∑ i=1 {∂ log f(Xi; θ0) ∂θ } is indicative of the validity of H0. Recall that n−1/2Sn(θ0) is asymptotically multivariate normal with asymp- totic variance I(θ0). Let us define a test statistic to be Tn = S τ n(θ0){nI(θ0)}−1Sn(θ0). The limiting distribution of Tn is chisquare with d degrees of freedom where d is the dimension of θ. Based on this result, a score test of approximate size α is given by φ(X) = { 1 when Tn ≥ χ2d(1− α) 0 otherwise Unlike the likelihood ratio test or Wald test, this statistic does not ask us to compute the MLE of θ. We do need to compute the Fisher information matrix and its inversion. Similar to Wald test, let us now consider the more complex situation where the null hypothesis H0 : θ1 = θ10, but the dimension of θ is d + k. This means the second part of θ vector is unspecified under H0. Let θˆ0 be the MLE under H0. Let Sn1(θ) = ∂`n(θ) ∂θ1 which was defined earlier too. This is a column vector of length d. If the same asymptotic techniques are used here, we will find that n−1/2Sn1(θˆ0) is asymptotically multivariate normal with mean 0 and variance matrix I11,2(θ∗) under the null hypothesis. This θ∗ stands for the true parameter value and I11,2 = I11 − I12I−122 I21 was also defined before. 256 CHAPTER 18. WALD AND SCORE TESTS Apparently, these notes on asymptotic results lead to the conclusion Tn = S τ n1(θˆ0){nI11,2(θˆ0)}−1Sn1(θˆ0)→ χ2d as n → ∞. This time, d is the dimension of θ1. A test can be constructed the same way as earlier. Finally, consider the null hypothesis specified by H0 : ϕ(θ) = 0 where ϕ is a smooth function. Under regularity conditions, and in most applied problems, we may equivalently writeH0 as θ = g(λ) for some smooth function g with new parameter setting λ. In this case, we may obtain the MLE of λ as the maximum point of `(g(λ)). Denote it as λˆ. Next, we redefine the score statistic to be Tn = S T n (g(λˆ)){nI(g(λˆ))}−1Sn(g(λˆ)). Under regularity conditions, we still have Rn → χ2d. This d is the dimension of θ minus the dimension of λ. Like the discussion for the Wald test, we can use variations of I(g(λˆ)) to construct the score test. I do not aware definitive statements on which of them work the best. 18.3 Power and consistency Three tests, likelihood ratio, Wald and Score, are asymptotically equiva- lent. By this statement, we mean that if the true parameter value is at n−1/2-distance from the null model space, then the powers of these tests are asymptotically equal and the value is not 1. Most tests recommended in mathematical statistics are consistent: the power of the test at any specific alternative distribution (a distribution in H1) goes to 1 when the sample size n→∞ under the i.i.d. setting. There are generally no discussions on whether these tests are unbiased. Admittedly, there is a discrepancy between optimality theories for hypothe- sis test, and the properties of generally recommended tests. The optimality 18.4. ASSIGNMENT PROBLEMS 257 theory provides a high ground based on which we discuss the pros and cons of test procedures. We do not insist on using only tests with confirmed optimality properties in applications. The recommended tests are often de- signed to mimic or follow the principles of optimal tests after some tolerated compromises for the sake of convenience or feasilibity in implementation. We often use simulation study to compare the performances of various tests, one is advocated by the user and the others are the existing ones in specific applications. It is not unusual for a student to claim that the “new method” is better because it rejects more null hypotheses without further qualifications. This practice is not right. One may reject all null hypotheses to achieve the highest power of 100% in any applications. Clearly, such a test ignores the need of controlling type I error, or the desire of having a test unbiased. The comparison should only be made after the tests are designed so that their sizes are practically equal. 18.4 Assignment problems 1. Suppose that X = (X1, . . . , Xk) τ has multinomial distribution with the parameter P = (p1, . . . , pk) τ . It is known that ∑ xj = n and n is not random. Consider the problem of testing P = P0 where P0 is a probability vector with all entries positive. (i) derive the quadratic form of the Wald test statistic; (ii) derive the quadratic form of the Score test statistic; (iii) (challeng) prove (or disprove) that these two statistics are identical. 2. Consider the hypothesis test for H0 : ϕ(θ) = 0 against the alternative H1 : ϕ(θ) 6= 0. Assume its derivative ϕ′(θˆ) is continuous and has full rank d while the dimension of θ is d+ k for some k. The test problem satisfy the regularity conditions, and we have i.i.d. observations of size n for this test. Prove that the Wald test statistic Wn = nϕ τ (θˆ){ϕ′(θˆ)I−1(θˆ)ϕ′(θˆ)τ}−1ϕ(θˆ) satisfies, under H0, and as n→∞, Wn d−→ χ2d. 258 CHAPTER 18. WALD AND SCORE TESTS 3. Consider the null hypothesis H0 : θ1 = θ10 where the dimension of θτ = (θτ1,θ τ 2) is d + k and the dimension of θ1 is d. Assume the regularity conditions are satisfied for the corresponding model and we have i.i.d. observations of size n for this test. Let θˆ0 be the MLE under H0 and Sn1(θ) = ∂`n(θ) ∂θ1 . Prove that the score test statistic Tn = S τ n1(θˆ0){nI11,2(θˆ0)}−1Sn1(θˆ0)→ χ2d in distribution as n→∞. 18.5 Assignment problems 1. Suppose that X = (X1, . . . , Xk) τ has multinomial distribution with the parameter P = (p1, . . . , pk) τ . It is known that ∑ xj = n and n is not random. Consider the problem of testing P = P0 where P0 is a probability vector with all entries positive. (i) derive the quadratic form of the Score test statistic; (ii) derive the quadratic form of the Wald test statistic; (iii) (challeng) prove (or disprove) that these two statistics are identical. 2. Consider the hypothesis test for H0 : ϕ(θ) = 0 against the alternative H1 : ϕ(θ) 6= 0. Assume its derivative ϕ′(θˆ) is continuous and has full rank d while the dimension of θ is d+ k for some k. The test problem satisfy the regularity conditions, and we have i.i.d. observations of size n for this test. Prove that the Wald test statistic Wn = nϕ τ (θˆ){ϕ′(θˆ)I−1(θˆ)ϕ′(θˆ)τ}−1ϕ(θˆ) satisfies, under H0, and as n→∞, Wn d−→ χ2d. 18.5. ASSIGNMENT PROBLEMS 259 3. Consider the null hypothesis H0 : θ1 = θ10 where the dimension of θτ = (θτ1,θ τ 2) is d + k and the dimension of θ1 is d. Assume the regularity conditions are satisfied for the corresponding model and we have i.i.d. observations of size n for this test. Let θˆ0 be the MLE under H0 and Sn1(θ) = ∂`n(θ) ∂θ1 . Prove that the score test statistic Tn = S τ n1(θˆ0){nI11,2(θˆ0)}−1Sn1(θˆ0)→ χ2d in distribution as n→∞. 260 CHAPTER 18. WALD AND SCORE TESTS Chapter 19 Tests under normality There are many classical tests taught in introductory courses for various aspects of the population under normality assumption. They include test for the hypotheses such as whether the population mean is equal or larger than a specific value when a single i.i.d. sample is given; test for the hypotheses whether two population means are equal, or differ by specific amount when two independent i.i.d. samples from two populations are given. We have them summarized in this chapter under the light of some types of optimality. 19.1 One-sample problem under normality Suppose we have a random sample x1, . . . , xn of size n from N(θ, σ 2). We adopt common notations: sample mean x¯ = n−1 ∑n i=1 xi and sample variance s2n = (n− 1)−1 ∑n i=1(xi − x¯)2; Let Tn = √ n(x¯− θ0) sn for some given θ0 value. The well known test for H0 : θ ≤ θ0 against H1 : θ > θ0 of size α is given by φ(x) = { 1 when Tn ≥ tn−1,1−α; 0 otherwise where tn−1,1−α is the upper 1 − α quantile of the t-distribution with n − 1 degrees of freedom. 261 262 CHAPTER 19. TESTS UNDER NORMALITY The well known t-test for H0 : θ = θ0 against H1 : θ 6= θ0 of size α is given by φ(x) = { 1 when |Tn| ≥ tn−1,1−α/2; 0 otherwise This is the famous two-sided t-test. Both tests are convenient to use and have nice properties. Yet after having studied the “UMP” theory, we may question their “optimality”. We will not prove anything but cite a few classical theorems here. Their truthfulness is too involved to be lectured in this course. These interested are referred to reference textbooks. Even better, you may practice your technical skill through an attempt to prove to disprove these results. Theorem 19.1. Suppose we have an i.i.d. sample from a distribution with density function from an exponential family f(x; θ, λ) = exp{θU(x) + λT (x) + A(θ, λ)} with respect to some σ-finite measure. The parameter θ is one-dimensional, while λ can be multi-dimensional. (i) Suppose that V = h(U, T ) is independent of T (being two independent random variables/vectors) when θ = θ0. In addition, for each t, h(u, t) is an increasing function in u. Then the test defined as follows φ(v) = 1 when v > k; c when v = k 0 otherwise satisfying E{φ(V ); θ0} = α is an UMPU test for H0 : θ ≤ θ0 against H1 : θ > θ0. (ii) Assume the same conditions as in (i), but in addition, h(u, t) = a(t)u+ b(t) with a(t) > 0. Let us define φ(v) = 1 when v > k1 or v < k2; cj when v = kj, j = 1, 2; 0 otherwise 19.1. ONE-SAMPLE PROBLEM UNDER NORMALITY 263 for some constants k1 > k2 such that E{φ(V ); θ0} = α and E{V φ(V ); θ0} = αE{V ; θ0}. This test is an UMPU test for H0 : θ = θ0 against H1 : θ 6= θ0. Clearly, this theorem targets problems when the hypothesis involves one specific component θ of parameter vector in the model while leaving other components λ of parameters unspecified. Because of this, we refer λ as nui- sance parameter. According to the results in this theorem, the UMPU tests have the same format to the ones we obtained in the absence of nuisance pa- rameters when the model is an “exponential family” under some conditions. We use this theorem to construct a test for hypotheses about the variance σ2 under normal model in the following example. Example 19.1. Suppose we have an i.i.d. sample from N(ξ, σ2). The joint function can be written as f(x; ξ, σ2) = exp{θU(x) + λT (x) + A(θ, λ)} with θ = −1/(2σ2), λ = (nξ)/σ2; and U(x) = ∑x2i , T (x) = x¯. Let h(U, T ) = U − nT 2 = ∑ (xi − x¯)2. It is seen that for any given value of σ2, h(U, T ) is independent of T . Thus, a UMPU test of size α for H0 : σ ≤ σ0 against H1 : σ > σ0 is given by φ(U) = { 1 when U > k; 0 otherwise Because U/σ20 has chisquare distribution with n− 1 degrees of freedom. The size of k is therefore the (1− α)th quantile of this distribution. In the next example, we exchange the roles of σ and µ (in terms of notation) and therefore U and T to construct a UMPU about the size of population mean. Example 19.2. Suppose we have an i.i.d. sample from N(ξ, σ2). The joint function can be written as f(x; ξ, σ2) = exp{θU(x) + λT (x) + A(θ, λ)} with λ = −1/(2σ2), θ = (nξ)/σ2; and T (x) = ∑x2i , U(x) = x¯. 264 CHAPTER 19. TESTS UNDER NORMALITY This time, we find V = h(U, T ) = U√ T − nU2 = X¯√∑ (Xi − X¯)2 is independent of T (x) when ξ = 0. See the remark after this example. It is easily seen that V is an increasing function of U given T . Hence, a UMPU test for H0 : ξ ≤ 0 versus H0 : ξ > 0 can be easily obtained by the result of Theorem 19.1 In order to construct a UMPU test for H0 : ξ = 0 versus H0 : ξ 6= 0, we need to find a V which is also linear in U given T . This is not the case here. However, it is possible to use a mathematical trick to show a UMPU for this pair of hypothesis is still given by φ(V ) = 1(|V | > k) with k satisfying E{φ(V ); θ0} = α and E{V φ(V ); θ0} = αE{V ; θ0}. The solution is certainly the famous two-sided t-test. This test is not in the exact form of the famous t-test. Some normalization steps are needed but omitted here. Remark: When ξ = 0 is given, T (x) is complete and sufficient for σ2. At the same time, the distribution of V is not dependent on σ. Thus, the classical theorem implies that they are independent. 19.2 Two-sample problem under normality as- sumption The purpose of the first example is to show that the commonly used F-test is a UMPU test for the one-sided hypothesis. The conclusion is likely also “true” for the two-sided hypothesis with some complications. For the specific hypothesis to be discussed, we may simply put ∆ = 1 in the subsequent discussion. Ignore this ∆ if you find it disturbing. LetX1, . . . , Xm and Y1, . . . , Yn be i.i.d. samples fromN(ξ, σ 2) andN(η, τ 2) respectively. Their joint density function is given by f(X, Y ; ξ, η, σ, τ) = exp { − 1 2σ2 ∑ x2i − 1 2τ 2 ∑ y2j + mξ σ2 x¯+ nη τ 2 y¯ − A(ξ, η, σ, τ) } . 19.2. TWO-SAMPLE PROBLEMUNDERNORMALITY ASSUMPTION265 Next, let us transform the parameter by θ = − 1 2τ 2 + 1 2∆σ2 and λ1 = − 1 2σ2 ; λ2 = mξ σ2 ; λ3 = nη τ 2 for some constant ∆ > 0. Let the corresponding sufficient statistics be U = n∑ j=1 Y 2j ; T1 = m∑ i=1 X2i + 1 ∆ n∑ j=1 Y 2j ; T2 = X¯; T3 = Y¯ . Test for equal variance Consider the test for H0 : τ 2 ≤ σ2 versus H1 : τ 2 > σ2 This is the same as, with ∆ = 1, H0 : θ ≤ 0; versus H1 : θ > 0. Define V = h(U, T1, T2, T3) = ∑n j=1(Yj − Y¯ )2∑m i=1(Xi − X¯)2 . It is seen that given θ = 0, V has F-distribution(after some scale adjustment) which does not depend on any parameters. Thus, it is independent of the sufficient and complete statistic (T1, T2, T3). To verify the sufficiency and completeness, work out the analytical form of the distribution family when θ = 0. It is also easy to show that h(U, T ) is monotone in U given T . These discussions show the conditions specified in Theorem 19.1 are sat- isfied with these parameters and statistics. Hence, a proper test based on V is a UMPU test. That is, a UMPU test for H0 : τ 2 ≥ σ2 versus H1 : τ 2 < σ2 is given by φ(V ) = I(V > k) and this k is chosen according to the F-distribution to make the size right. Extension. By putting ∆ at other values, we obtain many variations. One may also easily get the F-test for the two-sided test. It appears that an UMPU test is not the same test when the rejection regions on two tails of V having equal probability α/2. 266 CHAPTER 19. TESTS UNDER NORMALITY 19.3 Test for equal mean under equal vari- ance assumption We certainly know that the two-sample t-test will show up here. Consider the case where τ 2 = σ2. Under this assumption, the joint density of two samples is given by f(X, Y ; ξ, η, σ, τ) = exp { − 1 2σ2 { ∑ x2i + ∑ y2j}+ mξ σ2 x¯+ nη σ2 y¯ − A(ξ, η, σ) } . Let θ = η − ξ (m−1 + n−1)σ2 ; and λ1 = mξ + nη (m+ n)σ2 ; λ2 = − 1 2σ2 . The sufficient statistics are U = Y¯ − X¯; T1 = mX¯ + nY¯ ; T2 = m∑ i=1 X2i + n∑ j=1 Y 2j . Consider a test for H0 : ξ = η versus H1 : ξ 6= θ, we construct a statistic V = Y¯ − X¯√∑m i=1(Xi − X¯)2 + ∑n j=1(Yj − Y¯ )2 which is a function of U , T1, T2 and T3. Its distribution, when ξ = η does not depend on λj, j = 1, 2, 3. Thus, it serves the proper statistic for constructing UMPU. A UMPU test is given by φ(V ) = I(|V | > k) with k satisfying E{φ(V ); η = ξ} = α and E{V φ(V ); η = ξ} = αE{V ; η = ξ}. 19.4. TEST FOR EQUALMEANWITHOUT EQUAL VARIANCE ASSUMPTION267 This is apparently the two-sample t-test. Here are a few missing technical steps. First, the denominator in V can be written as T2 − 1 m+ n T 21 − mn m+ n U2. This ensures that V is indeed a function of the required format. The second is linearity of V in U given T . The linearity is not exactly true. However, V is a monotone function of W : W = Y¯ − X¯√∑ x2i + ∑ y2j − (m+ n)−1( ∑ xi + ∑ yj)2 . So a test based on W would satisfy all conditions specified in the theorem. Two tests are, however, equivalent. The reason for using V instead of W lies in the fact that the distribution of V is clearly related to t, while the distribution of W is not “standard”. 19.4 Test for equal mean without equal vari- ance assumption If σ2 = τ 2 is not assumed (or not known to be equal), there is no such a simple solution as UMPU test. This is so-called Behrens-Fisher problem. In terms of searching for “optimal tests”, one usually starts to place re- strictions on the test: we require the test is “unbiased”, “invariant”, “similar” and so on. With some considerations, it appears that a good test should reject the null hypothesis when Y¯ − X¯√ (1/m)S2x + (1/n)S 2 y ≥ g(S2y/S2x) for some suitable function g. If the test is required to be unbiased, then only “pathological functions g” can have this property. 268 CHAPTER 19. TESTS UNDER NORMALITY “Approximate solutions are available which provide tests that are satis- factory for all practical purposes”. Among them, we probably know Welch’s approximate t-test. In this case, a t-test statistic is defined to be tn = Y¯ − X¯√ (1/m)S2x + (1/n)S 2 y . Clearly, it is simply the “standardized difference in sample means”. Its dis- tribution under H0 depends on the actual value of σ 2 and τ 2. Namely, it does not have the desired “pivotal” property under H0. However, it is generally recommended in the literature that its distribution is well approximated by t-distribution with degree of freedom df = [(1/m)S2x + (1/n)S 2 y ] 2 [(1/m)S2x] 2/(m− 1) + [(1/n)S2y ]2/(n− 1) . Because this df is an approximation based on certain consideration, I am reluctant to declare that this is a “correct approximation of the degree of freedom”. It is best to call it the Welch’s approximation. I like the description of this famous problem in wikipedia: One difficulty with discussing the Behrens–Fisher problem and proposed solutions, is that there are many different interpretations of what is meant by “the Behrens- Fisher problem”. These differences involve not only what is counted as being a relevant solution, but even the basic statement of the context being considered. My summary: if one attempts to have an “optimal” test for ξ = η without knowing σ = τ in a two-sample problem, there may not be such a solution. If the “optimality” is not strictly observed, there are many sensible methods. Summary. We have gone over most famous tests based on data from normal models. We have not gone a complete list of all important cases. We did not go over the theorem based on which these tests are justified to have various optimality properties. The optimality will go away if the data are not from a normal distribution. My experience indicates, however, that the two-sample t-test works really nicely even when the normality is severely violated. That is, the size of the test remains close to what is promised and the power is superior compared 19.5. ASSIGNMENT PROBLEMS 269 to many alternatives. One can find situations where this t-test is very poor in these respects. However, these situations are often too extreme to be taken seriously. The simplest case I can think of is when a few extremely large observed values are in presence compared to the majority of observed valued. If so, robust approaches are recommended. Some of them will be discussed next. 19.5 Assignment problems 1. Suppose (Xi, Yi), i = 1, 2, . . . , n are a random sample from a bivariate normal distribution with density function f(x, y; ξ, η, σ, τ) = { 2piστ √ 1− ρ2}−n exp (− h(x, y; ξ, η, σ, τ) 2(1− ρ2) ) where h(x, y; · · · ) = 1 σ2 ∑ (xi− ξ)2− 2ρ στ ∑ (xi− ξ)(yi−η) + 1 τ 2 ∑ (yi−η)2. (a) Determine the form of the UMPU test for H0 : ρ ≤ 0 versus H1 : ρ > 0. (b) Determine the rejection region of the test of size α in terms of the quantile of a well known distribution (t-distribution). 2. Let X1, . . . , Xn be a random sample from N(ξ, σ 2). (a). Show that the power of the student’s t-test is an increasing function of ξ/σ for testing H0 : ξ < 0 versus H1 : ξ > 0. (One-sided test). (b). Show that the power of the student’s t-test is an increasing function of |ξ|/σ for testing H0 : ξ = 0 versus H1 : ξ 6= 0. (two-sided test). 3. Suppose that Xi = β0 + β1ti + i, where ti’s are fixed constants that are not all the same, i’s are i.i.d. from N(0, σ 2), and β0, β1 and σ 2 are unknown parameters. Derive a UMPU test of sizes α for testing (a) H0 : β0 ≤ θ0 versus H1 : β0 > θ0. (b) H0 : β0 = θ0 versus H1 : β0 6= θ0. 270 CHAPTER 19. TESTS UNDER NORMALITY Chapter 20 Non-parametric tests The methods we have discussed so far are based on the assumption that the data are generated i.i.d. from a distribution which is a member of some regular parametric models. These methods become either inapplicable, or inferior if the data do not listen to our command: “assume they have this or that distribution”. Strictly speaking, the tests designed under a “parametric assumption” can be carried out smoothly, whether or not the model assumption is violated or not. The real issue is: these tests may not have the prescribed size nor have reasonable power to detect the departure of the underlying distribution from the null hypothesis in specific aspect we are interested. For instance, suppose we have i.i.d. bivariate observations (xi, yi) from some distribution. We wish to test if the two component random variables X, Y are independent. If the joint distribution is bivariate normal, we may simply test the null hypothesis that their correlation is 0. The likelihood ratio test is a valid choice. However, if the joint distribution is not normal, the test will not be able to detect the violation of the independence hypothesis when X, Y are not independent but have 0 correlation. It can also happen that X, Y are independent, but the normality assumption based likelihood ratio test statistics has a very different limiting distribution from chisquare. If so, the null hypothesis can be rejected with much higher type I error. In other examples, the performance of a test can still be respectable. In two sample problem, the typical null hypothesis of interest is that two 271 272 CHAPTER 20. NON-PARAMETRIC TESTS populations have the same mean. In this case, the two-sample t-test is very hard to beat in terms of having both accurate type-I error and good power. One has to subject this test to very weird data sets to make it look bad. Even though some parametric tests are rather robust, there is a need of tests whose validity is not heavily dependent on the correctness of the model assumption. 20.1 One-sample sign test. Suppose we have an i.i.d. sample from some distribution whose c.d.f. is given by F (x) which is a member of F . The family F that F belongs is not very important so we do not carefully specify it. In some applications, we do not know much about this family other than it is very broad. The one-sample sign test is designed for the following null hypothesis H0 : p = F (u) ≤ p0 versus H1 : p = F (u) > p0 for some user-specified u and p0. Let xi, i = 1, 2, . . . , n be the observed values. Apparently, the key infor- mation from a single observation in this problem is whether xi > u or xi ≤ u. Consequently, we define ∆i = 1(xi − u ≤ 0) for i = 1, . . . , n. If ∆i, i = 1, . . . , n are the only data we observe, then Y = ∑n i=1 ∆i is sufficient for the probability of success p, the unknown value of F (u). The UMP test for H0 versus H1 has the form φ(Y ) = 1(Y > k) + c1(Y = k) with proper choices of k and c for the sake of the test to have a pre-given size. When n is large, the distribution of Y is well–approximated by a normal distribution. Hence, the usefulness of having an c to ensure exact size of the 20.2. SIGN TEST FOR PAIRED OBSERVATIONS. 273 test disappears. For this reason, we give up the effort of determining the value of c explicitly. If n is not extremely large and let the observed value of Y be y0, numerical value of pr(Y ≥ y0) can be obtained via many standard statistical software packages. So one may simply compute the p-value of the test. One may also activate the continuity correction if it deems necessary: computing (1/2)pr(Y = y0) + pr(Y > y0) instead. This test does not depend on the specific form of F , that is, we do not have to specify a parametric model {f(x; θ) : θ ∈ Θ} for F . For this reason, this test is referred to as non-parametric. The statistic Y is the number of observations with xi− u ≤ 0, which is the number of observations where the quantity has negative/positive sign. This test φ(Y ) is subsequently called sign-test. This test is UMP in general, rather than merely in the restricted sense as specified above. However, it may not be so interesting to seriously prove this claim. In the literature, the one-sample sign test may refer to the special case where p0 = 0.5. 20.2 Sign test for paired observations. Suppose a pair of observations of the same nature are obtained on n experi- mental units (more commonly sampling units). One wishes to test whether or not the two observations have the same distribution. Let the observations be denoted as (xi, yi), i = 1, 2, . . . , n. Assume inde- pendence between pairs. Define ∆i = 1(yi > xi) and Y = ∑ ∆i. If two marginal distributions F and G are identical, then it has conditional binomial distribution with probability of success p0 = 0.5 given n′ = ∑ 1(yi 6= xi). 274 CHAPTER 20. NON-PARAMETRIC TESTS The sign test recommended in the literature rejects the null hypothesis H0 : F (x) ≡ G(x) when Y exceeds k, a critical value decided by the condi- tional distribution of Y given n′ and the desired size of test α. The presumed alternative hypothesis is H1 : F < G in some stochastic sense. Apparently, there are many distribution pairs F 6= G at which this test has rejection probability smaller than α, a violation of the usual requirement of the test: unbiasedness. One need not be overly alarmed. In most applications where the sign-test is used, one looks for evidences of F < G from a specific angle. Unable to reject all possible violation of H0 : F (x) ≡ G(x) with a good power is not so much a concern. Nevertheless, as a statistician, she or he should be aware of this issue. As a reminder, a paired example t-test will be used if the data are from paired experiment and the normality is not in serious doubt. 20.3 Wilcoxon signed-rank test The tests in the last two sections do not take the magnitude of the difference into account. Hence, one may explore this fact and comes up with superior tests in some way. Consider the paired-experiment first. Let n′ being the number of obser- vations in which xi 6= yi and remove the sample units for which xi = yi from the sample. For simplicity, we assume xi 6= yi in the first place and use n for n′. Define δi = { 1 if yi > xi −1 if yi < xi and ∆i = |yi − xi| for i = 1, 2, . . . , n. Let Ri = n∑ j=1 1(∆j < ∆i) + (1/2) n∑ j=1 1(∆j = ∆i) + 1/2. Do not be fooled by the above seemingly complex formula. It merely counts the rank of ∆i among all absolute differences ∆j. When ∆i = ∆j, then we only count that only one half of the pair j has lower value than ∆i. Finally, 20.3. WILCOXON SIGNED-RANK TEST 275 Wilconxon signed-rank test is defined to be W = n∑ i=1 δiRi. Note that W looks into the sign δi and the rank Ri for each paired obser- vation. Because of this, if a sample has a large observed difference, its rank Ri is higher and therefore contributes more in terms of increasing the size of W . The distribution of W under the null hypothesis stays the same as long as the marginal distributions F ≡ G, given the number of unequal-pairs n. When F < G in some sense, we expect W better reflects the degree of departure from H0 in favour of this type of departure than the pure signed test. Hence, a signed-rank test is designed to reject H0 when W is large. One may numerically evaluate pr(W > w0) and use it as p-value of the test. The distribution of W is also asymptotically normal. When the popula- tion distributions are continuous so that there cannot be any ties in rank, the mean of W is 0 and its variance is given by var(W ) = 1 6 {n(n+ 1)(2n+ 1)}. When n is large, the test reject H0 if W > √ var(W ) z1−α for one-sided alternative. This is the Wilcoxon signed-rank test for paired experiment. When there are ties in |yi − xi|, the variance of W is more complex. We do not go over that case. In most books, this statistic is defined as the total rank of positive yi−xi. Since the rank total is non-random, the test based on our W is the same as the test based on those statistics. There is also a Wilcoxon signed-rank test for one sample problem. We do not discuss it here. 276 CHAPTER 20. NON-PARAMETRIC TESTS 20.4 Two-sample permutation test. Consider a situation where we have one random sample x1, . . . , xm from F and another random sample y1, . . . , yn from G. It is of interest to test for H0 : F = G versus H1 : F 6= G. Note that this problem is different from the one in the last section: for instance, x1 and y1 are not linked here unlike the paired-experiment. To have a meaningful discussion, assume both F and G are continuous. That is, the model space contains all continuous distribution functions. De- note the pooled sample as z = {x1, . . . , xm, } ∪ {y1, . . . , yn}. We first regard z as a set in the above, then we turn it into a vector of length m+ n for the subsequent discussion. Define a set of vectors to be Π(z) = {(zi1 , zi2 , . . . , zim+n) : (i1, . . . , im+n) is a permutation of (1, 2, . . . ,m+n)}. That is, Π(z) contains all vectors of length m + n that are permutations of each other. The entries of the vectors are observed values from two samples. We denote the members of Π(z) by pi(z). Let φ(X, Y ) be a test, namely a function taking values between 0 and 1, such that 1 (m+ n)! ∑ (x,y)∈Π(z) φ(x, y) = α. In the above expression, we write (x, y) instead of z in places to remind us the connections. The above test is called a permutation test for a given significance level α. At a first look, this definition does not make much sense. Suppose that the observed values are all different. That is, xi 6= xj, yi 6= yj for any i 6= j; and xi 6= yj for any i and j. In this case, once the pooled sample z is specified, Π(z) contains (m + n)! different valued vectors. Suppose the test function φ(x, y) does not involve randomization. Then this function decides which of these (m + n)! vectors are in the rejection region. Under the null hypothesis of F = G and both F and G distributions are continuous, given the set z, every member of Π(z) has probability 1/(m+n)! to occur. Hence, if no randomization is involved, the permutation test selects 5% of (m+ n)! possible outcomes of z to form the rejection region of a test of size α = 0.05. 20.4. TWO-SAMPLE PERMUTATION TEST. 277 The name of the test is now sensible as the rejection region is formed by permuted observed vectors. The size of the test is computed based on the promise that the null distribution on Π(z) is uniform conditional on the set of values in z. The key issue left in a permutation test is: which 5% permutations in (m+ n)! should be placed in the rejection region? In applications, we introduce a function Tm+n defined on Π(z). This function can take at most (m + n)! different values. We reject H0 if the observed Tm+n is among the top (100α)% values. We now give two specific choices of Tm+n. Difference in two sample means. The question of which 50 depends on the “optimality requirement” and the potential alternative hypothesis. What direction of the departure do we care? Without such a direction, we can always find two samples differ significantly in one way or in another. One should review the notion of two desired properties of a test statistics now. Consider the situation where the alternative is H1 : G(x) = F (x − δ) for some δ > 0. In statistics, we say G(x) is obtained from F by a location shift in this case. Under this alternative hypothesis, the samples from G are stochastically larger than the samples from F . Any statistics which tend to take larger values under H1 is a suitable candidate. Suppose x1, . . . , xm and y1, . . . , yn are two random samples respectively. Let Tm+n = n −1 n∑ j=1 yj −m−1 m∑ i=1 xi = y¯n − x¯m. For each permuted x1, . . . , xm; y1, . . . , yn., x′1, . . . , x ′ m; y ′ 1, . . . , y ′ n we compute T ′m+n = n −1 n∑ j=1 y′j −m−1 m∑ i=1 x′i = y¯ ′ n − x¯′m. The observed Tm+n is one of ( (m+n) n ) possible outcomes denoted as T ′m+n. It makes sense to select the permutations which results in the largest values of T ′m+n to form the rejection region. 278 CHAPTER 20. NON-PARAMETRIC TESTS To carry out this test, we compute all ( (m+n) n ) possible values of T ′m+n. One of them is the observed value Tm+n. If the observed value is among the top 100α%, we reject H0 in favour of H1 : G(x) = F (x− δ) for some δ > 0. In applications, if (m+n) is large, computing all possible values in “finite time” is not feasible. Computer simulation may be used to compute only a random subset of them and get an accurate enough rank of Tm+n. If (m + n) is small, some T ′m+n may equal Tm+n. Continuity correction is often used. That is, each equaling T ′m+n value is counted as half is larger, another half is smaller than Tm+n. Under mild conditions, this test is asymptotically equivalent to t-test. That is, two tests will give very close p-values over a wide range of p-values. One may check against the definition to verify that this test is a permu- tation test. Difference in ranks. Consider the same alternative H1 : G(x) = F (x − δ) for some δ > 0. Instead of examining the size of the difference in sample means y¯ − x¯, we may first replace each observed value by its rank in the set of all observed values. Define r(x) = m∑ j=1 1(xj ≤ x) + n∑ k=1 1(yk ≤ x). Thus, r(yj) is the number of observations in the pooled sample that are smaller than to equal to yj. Suppose both F and G are continuous. In this case, we do not need to look into the possibility of tied observations. Remedies to handle data with equal observed values are given in various places. Our focus is on conceptual issues. Let Tm+n = n∑ j=1 r(yj). The largest possible value of Tm+n is when xi ≤ yj for every pair of (i, j). A large observed value of Tm+n is indicative of departure from H0 in favour of H1. Thus, a rank based permutation test is to reject H0 when the observed Tm+n is among the top 100α% values. 20.4. TWO-SAMPLE PERMUTATION TEST. 279 If H0 holds, then Tm+n has same distribution as the sample total of a simple random sample of size n without replacement from a population made of {1, 2, . . . , N} with N = m + n. Hence, by some simple calculations, we have E{Tm+n} = 1 2 n(m+ n+ 1) and var(Tm+n) = 1 12 nm(n+m+ 1). It can be proved that Tm+n − E{Tm+n}√ var(Tm+n) → N(0, 1) in distribution, as both n,m → ∞ and n/(n + m) has a limit in (0, 1). An approximate one-sided rejection region can be determined by using this limiting distribution. This test is called Wilcoxon two-sample rank-sum test. It is also called Mann-Whitney U test, Mann-Whitney-Wilconxon, Wilcoxon-Mann-Whitney test. I am among those who is really confused with these names. Note that Tm+n is related to ∑ i ∑ j 1(xi < yj) which is a U-statistic. See certain references for U-statistic. This might be the reason behind the name U test. None of the above two tests are uniformly most powerful. The test based on ranks are nonparametric. Such tests are valued because their validity is free from model mis-specifications. Another additional remark is about the alternative model. The for- mulation is clearly geared for one-sided alternative. However, a two-sided Wilcoxon two sample rank test can be built based on the same principle. We reject the null hypothesis when Tm+n is extremely large or extremely smaller among T ′m+n. I leave it to you to decide a way to define the p-value. Clearly, we do not truly have a principle of what quality should be called p-value. 280 CHAPTER 20. NON-PARAMETRIC TESTS 20.5 Kolmogorov-Smirnov and Crame´r-von Mises tests Let x1, x2, . . . , xn be a set of i.i.d. observations from a continuous distribution F . The model under consideration is F : all continuous univariate distribu- tions Without any additional knowledge about the specific F from which we obtained the sample, one estimator of the cumulative distribution function F is given by the empirical distribution Fn(x) = n −1 n∑ i=1 1(xi ≤ x). When xi’s are all different, it is a uniform distribution on x1, . . . , xn. We may not be too happy as this estimator is not a continuous c.d.f. while the model F is made of continuous distributions. Nevertheless, Fn is a good estimator of F in many ways. Let Dn(F ) = sup x |Fn(x)− F (x)|. By the famous Glivenko-Cantelli theorem, Dn(F )→ 0 almost surely as n→ ∞ when F is the true distribution. Suppose we want to test for H0 : F = F0 versus H1 : F 6= F0. It is sensible to reject H0 when Dn(F0) is large. The test in the form of φ(x) = 1(Dn(F0) > k) for some k > 0 is called Kolmogorov-Smirnov test. In applications, we would like to choose k so that the test has some pre- specified size. This is possible only if we have an easy to computer expression of pr{Dn(F0) > k}. This is likely a mission impossible. However, Kolmogorov proved that P{√nDn(F0) ≤ t} → 1− 2 ∞∑ j=1 (−1)j−1 exp(−2j2t2) 20.6. PEARSON’S GOODNESS-OF-FIT TEST 281 as n→∞. Thus, when n is large, we may use the right hand side to pick a value of t so that 2 ∞∑ j=1 (−1)j−1 exp(−2j2t2) = α and reject H0 when √ nDn(F0) > t. The expression is certainly easy to use to compute an approximate P-value. How large this n has to be in order for the approximation to have satis- factory accuracy? I do not have an answer but it exists somewhere. I will not try to give a proof. All I can say that this large sample result is crazily elegant! Kolmogorov-Smirnov test measures the maximum discrepancy between Fn and F . It might be more helpful to examine the average difference. The Crame´r-von Mises test works in this fashion: Cn(F ) = ∫ {Fn(x)− F (x)}2dF (x). Under null distribution F0, it has been shown that nCn(F0)→ ∞∑ j=1 λjχ 2 1j, where λj = j −2pi−2 and χ21j, j = 1, 2, . . . are independent chisquare random variables with one degree of freedom. The sum of the coefficients is 1/6 and this is right. There can certainly be many other ways to examine the difference between Fn and F (x). By the latest checking, there is a R-cran function which is designed to carry out the Kolmogorov-Smirnov test. See the help file if you are interested in how the p-value is numerically computed for various ranges of the sample size n, 20.6 Pearson’s goodness-of-fit test Suppose the observations are naturally categoried into K groups. At the same time, these n observations are believed i.i.d. Let pk be the probability 282 CHAPTER 20. NON-PARAMETRIC TESTS of an observation falls into category k, k = 1, 2, . . . , K. One simple question is: does the data support or contradict the hypothesis that pk = pk0, k = 1, 2, . . . , K. One possible approach of addressing such a concern is Pearson’s goodness-of-fit test. We phrase the question from an opposite angle: is there a significant evident against the null hypothesis H0 : pk = pk0? Let ok be the number of observations out of total n fall into category k. Let ek = npk0 denote the expected value of ok under the null model. Pearson’s statistic for this test problem is defined to be Wn = K∑ k=1 (ok − ek)2 ek . This statistic clearly has one desired property for a test: when the true model deviates from the null hypothesis, we expect to have larger differences between ok and ek. Thus, Wn is stochastically larger when H0 is severely violated. Naturally, we reject H0 if Wn value is large. The next desired property for a test statistics is to have a known distri- bution under H0. This is not completely true. However, when n→∞ while K is a fixed value, it can be shown that Wn d−→ χ2K−1. Since the chisquare distribution is well documented, we may use its upper 1− α quantile as the critical region for this test. Namely, the test would be Reject H0 when Wn > χ 2 K−1(1− α). Of course, this writing has assumed a size-α test is desired in the first place. In a more realistic situation, for instance, these K categories represent the number of boys in a family with K − 1 children. Is this number truly binomially distributed as it would be under the assumption that there is no correlation between siblings and the population is homogeneous? In this case, we do not have pk0’s completely specified but we have an analytical experssion: p0k(θ) = ( K − 1 k − 1 ) θk−1(1− θ)K−k. 20.7. FISHER’S EXACT TEST 283 Namely, they are specified by a single parameter, the probability of success. In this case, let θˆ be the maximum likelihood estimate of θ and compute eˆk = np0k(θˆ). Let us revise the definition of Wn and get Wn = K∑ k=1 (ok − eˆk)2 eˆk . Although we have to estimate θ, the limiting distribution ofWn is only altered slightly: Wn d−→ χ2K−2. In general, if p0k are function of θ and θ has dimension d, the same approach is applicable. The limiting distribution remains chi-square with degrees of freedom being K − d− 1. Being a course in mathematical statistics, one may ask how to establish the asymptotic result. One approach is to connect Wn with the likelihood ratio test. This will be left as an assignment problem. The applied aspect of this test can be more troublesome. The biggest concern is when the chi-square approximation kicks in? The rumour is: do not use the goodness-of-fit test unless min{ok} ≥ 5. In other applications, the observations are not “naturally categorized”. The step of creating K categories in order to examine the goodness-of-fit can be controversial. 20.7 Fisher’s exact test In a classical and likely fictional example, a lady claimed that she could tell whether or not tea was added after milk or the other way by tasting the mixture. Carrying out an experiment and analyzing the subsequent data is the best way to dispute the existence of such an ability. We may put the inability as the null hypothesis. Rejection of which leads to the claim of this lady. Suppose A + B cups of teas of two types of preparation were prepared as such. Assume that the lady had no ability to tell two types of teas apart. 284 CHAPTER 20. NON-PARAMETRIC TESTS Her selection of A cups of type A would no different from randomly selecting A cups out of A+ B and then identifying them as cups with tea was added after milk. Note that in this experiment, the total number of cups as well as how many of them are of type A are not random and known. The random- ness comes from the lady who attempts to identify type A teas, given the knowledge of the split: A and B. Let X be the number of correctly identified type A cups. Under the null hypothesis that there is no correlation between being type A and identified as type A, random variable X as hypo-geometric distribu- tion: pr(X = x) = ( A x )( B A−x )( A+B A ) Clearly, this distribution does not depend on any unknown parameters as A and B are known. A large value of X is an evidence against the null hypothesis pointing to the direction of positive correlation. It is therefore sensible to reject the null hypothesis when X is large. Statistic X has two desired properties for a test statistic. Therefore, we would compute the p- value for the alternative that she has the skill as pr(X > x0) + 0.5pr(X = x0) where x0 is the observed value. The continuity correction here is intuitively helpful but there is no theory to support this practice. More generally, a 2× 2 table may be formed as follows: n11 n12 n1+ n21 n22 n2+ n+1 n+2 n The experiment is carried out to have n units placed into 4 possible cells in this table. Without placing any restrictions, the probability of each unit falling into a cell may be denoted as pij for i, j = 1, 2. The joint distribution of n11, n12, n21, n22 are multinomial with these probabilities. If the row and column counts are independent such that pij = pi,·p·,j for some pi,·, p·,j. Conditioning on the marginal totals (corresponding to the 20.8. ASSIGNMENT PROBLEMS 285 knowledge of A and B in the tea experiment), the size of n11 has hyper- geometric distribution. Again, n11 has the two desired properties of a test statistic, in conditional sense: regarding marginal totals are not random. Extreme values of n11 give evidence against independence assumption. One may select n11 or −n11 as a test statistic, or find a way to get them combined. Regardless the choice of a test statistic, the subsequent p-value of the test can be computed via hypergeometric distribution which does not depend on any unknown parameters. The outcome of the p-value is exact in theory, when the error due to rounding is not taken into consideration. In other words, no large sample approximations are needed for the p-value computa- tion. This property together with its inventor gives the name of the test. 20.8 Assignment problems 1. Carry out two permutation tests on the Precambrian iron formation data to be given. Regard it as a two-sample problem, not a paired- sample problem. Consider the hypothesis that the first two types have the same mean (H0) versus the hypothesis that the first two formations have unequal means (H1). Namely, we wish to carry out two-sided tests for mean. You are asked to carry out a number of both parametric and non- parametric tests requested as follows. (a.1) Carry out permutation test based on difference in means. Namely, compute T ′n+m = |y¯n− x¯m| for all permuted X, Y values. Obtain exact counts with continuity correction on how many of them are larger than Tn+m. (a.2) Carry out permutation test based on ranks. Namely, obtain the rank of each observed value: r(x): r(x) = ∑ 1(xi ≤ x) + ∑ 1(yj ≤ x); compute T ′n+m = ∑n i=1 r(x ′ i). The rest is the same as in (a.1) but you need to adjust for being a two-sided test. Report both p-values. (b) Use t-test and the Wilcoxon rank-sum test based on CLT and report 286 CHAPTER 20. NON-PARAMETRIC TESTS both p-values. Again, use two-sided test accordingly. Directly apply R functions. (c) Compare these 4 p-values and comment on what you find. Taking the significant level as 0.05, do they contradict each other? Remark: (c) is not a right or wrong question. The number of permu- tations in this example is around 180K. The data set is from an article on the origin of Precambrian iron forma- tion. They reported the following data on percentage iron for 4 types of iron formation (1=carbonate, 2=silicate, 3=magnetite, 4=hematite) group observations (only two of them are given below): 1: 20.5 28.1 27.8 27.0 28.0 25.2 25.3 27.1 20.5 31.3 2: 26.3 24.0 26.2 20.2 23.7 34.0 17.1 26.8 23.7 24.9 2. In a clinical trial, a total of 214 oral cancer patients are recruited. Their recurrences after a number of years are observed, together with the information on the site of the original tumour. This part of the outcomes can be summarized by the following 2y2 table. Recurrence Non-recurrence marginal total Low site 11 26 37 High site 28 149 177 39 175 214 (a) Does the “High site” tumour exhibit higher rate of recurrence? (b) Are recurrence risks differ between two groups of patients? Remark: answer these questions as a real world problem. I recommend Fisher exact test and Wald test for H0 : r1 = r2 = 0. in (a) and (b) respectively. It is unsatisfactory to simply give two P-values via an R function. Give a few sentences on how the p-values are calculated in the statistical/probability terminology. 20.8. ASSIGNMENT PROBLEMS 287 3. Based on the same data as in the last question, use Pearson’s goodness- of-fit test for the hypothesis that row and columns are independent. Give all intermediate values. (d) Find the 10% upper quantile of the limiting distribution of the Kolmogorov-Smirnov test statistic to the precision of the third decimal place. Use a short program and justify your precision. Do not use an all-powerful R function. 288 CHAPTER 20. NON-PARAMETRIC TESTS Chapter 21 Confidence intervals or regions Suppose we have a sample X from a distribution that belongs to F . Under parametric setting, the distributions in F are labeled by θ and the “true” distribution of X is the distribution with label θ = θ0, a value conceptually exists but unknown to us. We generally do not bother to use a special single θ0 to denote the true parameter value unless it becomes ambiguous otherwise. Most often, we simply declare that the parameter of the distribution of X is θ and its range or the parameter space is Θ. Furthermore, we implicitly assume that Θ is a subset of Rd with all mathematical properties needed such as being open, convex and so on. In addition, the distribution of X depends on the value of θ in a continuous fashion. Based on the realized value of X, one can estimate θ using any preferred methods. This is called point estimation. One may also make a judgement on whether or not θ is a member of an elite subset H0 by conducting a hypothesis test. The third option is to specify a subset Θ0 of Θ so that we are confident that θ ∈ Θ0. When d = 1, we usually prefer that Θ0 is an interval in Θ. When d > 1, Θ0 will be called a confidence region. It is preferable that Θ is connected and it does not contain holes. More often than not, convex set is most appealing. We require that Θ0 is decided by the value of X, and it does not depend on any unknown parameters. Thus, it is a random set. The characteristic of its randomness is dependent on the distribution of X. Just like a statistic 289 290 CHAPTER 21. CONFIDENCE INTERVALS OR REGIONS is a function of exclusively data, the confidence region is also a set-valued function of exclusively data. In Bayesian data analysis, the distribution of X is regarded as a realized distribution from a “super-population”. The super-population is specified by a prior distribution which is regarded as known. In parametric setting in Bayes analysis, the distribution of X is f(x; θ0) such that θ0 is a realized but not observed value of a random variable θ whose distribution is the prior distribution and known. The prior distribution is often denoted as pi(θ) which also stands for its density function. The corresponding cumulative distribution function is often denoted as Π(θ). The conditional distribution of θ given X called posterior distribution and it is derived via Bayes formula. A θ region on which posterior distribution has high density is referred to as a credible region. The topic of credible region will be discussed later. Constructing a confidence interval or region is easy. The real challenge is to construct an interval with desirable properties. We should specify what properties such an interval estimation should have, and how we construct intervals with these properties. Definition 21.1. An interval/region C(x) as a function of the realized value of X is a confidence interval/region of θ at level 1−α for some α ∈ (0, 1), if inf θ pr{θ ∈ C(X); θ} = 1− α. Probability calculation in this definition is done by regarding θ as the true parameter value of the distribution of X. The value of θ is not random in the definition of the frequentist confidence region. The interval C(X) is random due to the randomness of X. In comparison, when θ is regarded as a random variable in Bayes analysis, a similar notion is needed and must be defined separately. Corresponding to confidence region, the Bayes ver- sion is called “credible region”. There is no specific shape requirement on a confidence/credible region in their definitions. Yet we have preferences. The probability that C(X) covers θ generally depends on what value θ takes in this specific case. It is desirable to have the coverage probability not dependent on this specific value of θ. If this is achieved, the infimum operation in the above definition would be redundant. 291 The models in real world applications are often too complex to find a sensible C(X) that meets the standard of Definition 21.1. Often, people implicitly use the following convention which is wrong in strict sense. To make a distinction, we call it asymptotic confidence region and place a formal definition here. Definition 21.2. Suppose the observation X from a distribution, which is a member of distribution family, is regarded as an observation in an imagi- nary sequence X1, X2, . . . , Xn, . . . from a corresponding imaginary population sequence so that the parameter θ remains interpretable throughout. An interval Cn(xn) as a function of the current realized value of Xn is an asymptotic confidence interval of θ at level 1− α for some α ∈ (0, 1), if inf θ lim n→∞ pr{θ ∈ Cn(Xn); θ} = 1− α. The n in the above definition usually stands for the sample size and Cn is a confidence region derived based on a principled procedure applicable to any sample size n (or other dynamic index). The relevance of this definition largely depends on the the sensibility of the population sequence and the principled approach at constructing Cn(Xn). In addition, the sample size n in application should be large enough so that the value of pr{θ ∈ Cn(Xn); θ} is not far off from its limit when θ is within the anticipated region. Similar to the optimality notion in hypothesis test, comparison between different confidence regions are possible only if they are lined up by their confidence levels. If two confident intervals (or two construction procedures) have the same confidence level, the one having a shorter average/expected length is preferred. One may in addition prefer that the variation in the length of the interval is low. Suppose C(X) = [−2, 2] is a confidence interval for the population mean. If this interval is sensibly constructed, it is generally true that the most likely value of θ is located at the centre of this interval. Namely, [−1, 1] is more likely to contain θ than [−2,−1]∪ [1, 2] does. Yet this belief is not supported by nor is the part of the formal definition of the confidence interval. Unlike the theory for hypothesis test, there seem to be fewer solid math- ematical criteria for the optimality of confidence regions. Confidence regions are often derived from other well–known procedures. If these procedures 292 CHAPTER 21. CONFIDENCE INTERVALS OR REGIONS have optimal properties in the sense of their original purpose, statisticians seem to feel comfortable to recommend the corresponding confidence regions. There are some optimality criteria in classical textbooks though they are not convenient to use and generally ignored in contemporary textbooks. We now give a few generally recommended approaches for constructing confidence intervals. 21.1 Confidence intervals based on hypothe- sis test Assume a test φ(x; θ) of size-α is given for each simple null hypothesis H0 : θ = θ0 against a composite alternative hypothesis H1 : θ 6= θ0. To be more concrete, let the test be denoted as φ(x; θ0). Thus, θ0 is rejected when φ(x; θ0) = 1, assuming no randomization is involved. That is, we consider only the case where φ(x; θ0) can take a value either 0 or 1. Based on this test φ(x; θ) where θ is generic, we define C(x) = {θ : φ(x; θ) < 1}. It is easy to see that pr{θ ∈ C(x); θ} = pr{φ(X; θ) < 1; θ} ≥ 1− E{φ(X; θ) : θ} ≥ 1− α. for all θ ∈ Θ. Thus, C(x) is a 1 − α level confidence region. In most cases, C(x) obtained this way for one dimensional θ is an interval though it does not have to be an interval. Clearly, the coverage probability may exceed 1−α at some θ values. At the same time, the coverage probability is never below 1− α. Example 21.1. Suppose we have a random sample from N(θ, σ2). We hope to construct a confidence interval for θ. One approach is to use likelihood ratio test for each H0 : θ = θ0. The test statistic can be simplified to T (x; θ0) = √ n|X¯ − θ0| sn . 21.2. CONFIDENCE INTERVAL BY PIVOTAL QUANTITIES 293 The rejection region for each θ0 is given by {x : T (x; θ0) ≥ tn−1(1− α/2)}. Consequently, the confidence interval based on this test is {θ0 : T (x; θ0) ≤ tn−1(1− α/2)} or [x¯− tn−1(1− α/2)sn/ √ n, x¯+ tn−1(1− α/2)sn/ √ n]. It is nice to see that the outcome is indeed an interval. 21.2 Confidence interval by pivotal quanti- ties A pivot is a function of both data and unknown parameter such that it has a distribution not dependent on unknown parameters. Suppose q(x; θ) is a pivot. Then there is a quantity, say qα such that P{q(X; θ) > qα; θ} = α when q(X; θ) has a continuous distribution. The existence of qα is ensured because the distribution of q(X; θ) does not depend on the unknown value θ. If q(X; θ) has a discrete distribution, some continuity corrections may be used. Let C(x) = {θ : q(x; θ) < qα}. It is easily seen that C(x) is a 1− α-level confidence region of θ. Examples of pivotal quantity are most readily available in location-scale families. Example 21.2. Suppose we have a random sample of size n from N(θ, σ2). Let us try to find a confidence interval for σ2. It is well known that q(x;σ2) = ∑ (xi − x¯)2 σ2 294 CHAPTER 21. CONFIDENCE INTERVALS OR REGIONS has chisquare distribution with n− 1 degrees of freedom. Thus, it is a pivot. Let χ2n−1(0.95) be the 95th percentile of the chisquare distribution with n− 1 degrees of freedom. Then,{ σ2 : ∑ (xi − x¯)2 σ2 < χ2n−1(0.95) } = [ ∑(xi − x¯)2 χ2n−1(0.95) , ∞ ) is a 95% confidence interval for σ2. If a two-sided confidence interval is asked, then{ σ2 : ∑ (xi − x¯)2 σ2 ∈ (χ2n−1(0.025), χ2n−1(0.975) } = [ ∑(xi − x¯)2 χ2n−1(0.975) , ∑ (xi − x¯)2 χ2n−1(0.025) ] is a choice with 95% confidence level. It is natural for us to use 0.025 and 0.975 quantiles in the above example. However, using 0.02 and 0.97 quantiles will also give us a 95% two–sided confidence interval. Which one is better? Should we have a look of their average lengths? In applications, some function of both data and parameter has a distri- bution not dependent on the “true” parameter value in asymptotic sense. In this case, one may activate Definition 21.2 to justify asymptotic confidence regions. 21.3 Likelihood intervals. By the definition, a confidence region is characterized by its level of confi- dence. Yet the interval makes more sense if a parameter value within the region is more “likely” than a parameter value outside of the region, to be the “true” value of the parameter. This is particularly the case in the con- fidence interval for σ2 in the last example. The notion of likelihood interval or related Bayesian approach seem to be an improvement in this direction. Suppose we have a random sample of size n from a parametric family {f(x; θ) : θ ∈ Rd}. Consider the problem of constructing a confidence inter- val/region for θ. Since by “definition”, the maximum likelihood estimator 21.3. LIKELIHOOD INTERVALS. 295 is the most “likely” value of the parameter, the interval should contain the MLE θˆ. In addition, if the likelihood value at θ′ is almost as large as the likelihood value as θˆ, it is also a good candidate to be included into the interval. This notion quickly deduces to a likelihood region/interval in the form of C(X) = {θ : L(θ)/L(θˆ) ≥ c} where θˆ is the MLE, and c is a positive constant to be chosen. By Definition 21.1, to make a likelihood interval into a confidence interval, all we need is to choose c such that P{θ ∈ C(X); θ} ≥ 1− α is true for any θ, when the pre-specified level is 1− α. There may not exist such a meaningful constant c such that the coverage probability is no less than 1 − α under all θ. However, when the sample size n is large and the model is regular, it is possible to find an cn such that the coverage probability is approximately 1 − α for each θ. That is, the difference is an quantity converges to 0 when n→∞, whichever θ is the true value. This is an asymptotic confidence region at (1− α)-level by Definition 21.2. To students with rigorous mathematics background, you may notice that the asymptotic notion is not uniformly in θ. We only required the convergence point-wise, not uniformly over the parameter space. Example 21.3. Consider the situation where we have an i.i.d. sample of size n from an exponential distribution parameterized by its mean θ. A confidence interval for θ is desired. Let X¯n be the sample mean and regard it as a random variable. It is also the MLE of θ. The log likelihood function is given by `n(θ) = −n log θ − nθ−1X¯n. The likelihood ratio statistic is given by Rn(θ) = 2n{− log(X¯n/θ)− 1 + (X¯n/θ)} 296 CHAPTER 21. CONFIDENCE INTERVALS OR REGIONS which is convex in X¯n/θ. Hence, a likelihood interval for θ has the form C(X) = {c1X¯n ≤ θ ≤ c2X¯n} for some c1 < c2 such that Rn(c1X¯n) = Rn(c2X¯n) and P{X¯n/θ ∈ (1/c2, 1/c1)} = 1− α where 1− α is the pre-specified confidence level. Suppose one would like to construct a confidence interval for the rate parameter λ = 1/θ in this example. It is easily seen that the confidence interval based on likelihood approach would simply be C ′(X) = {(c2X¯n)−1 ≤ λ ≤ (c1X¯n)−1} where c1 and c2 are the same constants in the example. Note that θ ∈ C(X) if and only if 1/θ = λ ∈ C ′(X). We say that the likelihood intervals are equi-variant, just like its counterpart, MLE. This property is not shared by all other methods such as the one to be introduced. 21.4 Intervals based on asymptotic distribu- tion of θˆ It is arguable whether or not this is a new method. We might call it Wald’s method, yet it has too many moving parts to be solidly called as this method. Often, √ n(θˆ − θ) is asymptotic normal with limiting variance σ2. When σ2 is known, then q(X; θ) = √ n(θˆ − θ) σ 21.4. INTERVALS BASED ON ASYMPTOTIC DISTRIBUTION OF θˆ297 is an approximate pivotal quantity. Because of this, an approximate 2-sided 1− α confidence interval of region of θ is given by θˆ ± z1−α/2σ/ √ n. If σ is unknown but a consistent estimator σˆ is available, then a substitute is given by θˆ ± z1−α/2σˆ/ √ n. It might be more convenient to write the above as θˆ ± z1−α/2 √ v̂ar(θˆ). The meaning of the above notation is obvious. Example 21.4. Let X1, . . . , Xn be an i.i.d. sample from Poisson distribution with mean parameter denoted as θ. The MLE of θ is given by θˆ = X¯n, the sample mean. Construct a 95% CI for θ. Solutions: It is well known that √ n(θˆ− θ) d−→ N(0, θ). Thus, a 95% CI for θ is given by X¯n ± 1.96 √ X¯n/n. When 1.96 √ X¯n/n > X¯n, one must set lower confidence limit(bound) to 0. It is equally appropriate to notice that √ n( √ θˆ − √ θ) d−→ N(0, 1/4). Hence, one may construct a 95% CI based on √ n| √ θˆ − √ θ| ≤ 1.96/2. Solving this inequality, we get [{ √ X¯n − 1.96/2 √ n}+]2 < θ < { √ X¯n + 1.96/2 √ n}2. The third choice is to work with √ n(θˆ − θ)√ θ d−→ N(0, 1). 298 CHAPTER 21. CONFIDENCE INTERVALS OR REGIONS With this asymptotic pivotal quantity, the CI has lower and upper limits X¯n + 1.962 2n − √ 1.964 4n2 + 1.962X¯ n and X¯n + 1.962 2n + √ 1.964 4n2 + 1.962X¯ n . ♦ Many students have natural tendency to ask the following question. Which of the above confidence intervals is correct? The answer is: none of them. The reason is: the critical value 1.96 is based on the limiting distribution of θˆ in every case. Hence, none of them have exact 95% coverage probability (even if the rounding-off is not taken into account). If the above answer is unsatisfactory to you, then you need to think hard about what it means by “correct”. If approximate 95% CIs are acceptable, all three are fine. The real question in your mind might be: which one is the best? An- swering this question needs an optimality criterion. We do not have one at the moment. Now it boils down to a weak question: what are their relative merits? The first one is analytically simple. If the sample size n is not very large, the normal approximation can be poor. The CI may even have a negative lower bound. Chop-off the segment of the CI containing the negative values is a mathematical must but somewhat unnatural. The interval is otherwise always symmetric with respect to θˆ = X¯n. This is somewhat unattractive. I would use this one when n and X¯n are both large. How large is large? I do not have an absolute standard. The second one is nice in one way: after transforming θ into g(θˆ) =√ θ, the limiting distribution of g(θˆ) has a constant limiting distribution. For this reason, this type of transformation is called variance stabilization transformation. Since the limiting distribution does not depend on unknown parameter values, this interval is truly based on approximate pivotal. If n is not large, this is a good choice. 21.5. BAYES INTERVAL 299 The third one has its own merit. Scaling θˆ−θ by a function of θ creates a more complex pivotal. This often leads to more naturally shaped confidence regions (intervals). While I have intuitions for this approach, I cannot come up with concrete evidences for this preference. Recall that testing hypothesis on θ value based on the limiting distribu- tion of the MLE θˆ is called Wald’s method. I am not sure if this group of intervals should be credited to Wald, but I feel it is natural to call it Wald’s interval/region. Topics we do not have time to go over: Multi-parameter; Binomial exam- ple; Odds ratio; Intervals for quantiles. 21.5 Bayes Interval Under Bayesian setup, the parameter θ is a sample from some prior distribu- tion. Thus, its value itself is a realization of random variable. Constructing a confidence interval for a random quantity is a different topic. However, we may combine our prior information, if any, with data from f(x; θ), to take a guess about this realized value of θ. It is generally concluded that the infor- mation about θ is completely summarized in the posterior distribution of θ. If one must take a guess on a region in which this θ has been located based on Bayesian setup, she would select the region with the highest posterior density. Definition 21.3. Let pi(·|x) denote the posterior density function of param- eter θ given X = x. Then Ck = {θ : pi(·|x) ≥ k} is called a level 1− α credible region for θ if pr(θ ∈ Ck|x) ≥ 1− α. Note that pr(·|x) is used for posterior distribution of θ. If one can credibly regard θ as an outcome from a prior distribution, then the above credible region has a very strong appeal. If θ is not a vector but a real value, then we may choose to ignore the above definition of the credible region but insist of having a credible interval. 300 CHAPTER 21. CONFIDENCE INTERVALS OR REGIONS Definition 21.4. Let Π(·|x) denote the posterior cumulative distribution function of parameter θ given X = x. Suppose θ¯ and θ are the largest and the smallest values satisfying Π(θ ≤ θ|x) ≥ 1− α; Π(θ¯ ≥ θ|x) ≥ 1− α. Then, the θ and θ¯ are level 1− α lower and upper credible bounds for θ. The source of these two definitions is Bickel and Doksum (2001). Some changes are made to avoid potential non-uniqueness. The following is an example directly copied from Bickel and Doksum (2001). Example 21.5. Suppose that given µ, X1, . . . , Xn are i.i.d. from N(µ, σ 2 0) with known σ20. The prior distribution of µ is N(µ0, τ 2 0 ) with both parameter values are known. Find the credible bounds and regions according to the above definitions Solution. The posterior distribution of µ given the sample is still normal with parameters µB = nx¯/σ20 + µ0/τ 2 0 n/σ20 + 1/τ 2 0 ; and σ2B = [ n σ20 + 1 τ 20 ]−1 . The lower and upper 1− α bounds are simply µB ± z1−α σ0√ n+ σ20/τ 2 0 . The 1− α credible region is also an interval with lower and upper limits given by µB ± z1−α/2 σ0√ n+ σ20/τ 2 0 . ♦ Note that the centre of the credible interval is shift toward µ0 compared with usual confidence intervals. The length is shortened too. 21.6. PREDICTION INTERVALS 301 21.6 Prediction intervals In general, the notion of confidence region is defined for unknown parame- ters of a distribution family. There are cases where we hope to predict the outcome of a future trial from the same probability model. Suppose we have a set of iid sampleX1, X2, . . . , Xn from {f(x; θ) : θ ∈ Θ}. Based on this sample, we might have an estimate of θ. The question is: if another independent sample is to be taken from the same distribution, what are the possible values of this future X? If the value of θ for this experiment were known, we could use the high density region of f(x; θ) as our prediction region. This should allow us to catch the true value with the lowest volume confidence region. That is, let C(θ) = {x : f(x; θ) > c} with the known value θ. We may choose c such that P (X ∈ C(θ)) = 1− α for 1−α coverage probability. Note that this region is not dependent on the random sample X1, . . . , Xn. If θ is unknown as it is the usual case, it is natural to replace θ by its estimator, say θˆ. Although C(θˆ) is a very sensible prediction region for X, its coverage probability is likely lower than 1− α due to the uncertain brought in by θˆ. The event X ∈ C(θˆ) contains two random components: X and θˆ. The randomness in X is unaf- fected by how well θ is estimated, while the precision of θˆ usually improves with sample size n. The limit of the improvement is C(θ). Due to the build- in randomness in X, one cannot do anything better than C(θ) no matter what. In comparison, the precision of the confidence region for θ usually im- proves with n. When n → ∞, the size of the confidence region with fixed confidence level shrinks to 0. Example Suppose we have a random sample X1, . . . , Xn from N(θ, 1). It is well known that θˆ = X¯n is the MLE of θ. 302 CHAPTER 21. CONFIDENCE INTERVALS OR REGIONS If X is the outcome of a future experiment, then X − X¯n has normal distribution with mean 0 and variance 1 + n−1. Thus, a 95% prediction interval of X is given by (X¯n − 1.96 √ 1 + n−1, X¯n + 1.96 √ 1 + n−1). Clearly, increasing the sample size does not have much impact on reducing the length of the prediction interval. In general, a prediction interval can be obtained via some “pivotal” quan- tities. That is, we look for a function of the random quantity to be predicted, and a statistic based on observations such that the resulting random quantity has a distribution free from unknown parameters. Thus, it is possible to find a subset of its range such that its probability equals 1 − α, the confidence level we hope to get. The range can often be converted to obtain a prediction interval or region. 21.7 The relationship between hypothesis test and confidence region Both hypothesis test and confidence region are frequentist concept. The subsequent discussions are not applicable to Bayes analysis. Particularly in recent years, the use of the hypothesis test is being ques- tioned in science community. At the centre of this dispute is the interpreta- tion of the p-value. Even for researchers who received many years of rigorous statistical training, obtaining a small p-value based on some test on some data set remains holy. This is bad. The majority of the researchers in the same group fully understand the non-equivalence of “statistical significance” and “scientific significance”. The motivation outside of scientific considera- tion can simply be too strong to take the non-equivalence seriously. Here is a fictionally example based on linear regression. How should we judge the importance of explanatory variables? One may find the p-value of one variable is 10−5, extremely small and much lower than the commonly used nominal level 0.05. The p-value for another is, say 0.04, just small enough to declare its statistical significance. Which one is more important? 21.7. HYPOTHESIS TEST AND CONFIDENCE REGION 303 If it were us, we would first ask what hypotheses are under test? Mostly likely, the hypotheses are that their coefficients are zero in the regression relationship. Nevertheless, I prefer to have them explicitly stated. I would then ask what the purpose of this regression analysis is. If predicting the response value at a future experiment is the goal, then it is best to examine how much variation in the response is explained by the variation of each of these two explanatory variables. The large one is more important. The size of p-value merely tell us how sure we are about the conclusion that its effect is non-zero. One way to avoid such confusion is to make use of confidence interval, suggested by many. We feel that this is not necessarily fool-proof nor always feasible. Let the regression coefficients be denoted as β1 and β2 for two re- gression coefficients under consideration. Suppose two-sided 95% confidence intervals of these variables are [0.1, 1] and [1, 10] respectively. Suppose these two variables have been standardized, the second explanatory variable is more important in determining the size of the response variable. The foundation of replacing hypothesis test with a confidence interval/region construction is their equivalence: one may reject every null hypothesis which does not contain any parameter values in the confidence region; one may construct a confidence region made of all parameter values that is not re- jected by a chosen hypothesis test procedure when the specific value form a null hypothesis. This foundation does not always work. Consider the exam- ple of Wilcoxon rank test which is often used as a nonparametric method for two-sample problem. In this case, there is not a meaningful parameter whose confidence interval can be constructed based on this test. The goodness-of-fit test is another one. In these cases, hypothesis test is indispensable in spite of its deficiency. The only defence against the abuse of statistical inference pro- cedure is to uphold the statistical principle: a small p-value based on a valid hypothesis test implies statistical significance. It does not imply scientific significance. Let us end this chapter with another trivial example. It is definitely sta- tistically significant that those who buy 10 lottery tickets has higher chance to win than those who buy only 1 lottery ticket. Yet, multiplying a number practically zero by 10 does not lead to a meaningfully sized chance. For the 304 CHAPTER 21. CONFIDENCE INTERVALS OR REGIONS same reason,we do not take the advice such as “drinking 10 cup of water a day will reduce your risk of cancer by a factor of 10”. We will drink water when we are thirsty, not to reduce the risk of cancer. 21.8 Assignment problems 1. The following values are iid observations from a binomial distribution with m = 10 and the probability of success θ. 4 3 3 3 2 3 4 3 2 1 3 7 5 2 2 2 2 1 3 4 (1) Obtain the 95% confidence interval of θ based on likelihood method. (2) Let x¯ be the sample mean and let Tn(x¯, θ) = √ 20(x¯− 10θ)√ 10θ(1− θ) be used as a test statistic for H0 : θ = θ0 versus H1 : θ 6= θ0. Note that the sample size n = 20. Based on CLT, Tn is asymptotic N(0, 1). Thus, we reject H0 when |Tn(x¯, θ0)| > 1.96 at 5% level. Numerically find all value of θ which is not rejected by the above test. Your outcome is a confidence interval for θ. 2. Let X1, . . . , Xn be an iid sample from Cauchy distribution family f(x; θ) = 1 pi{1 + (x− θ)2} . (a) Derive the score test statistic for H0 : θ = 0. (b) Generate a set of random observations of size n = 20 to construct a two-sided 95% confidence interval for θ based on the score test. Remark: the locally most powerful test is one-sided. For this problem, we use S2n(θ)/I(θ) as the test statistic. The locally most powerful uses Sn/ √ I(θ). 21.8. ASSIGNMENT PROBLEMS 305 3. Let X1, . . . , Xn be a random sample from exponential distribution with density function f(x; θ) = θ−1 exp(−θ−1x). Consider the case n = 201 and θ = 1. (a) Generate 1000 data sets with n = 201 to estimate the bias and variance of the sample median for estimating the population median. (b) Let x¯ and s2n be the sample mean and variance. Obtain a two-sided 95% confidence interval for θ based on test statistic T = √ n(x¯− θ) θ and its asymptotic N(0, 1) distribution. (c): Obtain a two-sided 95% confidence interval for θ based on test statistic T = √ n(x¯− θ) sn and its asymptotic N(0, 1) distribution. (d) Which interval, rather the CI construction method, in (a) or the one in (b), seems to work better? 4. The following values are i.i.d. observations from a geometric distribu- tion with probability of success θ: 0 2 0 0 2 10 6 0 0 6 0 3 0 1 0 8 3 5 3 5. The sample size n = 20 and ∑ xi = 54. (a) Provide the algebraic expressions needed for constructing an ap- proximate 95% confidence interval of θ based on likelihood method. Obtain the 95% approximate likelihood interval based on the data pro- vided with precision up to the third decimal place. (b) The log odds is defined to be ξ = log{θ/(1 − θ)}. Obtain the approximate likelihood interval of ξ based on your solution in (a). 306 CHAPTER 21. CONFIDENCE INTERVALS OR REGIONS 5. Suppose that we have an i.i.d. sample x1, . . . , xn of size n from Poisson distribution whose p.d.f. is given by f(x; θ) = θx x! exp{−θ}; x = 0, 1, . . . , with parameter space θ > 0. In the context of Bayes analysis, a prior distribution for θ with density function pi(θ) = θ exp(−θ) is specified. (a) Consider the simplest case where n = 2 and x1 +x2 = 9. Show that the p.d.f. of the posterior distribution is given by pi(θ|x) = Cθ10 exp(−3θ) for some constant C. (b) Find the lower and upper limits of the 95% credible region to the precision of the 3rd decimal place. (c) Obtain the level 95% lower and upper credible bounds of θ. Remark: credible region and credible bounds are different in their defi- nition mathematically. Their statistical difference is not so meaningful. Chapter 22 Empirical likelihood Likelihood method for regular parametric models has many nice properties. One potential problem, though, is the risk of model mis-specification. If a data set is a random sample from Cauchy distribution, but we use normal model in the analysis, the statistical claims could be grossly false. Of course, the problem is not always so serious. If the data set is a sample from a double exponential model, but we use normal model as the basis for data analysis, many statistical claims will still be asymptotically valid. For instance, the sample mean remains a good estimator of the population mean, its variance remains well estimated by its sample variance after scaled by the sample size. The efficiency of the point estimator, however, is compromised. To avoid the risk of model mis-specification, non-parametric methods are sensible alternatives. The empirical likelihood methodology is a systematic non-parametric approach to statistical inference. It preserves some treasured properties of the likelihood approach while being nonparametric. 22.1 Definition of the empirical likelihood Suppose we have a set of i.i.d. observations X1, X2, . . . , Xn. We hope to make statistical inferences without placing restrictive assumptions on their com- mon distribution F . Can we still make meaningful and effective inferences on F? The answer is positive because it is widely known that the empirical distribution Fn(x) is a good estimate of F . This is an estimator based on 307 308 CHAPTER 22. EMPIRICAL LIKELIHOOD no parametric assumptions. The empirical distribution will be seen a non- parametric maximum likelihood estimator of F and it has many “optimal” properties. Let F ({xi}) = P (X = xi), where xi is the observed value of Xi, i = 1, 2, . . . , n. When all xi’s are distinct, the likelihood function becomes Ln(F ) = n∏ i=1 F ({xi}). Denote pi = F ({xi}). This likelihood can also be written as Ln(F ) = n∏ i=1 pi. Clearly, we have 0 ≤ pi ≤ 1 and ∑n i=1 pi ≤ 1. It is often more convenient to work with the log-empirical likelihood function `n(F ) = n∑ i=1 log pi. If F is a continuous distribution, we have Ln(F ) = 0. Because of this, the empirical likelihood appears insensible. In its eyes, no continuous distribu- tions are likely at all. Yet we will find the empirical likelihood is not boggled down by this deficiency. When there are ties in the data, that is when some xi are equal, Ln(F ) given above in terms of pi is not authentic. For instance, the requirement of∑ pi ≤ 1 is no longer valid. To justify the continued use of this Ln(F ) via pi as a likelihood function, we may add a set of independent and very small continuous noises to these observed values. After which, Ln(F ) remains a valid likelihood function but is constructed on a slightly different data set and of a different F . We can then proceed to whatever analysis first, and then let this amount of noise go to zero. In most situations, the analysis conclusions on original F remain valid. Owen (2001) contains a more rigorous justification to resolve the “philosophical issue” caused by tied observations. The justification here might be regarded as a lazy-man’s approach. 22.2. PROFILE LIKEIHOOD 309 It is easy to see that the likelihood is maximized when F (x) = Fn(x). Hence, empirical distribution Fn(x) based on an i.i.d. sample is also the non- parametric MLE. One may note that this conclusion does not depend on whether or not there are any ties in the sample. 22.2 Profile likelihood for population mean and the Lagrange multiplier The empirical likelihood may seem to have limited usage. The picture com- pletely changes once we introduce the concept of profile likelihood. Consider inference problem related to population mean when a set of i.i.d. observations from a distribution F ∈ F is available. Naturally, we now assume that F contains all distributions with finite first moment. Let θ = ∫ xdF (x) = E(X) under distribution F . The empirical likelihood is a function of F . There are many distributions whose expectation equal θ. What should be the likelihood value of θ? We do not have a widely acceptable answer to this question. Let F θ be all distributions whose expectations equal θ. The original concept of the profile likelihood would be wrong Ln(θ) = sup{Ln(F ) : F ∈ Fθ}. This definition is found not useful, however. It can be shown that wrong Ln(θ) = ( 1 n )n for any θ. That is, the above “profile likelihood” lacks discriminative power to tell true θ-value from other values. To avoid the above dilemma, we define the profile likelihood by first introducing a “distribution family”: F n,θ = {F : F (x) = n∑ i=1 pi1(xi ≤ x), n∑ i=1 pixi = θ}. Note that this class of distributions is data dependent. When n increases, this family expands, and in the limit, it can approximate any distribution well in some sense. 310 CHAPTER 22. EMPIRICAL LIKELIHOOD We now define the profile likelihood function for population mean θ as Ln(θ) = sup{Ln(F ) : F ∈ Fn,θ}. Note that we use Ln(·) for both empirical likelihood and for profile empirical likelihood. Mathematically, it is an abuse of notation but such an abuse does not seem to cause many confusions. The question of whose likelihood it stands for is answered by whether the input is a vector θ or a c.d.f. F . As usual, it is often more convenient to work with the logarithm transformation of the likelihood function. We use `n(·) = logLn(·). Does the profile likelihood function of θ works like a likelihood? To answer this question, we need to get some idea on the numerical problem comes with the empirical likelihood. Suppose we have n observed values or vectors x1, . . . , xn. To compute the profile likelihood `n(θ), the numerical problem is: maximize : n∑ i=1 log pi subject to : 0 < pi < 1; i = 1, 2, . . . , n n∑ i=1 pi = 1, n∑ i=1 pixi = θ. The method of Lagrange multiplier is very effective in solving this maximiza- tion problem with restrictions. Suppose that θ is an interior point of the convex hull formed by the n observed values. Define g(p, s,λ) = n∑ i=1 log pi + s( n∑ i=1 pi − 1)− nλτ ( n∑ i=1 pixi − θ) where p represents all pi, s and λ are Lagrange multipliers. When x’s are vectors, λ is also a vector and the multiplication is interpreted as the dot product. The method of Lagrange multiplier requires us to find the stationary points of g(p, s,λ) with respect to p, s and λ. After some routine derivations, we find the stationary point is given by pi = 1 n(1 + λτ{xi − θ}) 22.3. LARGE SAMPLE PROPERTIES 311 with λ satisfying n∑ i=1 xi − θ 1 + λ{xi − θ} = 0. (22.1) In the univariate case, since all 0 < pi < 1, we find 1− n−1 θ − x(n) < λ < 1− n−1 θ − x(1) where x(1) and x(n) are the minimum and maximum observed values. In addition, the function on the left hand side of (22.1) is monotone decreasing function in λ. One may verify this claim by finding its derivative with respect to λ. Hence, numerical value of λ may be easily computed. In the vector case, the function in (22.1) is the derivative of a convex function. A revised Newton’s method may be designed to ensure the numerical solution being obtained. Once the value of λ is obtained numerically, we have `n(θ) = − n∑ i=1 log{1 + λ(xi − θ)} − n log n and the corresponding F has pi = 1 n(1 + λ{xi − θ}) . This result paves the way to study the asymptotic property of the profile likelihood. 22.3 Large sample properties As a preparation step, we first show that the true population mean θ0 is within the convex hull of the data with probability approaching 1 as n→∞. Mathematically, this means that inf{max{a(xi − θ0) : i = 1, . . . , n} : a is a unit vector} > 0. The reason is: viewed from θ0 to whichever direction, there should always be data located in that direction. Remember that a unit vector is a vector 312 CHAPTER 22. EMPIRICAL LIKELIHOOD of length one. We use Euclidean norm to define the length. A mathematical result presented in Owen (2001) is needed here. Lemma 22.1. Let X be a d-dimensional random vector with mean 0 and finite variance-covariance matrix V of full rank. We have inf a pr(aτX > 0) > 0 where the infimum is taken over all unit d-dimensional vectors. Proof: Since Σ is positive definite, there cannot be an unit length vector a0 such that pr(a τ 0X > 0) = 0. We show that because of this, the lemma conclusion is true. If the conclusion is not true, then there be a sequence am such that pr(aτmX > 0) → 0 as m → ∞. Since the set of all unit d-dimensional vectors is compact, we must be able to find a sub-sequence of am such that am → a0 for some a0, also of unit length. Without loss of generality, assume am → a0 as m→∞. Clearly, lim m→∞ 1(aτmX > 0) = 1(a τ 0X > 0). Hence, by Fatou’s lemma in real analysis, we have 0 = lim m→∞ pr(aτmX > 0) ≥ pr(aτ0X > 0). This is impossible as pointed out in the beginning. Since the empirical measure approximates the true probability measure uniformly over the half space {aτX > 0}. This claim can be found in high level of probability theory books. This implies that the solution exists with probability converging to 1. By Slutsky’s theorem, the limiting distribution of a statistic is not affected by an event with probability going to zero. We now assume that the solution exists for all data sets observed. This is acceptable for deriving asymptotic results, though one should not used it for other purposes. At least, you should be very cautious on activating this “assumption”. The next lemma is to show that max 1≤i≤n ‖Xi‖ = op(n1/2). 22.3. LARGE SAMPLE PROPERTIES 313 This fact is helpful to determine the closeness of pˆi to 1/n as n→∞. Lemma 22.2. Assume Y1, . . . , Yn be a set of i.i.d. positive random variables with E[Y1] 2 <∞, then Y(n) = maxi Yi = o(n1/2). Proof: There is a simple inequality for positive valued random variables: ∞∑ j=1 pr(Y 21 > j) ≤ E{Y 21 }. Due to the i.i.d. assumption, it can also be written as ∞∑ j=1 pr(Y 2j > j) ≤ E{Y 21 } <∞. The finiteness is the lemma condition. The inequality can then be easily refined to show that ∞∑ j=1 pr(Y 2j > j) <∞ for any > 0. By Borel-Cantelli Lemma, it implies that the pr { Bj = {Y 2j > j}; i.o.} = 0. That is, there exists an event A, such that pr(A) = 1 and for each ω ∈ A, Y 2n (ω) > n for only finite number of n. Let ω ∈ A: it implies there exists an M such that Y 2n (ω) ≤ n when n > M. Let N(ω) = −1 max{Y 2j (ω) : j ≤M} which is a large but finite value. For all n ≥ max{M, N}, Y 2(n) ≤ max [ max{Y 2n (ω) : n ≤M}, n ] ≤ max{N, n} = n. That is, Y 2(n) ≤ n almost surely for all > 0, which is the conclusion of the lemma. After two rather technical lemmas, we are ready to prove the following statistically meaningful conclusion. For simplicity, we use θ for true popula- tion mean, rather than a special notation θ0. 314 CHAPTER 22. EMPIRICAL LIKELIHOOD Lemma 22.3. Under the conditions of Theorem 22.1, for the Lagrange mul- tiplier corresponding to true population mean θ, we have λn = Op(n −1/2). Further, we have λn = [ n∑ i=1 (xi − θ)(xi − θ)τ ]−1 n∑ i=1 (xi − θ) + op(n−1/2) and maxi |λτ (Xi − θ)| = op(1). Proof: We omit subscript n on λ for simplicity. Let ρ = ‖λ‖ and denote ξ = λ/ρ. For brevity, assume µ = 0 so that the equation for λ becomes n∑ i=1 xi 1 + ρξτxi = 0. We have 0 = n∑ i=1 (ξτxi)− ρ n∑ i=1 {ξτxi}2 1 + ρξτxi . This implies n∑ i=1 (ξτxi) = ρ n∑ i=1 {ξτxi}2 1 + ρξτxi ≥ 0. Let ti = ξ τxi and δn = maxi |ti|. It is known 1 + ρti > 0 for all i and therefore 1 + ρδn ≥ 0. Further, by the finiteness of the second moment of xi and Lemma ?? , we know δn = o(n 1/2). This order assessment leads to n∑ i=1 ξτxi = ρ n∑ i=1 {ξτxi}2 1 + ρξτxi ≥ ρ [ ∑n i=1{ξτxi}2 1 + ρδn . Multiplying positive constant 1 + ρδn on both sides, and after some simple algebra, we get n∑ i=1 (ξτxi) ≥ nρ [ n−1 n∑ i=1 (ξτxi) 2 − n−1δn n∑ i=1 (ξτxi) ] . 22.4. LIKELIHOOD RATIO FUNCTION 315 By the law of large numbers, n−1 n∑ i=1 xix τ i → var(X1) which is a positive definite matrix. Hence, n−1 ∑n i=1(ξ τxi) 2 ≥ σ21 > 0 almost surely with σ21 being the smallest eigenvalue of the covariance matrix. At the same time, it is clear that n−1δn n∑ i=1 (ξτxi) = op(1). Consequently, we have shown ρ ≤ [ n∑ i=1 (ξτxi) 2 ]−1{ n∑ i=1 ξτxi } (1 + op(1)) = Op(n −1/2). This conclusion implies maxi |λXi| = op(1). Substituting back to n∑ i=1 xi 1 + λτxi = 0, we get the expression for the expansion of λ. This concludes the proof. These preparations help to establish the useful statistical results in the next section. 22.4 Likelihood ratio function Since Ln(Fn) > Ln(F ) for any F 6= Fn, it is useful to introduce the empirical likelihood ratio function Rn(F ) = Ln(F )/Ln(Fn) = n∏ i=1 (npi). This function has the maximum value of 1. Similarly, for population mean θ, we define Rn(θ) = Ln(θ)/Ln(Fn) = n∏ i=1 (npi) 316 CHAPTER 22. EMPIRICAL LIKELIHOOD with npi = {1 + λ(xi − θ)}−1 for some Lagrange multiplier given earlier. In parametric inference we may base hypothesis test and confidence re- gions on the size of the likelihood ratio function. When Rn(θ) is large, then θ is a likely value of the true parameter. A confidence region hence is made of θ’s such that Rn(θ) is larger than a threshold value. As in the parametric likelihood inference, we need to know the distribution of Rn(θ) to define a proper threshold value. This value is to be selected that at least asymptoti- cally, the size of the test is pre-specified α, or the likelihood interval/ region has coverage probability 1 − α. Such a threshold value can be determined based on the following much celebrated result. Theorem 22.1. Let X1, X2, . . . , Xn be a set of i.i.d. random vectors of di- mension d with common distribution F0. Let θ0 = E[X1], and suppose 0 < var(X1) <∞. Then −2 log[Rn(θ0)]→ χ2d in distribution as n→∞. Because of the above Wilks type result, an effective empirical likelihood based hypothesis test procedure is possible by rejecting H0 : E(X) = θ0 in favour of H0 : E(X) 6= θ0 when Tn = −2 log[Rn(θ0)] ≥ χ2d(1− α). Note that this d is the dimension of X. I generally use Rn for the LRT statistics which is twice of the difference of the log likelihood values maximized respectively under the full model θˆ1 and under the null model θˆ0. Namely, Rn is generally used for 2{`n(θˆ1)−`n(θˆ0)}. In the context of empirical likelihood, there is a compelling reason to use Rn(θ) as the straight ratio of two likelihood values. Hence, one has to be careful to avoid some potential confusion here. Proof of Theorem: Because max ‖λτXi‖ = op(1), let us focus on events such that it is no more than 1/10 in absolute value. For |t| ≤ 1/10, it is simple to see that | log(1 + t)− {t− 1 2 t2}| ≤ |t|3/2. We in fact have given a big margin for the error. 22.5. NUMERICAL COMPUTATION 317 Without loss of generality, θ0 = 0. With this convention, we have −2 logRn(θ0) = 2 n∑ i=1 log{1 + λτxi} = 2λτ n∑ i=1 xi − λτ{ n∑ i=1 xix τ i }λ+ n = { n∑ i=1 xi}τ{ n∑ i=1 xix τ i }−1{ n∑ i=1 xi}+ op(1) + n. The leading term has chisquare limiting distribution. We need only verify that n = op(1). This is true as |n| ≤ n∑ i=1 |λτxi|3 ≤ max i |λτxi| n∑ i=1 |λτxi|2 = op(1). This completes the proof. This theorem can be used to construct confidence intervals for the popu- lation mean θ, or conduct hypothesis test regarding the value of population mean. For instance, an approximate level 1 − α confidence region for θ is given by {θ : −2 logRn(θ) ≤ χ2d(1− α)}. It can be shown that the profile likelihood function `n(θ) is concave. Hence, the above confidence region is always convex. On top of being derived from a non-parametric procedure, EL confidence regions are praised for having a data-shaped confidence region; for not de- manding an estimated covariance matrix. In general, as the above region is based on the first order asymptotic result, it has slightly lower than nominal 1− α coverage probability in general. A high-order correction can be made to achieve higher order precision: the actual coverage probability differs from 1− α by a quantity of order n−2. 22.5 Numerical computation The numerical computation appears to be problematic initially. We have to maximize a function with respect to n variables under various linear con- 318 CHAPTER 22. EMPIRICAL LIKELIHOOD straints. It turns out that once the value of the Lagrange multiplier λ is known, the remaining computation is very simple. We illustrate the numer- ical computation in this section. Consider the problem of computing the profile likelihood for the mean. The computation is particularly simple when x is a scale. In this case, we need to solve g(λ) = n∑ i=1 xi − θ 1 + λ(xi − θ) = 0 for a given set of data, and value θ. Our first step is to subtract θ from xi and call them yi. Namely define yi = xi − θ whenever a θ value is selected. We then sort yi to increase order and obtain y(1) and y(n). If they have the same sign, there will be no solution. The numerical solution is mission impossible. Otherwise, the sign of λ is the same as y¯n. If y¯n > 0, we search in the interval of [0, (n−1 − 1)/y(1)). Otherwise, we search in the interval ( (n−1 − 1)/y(n), 0]. We also note that g(λ) is a decreasing function. Let us provide the following pseudo code for computing λ: 1. Compute yi = xi − θ; 2. Sort yi to get y(i); 3. If y(1)y(n) ≥ 0, stop and report “no solution”. Otherwise, continue; 4. Compute y¯. If y¯ > 0, set L = 0, U = (n−1 − 1)/y(1), otherwise set L = (n−1 − 1)/y(n), U = 0. 5. Set λ = (L+ U)/2. 6. If g(λ) < 0, set U = λ otherwise set L = λ. 7. If U −L < , stop and report λ = (U +L)/2. Otherwise, go to Step 5. This algorithm is guaranteed to terminate. The constant is the tolerance level set by the user or by default. Often, it is chosen to be 10−8 or so. In applications, we should take the scale of xi’s into consideration. If all of them are small in absolute values (after subtracting θ), λ will be larger hence the above tolerance is fine. If xi− θ are in the order of 108, then to tolerance for 22.6. EMPIRICAL LIKELIHOODAPPLIED TO ESTIMATING FUNCTIONS319 λ must be reduced substantially, say to 10−16. A sensible choice is to find the sample standard error sn and set the tolerance level at sn. To find the upper and lower limits of the confidence interval of the mean, we first note that x¯n is always included in the interval. The upper and lower limits cannot exceed the smallest and the largest observed values. A simple method is to bisect the interval between x¯n and x(n) iteratively until we find the location θU at which the profile likelihood ratio function equals some quantile of the chisquare distribution set according to the confidence level suggested by the user. The typical value is of course 3.841 for one-dim problem at 95% confidence level. When Xi’s are vector valued, Chen, Sitter and Wu (2002, Biometrika) showed that a revised Newton-Raphson method can be used for computing the profile likelihood ratio function for the mean. The algorithm is guaran- teed to converge when the solution exists. 22.6 Empirical likelihood applied to estimat- ing functions In some applications, particularly in econometrics, the parameter of interest is defined through estimating functions. Namely, if X is a sample from a population of interest, the parameter vector θ is the unique solution to E{g(X;θ)} = 0 for some vector valued and smooth function g. In this setting, the distribution ofX is left unspecified. Some restrictions will be needed to permit meaningful discussion of some large sample properties. Let the dimension of g be denoted as m and the dimension of θ be denoted as d. When m < d, the solution to equation E{g(X;θ)} = 0 is likely not unique given a hypothetical distribution F of X. In this case, θ is under- defined. When m = d, the same equation usually has a unique solution. The parameter is then just-defined. When m > d, solution to E{g(X;θ)} = 0 exists only for special F . The model is then over-defined. If an i.i.d. sample 320 CHAPTER 22. EMPIRICAL LIKELIHOOD from a distribution F is available, the corresponding estimating equation n∑ i=1 g(xi;θ) = 0 may not have any solution in θ. Generalized Method of Moments. In mathematical textbooks, we often postulate a linear regression model in which the response variable Y and the p-dimensional covariate X as assumed to be linked through Y = Xτβ + in which β is a non-random regression coefficient. The so called error term is a random variable independent of X. It has mean zero and finite variance. The statistical problem to make inference about β based on an i.i.d. sample from this system. Typically, we estimate β by the least sum of squares. Equivalently, we estimate β by the solution to the normal equation: n∑ i=1 Xi(Yi −Xτi β) = 0. This approach fits into the frame of the estimating function definition with g(x, y;β) = x(y − xτβ). The system contains d equations and d parameters which is just-defined. In econometrics, however, a linear model with dependent X and is often more appropriate. One such example is the relationship between the earning potentials Y and the number of years spend in education X1 combined with other controlling factors X2. A sensible model is log(Y ) = β0 +X1β1 +X τ 2β2 + . It is argued that X1 is probably related to unobserved factors such as individ- ual cost and benefit of schooling. These unobserved factors in turn are likely presented in the error term (it is hard to identify them all to be included in X2). Hence, X1 and are not independent. 22.6. EMPIRICAL LIKELIHOODAPPLIED TO ESTIMATING FUNCTIONS321 If one uses least sum of squares estimate for β1 and β2 in this situation, then the estimator βˆ is biased and in fact is not consistent when the sample size n goes to infinite. To obtain a consistent estimator of β, one may look for some instrument variable(s) Z such that given Z, X and are independent. In this case, an unbiased estimating function (means zero-expectation) is given by g(x, y, z;β) = z(y − x1β1 − xτ2β2). Apparently, when the dimension of Z is larger than p, the combined dimen- sion of (x1, x τ 2), we have an over-defined system. When a system is over-defined, the sample estimating equation n∑ i=1 g(xi;θ) = 0 generally has no solution in β. Hence, it is not viable to use its solution as an estimate. This scenario leads to the generalized method of moments (GMM) extensively discussed in econometrics. Let Sn(θ) = ∑n i=1 g(xi;θ) and it is a kind of score function in the context of likelihood based method. Let An be a positive definite matrix of appropriate size and well specified. The general idea of GMM is to estimate θ by θ˜ that minimizes {Sτn(θ)}An{Sn(θ)}. The GMM approach leads to an inevitable question: how do we choose An? One choice is to get an initial estimate of θ such as by solving ∑n i=1 gd(xi;θ) = 0 where gd(·) is the first d entries of g(·). Let the solution be θˆ0 and let An = n −1 n∑ i=1 g(xi; θˆ0)g τ (xi; θˆ0). After which, we get θ˜ by GMM. The second possibility is to iterate the previous choice: update An with θˆ1 = θ˜ to obtain a new θ˜. Continue until hopeful something converges. The third choice is to define a θ dependent A: An(θ) = n −1 n∑ i=1 g(xi;θ)g τ (xi;θ). 322 CHAPTER 22. EMPIRICAL LIKELIHOOD Estimate θ by the minimizer of {Sτn(θ)}An(θ){Sn(θ)}. Three approaches are asymptotically equivalent and “optimal” based on some criterion. Empirical Likelihood. In comparison, for each given value of θ of dimen- sion d, one may define a profile empirical likelihood function as Ln(θ) = sup{ ∏ pi : n∑ i=1 pig(xi;θ) = 0}. We have omitted the requirements of pi > 0 and ∑ pi = 1 in the writing but they are required. For each given θ, the computation of Ln(θ) in the current case is not different from the case where θ is the population mean. We may also notice that the dimensions of g and d do not matter in theoretical development. The optimal solution to the maximization problem is given by pi = 1 n[1 + λτg(xi;θ)] with the Lagrange multiplier λ being the solution to n∑ i=1 g(xi;θ) n[1 + λτg(xi;θ)] = 0. The profile empirical likelihood defined here works almost the same way as the parametric likelihood. Asymptotically, as long as the model is valid in the sense that there exists a value θ∗ such that E{g(X;θ∗)} = 0 and that E{g(X;θ∗)}2 <∞, then n∑ i=1 pig(xi;θ ∗) = 0 has solution in pi with probability approaching 1 as n→∞. That is, `n(θ) = logLn(θ) is at least well defined at θ = θ ∗. 22.6. EMPIRICAL LIKELIHOODAPPLIED TO ESTIMATING FUNCTIONS323 Theorem 22.2. Under the assumption that x1, . . . , xn form a set of i.i.d. observations from some distribution F satisfying E{g(X;θ)} = 0 for some θ. Assume g(x;θ) and F jointly satisfy some regularity conditions. Let θˆ be the maxi- mum empirical likelihood estimator and θ∗ be the true value of the parameter. Then, as n→∞, as have 2{`n(θˆ)− `n(θ∗)} → χ2d and 2{−n log n− `n(θˆ)} → χ2m−d. where m is the dimension of g and d is the dimension of θ. This theorem provides a simple way to construct a likelihood interval/region of θ. By `n(θ ∗), we allow only a single value for θ. This effectively reduces the dimension of θ under consideration to 0. By `n(θˆ), we allow any value of θ in the parameter space that has dimension d. The difference in two dimensions is d. Hence, the degree of freedom in the limiting distribution is d, the same as if we work with parametric likelihood. For the second conclusion, the degree of freedom can be interpreted in the same fashion but differently. By −n log n, we permit any F in the likelihood computation. This means that we have involved no constraints from g. By `n(θˆ), we have introduced constraints in form of m estimating equations. At the same time, these equations contain d parameters as free variables. Hence, the effective number of constraints is reduced to m − d. Hence, the degree of freedom in the limiting distribution is m − d, the number of restrictions applied. The first result on limiting distribution concerns the difference in log likelihood at two parameter values. Hence, the size judges the fitness of a specific parameter value. The second result on limiting distribution concerns the difference between placing a set of constraints and placing no constraints. Hence, the size judges the fitness of these constraints. The maximum empirical likelihood estimator is in general asymptotically normally distributed. Among certain type of estimators, it is also known to be “optimal”. That is, it has the lowest asymptotic variance in a certain class of estimators. 324 CHAPTER 22. EMPIRICAL LIKELIHOOD A much liked advantage of EL method, compared with GMM, is that one does not need to estimate the variance of θˆ in order to construct confidence intervals or regions of θ. Another valued advantage of EL is that it is “Bartlett correctable”. It means that there exists a non-random constant bn such that the distribution of 2bn{`n(θˆ)−`n(θ∗)} is approximated by chisquare with a very high precision so that the difference decreases to 0 quickly when the sample size n → ∞. My experience shows that this is more a nice theory, not so much of practical value. 22.7 Adjusted empirical likelihood One problem with EL under the estimating function setting is that the solu- tion to the maximization problem may not exist. That is, given a θ value, n∑ i=1 pig(xi;θ) = 0 may not have a solution in pi such that pi > 0 and ∑ pi = 1. This could happen for any θ value. When it happens, the statistical literature generally refers it as ‘empty set’ problem. The Lagrange multiplier λ is well defined only if 0 is in the convex hull of {g(xi;θ), i = 1, . . . , n}. Thus, for each θ value given, one must first make sure Ln(θ) is actually defined. Looking for its maximum point θˆ can only be accomplished in the second step. If the set of θ, on which Ln(θ) is well defined, is empty, the rest of inference strategies falls apart. In theory, if the model is correct, g has finite second moment, then Ln(θ ∗) is well defined with probability approaching 1 as n→∞ where θ∗ is the true value. In applications, there is no guarantee we can locate a θ-value at which Ln(θ) is well defined. In fact, it can be an issue to merely determine whether or not it is well defined. There have been a few remedies proposed in the literature. One of them is given in Chen, Variyath and Abraham (2008). Let us define g(xn+1;θ) = −ang¯n 22.8. ASSIGNMENT PROBLEMS 325 where g¯n = n −1∑n i=1 g(xi;θ), for any θ, with a positive constant an. In this definition, we do not look for a xn+1 value at which the above relationship holds. We only need a g(xn+1;θ) value. Next, we define profile empirical likelihood as LN(θ) = sup{ ∏ pi : N∑ i=1 pig(xi;θ) = 0}. with N = n + 1. Namely, we have added a pseudo observation g(xn+1;θ) into the usual definition of the original empirical likelihood. Note that the restrictions pi > 0 and ∑ pi = 1 are satisfied by pi = an/c for i = 1, 2, . . . , n and pn+1 = n/c and c = nan+n for the expanded data set g1, . . . , gN . Hence, LN(θ) is well defined for any value of θ. Under mild conditions, the first order asymptotic properties of Ln(θ) remain valid for LN(θ). This so-called adjusted empirical likelihood is getting a lot of attention. Read related papers yourself if you are interested. 22.8 Assignment problems 1. Let x1, . . . , xn be a set of i.i.d. observations from a nonparametric dis- tribution family F with finite first moment. Let θ = ∫ xdF (x) be the mean of distribution F . Define the profile non-empirical likelihood θ to be `n(θ) = sup{ n∑ i=1 logF ({xi}) : ∫ xdF (x) = θ, F ∈ F}. Show that for any θ value, `n(θ) = −n log n when all xi values are distinct. 2. The authentic empirical likelihood is defined differently from Q1. Let x1, . . . , xn be a set of i.i.d. observations from a nonparametric distribu- tion family F with finite first moment. Let Fn = {F : F (x) = n∑ i=1 pi1(xi ≤ x)}. 326 CHAPTER 22. EMPIRICAL LIKELIHOOD The profile empirical likelihood θ is defined to be `n(θ) = sup{ n∑ i=1 logF ({xi}) : ∫ xdF (x) = θ, F ∈ Fn}. Show that `n(θ) is a concave function of θ. 3. The following are 10 i.i.d. observations of a random vector of dimension 2 (every column is one vector observation): 20.5 28.1 27.8 27.0 28.0 25.2 25.3 27.1 20.5 31.3 26.3 24.0 26.2 20.2 23.7 34.0 17.1 26.8 23.7 24.9 (a) Use some R-functions to draw the asymptotic empirical likelihood 90% confidence region of the mean. (b) Do the same based on parametric likelihood: assuming they are bivariate normally distributed. Remark: show your code and the source of R-functions you located. Give sufficient interpretations. Remark: show your code and the source of certain R-functions you located. Give sufficient interpretations. 4. A bivariate Gamma distributed random vector can be obtained as fol- lows. Generate U1, U2 i.i.d. from beta distribution with density func- tion Γ(a+ b) Γ(a)Γ(b) ua−1(1− u)b−11(0 < u < 1). Generate W from a Gamma distribution with density function βa+bwa+b−1 exp(−βw) Γ(a+ b) 1(0 < w). Let Yτ = W × (U1, U2). The distribution of Y is then the bivariate gamma BG(a, b, β) with correlation ρ = a/(a + b). The marginal dis- tribution of Y1 = U1W is gamma with shape parameter a and rate parameter β. 22.8. ASSIGNMENT PROBLEMS 327 (a) Verify that the marginal distribution of Y is as claimed. (b) Let the sample size n = 100 and repeat the simulation N = 20000 times. Put a = 3, b = 5 and β = 0.5. The population mean is hence (6, 6). Put the size of the test at α = 0.08. Set your seed value as 2018. Write a simulation R-code for the EL test to obtain (i) null rejection rate for H0 : µ = (6, 6). (ii) obtain a QQ plot of the EL test statistic against the theoretical χ22 distribution. 5. Suppose {0, 2, 3, 3, 12} is an i.i.d. sample from some distribution F . Let θ be the population mean. (a) Write done the analytical expression of the profile log-likelihood function give this data set, allowing λ value unspecified. (b) Compute numerically the value of `n(4) based on the illustration data: the profile log likelihood at θ = 4. (c) Compute numerically the value of `n(3) based on the illustration data: the profile log likelihood at θ = 3. (d) Plot the profile log-likelihood function over the range of θ ∈ (2.5, 5.5). (Note required this year. Do it if you are interested). 6. Let η be the second moment of F . (a) Give the corresponding analytical form of the profile log empirical likelihood function of η with the same data given in the last question. (b) Over what range of η is the profile empirical likelihood function well-defined? 328 CHAPTER 22. EMPIRICAL LIKELIHOOD Chapter 23 Resampling methods 23.1 Problems addressed by resampling We have routinely started a statistical inference problem with “let x1, . . . , xn be an i.i.d. sample from F”. When a parametric distribution family f(x; θ) for F is proposed, then the focus is to get θ well estimated. Often, the focus is to investigate certain aspect of F . For this purpose, we may define a function in a generic form θ = T (F ). The formation of T (F ) is commonly applicable whether or not F is a member of a parametric family. In both cases, a natural estimator of θ is θˆn = T (Fn) where Fn(x) is the empirical distribution based on the i.i.d. sample. When θ is the population mean, the estimator is the sample mean. A point estimation of θ is generally only a starting point of the statistical inference. An immediate question might be: what is the (sample) distribution of θˆn. If F is a member of Poisson distribution and T (F ) is the population mean, then θˆ = X¯n whose distribution is a scaled Poisson. If F is a member of normal distribution, then θˆ = X¯n has normal distribution with some mean and variance. If T (F ) is more complex than sample mean and F is a member of a generic distribution family, the answer to “what is the distribution of θˆn” is much more complex. Typically, if n is very large, θˆn = T (Fn) is asymptotic normal. This partly answers the above question. Yet even if so, we are burdened at analytically obtaining the mean and variance of the asymptotic distribution. 329 330 CHAPTER 23. RESAMPLING METHODS Bootstrap and other resampling methods provide some alternative ways to answer such question. They are labor intensive in terms of computation, but simple in terms of mathematical derivation. Because of these properties, this line of approach is admired by many applied statisticians. At the same time, resampling methods can be abused by many who know very little on their limitations. This chapter aims to get you informed about the idea be- hind the resampling methods and about when the methods work as intended. In no circumstances we should blindly trust a cure-all magic. 23.2 Resampling procedures Let x1, x2, . . . , xn be a set of i.i.d. observations. We have available the em- pirical distribution function Fn(x) which is a good estimate of their common distribution. Note that Fn(x) is the uniform distribution on these n observed values. If these observations have ties, this interpretation remains useful and harmless. Let X∗ denote a random variable with distribution Fn. That is, pr(X∗ = xi) = 1 n for i = 1, 2, . . . , n. In addition, let X∗1 , . . . , X ∗ n be i.i.d. random variables with the same distribu- tion as that of X∗. Let F ∗n(x) be the empirical distribution based on observed values of X∗1 , . . . , X ∗ n. We regard F ∗ n and its related entities as mirror image of Fn in the bootstrapping world. For this reason, Fn is regarded as a real world subject. For parameters of interest in the form of θ = T (F ), we may estimate their value by θˆ = T (Fn). In the bootstrapping world, we find their images θ∗ = T (Fn) and θˆ∗ = T (F ∗n). The distribution of T (Fn) has its bootstrap world image as distribution of T (F ∗n), conditional on fixed Fn. When F ≈ Fn as it is the case when n is large, we anticipate that the distribution of T (Fn) is well approximated by the conditional distribution of T (F ∗n) given data, namely Fn. We should take note that such claims are meaningful when T (F ) is smooth in F in some sense. It does not always work but often work well. Alone this line of thinking, the distributions of the sample mean and sam- ple variance X¯n and s 2 n should be approximately the same as the conditional 23.2. RESAMPLING PROCEDURES 331 distribution of X¯∗n = 1 n ∑n i=1 X ∗ i and s 2∗ n = 1 n−1 ∑n i=1(X ∗ i − X¯∗n)2. In fact, these claims extend to their joint distribution and their functions. Instead of working hard mathematically on the distributions of the sam- ple mean and variance and so on, we intend to approximate them by these of their bootstrap images. At this point, we cannot help question the practical- ity of this suggestion: if deriving the distribution of X¯n is hard, deriving the distribution of X¯∗n is likely harder. This is true. However, if we can generate a million of independent and distributionally identical copies of X¯∗n, the dis- tribution of X¯∗n can then be numerically determined accurately. If this idea works, we would have successfully unloaded our technical burden to comput- ers. We will be able to work on many problems without placing restrictive normality assumption or other parametric assumptions on the populations. Remember, placing assumptions on the population does not make the population satisfy these assumptions. Bootstrapping or other resampling procedures are generally portrayed as a non-parametric method. They are used for many purposes far more than merely approximating the distributions of X¯n and s 2 n. For example, the bootstrap method can be used to approximate the distribution of sample median under very general conditions. Such universality makes it a popular choice. Furthermore, when data do not have an i.i.d. structure, a carefully designed schedule may be used to create a faithful resampling mirror image in the bootstrap world. Hence, the resampling method is not restricted to data with i.i.d. structures. In many applications, a parametric model f(x; θ) itself is an acceptable assumption. Suppose θˆ is the maximum likelihood estimator of θ. What is the distribution of θˆ? The large sample answer can also be mathematically difficulty. In this case, one may study the distribution of θˆ, when the data are a random sample from f(x; θˆ). To implement this idea by resampling method, one may generate samples from f(x; θˆ), and obtain a large number of θˆ∗, the MLE based on generated data sets. The empirical distribution based on θˆ∗ can be an accurate approximation of the distribution of θˆ. When the resampled data are drawn from a parametric distribution, the bootstrapping method becomes parametric bootstrap. 332 CHAPTER 23. RESAMPLING METHODS 23.3 Bias correction Let θ = T (F ) be a parameter. Since the empirical distribution Fn(x) is a good estimator of F , we have proposed to use θˆ = T (Fn) to estimate θ. At the same time, the bias of T (Fn) itself is a functional of F . Thus, bootstrap can be used to estimate the bias of T (Fn) and subsequently reduce the bias. Although E{Fn(x)} = F (x) for all x, it is not necessarily true that E{T (Fn)} = T (F ). If so, how large is the bias of θˆ = T (Fn)? Let us denote the bias by ξ = E[T (Fn)]− T (F ). Let X∗1 , . . . , X ∗ n be i.i.d. observations from the empirical distribution Fn. Let F ∗n(x) be the corresponding empirical distribution. Subsequently, θˆ ∗ = T (F ∗n) is then a bootstrap estimator of θˆ. Its (conditional) bias is given by ξˆ∗ = E∗{T (F ∗n)} − T (Fn), where E∗ is the expectation conditional on X1, . . . , Xn. If ξˆ∗ cannot be eval- uated theoretically, we can evaluate it by simulation. Is ξˆ∗ an good estimator of ξ? Example 23.1. (a) Assume θ = ∫ xdF (x) and E|X| < ∞. Consequently, the bias ξ = 0. At the same time, E∗T (F ∗n) = E∗{n−1 n∑ i=1 X∗i } = n−1 n∑ i=1 Xi = T (Fn). Thus, we also have ξˆ∗ = 0. This result shows that ξˆ∗ works fine as an estimator of ξ. Of course, in this example, the exercise does not lead to any useful results. (b) Let us consider the parameter estimation of θ = T (F ) = E2(X) = [ ∫ xdF (x)]2. Assume that σ2 = var(X1) < ∞. We have T (Fn) = [X¯n]2 and its bias is given by ξ = n−1σ2. The conditional expectation of T (F ∗n) given Fn is given 23.4. VARIANCE ESTIMATION 333 by E∗T (F ∗n) = E∗{n−1 n∑ i=1 X∗i }2 = {E∗X∗1}2 + n−2var∗(X∗1 ) = T (Fn) + (n− 1)s2n/n2. Thus, if we estimate ξ by ξ∗i = E∗T (F ∗n)−T (Fn), we have ξˆ∗ = (n−1)s2n/n2. This is a very reasonable estimator of ξ though we certainly do not have to go over bootstrap resampling procedure to find out. 23.4 Variance estimation Consider the problem of assessing the variance of T (Fn). The bootstrap method estimates the variance of T (Fn) by the conditional variance of T (F ∗ n), where F ∗n is the empirical distribution based on an i.i.d. sample from the distribution Fn. Example 23.2. (a) Let the parameter of interest be θ = T (F ) = ∫ xdF again. It is seen that θˆ = T (Fn) = X¯n. Let us work as if we do not have a good idea on its variance. Consequently, we use resampling method to esti- mate its variance. Take an i.i.d. sample from the empirical distribution Fn. Let X¯∗n be the resulting sample mean. We now use the conditional variance of X¯∗n to estimate the variance of x¯n. We can easily calculate the conditional variance as var∗(X¯∗n) = n −1var∗(X∗1 ) = n −2 n∑ i=1 (Xi − X¯n)2. Recall the true variance of X¯n is n −1σ2 where σ2 = var(X1). The bootstrap variance estimation is n−1s2n +O(n −2). Clearly, we have var∗(X¯∗n) var(X¯n) → 1 almost surely as n → ∞. This result shows that the bootstrap variance esti- mator is well justified. 334 CHAPTER 23. RESAMPLING METHODS It is important to realize that var(X¯n)→ 0 as n→∞. Hence, even if a variance estimator v̂ar(X¯n) makes v̂ar(X¯n)− var(X¯n)→ 0 almost surely, this property alone does not make it a good estimator. (b) Let the parameter of interest be θ = {∫ xdF}2. Its natural estimator is θˆ = X¯2n. How large is the variance of θˆ? Assume that X has finite 4th moment and without loss of generality. Then E(X¯4n) = E{(X¯n − µ) + µ}4 = E{(X¯n − µ)4 + 4(X¯n − µ)3µ+ 6(X¯n − µ)2µ2 + 4(X¯n − µ)µ3 + µ4} = µ4 + 6µ2σ2 n +O(n−2). We also have, putting s2n = n −1∑(xi − x¯n)2, E2(X¯2n) = (µ2 + σ2/n)2 = µ4 + 2µ2σ2 n +O(n−2). Therefore, var(X¯2n) = E(X¯4n)− E2(X¯2n) = 4µ2σ2 n +O(n−2). In the bootstrap method, it is easy to get var∗({X¯∗n}2) = 4X¯2nE∗{X¯∗n − X¯n}2 + E∗{X¯∗n − X¯n}4 −{E∗[X¯∗n − X¯n]2}2 + 4X¯nE∗{X¯∗n − X¯n}3. The order of the last three terms are Op(n −2). The order of the first one is Op(n −1) when the true mean is not zero. Thus, the leading term in this bootstrap variance estimator is (4X¯2n/n 2) ∑n i=1(Xi− X¯n)2. This marches the approximate variance of X¯2n which equals (4µ 2σ2)/n. In both examples, we analytically obtained the properties of the bootstrap method for bias and variance estimation of estimators in the form of T (Fn) for 23.4. VARIANCE ESTIMATION 335 parameter T (F ). Analytical derivation is not always feasible. For instance, suppose θ is the location parameter in Cauchy distribution, we will not be able to find var∗(T (F ∗n)) by theoretical computation. Instead, computer simulation is likely the only option which can be carried out as follows. First, draw an i.i.d. sample of size n x∗1, . . . , x ∗ n from Fn based on some computer package. Compute, based on bth sample, θˆ∗b = T (F ∗ n) where F ∗n(x) = n −1∑n i=1 1(x ∗ i ≤ x) which is the empirical distribution based on bootstrap sample. Next, define the simulated var∗(T (F ∗n)) value to be v2∗ = 1 B − 1 B∑ b=1 {θˆ∗b − θ¯∗}2 where θ¯∗ = B−1 ∑B b=1 θˆ ∗ b . If θ is a vector, we need to modify the above formula for the variance-covariance matrix. Under some conditions, v2∗ is a consistent estimator of var(T (Fn)). Yet we must be more cautious on the meaning of consistency: v2∗/var(T (Fn))→ 1 in some modes. One many define the bootstrap variance estimator to be v˜2∗ = 1 B − 1 B∑ b=1 {θˆ∗b − θˆ}2. Since the difference between θˆ and θ¯∗ is likely very small in asymptotic ar- gument, both of them are well justified. None of them can be judged as “wrong” as many would like to ask. In addition, simulation study will likely find situations where v2∗ is more accurate and other situations where v˜2∗ is superior. In summary, being a statistician does not make you an authority to decide between these estimators. We do notice that v˜2∗ resembles mean square error. It therefore takes a larger value. If one likes to have a more conservative statistical procedure, using v˜2∗ a good choice. 336 CHAPTER 23. RESAMPLING METHODS 23.5 The cumulative distribution function Consider the problem of approximating the distribution of T (Fn) by that of T (F ∗n). The idea here is the same as the one for variance estimation. We hope that the conditional distribution of T (F ∗n) is a good approximation to the distribution of T (Fn). Consider the simplest situation where the parameter to be estimated is θ = T (F ) = ∫ xdF . The estimator of θ is the sample mean X¯n, and we aim at estimating the cumulative distribution function of X¯n. Under the assumption of the finite second moment, √ n(X¯n − θ)/σ is asymptotically normal. This fact pretty much tells us to not bother at es- timating its distribution. Nevertheless, if we insist on using bootstrap to estimate the distribution of X¯n, we should have, as n→∞, pr (√ n(X¯∗n − X¯n) sn ≤ x∣∣Fn)→ Φ(x) almost surely for any x with s2n being the sample variance and Φ(x) being the c.d.f. of the standard normal distribution. Note that this is a limit where both the event under investigation and the condition are changing when n increases. As n increases, we may use the central limit theorem for triangular array to obtain the above result. To prove the asymptotic normality, Berry-Esseen bound is the most sim- ple tool though at a relatively stronger conditions. Theorem 23.1. Let X1, . . . , Xn be an i.i.d. sample from a distribution F with finite mean θ, finite variance σ2 and finite E|X|3. Then, we have sup x ∣∣∣∣pr(√n(X¯n − θ)σ ≤ x∣∣Fn ) − Φ(x) ∣∣∣∣ ≤ 334 E(|X − θ|3)√nσ3/2 Note that this conclusion holds for all n. In other words, it is not an asymptotic one, but universal one. At the same time, it shows that the precision of the normal approximation improves with increased sample size at rate n−1/2. Applying this bound to bootstrapping sample, we find sup x ∣∣∣∣pr(√n(X¯∗n − X¯n)sn ≤ x∣∣Fn ) − Φ(x) ∣∣∣∣ ≤ 334 E∗|X∗1 − X¯n|3√n[E∗|X1 − X¯n|2]3/2 . 23.5. THE CUMULATIVE DISTRIBUTION FUNCTION 337 Again, this result holds for any Fn and n. In view of such an inequality, the asymptotic normality is valid when E∗|X∗1 − X¯n|3 [E∗|X1 − X¯n|2]3/2 = o(n 1/2) almost surely or in any appropriate modes. Suppose the model satisfies E|X1|3 <∞. In this case, we have E∗|X∗1 − X¯n|3 = 1 n n∑ i=1 |Xi − X¯n|3 → E|X1|3, almost surely and E∗|X∗1 − X¯n|2 = 1 n n∑ i=1 |Xi − X¯n|2 → σ2 = 1, almost surely. Thus, it is trivial to find that E∗|X∗1 − X¯n|3 [E∗|X1 − X¯n|2]3/2 → E|X1| 3. Hence, when E|X1|3 <∞, the (conditional) asymptotic normality is proved. The simple proof is benefitted from an unnecessarily strong assumption on the finiteness of the third moment. A generalization can be easily made. If g(X¯n) is a smooth function of X¯n, then g(X¯n) is asymptotic normal. By the same logic, g(X¯ ∗ n) is also asymptotically normal conditional on Fn. Thus, the conditional distribution of g(X¯∗n) still marches that of g(X¯n). Although the above example is very supportive on the usefulness of the bootstrap method, it is not without its limitations. For the sample mean, its asymptotic normality can be established easily. The calculation of the limiting distribution is also very simple. Why should we bootstrap in these simple situations? In situations where the asymptotic become complex, do we have a good theory to support the bootstrap? One crucial justification of using bootstrap method comes from Singh (1981). There are many results contained in this paper. Here I only pick up a relatively simple case. 338 CHAPTER 23. RESAMPLING METHODS Theorem 23.2. ( Singh, 1981). Assume X1, . . . , Xn are i.i.d. samples from F . Assume EX1 = θ, σ2 = var(X1) > 0, and E|X1|3 < ∞. Let X¯n be the sample mean and s2n be the sample variance. In addition, let X¯ ∗ n be the bootstrap sample mean. Then sup x ∣∣∣∣pr(√n(X¯n − θ)σ ≤ x ) − pr (√ n[X¯∗n − X¯n] sn ≤ x|Fn )∣∣∣∣ = O(n−1/2) almost surely. If F is a continuous distribution, then sup x ∣∣∣∣pr(√n(X¯n − θ)σ ≤ x ) − pr (√ n[X¯∗n − X¯n] sn ≤ x|Fn )∣∣∣∣ = o(n−1/2). The first result is implied by Berry-Esseen bound. We do not prove the second result here. The second result shows that the bootstrapping approximation has better precision than the normal approximation. This is a surprising good news. The bootstrap sampling procedure for approximating c.d.f. of some func- tional T (Fn) is very simple. First, we draw an i.i.d. sample of size n X ∗ 1 , . . . , X ∗ n from Fn using some computer software package. We repeat this step a suffi- ciently large number B times. Next we compute, based on bth sample, θˆ∗b = T (F ∗ n) where F ∗n(x) = n −1∑n i=1 1(x ∗ i ≤ x). The last step is to define the estimated cumulative distribution function to be Hˆn(t) = B −1 B∑ b=1 1(θˆ∗b ≤ t). Needless to say, under some conditions, Hˆn(t) is consistent for H(t) = pr(θˆ ≤ t). We should be aware that only if the limiting distribution of T (Fn) does not degenerate, the bootstrap approximation is meaningful. The result of Singh, for instance, is effective for T (Fn) = √ n(X¯n − θ) σ . 23.6. RECIPES FOR CONFIDENCE LIMIT 339 A similar result for T (Fn) = (X¯n − θ) σ would be meaningless. One should be aware that the both T (Fn) are not statistics, but functions of both data and parameter. 23.6 Recipes for confidence limit We still discuss the inference problem under the assumption that an i.i.d. sample of size n from distribution F is given. The parameter of interest is some θ = T (F ). It is estimated by θˆ = T (Fn) where Fn is the empirical distri- bution function. We will also use some variance estimator based on Fn and denote it as σˆ. It is not necessary the variance of distribution F , but some quantity used for constructing pivotal quantities. Percentile method. Consider the case when θˆ− θ is more or less a pivotal quantity. Suppose that its distribution is given by H(x), namely, pr(θˆ − θ ≤ x) = H(x). Let H−1(α) be the αth quantile of H(x). Then pr(θˆ − θ ≥ H−1(α)) = 1− α when H(·) is a strictly increasing function. This implies that an upper 1−α) limit for θ is given by θˆ −H−1(α). A two-sided confidence interval can be formed by using one upper and one lower confidence limits. Let Hˆ(x) be an estimator of H(x). Define θBP = θˆ − inf t {Hˆn(t) ≥ α} = θˆ − Hˆ−1(α). This is an approximate upper confidence bound for θ because pr(θ < θBP ) = pr(θˆ − θ ≥ Hˆ−1n (θ)) ≈ pr(θˆ − θ ≥ H−1(θ)) = 1− α. 340 CHAPTER 23. RESAMPLING METHODS Computing confidence lower limit based on the above approach is generally called percentile method. The subscript BP is used for “bootstrap percentile” though we motivated this upper limit without a bootstrapping procedure. Ordinary and studentized methods Consider the case we have estimators θˆ and σˆ for θ and the standard error of √ nθˆ. Note that the later is not the population standard error. In many cases, it might be more realistic that √ n(θˆ − θ) σ ; √ n(θˆ − θ) σˆ are approximate pivotal quantities. If they are, without any ambiguity, we may define H(x) = pr{ √ n(θˆ − θ) σ ≤ x} and K(y) = pr{ √ n(θˆ − θ) σˆ ≤ y}. If we had complete information of H(x) and K(y), constructing confidence intervals for θ would be a simple task. We further notice that the task is reduced to find upper and lower confi- dence limits. Let xα = H −1(α) and yα = K−1(α) so that 1−α is the targeted level of confidence. Depending on whether we have knowledge on H or on K, their lower confidence limits are respectively θˆord(α) = θˆ − x1−ασ/ √ n, and θˆstud(α) = θˆ − y1−ασˆ/ √ n. Both of them have the format we presented in a previous chapter. The sub- scripts, ord and stud, are abbreviations for ordinary and studentized. They would have been z1−α or t1−α when H and K are c.d.f. of normal and t distributions. Hybrid and backward methods. As it is well known, when the sample size is large, σˆ ≈ σ and hence xα ≈ yα. One may therefore use a hybrid lower confidence limit: θˆhyb = θˆ − x1−ασˆ/ √ n. 23.7. IMPLEMENTATION BASED ON RESAMPLING 341 This can be compared to the situation where quantile of t-distribution should be used, yet we mistakenly use the quantile of the normal distribution. Under normal distribution, zα = −z1−α because the normal distribution is symmetric. If H is symmetric, then xα = −x1−α for the same reason. Hence, when H is believed symmetric, we may use another lower confidence limit: θˆback(α) = θˆ + xασˆ/ √ n. It is clearly confusing to present so many possibilities. Which one is correct? The answer depends on what we mean by “correct”. If any (random) interval which covers the true value of θ with probability 1−α+o(1), we feel that they are okay, or “correct”. When the sizes of these intervals are not taken into consideration, we may want to examine the exact sizes represented by o(1) term in the coverage probability. 23.7 Implementation based on resampling Having complete knowledge of H(x) and K(x) is not possible. More often than not, they are also dependent on unknown parameter values. Nonethe- less, bootstrap simulation can be used to properly estimate H and K, when the population distribution is given by F and the parameter θ is a functional of F . We will see what it means by “can be estimated”. The distribution H is estimated by Hˆ(x) = pr( √ n(θˆ∗ − θˆ) ≤ σˆx|Fn). The distribution K is estimated by Kˆ(x) = pr( √ n(θˆ∗ − θˆ) ≤ σˆ∗x|Fn). Once they are obtained via bootstrap simulation, we define xˆα = Hˆ −1(α) and yˆα = Kˆ −1(α). All lower confidence limits proposed in the last section are transformed to bootstrap lower confidence limits by putting a hat on either xα or yα. Now, which one makes o(1) inside the coverage probability 1 − α + o(1) the smallest? Peter Hall (AOS some year) had a discussion paper specifically 342 CHAPTER 23. RESAMPLING METHODS for this problem. The technical discussion is too complex for this course. The results are not that insightful either. Without going back to the paper itself, I put down my unverified memory here: the studentized approach together with bootstrap resampling has this o(1) reduced toO(n−1). Without studentization, this o(1) is o(n−1/2). Both conclusions are obtained under the assumption that θˆ is a smooth function of the sample mean X¯n, after being broadly interpreted. For instance, θ = µ σ has its estimator given by θˆ = x¯n√ x¯2 − (x¯)2 . This estimator is a smooth function of the sample mean in (xi, x 2 i ). 23.8 A word of caution The bootstrap method is generally used to simulate the variance and the dis- tribution of a point estimator. Based on bootstrap simulation, we can often subsequently make inference on various parameters. Most noticeably, the results are subsequently used to construct confidence intervals for parameter θ and test the hypotheses such as θ = θ0. We are often freed from complex technical issues. At the same time, one has to have a good point estimator θˆ before the resampling procedure can even start. The statistical properties of the corre- sponding data analysis is largely determined by that of θˆ. The resampling methods help to determine these properties. They do not instill good prop- erties into these procedures. There is no guarantee that the resampling methods always lead to valid statistical inferences. By this statement, for instance, a 1−α level confidence interval may have far lower coverage probability and the under-coverage problem does not go away even if the sample size increases. The theory in mathematical statistics cannot be thrown out simply because the resam- pling procedure is powerful at freeing us from the task of a lot of technical derivations. 23.9. ASSIGNMENT PROBLEMS 343 23.9 Assignment problems 1. Let X1, . . . , Xn be a random sample from exponential distribution with density function f(x; θ) = θ−1 exp(−θ−1x). Consider the case n = 201 and θ = 1. (a) Theoretically determine the median of this distribution. (b) Generate 1000 data sets with n = 201 to estimate the bias and variance of the sample median for estimating the population median. (c) Bootstrap the first sample in (b) to obtain estimates of the bias and variance of the sample median for estimating the population median. 2. Continue from the last problem. (a) Use a Bootstrap method to construct a 95% confidence interval for θ based on the following two asymptotic pivotals: T1 = √ n(x¯− θ) θ ; T2 = √ n(x¯− θ) sn . Set B = 1999. Present your intervals for the first data set generated. Show your code. (b) Repeat (a) for N = 100 data sets. Compute their average lower and upper confidence limits, average lengths and standard error of the lengths. Which one do you recommend based these outcomes? 3. Generate 7 observations from uniform (0, 1) distribution. (a) How many distinct bootstrap samples (standard bootstrap as in this book) are possible? (b) Draw the c.d.f. of X¯∗ and compute its 0.25, 0.5 and 0.75 quantiles. (c) Compute the difference between the variance of X¯∗ and the variance of X¯ both numerically and by theoretical derivation. Of course, two results are the same if round-off is ignored. 344 CHAPTER 23. RESAMPLING METHODS 4. Based on an i.i.d. sample of size n = 99 from Cauchy distribution with only a location parameter θ. Namely, the density function is given by f(x; θ) = 1 pi{1 + (x− θ)2} One wishes to test the hypothesis H0 : θ = 0 against H1 : θ 6= 0. Two potential approaches are: (1) score test and (2) Wald test based on sample median (the 50th order statistic). (a) For the purpose of implementation, you need to work out some additional details for these two tests such as the asymptotic variances. Get them done and present your work and results. (b) Suppose the precisions of the asymptotic distributions for these two tests are not sufficiently high. Design bootstrap procedures to carry out these two tests. (c) Implement the bootstrap procedures in (b) based on B = 1999 and repeat N = 500 to obtain observed rejection rate under the null hypothesis. 5. Suppose {0, 2, 3, 5, 10} is an i.i.d. sample from some distribution F . (a) Let X∗ be a single bootstrap observation in the conventional form used in this course. What is the distribution of X∗ (conditional on these observed values)? Compute its mean and variance. (b) Let X∗1 , · · · , X∗5 be an i.i.d. bootstrap sample from Fn, where n = 5. Let X∗(n) = max{X∗1 , · · · , X∗5} What is the conditional distribution of X∗(n)? Namely, work out its probability mass function. 6. Suppose x1, . . . , xn is an iid sample from a distribution family F with finite and positive variance σ2. Let us estimate the population mean θ by θˆ = x¯ as usual, and we certainly have var(θˆ) = σ2/n. However, a researcher insists of using bootstrap method to estimate the variance var(θˆ) or more precisely nvar(θˆ). In addition, she suggests 23.9. ASSIGNMENT PROBLEMS 345 to generate b = 1, 2, . . . , 2B sets of conditional iid samples of size n, x∗1, x ∗ 2, . . . , x ∗ n from the empirical distribution Fn(x). For each b, she would compute x¯∗b = n −1 n∑ i=1 x∗i . She then defines νˆ∗ = (2B)−1 B∑ b=1 (x¯∗2b−1 − x¯∗2b)2 as an estimate of var(θˆ). (a) Given Fn, compute the conditional mean and variance of x ∗ 1. (b) For each b, compute the conditional mean and variance of x¯∗b given Fn. (c) Show that the conditional expectation E{νˆ∗|Fn} = 1 n2 n∑ i=1 (xi − x¯n)2. (d) Show that as B →∞, nνˆ∗ p−→ n−1 n∑ i=1 (xi − x¯n)2. (e) Show that n{νˆ∗ − var(θˆ)} → 0 in probability as B, n → ∞. More precisely, almost surely or in probability, lim n→∞ lim B→∞ {n{νˆ∗ − var(θˆ)}} = 0. 346 CHAPTER 23. RESAMPLING METHODS Chapter 24 Multiple comparison One-way ANOVA is a typical method to compare a number of treatments in terms of a specific measurement of some experimental outcomes. For example, an experiment might be designed to compare the volumes of harvest when different fertilizers are used. Let the number of treatments be k. LetN = n1+n2+· · ·+nk experimental units randomly assigned to k treatments with n1, n2, . . . , nk units each. Let the response variable be denoted as y. Suppose the jth treatment is replicated nj times. The outpits of the experiment can be displayed as y11, y12, . . . , y1n1 ; y21, y22, . . . , y2n2 ; . . . , . . . yk1, yk2, . . . , yknk . The output yij is the reading of the unit assigned to the ith treatment and the jth replication. A linear model for this set up is yij = η + τi + ij for i = 1, 2, . . . , k, and j = 1, 2, . . . , ni. We assume η is the overall mean, τi is the mean response from the ith treatment after subtracting the overall 347 348 CHAPTER 24. MULTIPLE COMPARISON mean. The error term ij is what cannot be explained by the treatment effect τi. The statistical analysis is often done based on the assumption that ij ∼ N(0, σ2) and they are assumed independent of each other. The normality assumption and the equal variance assumption are the ones that may be violated in the real world. The decomposition of the treatment means is always feasible. 24.1 Analysis of variance for one-way layout. Let y¯·· = N−1 k∑ i=1 ni∑ j=1 yij be the over all sample mean. Let y¯i· = n−1i ni∑ j=1 yij be the sample mean restricted to samples from the ith treatment. In general, whenever an index is replaced by a dot, the resulting notation represents the sample mean over the corresponding index. For example, y¯· 1 would be the average of y11, y21, . . . , yk1. Now we may decompose the response as yij = y¯·· + (y¯i· − y¯··) + (yij − y¯i·) which will also be written as yij = ηˆ + τˆi + rij. These quantities marked with hats are estimates/estimators of the corre- sponding parameters in the linear model. The sum of squares in (y¯i· − y¯··) represents the variation in the mean responses between different levels of the factor (or between treatments), while 24.1. ANALYSIS OF VARIANCE FOR ONE-WAY LAYOUT. 349 rij = (yij − y¯i·) represents the residual variations. The residual variation is the variation not explainable by the treatment effect. The analysis of variance aims to compare the relative sizes of these two sources of variation. The resulting ANOVA table is as follows. ANOVA for One-Way Layout Source D.F. SS Treatment k − 1 ∑ki=1 ni(y¯i· − y¯··)2 Residual N − k ∑ki=1∑nij=1(yij − y¯i·)2 Total N − 1 ∑ki=1∑nij=1(yij − y¯··)2 One may notice that each sum of squares contain N terms, when du- plicating entrances are also counted. Consider the test problem on the null hypothesis: H0 : τ1 = τ2 = · · · = τk. The alternative hypothesis is that not all population means are equal. The test statistic we commonly use is F = (k − 1)−1∑ki=1 ni(y¯i· − y¯··)2 (N − k)−1∑ki=1∑nij=1(yij − y¯··)2 which, under normality/equal-variance and H0, has F-distribution with k−1 and N − k degrees of freedom. This might be an opportunity for us to refresh the memory on the desired properties a test statistics. It is also a useful exercise to refresh the memory of what a UMPU test is. If H0 is true, the randomness of the test statistic F is completely deter- mined and it does not depend on any external factors. Hence, the following p-value computation is justified: p = pr(F > Fobs). Rejecting H0 when p < 0.05 is a common practice. 350 CHAPTER 24. MULTIPLE COMPARISON 24.2 Multiple comparison Once (and if) the null model is rejected, it is natural to ask: which pair or pairs of treatments are the culprits that lead to the rejection? The rejection may be caused by a single treatment that has substantially different effect from the rest. It may also be caused by smaller differences between all treatments. Of course, the rejection may be erroneous. Regardless of these possibilities, let us ask the question on which pairs of treatments are significantly different? The technique used to addressing this question is called multiple comparison, because many pairs are being compared simultaneously. Borrowing the idea of two-sample test, we may define tij = y¯j· − y¯i·√ (1/ni + 1/nj)σˆ2 where σˆ2 is the variance estimator from the ANOVA table. The denominator of this t-statistic is different from the usual two-sample t-test. It is obtained by pooling information from all k-treatments. Each tij has t-distribution with N − k degrees of freedom. Hence, if a level-α test is desired, we may reject H0 : µi = µj when |tij| > t(1− α/2;N − k). This test has probability α to falsely reject the hypothesis that the corre- sponding pair of treatment means are equal. Suppose we set α = 0.05 as in common practice and k = 5. There will be 10 such pairs of treatments. Even if all pairs of treatments are not different, there will a chance about 5% to declare any one of them significant. The chance of declaring one of them is significant by a simple t-test is likely much larger, approaching possibly 50%. Such a high probability of false rejection is clearly not acceptable. 24.3 The Bonferroni Method To address the problem of inflated type I error in multiple comparison, we could simply set up a high standard for every pair of i and j such that the 24.4. TUKEY METHOD 351 overall type I error is guaranteed to be lower than pre-specified value α. Let k′ = k(k − 1)/2 be the number of possible treatment pairs. We may reject Hij : µi = µj only if |tij| > t(1− α/2k′;N − k). Since the probability that any pair of treatments wrongfully judged different is no more than α/k′ (note this is a two-sided test), and there are k′ such pairs, it is simple to see that the chance that at least one pair to be declared different, when none of them are difference, is controlled tightly by 100α%. 24.4 Tukey Method Particularly when k is large (5 or more), the Bonferroni method is too con- servative. It means that the actual type I error can be far lower than the targeted level 100α%. Having a small type I error is not strictly wrong in term of being a valid test. The real drawback of such a test is that this increases the type II error. When k is large, the statistical power of detect- ing any departure from the null hypothesis is too small if the conservative Bonferroni method is used. If such a method is used as standard, scientists have to work unjustifiably harder to prove their point (even if their point is valid). Let us define t∗ = √ 2 max{|tij|}. It is seen that t∗ has a distribution which does not depend on any unknown parameters under the null hypothesis that all treatment effects are equal. However, its distribution does depend on k and N−k, and in fact, also on how N units are divided and assigned to k treatments. It is a test statistic with almost all desirable properties we specified for a pure statistical significance test. Unlike t-distribution, however, the c.d.f. of this distribution is not as well documented. When all nj are equal, the distribution might be named after Tukey. Let qtukey(1 − α; k,N − k) be its upper quantile when all ni are equal. That is, under that restriction, P{t∗ > qtukey(1− α; k,N − k)} = α 352 CHAPTER 24. MULTIPLE COMPARISON for any α ∈ (0, 1). We may reject the hypothesis that the i and j pair of treatments have equal mean when |tij| > qtukey(1− α; k,N − k)/ √ 2. The type I error of this approach is only approximate when n1, · · · , nk are not all equal. In fact, it is bounded by α which is proved by someone based on my memory. My observation: Tukey’s method is not so much as a new method. It simply requires us to use a critical value so that the probability of wrongfully rejecting any pair of τi = τj is below 100α%. Pitfalls of Bonferroni and Tukey Methods In the case of Bonferroni, the adjustment is too conservative. If we hope to test 1000 hypotheses based on a single data set, then the significance level would be placed at 0.005%. If it is applied to a t-test with n = 20 degrees of freedom, the critical value is 5.134. This is to be compared to 2.09 if only one hypothesis is being tested. The actual type I error of the test is likely much lower (Assignment problem). As a side remark, in statistical consulting practice, we often look into many aspects of data. Based on what we spot, various hypotheses are pro- posed and then tested. In the end, we report the p-value on the hypothesis that is below 0.05 (or several of such p-values). The practice seriously violates the statistical principle we preach. Nonetheless, statisticians do it routinely and our collaborators will not be pleased otherwise. At the same time, to avoid this problem, scientific journals intend to go so far as prohibiting the use of p-value as a justification of the scientific findings. These problems are further compounded by the fact the p-value is not carefully defined in the first place. In the case of Tukey Method, I feel that it is specifically designed for one-way anova. I am not sure of any other situations where it is applied. Regardless, my understanding is that it simply requires statisticians to make sure the probability of wrongfully rejecting even one of many hypotheses is below α, the pre-specified level. This principle leads to a technical issue: we may not be able to find even a well approximated critical value. 24.5. FALSE DISCOVERY RATE 353 24.5 False discovery rate In modern statistical applications, we are confronted with a problem that is radically different from the one-way anova. Due to bio and info technical advances, we can now cost-effectively and timely take measurements of thou- sands of genes expression levels from each subject. It is of interest to identify some genes whose expression levels are different on different groups of peo- ple. Typically, one group is made of health controls and another group is made of patients of a specific type of disease. The genes that are significantly differentially expressed might be related to the disease. There are two aspects of this new problem. First, if 500,000 genes are inspected on 50+50 subjects, even if we use α = 0.001 to test for each hypothesis that a gene is significantly differentially expressed, and that none of them are differentially expressed, 500 of them will likely be found statistically significant. This is bad. Second, suppose a handful of genes are indeed differentially expressed but the differences are not exceedingly large. Applying Bonferroni method likely results in none of them judged significant. The high standard set by Bonferroni method may fail the researchers for this wrong reason. The dilemma seems solved by giving up the notion of type I error. When thousands and thousands hypotheses are examined simultaneously, we prob- ably should not mind to have a larger probability of “wrongfully declare a few genes significantly differently expressed”. Rather, we should proba- bly ask: among many genes judged significantly differently expressed, what percentage of them are falsely significant? Because “rejecting a null hypothesis” in such context is regarded as a scientific discovery, the percentage of false significant outcomes among all declared significant outcomes is called “false discovery rate”. In such appli- cations, controlling the false discovery rate is regarded as a better principle. A widely accepted standard is again 5%. In comparison, the classical practice of controlling the overall type I error is renamed as “family-wise error rate”. There is a need to be reminded about the difference between “statistical significance” and the “real world” significance here. How large a difference in 354 CHAPTER 24. MULTIPLE COMPARISON the expression levels is scientifically significant should be judged by scientists. When two expression levels are judged statistically significantly different, it means that we have sufficient statistical evidence to declare that difference is genuine. However, the magnitude of the difference could be so small that it is scientifically meaningless. 24.6 Method of Benjamini and Hochberg We will only discuss the result of Benjamini and Hochberg (1995, JRSSB). There have been a lot of new developments and I have not followed them very closely. False discovery rate. Suppose m hypotheses are being tested. Let m0 denote the number of them that are true. Let R the be number of hypothesis rejected. Note that R is random. We have decomposition m0 = U + V with U of them are tested non-significant, and V of them are tested signifi- cant. Similarly, m−m0 = T + S: T of them are tested non-significant, and S of them are tested significant. The total number of hypotheses tested significant is R = V + S. The total number of hypotheses tested non significant is m−R = U + T . When R > 0, the percentage of false discovery is V/R. When R = 0, there cannot be any false discovery. Thus, they propose to define Q = V V + S 1(V + S > 0). Clearly, this value is not observed and is random in any applications. The false discovery rate (FDR) is defined to be Qe = E(Q). In comparison, the type I error in the current situation is also called family- wise error rate (FWER). It measures the probability of having at least one hypotheses rejected when all of them are truthful. 24.7. HOW TO APPLY THIS PRINCIPLE? 355 According to Benjamini and Hochberg (direct quote): (a) If all null hypotheses are true, the FDR is equivalent to the FWER: in this case s = 0 and v = r, so if v = 0 then Q = 0, and if v > 0 then Q = 1, leading to pr(V ≥ 1) = E(Q) = Qe. Therefore control of the FDR implies the control of the FWER in the weak sense. (b) When m0 < m, the FDR is smaller than or equal to the FWER: in this case, if v > 0 then v/r ≤ 1, leading to 1(V ≥ 1) ≥ Q. Taking expectations on both sides we obtain pr(V ≥ 1) ≥ Qe, and the two can be quite different. As a result, any procedure that controls the FWER also controls the FDR. However, if a procedure controls the FDR only, it is less stringent and a gain in power is expected. In particular, the larger the number of the false null hypotheses is, the larger S tends to be, and so is the difference between the error rates. As a result, the potential for increase in power is larger when more of the hypotheses are untrue. 24.7 How to apply this principle? Suppose the hypotheses to be tested are H1, H2, . . . , Hm. Whatever methods are used, the outcome of each test is summarized by a p-value: P1, . . . , Pm. We assume these p-values calculated based on valid tests. Sorting these values to get P (1) ≤ P (2) ≤ · · · ≤ P (m). Their corresponding hypotheses are denoted as H (i) 0 accordingly. Select an upper bound for the false discovery rate and denote it as q∗. The BH procedure: Step I Let k be the largest i for which P (i) ≤ (i/m)q∗: namely k = max{i : P (i) ≤ (i/m)q∗}; Step II Reject all H(i), i = 1, 2, . . . , k. Numerically, the BH procedure can be carried out as follows 356 CHAPTER 24. MULTIPLE COMPARISON • If p(m) ≤ q∗, reject all null hypotheses and stop; • else if p(m−1) ≤ m−1 m q∗, reject these (m− 1) null hypotheses and stop; • else if p(m−2) ≤ m−2 m q∗, reject these (m− 2) null hypotheses and stop; • Continue the above process until the last step: if p(1) ≤ (1/m)q∗, reject H (1) 0 step; • else, reject none and terminate. Moral of this procedure: for the targeted application, it is not a serious issue if one falsely declares 10 genes are differentially expressed for diabetes patients when 2 of them are not. We can figure out the true set subsequently. The procedure is more effective than to declare none of them are significantly differentially expressed. Suppose we choose q∗ = 0.05. The procedure will have at least one gene declared significantly differentially expressed when p(1) ≤ 0.05/m. Thus, if the Bonferroni’s method reject “all H0’s are true”, then at least one of them is rejected by the Benjamini-Hochberg procedure. The new procedure may rejects many individual H0’s. For instance, the Bonferroni’s method rejects H (2) 0 only if p(1) ≤ p(2) ≤ 0.05/m but the FDR method will do so when p(1) ≤ p(2) ≤ (2/m)× 0.05. Hence, FDR method will have more hypotheses rejected in long run. Reject- ing both H (1) 0 and H (2) 0 requires only p(2) ≤ (2/m) ∗ q∗. 24.8. THEORY AND PROOF 357 24.8 Theory and proof Theorem 24.1. For independent test statistics and for any configuration of false null hypotheses, the Benjamini-Hochberg procedure controls the FDR at q∗. Remark: by “independent test statistics”, we assume that the p-values are independent of each other, when they are regarded as random variables. When a null hypothesis is true, its corresponding p-value, however it is ob- tained as long as it is valid, has uniform [0, 1] distribution. Lemma 24.1. Consider the problem when the Benjamini-Hochberg procedure is applied to m null hypothesis. For any 0 ≤ m0 ≤ m, independent p- values corresponding to true null hypotheses, and for any values that the m1 = m −m0 p-values corresponding to the false null hypotheses can take, the Benjamini-Hochberg procedure satisfies the inequality E{Q|Pm0+1 = p1, . . . , Pm = pm1) ≤ (m0/m)q∗. Interpreting this lemma: Suppose that m1 of the hypotheses are false. Whatever the joint distribution of their corresponding p-values, integrating inequality in the lemma we obtain E(Q) ≤ (m0/m)q∗ ≤ q∗ and the FDR is controlled. Namely, the conclusion of the theorem is implied by this lemma. The independence of the test statistics corresponding to the false null hypotheses is not needed for the proof of the theorem. Proof of the Lemma. Recall m is the number of hypotheses; m0 is the number of true hypotheses. Denote H (i) 0 and P ′ (i), i = 1, 2, . . . ,m0 the true null hypotheses and their p-values, with p-values in increasing order. P ′(i), i = 1, 2, . . . ,m0 are order statistics of m0 iid uniform [0, 1] random variables. Denote false null hypotheses as H (i) f : i = m0 + 1,m0 + 2, . . . ,m. Their p-values as random variables will be denoted as Pi, capitalized P and indexed by i. Their realized values are denoted as p1 ≤ p2 ≤ . . . ≤ pm1 . 358 CHAPTER 24. MULTIPLE COMPARISON The proof is obtained by using mathematical induction. We work on a few simple cases first before truly starting the induction. Case I: The case m = 1 is immediate. (a) m1 = 1 so that m0 = 0. Hence, Q ≡ 0 and E(Q|P1) = 0 ≤ m0 m q∗. (b) m1 = 0 so that m0 = 1. Hence Q = 1(P ′(1) < q ∗). Thus, there is nothing to condition on. We have E(Q) = P (P ′(1) < q∗) = q∗ = m0 m q∗ Combining (a) and (b), we find the conclusion of the lemma is true for the case where m = 1. Case II: The case m = 2. (a) m1 = 2 so that m0 = 0. In this case, Q ≡ 0 and E(Q|P1, P2) = 0 ≤ m0 m q∗. (b) m1 = 1 so that m0 = 1. In this case, Q can take values 0, 1/2, and 1. When P1 > q ∗, H(2)f is never rejected. H (1) 0 is rejected when P ′ (1) < 0.5q ∗. Hence, we have E(Q|P1 > q∗) = P (P ′(1) ≤ 0.5q∗) = 0.5q∗ = m0 m q∗; When P1 < q ∗, both H(1)0 and H (2) f are rejected if p(1) < q ∗. When this happens, Q = 0.5; otherwise, Q = 0. Hence, E(Q|P1 < q∗) = (0.5)P (P ′(1) < q∗|P1 < q∗) = 0.5q∗ = m0 m q∗. (c) m1 = 0 so that m0 = 2. In this case, any rejection leads to Q = 1. There is nothing to be conditioned on. Hence, E(Q) = P{P ′(1) < (1/2)q∗ or P ′(2) < (2/2)q∗} = P{P ′(1) < (1/2)q∗}+ P{P ′(1) > (1/2)q∗, P ′(2) < (2/2)q∗} = 1− (1− .5q∗)2 + (0.5q∗)2 = q∗ = (m0/m)q∗. 24.8. THEORY AND PROOF 359 Combining (a), (b) and (c), we find the conclusion of the lemma is true for the case where m = 2. Induction assumption Assume that the lemma is true for any m ≤ N −1. We work on proving this lemma when m = N . Suppose m0 = 0 so that all null hypotheses are false. The false discovery rate Q ≡ 0. Hence, E{Q|Pm0+1 = p1, . . . , Pm = pm1) = 0 ≤ (m0/m)q∗. That is, the lemma is true when m = k and m0 = 0. Thus, we need only discuss the situation where m can take any value, and m0 > 0. Let j0 be the largest 0 ≤ j ≤ m1 satisfying pj ≤ m0 + j N q∗. Note that these are p-values corresponding to false null hypotheses. Denote p′′ = m0 + j0 N q∗. This value will be used as cut-off point. The key steps of the proof start from here Step 1 Conditioning on P ′(m0), the largest p-value in the group of true null hypotheses, we find E(Q|Pm0+1 = p1, . . . , Pm = pm1) = ∫ p′′ 0 E(Q|P ′m0 = p, Pm0+1 = p1, . . . , Pm = pm1)fm0(p)dp + ∫ 1 p′′ E(Q|P ′m0 = p, Pm0+1 = p1, . . . , Pm = pm1)fm0(p)dp with fm0(p) = m0p (m0−1) being the density function of P ′(m0). We will work out an upper bound for each of these two integrations. Step 2 Analyzing the integration in two intervals: In the first integral, we are dealing with the situation where p ≤ p′′ because the integration is over the region [0, p′′]. In this case, the BH proce- dure will have all m0 true null, plus j0 false hypotheses rejected. The false 360 CHAPTER 24. MULTIPLE COMPARISON discovery rate is hence Q = m0/(m0 + j0). Substituting this value into the first integral, and noting the density function of fm0(p), we have∫ p′′ 0 {·}dp = {m0/(m0 + j0)} ∫ p′′ 0 m0p m0−1dp = {m0/(m0 + j0)}(p′′)m0 . Recall p′′ = m0+j0 N q∗, we get {m0/(m0+j0)}(p′′)m0 ≤ {m0/(m0+j0)}(p′′)m0−1 { m0 + j0 N q∗ } = m0 N q∗(p′′)m0−1. Now keep this result in this form and work on the second integral. In the the second integral, by definition of j0, the largest p-value cor- responding true H0 satisfies P ′(m0) = p ≥ p′′ = m0 + j0 N q∗ and pj0 ≤ p′′. Let j be the integer satisfying pj0 < pj ≤ P ′(m0) = p < pj+1. Note that the value p exceeds many more p-values of the false null hypothesis. If such a j does not exist, it implies pj0 ≤ p′′ < P ′(m0) = p < pj0+1 which occurs when the value p is barely larger than p′′ = m0+j0 N q∗. If so, j = j0. Now we regard j as fixed and satisfies one of two above inequalities. Namely, we will work on conditional probability. Because of the way by which j0 and p ′′ are defined, no hypothesis can be rejected as a result of the values of p, pj+1, . . . , pm1 . 24.8. THEORY AND PROOF 361 That is, none of H (m0) 0 , H (m0+j+1) f , H (m0+j+2) f , . . ., H (m0+m1) f are rejected. Reminder: j is fixed value in this argument. Hence, the pool of hy- potheses that might be rejected is shrunk to H (i) 0 : i = 1, 2, . . . ,m0 − 1; H if : i = m0 + 1,m0 + 2, . . . ,m0 + j. There are m0 + j − 1 < N of hypotheses in this pool. In this pool of null hypotheses, whether or not a hypothesis is true and false, get their p-values ordered together to obtain hypotheses H˜ (i) 0 , i = 1, 2, . . . ,m + j − 1. A hypothesis H˜(i)0 is rejected only if there exists k, i ≤ k ≤ m0 + j− 1, for which p˜(k) ≤ {k/(m+ 1)}q∗. Namely, we look for the largest k such that p˜(k) p ≤ k m0 + j − 1 m0 + j − 1 Np q∗. (24.1) We now explain that this corresponding to a BH procedure with different m, m0 and q ∗ values. When conditioning on P ′(m0) = p, P ′ i/p are iid U(0, 1) random variables (before sorting). Also, {pi/p, i = 1, 2, . . . , j}, are numbers between 0 and 1 corresponding to false null hypotheses (H (m0+1) f , . . ., H (m0+j) f ). Using inequality (24.1) to test the m0 + j − 1 = m′ < N hypotheses is equivalent to the BH procedure, with the constant q˜∗ = (m0 + j − 1) Np q∗ taking the role of q∗. Applying now the induction hypothesis to this procedure in which the total number of hypothesis being tested is m˜ = m0 + j − 1 < N, we have E(Q|P ′(m0) = p, Pm0+1 = p1, . . . , Pm = pm1) ≤ m0 − 1 m′ q˜∗ = m0 − 1 m0 + j − 1 m0 + j − 1 Np q∗ = m0 − 1 Np q∗. 362 CHAPTER 24. MULTIPLE COMPARISON The above bound depends on p, but not on the segment pj < p < pj+1 for which it was evaluated. (That is, whichever fixed j is under consideration). Therefore, the second integral∫ 1 p′′ E(Q|P ′(m0) = p, Pm0+1, . . . , Pm)fm0(p)dp ≤ ∫ 1 p′′ m0 − 1 Np q∗ m0p(m0−1)dp The outcome of the integration is m0 N q∗ ∫ 1 p′′ (m0 − 1)p(m0−2)dp = m0 N q∗(1− {p′′}(m0−1)). Adding two upper bounds on integrations completes the proof of the lemma for the case of m = N . Now the induction is completed and the lemma is fully proved.
欢迎咨询51作业君