STAT 460/560 + 461/561
STATISTICAL INFERENCE I & II
2019/2020, TERMs I & II
Jiahua Chen and Ruben Zamar c©
Department of Statistics
University of British Columbia
Contents
1 Some basics 1
1.1 Discipline of Statistics . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Probability and Statistics models . . . . . . . . . . . . . . . . 3
1.3 Statistical inference . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 8
2 Normal distributions 9
2.1 Uni- and Multivariate normal . . . . . . . . . . . . . . . . . . 10
2.2 Standard Chi-square distribution . . . . . . . . . . . . . . . . 12
2.3 Non-central chi-square distribution . . . . . . . . . . . . . . . 14
2.4 Cochran Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 F- and t-distributions . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 20
3 Exponential distribution families 23
3.1 One parameter exponential distribution family . . . . . . . . . 23
3.2 The multiparameter case . . . . . . . . . . . . . . . . . . . . . 26
3.3 Other properties . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 29
4 Optimality criteria of point estimation 31
4.1 Point estimator and some optimality criteria . . . . . . . . . . 32
4.2 Uniformly minimum variance unbiased estimator . . . . . . . . 35
4.3 Information inequality . . . . . . . . . . . . . . . . . . . . . . 37
1
2 CONTENTS
4.4 Other desired properties of a point estimator . . . . . . . . . . 40
4.5 Consistency and asymptotic normality . . . . . . . . . . . . . 42
4.6 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 45
5 Approaches of point estimation 49
5.1 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Maximum likelihood estimation . . . . . . . . . . . . . . . . . 52
5.3 Estimating equation . . . . . . . . . . . . . . . . . . . . . . . 53
5.4 M-Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.5 L-estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.6 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 56
6 Maximum likelihood estimation 59
6.1 MLE examples . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 Newton Raphson algorithm . . . . . . . . . . . . . . . . . . . 61
6.3 EM-algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.4 EM-algorithm for finite mixture models . . . . . . . . . . . . . 66
6.4.1 Data Examples . . . . . . . . . . . . . . . . . . . . . . 69
6.5 EM-algorithm for finite mixture models repeated . . . . . . . 70
6.6 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 73
7 Properties of MLE 75
7.1 Trivial consistency . . . . . . . . . . . . . . . . . . . . . . . . 78
7.2 Trivial consistency for one-dimensional θ . . . . . . . . . . . . 79
7.3 Asymptotic normality of MLE after the consistency is estab-
lished . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.4 Asymptotic efficiency, super-efficient, one-step update scheme 83
7.5 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 85
8 Analysis of regression models 89
8.1 Least absolution deviation and least square estimators . . . . 90
8.2 Linear regression model . . . . . . . . . . . . . . . . . . . . . . 91
8.3 Local kernel polynomial method . . . . . . . . . . . . . . . . . 95
8.4 Spline method . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.5 Cubic spline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
CONTENTS 3
8.6 Smoothing spline . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.7 Effective number of parameters and the choice of λ . . . . . . 116
8.8 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 117
9 Bayes method 119
9.1 An artificial example . . . . . . . . . . . . . . . . . . . . . . . 120
9.2 Classical issues related to Bayes analysis . . . . . . . . . . . . 123
9.3 Decision theory . . . . . . . . . . . . . . . . . . . . . . . . . . 126
9.4 Some comments . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9.5 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 129
10 Monte Carlo and MCMC 133
10.1 Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . 134
10.2 Biased or importance sampling . . . . . . . . . . . . . . . . . 138
10.3 Rejective sampling . . . . . . . . . . . . . . . . . . . . . . . . 139
10.4 Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . . . 142
10.4.1 Discrete time Markov chain . . . . . . . . . . . . . . . 143
10.5 MCMC: Metropolis sampling algorithms . . . . . . . . . . . . 146
10.6 The Gibbs samplers . . . . . . . . . . . . . . . . . . . . . . . . 149
10.7 Relevance to Bayes analysis . . . . . . . . . . . . . . . . . . . 151
10.8 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 151
11 More on asymptotic theory 155
11.1 Modes of convergence . . . . . . . . . . . . . . . . . . . . . . . 155
11.2 Convergence in distribution . . . . . . . . . . . . . . . . . . . 158
11.3 Stochastic Orders . . . . . . . . . . . . . . . . . . . . . . . . . 159
11.3.1 Application of stochastic orders . . . . . . . . . . . . . 161
12 Hypothesis test 167
12.1 Null hypothesis. . . . . . . . . . . . . . . . . . . . . . . . . . . 168
12.2 Alternative hypothesis . . . . . . . . . . . . . . . . . . . . . . 169
12.3 Pure significance test and p-value . . . . . . . . . . . . . . . . 171
12.4 Issues related to p-value . . . . . . . . . . . . . . . . . . . . . 173
12.5 General notion of statistical hypothesis test . . . . . . . . . . 175
12.6 Randomized test . . . . . . . . . . . . . . . . . . . . . . . . . 177
4 CONTENTS
12.7 Three ways to characterize a test . . . . . . . . . . . . . . . . 178
13 Uniformly most powerful test 183
13.1 Simple null and alternative hypothesis . . . . . . . . . . . . . 184
13.2 Making more from N-P lemma . . . . . . . . . . . . . . . . . . 189
13.3 Monotone likelihood ratio . . . . . . . . . . . . . . . . . . . . 190
13.4 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 193
14 Pushing Neyman–Pearson Lemma Further 197
14.1 One parameter exponential family . . . . . . . . . . . . . . . . 199
14.2 Two-sided alternatives . . . . . . . . . . . . . . . . . . . . . . 202
14.3 Unbiased test . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
14.3.1 Existence of UMPU tests . . . . . . . . . . . . . . . . . 204
14.4 UMPU for normal models . . . . . . . . . . . . . . . . . . . . 207
14.5 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 207
15 Locally most powerful test 211
15.1 Score test and its local optimality . . . . . . . . . . . . . . . . 212
15.2 General score test . . . . . . . . . . . . . . . . . . . . . . . . . 214
15.3 Implementation remark . . . . . . . . . . . . . . . . . . . . . . 216
15.4 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 217
15.5 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 218
16 Likelihood ratio test 221
16.1 Likelihood ratio test: as a pure procedure . . . . . . . . . . . . 221
16.2 Wilks Theorem under regularity conditions . . . . . . . . . . . 224
16.3 Asymptotic chisquare of LRT statistic . . . . . . . . . . . . . 227
16.4 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 228
16.5 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 230
17 Likelihood with vector parameters 233
17.1 Asymptotic normality of MLE after the consistency is estab-
lished . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
17.2 Asymptotic chisquare of LRT for composite hypotheses . . . . 237
17.3 Asymptotic chisquare of LRT: one-step further . . . . . . . . . 240
CONTENTS 5
17.3.1 Some notational preparations . . . . . . . . . . . . . . 240
17.4 The most general case: final step . . . . . . . . . . . . . . . . 243
17.5 Statistical application of these results . . . . . . . . . . . . . . 244
17.6 Assignment Problems . . . . . . . . . . . . . . . . . . . . . . . 246
17.7 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 248
18 Wald and Score tests 251
18.1 Wald test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
18.1.1 Variations of Wald test in the aspect of Fisher infor-
mation . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
18.1.2 Variations of Wald test in the aspect of H0 . . . . . . . 253
18.1.3 Variations of Wald test in the aspect of H0 . . . . . . . 254
18.2 Score Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
18.3 Power and consistency . . . . . . . . . . . . . . . . . . . . . . 256
18.4 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 257
18.5 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 258
19 Tests under normality 261
19.1 One-sample problem under normality . . . . . . . . . . . . . . 261
19.2 Two-sample problem under normality assumption . . . . . . . 264
19.3 Test for equal mean under equal variance assumption . . . . . 266
19.4 Test for equal mean without equal variance assumption . . . . 267
19.5 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 269
20 Non-parametric tests 271
20.1 One-sample sign test. . . . . . . . . . . . . . . . . . . . . . . . 272
20.2 Sign test for paired observations. . . . . . . . . . . . . . . . . 273
20.3 Wilcoxon signed-rank test . . . . . . . . . . . . . . . . . . . . 274
20.4 Two-sample permutation test. . . . . . . . . . . . . . . . . . . 276
20.5 Kolmogorov-Smirnov and Crame´r-von Mises tests . . . . . . . 280
20.6 Pearson’s goodness-of-fit test . . . . . . . . . . . . . . . . . . . 281
20.7 Fisher’s exact test . . . . . . . . . . . . . . . . . . . . . . . . . 283
20.8 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 285
6 CONTENTS
21 Confidence intervals or regions 289
21.1 Confidence intervals based on hypothesis test . . . . . . . . . . 292
21.2 Confidence interval by pivotal quantities . . . . . . . . . . . . 293
21.3 Likelihood intervals. . . . . . . . . . . . . . . . . . . . . . . . 294
21.4 Intervals based on asymptotic distribution of θˆ . . . . . . . . . 296
21.5 Bayes Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
21.6 Prediction intervals . . . . . . . . . . . . . . . . . . . . . . . . 301
21.7 Hypothesis test and confidence region . . . . . . . . . . . . . . 302
21.8 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 304
22 Empirical likelihood 307
22.1 Definition of the empirical likelihood . . . . . . . . . . . . . . 307
22.2 profile likeihood . . . . . . . . . . . . . . . . . . . . . . . . . . 309
22.3 Large sample properties . . . . . . . . . . . . . . . . . . . . . 311
22.4 Likelihood ratio function . . . . . . . . . . . . . . . . . . . . . 315
22.5 Numerical computation . . . . . . . . . . . . . . . . . . . . . . 317
22.6 Empirical likelihood applied to estimating functions . . . . . . 319
22.7 Adjusted empirical likelihood . . . . . . . . . . . . . . . . . . 324
22.8 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 325
23 Resampling methods 329
23.1 Problems addressed by resampling . . . . . . . . . . . . . . . . 329
23.2 Resampling procedures . . . . . . . . . . . . . . . . . . . . . . 330
23.3 Bias correction . . . . . . . . . . . . . . . . . . . . . . . . . . 332
23.4 Variance estimation . . . . . . . . . . . . . . . . . . . . . . . . 333
23.5 The cumulative distribution function . . . . . . . . . . . . . . 336
23.6 Recipes for confidence limit . . . . . . . . . . . . . . . . . . . 339
23.7 Implementation based on resampling . . . . . . . . . . . . . . 341
23.8 A word of caution . . . . . . . . . . . . . . . . . . . . . . . . . 342
23.9 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . 343
24 Multiple comparison 347
24.1 Analysis of variance for one-way layout. . . . . . . . . . . . . . 348
24.2 Multiple comparison . . . . . . . . . . . . . . . . . . . . . . . 350
24.3 The Bonferroni Method . . . . . . . . . . . . . . . . . . . . . 350
CONTENTS 7
24.4 Tukey Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
24.5 False discovery rate . . . . . . . . . . . . . . . . . . . . . . . . 353
24.6 Method of Benjamini and Hochberg . . . . . . . . . . . . . . . 354
24.7 How to apply this principle? . . . . . . . . . . . . . . . . . . . 355
24.8 Theory and proof . . . . . . . . . . . . . . . . . . . . . . . . . 357
Chapter 1
Some basics
1.1 Discipline of Statistics
Statistics is a discipline that serves other scientific disciplines. Statistics is
itself may not considered by many as a branch of science. A scientific disci-
pline constantly develops theories to describe how the nature works. These
theories are falsified whenever their prediction contradicts the observations.
Based on these theories and hypotheses, scientists form a model for the nat-
ural world and the model is then utilized to predict what happens to the
nature under new circumstances. Scientific experiments are constantly de-
signed find evidences that may contradict the prediction of the proposed
model and aim at DISPROVING hypotheses behind the model/theory. If
a theory is able to make useful predictions and we fail to find contradicting
evidences, it gains broad acceptance. We may then temporarily consider it
as “the truth”. Even if a model/theory does not give a perfect prediction,
but a prediction precise enough for practical purposes and it is much simpler
than a more precise model/theory, we tend to retain it as a working model. I
regard, for example, Newton’s laws as such an example as compared to more
elaborating Einstein’s relativity.
If a theory does not provide any prediction that can potentially be dis-
proved by some experiments, then it is not a scientific theory. Religious
theories form a rich group of such examples.
Statistics in a way is a branch of mathematics. It does not model our
1
2 CHAPTER 1. SOME BASICS
nature. For example, it does not claim that when a fair die is rolled, the
probability of observing 1 is 1/6. Rather, for example, it claims that if the
probability of observing 1 is 1/6, and if the outcomes of two dice are indepen-
dent, then the probability of observing (1, 1) is 1/36, and the probability of
observing either (1, 2) and (2, 1) is 2/36. If one applies a similar model to the
spacial distribution of two electrons, the experimental outcomes may contra-
dict the prediction of this probability model, yet the contradiction does not
imply that the statistic theory is wrong. Rather, it implies that the statistical
model does not apply to the distribution of the electrons. The moral of this
example is, a statistical theory cannot be disproved by physical experiments.
Its theories are of logical truth, and this makes it unqualified as a scientific
discipline in the sense we mentioned earlier.
We should make a distinction of the inconsistency between a probability
model and the real world, and the inconsistency within our logical deriva-
tions. If we err at proving a proposition, that proposition is very likely false
within our logical system. It does not disprove the logical system. We call
logically proved propositions as theorems. In comparison, the propositions
regarded as temporary truth in science are named as laws. Of course, we
sometimes abuse these terminologies such as “Law of Large Numbers”.
In a scientific investigation, one may not always be able to find clear-cut
evidence against a hypothesis. For instance, genetic theory indicates that
tall fathers have tall sons in general. Yet there are many factors behind the
height of the son. Suppose we collect 1000 father-son pairs randomly from a
human population. Let us measure their heights as (xi, yi), i = 1, 2, . . . , 1000.
A regression model in the form of
yi = a+ bxi + i
with some regression coefficient (a, b) and random error , can be a useful
summary of the data.
If the statistical analysis of the data supports the model with some b > 0,
then the genetic theory survives the attack. If we have a strong evidence to
suggest b is not very different from 0, or it may even be negative, then the
genetic theory has to be abandoned. In this case, the genetic theory is not
disproved by statistics, but by physical experiments (data collected on father-
son heights) assisted by the statistical analysis. Whatever the outcome of
1.2. PROBABILITY AND STATISTICS MODELS 3
the statistical analysis is, the statistic theory is not falsified. It is the genetic
theory that is being tortured.
1.2 Probability and Statistics models
In scientific investigations, we often quantify the outcomes of an experiment
in order to develop a useful model for the real world. An existing scientific
theory can often give a precise prediction: the water boils at 100 degrees
Celsius at the sea level on the Earth. In other cases, precise prediction
is nearly impossible. For example, scientists still cannot predict when and
where the next serious earthquake will be. There used to be beliefs that a
yet to be discovered perfect scientific model exists which can explain away all
randomness. In terms of earthquakes, it might be possible to have a precise
prediction if we know the exact tensions between the geographic structures
all around the world, the amount of heat being generated at the core of the
earth, the positions of all heavenly bodies and a lot more.
In other words, the claim is that we study randomness only because we
are incompetent in science or because a perfect model is too complicated to
be practically useful. This is now believed not the case. The uncertainty
principle in quantum theory indicates that the randomness might be more
fundamental than many of us are willing to accept. It strongly justifies the
study of statistics as an “academic discipline”.
A probability space is generally denoted as (Ω,B, P ). We call Ω the sam-
ple space, which is linked to all possible outcomes of an experiment under
consideration. The notion of experiment becomes rough when the real world
problem becomes complex. It is better off to take the mathematical conven-
tion to simply assume its existence. B is a σ-algebra. Mathematically, it
stands for a collection of subsets of Ω with some desirable properties. We
require that it is possible to assign a probability to each subset of Ω that is a
member of B without violating some desired rules. How large a probability
is assigned to a particular member of B is a rule denoted by P .
A random variable (vector) X is a measurable function on Ω. It takes
values on Rn if X has length n. It induces a probability space (Rn,B, F )
where F is its distribution. In statistics, we consider problems of inferring
4 CHAPTER 1. SOME BASICS
about F within a set of distributions pre-specified. This set of distributions
is called statistical model, and it is presented as a probability distribution
family F sometime with additional structures. If vector X has n components
and they are independent and identically distributed (i.i.d. ), we use F for
individual distribution, not for the joint distribution. This convention will be
clear when we work with specific problems. In this case, we call it population
F defined on (R,B). Components of X are samples from population F .
When the individual probability distributions in F is conveniently la-
belled by a subset of Rd, the Euclid space of dimension d, we say that F is
a parametric distribution family. The label is often denoted as θ, and its all
possible values Θ is called parameter space. In applications, we usually only
consider parametric models whose probability distributions have a density
function with respect to a common σ-finite measure. In such situations, we
write
F = {f(x; θ) : θ ∈ Θ}.
The σ-finite measure is usually the Lebesgue which makes f(x; θ) the com-
monly referred density functions. When the σ-finite measure is the counting
measure, the density functions are known as probability mass function.
If F is not parameterized, we have a non-parametric model.
Probability theory and statistics Probability theory studies the prop-
erties of stochastic systems. For instance, the convergence property of the
empirical distribution based on an i.i.d. sample. Statistical theory aims at
inferring about the stochastic system based on (often) an i.i.d. sample from
this system. For instance, does the system (population) appear to be a mix-
ture of two more homogeneous subpopulations? Probability theory is the
foundation of statistical inference.
Given an inference goal, statisticians may propose many possible ap-
proaches. Some approaches may deem inferior and dismissed over the time.
Most approaches have merits that are not completely shadowed by other
approaches. Some statistical techniques used as standard methods in other
disciplines yet most statisticians never heard of. As a statistician, I hope to
have the knowledge to understand these approaches, not to have the knowl-
edge of all statistical approaches.
1.3. STATISTICAL INFERENCE 5
1.3 Statistical inference
Let X = (X1, X2, . . . , Xn) be a random sample from a statistical model F .
That is, we assume that they are independent and identically distributed
with a distribution which is a member of F . Let their realized values be
x = (x1, x2, . . . , xn). A statistical inference is to infer about the specific
member F of F based on the realized value x. If we take a single guess of
F , the result is a point estimate; If we provide a collection of possible F , the
result is an interval estimate (usually); If we make a judgement on whether
a single or a subset of F contains the “true” distribution, the procedure is
called hypothesis test. In general, in the last case, we are required to quantify
the strength of the evidence based on which the judgement is made. If we
partition the space of F into several submodels and infer which submodel
F belongs, the procedure is called model selection. In general, for model
selection, we do not quantify the evidence favouring the specific submodel.
This is the difference between “hypothesis test” and “model selection”.
Another general category of statistical inference is based on Bayesian
paradigm. The Baysian approach does not identify any F or any set of F .
Instead, it provides a probabilistic judgement on every member of subset of
F . The probabilistic judgement is obtained via conditional distribution by
placing a prior distribution on F and conditional on observations in the form
of X = x. We call it posterior distribution. The final decision will be made
based on consideration such as minimizing an expected lost.
Definition 1.1. A statistic is a function of data which does not depend on
any unknown parameters.
More concretely, the value of a statistic can be evaluated without knowing
the value of the unknown parameters in the model. The sample mean x¯n =
n−1(x1 +x2 + · · ·+xn) is a statistic. However, x¯n−E(X1) is in general not a
statistic because it is a function of both data, x¯n, and the usually unknown
value, E(X1). The value of E(X1) often depends on parameter θ behind F .
Let T (x) be a statistic. We may also regard T (x) as the realized value of
T when the realized value of X is x. We may regard T = T (X) as a quantity
to be “realized”. Since X is random, the outcome of T is also random.
The distribution of T (X) is called its sample distribution. Unfortunately,
6 CHAPTER 1. SOME BASICS
it is often hard to be completely consistent when we deal with T (X) and
T (x). We may have to read between lines to tell which one of the two is
under discussion. Since the distribution of X is usually only known up to
being a member of F which is often labeled by a parameter θ, the (sample)
distribution of T is also only known up to the unknown parameter θ.
Definition 1.2. Let T (x) be a statistic. If the conditional distribution of
X given T does not depend on unknown parameter values, we say T is a
sufficient statistics.
When T is sufficient, all information contained in X about θ is contained
in T . In this case, one may choose to ignore X but work only on T without
loss of any efficiency. Such a simplification is most helpful if T is much
simpler than X or it is a substantial reduction of X.
Directly verifying the sufficiency of a statistic is often difficult. We gen-
erally use factorization theorem to identify sufficient statistics. If the density
function of X can be written as
f(x; θ) = h(x)g(T (x); θ)
for some function h(·) and g(·; ·), then T (x) is sufficient for θ.
In some situations, direct verification is not too complex. For example,
if X1, X2 are independent Poisson distributed with mean parameter θ. Then
the conditional distribution of X1, X2 given T = X1 + X2 are binomial (T ,
1/2) which is free from the unknown parameter θ. Hence, T is sufficient for
θ.
Definition 1.3. Sufficient statistic T (x) is minimum sufficient if T is the
function of every other sufficient statistic.
A minimum sufficient statistic may still contain some redundancy. If a
statistic has the property that none of its non-zero function can have identi-
cally 0 expectation, this statistic is called complete. When the requirement
is reduced to included only “bounded functions”, then T is called bounded-
complete. We have a few more such notions.
Definition 1.4. Sufficient statistic T (x) is complete if E(g(T )) = 0 under
every F ∈ F implies g(·) ≡ 0 almost surely.
1.3. STATISTICAL INFERENCE 7
In contrast, if the distribution of T does not depend on θ or equivalently
on the specific distribution of X, we say that T is an ancillary statistic.
Definition 1.5. If the distribution of the statistic T (x) does not depend on
any parameter values, it is an ancillary statistic.
Example: Suppose X = (X1, . . . , Xn) is a random sample from N(θ, 1)
with θ ∈ R. Recall that T = X¯ is a complete and sufficient statistic of θ.
At the same time, X − T = (X1 − X¯, . . . , Xn − X¯) is an ancillary statistic.
It does not contain any information about the value of θ. However, it is not
completely useless. Under the normality assumption, X − T is multivariate
normal. We can study the realized value of X−T to see whether it looks like
a realized value from a multivariate normal. If the conclusion is negative, the
normality assumption is in serious question. If the validity of a statistical
inference heavily depends on normality, such a diagnostic procedure is very
important.
Remark: In this example the probability model F is all normal distribu-
tions with mean θ and known variance σ2 = 1. Notationally, F = {N(θ, 1) :
θ ∈ R}.
Definition 1.6. If T is a function of both data X and the parameter θ, but
its distribution is not a function of θ, we call T a pivotal quantity.
In the last example, S = X¯ − θ is a pivotal quantity. Note that this
claim is made under the assumption that θ is the “true” parameter value of
the distribution of X, it is not a dummy variable. This is another common
practice in statistical literature: if not declared, notation θ is used both as
a dummy variable and the “true” value of the distribution of the random
sample X. This notion also applies to Bayes methods, θ is often regarded as
a realized value from its prior distribution, and X is then a sample from the
distribution labeled by this “true” value of θ.
Note that the parameter θ is a label of F that belongs to F in parametric
models. It may as well be regarded as a function of F , call it functional if
you please. Any function of F can be regarded as a parameter by the same
token. For example, the median of F is a parameter. This works even if F
is a popularly used parametric distribution family such as Poisson.
8 CHAPTER 1. SOME BASICS
1.4 Assignment problems
1. Let X1, X2, . . . , Xn be a random sample (i.i.d. ) from a continuous dis-
tribution f(x). Namely, the distribution family F contains all univari-
ate continuous distributions.
Let R1, R2, . . . , Rn be the rank statistic. That is, R1 = the rank of X1
among n random variables.
(a) Show that the vector R = (R1, R2, . . . , Rn)
τ is an ancillary statistic
and find its distribution.
(b) What information contained in R that might be useful for statistical
inference?
2. Let X1, X2, . . . , Xn be a random sample (i.i.d. ) from N(θ, σ
2). Let X¯n
and s2n be the sample mean and variance.
(a) Verify that (X1 − X¯n, . . . , Xn − X¯n)/sn is an ancillary statistic.
(b) Verify by factorization theorem that X¯n, s
2
n are jointly sufficient.
(c) Suppose σ = 1 is known. Show that X¯n is complete for θ by
definition.
Chapter 2
Normal distributions
Let X be a random variable. Namely, it is a function on a probability
space (Ω,B, P ). It randomness is inherited from probability measure P . By
definition of random variable,
{X ≤ t} = {ω : ω ∈ Ω, X(ω) ≤ t}
is a member of B for any real value t. Hence, there is a definitive value
Fx(t) = P ({X ≤ t})
for any t ∈ F . We refer Fx(t) as the cumulative distribution function (c.d.f.
) of X. Often, we omit the subscript and write it as F (t). Note t itself is
a dummy variable so it does not carry any specific meaning other than it
stands for a real number. In most practices, we use F (x) for the c.d.f. of X.
This can lead to confusion: Once F (x) is used as c.d.f. of X, F (y) remains
the c.d.f. of X, not necessarily that of another random variable called Y .
The c.d.f. of a random variable largely determines it randomness prop-
erties. This is the basis of forming distribution families: distributions whose
c.d.f. having a specific algebraic form. Of course, there are often physical
causes behind the algebraic form. For instance, success-failure experiment is
behind the binomial distribution family.
Uni- and Multi-variate normal distribution families occupy a special space
in the classical mathematical statistics. We provide a quick review as follows.
9
10 CHAPTER 2. NORMAL DISTRIBUTIONS
2.1 Uni- and Multivariate normal
A random variable has standard normal distribution if its density function
is given by
φ(x) =
1√
2pi
exp(−1
2
x2).
We generally use
Φ(x) =
∫ x
−∞
φ(t)dt
to denote the corresponding c.d.f. . If X has probability density function
φ(x;µ, σ) = σ−1φ(
x− µ
σ
) =
1√
2piσ
exp(− 1
2σ2
x2)
then it has normal distribution with mean µ and variance σ2. We use
Φ(x;µ, σ) to denote the corresponding c.d.f.
If Z has standard normal distribution, then X = σZ + µ has normal
distribution with parameters (µ, σ2) which represent mean and variance. The
moment generating function of X is given by
Mx(t) = exp(µt+
1
2
σ2t2)
which exists for all t ∈ R. The moment of the standard normal Z are:
E(Z) = 0, E(Z2) = 1, E(Z3) = 0 and E(Z4) = 3.
Why is the normal distribution normal? The central limit theorem
tells us that if X1, X2, . . . , Xn, . . . is a sequence of i.i.d. random variables with
E(X) = 0 and var(X) = 1, then
P (n−1/2
n∑
i=1
Xi ≤ x)→
∫ x
−∞
φ(t)dt
for all x, where φ(t) is the density function of the standard normal distribu-
tion (normal with mean 0 and variance 1).
Recall that many distributions we investigated can be viewed as distribu-
tions of sum of i.i.d. random variables, hence, when properly scaled as in the
central limit theorem, their distributions are well approximated by normal.
These examples include: binomial, Poisson, Negative binomial, Gamma.
2.1. UNI- AND MULTIVARIATE NORMAL 11
In general, if the outcome of a random quantity is influenced by numerous
factors and none of them play a determining role, then the sum of their effects
is normally distributed. This reasoning is used to support the normality
assumption on our “height” distribution, even though none of us ever had a
negative height.
Multivariate normal. Let the vector Z = {Z1, Z2, . . . , Zd}′ consist of in-
dependent, standard normally distributed components. Their joint density
function is given by
f(z) = {2pi}−d/2 exp{−1
2
zτz} = {2pi}−d/2 exp{−1
2
d∑
j=1
z2i }.
Easily, we have E(Z) = 0 and var(Z) = Id, the identity matrix. The moment
generating function of Z (joint one) is given by
Mz(t) = exp{1
2
tτt}
which is in vector form.
Let B be a matrix of size m× d and µ be a vector of length m. Then
X = BZ + µ
is multivariate normally distributed with
E(X) = µ, var(X) = BBτ .
We will use notation Σ = BBτ . It is seen that if X is multivariate nor-
mally distributed, N(µ,Σ), then its linear function, Y = AX + b is also
multivariate normally distributed: N(Aµ+ b,AΣAτ ).
Note this claim does not require Σ nor A to have full rank. It also implies
all marginal distributions of a multivariate normal random vector is normally
distributed. The inverse is not completely true: if all marginal distributions
of a random vector are normal, the random vector does not necessarily have
multivariate normal distribution. However, if all linear combinations of X
has normal distribution, then the random vector X has multivariate normal
distribution.
12 CHAPTER 2. NORMAL DISTRIBUTIONS
When Σ has full rank, then N(µ,Σ) has a density function given by
φ(x;µ,Σ) = (2pi)−d/2{det(Σ)}−1/2 exp{−1
2
(x− µ)τΣ−1(x− µ)}
where det(·) is the determinant of a matrix. We use Φ(x;µ,Σ) for the
multivariate c.d.f. .
Partition of X. Assume that a multivariate normal random vector is parti-
tioned into two parts: Xτ = (Xτ1,X
τ
2). The mean vector, covariance matrix
can be partitioned accordingly. In particular, we denote the partition of the
mean vector as µτ = (µτ1,µ
τ
2) and the covariance matrix as
Σ =
(
Σ11 Σ12
Σ21 Σ22
)
.
Theorem 2.1. Suppose Xτ = (Xτ1,X
τ
2) is multivariate normal, N(µ,Σ).
Then
(1) X1 is multivariate N(µ1,Σ11).
(2) X1 and X2 are independent if and only if Σ12 = 0.
(3) Assume Σ22 has full rank. Then the conditional distribution of X1|X2
is normal with conditional mean µ1 + Σ12Σ
−1
22 (X2−µ2) and variance matrix
Σ11 −Σ12Σ−122 Σ21.
That is, for multivariate normal random variables, zero-correlation is
equivalent to independence. The above result for conditional distribution
is given when Σ22 has full rank. The situation where Σ22 does not have full
rank can be worked out by removing the redundancy in X2 before applying
the above result.
2.2 Standard Chi-square distribution
We first fix the idea with a definition.
Definition 2.1. Let Z1, Z2, . . . , Zd be a set of i.i.d. standard normally dis-
tributed random variables. The sum of squares
T = Z21 + Z
2
2 + · · ·+ Z2d
is said to have chi-square distribution with d degrees of freedom.
2.2. STANDARD CHI-SQUARE DISTRIBUTION 13
For convenience of future discussion, we first put down a simple result
without a proof here.
Theorem 2.2. Let Z1, Z2, . . . , Zd be a set of i.i.d. standard normally dis-
tributed random variables. The sum of squares
T = a1Z
2
1 + a2Z
2
2 + · · ·+ adZ2d
has chi-square distribution if and only if a1, . . . , ad are either 0 or 1.
We use notation χ2d as a symbol of the chi-square distribution with d
degrees of freedom. The above definition is how we understand the chi-
square distribution. Yet without seeing its probability density function and
so on, we may only have superficial understanding
To obtain the density function of T , we may work on the density function
of Z21 first. It is seen that
P (Z21 ≤ x) = P (−

x ≤ Z1 ≤

x). =
∫ √x
−√x
φ(t)dt
Hence, by taking derivative with respect to x, we get its pdf as
fZ21 (x) =
1
2

pi
(x
2
)1/2−1
exp(−x
2
).
This is the density function of a specific Gamma distribution with 1/2 degrees
of freedom and scale parameter 2. Because of this and from the property of
Gamma distribution, we conclude that T has Gamma distribution with d/2
degrees of freedom, and scale parameter 2. Its p.d.f. is given by
fT (x) =
1
2Γ(d/2)
(x
2
)d/2−1
exp(−x
2
).
Its moment generating function can also be obtained easily:
MT (t) =
(
1
1− 2t
)d/2
.
Note that this function is defined only for t < 1/2. The mean of T is d, and
its variance is 2d.
14 CHAPTER 2. NORMAL DISTRIBUTIONS
Clearly, if X is N(µ,Σ) of length d and that Σ has full rank, then W =
(X−µ)τΣ−1(X−µ) has chi-square distribution with d degrees of freedom.
The cumulative distribution function of standard chi-square distribution with
(virtually) any degrees of freedom has been well investigated. There used to
be detailed numerical tables for their quantiles and so on. We have easy-to-
use R functions these days. Hence, whenever a statistic is found to have a
chi-square distribution, we consider its distribution is known.
If A is a symmetric matrix such that AA = A, we say that it is idem-
potent. In this case, the distribution of ZτAZ is chisquare distribution with
degrees of freedom equaling the trace of A when Z is N(0, I).
2.3 Non-central chi-square distribution
We again first fix the idea with a definition.
Definition 2.2. Let Z1, Z2, . . . , Zd be a set of i.i.d. standard normally dis-
tributed random variables. The sum of squares
T = (Z1 + γ)
2 + Z22 + · · ·+ Z2d
is said to have non-central chi-square distribution with d degrees of freedom
and non-centrality parameter γ2.
Let
T ′ = (Z1 − γ)2 + Z22 + · · ·+ Z2d
with the same γ as in the definition. The distribution of T ′ is the same as
the distribution of T . This can be proved as follows. Let W1 = −Z1 and
Wj = Zj for j = 2, . . . , d. Clearly,
T ′ = (W1 + γ)2 +W 22 + · · ·+W 2d
and W1,W2, . . . ,Wd remain i.i.d. standard normally distributed. Hence, T
and T ′ must have the same distribution. However, T 6= T ′ when they are
regarded as random variables on the same probability space.
2.3. NON-CENTRAL CHI-SQUARE DISTRIBUTION 15
The second remark is about the stochastic order of two distributions.
Without loss of generality, γ > 0. When d = 1, and for any x > 0, we find
P{(Z1 + γ)2 ≥ x2} = 1− Φ(x− γ) + Φ(−x− γ).
Taking derivative with respect to γ, we get
φ(x− γ)− φ(−x− γ) = φ(x− γ)− φ(x+ γ) > 0.
That is, the above probability increases with γ over the range of γ > 0. That
is, (Z1 + γ)
2 is always more likely to take larger values than Z21 does.
For convenience, let χ2d and χ
2
d(γ
2) be two random variables with respec-
tively central and non-central chi-square distributions with the same degrees
of freedom d. We can show that for any x,
P{χ2d(γ2) ≥ x2} ≥ P{χ2d ≥ x2}.
This proof of this result will be left as an exercise.
In data analysis, a statistic or random quantity T often has central
chisquare distribution under one model assumption, say A, but non-central
chisquare distribution under another model assumption, say B. Which model
assumption is better supported by the data? Due to the above result, a large
observed value of T is supportive of B while a small observed value of T
is supportive of A. This provides a basis for hypothesis test. We set up
a threshold value for T so that we accept B when the observed value of T
exceeds this value.
Let X be multivariate normal N(µ, Id). Then XτX has non-central
chisquare distribution with non-centrality parameter µτµ. This can be proved
as follow. Without loss of generality, assume µ 6= 0. Let A be an orthogonal
matrix so that its first row equals µ/‖µ‖. Let
Y = AX.
Write Yτ = (Y1, Y2, . . . , Yd). Then Y

1 = Y1−‖µ‖, Y2, . . . , Yd are i.i.d. standard
normal random variables. Hence,
XτX = YτY = (Y ′1 + ‖µ‖)2 + Y 22 + · · ·+ Y 2d
16 CHAPTER 2. NORMAL DISTRIBUTIONS
has non-central chi-square distribution with non-centrality parameter µτµ.
As an exercise, please show that if X is multivariate normal N(µ,Σ),
then
Q = XΣ−1X
has non-central chi-square distribution with non-centrality parameter γ2 =
µτΣ−1µ.
It can be verified that
E(Q) = d+ γ2; var(Q) = 2(d+ 2γ2).
When Σ = σ2Id, then XτX has non-central chi-square distribution with
d degrees of freedom and non-centrality parameter γ2 = ‖µ‖2.
Suppose W1 and W2 are two independent non-central chi-square dis-
tributed random variables with d1 and d2 degrees of freedome, and non-
centrality parameters γ21 and γ
2
2 . Then W1 + W2 is also non-central chi-
square distributed and its degree of freedom is d1 + d2 and non-centrality
parameters γ21 + γ
2
2 .
2.4 Cochran Theorem
We first look into a simple case.
Theorem 2.3. Suppose X is N(0, Id) and that
XτX = XτAX + XτBX = QA + QB
such that both A and B are symmetric with ranks a and b respectively.
If a + b = d, then QA and QB are independent and have χ
2
a and χ
2
b
distributions.
Proof: By standard linear algebra result, there exists an orthogonal matrix
R and diagonal matrix Λ such that
A = RτΛR.
This implies
B = Id −A = Rτ (Id −Λ)R
2.5. F- AND T-DISTRIBUTIONS 17
in which (Id −Λ) is also diagonal.
The rank of A equals the number of non-zero entries of Λ and that of
B is the number of entries of Λ not equalling 1. Since a + b = d, this
necessitates all entries of Λ are either 0 or 1. Without loss of generality,
Λ = diag(1, · · · , 1, 0, . . . , 0).
Note that orthogonal transformation Y = RX makes entries of Y i.i.d. standard
normal. Therefore,
QA = Y
τΛY = Y 21 + · · ·+ Y 2a
which has χ2a distribution. Similarly,
QB = Y
τ (Id −Λ)Y = Y 2a+1 + · · ·+ Y 2d
which has χ2b distribution. In addition, they are quadratic forms of different
segments of Y. Therefore, they are independent.
Remark: Since XτAX = XτAτX, we have QA = X
τ{(A + Aτ )/2}X in
which {(A+Aτ )/2} is symmetric. Hence, we do not loss much generality by
assuming both A and B are symmetric. The result does not hold without
symmetry assumption though I cannot find references: Try
A =
[
1 −1
0 0
]
, B =
[
0 1
0 1
]
.
Under symmetry assumption, take it as a simple exercise to show that if
XτX = XτA1X + · · ·+ XτBpX =
p∑
j=1
Qj
such that
rank(A1) + · · ·+ rank(Ap) = d
then Qj’s are independent, each has chisquare distribution of degrees rank(Aj).
2.5 F- and t-distributions
If X and Y have chisquare distributions with degrees of freedom m and n
respectively, then the distribution of
F =
X/m
Y/n
18 CHAPTER 2. NORMAL DISTRIBUTIONS
is called F with m and n degrees of freedom. Note that
X/(X + Y ) = (1 + Y/X)−1
has Beta distribution. Thus, there is a very simple relationship between the
F -distribution and the Beta distribution.
t-distribution. If X has standard normal distribution, and S2 has chisquare
distribution with n degrees of freedom. Further, when X and S2 are inde-
pendent,
t =
X√
S2/n
has t-distribution with n degrees of freedom.
When n = 1, this distribution reduces to the famous Cauchy distribution,
none of its moments exist.
When n is large, S2/n converges to 1. Thus, the t-distribution is not very
different from the standard normal distribution. A general consensus is that
when n ≥ 20, it is good enough to regard t-distribution with n degrees of
freedom as the standard normal in statistical inferences.
2.6 Examples
In this section, we give a few commonly used distributional results in mathe-
matical statistics. Two examples are generally referred to as one-sample and
two-sample problems.
Example 2.1. Consider the normal location-scale model in which for i =
1, . . . , n, we have
Yi = µ+ σi
such that 1, . . . , n are i.i.d. N(0, 1). Let Y be the corresponding Y vector
which is multivariate normal with mean
µτ = (1, 1, . . . , 1) = µ1τ
and identity covariance matrix I. Similarly, we use for the vector of .
2.6. EXAMPLES 19
The sample variance can be written as
s2n = (n− 1)−1Yτ (I− n−111τ )Yτ
= (n− 1)−1σ2τ (I− n−111τ ).
The key matrix (I − n−111τ ) is idempotent. Hence, other than factor (n −
1)−1σ2, the sample variance has chisquare distribution with n− 1 degrees of
freedom.
In addition, the sample mean Y¯n = n
−11τY is uncorrelated to (I −
n−111τ )Yτ . Hence, they are independent. This further implies that the sam-
ple mean and sample variance are independent.
Example 2.2. Consider the classical two-sample problem in which we have
two i.i.d. samples from normal distribution: Xτ = (X1, X2, . . . , Xm) are
i.i.d. N(µ1, σ
2) and Yτ = (Y1, Y2, . . . , Yn) are i.i.d. N(µ2, σ
2). We are of-
ten interested in examining the possibility whether µ1 = µ2.
Let X¯m and Y¯n be two sample means. It is seen that
mn
m+ n
{X¯m − Y¯n}2
is a quadratic form that represents the variation between two samples. At the
same time,
m∑
i=1
{Xi − X¯m}2 +
n∑
j=1
{Yj − Y¯n}2
is a quadratic form that represents the internal variations within two popu-
lations. It is natural to compare the relative size of RSS0 against RRS1 to
decide whether two means are significantly different. For this purpose, it is
useful to know their sample distributions and independence relationship.
It is easy to directly verify that RSS0 and RRS1 are independent and both
have chisquare distributions. We may also find
−1(Xτ1m + Yτ1n)(1τmX + 1
τ
nY)
The ranks of three quadratic forms on the right hand side are 1, m + n − 2
and 1 which sum to n. The decomposition remains the same when we replace
X by (X− µ)/σ and Y by (Y− µ)/σ. Hence when µ1 = µ2 = µ and σ = 1,
20 CHAPTER 2. NORMAL DISTRIBUTIONS
RSS0 and RRS1 independent and chisquare distributed by Cochran Theorem
(after scaled by σ2).
This further implies that
F =
has F-distribution with degrees of freedom 1 and m+ n− 2.
The F-distribution conclusion is the basis for the analysis of variance,
two-sample t-test and so on.
2.7 Assignment problems
1. Let χ2d and χ
2
d(γ
2) be two random variables with respectively central
and non-central chi-square distributions with the same degrees of free-
dom d. Show that for any x,
P{χ2d(γ2) ≥ x2} ≥ P{χ2d ≥ x2}.
Suppose χ2n is a third random variable with central chisquare distribu-
tion and n degrees of freedom. Then
Fd,n(γ) =
χ2d(γ
2)/d
χ2n/n
is said to have non-central F-distribution, when the two chisquare ran-
dom variables are independent.
Show that for any x,
P{Fd,n(γ2) ≥ x2} ≥ P{Fd,n(0) ≥ x2}.
2. Let Z be a multivariate normal N(0, Id) random vector. Show that for
a symmetric matrix A,
ZτAZ
has central chisquare distribution if and only if A2 = A.
2.7. ASSIGNMENT PROBLEMS 21
3. Let Z be a multivariate normal N(0, Id) random vector. Under sym-
metry assumptions on A1, . . . ,Ap, show that if
ZτZ = ZτA1Z + · · ·+ ZτApZ =
p∑
j=1
Qj
such that
rank(A1) + · · ·+ rank(Ap) = d
then Qj’s are independent, each has chisquare distribution of degrees
equaling rank(Aj).
22 CHAPTER 2. NORMAL DISTRIBUTIONS
Chapter 3
Exponential distribution
families
In mathematical statistics, the normal distribution family plays a very im-
portant role for its simplicity and for the reason that many distributions are
well approximated by a normal distribution. We have also seen many useful
other distributions are derived from normal distributions.
There are many other commonly used distribution families in mathe-
matical statistics. Many of them have density functions conform to a specific
algebraic structure. The algebraic structure further enables simple statistical
conclusions in data analysis. Hence, it is often useful to have this structure
discussed in mathematical statistics.
3.1 One parameter exponential distribution
family
Consider a one parameter distribution family whose probability distributions
have a density function with respect to a common σ-finite measure. That is,
{f(x; θ) : θ ∈ Θ ⊂ R}
with Θ being its parameter space.
23
24 CHAPTER 3. EXPONENTIAL DISTRIBUTION FAMILIES
Definition 3.1. Suppose there exist real valued functions η(θ), T (x), A(θ)
and h(x) such that
f(x; θ) = exp{η(θ)T (x)− A(θ)}h(x). (3.1)
We say {f(x; θ) : θ ∈ Θ ⊂ R} is a one-parameter exponential family.
The definition does not give much insight on the specific algebraic form
is of interest. Let us build some intuition from several examples.
Example 3.1. Suppose X1, . . . , Xn are i.i.d. from Binomial (m, θ). Their
joint density (probability mass) function is given by
f(x1, . . . , xn; θ) =
n∏
i=1
[(
m
xi
)
θxi(1− θ)m−xi
]
.
Let
T (X) =

Xi, and T (x) =

xi
and
h(x) =
n∏
i=1
(
m
xi
)
.
Then we find
f(x1, . . . , xn; θ) = exp{T (x) log θ + (nm− T (x)) log(1− θ)}h(x)
= exp{log{θ/(1− θ)}T (x) + nm log(1− θ)}h(x).
This conforms the definition of one parameter family with
η = log{θ/(1− θ)}
and
A(θ) = nm log(1− θ).
As an exercise, you can follow this example to show that both Negative
Binomial, Poisson distributions are one-parameter exponential families.
3.1. ONE PARAMETER EXPONENTIAL DISTRIBUTION FAMILY 25
In the above example, η is call log-odds because θ/(1 − θ) is the odds
of success compared to failure in typical binary experiments. It is equally
useful to “label” Binomial distribution family by log-odds. Note that
θ =
exp(η)
1 + exp(η)
.
Hence, we may equivalently state that the joint density function of X is given
by
g(x1, . . . , xn; θ) = exp{ηT (x)− nm log(1 + exp(η))}h(x).
This form also confirms the definition of the one-parameter exponential fam-
ily.
Definition 3.2. Let X be a random variable or vector. The support of X of
that of its distribution is the set of all x such that for any δ > 0,
P{X ∈ (x− δ, x+ δ)} > 0.
For the sake of accuracy, a definition sometimes has to be abstract. The
support of X is intuitively the set of x such that X = x is a “possible event”.
When Z is N(0, 1), we have P (Z = z) = 0. Hence, we cannot interpret
“possible event” as a positive probability event. The above definition first
expands x and then judges its “possibility”. Hence, the support contains all
x at which the density function is positive and continuous.
We do not ask you to memorize this definition. Rather, we merely point
out that if two distributions belong the same one-parameter exponential fam-
ily, then they have the same support. In comparison, a standard exponential
distribution has support [0,∞) and a standard normal distribution has sup-
port R. Let us now show you another interesting property.
Example 3.2. Let us now consider the natural form of the one-parameter
exponential family:
f(x1, . . . , xn; η) = exp{ηT (x)− A(η)}h(x)
with η being a real value whose parameter space is an interval. The moment
generating function of T (x) is given by
MT (s) = E exp{sT (X)} = exp{A(η + s)− A(η)}.
26 CHAPTER 3. EXPONENTIAL DISTRIBUTION FAMILIES
This implies that
E{T} = M ′T (0) = A′(η).
and
E{T 2} = M ′′T (0) = A′′(η) + {A′(η)}2.
Hence,
var(T ) = A′′(η).
This example shows that the exponential families have some neat prop-
erties which make them an interest object to study.
3.2 The multiparameter case
We can practically copy the previous definition without any changes.
Definition 3.3. Suppose there exist real-vector valued functions η(θ), T(x),
and real valued functions A(θ) and h(x) such that
f(x;θ) = exp{ητ (θ)T(x)− A(θ)}h(x). (3.2)
We say {f(x; θ) : θ ∈ Θ ⊂ Rd} is a multi-parameter exponential family.
Without the above expansion, the exponential family does not even in-
clude normal distribution.
Example 3.3. Let X1, X2, . . . , Xn be i.i.d. with distribution N(µ, σ
2). Their
joint density function
φ(x1, . . . , xn;µ, σ
2) = (2pi)−n/2σ−n exp{−
∑n
i=1(xi − µ)2
2σ2
}
= (2pi)−n/2 exp{ µ
σ2
n∑
i=1
xi − 1
2σ2
n∑
i=1
x2i −
nµ2
2σ2
− n log σ}.
We now regard θ as a vector made of µ and σ. The above density function
3.3. OTHER PROPERTIES 27
fits into the definition (3.2) with the following functions:
η(θ) =
(
µ
σ2
,− 1
2σ2

,
T(x) = (

xi,

x2i )
τ ,
A(θ) = −nµ
2
2σ2
− n log σ,
h(x) = (2pi)−n/2.
Recall the Binomial distribution example. We had joint density function
given by
f(x1, . . . , xn; θ) = exp{T (x) log θ + (nm− T (x)) log(1− θ)}h(x).
It can also be regarded as a multi-parameter exponential family with d = 2
and
η = (log θ, log(1− θ))τ ; Tnew(x) = (T (x), nm− T (x))τ .
The parameter space in terms of values of η is a curve in R2 which does not
contain any open (non-empty) subset of R2. We generally avoid having a
distribution families with degenerate parameter spaces.
As an exercise, one can verify that two-parameter Gamma distribution
family is a multiple parameter exponential family.
3.3 Other properties
Suppose X1 and X2 both have distributions belonging to some exponential
families and they are independent. Then their joint distribution also belongs
to an exponential family.
By factorization theorem, T(X) in exponential family is a sufficient statis-
tic. It is also a complete statistic when the family does not degenerate.
The distribution of T belongs to an exponential family.
Definition 3.4. Let T be a k-dimensional vector valued function and h be a
real value function. The canonical k-dimensional exponential family gener-
ated by T and h is
g(x; η) = exp{ητT (x)− A(η)}h(x).
28 CHAPTER 3. EXPONENTIAL DISTRIBUTION FAMILIES
The parameter space for η is all η ∈ Rk such that exp{ητT (x)}h(x) has finite
integration with respect to the corresponding σ-finite measure.
We call the parameter space, E, the natural parameter space. We call T
and h generators.
Because the integration of a density function equals 1, the integration of
exp{ητT (x)}h(x) equals exp(A(η) if it is finite. Hence, the natural parameter
space E contains all η at which A(·) is well-defined.
Definition 3.5. We say that an exponential family F is of rank k if and only
if the generating statistic T is k-dimensional and 1, T1, . . . , Tk are linearly
independent with positive probability. That is,
P (a0 +
k∑
j=1
ajTj = 0; η) < 1
for some η unless all non-random coefficients a0 = a1 = · · · = ak = 0.
In the above definition, we only need to verify the probability inequality
for one η value. If it is satisfied for one η value, then it is satisfied for any
other η value.
Theorem 3.1. Suppose F = {g(x; η) : η ∈ E} is a canonical exponential
family generated by (T, h) with natural parameter space E such that E is
open. Then the following are equivalent:
(a) F is of rank k.
(b) var(T; η) is positive definite.
(c) η is identifiable: g(x; η1) ≡ g(x; η2) for all x implies η1 = η2.
These discussions on exponential family suffice for the moment so we
move to the next topic.
3.4. ASSIGNMENT PROBLEMS 29
3.4 Assignment problems
1. Show that both Negative Binomial and Poisson distributions are one-
parameter exponential families.
2. Show that the family of uniform distributions
f(x; θ) = θ−11(0 < x < θ)
over θ ∈ R+ is not a one-parameter exponential family.
3. (a) Show that two-parameter Gamma distribution family is a multi-
ple parameter exponential family, and select a T so that var(T ; η) is
positive definite.
(b) Show that multinomial distribution family with fixed number of
trials n is a multiple parameter exponential family, and select a T such
that var(T ; η) is positive definite.
30 CHAPTER 3. EXPONENTIAL DISTRIBUTION FAMILIES
Chapter 4
Optimality criteria of point
estimation
A general setting of the mathematical statistics is: we are given a data x
believed to be the observed value of a random object X. The probability
distribution of X will be denoted as F ∗ and F ∗ is believed to be a member
of a distribution family F . Based on the fact that X has an observed value
x, identify a single or a set of F in F which might be the “true” F ∗ that
describe the probability distribution of X.
There are many serious fallacies related to the above thinking. The first
one is the specification of F , which is referred as a model in this course. If
a specific form of F is given, how certain are we on F ∗ is one of F? Even if
the distribution of X is a member of F , X may not be accurately observed.
What we have recorded may be Y = X + . Hence, we may unknowingly
work on the distribution of Y instead that of X.
In this course, we do not discuss these possible fallacies but leave them to
applied courses. We take the approach that if the distribution ofX is indeed a
member of F and x is its accurate observed value, what can we say about F ∗?
Also, we often study the situation where X is an i.i.d. replication of some
random system so that X = (X1, . . . , Xn). The model of the distribution of
X will be then taken over by the model for X1 which is representative for
every Xi, i = 1, 2, . . . , n. We state that X1, . . . , Xn is an random or an i.i.d.
sample from population/distribution F of F . In this case n is referred to as
31
32 CHAPTER 4. OPTIMALITY CRITERIA OF POINT ESTIMATION
sample size. With many replications, or when n→∞, we should be able to
learn a lot more about F ∗.
4.1 Point estimator and some optimality cri-
teria
Let θ be a parameter in the probability model F and suppose we have a
random sample X. The parameter space is loosely Θ = {θ : θ = g(F ), F ∈
F} for some functional g. A point estimator of θ is a statistic T whose range
is Θ. The realized value of T , T (x), is an estimate of θ. We generally allow,
for the least, T to take values on the smallest closed set containing Θ. That
is, taking values on limiting points of Θ.
Definition 4.1. A point estimator of θ is a statistic T whose range is Θ.
The realized value of T , T (x), is an estimate of θ.
The definition implies that as an estimator, T (X) is regarded as a mech-
anism/rule of mapping X to Θ; as an estimate, T (x) is a value in Θ which
corresponding to data x. In both cases, we may use θˆ as their common
notation.
One must realize T (x) = 0 is an estimator of θ as long as 0 ∈ Θ. Hence,
we always can estimate the parameter in any statistical models, no matter
how complex the model is. We may not be able to find an estimator with a
satisfactory precision or certain desired properties.
Suppose the parameter space is a subset of Rd for some integer d. Hence,
T (X) takes values in Rd. When the distribution of X is given by an F ∈ F
or equivalently c.d.f. F (x; θ) or p.d.f. f(x; θ). Hence, T (X) is a distribution
induced by F (x; θ) or simply by θ. To fix the idea, we assume the “true” pa-
rameter value of F is θ, the generic θ. When θˆ = T (X) has finite expectation
under any θ, we define
bias(θˆ) = E{T (X); θ} − θ
as the bias of θˆ = T (X) when it is used as an estimator of θ and when the
true parameter value is θ.
4.1. POINT ESTIMATOR AND SOME OPTIMALITY CRITERIA 33
Definition 4.2. Suppose X has a distribution F ∈ F which is parameterized
by θ ∈ Θ. Suppose T (X) is an estimator of θ such that
E{T (X); θ} = θ
for all θ ∈ Θ, then we say T (X) is an unbiased estimator of θ.
For some reason, statisticians and others prefer estimators that are unbi-
ased. This is not always well justified.
Example 4.1. Suppose X has binomial distribution with parameters n and
θ, n is known and θ is an unknown parameter.
A commonly used estimator for θ is
θˆ =
X
n
.
An estimator motivated by Bayesian approach is
θ˜ =
X + 1
n+ 2
.
It is seen E{θˆ; θ} = θ. Hence, it is an unbiased estimator.
We find that other than θ = 0.5,
bias(θ˜) =
1− 2θ
n+ 2
6= 0.
Hence, θ˜ is a biased estimator.
Which estimator makes more sense to you?
In the above example, the bias of θ˜ has a limit 0 when n goes to infinite.
Often, we discuss situations where the data set contains n i.i.d. observations
from a distribution F which is a member of F . The above result indicates that
even though θ˜ is biased, the size of the bias diminishes when the sample size
n gets large. Many of us tends to declare that θ˜ is asymptotically unbiased
when this happens.
While we do not feel such a notion of “asymptotically unbiased” is wrong,
this terminology is often abused. In statistical literature, people may use this
term when √
n(θˆ − θ)
34 CHAPTER 4. OPTIMALITY CRITERIA OF POINT ESTIMATION
has a limiting distribution whose mean is zero. In this case, the bias of θˆ
does not necessarily goes to zero.
To avoid such confusions, let us invent a formal definition.
Definition 4.3. Suppose there is an index n such that Xn has a distribution
in Fn and an → ∞ as n → ∞ while the parameter space Θ of Fn does not
depend on n. Let θ be the true parameter value and θˆn is an estimator (a
sequence of estimators). If
an(θˆn − θ)
has a limiting distribution whose expectation is zero, for any θ ∈ Θ, then we
say θˆn is asymptotically rate-an unbiased.
Most often, we take an = n
1/2 in the above definition. We do not have
good reasons to require an estimator unbiased. Yet we feel that being asymp-
totically unbiased for some an is a necessity. When n → ∞ in common
settings, the amount of information about which F is the right F becomes
infinity. If we cannot make it right in this situation, the estimation method
is likely very poor.
The variance of an estimator is as important a criterion in judging an
estimator. Clearly, having a lower variance implies the estimator is more
accurate. In fact, let ϕ(·) be a convex function. Then an estimator is judged
superior if
E{ϕ(θˆ − θ)}
is smaller. When ϕ(x) = x2, the above criterion becomes Mean Squared
Error:
mse(θˆ) = E{(θˆ − θ)2}.
It is seen that
mse(θˆ) = bias2(θˆ) + var(θˆ).
To achieve lower mse the estimator must balance the loss due to variation
and bias.
Similar to asymptotic bias, it helps to give definite notions of asymptotic
variance and mse of an estimator.
4.2. UNIFORMLY MINIMUM VARIANCE UNBIASED ESTIMATOR 35
Definition 4.4. Suppose there is an index n such that Xn has a distribution
in Fn and an → ∞ as n → ∞ while the parameter space Θ of Fn does not
depend on n. Let θ be the true parameter value and θˆn is an estimator (a
sequence of estimators). Suppose
an(θˆn − θ)
has a limiting distribution with mean B(θ) and variance σ2(θ), for θ ∈ Θ.
We say θˆn has asymptotic bias B(θ) and asymptotic variance σ
2(θ) at
rate an.
Further more, we define the asymptotic mse at rate an as the σ
2(θ) +
B2(θ).
Unfortunately, themse is often a function of θ. In any specific application,
the “true value” of θ behind X is not known. Hence, it is not possible to
find an estimator which is a better estimator in terms of variance or mse
whichever value θ is the true value.
Example 4.2. Suppose X1, X2, . . . , Xn form an i.i.d. sample from N(θ; 1)
such that Θ = R.
Define θˆ = n−1

Xi and θ˜ = 0.
It is seen that var(θˆ) = n−1 > var(θ˜) for any θ ∈ R. However, no one
will be happy to use θ˜ as his/her estimator.
In addition, mse(θˆ) = n−1 > mse(θ˜) for all |θ| < n−1/2. Hence, even if
we use a more sensible performance criterion, it still does not imply that our
preferred sample mean is indisputably a superior estimator.
4.2 Uniformly minimum variance unbiased es-
timator
This section contains some materials that most modern statisticians believe
we should not have them included in statistical classes. Yet we feel a quick
discussion is still a good idea.
Either bias, var, mse can be used to separate the performance of es-
timators we can think of. Yet without any performance measure, how can
36 CHAPTER 4. OPTIMALITY CRITERIA OF POINT ESTIMATION
statisticians recommend any method to scientists? This is the same problem
when professors are asked to recommend their students. Everyone is unique.
Simplistically declaring one of them is the best will draw more criticisms than
praises. Yet at least, we can timidly say one of the students has the highest
average mark on mathematics courses, in this term, among all students with
green hair and so on.
Definition 4.5. Suppose X is a random sample from F with parameter
θ ∈ Θ.
An unbiased estimator θˆ is uniformly minimum variance estimator of θ,
UMVUE, if for any other unbiased estimator θ˜ of θ,
varθ(θˆ) ≤ varθ(θ˜)
for all θ ∈ Θ.
In the above definition, we added a subscript θ to highlight the fact that
the variance calculation is based on the assumption that the of X has true
parameter value θ. We do not always do so in other part of the course note.
Upon the introduction of UMVUE, a urgent question to be answered is
its existence. This answer is positive at least in textbook examples.
Example 4.3. Suppose X1, X2, . . . , Xn form an i.i.d. sample from Poisson
distribution with mean parameter θ and the parameter space is Θ = R+.
Let θˆ = X¯n = n
−1∑Xi. It is easily seen that θˆ is an unbiased estimator
of θ.
Suppose that θ˜ is another unbiased estimator of θ. Because X¯n is complete
and sufficient statistic, we find
θ˘ = E{θ˜|X¯n)
is a function of data only. Hence, it is an estimator of θ. Using a formula
that for any two random variables, var(Y ) = E{var(Y |Z)}+var{E(Y |Z)},
we find
var(θ˘) ≤ var(θ˜).
Furthermore, this estimator is also unbiased. Hence,
E{θˆ − θ˘} = 0
4.3. INFORMATION INEQUALITY 37
for all θ ∈ R+. Because both estimators are function of X¯n and the com-
pleteness of X¯n, we have
θˆ = θ˘.
Hence,
var(θˆ) = var(θ˘) ≤ var(θ˜).
Therefore, X¯n is the UMVUE.
Now, among all estimators of θ that are unbiased, the sample mean has
the lowest possible variance. If UMVUE is a criterion we accept, then the
sample mean is the best possible estimator under the Poisson model for the
mean parameter θ.
Why is such a beautiful conclusion out of fashion these days? Some of the
considerations are as follows. In real world applications, having a random
sample strictly i.i.d. from a Poisson distribution is merely a fantasy. If
so, why should we bother? Our defence is as follows. If the sample mean
is optimal in the sense of UMVUE under the ideal situation, it is likely
a superior one even if the situation is slightly different from the ideal. In
addition, the optimality consideration is a good way of thinking.
Suppose λ = 1/θ which is called rate parameter under Poisson model
assumption. How would you estimate λ? Many will suggest that X¯−1n is a
good candidate estimator. Sadly, this estimator is biased and has infinite
variance! Lastly, in modern applications, we rarely work with such simplistic
models. In these cases, it is nearly impossible to have a UMVUE. If so, we
probably should not bother our students with such technical notions.
4.3 Information inequality
At least in textbook examples, some estimators are fully justified as optimal.
This implies that there is an intrinsic limit on how precise an estimator can
achieve.
Let X be a random variable modelled by F or more specifically a para-
metric family f(x; θ). Let T (X) be a statistic with finite variance given any
38 CHAPTER 4. OPTIMALITY CRITERIA OF POINT ESTIMATION
θ ∈ Θ. Denote
ψ(θ) = E{T (X); θ} =

T (x)f(x; θ)dx
where the Lebesgue measure can be replaced by any other suitable measures.
Suppose some regularity conditions on f(x; θ) are satisfied so that our fol-
lowing manipulations are valid. Taking derivatives with respect to θ on two
sides of the equality, we find
ψ′(θ) =

T (x)f ′(x; θ)dx
=

T (x)s(x; θ)f(x; θ)dx
where
s(x; θ) =
f ′(x; θ)
f(x; θ)
=

∂θ
{log f(x; θ)}.
It is seen that∫
s(x; θ)f(x; θ)dx =

f ′(x; θ)dx =
d

f(x; θ)dx = 0.
We define the Fisher information
I(θ) = E
[

∂θ
{log f(X; θ)}
]2
= E{s(X; θ)}2.
Hence,
{ψ′(θ)}2 = {

{T (x)− ψ(θ)}f(x; θ)dx}2

{T (x)− ψ(θ)}2s(x; θ)f(x; θ)dx×

{s(x; θ)}2f(x; θ)dx
= var(T (x))I(θ).
This leads to the following theorem.
Theorem 4.1. Crame´r-Rao information inequality. Let T (X) be any
statistic with finite variance for all θ ∈ Θ. Under some regularity conditions,
var(T (X)) ≥ {ψ
′(θ)}2
I(θ)
where ψ(θ) = E(T (X); θ).
4.3. INFORMATION INEQUALITY 39
If T (X) is unbiased for θ, then ψ′(θ) = 1. Therefore, var(T ) ≥ I−1(θ).
When I(θ) is larger, the variance of T could be smaller. Hence, it indeed
measures the information content in data X with respect to θ. For conve-
nience of reference, we call I−1(θ) the information lower bound for estimating
θ.
In assignment problems, X is often made of n i.i.d. observations from
f(x; θ). Let X1 be one component of X. It is a simple exercise to show that
I(θ;X) = nI(θ;X1)
in the obvious notation. We need to pay attention to what I(θ) stands for
in many occasions. It could be the information contained in a single X1, but
also could be information contained in the i.i.d. sample X1, . . . , Xn.
Example 4.4. Suppose X1, X2, . . . , Xn form an i.i.d. sample from Poisson
distribution with mean parameter θ and the parameter space is Θ = R+.
The density function of X1 is given by
f(x; θ) = P (X1 = x; θ) =
θx
x!
exp(−θ).
Hence,
s(x; θ) =
x
θ
− 1
and the information in X1 is given by
I(θ) = E
{
X
θ
− 1
}2
=
1
θ
.
Therefore, for any unbiased estimator Tn of θ based on the whole sample, we
have
var(Tn) ≥ 1
nI(θ)
=
θ
n
.
Since the sample mean is unbiased and has variance θ/n, it is an estimator
that attains the information lower bound.
The definition of Fisher information depends on how the distribution
family is parameterized. If η is a smooth function of θ, the Fisher information
40 CHAPTER 4. OPTIMALITY CRITERIA OF POINT ESTIMATION
with respect to η is not the same as the Fisher information with respect to
θ.
As an exercise, find the information lower bound for estimating η =
exp(−θ) under Poisson distribution model. Derive its UMVUE given n i.i.d.
observations.
4.4 Other desired properties of a point esti-
mator
Given a data set from an assumed model F , we often ask or are asked whether
certain aspect of F can be estimated. This can be the mean or median of
F where F is any member of F . In general, we may write the parameter as
θ = θ(F ), a functional defined on F .
Definition 4.6. Obsolete Concept of Estimability. Suppose the data set
X is a random sample from a model F and suppose θ = θ(F ) is a parameter.
We say θ is estimable if there exists a function T (·) such that
E(T (X);F ) = θ(F )
for all F ∈ F .
In other words, a parameter is estimable if we can find an unbiased estima-
tor for this parameter. We can give many textbook examples of estimability.
In contemporary applications, we are often asked to “train” a model given
a data set with very complex structure. In this case, we do not even have
a good description of F . Because of this, being estimable for a useful func-
tional on F is a luxury. We have to give up this concept but remain aware
of such a definition.
It is not hard to give an example of un-estimable parameters according
to the above definition though the example can overly technical. Instead, we
show that there is a basic requirement for a parameter to be estimable.
Definition 4.7. Identifiability of a statistical model. Let F be a para-
metric model in statistics and Θ be its parameter space. We say F is iden-
tifiable if for any θ1, θ2 ∈ Θ,
F (x; θ1) = F (x; θ2)
4.4. OTHER DESIRED PROPERTIES OF A POINT ESTIMATOR 41
for all x implies θ1 = θ2.
A necessary condition for a parameter θ to be estimable is that θ is
identifiable. Otherwise, suppose F (x; θ1) = F (x; θ2) for all x, but θ1 6= θ2.
For any estimator θˆ, we cannot have both
E{θˆ; θ1} = θ1; E{θˆ; θ2} = θ2
because two expectations are equal while θ1 6= θ2.
Definition 4.8. Proposed notion of estimability. Let F be a parametric
model in statistics and Θ be its parameter space. Suppose the sample plan
under consideration may be regarded as one of a sequence of sampling plans
indexed by n with sample Xn from F . If there exists an estimator Tn, a
function of Xn, such that
P (|Tn − θ| ≥ ; θ)→ 0
for any θ ∈ Θ and > 0 as n → ∞, then we say θ is (asymptotically)
estimable.
The sampling plans in my mind include the plan of obtaining i.i.d. ob-
servations, obtaining observations of time series with extended length and so
on. This definition makes sense but we will not be surprised to draw serious
criticisms.
Example 4.5. Suppose we have an i.i.d. sample of size n from Poisson
distribution. Let λ be the rate parameter. It is seen that λ is asymptotically
estimable because
P
(∣∣ 1
n−1 + X¯n
− λ∣∣ > )→ 0
as n→∞, where X¯n is the sample mean.
In this example, I have implicitly regarded “having i.i.d. sample of size
n” as a sequence of sampling plan. If one cannot obtain more and more i.i.d.
observations from this population, then the asymptotic estimability does not
make a lot of sense.
42 CHAPTER 4. OPTIMALITY CRITERIA OF POINT ESTIMATION
If two random variables are related by Y = (5/9)(X−32) such as the case
where Y and X are the temperatures measured in Celsius and Fahrenheit.
Given measures X1, X2, . . . , Xn on a random sample from some population,
it is most sensible to estimate the mean temperature as X¯n, the sample
mean of X. If one measures the temperature in Celsius to get Y1, . . . , Yn
on the same random sample, we should have estimated the mean by Y¯n, the
sample mean of Y . Luckily, we have Y¯n = (5/9)(X¯n − 32). Some internal
consistency is maintained. Such a desirable property is termed as equivariant.
and sometimes is also called invariant. See Lehmann for references.
In another occasion, one might be interested in estimating mean parame-
ter µ in Poisson distribution. This parameter tells us the average number of
events occuring in a time period of interest. At the same time, one might be
interested in knowing the chance that nothing happens in the period which is
exp(−µ). Let X¯n as the sample mean of the number of events over n distinct
periods of time. We naturally estimate µ by X¯n and exp(−µ) by exp(−X¯n).
If so, we find
ĝ(µ) = g(µˆ)
with g(x) = exp(−x). This is a property most of us will find desirable. When
an estimator satisfies above property, we say it is invariant.
Rigorous definitions of equivariance and invariance can be lengthy. We
will be satisfied with a general discussion as above.
In the Poisson distribution example, it is seen that
E{exp(−X¯n)} = exp{nµ[exp(−1/n)− 1]} 6= exp(−µ).
Hence, the most natural estimator of exp(−µ) is not unbiased.
The UMVUE of exp(−µ) is given by E{1(X1 = 0)|X¯n}. The UMVUE of
µ is given by X¯n. Thus, the UMVUE is not invariant when the population
is the Poisson distribution family. As a helpful exercise for improving one’s
technical strength, work out the explicit expression of E{1(X1 = 0)|X¯n}.
4.5 Consistency and asymptotic normality
A point estimator is a function of data and the data are a random sample from
a distribution/population that is a member of distribution family. Hence,
4.5. CONSISTENCY AND ASYMPTOTIC NORMALITY 43
it is random in general: its does not take a value with probability one. In
other words, we can never be completely sure about the unknown parameter.
However, when the sample size increases, we gain more and more information
about its underlying population. Hence, we should be able to decide what
the “true” parameter value with higher and higher precision.
Definition 4.9. Let θn be an estimator of θ based on a sample of size n from
a distribution family F (x; θ) : θ ∈ Θ. We say that θn is weakly consistent if,
as n→∞, for any > 0 and θ ∈ Θ
P (|θˆn − θ| ≥ ; θ)→ 0.
In comparison, we have a stronger version of consistency.
Definition 4.10. Let θn be an estimator of θ based on a sample of size n from
a distribution family F (x; θ) : θ ∈ Θ. We say that θn is strongly consistent
if, as n→∞, for any θ ∈ Θ
P ( lim
n→∞
θˆn = θ; θ) = 1.
Here are a few remarks one should not take them seriously but worth to
point out. First, the i.i.d. structure in the above definitions is not essential.
However, it is not easy to give a more general and rigorous definition with-
out this structure. Second, the consistency is not really a property of one
estimator, but a sequence of estimators. Unless θˆn for all n are constructed
based on the same principle, otherwise, the consistency is nothing relevant
in applications: your n is far from infinity. For this reason, there is a more
sensible definition called Fisher consistency. To avoid too much technicality,
it is mentioned but not spelled out here. Lastly, when we say an estimator
is consistent, we mean weakly consistent unless otherwise stated.
The next topic is asymptotic normality. It is in fact best to be called
limiting distributions. Suppose θˆn is an estimator of θ based on n i.i.d.
observations from some distribution family. The precision of this estimator
can be judged by its bias, variance, mean square error and so on. Ultimately,
the precision of θˆn is its sample distribution. Unfortunate, the sample
distribution of θˆn is often not easy to directly work with. At the same time,
44 CHAPTER 4. OPTIMALITY CRITERIA OF POINT ESTIMATION
when n is very large, the distribution of its standardized version stabilizes.
This is the limiting distribution. If we regard the limiting distribution as
the sample distribution of θˆ, the difference is not so large. That is, the error
diminishes when n increases. For this reason, statisticians are fond of finding
limiting distributions.
Definition 4.11. Let Tn be a sequence of random variables, we say its dis-
tribution converges to that of T if
lim
n→∞
P (Tn ≤ t) = P (T ≤ t)
for all t ∈ R at which F (t) = P (T ≤ t) is continuous.
In this definition, Tn is just any sequence random variable, it may contain
unknown parameters in specific examples. The index n need not be the sam-
ple size in typical set up. The multivariate case will not be given here. The
typical applications, the limiting distribution is about asymptotic normality.
Example 4.6. Suppose we have an i.i.d. sample X1, . . . , Xn from a distri-
bution family F . A typical estimator for F (t), the cumulative distribution
function of X is the empirical distribution
Fn(t) = n
−1
n∑
i=1
1(Xi ≤ t).
For each given t, the distribution of Fn(t) is kind of binomial. At the same
time, √
n{Fn(t)− F (t)} d−→ N(0, σ2)
with σ2 = F (t){1− F (t)} as n→∞.
Remark: in this example, we have a random variable on one side but a
distribution on the other side. It is interpreted as the distribution sequence
of the random variables, indexed by n, converges to the distribution specified
on the right hand side.
As an exercise, one can work out the following example.
4.6. ASSIGNMENT PROBLEMS 45
Example 4.7. Suppose we have an i.i.d. sample X1, . . . , Xn from a uniform
distribution family F such that F (x; θ) is uniform on (0, θ) and Θ = R+.
Define
θˆn = max{X1, X2, . . . , Xn}
which is often denoted as X(n) and called order statistic. It is well known
that
n{θ − θˆ} d−→ exp(θ).
Namely, the limiting distribution is exponential.
Is θˆ asymptotically unbiased at rate

n, at rate n?
4.6 Assignment problems
1. Let X1, X2, ..., Xn be an i.i.d. sample from the Uniform distribution
Unif(0, θ). Define θˆn = max{X1, X2, . . . , Xn}, which is often denoted
as X(n) and called order statistic.
Find the limiting distribution of n(θ − θˆn) as n→∞.
Is θˆ asymptotically unbiased at rate

n, at rate n?
2. Let X1, X2, ..., Xn be an i.i.d. random sample from Poisson (θ), and let
η = exp(−θ). From the previous assignment, we find that the UMVUE
for η is given by
ηˆ = (1− 1/n)nX¯ .
(a) Follow the Definition 4.9 as given in the Lecture Notes, prove that
ηˆ is weakly consistent, i.e., prove that, for any ε > 0 and θ > 0,
P (|ηˆ − η| > ε; η)→ 0,
as n→∞.
(b) Conduct a simulation study to find the probability in part (a). Let
= 0.01, η = exp(−1) and repeat the simulation with sample sizes
n = 100 and 1000, with N = 20000 repetitions. Report your findings.
46 CHAPTER 4. OPTIMALITY CRITERIA OF POINT ESTIMATION
3. LetX1, X2, ..., Xn be an i.i.d. sample from the following mixture model,
with density function
f(x;λ, pi) = (1− pi) exp(−x) + piλ−1 exp(−x/λ), x > 0.
Suppose we observe the sample data
0.61683384, 0.49301343, 0.08751571, 6.32112518, 1.46224603,
0.17420356, 1.07460011, 0.18795447, 2.01524287, 0.83013365,
0.04476622, 2.01365679, 1.63824658, 0.01627277, 5.71925356,
3.85095169, 0.75024996, 1.26231923, 0.70529060, 1.66594757
(a) Derive an analytical expression for moment estimate of the param-
eters λ and pi.
(b) Obtain their numerical values.
4. Given a positive constant k, we define a function for the purpose of
M-estimation:
ϕ(x; θ) =
{
(x− θ)2 , if |x− θ| ≤ k;
k2 , if |x− θ| > k.
(a) The M-Estimator θˆ of θ is the value at whichMn(θ) =
∑n
i=1 ϕ(Xi, θ)
is minimized.
Assume that none of i makes |Xi − θˆ| = k where θˆ is the solution to
the optimization problem.
Show that θˆ is the mean of Xi such that |Xi − θˆ| < k.
(b) Given the sample data
1.551 -1.170 -0.201 1.143 0.138 3.103 1.455 -2.121 -1.672 6.150
and that k = 2.0, calculate the value of the M-Estimate defined in part
(a).
4.6. ASSIGNMENT PROBLEMS 47
5. Let X1, X2, ..., Xn be an i.i.d. random sample from the Exponential dis-
tribution Exp (θ) with mean θ (the density is f(x; θ) = θ−1 exp(−x/θ)).
Denote by X(1), X(2), . . . , X(n) the corresponding order statistics for this
random sample. Then, Wk = X(k+1) − X(k), 1 ≤ k ≤ n − 1 are called
the spacings of the order statistics. By convention, define W0 = X(1),
the first order statistic.
(a) It is known that W0,W1, ...,Wn−1 are independent to each other,
with
Wk ∼ exp( θ
n− k ),
for k = 0, 1, . . . , n− 1. Verify this result for the case where n = 2.
(b) Let Tn = X(1) +X(2) + · · ·+X(k) + (n− k)X(k).
Suppose n = 10, k = 8, find the mean and variance for this statistic Tn.
(c) Suppose n = 10, k = 8, and use on your result from part (b), work
out an unbiased L-Estimator for the parameter θ.
48 CHAPTER 4. OPTIMALITY CRITERIA OF POINT ESTIMATION
Chapter 5
Approaches of point estimation
Even though any statistics with proper range is a point estimator, we gener-
ally prefer estimators derived based on some principles. This leads to a few
common estimation procedures.
5.1 Method of moments
Suppose F is a parametric distribution family so that it permits a general
expression
F = {F (x; θ) : θ ∈ Θ}
such that Θ ⊂ Rd for some positive integer d. We assume the parameter is
identifiable.
In most classical examples, the distributions are labeled smoothly by θ:
two distributions having close parameter values are similar in some metric. In
addition, the first d moments are smooth functions of θ. They map Θ to Rd
in a one-to-one fashion: different θ value leads to different first d moments.
Suppose we have an i.i.d. sample X1, . . . , Xn of size n from F and X is
univariate. For k = 1, 2, . . . , d, define equations with respect to θ as
n−1{Xk1 +Xk2 + · · ·+Xkn} = E{Xk; θ}.
The solution in θ, if exists and unique, are called moment estimator of θ.
49
50 CHAPTER 5. APPROACHES OF POINT ESTIMATION
Example 5.1. If X1, . . . , Xn is an i.i.d. sample from Negative binomial dis-
tribution whose probability mass function (p.m.f. ) is given by
f(x; θ) =
(−m
x
)
(θ − 1)xθm
for x = 0, 1, 2, . . .. It is known that E{X; θ} = m/θ. Hence, the moment
estimator of θ is given by
θˆ = m/X¯n.
If X1, . . . , Xn is an i.i.d. sample from N(µ, σ
2) distribution. It is known
that E{X,X2} = (µ, µ2 + σ2). The moment equations are given by
n−1{X1 +X2 + · · ·+Xn} = µ;
n−1{X21 +X22 + · · ·+X2n} = µ2 + σ2.
The moment estimators are found to be
µˆ = X¯n; σˆ
2 = n−1

X2i − X¯2n.
Note that σˆ2 differs from the sample variance by a scale factor n/(n− 1).
Moment estimators are often easy to construct and have simple distri-
butional properties. In classical examples, they are also easy to compute
numerically.
The use of moment estimator depends on the existence and also unique-
ness of the solutions to the corresponding equations. There seem to be little
discussion on this topic. We suggest that moment estimators are estima-
tors of ancient tradition in which era only simplistic models were considered.
Such complications do not seem to occur too often for these models. We will
provide an example based on exponential mixture as an exercise problem.
One may find the classical example in Pearson (1904?) where a heroic ef-
fort was devoted to solve moment equations to fit a two-component normal
mixture model. Other than it is a general convention, there exists nearly no
theory to support the use of the first d moments for the method of moments
rather than other moments. The method of moments also does not have to
be restricted to situations where i.i.d. observations are available.
5.1. METHOD OF MOMENTS 51
Example 5.2. Suppose we have T observations from a simple linear regres-
sion model:
Yt = βXt + t
for t = 0, 1, . . . , T , such that 1, . . . , T are i.i.d. N(0, 1) and X1, . . . , XT are
non-random constants.
It is seen that
E{

Yt} = β

Xt.
Hence, a moment estimator of β is given by
βˆ =

Yt∑
Xt
.
The method of moments makes sense based on our intuition. What statis-
tical properties does it have? Under some conditions, we can show that it is
consistent and asymptotically normal. Specifying exact conditions, however,
is surprisingly more tedious than we may expect.
Consider the situation where an i.i.d. sample of size n from a parametric
statistical model F is available. Let θ denote the parameter and Θ ⊂ Rd
be the parameter space. Let µk(θ) be the kth moment of X, the random
variable whose distribution is F (x; θ) which is a member of F .
Assume that µk(θ) exists and continuous in θ for k = 1, 2, . . . , d. Assume
also the moment estimator of θ, θˆ is a unique solution to moment equations
for large enough n. Recall the law of large numbers:
n−1{Xk1 +Xk2 + · · ·+Xkn} → µk(θ)
almost surely when n→∞.
By the definition of moment estimates, we have
µk(θˆ)→ µk(θ)
for k = 1, 2, . . . , d when n→∞, almost surely.
Assume that as a vector valued function made of first d moments, µ(θ)
“inversely continuous” a term we invent on spot: for any fixed θ∗ and dynamic
θ,
‖µ(θ)− µ(θ∗)‖ → 0
52 CHAPTER 5. APPROACHES OF POINT ESTIMATION
only if θ → θ∗. Then, µk(θˆ) → µk(θ) almost surely implies θˆ → θ almost
surely.
We omit the discussion of asymptotic normality here.
5.2 Maximum likelihood estimation
If one can find a σ-finite measure such that each distribution in F has a
density function f(x). Then the likelihood function is given by (not defined
as)
L(F ) = f(x)
which is a function of F on F . To remove the mystic notion of F , under
parametric model, the likelihood becomes
L(θ) = f(x; θ)
because we can use θ to represent each F in F . If θˆ is a value in Θ such that
L(θˆ) = sup
θ
L(θ)
then it is a maximum likelihood estimate (estimator) of θ. If we can find a
sequence {θm}∞m=1 such that
lim
m→∞
L(θm) = sup
θ
L(θ)
and lim θm = θˆ exists, then we also call θˆ a maximum likelihood estimate
(estimator) of θ.
The observation x includes the situation where it is a vector. The common
i.i.d. situation is a special case where x is made of n i.i.d. observations from
a distribution family F . In this case, the likelihood function is given by
the product of n densities evaluated at x1, . . . , xn respectively. It remains a
function of parameter θ.
The probability mass function, when x is discrete, is also regarded as a
density function. This remark looks after discrete models. In general, the
likelihood function is defined as follows.
5.3. ESTIMATING EQUATION 53
Definition 5.1. The likelihood function on a model F based on observed
values of X is proportional to
P (X = x;F )
where the probability is computed when X has distribution F .
When F is a continuous distribution, the probability is computed as the
probability of the event “when X belongs to a small neighbourhood of x”.
The argument of “proportionality” leads to the joint density function f(x)
or f(x; θ) in general. The proportionality is a property in terms of F . The
likelihood function is a function of F .
The phrase “proportional to” in the definition implies the likelihood func-
tion is not unique. If L(θ) is a likelihood function based on some data, then
cL(θ) for any c > 0 is also a likelihood function based on the same data.
5.3 Estimating equation
The MLE of a parameter is often obtained by solving a score equation:
∂Ln(θ)
∂θ
= 0.
It is generally true that
E
[
∂ logLn(θ)
∂θ
; θ
]
= 0
where the expectation is computed when the parameter value (of the distri-
bution of the data) is given by θ. Because of this, the MLE is often regarded
as a solution to
∂ logLn(θ)
∂θ
= 0.
It appears that whether or not ∂ logLn(θ)/∂θ is the derivative function of
the log likelihood function matters very little. This leads to the following
consideration.
In applications, we have reasons to justify that a parameter θ solves equa-
tion
E{g(X; θ)} = 0.
54 CHAPTER 5. APPROACHES OF POINT ESTIMATION
Given an set of i.i.d. observations in X, we may solve
n∑
i=1
g(xi; θ) = 0
and use its solution as an estimate of θ (or estimator if xi’s are replaced by
Xi).
Clearly, such estimators are sensible and may be preferred when com-
pletely specifying a model for X is at great risk of misspecification.
Example 5.3. Suppose (xi, yi), i = 1, 2, . . . , n is a set of i.i.d. observations
from some F such that E(Yi|Xi = xi) = xτiβ.
We may estimate β by the solution to
n∑
i=1
xτi (yi − xτiβ) = 0.
The solution is given by
βˆ = {
n∑
i=1
xix
τ
i }−1{
n∑
i=1
xiyi}
which is the well known least squares estimator.
The spirit of this example is: we do not explicitly spell out any distribu-
tional assumptions on (X, Y ) other than the form of the conditional expecta-
tion.
5.4 M-Estimation
Motivated from a similar consideration, one may replace Ln(θ) by some other
functions in some applications. Let ϕ(x; θ) be a function of data and θ but
we mostly interested in its functional side in θ after x is given. In the i.i.d.
case, we may maximize
Mn(θ) =
n∑
i=1
ϕ(xi; θ)
5.5. L-ESTIMATOR 55
use its solution as an estimate of θ (or estimator if xi’s are replaced by Xi).
In this situation, parameter θ is defined as the solution to the minimum point
of E{ϕ(X; ξ);F} in ξ where F is the true distribution of X.
Example 5.4. Suppose (xi, yi), i = 1, 2, . . . , n is a set of i.i.d. observations
from some F such that E(Yi|Xi = xi) = xτiβ.
We may estimate β by the solution to the minimization/optimization
problem:
min
β
{ n∑
i=1
(yi − xτiβ)2}.
In this case, ϕ(x, y;β) = (y − xτβ)2. The solution is again given by
βˆ = {
n∑
i=1
xix
τ
i }−1{
n∑
i=1
xiyi}
which is the well known least squares estimator.
In some applications, the data set may contain a few observations whose
y values are much much larger than the rest of observations. Their pres-
ence makes the other observed values have almost no influence on the fitted
regression coefficient βˆ. Hence, Huber suggested to use
ϕ(x, y;β) =

(y − xτβ)2 |y − xτβ| ≤ k
k(y − xτβ) y − xτβ > k
−k(y − xτβ) y − xτβ < −k
for some selected constant k instead.
This choice limits the influences of observations with huge values. Some-
times, such abnormal values, often referred to as outliers, are caused by
recording errors.
5.5 L-estimator
Suppose we have a set of univariate i.i.d. observations and and it is simple
to record them in terms of sizes such that X(1) ≤ X(2) ≤ · · · ≤ X(n). We call
56 CHAPTER 5. APPROACHES OF POINT ESTIMATION
them order statistics. To avoid the influence of outliers, one may estimate
the population mean by a trimmed mean:
(n− 2)−1
n−1∑
i=2
X(i).
This practice is used on Olympic games though theirs are not estimators.
One can certainly remove more observations from consideration and make
the estimator more robust. The extreme case is to use the sample median to
estimate the population mean. In this case, the estimator makes sense only if
the mean and median are the same parameters under the model assumption.
In general, an L-estimator is any linear combination of these order statis-
tics. The coefficients are required to be non-random and naturally do not
depend on unknown parameters.
5.6 Assignment problems
1. Let X1, X2, ..., Xn be an i.i.d. sample from the Uniform distribution
Unif(0, θ). Define θˆn = max{X1, X2, . . . , Xn}, which is often denoted
as X(n) and called order statistic.
Find the limiting distribution of n(θ − θˆn) as n→∞.
Is θˆ asymptotically unbiased at rate

n, at rate n?
2. Let X1, X2, ..., Xn be an i.i.d. random sample from Poisson (θ), and let
η = exp(−θ). From the previous assignment, we find that the UMVUE
for η is given by
ηˆ = (1− 1/n)nX¯ .
(a) Follow the Definition ?? as given in the Lecture Notes, prove that
ηˆ is weakly consistent, i.e., prove that, for any ε > 0 and θ > 0,
P (|ηˆ − η| > ε; η)→ 0,
as n→∞.
5.6. ASSIGNMENT PROBLEMS 57
(b) Conduct a simulation study to find the probability in part (a). Let
= 0.01, η = exp(−1) and repeat the simulation with sample sizes
n = 100 and 1000, with N = 20000 repetitions. Report your findings.
3. LetX1, X2, ..., Xn be an i.i.d. sample from the following mixture model,
with density function
f(x;λ, pi) = (1− pi) exp(−x) + piλ−1 exp(−x/λ), x > 0.
Suppose we observe the sample data
0.61683384, 0.49301343, 0.08751571, 6.32112518, 1.46224603,
0.17420356, 1.07460011, 0.18795447, 2.01524287, 0.83013365,
0.04476622, 2.01365679, 1.63824658, 0.01627277, 5.71925356,
3.85095169, 0.75024996, 1.26231923, 0.70529060, 1.66594757
(a) Derive an analytical expression for moment estimate of the param-
eters λ and pi and
(b) obtain their numerical values.
4. Given a positive constant k, let us define a function for the purpose of
M-estimation:
ϕ(x; θ) =
{
(x− θ)2 , if |x− θ| ≤ k;
k2 , if |x− θ| > k.
(a) The M-Estimator θˆ of θ is the value at whichMn(θ) =
∑n
i=1 ϕ(Xi, θ)
is minimized.
Assume that none of i makes |Xi − θˆ| = k where θˆ is the solution to
the optimization problem.
Show that θˆ is the mean of Xi such that |Xi − θˆ| < k.
(b) Given the sample data
1.551 -1.170 -0.201 1.143 0.138 3.103 1.455 -2.121 -1.672 6.150
58 CHAPTER 5. APPROACHES OF POINT ESTIMATION
and that k = 2.0, calculate the value of the M-Estimate defined in part
(a).
5. Let X1, X2, ..., Xn be an i.i.d. random sample from the Exponential dis-
tribution Exp (θ) with mean θ (the density is f(x; θ) = θ−1 exp(−x/θ)).
Denote by X(1), X(2), . . . , X(n) the corresponding order statistics for this
random sample. Then, Wk = X(k+1) − X(k), 1 ≤ k ≤ n − 1 are called
the spacings of the order statistics. By convention, define W0 = X(1),
the first order statistic.
(a) It is known that W0,W1, ...,Wn−1 are independent to each other,
with
Wk ∼ Exp( θ
n− k ),
for k = 0, 1, . . . , n− 1.
Verify this Theorem for the case when n = 2.
(b) Let
Tn = X(1) +X(2) + · · ·+X(k) + (n− k)X(k).
When n = 10, k = 8, find the mean and variance for this statistic Tn.
(c) Suppose n = 10, k = 8, and use on your result from part (b), give
an unbiased L-Estimator for the parameter θ.
Chapter 6
Maximum likelihood estimation
In textbooks such as here, we have plenty of examples where the solutions
to MLE are easy to obtain. We now give some examples where the routine
approaches do not work.
6.1 MLE examples
The simplest example is when we have i.i.d. data of size n from N(µ, σ2)
distribution (family). In this case, the log-likelihood function is given by
`n(µ, σ
2) = −n log σ − 1
σ2
n∑
i=1
(xi − µ)2.
Note that I have omitted the constant that does not depend on parameters.
Regardless of the value of σ2, the maximum point in µ is µˆ = X¯n, the sample
mean. Let σ˜2 = n−1
∑n
i=1(xi − µ)2 and do not regard it as an estimator but
a statistic for the moment. Then, we find
`n(µˆ, σ
2) = −n log σ − nσ˜
2
σ2
.
This function is maximized at σ2 = σ˜2. Hence, the MLE of σ2 is given by
σˆ2 = σ˜2.
Type I censor. The next example is a bit unusual. In industry, it is vital
to ensure that components in a product will last for a long time. Hence, we
59
60 CHAPTER 6. MAXIMUM LIKELIHOOD ESTIMATION
need to have a clear idea on their survival distributions. Such information
can be obtained by collecting complete failure time data on a random sample
of the components. When the average survival time is very long, one has to
terminate the experiment at some point, likely before all samples fail. Let
the life time of a component be X and the termination time be nonrandom
T . Then, the observation may be censored and we only observe min(X,T ).
This type of censorship is commonly referred to as type I censor.
Suppose the failure time data can be properly modelled by exponential
distribution f(x; θ) = θ−1 exp(−x/θ), x > 0. Let x1, x2, · · · , xm be the ob-
served failure times of m out of n components. The rest of n−m components
have not experienced failure at time T (which is not random). In this case,
the likelihood function would be given by
Ln(θ) = θ
−m exp
{
−θ−1[
m∑
i=1
xi + (n−m)T ]
}
.
Interpreting likelihood function based on the above definition makes it easier
to obtained the above expression.
Some mathematics behind this likelihood is as follows. To observe that
n−m components lasted longer than T , the probability of this event is given
by (
n
n−m
)
{exp(−θ−1T )}n−m{1− exp(−θ−1T )}m.
Given m components failed before time T , the joint distribution is equivalent
to an i.i.d. conditional exponential distribution whose density is given by
θ−1 exp(−θ−1x)
1− exp(−θ−1T ) .
Hence, the joint density of x1, . . . , xm is given by
m∏
i=1
[
θ−1 exp(−θ−1xi)
1− exp(−θ−1T )
]
.
The product of two factors gives us the algebraic expression of Ln(θ). Once
the likelihood function is obtained, we can find the explicit solution to the
MLE of θ easily.
6.2. NEWTON RAPHSON ALGORITHM 61
There are more than one way to arrive at the above likelihood function.
Discrete parameter space. Suppose a finite population is made of two
types of units, A and B. The population size N = A+B units where A and
B also denote the number of types A and B units. Assume the value of B
is known which occurs in capture-recapture experiment. A sample of size n
is obtained by “simple random sample without replacement” and x of the
sampled units are of type B.
Based on this observation, what is the MLE of A?
To answer this question, we notice that the likelihood function is given
by
L(A) =
(
A
n−x
)(
B
x
)(
A+B
n
) .
Our task is to find an expression of the MLE of A. Note that “find the MLE”
is not very rigorous statement.
Let us leave this problem to classroom discussion.
Non-smooth density functions. Suppose we have an i.i.d. sample of size
n from uniform distribution on (0, θ) and the parameter space is Θ = R+.
Find the MLE of θ.
6.2 Newton Raphson algorithm
Other than textbook examples, most applied problems do not permit an
analytical solutions to the maximum likelihood estimation. In this case, we
resort to any optimization algorithms that work. For illustration, we still
resort to “textbook examples.”
Example 6.1. Let X1, . . . , Xn be i.i.d. random variables from Weibull dis-
tribution with fixed scale parameter:
f(x; θ) = θxθ−1 exp(−xθ)
with parameter space Θ = R+ on support x > 0.
Clearly, the log likelihood function of θ is given by
`n(θ) = n log θ + (θ − 1)
n∑
i=1
log xi −
n∑
i=1
xθi .
62 CHAPTER 6. MAXIMUM LIKELIHOOD ESTIMATION
It is seen that
`′n(θ) =
n
θ
+
n∑
i=1
log xi −
n∑
i=1
xθi log xi;
`′′n(θ) = −
n
θ2

n∑
i=1
xθi log
2 xi < 0.
Therefore, the likelihood function is convex and hence has unique maximum
in θ. Either when θ → 0+ and when θ →∞, we have `n(α)→ −∞.
For numerical computation, we can easily locate θ1 < θ2 such that the
maximum point of `n(θ) is within the interval [θ1, θ2].
Following the above example, a bisection algorithm can be applied to
locate the maximum point of `n(θ).
1. Compute y1 = `

n(θ1), y2 = `

n(θ2) and θ = (θ1 + θ2)/2;
2. If `′n(θ) > 0, let θ1 = θ; otherwise, θ2 = θ;
3. Repeat the last step until |θ1 − θ2| < for a pre-specified precision
constant > 0.
4. Report θ as the numerical value of the MLE θˆ.
It will be an exercise to numerically find an upper and lower bounds and
the MLE of θ given a data set.
The bisection method is easy to understand. Its convergence rate, in
terms of how many steps it must take to get the final result is judged not
high enough. When θ is one dimensional, our experience shows the criticism
is not well founded. Nevertheless, it is useful to understand another standard
method in numerical data analysis.
Suppose one has an initial guess of the maximum point of the likelihood
function, say θ(0). For any θ close to this point, we have
`′n(θ) ≈ `′n(θ(0)) + `′′n(θ(0))(θ − θ(0)).
6.2. NEWTON RAPHSON ALGORITHM 63
If the initial guess is pretty close to the maximum point, then the value of
the second derivative `′′n(θ
(0)) < 0. From the above approximation, we would
guess that
θ(1) = θ(0) − `′n(θ(0))/`′′n(θ(0))
is closer to the solution of `′n(θ) = 0. This consideration leads to repeated
updating:
θ(k+1) = θ(k) − `′n(θ(k))/`′′n(θ(k)).
Starting from θ(0), we therefore obtain a sequence θ(k). If the problem is
not tricky, this sequence converges to the maximum point of `n(θ). Once it
stabilizes, we regard the outcome as the numerical value of the MLE.
The iterative scheme is called Newton-Raphson method. Its success de-
pends on a good choice of θ(0) and the property of `n(θ) as a function of θ. If
the likelihood has many local maxima, then the outcome of the algorithm can
be one of these local maxima. For complex models and multi-dimensional
θ, the convergence is far from guaranteed. The good/lucky choice of θ(0) is
crucial.
Although in theory, each iteration moves θ(k+1) toward true maximum
faster by using Newton-Raphson method, we pay extra cost on computing
the second derivation. For multi-dimensional θ, we need to invert a matrix
which is not always a pleasant task. The implementation of this method is
not always so simple.
Implementing Newton-Raphson for a simple data example will be an ex-
ercise.
Example 6.2. Logistic distribution. Let X1, X2, . . . , Xn be i.i.d. with
density function
f(x; θ) =
exp{−(x− θ)}
[1 + exp{−(x− θ)}]2 .
The support of the distribution is the whole line, and parameter space is R.
We usually call it a location distribution family.
The log-likelihood function is give by
`n(θ) = nθ − nx¯n − 2
n∑
i=1
log[1 + exp{−(xi − θ)}].
64 CHAPTER 6. MAXIMUM LIKELIHOOD ESTIMATION
Its score function is
`′n(θ) = s(θ) = n− 2
n∑
i=1
exp{−(x− θ)}
1 + exp{−(x− θ)} .
The MLE is a solution to s(θ) = 0.
One may easily find that
`′′n(θ) = s
′(θ) = −2
n∑
i=1
exp{−(xi − θ)}
[1 + exp{−(xi − θ)}]2 < 0.
Thus, the score function is monotone in θ, which implies the solution to
s(θ) = 0 is unique. It also implies that the solution is the maximum point of
the likelihood, not minimum nor stationary points.
It is also evident that there is no analytical solution to this equation,
Newton-Raphson algorithm can be a good choice for numerically evaluate the
MLE in applications.
6.3 EM-algorithm
Suppose we have n observations from a tri-nomial distribution. That is, there
are n independent and independent trials each has 3 possible outcomes. The
corresponding parameters are p1, p2, p3. We summarize these observations
into n1, n2, n3. The log-likelihood function is
`n(p1, p2, p3) = n1 log p1 + n2 log p2 + n3 log p3.
Using Lagrange method, we can easily show that the MLEs are
pˆj = nj/n
for j = 1, 2, 3.
If, however, another m trials were carried out but we know only their
outcomes are not of the third kind. In some words, the data contains some
missing information.
6.3. EM-ALGORITHM 65
The log-likelihood function when the additional data are included be-
comes
`n(p1, p2, p3) = n1 log p1 + n2 log p2 + n3 log p3 +m log(p1 + p2).
Working out the MLE is no longer straightforward now. Given specific values,
there are many numerical algorithms can be used to compute MLE. We
recommand EM-algorithm in this case.
If we knew which of these m observations were of type I, we would have
obtained the complete data log-likelihood as:
`c(p1, p2, p3) = (n1 +m1) log p1 + (n2 +m2) log p2 + n3 log p3
where c stands for “complete data”. Since we do not know what these m1
and m2 are, we replace them with some predictions based on what we know
already. In this case, we use conditional expectations.
E-step: If the current estimates pˆ1 = n1/n and pˆ2 = n2/n are relevant.
Then, we might expect that out of m non-type III observations, mˆ1 =
mpˆ1/(pˆ1 + pˆ2) are of type I and mˆ2 = mpˆ2/(pˆ1 + pˆ2) are of type II. That
is, the conditional expectation (given data, and the current estimates of the
parameter values) of m1 and m2 are given by mˆ1 and mˆ2. When m1 and m2
are replaced by their conditional expectations, we get a function
Q(p1, p2, p3) = (n1 + mˆ1) log p1 + (n2 + mˆ2) log p2 + n3 log p3.
This is called E-stap because we Replace the unobserved values by their
conditional expectations.
M-step: In this step, we update unknown parameters by the maximizer of
Q(p1, p2, p3). The updated estimator values are
p˜1 = (n1 +m1)/(n+m) p˜2 = (n2 +m2)/(n+m), p˜3 = n3/(n+m).
If they represent a better guess of the MLE, then we should update the
Q-function accordingly. After which, we should carry out the M-step again
to obtain more satisfactory approximation to the MLE. We therefore iterate
between the E and M steps until some notion of convergence.
These idea is particularly useful when the data structure is complex. In
most cases, the EM iteration is guaranteed to increase the likelihood. Thus,
it should converge, and converge to a local maximum for the least.
66 CHAPTER 6. MAXIMUM LIKELIHOOD ESTIMATION
6.4 EM-algorithm for finite mixture models
Let envisage a population made of a finite number of subpopulations, each is
governed by a specific distribution from some distribution family. Taking a
random sample from a finite mixture model, we obtain a set of units without
knowing their subpopulation identities. The resulting random variable has
density function
f(x;G) =
m∑
j=1
pijf(x; θj)
with G denoting a mixing distribution on parameter space of θ, Θ, by as-
signing probability pij on θj.
Given a random sample of size n, x1, x2, . . . , xn, from this distribution,
the log likelihood function is given by
`n(G) =
n∑
i=1
log f(xi;G). (6.1)
Other than order m, we regard pij, θj as parameters to be estimated. Com-
puting the maximum likelihood estimate of G is to find the values of m pairs
of pij and θj such that `n(G) is maximized.
Taking advantage of the mixture model structure, EM-algorithm can of-
ten be effectively implemented to locate the location of the maximum point
of the likelihood function.
Conceptually, each observation x from a mixture model is part of a com-
plete vector observation (x, zτ ) where z is a vector of mostly 0 and a single
1 of length m. The position of 1 is its subpopulation identity. Suppose we
have a set of complete observations in the form of (xi, z
τ
i ): i = 1, 2, . . . , n.
The log likelihood function of the mixing distribution G is given by
`c(G) =
n∑
i=1
m∑
j=1
zij log{pijf(xi; θj)}. (6.2)
Since for each i, zij equals 0 except for a specific j value, only one log{pijf(xi; θj)}
actually enters the log likelihood function.
We use x for the vector of the xi and X as its corresponding random
vector and start the EM-algorithm with an initial mixing distribution with
6.4. EM-ALGORITHM FOR FINITE MIXTURE MODELS 67
m support points:
G(0)(θ) =
m∑
j=1
pi
(0)
j 1(θ
(0)
j ≤ θ).
E-Step. This step is to find the expected values of the missing data in the
full data likelihood function. They are zi in the context of the finite mix-
ture model. If the mixing distribution G is given by G(0), its corresponding
random variable has conditional expectation given by
E{Zij|X = x;G(0)} = pr(Zij = 1|Xi = xi;G(0))
=
f(xi; θ
(0)
j )pr(Zij = 1;G
(0))∑m
k=1 f(xi; θ
(0)
k )pr(Zik = 1;G
(0))
=
pi
(0)
j f(xi; θ
(0)
j )∑m
k=1 pi
(0)
k f(xi; θ
(0)
k )
.
The first equality has utilized two facts: the expectation of an indicator ran-
dom variable equals the probability of “success”; only the ith observation is
relevant to the subpopulation identity of the ith unit. The second equality
comes from the standard Bayes formula. The third one spells out the proba-
bility of “success” if G(0) is the true mixing distribution. The superscript (0)
reminds us that the corresponding quantities are from G(0), the initial mixing
distribution. One should also note the expression is explicit and numerically
easy to compute as long as the density function itself can be easily computed.
We use notation w
(0)
ij for E{Zij|X = x;G(0)}. Replacing zij by w(0)i in
`c(G), we obtain a function which is usually denoted as
Q(G;G(0)) =
n∑
i=1
m∑
j=1
w
(0)
ij log{pijf(xi; θj)}. (6.3)
In this expression, Q is a function of G, and its functional form is determined
by G(0). The E-Step ends at producing this function.
M-Step. Given this Q function, it is often simple to find a mixing distribu-
tion G having it maximized. Note that Q has the following decomposition:
Q(G;G(0)) =
m∑
j=1
{ n∑
i=1
w
(0)
ij
}
log(pij) +
m∑
j=1
{ n∑
i=1
w
(0)
ij log f(xi; θj)
}
.
68 CHAPTER 6. MAXIMUM LIKELIHOOD ESTIMATION
In this decomposition, two additive terms are functions of two separate parts
of G. The first term is a function of mixing probabilities only. The second
term is a function of subpopulation parameters only. Hence, we can search
for the maxima of these two functions separately to find the overall solution.
The algebraic form of the first term is identical to the log likelihood of a
multinomial distribution. The maximization solution is given by
pi
(1)
j = n
−1
n∑
i=1
w
(0)
ij
for j = 1, 2, . . . ,m.
The second term is further decomposed into the sum of m log likelihood
functions, one for each subpopulation. When f(x; θ) is a member of classical
parametric distribution family, then the maximization with respect to θ often
has an explicit analytical solution. With a generic f(x; θ), we cannot give an
explicit expression but an abstract one:
θ
(1)
j = arg sup
θ
{
n∑
i=1
w
(0)
ij log f(xi; θj)}.
The mixing distribution
G(1)(θ) =
m∑
j=1
pi
(1)
j 1(θ
(1)
j ≤ θ)
then replaces the role of G(0) and we go back to E-step.
Iterating between E-step and M-step leads to a sequence of intermediate
estimates of the mixing distribution: G(k). Often, this sequence converges to
at least a local maximum of `n(G).
With some luck, the outcome of this limit is the global maximum. In
most applications, one would try a number of G(0) and compare the values
of `n(G
(k)) the EM-algorithm leads to. The one with the highest value will
have its G(k) regarded as the maximum likelihood estimate of G.
The algorithm stops after many iterations when the difference between
G(k) and G(k−1) is considered too small to continue. Other convergence cri-
teria may also be used.
6.4. EM-ALGORITHM FOR FINITE MIXTURE MODELS 69
6.4.1 Data Examples
Leroux and Puterman (1992) and Chen and Kalbfleisch (1996) analyze data
on the movements of a fetal lamb in each of 240 consecutive 5-second intervals
and propose a mixture of Poisson distributions. The observations can be
summarized by the following table.
x 0 1 2 3 4 5 6 7
freq 182 41 12 2 2 0 0 1
It is easily seen that the distribution of the counts is over-dispersed. The
sample mean is 0.358 which is significantly smaller than the sample variance
which is 0.658 given that the sample size is 240.
A finite mixture model is very effective at explaining the over-dispersion.
There is a general agreement that a finite Poisson mixture model with order
m = 2 is most suitable. We use this example to demonstrate the use of EM-
algorithm for computing the MLE of the mixing distribution given m = 2.
Since the sample mean is 0.358 and the data contains a lot of zeros. Let
us choose an initial mixing distribution with
(pi
(0)
1 , pi
(0)
2 , θ
(0)
1 , θ
(0)
2 ) = (0.7, 0.3, 0.1, 4.0).
We do not have more specific reasons behind the above choice.
A simplistic implementation of EM-algorithm for this data set is as fol-
lows.
pp = 0.7;
theta = c(0.1, 4.0)
xx = c(rep(0, 182), rep(1, 41), rep(2, 12), 3, 3, 4, 4, 7)
#data inputted, initial mixing distribution chosen
last = c(pp, theta)
dd= 1
while(dd > 0.000001) {
temp1 = pp*dpois(xx, theta[1])
temp2 = (1-pp)*dpois(xx, theta[2])
w1 = temp1/(temp1+temp2)
70 CHAPTER 6. MAXIMUM LIKELIHOOD ESTIMATION
w2 = 1 - w1
#E-step completed
pp = mean(w1)
theta[1] = sum(w1*xx)/sum(w1)
theta[2] = sum(w2*xx)/sum(w2)
#M-step completed
updated = c(pp, theta)
dd = sum((last - updated)^2)
last = updated
}
print(updated)
When the EM-algorithm converges, we get pˆi1 = 0.938 and θˆ1 = 0.229,
θˆ2 = 2.307. The likelihood value at this Gˆ equals −186.99 (based on the usual
expression of the Poisson probability mass function). The fitted frequency
vector is given by
x 0 1 2 3 4 5 6 7
freq 182 41 12 2 2 0 0 1
fitted freq 180.4 44.5 8.6 3.4 1.8 0.8 0.3 0.1
6.5 EM-algorithm for finite mixture models
repeated
Let envisage a population made of a finite number of subpopulations, each is
governed by a specific distribution from some distribution family. Taking a
random sample from a finite mixture model, we obtain a set of units without
knowing their subpopulation identities. The resulting random variable has
density function
f(x;G) =
m∑
j=1
pijf(x; θj)
with G denoting a mixing distribution on parameter space of θ, Θ, by as-
signing probability pij on θj.
6.5. EM-ALGORITHM FOR FINITEMIXTUREMODELS REPEATED71
Given a random sample of size n, x1, x2, . . . , xn, from this distribution,
the log likelihood function is given by
`n(G) =
n∑
i=1
log f(xi;G). (6.4)
Other than order m, we regard pij, θj as parameters to be estimated. Com-
puting the maximum likelihood estimate of G is to find the values of m pairs
of pij and θj such that `n(G) is maximized.
Taking advantage of the mixture model structure, EM-algorithm can of-
ten be effectively implemented to locate the location of the maximum point
of the likelihood function.
Conceptually, each observation x from a mixture model is part of a com-
plete vector observation (x, z) where z takes values j with probability pij for
j = 1, 2, . . . ,m.
Suppose we have a set of complete observations in the form of (xi, zi):
i = 1, 2, . . . , n. The log likelihood function of the mixing distribution G is
given by
`c(G) =
n∑
i=1
m∑
j=1
1(zi = j) log{pijf(xi; θj)}. (6.5)
Clearly, only one log{pijf(xi; θj)} actually enters the log likelihood function.
We use x for the vector of the xi and X as its corresponding random
vector and start the EM-algorithm with an initial mixing distribution with
m support points:
G(0)(θ) =
m∑
j=1
pi
(0)
j 1(θ
(0)
j ≤ θ).
E-Step. This step is to find the expected values of the missing data in the
full data likelihood function. If the mixing distribution G is given by G(0),
its corresponding random variable has conditional expectation given by
E{1(Zi = j)|X = x;G(0)} =
f(xi; θ
(0)
j )pr(Zi = j;G
(0))∑m
k=1 f(xi; θ
(0)
k )pr(Zi = j;G
(0))
=
pi
(0)
j f(xi; θ
(0)
j )∑m
k=1 pi
(0)
k f(xi; θ
(0)
k )
.
72 CHAPTER 6. MAXIMUM LIKELIHOOD ESTIMATION
The first equality has utilized two facts: the expectation of an indicator ran-
dom variable equals the probability of “success”; only the ith observation is
relevant to the subpopulation identity of the ith unit. The second equality
comes from the standard Bayes formula. The third one spells out the proba-
bility of “success” if G(0) is the true mixing distribution. The superscript (0)
reminds us that the corresponding quantities are from G(0), the initial mixing
distribution. One should also note the expression is explicit and numerically
easy to compute as long as the density function itself can be easily computed.
We use notation w
(0)
ij for E{1(Zi = j)|X = x;G(0)}. Replacing 1(Zi = j)
by w
(0)
i in `
c(G), we obtain a function which is usually denoted as
Q(G;G(0)) =
n∑
i=1
m∑
j=1
w
(0)
ij log{pijf(xi; θj)}. (6.6)
In this expression, Q is a function of G, and its functional form is deter-
mined by G(0). The E-Step ends at producing this function. In other words,
Q(G;G(0)) is the conditional expectation of `c(G) when X = x are given,
and G(0) is regarded as the true mixing distribution behind X.
M-Step. Given this Q function, it is often simple to find a mixing distribu-
tion G having it maximized. Note that Q has the following decomposition:
Q(G;G(0)) =
m∑
j=1
{ n∑
i=1
w
(0)
ij
}
log(pij) +
m∑
j=1
{ n∑
i=1
w
(0)
ij log f(xi; θj)
}
.
In this decomposition, two additive terms are functions of two separate parts
of G. The first term is a function of mixing probabilities only. The second
term is a function of subpopulation parameters only. Hence, we can search
for the maxima of these two functions separately to find the overall solution.
The algebraic form of the first term is identical to the log likelihood of a
multinomial distribution. The maximization solution is given by
pi
(1)
j = n
−1
n∑
i=1
w
(0)
ij
for j = 1, 2, . . . ,m.
6.6. ASSIGNMENT PROBLEMS 73
The second term is further decomposed into the sum of m log likelihood
functions, one for each subpopulation. When f(x; θ) is a member of a classical
parametric distribution family, then the maximization with respect to θ often
has an explicit analytical solution. With a generic f(x; θ), we cannot give an
explicit expression but an abstract one:
θ
(1)
j = arg sup
θ
{
n∑
i=1
w
(0)
ij log f(xi; θj)}
for j = 1, 2, . . . ,m.
The mixing distribution
G(1)(θ) =
m∑
j=1
pi
(1)
j 1(θ
(1)
j ≤ θ)
is an updated estimate of G from G(0) based on data. We then replace the
role of G(0) by G(1) and go back to E-step.
Iterating between E-step and M-step leads to a sequence of intermediate
estimates of the mixing distribution: G(k). Often, this sequence converges to
at least a local maximum of `n(G).
With some luck, the outcome of this limit is the global maximum. In
most applications, one would try a number of G(0) and compare the values
of `n(G
(k)) the EM-algorithm leads to. The one with the highest value will
have its G(k) regarded as the maximum likelihood estimate of G.
The algorithm stops after many iterations when the difference between
G(k) and G(k−1) is considered too small to continue. Other convergence cri-
teria may also be used.
6.6 Assignment problems
1. Let X1, X2, ..., Xn be an i.i.d. random variables from Weibull distribu-
tion with fixed scale parameter, whose density function is given by
f(x; θ) = θxθ−1 exp(−xθ), x > 0, θ > 0.
(You may want to first go over the Example 6.1 in the Lecture Notes.)
Suppose we observe the sample data
74 CHAPTER 6. MAXIMUM LIKELIHOOD ESTIMATION
0.6944788, 0.3285051, 0.7165376, 0.8865894, 1.0858084,
0.4040884, 1.0538935, 1.2487677, 1.1523552, 0.9977360,
0.7251880, 1.0716697, 1.0382114, 1.1535934, 0.9175693,
0.5537849, 0.9701821, 0.5486354, 1.0168818, 0.5193687
(a) For this sample, numerically find an upper bound θ1 and a lower
bound θ2, so that the maximum point of the likelihood function is
within the interval [θ1, θ2].
(b) Use a bisection algorithm (as discussed in section 6.2 in Lecture
Notes) to numerically find the MLE of θ for this observed sample.
You need to attach your code, preferably in R.
2. Let X1, X2, ..., Xn be an i.i.d. random variables from Weibull distribu-
tion with fixed scale parameter.
(a) Work out analytically the updating formulas for the parameter θ
in the Newton-Raphson method (as discussed in section 6.2 in Lecture
Notes).
In other words, how do you obtain the value θ(k+1) in the (k + 1)-th
iterative step based on the value θ(k).
(b) For the same observed sample as in Problem 1, numerically find the
MLE of the parameter θ using the Newton-Raphson algorithm. Start
from the initial value be θ(0) = 1 and report the first 5 values of the
iteration.
Code is part of the required solution.
3. Let f(x; θ) be the p.m.f. of Poisson distribution with mean θ. Derive
the KL divergence function K(θ1, θ2).
Repeat this problem when f(x;α) = {Γ(α)}−1xα−1 exp(−x).
Chapter 7
Properties of MLE
Consider the situation we have have a data set x whose joint density function
is a member of distribution family specified by density functions {f(x; θ) :
θ ∈ Θ}.
Suppose η = g(θ) is an invertible parameter transformation and denote
the inverse transformation by θ = h(η) and the parameter space of η be Υ.
Clearly, for each θ, there is an η such that
f(x; θ) = f(x;h(η)) = f˜(x; η)
where we have introduced f˜(x; η) for the function under the new parameter-
ization. In other words,
{f(x; θ) : θ ∈ Θ} = {f˜(x; η) : η ∈ Υ}.
The likelihood functions in these two systems are related by
`(θ) = ˜`(η)
for η = g(θ). If θˆ is a value such that
`(θˆ) = sup
θ∈Θ
`(θ)},
we must also have
˜`(g(θˆ)) = `(θˆ) = sup
θ∈Θ
`(θ) = sup
η∈Υ
˜`(η).
75
76 CHAPTER 7. PROPERTIES OF MLE
Hence, h(θˆ) is the MLE of η = h(θ).
In conclusion, the MLE as a general method for point estimation, is equi-
variant. If we estimate µ by x¯, then we estimate µ2 by x¯2 in common notation.
Next, we give results to motivate the use of MLE. The following inequality
plays an important role.
Jensen’s inequality. Let X be a random variable with finite mean and g
be a convex function. Then
E[g(X)] ≥ g[E(X)].
Proof: We give a heuristic proof. Function g is convex if and only if for every
set of x1, x2, . . . , xn and positive numbers p1, p2, . . . , pn such that
∑n
i=1 pi = 1,
we have
n∑
i=1
pig(xi) ≥ g(
n∑
i=1
pixi).
This essentially proves the inequality when X is a discrete random variable
of finite number of possible values. Since every random variable can be
approximated by such random variables, we can take a limit to get the general
case. This is always possible when X has finite first moment.
Kulback-Leibler divergence. Suppose f(x) and g(x) are two density func-
tions with respect to some σ-finite measure. The Kulback-Leibler divergence
between f and g is defined to be
K(f, g) = E{log[f(X)/g(X)]; f}
where the expectation is computed when X has distribution f .
Let Y = g(X)/f(X) and h(y) = − log(y). It is seen that h(y) is a convex
function. It is easily seen that
E{Y } ≤ 1
where the inequality can occur if the support of f(x) is a true subset of that
of g(x). In any case, by Jensen’s inequality, we have
E{h(Y )} ≥ h(E{Y }) ≥ 0.
77
This implies that
K(f, g) ≥ 0
for any f and g. Clearly, K(f, f) = 0.
Because K(f, g) is positive unless f = g, it serves as a metric to measure
how different g is from f . At the same time, the KL divergence is not a
distance in mathematical sense because K(f, g) 6= K(g, f) in general.
Let F be a parametric distribution family possessing densities f(x; θ)
and parameter space Θ. Let f(x) be simply a density function may or may
not be a member of F . If we wish to find a density in F that is the best
approximation to f(x) in KL-divergence sense, a sensible choice is f(x; θˆ)
such that
θˆ = arg min
θ∈Θ
K(f(x), f(x; θ)).
In most applications, f(x) is not known but we have an i.i.d. sample
X1, . . . , Xn from it. In this case, we may approximate K(f(x), f(x; θ)) as
follows:
K(f(x), f(x; θ)) =

log{f(x)/f(x; θ)}f(x)dx
≈ n−1
n∑
i=1
log{f(xi)/f(xi; θ)}
= n−1
n∑
i=1
log{f(xi)} − n−1`n(θ)
where the second term is the usual log likelihood function. Hence, minimiz-
ing KL-divergence is approximately the same as maximizing the likelihood
function. The analog goes further to situations where non-i.i.d. observations
are available.
Unlike UMVUE or other estimators, MLE does not aim at most precisely
determining the best possible value of “true” θ. One may wonder if it mea-
sures up if it is critically examined from different angles. This will be the
topic of the next section.
78 CHAPTER 7. PROPERTIES OF MLE
7.1 Trivial consistency
Under very general conditions, the MLE is strongly consistent. We work out
a simple case her. Consider the situation where Θ = {θj : j = 1, . . . , k} for
some finite k. Assume that
F (x; θj) 6= F (x; θl)
for at least one x value when j 6= l, where F (x; θ) is the cumulative distribu-
tion function of f(x; θ). The condition means that the model is identifiable
by its parameters. We assume an i.i.d. sample from F (x; θ0) has been ob-
tained but pretend that we do not know θ0. Instead, we want to estimate it
by the MLE.
Let `n(θ) be the likelihood function based on the i.i.d. sample of size n.
By the strong law of large numbers, we have
n−1{`n(θ)− `n(θ0)} → −K(f(x; θ0), f(x; θ))
almost surely for any θ ∈ Θ. The identifiability condition implies that
K(f(x; θ0), f(x; θ)) > 0
for any θ 6= θ0. Therefore, we have
`n(θ) < `n(θ0)
almost surely as n→∞. When there are only finite many choices of θ in Θ,
we must have
max{`n(θ) : θ 6= θ0} < `n(θ0)
almost surely. Hence, the MLE θˆn = θ0 almost surely.
Let us summarize the result as follows.
Theorem 7.1. Let X1, . . . , Xn be a set of iid sample from the distribution
family {f(x; θ) : θ ∈ Θ} and the true value of the parameter is θ = θ0.
Assume the identifiability condition that
F (x; θ

) 6= F (x; θ′′) (7.1)
7.2. TRIVIAL CONSISTENCY FOR ONE-DIMENSIONAL θ 79
for at least one x whenever θ
′ 6= θ′′.
Assume also that
E| log f(X; θ)| <∞ (7.2)
for any θ ∈ Θ, where the expectation is computed under θ0.
Then, the MLE θˆ → θ0 almost surely when Θ = {θj : j = 0, 1, . . . , k} for
some finite K.
Although the above proof is very simple. The idea behind it can be
applied to prove the general result. For any subset B of Θ, define
f(x;B) = sup
θ∈B
f(x; θ).
We assume that f(x;B) is a measurable function of x for all B under con-
sideration. We can generalize the above theorem as follows.
Theorem 7.2. Let X1, . . . , Xn be a set of i.i.d. sample from the distribution
family {f(x; θ) : θ ∈ Θ} and that Θ = ∪kj=0Bj for some finite k. Assume
that the true value of the parameter is θ = θ0 ∈ B0 and that
E| log f(X;Bj)| < E[log f(X; θ0)] (7.3)
for j = 1, 2, . . . , k. Then, the MLE θˆ ∈ B0 almost surely.
7.2 Trivial consistency for one-dimensional θ
Consider the situation where we have a set of i.i.d. observations from a one-
dimensional parametric family {f(x; θ) : θ ∈ Θ ⊂ R}. The log likelihood
function remains the same as
`n(θ) =
n∑
i=1
log f(xi; θ).
We likely have defined score function earlier, which is, given i.i.d. observations
Sn(θ;x) =
n∑
i=1
∂{log f(xi; θ)}
∂θ
.
80 CHAPTER 7. PROPERTIES OF MLE
We will use plain S(θ;x) if when x is regarded as a single observation. We
can be sloppy by using notation E{S(θ)} in which x has to be interpreted as
the random variable X whose distribution is f(x; θ), with the same θ in S
and f .
Let us put done a few regularity conditions. They are not most general
but suffice in the current situation.
R0 The parameter space of θ is an open set of R.
R1 f(x; θ) is differentiable to order three with respect to θ at all x.
R2 For each θ0 ∈ Θ, there exist functions g(x), H(x) such that for all θ in
a neighborhood N(θ0),
(i)
∣∣∣∣∂f(x; θ)∂θ
∣∣∣∣ ≤ g(x);
(ii)
∣∣∣∣∂2f(x; θ)∂θ2
∣∣∣∣ ≤ g(x);
(iii)
∣∣∣∣∂3 log f(x; θ)∂θ3
∣∣∣∣ ≤ H(x)
hold for all x, and∫
g(x)dx <∞; E0{H(X)} <∞.
R3 For each θ ∈ Θ,
0 < Eθ
{
∂ log f(x; θ)
∂θ
}2
<∞.
Although the integration is stated as with respect to dx, the results we
are going to state remain valid if it is replace by some σ-finite measure. For
instance, the result is applicable to MLE under Poisson model where dx
must be replaced by summation over non-negative integers. All conditions
are stated as they are required for all x. An exception over a 0-measure set
of x is allowed, as long as this 0-measure set is the same for all θ ∈ Θ.
7.2. TRIVIAL CONSISTENCY FOR ONE-DIMENSIONAL θ 81
Lemma 7.1. (1) Under regularity conditions, we have
E
{
∂ log f(X; θ)
∂θ
; θ
}
= 0.
(2) Under regularity conditions, we have
E
{
∂ log f(X; θ)
∂θ
}2
= −E
{
∂2 log f(X; θ)
∂θ2
}
= I(θ).
Proof. We first remark that the first result is the same as stating E{S(θ)} =
0. The proof of one is based on the fact that∫
f(x; θ)dx = 1.
Taking derivative with respect to θ on both sizes, permitting the exchange
of derivative and integration under regularity condition R2, and expressing
the resultant properly, we get result (1).
To prove (2), notice that
∂2 log f(X; θ)
∂θ2
=
{
f ′′(X; θ)
f(X; θ)
}

{
f ′(X; θ)
f(X; θ)
}2
.
The result is obtained by taking expectation on both sizes and the fact
E
{
f ′′(X; θ)
f(X; θ)
}
=

f ′′(x; θ)dx = 0.
This completes the proof.
We now give a simple consistency proof when θ is one-dimensional.
Theorem 7.3. Given an i.i.d. sample of size n from some one-parameter
model {f(x; θ) : θ ∈ Θ ⊂ R}. Suppose θ∗ is the true parameter value. Under
Conditions R0-R3, there exists an θˆn sequence such that
(i) Sn(θˆn) = 0 almost surely;
(ii) θˆn → θ∗ almost surely.
82 CHAPTER 7. PROPERTIES OF MLE
Proof. (i) As a function of θ, E{S(θ)} has derivative equaling −I(θ∗) at
θ = θ∗. Hence, it is a decreasing function at θ∗. This implies the existence
of sufficiently small > 0, such that
E{S(θ∗ + )} < 0 < E{S(θ∗ − )}.
By the law of large numbers, we have
n−1Sn(θ∗ ± ) a.s.−→ E{S((θ∗ ± )}.
Hence, almost surely, we have
Sn(θ
∗ + ) < 0 < Sn((θ∗ − ).
By intermediate value theorem, there exists a θˆ ∈ (θ∗ − , θ∗ + ) such that
Sn(θˆ) = 0.
This proves (i).
(ii) is a direct consequence of (i) as can be made arbitrarily small.
7.3 Asymptotic normality of MLE after the
consistency is established
Under the assumption that f(x; θ) is smooth, and the MLE θˆ is a consistent
estimator of θ, we must have
Sn(θˆ) = 0.
By the mean-value theorem in mathematical analysis, we have
Sn(θ
∗) = Sn(θˆ) + S ′n(θ˜)(θ
∗ − θˆ)
where θ˜ is a parameter value between θ∗ and θˆ.
By the result in the last lemma, we have
n−1S ′n(θ˜)→ −I(θ∗),
7.4. ASYMPTOTIC EFFICIENCY, SUPER-EFFICIENT, ONE-STEP UPDATE SCHEME83
the Fisher information almost surely. In addition, the classical central limit
theorem implies
n−1/2Sn(θ∗)→ N(0, I(θ∗)).
Thus, by Slutzky’s theorem, we find

n(θˆ − θ∗) = n−1/2I−1(θ∗)Sn(θ∗) + op(1)→ N(0, I−1(θ∗))
in distribution as n→∞.
Many users including statisticians ignore the regularity conditions. In-
deed, they are satisfied by most commonly used models. If one does not
bother with the full rigour, he or she should at least make sure that the
parameter value in consideration is an interior point, the likelihood function
is smooth enough. If the data set does not have i.i.d. structure, one should
make sure that some form of uniformity hold.
7.4 Asymptotic efficiency, super-efficient, one-
step update scheme
By Cramer-Rao information inequality, for any estimator of θ given i.i.d.
data and sufficiently regular model, we have
var(θˆn) ≥ I−1n (θ∗)
for any estimator θˆn assuming unbiasedness. The MLE under regularity
conditions has asymptotic variance I(θ∗) at rate

n. Loosely speaking, the
above inequality becomes equality for MLE. Hence, the MLE is “efficient”:
no other estimators can achieve lower asymptotic variance.
Let us point out the strict interpretation of asymptotic efficiency is not
correct. Suppose we have a set of i.i.d. observations from N(θ, 1). The MLE
of θ is X¯n. Clearly, if θ
∗ is the true value, we have

n(X¯n − θ∗) d−→ N(0, 1).
Can we do better than the MLE? Let
θ˜n =
{
0 if |X¯n| ≤ n−1/4
X¯n otherwise.
84 CHAPTER 7. PROPERTIES OF MLE
When the true value θ∗ = 0, then
pr(|X¯n| ≤ n−1/4)→ 1
as n→ 0. Hence, √
n(X¯n − θ∗) d−→ N(0, 0)
with asymptotic variance 0 at rate

n.
When the true value θ∗ 6= 0, then
pr(|X¯n| ≤ n−1/4)→ 0
which implies
pr(θ˜n = X¯n)→ 1.
Consequently, √
n(θ˜n − θ∗) d−→ N(0, 1).
What have we seen? If θ∗ 6= 0, then θ˜n has the same limiting distribution
as that of X¯n at the same rate. So they have the same asymptotic efficiency.
When θ∗ = 0, the asymptotic variance of θ˜n is 0 which is smaller than that
of X¯n (at rate

n). It appears that the unattractive θ˜n is superior than the
MLE in this example.
Is there any way to discredit θ˜n? Statisticians find that if θ
∗ = n−1/4,
namely changes with n, then the variance of

nθ˜n goes to infinity while that
of

nX¯n remains the same. It is a good exercise to compute its variance in
this specific case.
If some performance uniformity in θ is required, the MLE is the one
with the lowest asymptotic variance. Hence, the MLE is generally referred
to as asymptotically efficient under regularity conditions, or simply
asymptotically optimal.
Estimators such as θ˜n are called super-efficient estimators. Their existence
makes us think harder. We do not recommend these estimators.
If one estimator has asymptotic variance σ21 and the other one has asymp-
totic variance σ22 at the same rate and both asymptotically unbiased, then
the relative efficiency of θˆ1 against θˆ2 is defined as σ
2
2/σ
2
1. A higher ratio
implies higher relative efficiency. This definition is no longer emphasized in
contemporary textbooks.
7.5. ASSIGNMENT PROBLEMS 85
Suppose θ˜ is not asymptotically efficient. However, it is good enough such
that for any > 0, we have
pr{n1/4|θ˜ − θ| ≥ } → 0
as n→∞. Let
θˆn = θ˜n − `′n(θ˜n)/`′′n(θ˜n)
in apparent notation. Under regularity conditions, it can be shown that

n(θˆ − θ∗) d−→ N(0, I−1(θ∗)).
Namely, the Newton-Raphson update formula can turn an ordinary estimator
into an asymptotically efficient estimator easily.
Suppose we have a set of i.i.d. observations from Cauchy distribution
with location parameter θ. Under this setting, the score function has multiple
solutions. It is not straightforward to obtain the MLE in applications. One
way to avoid this problem is to estimate θ by the sample median which
is not optimal. The above updating formula can then be used to get an
asymptotically efficient (optimal) estimator. Let us leave it as an exercise
problem.
7.5 Assignment problems
1. Let X1, X2, . . . , Xn be a set of i.i.d. random variables from N(θ, 1), and
let X¯n be the sample mean. Suppose θ
∗ is the true value of the mean
parameter θ.
Let
θ˜n =
{
0 if |X¯n| ≤ n−1/4 ;
X¯n otherwise .
(a) For θ∗ = n−1/4 which changes with n, show that
P (θ˜n = 0)→ 0.5
as n→∞.
86 CHAPTER 7. PROPERTIES OF MLE
(b) Under the same condition as in (a), show that the MSE of θ˜n
nE{(θ˜n − θ∗)2} → ∞.
Hint: develop an inequality based on result (a).
(c) Use computer to generate data of size n = 1600 from N(θ∗ =
n−1/4, 1), and compute the values of θˆ = X¯n and θ˜n. Repeat it N =
1000 times so that you have N many pairs of these values. Compare
their simulated total MSE:
N∑
k=1
(θˆ − θ∗)2;
N∑
k=1
(θ˜ − θ∗)2.
2. Let X1, X2, ..., X2n+1 be an i.i.d. random sample from Cauchy distribu-
tion with location parameter θ, whose density function is given by
f(x; θ) =
1
pi{1 + (x− θ)2} .
The sample median is given by the order statistic x(n+1).
(a) Show that the sample median satisfies
P (n1/4|x(n+1) − θ| ≥ )→ 0
for any > 0 as n→∞.
Remark: directly proving this result is challenging for inexperienced.
Proving it by directly quoting an existing result is not satisfactory.
(b) Derive the explicit expression of the Newton-Raphson iteration for
Cauchy distribution.
(c) Simulation N = 1000 times with 2n + 1 = 201, θ = 0 and obtain
total MSEs in the same way as the last example. Clearly present your
results.
(d) Plot the histogram of the 1000 X(n+1). Do the same for the one-step
Newtwon-Raphson estimator.
(e) Do these histograms support our asymptotic results on MLE and
on median?
7.5. ASSIGNMENT PROBLEMS 87
3. Derive the EM-iteration formulas for data from two component Bino-
mial mixture model:
f(x;G) =
(
m
x
)
{piθx1(1− θ1)m−x + (1− pi)θx2(1− θ2)m−x}
under the setting of n i.i.d. observations with m ≥ 3. (Sizes n and m
are not relevant in these formulas)
Be sure to have E-step and M-step clearly presented together with the
corresponding Q function.
88 CHAPTER 7. PROPERTIES OF MLE
Chapter 8
Analysis of regression models
In this chapter, we investigate the estimation problems when data are pro-
vided in the form
(yi; xi) : i = 1, 2, . . . , n. (8.1)
The range of y is R and the range of x is Rp. We call them response vari-
able and explanatory variables (sometimes covariates). In many applications,
such data are collected because the users believe a large proportion of the
variability in y from independent trials can be explained away from the vari-
ations in x. Often, we feel that they are linked via a regression relationship
yi = g(xi;θ) + σi (8.2)
such that the error terms i are uncorrelated with mean 0 and variance 1. In
the current form of expression, it hints that an analytical form of g(x;θ) is
specified. All that are left to statisticians is to decide what is the most “ap-
propriate” value of θ in the specific occasion. The distributional information
about may or may not be specified depending on specific circumstances.
Factoring out σ in the error term may not always be most convenient for
statistical discussion. We may choose to replace σi by i but allowing to
have a variance different from 1.
The observations on the explanatory variable, xi, are either regarded as
chosen by scientists (users) so that their values are not random, or they are
independent samples from some population whose distribution is not related
to g(·) nor θ. In addition, they are independent of .
89
90 CHAPTER 8. ANALYSIS OF REGRESSION MODELS
The appropriateness of a regression model in specific applications will not
be discussed in this course. We continue our discussion under the assumption
that all promises for (8.2) are solid.
It is generally convenient to use matrix notation here. We define and
denote the covariate matrix as
Xn =

x11 x12 · · · x1p
x21 x22 · · · x2p
. . . . . . . . .
xn1 xn2 . . . xnp
 =

xτ1
xτ2
. . .
xτn

= (X1,X2, . . . ,Xp).
We define design matrix as
Zn = (1,X1,X2, . . . ,Xp)
which is the covariate matrix supplemented by a column vector made of 1.
We also use bold faced y and for column vectors of length n for response
values and error terms. When necessary, we use yn,Xn with subindex n to
highlight the sample size n. Be cautious that X3 stands for the column vector
of the third explanatory variable, not the covariate matrix when n = 3. We
trust that such abuses will not cause much confusion though mathematically
not rigorous.
8.1 Least absolution deviation and least square
estimators
Suppose we are given a data set in the form of (8.1) and we are asked to
use the data to fit model (8.2). Let us look into the problem of how to best
estimate θ and σ. We do not discuss the issues such as the fitness of function
g(·) and the distribution of .
There are many potential approaches for estimating θ. One way is to
select θ value such that the average difference between yi and g(xi; θ) is
minimized. To implement this idea, one may come up with many potential
8.2. LINEAR REGRESSION MODEL 91
distances. The absolute difference is one of the favourites. With this choice,
we define
Mn(θ) =
n∑
i=1
|yi − g(xi; θ)|
and find the corresponding M-estimator for θ. This estimator is generally
called the least absolute deviation estimator. A disadvantage of this approach
is the inconvenience of working with absolute value function both analytically
and numerically.
A more convenient choice is
Mn(θ) =
n∑
i=1
{yi − g(xi; θ)}2.
The resultant estimator is called the least square estimator.
We may place a parametric distribution assumption on that of . If
has standard normal N(0, 1) distribution, then the MLE of θ equals the
least squares estimator. If has double exponential distribution with density
function
f(u) =
1
2
exp{−|u|}
then, the least absolute deviation estimator is also the MLE under this model.
8.2 Linear regression model
Linear regression model is a special signal plus error model. In this case, the
regression function E(Y |X = x) has a specific form:
E(Y |X = x) = g(x; θ) = β0 + β1x1 + · · ·+ βpxp.
We can write it in vector form with zτ = (1,xτ ) as
g(x; θ) = zτβ (8.3)
which is linear in regression coefficient β = (β0, β1, . . . , βp)
τ . While we
generally prefer to include β0 in most applications, this is not a mathematical
necessity. In some applications, the scientific principle may seriously demand
92 CHAPTER 8. ANALYSIS OF REGRESSION MODELS
a model with β0 = 0. Luckily, even though the subsequent developments will
be based on z which implies β0 is part of the model, most theoretical results
remain valid when z is reduced to x so that β0 = 0 is enforced. We will not
rewrite the same result twice for this reason.
We have boldfaced two terminologies without formally defining them. It
is worth to emphasize here that model is linear not because the regression
function g(x; θ) is linear in x, but it is linear in θ which is denoted as β here.
In applications, we may use x1 for some explanatory variables such as dosage
and include x2 = log(x1) as another explanatory variable in the linear model.
In this case, the linear regression model has a regression function g(x, θ) not
linear in x1.
Suppose we have n independent observations from regression model (8.2)
with linear regression function (8.3), one way to estimate the regression co-
efficient vector is by the least squares. The M-function now has form
Mn(β) = (yn − Znβ)τ (yn − Znβ) =
n∑
i=1
(yi − zτiβ)2. (8.4)
For linear regression model, there is an explicit solution to the least squares
problem in a neat matrix notation.
Theorem 8.1. Suppose (yi,xi) are observations from linear regression model
(8.2) with g(x, θ) given by (8.3). The solution to the least squares problem
as defined in (8.4) is given by
βˆn = (Z
τ
nZn)
−1Zτnyn (8.5)
if ZτnZn has full rank.
If ZτnZn does not have full rank, one solution to the least squares problem
is given by
βˆn = (Z
τ
nZn)
−Zτnyn
where A− here denotes a specific generalize inversion.
Remark: the statement hints that if ZτnZn does not have full rank, the
solution is not unique. However, we will not discuss it in details.
8.2. LINEAR REGRESSION MODEL 93
Proof. We only give a proof when ZτnZn has full rank. It is seen that
Mn(β) = {(yn − Znβˆ) + Zn(βˆ − β)}τ{(yn − Znβˆ) + Zn(βˆ − β)}
= (yn − Znβˆ)τ (yn − Znβˆ) + (βˆ − β)τ (ZτnZn)(βˆ − β)
≥ (yn − Znβˆ)τ (yn − Znβˆ).
The lower bound implied by the above inequality is attained when β = βˆ.
Hence, βˆ is the solution to the least squares problem.
Let βˆn be the least squares estimator of β and β be the true value of the
parameter without giving it a special notation. We find
E{βˆn|Xn} = (ZτnZn)−1Zτn{Znβ} = β.
Hence, βˆn is an unbiased estimator of the regression coefficient vector. No-
tice that this conclusion is obtained under the assumption that x and are
independent. Also notice that we assumed has zero mean and constant
variance, but placed no assumption on its distributions. Put the additive
error terms in the form of σn, e have
βˆn − β = σ(ZτnZn)−1Znn.
Hence,
var(βˆn) = (Z
τ
nZn)
−1σ2.
Because we made a distinction between the covariate matrix Xn and the
design matrix Zn, the above expression may appear different from those in
standard textbooks.
With β estimated by βˆ, it is naturally to regard
yˆn = Znβˆn = Hnyn
as the estimated value of yn, where the hat matrix
Hn = Zn(Z
τ
nZn)
−1Zτn.
In fact, we call yˆn fitted value(s). How closely does yˆn match yn? The
residual of the fit is given by
ˆn = (In −Hn)yn = σ(In −Hn)n.
94 CHAPTER 8. ANALYSIS OF REGRESSION MODELS
One can easily verify that Hn and In −Hn are symmetric and idempotent,
and (In − Hn)Zn = 0. From geometric angle, Hn is a projection matrix.
The operation Hnyn projects yn into the linear space spun by Zn. Naturally,
(In −Hn)yn is the projection of yn into the linear space orthogonal to Zn.
This leads to a decomposition of the sum of squares:
yτnyn = y
τ
nHnyn + y
τ
n(In −Hn)yn.
The second term is the “residual sum of squares”. It is an easy exercise to
prove that
yτn(In −Hn)yn = ˆτnˆn.
We directly verified that βˆ solves the least squares problem. One may
derive this result by searching for solutions to
∂Mn(β)
∂β
= 0.
Zτn{yn − Znβ} = 0.
We again leave it as an easy exercise.
We have seen that the least squares estimator βˆn has a few neat prop-
erties. Yet we cannot help to ask: can we find other superior estimators?
The answer is no at least in one respect. The least squares estimator has the
lowest variance among all unbiased linear estimators of β. A linear estimator
is defined as one that can be written as a linear combinations of yi. It must
be able to be written in the form of Ayn for some matrix A not dependent
on yn.
Theorem 8.2. Gauss-Markov Theorem. Let βˆn be the least squares
estimator and
β˜n = Ayn
for some nonrandom matrix A (may depend on Xn) be an unbiased linear
estimator of β under the linear regression model with n independent obser-
vations. Then
var(β˜)− var(βˆ) ≥ 0.
8.3. LOCAL KERNEL POLYNOMIAL METHOD 95
Proof. Suppose Ayn is unbiased for β. We must have
E(Ayn) = AZnβ = β
for any β. Hence, we must have AZ = Ip+1. This implies
var(β˜ − βˆ) = σ2{A− (ZτnZn)−1Zτn}{Aτ − Zn(ZτnZn)−1}
= var(β˜)− var(βˆ).
Because the variance matrix for any random variable is non-negative definite.
Hence, we must have
var(β˜)− var(βˆ) ≥ 0.
An estimator which is linear in data and unbiased for the target parameter
is called best linear unbiased estimator (BLUE) if it has the lowest
possible variance matrix.
Not only the least squares estimator βˆ is BLUE for β, but bτ βˆ is BLUE
for bτβ for any non-random vector b.
At the same time, be aware that if we have additional information about
the distribution of n in the linear model, then we may obtain more efficient
estimator for β, but that estimator is either not linear or not unbiased.
8.3 Local kernel polynomial method
Naturally, a linear regression model is not always appropriate in applications,
but we may still believe a signal plus noise relationship is sound. In this sec-
tion, we consider the situation where the regression function g(x) is smooth
in x, but we are unwilling to place more restrictions on it. At the same time,
we only study the simple situation where x is a univariate covariate.
Suppose we wish to estimate g(x) at some specific x∗ value. By definition,
g(x∗) = E(Y |X = x∗). Suppose that among n observations {(yi, xi)}, i =
1, . . . , n we collected, there are many xi values such that xi = x
∗. The average
of their corresponding yi would be a good estimate of g(x
∗). In reality, there
may not be any xi equalling x
∗ exactly. Hence, this idea does not work. On
the other hand, when n is very large, there might be many xi which are very
96 CHAPTER 8. ANALYSIS OF REGRESSION MODELS
close to x∗. Hence, the average of their corresponding yi should be a sensible
estimate of g(x∗). To make use of this idea, one must decide how close is
close enough. Even within the small neighbourhood, should we merely use
constant, rather than some other smooth functions of x to approximate g(x)?
For any u in close enough to x (rather than x∗ for notation simplicity)
and some positive integer p, when g(x) is sufficiently smooth at x, we have
g(u) ≈ f(x) + f ′(x)(u− x) + . . .+ (1/p!)f (p)(x)(u− x)p.
Let
β0 = f(x), β1 = f
′(x), . . . , βp = (1/p!)f (p)(x).
Then the approximation can be written as
g(u) ≈ β0 + β1(u− x) + . . .+ βp(u− x)p.
Note that at u = x, we have g(x) ≈ β0.
Suppose that for some h > 0, f(u) perfectly coincides with the above
polynomial function for x ∈ [x− h, x+ h]. If so, within this region, we have
a linear regression model with regression coefficient βx. A natural approach
of estimating this local βx is the least squares:
βˆx = arg min
β
n∑
i=1
1(|xi − x| ≤ h){yi − zτiβ}2
where
zi = {1, (xi − x), (xi − x)2, . . . , (xi − x)p}τ .
Note again that zi is defined dependent on x-value, the location at which
g(x) is being estimated.
Note that we have added a subindex x to β. This is helpful because this
vector is specific to the regression function g(u) at u = x. When we change
target from u = x1 to u = x2 6= x2, we must refit the data and obtain the β
specific for u = x2. We repeatedly state this to emphasize the local nature
of the current approach.
The above formulation implies that ith observation will be excluded even
if |xi − x| is only slightly larger than h. At the same time, any observations
8.3. LOCAL KERNEL POLYNOMIAL METHOD 97
with |xi−x| ≤ h are treated equally. This does not seem right in our intuition.
One way to avoid this problem is to replace the indicator function by a general
kernel function K(x) often selected to satisfy the following properties:
1. K(x) ≥ 0;
2.
∫∞
−∞K(x)dy = 1;
3. K(x) = K(−x), That is, K(x) is a symmetric function.
For instance, the density function φ(x) of N(0, 1) has these properties. In
fact, any symmetric density function does.
Let Kh(x) = h
−1K(x/h). We now define the local polynomial kernel
estimator of βx as
βˆx = arg min
β
n∑
i=1
Kh(xi − x){yi − zτiβ}2
An explicit solution to the above optimization problem is readily available
using matrix notation. Let ym be the response vector, define design matrix
Zx =
 1 x1 − x · · · (x1 − x)
p
...
... · · · ...
1 xn − x · · · (xn − x)p

and weight matrix
Wx = diag{Kh(x1 − x), Kh(x2 − x), · · · , Kh(xn − x)}.
The M-function can then be written as
Mn(β) = (y− Zxβ)τWx(y− Zxβ).
It is an easy exercise to show that the solution is given by
βˆx = (Z
τ
xWxZx)
−1ZτxWxyn
Let ej be a (p + 1)× 1 vector such that the jth element being 1 and all
other elements being 0, j = 1, . . . , p+ 1. Then we estimate g(x) by
gˆ(x) = βˆ0 = e
τ
1(Z
τ
xWxZx)
−1ZτxWxyn
98 CHAPTER 8. ANALYSIS OF REGRESSION MODELS
where βˆ0 is the first element of βˆx.
Remark: Notationally, the above locally kernel polynomial estimator re-
mains the same for any choice of p.
Suppose g(x) is differentiable up to order p. Then, for k = 1, . . . , p, we
estimate the kth derivative g(k)(x) by
gˆ(k)(x) = k!βˆk = k!e
τ
k+1(Z
τ
xWxZx)
−1ZτxWxyn.
When we decide to use p = 0 in this approach, the estimator gˆ(x) becomes
fˆ(x) =
∑n
i=1 Kh(xi − x)yi∑n
i=1Kh(xi − x)
,
which is known as the local constant kernel estimator, kernel regression es-
timator and Nadaraya-Watson estimator. This estimator can be motivated
by the fact that g(u) is a constant function in a small neighborhood of x:
u ∈ [x−h, x+h] for some sufficiently small h. The estimator is the weighted
average of the corresponding response values whose x is within small neigh-
bourhood of x.
When we decide to use p = 1 in this approach, the estimator is called the
local linear kernel estimator of g(x).
Before this estimator is applied to any specific data, we must make a
choice on the kernel function K, the degree of the polynomial p and the
bandwidth h. We now go over these issues.
Choice of K(y).
The choice of kernel function K(x) is not crucial. Other than it should
have a few desired properties, its specific form does not markedly change the
variance or bias of gˆ(x). In our future examples, we will mostly use normal
density function. Clearly, the normal density function has the listed three
properties.
Choice of p.
For the given bandwidth h and kernel K(x), a large value of p would
expectedly reduce the bias of the estimator because the local approximation
becomes more and more accurate as p increases. At the same time, when p
is large, we have more parameters to estimate as reflected in the dimension
8.3. LOCAL KERNEL POLYNOMIAL METHOD 99
of β. Hence, the variance of the estimator will increase and there will be a
larger computational cost.
Fan and Gijbels (1996) showed that when the degree of the polynomial
employed increases from p = k + 2q to p = k + 2q + 1 for estimating g(k)(x),
the variance does not increase. However, if we increase the degree from
p = k + 2q + 1 to p = k + 2q + 2, the variance increases. Therefore for
estimating g(k)(x), it is beneficial to use a degree p such that p − k is odd.
Since bandwidth h also controls the bias and variance trade-off of g(k)(x),
they recommended the lowest odd order for p − k, namely p = k + 1, or
occasionally p = k+ 3. For the regression function itself, they recommended
local linear kernel estimator (i.e. p = 1) instead of the Nadaraya-Watson
estimator (i.e. p = 0).
To have a better understanding of the above information, we summarize
Watson estimator here. Let them be denoted as gˆll(x) and gˆnw(x), respec-
tively. We have
gˆnw(x) =
∑n
i=1 Kh(xi − x)yi∑n
i=1Kh(xi − x)
gˆll(x) = βˆ0 = arg min
β0
{min
β1
n∑
i=1
Kh(xi − x){yi − β0 − β1(xi − x)}2}.
Under the regression model assumption that
yi = g(xi) + σi
and for random xi such that its density function is given by f(x), and under
many conditions regulating f(x), g(x) and distribution of , we have
E{gˆnw(x)|x} ≈ g(x) + 0.5h2µ2(K)
{
g′′(x) +
2f ′(x)g′(x)
f(x)
}
;
E{gˆll|x} ≈ g(x) + 0.5h2g′′(x)µ2(K);
var{gˆnw(x)|x} ≈ σ
2
nhf(x)
R(K);
var{gˆll(x)|x} ≈ σ
2
nhf(x)
R(K)
100 CHAPTER 8. ANALYSIS OF REGRESSION MODELS
where µ2(K) and R(K) are some positive constants depending on kernel
function K.
The above results show that the local linear kernel estimator gˆll(x) and
Nadaraya-Watson estimator gˆnw(x) have the same asymptotic variance condi-
tional on x. which is the conclusion that we discussed before. The asymptotic
bias of gˆnw(x) has an extra bias term 2f
′(x)g′(x)µ2(K)h2/f(x). The coeffi-
cient 2f ′(x)g′(x)/g(x) is also called design bias because it depends on the
design, namely, the distribution of x. This implies that the bias is sensitive
to the positions of design point xi’s. Note that
f ′(x)
f(x)
can have high influence
on the bias when x is close to the boundary. For example, when the den-
sity points xi have standard normal distribution, |f ′(x)/f(x)| = |x|, which is
very large when x approaches to∞. Hence 2f ′(x)g′(x)/f(x) is also known as
boundary bias. These two biases are reduced by using the local linear kernel
estimator. In summary, local linear kernel estimator is free from the design
and boundary biases, but Nadaraya-Watson estimator is not.
Choice of bandwidth h
Suppose we have made choice of the kernel function K(x) and p. We now
discuss the choice of bandwidth h. Bandwidth plays a very important role
in estimating the regression function g(x).
First, as h increases, the local approximation becomes worse and worse
and hence the bias of local polynomial kernel estimator increases. On the
other hand, more and more observations will be included in estimating g(x).
Hence the variance of local polynomial kernel estimator decreases. A good
choice of a bandwidth helps to balance the bias and variance. Second, as
h increases, the local polynomial kernel estimate becomes smoother and
smoother. This can be observed in Figure 8.1, in which we compare the
Nadaraya-Watson estimates of g(x) constructed when the bandwidth h takes
three values, 0.1, 1, and 4, respectively. Conceptually, the number of param-
eters required to describe the curve decreases. In this sense, h controls the
model complexity. We should choose a bandwidth to balance the modelling
fitting and model complexity.
8.3. LOCAL KERNEL POLYNOMIAL METHOD 101
Figure 8.1: Motorcycle data: Nadaraya-Watson estimates of g(x) with nor-
mal kernel
10 20 30 40 50
−1
00
−5
0
0
50
Times in milliseconds after impact
Ac
ce
ler
at
ion
(i
n
g)
bandwidth = .1
bandwidth = 1
bandwidth=4
102 CHAPTER 8. ANALYSIS OF REGRESSION MODELS
We introduce two bandwidth selection methods here: l eave-one-out cross-
validation (cv) and generalized cross-validation (gcv). These two methods
are also widely used in studying other regression problems.
The idea of leave-one-out cv is as follows. Recall that one purpose of
fitting a regression model is to predict the response value in a new trial. So
a reasonable choice of h should result in a small prediction error. Unfor-
tunately, we do not know the true response, and therefore we cannot know
how good is the prediction fˆ(x) given h. The idea of cross-validation is to
first delete one observation from the data set, and treat the remaining n− 1
observations as the training data set and the deleted observations as testing
data. We then test the goodness of prediction for the testing observation by
using the training data set. We repeat the process for all observations and
get the prediction errors for all observations. We choose h by minimizing the
sum of prediction errors. Mathematically, let gˆ−i(xi) be the estimate of g(xi)
based on the n− 1 observations without xi. For the given h, the cv score is
defined as
cv(h) =
n∑
i=1
{yi − gˆ−i(xi)}2.
The optimal h based on the leave-one-out cross-validation idea is
hcv = arg mincv(h).
It seems that it might be time consuming to evaluate cv(h) since we ap-
parently need to recompute the estimate after dropping out each observation.
Fortunately, there is a shortcut formula for computing cv(h).
Let
l(x) =
(
l1(x), . . . , ln(x)
)
= eτ1(Z
τ
xWxZx)
−1ZτxWx.
Then
gˆ(x) =
n∑
j=1
lj(x)yj and gˆ(xi) =
n∑
j=1
lj(xi)yj.
Define the fitted value vector
ŷ = (yˆ1, · · · , yˆn)τ = (gˆ(x1), · · · , gˆ(xn))τ .
It then follows that
ŷ = Ly
8.3. LOCAL KERNEL POLYNOMIAL METHOD 103
where L is an n × n matrix whose ith row is l(xi); thus Lij = lj(xi) and
Lii = li(xi). It can be shown that
cv(h) =
n∑
i=1
{
yi − fˆ(xi)
1− Lii
}2
.
We can minimize the above cv(h) to get the hcv.
The second method for choosing h is called the generalized cross-validation.
For this method, rather than minimizing cv(h), an alternative is to use an
approximation called generalized cross-validation (gcv) score in which each
Lii is replaced with its average v/n, where v = tr(L) =
∑n
i=1 Lii is called the
effective degrees of freedom. Thus, we would minimize gcvscore
gcv(h) =
n∑
i=1
{
Yi − fˆ(xi)
1− v/n
}2
to obtain the bandwidth hgcv. That is,
hgcv = arg min
h
gcv(h).
Usually hcv is quite close to hgcv.
In Appendix I, we include the R function bw.cv() to choose the bandwidth
for the local polynomial kernel estimate for continuous response. The source
code is saved in bw cv.R. In this function, if the option cv=T, then the
cvmethod is used; if the option cv=F, then the gcvmethod is used. The R
function regCVBwSelC() in the R package locpol can also be used to obtain
hcv for the continuous response. The R function regCVBwSelC() gives the
same result as the R function bw.cv() with cv=T. Further it is much faster.
Figure 8.2 gives the cv(h) and gcv(h) for p = 0, 1. Here the normal kernel
is used. (Remark by your instructor: these programs are not included).
Similar to kernel density estimation, Wand and Jones (1995) applied the
idea of direct plug-in methods for bandwidth selection for local linear kernel
estimate. This idea is implemented in R function dpill() in the package
KernSmooth. I did not cover this idea because it is only applicable for local
linear kernel estimate. Further it is more complicated to implement compared
with cv and gcv methods.
104 CHAPTER 8. ANALYSIS OF REGRESSION MODELS
Figure 8.2: Motorcycle data: cv(h) and gcv(h) for p = 0, 1 with normal
kernel
0.5
1.0
1.5
2.0
80000 84000 88000 92000
p=0h
CV score
0.5
1.0
1.5
2.0
90000 95000 100000
p=0h
GCV score
0.5
1.0
1.5
2.0
75000 80000 85000 90000
p=1h
CV score
0.5
1.0
1.5
2.0
80000 90000 100000 110000
p=1h
GCV score
8.3. LOCAL KERNEL POLYNOMIAL METHOD 105
Applying the above mentioned R functions, for p = 0, hcv = 0.914 and
hgcv = 1.089; for p = 1, hcv = 1.476, hgcv = 1.570, and the direct plug-in
gives hDPI = 1.445. Figure 8.3 gives the fitted curves of f(x) with p = 0, 1,
in which the bandwidth is selected by cv or gcv. Here the normal kernel
is used. The two curves for p = 0 are almost the same. The fitted curves
for p = 1 with the bandwidths hcv, hgcv, and hDPI are almost the same.
Hence we only plot the curves with the bandwidths selected by cv and gcv.
The four fitted curves are very close to each. They do not show too much
difference when they are plotted in the same panel.
Properties of fˆ(x)
Let h be given. We have
E{gˆ(x)|x} ≈ f(x)
and
var{gˆ(x)|x} = σ2e1τ (ZτxWxZx)−1(ZτxW2xZx)(ZτxWxZx)−1e1.
Therefore the standard error is given by
se{fˆ(x)} =

σˆ2e1τ (Z
τ
xWxZx)
−1(ZτxW
2
xZx)(Z
τ
xWxZx)
−1e1,
where σˆ2 is an estimator of σ2. Wand and Jones (1995) suggested the fol-
lowing form for σˆ2:
σˆ2 = n− 2v + v˜
with
v = tr(L) =
n∑
i=1
Lii, v˜ = tr(L
τL) =
n∑
i=1
n∑
j=1
L2ij.
106 CHAPTER 8. ANALYSIS OF REGRESSION MODELS
Figure 8.3: Motorcycle data: fitted curves for p = 0, 1 with normal kernel,
in which the bandwidth is selected by cvor gcv
10 20 30 40 50
−1
00
−5
0
0
50
CV; p=0
Times in milliseconds after impact
Ac
ce
ler
at
ion
(i
n
g)
10 20 30 40 50
−1
00
−5
0
0
50
GCV; p=0
Times in milliseconds after impact
Ac
ce
ler
at
ion
(i
n
g)
10 20 30 40 50
−1
00
−5
0
0
50
CV; p=1
Times in milliseconds after impact
Ac
ce
ler
at
ion
(i
n
g)
10 20 30 40 50
−1
00
−5
0
0
50
GCV; p=1
Times in milliseconds after impact
Ac
ce
ler
at
ion
(i
n
g)
8.4. SPLINE METHOD 107
8.4 Spline method
Let us again go back to model (8.2) but do not assume a parametric regression
function g(x;θ). Instead, we only postulate that E(Y |X = x) = g(x) for
some smooth function g(·). Suppose we try to estimate g(·) by simplistic
least squares estimator without a careful deliberation. The solution will be
regarded as the solution to the minimization problem to
n∑
i=1
{yi − g(xi)}2.
If all xi values are different, the solution is given by any function gˆ such that
gˆ(xi) = yi. Such a perfect fitting clearly does not have any prediction power
for a new observation whose covariate value is not equal to the existing
covariate values. Furthermore, if gˆ(x) just connects all points formed by
observations, it lacks some smoothness we may expect.
If we require g(x) to be a linear function of x, then it is a very smooth
function, but the fitting is unsatisfactory if E(Y |X = x) is not far from
linear in x. One way to balance the need of smoothness and fitness is to use
smoothing spline. Among all functions with first two continuous derivatives,
let us find the one that minimizes the penalized L2-loss function
gˆλ(x) = arg min
g(x)
[
n∑
i=1
{yi − g(xi)}2 + λ

{g′′(x)}2dx
]
, (8.6)
for some positive tuning or smoothing parameter λ. which is called smoothing
parameter. In the penalized L2-loss function, the first term measures the
goodness of model fitting, while the second term penalizes the curvature in
the function. We will remain vague on the range of x.
When we use λ = 0: gˆλ(x) becomes the ordinary least squares estima-
tor. The solution is not unique and has little prediction power.
When we use λ = ∞, then the optimal solution must be g′′(x) = 0
for all x. The solution must be linear in x. We are back to use linear
regression model and the associated least squares estimator.
108 CHAPTER 8. ANALYSIS OF REGRESSION MODELS
Clear, a good fit is possible by choose a λ value in between 0 to ∞ to
get a smooth function with reasonable fitting. Note that the above mini-
mization is taken over all possible function g(x), and such functions form
an infinite dimensional space. Remarkably, it can be shown that solution
gˆλ(x) to the penalized least squares problem is a natural cubic spline with
knots at the unique values of {xi}ni=1. Here we consider the case when x is
one-dimensional.
8.5 Cubic spline
We now need a brief introduction to the cubic spline. A cubic spline is a
function which is piece-wisely cubic polynomial. Namely, we partition the
real line into finite number of intervals and a cubic spline is a polynomial of
x of degree 3 which has continuous derivative.
More precisely, suppose −∞ = t0 < t1 < t2 < . . . < tk < tk+1 =∞ are k
distinct real values, then s(x) is a cubic spline if
1. It is a cubic function on each interval [ti, ti+1]:
si(x) = {ai + bix+ cix2 + dix3}
s(x) =
k∑
i=0
si(x)1(ti < x ≤ ti+1).
2. s(x) and its first and second derivatives are continuous:
si(ti+1) = si+1(ti+1),
s′i(ti+1) = s

i+1(ti+1),
s′′i (ti+1) = s
′′
i+1(ti+1).
The connection values t1, . . . , tk are called the knots of the cubic spline. In
particular, t1 and tk are called the boundary knots, and t2, . . . , tk−1 are called
the interior knots.
Furthermore, if
8.5. CUBIC SPLINE 109
3. s(x) is linear outside the interval [t1, tk]; that is,
s(x)1(x ≤ t1) = (a0+b0x)1(x ≤ t1); s(x)1(x ≥ tk) = (ak+bkx)1(x ≥ tk)
for some a0, b0, ak, bk,
we call s(x) a natural cubic spline with knots at t1, . . . , tk. Note that this
also means c0 = ck = 0.
The following result shows that there is a simpler way to express a cubic
spline.
Theorem 8.3. Any cubic spline s(x) with knots at {t1, . . . , tk} can be written
as:
s(x) = β0 + β1x+ β2x
2 + β3x
3 +
k∑
j=1
βj+3(x− tj)3+, (8.7)
where (x)+ = max(0, x) for some coefficients β0, . . . , βk+3.
In other words, the cubic spline is a member of the linear space with basis
functions
1, x, x2, x3, (x− t1)3+, . . . , (x− tk)3+.
Proof. The function defined by (8.7) is clearly a cubic function on every
interval [t0, ti+1]. We can also easily verify that its first two derivatives are
continuous. This shows that such functions are cubic splines.
To prove this theorem, we need further show that every cubic spline with
knots at {t1, . . . , tk} can be written in the form specified by (8.7).
Let g(x) be a cubic spline with knots at {t1, . . . , tk}. Denote γi = g′′(ti)
for i = 1, 2, . . . , k. We show that there exists a function s(x) in the form of
(8.7) such that
β3 = 0, βk+3 = 0,
and s′′(ti) = γi for i = 1, . . . , k.
110 CHAPTER 8. ANALYSIS OF REGRESSION MODELS
If such a function exists, we must have, for other β values
β2/3 = γ1/6;
β2/3 + β4(t2 − t1) = γ2/6;
β2/3 + β4(t3 − t1) + β5(t3 − t2) = γ3/6;
· · ·
β2/3 + β4(tk−1 − t1) + · · ·+ βk+1(tk−1 − tk−2) = γk−1/6;
β2/3 + β4(tk − t1) + · · ·+ βk+1(tk − tk−2) + βk+2(tk − tk−1) = γk/6;
Taking differences, we find another set of equations whose solutions clearly
exist:
β4 = (1/6)(γ2 − γ1)/(t2 − t1);
β4 + β5 = (1/6)(γ3 − γ2)/(t3 − t2);
β4 + β5 + β6 = (1/6)(γ4 − γ3)/(t4 − t3);
· · ·
β4 + β5 + · · ·+ βk+2 = (1/6)(γk − γk−1)/(tk − tk−1).
The solution s(x) with any choice of β0 and β1 we have just obtained, has
the same second derivatives with the cubic spline g(x) at {t1 = 0, t2, . . . , tk}.
Now we can select β0 and β1 values such that s(t1) = g(t1) and s
′(t1) = g′(t1).
Together with s′′(t1) = g′′(t1), s′′(t2) = g′′(t2), and they are both cubic
functions, we must have s(x) = g(x) for all x ∈ [t1, t2]. Applying the same
argument, they must be identical over [t1, tk]. This proves the existence.
As a remark, there can be multiple cubic splines identical on [t1, tk] but
different outside this interval.
Suppose
s(x) = β0 + β1x+ β2x
2 + β3x
3 +
k∑
j=1
βj+3(x− tj)3+
is a natural cubic spline with knots {t1, t2, . . . , tk}. Since it is linear below
t1, we must have
β2 = β3 = 0.
8.5. CUBIC SPLINE 111
At the same time, being linear beyond tk implies we must have
k∑
j=1
βj+1(x− tj)+ = 0
for all x ≥ tk. This is possible only if both
k∑
j=1
βj+3 = 0,
k∑
j=1
tjβj+3 = 0.
In conclusion, out of k + 4 entries of β, only k of them are free for a natural
cubic spline. For this reason, we need to think a bit about how to fit a natural
cubic spline when data and knots are given.
One approach is as follows. Define functions for j = 1, . . . , k
dj(x) =
(x− tj)3+ − (x− tk)3+
tk − tj .
Further, let N1(x) = 1, N2(x) = x, and for j = 3, . . . , k, let
Nj(x) = dj−1(x)− d1(x).
The following theorem says that every natural cubic spline is a linear com-
bination of Nj(x).
Theorem 8.4. Let t1 < t2 < . . . < tk be k knots and {N1(x), . . . , Nk(x)}
be functions defined above. Then all natural cubic splines s(x) with knots in
{t1, . . . , tk} can be expressed as:
s(x) =
k∑
j=1
βjNj(x),
for some coefficients β1, . . . , βk.
Proof. Note that
(tk − tj)dj(x) = (x− tj)3+ − (x− tk)3+.
112 CHAPTER 8. ANALYSIS OF REGRESSION MODELS
Equivalently,
(x− tj)3+ = (tk − tj)dj(x) + (x− tk)3+.
Substituting this expression into generic form of cubic spline, and activating
the constrains on βj implied by natural cubic spline, we find
s(x) = β0N1(x) + β1N2(x) +
k∑
j=1
βj+3(tk − tj)Nj+1(x).
Note that the kth term is zero. The conclusion is therefore true.
In general, a natural cubic spline can give very good approximation to
any function in a finite interval. This makes it useful to fit nonparametric
signal plus noise regression models. Given data {yi;xi} and the k knots,
t1, . . . , tk, we may suggest that
g(x) ≈
k∑
j=1
βjNj(x).
For the ith observation, we have
g(xi) ≈
k∑
j=1
βlNj(xi),
which is now a linear combination of k derived covariates. Let y be the
response vector, β the regression coefficient vector and the error vector.
Define design matrix
Zn =
 N1(x1) · · · Nk(x1)... ... ...
N1(xn) · · · Nk(xn)
 .
The approximate regression model becomes
y ≈ Zβ + . (8.8)
We may use least squares estimator of β given by
βˆ = (ZτZ)−1Zτy.
8.6. SMOOTHING SPLINE 113
Let N(x) = {N1(x), . . . , Nk(x)}τ . Once βˆ is obtained, we estimate the re-
gression function by
gˆ(x) = Nτ (x)βˆ.
Suppose (8.8) is in fact exact, then the properties of least squares estimator
are applicable. We summarize them as follows:
(a) E{βˆ} = β and E{gˆ(x)} = g(x);
(b) var(βˆ) = σ2(ZTZ)−1
(c) var{gˆ(x)} = σ2Nτ (x)(ZτZ)−1N(x).
If (8.8) is merely approximate, then the above equalities are approximate.
The approximation errors will not be discussed here.
The above idea is known as regression spline, which is a large research
topic in nonparametric regression. This approach is very widely used in many
applications to model a nonlinear and unknown function g(x). To apply this
method, we must decide the number of knots and choose the knots t1, . . . , tk
after the number of knots (k) is decided.
8.6 Smoothing spline
Smoothing spline addresses the knot-selection problem of regression spline by
taking all different covariate values as the knots. It uses the size of penalty
to determine the level of smoothness.
Recall that we claim that the numeric solution of smoothing spline to
(8.6) is a natural cubic spline with knots at all distinct values (t1 < · · · < tk)
of {xi}ni=1. This conclusion is implied by the following two claims.
Suppose gˆλ(x) is the solution to the penalized sum of squares. Two claims
1. Given {ti; gˆλ(ti)}, based on the discussion in the last section there is a
unique natural cubic spline s(x) with knots in {t1, . . . , tk} such that
s(ti) = gˆλ(ti), i = 1, . . . , k.
114 CHAPTER 8. ANALYSIS OF REGRESSION MODELS
Because of the above, we have
n∑
i=1
{yi − s(xi)}2 =
n∑
i=1
{yi − gˆλ(xi)}2.
2 For the s(x) defined above, we have∫
{gˆ′′λ(x)}2dx ≥

{s′′(x)}2dx
with the equality holds if and only if gˆλ(x) = s(x) for all x. If this is
true, we must have gˆλ(x) = s(x), a natural cubic spline.
A serious proof is needed for the second claim. Here is the proof.
Let γi = s
′′(ti) for i = 1, . . . , k with s(x) being a cubic spline with knots
on t1, . . . , tk. Being “natural”, we have γ1 = γk = 0.
Let g(x) be another function with finite second derivatives such that
g(ti) = s(ti) for i = 1, 2, . . . , tk. It is seen that∫ ti+1
ti
g′′(x)s′′(x)dx =
∫ ti+1
ti
s′′(x)dg′(x)
= [s′′(ti+1)g′(ti+1)− s′′(ti)g′(ti)]−
∫ ti+1
ti
g′(x)s′′′(x)dx,
Note that
k−1∑
i=1
[s′′(ti+1)g′(ti+1)− s′′(ti)g′(ti)] = γkg′(tk)− γ1g′(t1) = 0.
Being linear on every interval [ti, ti=1], we have
s′′′(x) =
γi+1 − γi
ti+1 − ti = αi
where we have used αi for the slope. With this, we find∫ ti+1
ti
g′(x)s′′′(x)dx = αi{g(ti+1)− g(ti)} = αi{s(ti+1)− s(ti)}
8.6. SMOOTHING SPLINE 115
where the last equality is from the fact that g(x) and s(x) are equal at knots.
Hence, we arrive at the conclusion that∫ tk
t1
g′′(x)s′′(x)dx = −
k∑
i=1
αi{s(ti+1)− s(ti)}.
This result is applicable when g′′(x) = s′′(x). Hence, we also have∫ tk
t1
s′′(x)s′′(x)dx = −
k∑
i=1
αi{s(ti+1)− s(ti)}.
This implies that ∫ tk
t1
g′′(x)s′′(x)dx =
∫ tk
t1
s′′(x)s′′(x)dx.
Making use of this result, we get∫ tk
t1
{g′′(x)− s′′(x)}2dx =
∫ tk
t1
{g′′(x)}2dx−
∫ tk
t1
{s′′(x)}2dx ≥ 0.
This equality holds only if g′′(x) = s′′(x) for all x ∈ [t1, tk]. Hence the overall
conclusion is proved.
Consider the problem of searching for a natural cubic splines that min-
imizes the penalized optimization problem (within this class of functions).
Given a function
g(x) =
k∑
j=1
βjNj(x)
for some constants β1, . . . , βk, its sum of squared residuals is given by
n∑
i=1
{yi − g(xi)}2 = (y− Zβ)τ (y− Zβ)
where
Z =
 N1(x1) · · · Nk(x1)... ... ...
N1(xn) . . . Nk(xn)
 .
116 CHAPTER 8. ANALYSIS OF REGRESSION MODELS
The penalty term over interval [t1, tk] for this g(x) becomes∫
{g′′(x)}2dx =
∫ k∑
j=1
k∑
l=1
βjβlN
′′
j (x)N
′′
l (x)dx = β
TNβ
with
N = (Njl)k×k and Njl =
∫ tk
t1
N ′′j (x)N
′′
l (x)dx.
The penalized sum of squares of g(x) is given by
(y− Zβ)τ (y− Zβ) + λβτNβ.
It is minimized, given λ, at
βˆλ = (Z
τZ + λN)−1Zτy
and the fitted regression function is
gˆλ(x) =
k∑
j=1
βˆλ,jNj(x).
8.7 Effective number of parameters and the
choice of λ
If we regard gˆλ(x) as a fit based on a linear regression, then we seem to
have employed k independent parameters. Due to regularization induced by
penalty, the effective number of parameters is lower than k. Note that the
fitted value of response vector is given by
yˆλ = Z(Z
τZ + λN)−1Zτy = Aλy.
We call Aλ smoother matrix. Similar to local polynomial kernel method, we
define the effective degrees of freedom (dfs) or effective number of parameters
to be
dfλ = trace(Aλ).
8.8. ASSIGNMENT PROBLEMS 117
As λ increases, the effective number of parameters (dfλ) decreases and gˆλ(x)
becomes smoother and smoother. We can hence try out a range of λ values
and examine the resulting gˆλ(x) and select the most satisfactory one. How-
ever, this procedure needs human interference and cannot be automated.
To overcome this deficiency, one may choose λ using cv or gcv criteria.
Similar to local polynomial kernel method, we define the gcv score as a
function of λ to be
gcv(λ) =
(y− yˆλ)τ (y− ŷλ)
{1− trace(Aλ)/n}2 .
The gcvmethod chooses λ as the minimizer of gcv(λ).
The cv approach is similar. Let gˆ−i(xi) be the estimate of g(xi) based
on n − 1 observations without the ith observation. We define the cv score
as a function of λ to be
cv(λ) =
n∑
i=1
{yi − gˆ−i(xi)}2.
It turns out that
cv(λ) =
n∑
i=1
(
yi − gˆλ(xi)
1− trace(Aλ,i,i)
)2
.
This expression enable us to only fit the model once for each λ in order to
compute cv(λ). The cv method chooses λ value as the minimizer of cv(λ).
Remark: The so-called R-functions are not included.
8.8 Assignment problems
1. Find the asymptotic efficiency of the least absolute deviation estimator
when the data are i.i.d. samples from normal distribution, and the
asymptotic efficiency of the least squares estimator when the data are
i.i.d. samples from double exponential.
2. Let βˆ be the least squares estimator of β under the linear model. Show
that for any non-random vector b bτ βˆ is the BLUE of bτβ.
118 CHAPTER 8. ANALYSIS OF REGRESSION MODELS
Chapter 9
Bayes method
Most of the data analysis methods we have discussed so far are regarded as
frequentist methods. More precisely, these methods are devised based on
the conviction that the data are generated from a fixed system which is a
member of a family of systems. While the system is chosen by nature, the
outcomes are random. By analyzing the data obtained/generated/sampled
from this system, we infer the properties of THIS system. The methods
devised subsequently are judged by their average performances if they are
repeatedly applied to all possible realized data from this system. For in-
stance, we regard sample mean as an optimal estimator for the population
mean under normal model in some sense. Whichever N(θ, σ2) is the truth,
on average, (x¯− θ)2 has the lowest average among all θˆ whose average equals
θ. A procedure is judged optimal only if this optimality holds at each and
every possible θ, σ2 value.
When considered from such a frequentist point of view, the statisticians
do not play favours to any specific system against the rest of them in this
family. Simplistically, each system in the family is regarded equal likely
before hand. This view is subject to dispute. In some applications, we may
actual have some preference between such systems. What is the chance that
a patient entering a clinic with fever actually has a simple flu? If this occurs
at a flu season, the doctor would immediately look for more signs of flu. If
it is not a flu season, the doctor will cast a bigger net to the cause of the
fever. The conclusion arrived by the doctor is not completely dependent on
119
120 CHAPTER 9. BAYES METHOD
the evidence: having fever. This example shows that most of human being
act on their prior belief.
The famous Bayes theorem provides one way to formally utilize prior
information. Let A and B be two events in the context of probability theory.
It is seen that the conditional probability of B given A
pr(B|A) = pr(A|B)pr(B)
pr(A|B)pr(B) + pr(A|Bc)pr(Bc)
where Bc is the complement of B, or the event that B does not occur. This
formula is useful to compute the conditional probability of B after A is known
to have occurred when all probabilities on the right hand side are known. The
comparison between pr(B|A) and pr(B) reflects what we learn from event
A about the likeliness of event B.
9.1 An artificial example
Suppose one of two students is randomly selected to write a typical exam.
Their historical averages are 70 and 80 percent. After we are told the mark
of this exam is 85%, which student has been selected in the first place?
Clearly, both are possible but most of us will bet on the one who has
historical average of 80%. It turns out that Bayes theorem gives us a quan-
titative way to justify our decision if we are willing to accept some model
assumptions.
Suppose the outcome of the exam results have distributions who densities
are given by
fa(x) =
x7−1(1− x)3−1
B(7, 3) 1(0 < x < 1);
fb(x) =
x8−1(1− x)2−1
B(8, 2) 1(0 < x < 1)
for students A and B with beta function defined to be
B(a, b) =
∫ 1
0
xa−1(1− x)b−1dx
9.1. AN ARTIFICIAL EXAMPLE 121
for a, b, > 0. The probability that they are selected to write the exam is
pr(A) = pr(B) = 0.5
which is our prior belief that reflects the random selection very well. Let X
denote the outcome of the exam. It is seen that
pr(A|X = x) = 0.5fa(x)
0.5fa(x) + 0.5fb(x)
.
If X = 85%, we find
pr(A|X = 85) = 0.3818.
If X = 60%, we find
pr(A|X = 60) = 0.7000.
Based on these calculations, we seem to know what to do next.
To use the frequentist approach discussed earlier, we re-state this ex-
periment as follows. One observation X has been obtained from a Beta
distribution family with parameter space
Θ = {(7, 3); (8, 2)} or {A,B}.
If X = 0.85, what is your estimate of θ?
The likelihood values at these two parameter points are given by
`((7, 3)) = fa(0.85) = 2.138;
`((8, 2)) = fb(0.85) = 3.462.
Hence, the MLE is given by θˆ = (8, 2) corresponding to student B.
Based on frequentist approach which ignores the prior information, we
are told it is more likely that student B wrote the exam. If the MLE has
been chosen as the frequentist method to be used, then student B is our
choice, even though we know it is not certain.
Using Bayes analysis together with the prior information provided, we
claim that there is a 82% chance that student B wrote the exam. At this
moment, we have yet to make a decision. The calculation of the posterior
probability itself does not directly provide one. Suppose wrongfully conclud-
ing it was written by student B may result in a loss of a million dollars, while
wrongfully concluding it was student A may result in a loss of a single dollar,
then we may still claim/act that it was student A who wrote the exam.
122 CHAPTER 9. BAYES METHOD
Figure 9.1: Posterior probability as a function of x
0.2 0.4 0.6 0.8 1.0
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Exam Score
P
ro
b
(A
|X
=x
)
9.2. CLASSICAL ISSUES RELATED TO BAYES ANALYSIS 123
9.2 Classical issues related to Bayes analysis
We suggested that a statistical model is a family of distributions often rep-
resented as a collection of parameterized density functions. We use {f(x; θ) :
θ ∈ Θ} as a generic notation. In most applications, we let Θ be a subset of
Rd.
When a set of observations X are obtained and a statistical model is
assumed, a frequentist would regard X is generated from ONE member of
{f(x; θ) : θ ∈ Θ} but usually we do not know which one. The information
contains in X helps us to decide which one is most likely, or a close proximate
of this ONE.
In comparison, a Baysian may also regard X is generated from ONE
member of {f(x; θ) : θ ∈ Θ}. However, this one θ value itself is generated
from another distribution called prior distribution, Π(θ). In other words,
it is a realized value of a random variable whose distribution is given by
Π(θ). If we have full knowledge of Π(θ), then it should be combined with
X to infer which θ has been THE θ in {f(x; θ) : θ ∈ Θ} that generated
X. We generally cannot nail down to a single θ value given X and Π(θ).
With the help of Bayes theorem, we are able to compute the conditional
distribution of θ given X, which is called posterior. That is, we retain the
random nature of θ but update our knowledge about its distributions when
X becomes available. Statistical inference about θ will then be made based
on the updated knowledge.
From the above discussion, it is seen that the a preliminary step in Bayes
analysis is to obtain posterior distribution of θ, assuming the model itself
has been given and the data have been collected. That is, we have already
decided on the statistical model f(x; θ), prior distribution Π(θ) and data
X collected in the application. Note that this X can be a vector of i.i.d.
observations given θ. The notion of GIVEN θ is important because θ is a
random variable in the context of Bayes analysis.
Particularly in early days, the Bayes analysis is possible only if some kind
of neat analytical expression of the posterior is available. Indeed, I can give
you many such examples when things lineup nicely.
Example 9.1. Suppose we have an observation X from a binomial distri-
124 CHAPTER 9. BAYES METHOD
bution f(x; θ) = C(n, x)θx(1− θ)n−x for x = 0, 1, . . . , n. Suppose we set the
prior distribution with density function
pi(θ) =
θa−1(1− θ)b−1
B(a, b) 1(0 < θ < 1).
By Bayes rule, the density function of the posterior distribution of θ is given
by
fp(θ|X = x) = f(x; θ)pi(θ)∫
f(x; θ)pi(θ)dθ
.
It appears to get explicit expression, we must find the outcome of the integra-
tion. However, this can often be avoided. Note that
f(x; θ)pi(θ) = C(n, x)θa+x−1(1− θ)b+n−x−11(0 < θ < 1).
Hence, we must have
fp(θ|X = x) = θ
a+x−1(1− θ)b+n−x−11(0 < θ < 1)
c(n, a, b, x)
for some constant c(a, b, x) not depending on θ. As a function of θ, it matches
the density function of Beta distribution with degrees of freedom a+x, b+n−x.
At the same time, its integration must be 1. This shows that we must have
c(n, a, b, x) = B(a+ 1, n+ b− x).
The posterior distribution is Beta with a+ x, n+ b− x degrees of freedom:
fp(θ|X = x) = θ
a+x−1(1− θ)b+n−x−11(0 < θ < 1)
B(a+ 1, n+ b− x)
This will be the posterior distribution used for Bayes decision.
You may notice that Binomial distribution and the Beta distribution are
perfectly paired up to permit an easy conclusion on the posterior distribution.
There are many such pairs. For instance, if X has Poisson distribution
with mean θ, and θ has prior one parameter Gamma distribution, then the
posterior distribution of θ is also Gamma. We leave this case as an exercise.
Such prior distributions are call conjugate priors. Another good exercise
problem is to draw the density function of many beta distributions. It helps
to get an intuition on what you have assumed if a beta prior is applied.
9.2. CLASSICAL ISSUES RELATED TO BAYES ANALYSIS 125
Definition 9.1. Let {f(x; θ) : θ ∈ Θ} be a statistical model. Namely, it is a
family of distributions. Suppose for any prior distribution pi(θ) as a member
of distribution family {pi(θ; ξ) : ξ ∈ Ξ}, the posterior distribution of θ given
a set of i.i.d. observations from f(x; θ) is a member of {pi(θ; ξ) : ξ ∈ Ξ},
then we say that {pi(θ; ξ) : ξ ∈ Ξ} is a conjugate prior distribution family of
{f(x; θ) : θ ∈ Θ}.
Remark: We have seen that the posterior density is given by
fp(θ|X = x) = f(x; θ)pi(θ)∫
f(x; θ)pi(θ)dθ
.
This formula is generally applicable. In addition, one should take note that
the denominator in this formula does not depend on θ. Hence, the denomina-
tor merely serves as a scale factor in fp(θ|X = x). In classical examples, its
value can be inferred from the analytical form of the numerator. In complex
examples, its value does not play a rule in Bayes analysis.
Example 9.2. Suppose that given µ, X1, . . . , Xn are i.i.d. from N(µ, σ
2
0) with
known σ20. Namely, σ
2
0 is not regarded as random. The prior distribution of µ
is N(µ0, τ
2
0 ) with both parameter values are known. The posterior distribution
of µ given the sample is still normal with parameters
µB =
nx¯/σ20 + µ0/τ
2
0
1/σ20 + 1/τ
2
0
;
and
σ2B =
[
n
σ20
+
1
τ 20
]−1
.
The philosophy behind the Bayes data analysis is to accommodate our
prior information/belief about the parameter in statistical inference. Some-
time, prior information naturally exists. For instance, we have a good idea
on the prevalence of human sex ratio. In other applications, we may have
some idea on certain parameters. For example, the score distribution of a
typical course. Even if we cannot perfectly summarize our belief with a prior
distribution, one of the distributions in the beta distribution family can be
good enough.
126 CHAPTER 9. BAYES METHOD
It is probably not unusual that we do not have much idea about the
parameter value under a statistical model assumption. Yet one may be at-
tracted to the easiness of the Bayesian approach and would like to use Bayes
analysis anyway. She may decide to use something called non-informative
prior. Yet there seem to be no rigorious definition on what kind of priors are
a non-informative priors.
In the normal distribution example, one may not have much idea about
the mean of the distribution in a specific application. If one insists on use
Bayesian approach, he or she may simply use a prior density function
pi(µ) = 1
for all µ ∈ R. This prior seems to reflect the lack of any idea on which µ
value is more likely than any other µ values. In this case, pi(µ) is not even
a proper density function with respect to Lesbesgue measure. Yet one may
obtain a proper posterior density following the rule of Bayes theorem.
It appears to me that Bayes analysis makes sense when prior information
about the parameter truly exists. In some occasions, it does not hurt to
employ this tool even if we do not have much prior information. If so, the
Bayes inference conclusion should be critically examined just likely any other
inference conclusions.
9.3 Decision theory
Let us back to the position that a statistical model f(x; θ) is given, prior
distribution Π(θ) is chosen and data X have been collected. At least in
principle, the Bayes theorem has enabled us to obtain posterior distribution
of θ: fp(θ|X). At this point, we need to decide how to estimate θ, the value
generated from Π(θ), and X is a random sample from f(x; θ) with this θ.
With fp(θ|X) at hand, how do you estimate θ?
First of all, you may pick any function of X as your estimator of θ. This
has not changed.
Second, if you wish to find a superior estimator, then you must provide
a criterion to judge superiority. In the content of Bayes data analysis, the
criteria for point estimation is through loss functions.
9.3. DECISION THEORY 127
Definition 9.2. Assume a probability model with parameter space Θ. A
loss function `(·, ·) is a non-negative valued function on Θ × Θ such that
`(θ1, θ2) = 0 when θ1 = θ2.
Finally, since we do not know what the true θ value is, with the posterior
distribution, we can only hope to minimize the average loss. Hence, the
decision based on the bayes rule is to look for θˆ such that the expected loss
is minimized: ∫
L(θˆ, θ)fp(θ|X)dθ = min.
A naturally choice of the loss function is
L(θˆ, θ) = (θˆ − θ)2.
The solution to this loss function is clearly the posterior mean of θ for
one-dimension θ.. This extends to the situation where θ is multidimensional.
One may use the loss function
L(θˆ, θ) = |θˆ − θ|.
If so, the solution is the posterior median for one-dimension θ. The exten-
sion to the multidimensional θ is possible.
Example 9.3. Suppose we have an observation X from a binomial distri-
bution f(x; θ) = C(n, x)θx(1− θ)n−x for x = 0, 1, . . . , n. Suppose we set the
prior distribution with density function
pi(θ) =
θa−1(1− θ)b−1
B(a, b) 1(0 < θ < 1).
By Bayes rule, the density function of the posterior distribution of θ is given
by
fp(θ|X = x) = f(x; θ)pi(θ)∫
f(x; θ)pi(θ)dθ
.
The posterior distribution is Beta with a+ x, n+ b− x degrees of freedom:
fp(θ|X = x) = θ
a+x−1(1− θ)b+n−x−11(0 < θ < 1)
B(a+ 1, n+ b− x)
128 CHAPTER 9. BAYES METHOD
If the square loss is employed, then the Bayes estimator of θ is given by∫
θfp(θ|X = x)dθ = a+ x
a+ b+ n
.
When a = b = 1, the prior distribution of θ is uniform on (0, 1). This is
regarded as a non-informative prior. With this prior, we find
θˆ =
x+ 1
n+ 2
which seems to make more sense than the MLE x/n.
Since Bayes estimator is generally chosen as the minimizer of some ex-
pected posterior loss, it is optimal in this sense by definition. However, the
optimality is judged with respect to the specific loss function and under the
assumed prior. Blindly claiming a Bayes estimator is optimal out of con-
tent is not recommended here. If this logic is applicable, then we would as
rightfully claim that the MLE is optimal, because it maximizes a criterion
function called likelihood. Such a claim would be ridiculous because we have
many examples where the MLEs are not even consistent.
We will have an exercise problem to work out Bayes estimators under
square loss under normal model with some conjugate prior distribution on
both mean and variance.
Once the posterior distribution is ready, we are not restricted to merely
give a point estimation. These issues will be discussed in other parts of this
course. At the same time, we may get some sense that being able to precisely
describing the posterior distribution is one of the most important topic in
Bayes data analysis.
There are two major schools on how the statistical data analysis should be
carried out: frequentist and Bayesian. If some prior information exists and
can be reasonably well summarized by some prior distribution, then I feel
the inference based on Bayes analysis is fully justified. If one does not have
much sensible prior information on the statistical model appropriate to the
9.5. ASSIGNMENT PROBLEMS 129
data at hand, it is still acceptable to use the formality of the Bayes analysis.
Yet blindly claiming the superiority of a Bayesian approach is not of my
taste. Particularly in the later case, the Bayes conclusion should be critically
examined as much as any data analysis methods.
To make things worse, many statisticians seem to regard themselves doing
research on Bayesian methods, yet they do not aware the principle of the
Bayes analysis. Probably, they merely feel that this is an easy topic to
publish papers (not true if one is a serious Bayesian). To be more strict,
a Bayesian should have a strong conviction that the model parameters are
invariably realized values from some distribution. There is an interest and
very valid question, is/was Bayes a Bayesian?
9.5 Assignment problems
1. (a) Using R-package to plot 6 beta density functions with degrees of
freedoms:
(a, b) = (0.5, 0.1), (0.1, 0.5), (0.5, 0.5), (1, 1), (5, 1), (1, 5), (5, 20), (5, 50).
(b) Select two additional pairs based on your own curiosity and plot
them.
(c) Show that the density functions with parameters (a, b) and (b, a)
are mirror images of each other.
Remark: (c) is an observational question.
2. Show that the following two pairs of distribution families permit ana-
lytical descriptions of the posterior distribution. Identify the posterior
distribution together with specific parameter values. Assume that we
have a single observation from the statistical model.
(a) Statistical model: Poisson with parameter θ; Prior distribution
family: one parameter Gamma with degree of freedom d. Namely, the
design function is given by
pi(θ) = θd−1 exp(−θ)/Γ(d)
for θ ≥ 0.
130 CHAPTER 9. BAYES METHOD
(b) Statistical model: N(µ, σ2); Prior distribution: N(µ0, σ
2) for µ
given σ2, and one parameter Gamma for 1/σ2 with degree of freedom
d0 = 5. Namely, the prior distribution is specified through µ given σ
2
and the distribution of σ2.
3. Given a set of i.i.d. observations of size n = 20 from N(µ, σ2) and the
prior distribution specified as in the previous problem with µ0 = 0.
(a) Find the posterior 75% quantile of the mean parameter µ;
(b) Find the posterior expectation of µ2.
4. Following the last problem. Assume the data set contains n = 20
observations as follows:
1.1777518 -0.5867896 0.2283789 -0.1735369 -0.2328192
1.0955114 1.2053680 -0.7216797 -0.3387580 0.1620835
1.4173256 0.0240219 -0.6647623 0.6214567 0.7466441
1.9525066 -1.2017093 1.9736293 -0.1168171 0.4511754
(a) Given d0 = 5, plot the posterior mean of µ as a function of µ0 over
[−2, 2].
(b) Given µ0 = 0, plot the posterior mean of σ
2 as a function of d0 over
[0.5, 10].
Remark: use Monte Carlo simulation if direct numerical/analytical
computation is too difficult/infeasible.
5. Based on the class example where the statistical model is Binomial and
the prior distribution of θ is Beta(a, b).
(a) Compute the expected posterior squared loss of the MLE θˆ = X/n
and the Bayes estimator θˇ = (X + 1)/(n + 2). Remark: they are
functions of (a, b).
(b) Compute the frequentist MSE of these two estimators. That is,
regard the MSEs as functions of θ and θ a non-random unknown pa-
rameter.
9.5. ASSIGNMENT PROBLEMS 131
6. Consider the problem where an i.i.d. sample x1, x2, . . . , xn of size n
from Gamma distribution family
f(x; θ) =
1
Γ(θ)
xθ−1 exp(−x).
We have a discrete prior distribution of θ given by
P (θ = j) = j/9; j = 2, 3, 4.
(a) Give the expression of the posterior probability mass function of θ.
(b) When n = 3 and the observed values are 4.134, 2.116, 4.105 so that∑
xi = 10.335 and

xi = 35.90. What is your estimated value of θ?
Remark: Be a good Bayesian.
132 CHAPTER 9. BAYES METHOD
Chapter 10
Monte Carlo and MCMC
Recall that a statistical model is a distribution family, at least this is what
we suggest. Let us first focus on parametric models: {f(x; θ) : θ ∈ Θ}. In
this case, θ is generally a real valued vector and Θ is a subset of Euclidean
space with nice properties such as convex, open and so on. After placing a
prior distribution on θ, we have created a Bayes model. We do not seem to
have an explicit consensus on a definition of and a notation for Bayes model,
even though statisticians are not shy at using this terminology. Based on
our understanding, we define a Bayes model as a system with two necessary
components: a family of distributions, and a prior distribution on the space
of this distribution family:
Bayes Model = [{f(x; θ) : θ ∈ Θ}, pi(θ)].
When Θ is a subset of Euclidean space, we generally regard pi(·) a density
function with respect to Lesbesgue measure on Θ. For an abstract {f(x; θ) :
θ ∈ Θ}, pi(·) represents an abstract distribution.
Logically, a Bayes model is not the same as Bayes analysis. Bayes analysis
is generally carried out based on the posterior distribution. Yet there is no
formal requirement on this rule. Frequentists often take likelihood function
as the basis for inference, yet they may design inference procedures in any
way they like. In my opinion, this includes procedures based on posterior
distributions.
Suppose a θ value is generated according to pi(·), and subsequently, a
133
134 CHAPTER 10. MONTE CARLO AND MCMC
data set X is generated from THIS f(x; θ). Here we implicitly assume that
X is accurately measured and available to use for the purpose of inference,
and the value of θ is hidden from us. The inference target is θ based on
data from this experiment. Any decision about the possible value of θ in
Bayes analysis is generally based on the posterior density of θ given X. We
use notation fp(θ|X) for posterior distribution (density). It is conceptually
straightforward to define and derive the posterior distribution. Hence, there
are not much left for a statistician to do.
Bayes analysis makes a decision based on posterior distribution. Research
on Bayesis methods includes: (a) most suitable prior distributions in specific
applications; (b) the influence of the choice of prior distribution to the final
decision; (c) numerical or theoretical methods for determining the posterior
distribution; (d) properties of the posterior distribution; (e) decision rule.
There might be more topics out there. This chapter is about topic (c).
For some well paired up f(x; θ) and pi(θ) (when pi(·) is a conjugate prior for
f(x; θ)), it is simple to work out the analytical form of the posterior density
function. A Bayesian needs only decide the best choices of pi(θ) and the
subsequent decision rule. In many real world problems, the posterior density
is on high dimensional space and does not have a simple analytical form. The
Bayes analysis before the contemporary computing power has been a serious
challenge because of this formidable task. This task becomes less and less
an issue today. We discuss a number of commonly used techniques in this
chapter.
10.1 Monte Carlo Simulation
The content of this section is related but not limited to Bayes analysis. Sup-
pose in some applications, we wish to compute E{g(X)} and X is known
to have a certain distribution. This is certainly a simple task in many text-
book examples. For instance, if X has Poisson distribution with mean θ and
g(x) = x(x− 1)(x− 2)(x− 3), then
E{g(X)} = θ4.
10.1. MONTE CARLO SIMULATION 135
However, if g(x) = x log(x + 1), the answer to E{g(X)} is not analytically
available.
Suppose we have an i.i.d. sample x1, . . . , xn with sufficiently large n from
this distribution, then by the law of large numbers,
E{g(X)} ≈ n−1
n∑
i=1
xi log(1 + xi).
Let us generate n = 100 values from Poisson distribution with θ = 2. Using
a function in R-package, we get 100 values
5 2 3 4 1 2 1 2 1 1 2 3 2 2 2 3 1 2 0 4 1 2 5 1 1
2 3 1 1 1 2 0 2 1 1 3 0 5 1 5 1 2 1 0 2 3 5 2 6 3
2 4 3 1 1 2 2 1 1 2 2 5 0 2 1 3 3 1 3 1 1 2 2 3 1
2 1 4 0 4 2 3 0 0 2 1 3 1 0 2 1 0 3 1 3 6 1 3 3 3
Based on this sample, we get an approximated value
E{G(X)} ≈ 2.691.
I can just as easily use n = 10, 000 and find E{g(X)} ≈ 2.648 in one try.
With contemporary computer, we can afford to repeat it as many times as
we like: E{g(X)} ≈ 2.642, 2.641, 2.648. It appears E{g(X)} = 2.645 would
be a very accurate approximations. Computation based on simulated data
is generally called Monte Carlo method.
We must answer two questions before we continue. The first is why do
not we use a numerical approach if we need to compute E{g(X)}. Indeed,
we can put up a quick R-code
{ii= 0:50; sum(ii*log(1+ii)*dpois(ii, 2))}
and get a value 2.647645. This is a very accurate answer to this specific
problem. Yet if we wish to compute
E{(X1 +

X2)
2 log(1 +X1 +X3X4)},
where X1, X2, X3, X4 may have a not very simple joint distribution, a neat
numerical solution becomes hard. Since the contemporary computers are so
powerful, the above problem is only “slightly” harder. Yet there are real
136 CHAPTER 10. MONTE CARLO AND MCMC
world problem of this nature, but involves hundreds or more random vari-
ables. For these problems, the numerical problem quickly becomes infeasible
even for contemporary computers. In comparison, the complexity of the
Monte Carlo method remains the same even when g(X) is a function of
vector X with a very high dimension.
The second question is how easy is it to generate quality “random sam-
ples” from a given distribution by computer? There are two issues related
to this question. First, the computer does not have an efficient way to gen-
erate random numbers. However, with some well designed algorithms, it
can produce massive amount of data which appear purely random. We call
them pseudo random number generators. We do not discuss this part of the
problem in this course. The other issue is how to make sure these random
numbers behave like samples from the desired distributions.
Our starting point is that it is easy to generate i.i.d. observations (pseudo
numbers) from uniform distribution [0, 1]. We investigate the techniques for
generating i.i.d. observations from other distributions.
Theorem 10.1. Let F (x) be any univariate continuous distribution function
and U be a standard uniformly distributed random variable. Let
Y = inf{x : F (x) ≥ U}.
Then the distribution function of Y is given F (·).
Proof. We only need to work out the c.d.f. of Y . If it is the same as F (·),
then the theorem is proved.
Routinely, we have
pr(Y ≤ t) = pr(inf{x : F (x) ≥ U} ≤ t) = pr(F (t) ≥ U) = F (t)
because pr(U ≤ u) = u for any u ∈ (0, 1). This completes the proof.
Since we generally only have pseudo numbers in U , applying the above
transformation will only lead to “pseudo numbers” in Y .
Example 10.1. Let g(u) = − log u. Then, Y = g(U) has exponential distri-
bution if U has standard uniform distribution.
Let g(u) = (− log u)a for some positive constant a. Then Y = g(U) has
Weilbull distribution.
10.1. MONTE CARLO SIMULATION 137
As an exercise problem, find the function g(·) which makes g(U) standard
Cauchy distributed.
Here is another useful exercise problem for knowledge. If Z1, Z2 are inde-
pendent standard normally distributed random variables, then r2 = Z21 +Z
2
2
are exponentially distributed. One should certainly know that r2 is also
chisquare distributed with 2 degrees of freedom.
Example 10.2. Let U1, U2 be two independent standard uniform random
variables. Let
g1(s, t) =

−2 log s cos(2pit);
g2(s, t) =

−2 log s sin(2pit).
Then, g1(U1, U2), g2(U1, U2) are two independent standard normal random
variables.
If we can efficiently generate pseudo numbers from uniform distribution,
then the above result enables us to efficiently generate pseudo numbers from
standard normal distributions. Since general normal distributed random
variables are merely location-scale shifted standard normal random variables,
their generation can hence also be efficiently generated this way.
Due to well established relationship between various distributions, pseudo
numbers from many many classical distributions can be efficiently generated.
Here are a few well-known results which were also given in the chapter about
normal distributions.
Example 10.3. Let Z1, Z2, . . . be i.i.d. standard normally distributed random
variables.
(a) X2n = Z
2
1 +Z
2
2 + · · ·+Z2n has chisquare distribution with n degrees of
freedom.
(b) Fn,m = (X
2
n/n)/(Y
2
m/m) has F distribution with n,m degrees of free-
dom when X2n, Y
2
m are independent.
(c) Bn = (X
2
n)/(X
2
n + Y
2
m) has Beta distribution with n,m degrees of
freedom when X2n, Y
2
m are independent.
We can also generate multinomial pseudo numbers with any probabilities:
p1, p2, . . . , pm: generate U from uniform, then let X = k for k such that
p1 + · · ·+ pk−1 < U ≤ p1 + · · ·+ pk−1 + pk−1.
138 CHAPTER 10. MONTE CARLO AND MCMC
The left hand side is regarded as zero for k = 1.
10.2 Biased or importance sampling
Back to the problem of computing E{g(X)} when X has a distribution with
density or probability mass function f(x). If generating pseudo numbers
from f(x) is efficient, then it is a good idea to approximate this expectation
by
n−1
n∑
i=1
g(xi).
If it is more convenient to generate pseudo numbers from a different dis-
tribution f0(x) which has the same support as f(x), then it is easier to
approximate this expectation by
n−1
n∑
i=1
{g(yi)f(yi)/f0(yi)}
where y1, . . . , yn observations are generated from f0(x).
If Y has a distribution with density f0(x), we have
E{g(Y )f(Y )/g0(Y )} =

{g(y)f(y)/f0(y)}f0(y)dy
=

g(y)f(y)dy = E{g(X)}
where X has distribution f(x). Note that it is important that f and f0 have
the same support so that the range of integrations remains the same. If X
has discrete distribution, the integration will be changed to summation. The
conclusion is not affected.
In sample survey, the units in the finite population often have different
probabilities to be included in the sample due to various considerations. The
population total
Y =
N∑
i=1
yi,
10.3. REJECTIVE SAMPLING 139
where N is the number of sample units in the finite population and yi is
the response value of the ith unit, is often estimated by Horvath-Thompson
estimator:
Yˆ =

i∈s
yi/pii
where s is the set of units sampled and pii is the probability that the unit i is
in the sample. The role of pii is the same as f0(x) in the importance sampling
content.
In sampling practice, some units with specific properties of particular in-
terest are hard to obtain in an ordinary sampling plan. Specific measures
are often taken so that these units have higher probability to be included
than otherwise when all units are treated equally. The practice may also be
regarded as finding a specific f0(x) to replace f(x) even though the expecta-
tion of g(X) under f(x) distribution is the final target. One such example
is to obtain the proportional of HIV+ person in Vancouver population. A
simple random sample may end up with a sample of all HIV- individuals
giving lower accurate estimation of the rate of HIV+. The same motivation
is used in numerical computation. If f(x) has lower values in certain region
of x, then a straightforward random number generator will have very few
values generated from that region. This problem makes such numerical ap-
proximations inefficient. Searching for some f0(x) can be a good remedy to
Here is another example. To estimate the survival time of cancer patient.
Let us a random sample from all cancer patients at a specific time point. If
their survive times are denoted as Y1, Y2, . . . , Yn whose distribution is denoted
as f0(y). The actually survival distribution would be different if every cancer
patient is counted equally. This is because f0(y) ∝ yf(y) where f(y) is the
“true” survival time distribution. This may also be regarded as importance
sampling created by nature.
10.3 Rejective sampling
Instead of generating data from an original target distribution f(x), we may
generate data from f0(x) and obtain more effective numerical approximation
140 CHAPTER 10. MONTE CARLO AND MCMC
of E{g(X)}. This is what we have seen in the last section. The same idea is
at work in rejective sampling. The target of this game is to obtain pseudo
numbers which may be regarded as random samples from f(x). Of course,
to make it a good tool, we must select an f0(x) which is easy to handle.
Let f(x) be the density function from which we wish to get random
samples. Let f0(x) be a density function with the same support and further
sup
x
f(x)
f0(x)
= u <∞
Denote
pi(x) =
f(x)
uf0(x)
.
Apparently, pi(x) ≤ 1 for any x. In addition, if f(x) is known up to a constant
multiplication, the above calculations remain feasible. One potential example
of such an f(x) is when
f(x) =
C exp(−x4)
1 + x2 + sin2(x)
.
Since f(x) > 0 and its integration converges, we are sure that
C−1 =

exp(−x4)
1 + x2 + sin2(x)
dx
is well defined. Yet we do not have its exact value. In this example, an
accurate approximate value of C is not hard to get. Yet if f(·) is the joint
density of many variables, even a numerical approximation is not feasible.
Particularly in Bayes analysis, this can occur. If an effective way to generate
“random” samples from f(x) is possible, then we do not need to know C any
more in many applications.
Now we present the procedure of the rejective sampling method.
1. Generate a sequence of i.i.d. samples X1, X2, . . . from f0(x).
2. Generate a sequence of i.i.d. samples U1, U2, . . . from the standard uni-
form distribution.
10.3. REJECTIVE SAMPLING 141
3. For i = 1, 2, . . ., if Ui ≤ pi(Xi), let Yi = Xi; otherwise, we leave Yi
undefined.
4. Collect the Xi values not reject in the last step to form a sequence of
i.i.d. sample: Y1, Y2, . . ..
It is easy to see why this procedure is called rejective sampling. We
now show that the outcome of the above procedure indeed produce a set of
i.i.d. sample from distribution f(x).
Theorem 10.2. The output of the rejective sampling, Yi, has distribution
F (x) with density function f(x) for any i.
Proof. This is demonstrated as follows. First, we consider the case for i = 1.
It is seen that
pr{U > pi(X)} = E{1− pi(X)} = 1−

pi(x)f0(x)dx = 1− u−1.
Hence, the distribution of Y1 is given by
pr(Y1 ≤ y) =
∞∑
k=1
pr(U1 > pi(X1), . . . , Uk−1 > pi(Xk−1), Uk < pi(Xk), Xk ≤ y)
=
∞∑
k=1
(1− u−1)k−1pr(Uk < pi(Xk), Xk ≤ y)
=
∞∑
k=1
(1− u−1)k−1pr(U < pi(X), X ≤ y)
= uE{pr(X ≤ y, U ≤ pi(X)|X)}
= uE{pi(X)1(X ≤ y)}.
Taking the definition of pi(x) into consideration, we find
pr(Y1 ≤ y) = u
∫ y
−∞
f(x)
uf0(x)
f0(x)dx = F (y).
This shows that the rejective sampling method indeed leads to random num-
bers from the target distribution.
142 CHAPTER 10. MONTE CARLO AND MCMC
Let us define the waiting time
T = min{i : Ui ≤ pi(Xi)}
which is the number of pairs of pseudo numbers in (X,U) it takes to get a
pseudo observation Y . Its probability mass function is given by
pr(T = k) = pr
(
U1 > pi(X1), . . . , Uk−1 > pi(Xk−1), Uk < pi(Xk)
)
= (1− u−1)k−1u−1.
That is, T has geometric distribution with mean u.
If we use an f0(·) which leads to large u, the rejective sampling is numer-
ically less efficient. It takes more tries on average to obtain one sample from
the target distribution. The best choice is f0(·) = f(·) in terms of computa-
tion efficiency. Of course, this means we are not using a rejective sampling
tool at all.
Here is an exercise problem. Suppose we can easily generate random
numbers from standard normal distribution whose density is given by φ(x) =
(2pi)−1/2 exp(−x2/2). Some how, we wish to generate data from double ex-
ponential:
f0(x) =
1
2
exp(−|x|).
The rejective sampling is a choice. Compute the constant u as defined above.
Write a code in R to implement the rejective sampling method to generate
n = 1000 observations from N(0, 1). Show the Q-Q plot of the data generated
and report the number of pairs of (X,U) in rejective sampling required. How
many pairs of (X,U) do you expect to be needed to generate n = 1000
normally distributed random numbers with this method?
10.4 Markov chain Monte Carlo
Not an expert myself, my comments here may not be accurate. The rejection
sample approach appears to be effective for generating univariate random
variables (pseudo numbers). In applications, we may wish to generate a large
quantity of vector valued observations. Markov chain Monte Carlo seems to
be one of the solutions to this problem. To introduce this method, we need
a dose of Markov chain.
10.4. MARKOV CHAIN MONTE CARLO 143
10.4.1 Discrete time Markov chain
A Markov chain is a special type of stochastic process. A stochastic process
in turn is a collection of random variables. Yet we cannot pay equal amount
of attention to all stochastic processes but the ones that behave themselves.
Markov chain is one of them.
We narrow our focus even further on processes containing a sequence of
random variables having a beginning but no end:
X0, X1, X2, . . . .
The subindices {0, 1, 2, . . .} are naturally called time. In addition, we consider
the case where Xn takes values in the same space with countable members
for all n. Without loss of generality, we assume the space is
S = {0,±1,±2, . . .}.
We call S state space. For such a stochastic process, we define transition
probabilities for s < t to be
pij(s, t) = pr(Xt = j|Xs = i).
Definition 10.1. A discrete time Markov chain is an ordered sequence of
random variables with discrete state space S and has Markov property:
pr(Xs+t = j|Xs = i,Xs−1 = i1, . . . , Xs−k = ik) = pij(s, s+ t)
for all i, j ∈ S and s, t ≥ 0.
If further, all one-step transition probabilities pij(s, s+ 1) do not depend
on s, we say the Markov chain is time homogeneous.
The Markov property is often referred to as: given present, the future
is independent of the past. In this section, we further restrict ourselves to
homogeneous, discrete time Markov chain. We will work as if S is finite and
S = {1, 2, . . . , N}.
The subsequent discussion does not depend on this assumption. Yet most
conclusions are easier to understand under this assumption. We simplify the
one step transition probability notation to pij = pr(X1 = j|X0 = i).
144 CHAPTER 10. MONTE CARLO AND MCMC
Let P be a matrix formed by one step transition probabilities: P = (pij).
For finite state space Markov chain, its size is N × N . We may also notice
its row sums equal to 1. It is well known that the t-step transition matrix
P(t) = {pr(Xt = j|X0 = i)} = Pt
for any positive integer t. For convenience, we may take 0-step transition
matrix as P0 = I, the identity matrix. The relationship is so simple, we do
not need a specific notation for t-step transition matrix.
Let Πt be the column vector made of pr(Xt = i), i = 1, 2, . . . , N and
t = 0, 1, . . .. This vector fully characterizes the distribution of Xt. Hence, we
simply call it the distribution of Xt. It is seen that
Πτt = Π
τ
0P
t.
Namely, the distribution of Xt in a homogeneous discrete time Markov chain
is fully determined by the distribution of X0 and the transition probability
matrix P.
Under some conditions, limt→∞Πt exists. The limit itself is unique and is
a distribution on the state space S. For a homogeneous discrete time Markov
chain with finite state space, the following conditions are sufficient:
(a) irreducible: for any (i, j) ∈ S, there exists a t ≥ 1 such that pr(Xt =
j|X0 = i) > 0.
(b) aperiodic: the greatest common factor of {t : pr(Xt = i|X0 = i) > 0}
is 1 for any i ∈ S.
When a Markov chain is irreducible, all states in S have the same period
which is defined as the greatest common factor of {t : pr(Xt = i|X0 = i) >
0}.
Theorem 10.3. If a homogeneous discrete time Markov chain has finite
space and properties (a) and (b), then for any initial distribution Π0,
lim
t→∞
Πt = Π
exists and is unique.
10.4. MARKOV CHAIN MONTE CARLO 145
We call Π in the above theorem as equilibrium distribution and such a
Markov chain ergodic. It can be shown further that when these conditions
are satisfied, then for any i, j ∈ S,
lim
t→∞
pr(Xt = j|X0 = i) = pij
where pij is the jth entry of the equilibrium distribution Π.
Definition 10.2. For any homogeneous discrete time Markov chain with
transition matrix P and state space S, if Π is a distribution on the state
space such that
Πτ = ΠτP
when we call it a stationary distribution.
It is easily seen that the equilibrium distribution is a stationary distri-
bution. However, there are examples where there exist many stationary
distributions but there is no equilibrium distribution.
If a vector of probability mass function pii : i = 0, 1, . . . , N satisfies the
balance equation:
piipi,j = pijpji
for any i, j, then it is the limiting distribution of the Markov chain. In other
words, the balance equation serves as a criterion on whether pii is the limiting
distribution of the Markov chain.
Finally, we comment on the relevance of this section to MCMC. If one
wishes to generate observations from a distribution f(x). It is always possible
for us to find a discrete distribution Π whose p.m.f. is very close that that
of f(x). Suppose we can further create a Markov chain with proper state
space and transition matrix with Π as its equilibrium distribution. If so, we
may generate random numbers from this Markov chain: x1, x2, . . .. When
t is large enough, the distribution of Xt is nearly the same as the target
distribution Π. One should be aware that x1, x2, . . . are not observed values
of independent and identically distributed random variables. By the name
of Markov chain, it is most likely that they have the same distribution, but
not independent.
The Markov chain Monte Carlo also works for continuous distributions.
However, the general theory cannot be presented without a full course on
146 CHAPTER 10. MONTE CARLO AND MCMC
Markov chain. This section is helpful to provide some intuitive justification
on the Markov chain Monte Carlo in the next section.
10.5 MCMC: Metropolis sampling algorithms
Sometimes, direct generation of i.i.d. observations from a distribution f(·) is
not feasible. Rejective sampling can also be difficult because to find a proper
f0(·) is not easy. These happen when f(·) is the distribution of a high-
dimensional random vector, or it does not have an exact analytical form.
Markov chain Monte Carlo is regarded as a way out in recent literature. Yet
you will see that the solution is not to provide i.i.d. random numbers/vectors,
but dependent with required marginal distributions.
Let X0, X1, X2, . . . be random variables that form a time-homogeneous
Markov process. We use process here instead of chain to allow the rang of
X to be Rd or something generic. It has all the properties we mentioned
in the last section. We define the kernel function K(x, y) be the conditional
density function of X1 given X0. Roughly speaking,
K(x, y) = pr(X1 = y|X0 = x) = f(x, y)
fX(x)
which is the transition probability when the process is in fact a chain. We
may also use
K(x, y) = f1|0(x1|x0)
as the conditional density of X1 given X0 when the joint density is definitely
needed.
One Metropolis sampling algorithm goes as follows.
1. Let t = 0 and choose a x0 value.
2. Choose a proposed kernel K0(x, y) so that the corresponding Markov
process is convenient to generate random numbers/vectors from the
conditional density.
3. Choose a function r(x, y) taking values in [0, 1] and r(x, x) = 1.
10.5. MCMC: METROPOLIS SAMPLING ALGORITHMS 147
4. Generate a y value from conditional distribution K0(xt, y) and a stan-
dard uniform random number u. If u < r(xt, y), let xt+1 = y; otherwise,
let xt+1 = xt. Update t = t+ 1.
5. Repeat Step 4 until sufficient number of random numbers are obtained.
In the above algorithm, we initially generate random numbers from a
Markov chain with transition probability matrix specified by K0(x, y). Due
to a rejective sampling step, many outcomes are not accepted and in which
cases, the previous value xt is retained. What have we obtained?
We can easily seen that {x0, x1, . . .} remains a Markov chain with the
same state space in spite of rejecting many y values generated according to
K0. We use Markov chain to illustrate the point. The transition probability
of this Markov chain is computed as follows. Consider the case when X0 = i
and the subsequent Y is generated according to the conditional distribution
K(i, ·). Let U be i.i.d. uniform [0, 1] random variables. For any j 6= i ∈ S,
we have
K(i, j) = pr(X1 = j|X0 = i)
= pr(U < r(i, Y ), Y = j|X + 0 = i)
= r(i, j)K0(i, j).
Clearly, the chance of not making a move is
K(i, i) = 1 +K0(i, i)−
∞∑
j=1
r(i, j)K0(i, j).
Suppose the target distribution has probability mass function Π. We
hope to select K0(x, y) and r(x, y) so that Π is the equilibrium distribution
of the Markov chain with transition matrix K(x, y). Consider the situation
where the working transition matrix K0(x, y) is symmetric and we choose for
all i, j,
r(i, j) = min{1,Π(j)/Π(i)}
in the above so called Metropolis algorithm. One important property of this
choice is that we need not know individual values of Π(i) for each i but their
ratios. This is a useful property in Bayes method where the posterior density
148 CHAPTER 10. MONTE CARLO AND MCMC
function is often known up to a constant factor. Computing the value of the
constant factor is not a pleasant task. The above choice of r(i, j) makes the
computation unnecessary which is a big relief.
With this choice of r(x, y), we find
Π(i)K(i, j) = min{Π(i),Π(j)}K0(i, j)
= min{Π(i),Π(j)}K0(j, i)
= Π(j)K(j, i).
This property is a sufficient condition for Π to be the equilibrium distribution
of the Markov chain with transition probabilities given by K(i, j). Note that
the existence of the equilibrium distribution is assumed and can be ensured
by the choice of an appropriate K0(i, j).
Although Step 4 in the Metropolis algorithm is very similar to the rejec-
tive sampling, they are not the same. In rejective sampling, if a proposed
value is rejected, this value will be thrown out and a new candidate will be
generated. In current Step 4, if a proposed value is rejected, the previous
value in the Markov chain will be adopted.
We presented the result for discrete time homogeneous Markov chain with
countable state space. The symbolic derivation for general state space is the
same.
The symmetry requirement on K0(x, y) is not absolutely needed to ensure
the limiting distribution is given by Π. When K0(x, y) is not symmetric, we
r(x, y) = min
{
1,
f(y)K0(y, x)
f(x)K0(x, y)
}
.
We use x, y here to reinforce the impression that both x, y can be real values,
not just integers.
A toy exercise is to show that this choice also leads to f(x) satisfying the
balance equation:
f(x)K(x, y) = f(y)K(y, x).
Finally, because f(x) is the density function of the equilibrium distribu-
tion, when t → ∞, the distribution of Xt generated from the Metropolis
10.6. THE GIBBS SAMPLERS 149
algorithm has density function f(x). At the same time, the distribution of
Xt for any finite t is not f(x) unless that of X0 is. However, for large enough
t, we may regard the distribution of Xt as f(x). This is the reason why a
burning period is needed before we use Xt as random samples from f(x) in
many applications.
Obviously, Xt, Xt+1 generated by this algorithm are not independent ex-
cept for very special cases. However, in many applications, a non-i.i.d.
sequence suffices. For instance, when the Markov chain is ergodic,
n−1
n∑
t=1
g(Xt)→ E{g(X)}
almost surely where E is computed with respect to the limiting distribution.
10.6 The Gibbs samplers
Gibbs samplers are another class of algorithms to generate random numbers
based on a Markov chain. Suppose X = (U, V ) has some joint distribution
with both u and v taking values as real vectors. Suppose that given U = u
for any u, it is easy to generate a value v from conditional distribution of
V |(U = u); and the opposite is also true. The goal is to generate number
vectors with distribution of U , with distribution of V , or with distribution of
(U, V ). We should add that directly generating random vectors (U, V ) itself
is not as easy a task.
A Gibbs sampler as follows leads to a Markov chain/process whose equi-
librium distribution is that of U .
1. Pick a value u0 for U0. Let t = 0.
2. Generate a value vt from the conditional distribution V |(U = ut).
3. Generate a value ut+1 from the conditional distribution U |(V = vt).
4. Let t = t+ 1 and go back to Step 2.
150 CHAPTER 10. MONTE CARLO AND MCMC
Theorem 10.4. The random numbers generated from the above sampler
with joint distribution/density f(u, v) form an observed sequence of a Markov
chain/process {U0, U1, . . .}.
Assume the limiting distribution of the Markov chain exists and unique.
Then, the limiting distribution of Ut is the marginal distribution of f(u, v)
(assume the limiting distribution exists and unique).
Proof. This is only a proof for discrete case. Let pu|v(u, v) be the conditional
probability mass function of U given V and similarly define pv|u(v, u). The
transition probability of the Markov chain is given by
pij = pr(Ut+1 = j|Ut = i) =

k
pu|v(j|k)pv|u(k|i).
Let gu(u) and gv(v) be the marginal distributions of U and V . Multiplying
both sides by gu(i) and summing ove i, we have∑
i
gu(i)pij =

i
{∑
k
pu|v(j|k)pv|u(k|i)gu(i)
}
=

k
pu|v(j|k)
{∑
i
pv|u(k|i)gu(i)
}
=

k
pu|v(j|k)gv(k)
= gu(j).
This implies that the distribution of U , as a column vector Π, satisfies the
relationship
Πτ = ΠτP
where P is the transition matrix, for the discrete Markov chain.
Since the limiting distribution of Ut is gu(·) and the conditional dis-
tribution of Vt is pv|u(·). It is immediately clear that the marginal dis-
tribution of Vt in the limit is gv(v). Their joint limiting distribution is
f(u, v) = pv|u(v|u)gu(u) as desired.
There are clearly many other problems with the use of Gibbs sampling.
Not an expertise myself, it is best for me to not say too much here.
10.7. RELEVANCE TO BAYES ANALYSIS 151
10.7 Relevance to Bayes analysis
As we pointed out, the basis of Bayes data analysis is the posterior distri-
bution of the model parameters. However, we often only have the analytical
form of the posterior distribution up to a multiplicative constant. It is seen
that in Metropolis sampling algorithm, this is all we need to generate random
numbers from such distributions.
In the case of Gibbs samplers, the idea can be extended. Suppose U =
(U1, U2, . . . , Uk) and we wish to obtain samples whose marginal distribution
is that of U . Let U−i be subvector of U with Ui removed. Suppose it is
efficient to generate data from the conditional distribution of Ui given U−i
for all i. Then one may iteratively generate Ui to obtain sample from the
distribution f U using Gibbs samplers.
10.8 Assignment problems
1. Find a function g(·) such that g(U) has the standard Cauchy distribu-
tion when U has uniform distribution on [0, 1].
2. Suppose we want to generate random numbers from standard normal
distribution whose density is given by φ(x) = (2pi)−1/2 exp(−x2/2).
We decide to use rejective sampling plan via double exponential distri-
bution whose density function is given by:
f0(x) =
1
2
exp(−|x|).
(a) Compute the constant u as defined in our notes.
(b) Write a code in R to implement the rejective sampling method to
generate n = 1000 observations from N(0, 1).
(c) Work out the Q-Q plot of the data generated and report the number
of pairs of (X,U) in rejective sampling required.
(d) How many pairs of (X,U) do you expect to be needed to generate
n = 1000 normally distributed random numbers with this method?
152 CHAPTER 10. MONTE CARLO AND MCMC
3. In the Metropolis sampling algorithm, one choice of r(·, ·) function is
r(x, y) = min
{
1,
f(y)K0(y, x)
f(x)K0(x, y)
}
.
Show that this choice also leads to f(x) satisfying the balance equation:
f(x)K(x, y) = f(y)K(y, x).
Remark: for discrete case.
4. Suppose P is the transition probability matrix of finite state space and
Π = [pii : i = 0, 1, . . . , N ]
τ is a probability vector. Prove that if pii
satisfies the balance equation:
piipi,j = pijpj,i
for any i, j, then Π is the limiting distribution of the Markov Chain
under conditions (a) and (b) in the chapter.
5. We used to have difficulties to “identify” the marginal posterior dis-
tributions of µ, 1/σ2 when their priors are independent. For the pur-
pose of generating samples from from posterior distributions of µ and
λ = 1/σ2, this is not an issue if the Gibbs sampler is used. This problem
is designed to illustrate this point.
Assume that we have n observations x1, . . . , xn i.i.d. from N(µ, σ
2).
(a) Let N(0, 4) be the prior for µ, and Gamma(d0 = 5) be the prior for
λ = 1/σ2. Let µ and λ be independent. Write down the joint posterior
density function of µ and 1/σ2 up to a multiplication constant.
(b) Write a code to generate data by Gibbs sampler method from the
above posterior distribution. Generate N = 1000 of pairs. Obtain the
means of µ and 1/σ2 based on the following data:
dat <- c(1.1777518, -0.5867896, 0.2283789, -0.1735369, -0.2328192,
1.0955114, 1.2053680, -0.7216797, -0.3387580, 0.1620835,
1.4173256, 0.0240219, -0.6647623, 0.6214567, 0.7466441,
1.9525066, -1.2017093, 1.9736293, -0.1168171, 0.4511754)
10.8. ASSIGNMENT PROBLEMS 153
(c) Plot the posterior density function in (a) and a kernel density esti-
mator of µ, λ based on posterior sample obtained in (b).
Remark: make use of existing functions for generating Gamma, normal
random numbers, and for density estimation.
6. Construct an example where the Gibbs sampler will fail to generate
random numbers from the marginal distribution of f(u, v), as stated in
the theorem in this chapter when some condition is violated.
154 CHAPTER 10. MONTE CARLO AND MCMC
Chapter 11
More on asymptotic theory
Various approaches to point estimation have been discussed so far. An es-
timator is recommended when it has certain desirable properties. Among
many things, we like to know its bias and variance which can be derived
from its finite sample distribution. Characterizing exact finite sample distri-
butions is difficult in most cases. Fortunately, also in most cases, an estimator
has a limiting distribution when the sample size increases to infinite. The
limiting distribution approximates the finite sample distribution and enables
us to make inferences accordingly. In this chapter, we provide additional
discussions on asymptotic theories.
11.1 Modes of convergence
Let X,X1, X2, . . . be a sequence of random variables defined on some prob-
ability space (Ω,B, P ).
Definition 11.1. We say {Xn}∞n=1 or simply Xn converges in probability to
random variable X, if for every > 0,
lim
n→∞
pr(|Xn −X| > ) = 0.
We use notation Xn
p−→ X.
Here is an example in which the convergence in probability can be directly
verified.
155
156 CHAPTER 11. MORE ON ASYMPTOTIC THEORY
Example 11.1. Let X1, X2, . . . , be a sequence of i.i.d. random variables each
has exponential distribution with rate λ > 0. Let
X(1) = min{X1, X2, . . . , Xn}.
Then X(1)
p−→ 0.
Proof: Here 0 is considered as a random variable which takes value 0 with
probability 1. Note that for every > 0,
pr(|X(1) − 0| > ) = pr(X(1) > )
= pr(X1 > , . . . , Xn > )
= pr(X1 > ) · · ·P (Xn > )
= exp(−nλ)→ 0
as n→ 0. Hence, by Definition 11.1, X(1) p−→ 0.
Definition 11.2. We say Xn converges to X almost surely (or with proba-
bility 1) if and only if
P{ω : lim
n→∞
Xn(ω) = X(ω)} = 1.
We use notation Xn
a.s.−→ X.
Here is a quick example for the mode of almost sure convergence.
Example 11.2. Let Y be a random variable and let Xn = n
−1Y for n =
1, 2, . . .. For any sample point ω ∈ Ω, as n→∞, we have
Xn(ω) = n
−1Y (ω)→ 0.
Hence,
pr(ω : limXn(ω) = 0) = 1.
Therefore Xn → 0 almost surely.
It is natural to ask whether the two modes of convergence defined so
far are equivalent. The following example explains that the convergence in
probability does not imply the almost sure convergence. The construction is
somewhat involved. Please do not spend a lot of time on it.
11.1. MODES OF CONVERGENCE 157
Example 11.3. Consider a probability space (Ω,B, P ) where Ω = [0, 1], B
is the usual Borel σ-algebra, and the probability measure pr is the Lesbesgue
measure. For any event A ∈ B, 1(A) is an indicator random variable. Define,
for k = 0, 1, 2, . . . , 2n−1 − 1 and n = 1, 2, . . .,
X2n−1+k = 1([
k
2n−1
,
k + 1
2n−1
]).
Since any positive integer m can be uniquely written as 2n−1 + k for some n
and k between 0 and 2n−1−1, we have well defined Xm for all positive integer
m.
On one hand, for every > 0, it is seen that
pr(|Xm − 0| > ) ≤ 2−(n−1) → 0.
Hence, Xm
p−→ 0.
On the other hand, for each ω ∈ Ω and any given n, there is a k such
that
k − 1
2n
≤ ω < k
2n
.
Hence, no matter how large N is, we can always find an m = 2n−1 + k > N
for which Xm(ω) = 1, and Xm+1(ω) = 0. Therefore, Xm(ω) does not have
a limit. This claim is true for any sample point in Ω. Hence, Xm does not
almost surely converge to anything.
The following theorem shows that the mode of almost sure convergence
is a stronger mode of convergence.
Theorem 11.1. If Xn converges almost surely to X, then Xn
p−→ X.
Let Bn, n = 1, 2, . . . be a sequence of events. That is, they are subsets of
sample space Ω and members of B. If a sample point belongs to infinite many
Bn, for example it belongs to all B2n, we say it occurs infinitely often. The
subset which consists of sample points that occur infinitely often is denoted
as
{Bn i.o.} = ∩∞n=1 ∪∞i=n Bi.
158 CHAPTER 11. MORE ON ASYMPTOTIC THEORY
Theorem 11.2 (Borel-Cantelli Lemma). 1. Let {Bn} be a sequence
of events. Then
∞∑
n=1
pr(Bn) <∞
implies
pr({Bn i.o.}) = 0;
2. If Bn, n = 1, 2, . . . are mutually independent, then
∞∑
i=1
pr(Bn) =∞
implies
pr({Bn i.o.}) = 1.
The proof of this lemma relies on the expression {Bn i.o.} = ∩∞n=1∪∞i=nBi.
We now introduce other modes of convergence.
11.2 Convergence in distribution
The convergence in distribution is usually discussed together with the modes
of convergence for a sequence of random variables. Although they are con-
nected, convergence in distribution is very different from other modes of
convergence in nature.
Definition 11.3. Let G1, G2, . . . , be a sequence of (univariate) cumulative
distribution functions. Let G be another cumulative distribution function.
We say Gn converges to G in distribution, denoted as Gn
d−→ G if
lim
n→∞
Gn(x) = G(x)
for all points x at which G(x) is continuous.
This definite is not based on a sequence of random variables. If there
is a sequence of random variables X1, X2, . . . and X whose distributions are
given by G1, G2, . . . and G, we also say that Xn
d−→ X. These random
variables may not be defined on the same probability space. When we state
11.3. STOCHASTIC ORDERS 159
that Xn
d−→ X, it means that the distributions of Xn converges to the
distribution of X as n→∞.
Theorem 11.3. If Xn
p−→ X, then Xn d−→ X.
Suppose c is a non-random constant. If Xn
d−→ c, then Xn p−→ c.
A probability space is generally irrelevant to the convergence in distri-
bution. Yet we can create a shadow probability space for the corresponding
random variables.
Theorem 11.4 (Skorokhod’s representation theorem). If Gn
d−→ G,
then there exist a probability space (Ω,B, P ) and random variables Y1, Y2, . . .
and Y , such that
1. Yn has distribution Gn for n = 1, 2, . . . and Y has distribution G.
2. Yn
a.s.−→ Y .
The following result is intuitively right but hard to prove unless the above
theorem is applied.
Example 11.4. If Xn
d−→ X and g is a real, continuous function, then
g(Xn)
d−→ g(X).
This is a simple exercise problem. There is an equivalent definition of the
mode of convergence in distribution. We state here as a theorem.
Theorem 11.5. Let X1, X2, . . . be a sequence of random variables. Then,
Xn
d−→ X if and only if E{g(Xn)} → E{g(X)} for all bounded, uniformly
continuous real valued function g.
11.3 Stochastic Orders
Random variables come with different sizes. When a number of random
variable sequences are involved in a problem, it is helpful to know their
relative sizes. Let {Xn}∞n=1 be a sequence of random variables. If Xn p−→ 0,
we say Xn = op(1). That is, compared with constant 1, the size of Xn
becomes less and less noticeable. Naturally, we may also want to compare
Xn with other sequences of numbers.
160 CHAPTER 11. MORE ON ASYMPTOTIC THEORY
Definition 11.4. Let {an} be a sequence of positive constants. We say Xn =
op(an) if Xn/an
p−→ 0 as n→∞.
Let {Yn}∞n=1 be another sequence of random variables. We say Xn =
op(Yn) if and only if
Xn/Yn = op(1).
How do we describe that Xn and an are about the same magnitude?
Intuitively, this should be the case when Xn/an stays clear from both 0
and infinite. In common practice, we only exclude the latter. A rigorous
mathematical definition is as follows:
Definition 11.5. We say Xn = Op(an) if and only if for every > 0, there
exist an M such that for all n,
pr(|Xn/an| ≥M) < .
Note that Xn = Op(an) only reveals that |Xn| is not larger compared
with an. The size of |Xn| can, however, be much smaller than an.
Example 11.5. Assume X1, X2, . . . is a sequence of i.i.d. standard Poisson
random variables. Then
max{X1, X2, . . . , Xn} = Op(log n).
The next example essentially gives an equivalent definition.
Example 11.6. If for every > 0, there exist M such that for all n large
enough,
pr(|Xn/an| ≥M) < ,
then Xn = Op(an).
The following is a useful result.
Example 11.7. Suppose Xn → X in distribution, then Xn = Op(1).
11.3. STOCHASTIC ORDERS 161
11.3.1 Application of stochastic orders
Stochastic order enables us to ignore irrelevant details above Xn and Yn in
asymptotic derivations. Some useful facts are as follows.
Lemma 11.1. The stochastic orders have the following properties.
1. If Xn = Op(1) and Yn = op(1), then −Xn = Op(1), −Yn = op(1).
2. If Xn = Op(1) and Yn = Op(1), then XnYn = Op(1), Xn + Yn = Op(1).
3. If Xn = op(1) and Yn = op(1), then XnYn = op(1), Xn + Yn = op(1).
4. If Xn = op(1) and Yn = Op(1), then XnYn = op(1), Xn + Yn = Op(1).
If Xn converges to X in distribution and Yn differs from Xn by a random
amount of size op(1), we expect that Yn also converges to X in distribution.
This is a building block to for more complex approximation theorems.
Lemma 11.2. Assume Xn
d−→ X and Yn = Xn + op(1). Then Yn d−→ X.
Proof: Let x be a continuous point of the c.d.f. of X. Let > 0 such that
x+ is also a continuous point of the c.d.f. of X. Then
pr(Yn ≤ x) = pr(Yn ≤ x, |Yn −Xn| ≤ ) + pr(|Yn −Xn| > , Yn < x)
≤ pr(Xn ≤ x+ ) + pr(|Yn −Xn| > )
→ pr(X ≤ x+ ).
The second term goes to zero because Yn −Xn = op(1).
For any given x, can be chosen arbitrarily small due to the property of
the monotonicity of distribution functions. Thus we must have
lim sup
n→∞
pr(Yn ≤ x) ≤ pr(X ≤ x).
Similarly, we can show
lim inf
n→∞
pr(Yn ≤ x) ≥ pr(X ≤ x).
The two inequalities together imply
pr(Yn ≤ x)→ pr(X ≤ x)
162 CHAPTER 11. MORE ON ASYMPTOTIC THEORY
for all x at which the c.d.f. of X is continuous. Hence Yn
d−→ Y .
The above result makes the next lemma obvious.
Lemma 11.3. Here are some additional properties of the stochastic orders:
If an → a, bn → b, and Xn d−→ X, then anXn + bn d−→ aX + b.
If Yn
p−→ a and Zn p−→ b, and Xn d−→ X, then YnXn + Zn d−→ aX + b.
The following well-known theorem becomes a simple implication.
Theorem 11.6 (Slutsky’s Theorem). Let Xn
d−→ X and Yn p−→ c where
c is a finite constant. Then
1. Xn + Yn
d−→ X + c;
2. XnYn
d−→ cX;
3. Xn/Yn
d−→ X/c when c 6= 0.
Although the formal Slutsky’s Theorem is stated as above, many of us
regard Lemma 11.2 as the conclusion of this theorem. Each of the conclusions
in the above theorem can be easily proved by using Lemma 11.2. In this
lecture note, we will refer Lemma 11.2 as Slutsky’s Theorem.
Here is another convenient theorem
Theorem 11.7. Let an be a sequence of real values and Xn be a sequence
of random variables. Suppose an → ∞ and an(Xn − µ) d−→ Y . If g(x) is a
function which has continuous derivative at x = µ, then
an{g(Xn)− g(µ)} d−→ g′(µ)Y.
The most useful result for convergence in distribution is the central limit
theorem.
Theorem 11.8 (Central Limit Theorem). Assume X1, X2, . . . are i.i.d. .
random variables with E(X) = 0 and var(X) = 1. Then as n→∞,

nX¯n
d−→ N(0, 1).
If, instead, E(X) = µ and var(X) = σ2, then
11.3. STOCHASTIC ORDERS 163
1.

nσ−1(X¯n − µ) d−→ N(0, 1);
2.

n(X¯n − µ) d−→ N(0, σ2);
3. n−1/2
∑n
i=1{(Xi − µ)/σ} d−→ N(0, 1);
4. n−1/2
∑n
i=1(Xi − µ) d−→ N(0, σ2).
It is not advised to state
X¯n − µ d−→ N(0, σ2/n).
The righthand side is not a limit at all.
Example 11.8. Let Xn, Yn, be a pair of independent Poisson distributed
random variables with mean nλ1 and nλ2. Define
Tn = (Yn/Xn)1(Xn > 0).
Then Tn is asymptotically normal.
1. Prove that if Xn converges almost surely to X, then Xn
p−→ X.
2. Let X1, X2, . . . , be a sequence of random variables such that Xn
d−→ X
for some random variable X. Prove that Xn = Op(1).
3. Let X1, X2, . . . , be a sequence of i.i.d. random variables each has expo-
nential distribution with rate λ > 0. Let
X(1) = min{X1, X2, . . . , Xn}.
Prove that, as n→∞,
(a) X(1) = op(n
−1/2);
(b) X(1) = Op(n
−1);
(c) X(1) = Op(1).
Remark: the proposed orders in this assignment problem are not accu-
rate.
164 CHAPTER 11. MORE ON ASYMPTOTIC THEORY
4. Assume we have a sequence of random variables and consider the dy-
namic of n→∞.
(a) If Xn = Op(n
−1), does it imply Xn = o(1) in almost sure sense?
(b) If E{nXn} = 1 for all n, show that Xn = Op(n−1)? is it possible
that Xn = Op(n
−2) in some case?
5. Let an be a sequence of real values and Xn be a sequence of random
variables. Suppose an →∞ and an(Xn−µ) d−→ Y . If g(x) is a function
satisfying Lipschitz condition:
|g(x)− g(y)| ≤ C|x− y|
for some constant C. Show that
an{g(Xn)− g(µ)} = Op(1).
6. Prove the following stochastic orders properties.
(a) If Xn = Op(1) and Yn = op(1), then −Xn = Op(1), −Yn = op(1).
(b) If Xn = Op(1) and Yn = Op(1), then XnYn = Op(1), Xn + Yn =
Op(1).
(c) If Xn = op(1) and Yn = op(1), then XnYn = op(1), Xn +Yn = op(1).
(d) If Xn = op(1) and Yn = Op(1), then XnYn = op(1), Xn+Yn = Op(1).
7. Let Xn, Yn, be a pair of independent Poisson distributed random vari-
ables with mean nλ1 and nλ2. Define
Tn = (Yn/Xn)1(Xn > K)
for some K > 0. Show that

n(Tn − λ2/λ1) d−→ N(0, σ2)
and work out the expression of σ2.
11.3. STOCHASTIC ORDERS 165
8. Suppose Xn
d−→ X as n→∞.
(a) Given an example that E(Xn) 6→ E(X).
(b) Give a nontrivial condition on top of Xn
d−→ X, so that after its
addition, E(Xn)→ E(X) with a solid proof.
9. Let X1, X2, . . . , Xn be a random sample (iid) from a family of uniform
distributions with density function
f(x; θ) = θ−11(0 < x < θ)
and the parameter space Θ = R+. Namely it contains all positive real
values.
(a) Find the maximum likelihood estimator θˆ of θ. What is the value
of the likelihood function at θˆ?
(b) Find appropriate an such that an(θˆ− θ) has a non-degenerate lim-
iting distribution. Work out the limiting distribution.
166 CHAPTER 11. MORE ON ASYMPTOTIC THEORY
Chapter 12
Hypothesis test
Recall again that a statistics model is a family of distributions. When they
are parameterized, the model is parametric. Otherwise, the model is non-
parametric. One may notice that the regression models are not exceptions
to this definition. Suppose a random sample from a distribution F is ob-
tained/observed. A statistical model assumption is to specify a distribution
family F such that F is believed to be a member of it.
Often, we are interested in a special subfamily F0 of F . The statistical
problem is to decide whether or not F is a member of F0 based on a random
sample from this unknown F . There might be situations where the question
can be answered with certainty. Most often, statistics are used to quantify
the strength of the evidence against F0 from chosen angles. Hypothesis test
is an approach which recommends whether or not F0 should be rejected. It
also implicitly recommends a distribution in the complement of F0 if F0 is
rejected. We consider F0 as null hypothesis and also denote it as H0. Its
complement in F forms alternative hypothesis and is denoted as Ha or H1.
The specification of F is based on our knowledge on the subject matter
and the property of probability distributions. For instance, a binomial distri-
bution family is used when the number of passengers show up for a specific
flight, the number of students show up for a class and so on. The choice of
F0 often relates to the background of the application. We provide a number
of scenarios in the next section.
167
168 CHAPTER 12. HYPOTHESIS TEST
12.1 Null hypothesis.
Where is F0 from? The question is more complicated than we may believe.
Here are some examples motivated from various classical books.
(a) The null hypothesis may correspond to the prediction out of some sci-
entific curiosity. One wishes to use data to examine its validity.
We suspect that the sex ratio of new babies is 50%. In this case, one
may collect data to critically examine how well this belief approximates
the real world.
(b) In genetics, when two genes are located in two different chromosomes,
their recombination rate is exactly θ = 0.5 according to Mendel’s law.
Rejection of a null hypothesis of θ = 0.5 based on experimental or
observational data leads to meaningful scientific claims.
Scientists or geneticists in this and similar cases must bear the bur-
den of proof. The null hypothesis stands on the opposite side of their
convictions.
(c) Some statistical methods are developed under certain distributional
assumptions on the data such as the analysis of variance. If the nor-
mality assumption is severely violated, the related statistical conclu-
sions become dubious. A test of normality as the null hypothesis is
often conducted. We are alarmed only if there is a serious departure
from normality. Otherwise, we will go ahead to analyze the data under
normality assumption.
(d) H0 may assert complete absence of structure in some sense. So long as
the data are consistent with H0 it is not justified to claim that data
provide clear evidence in favour of some particular kind of structure.
Does living near hydro power line make children more likely to have
leukaemia? The null hypothesis would suggest the cases to be dis-
tributed geographically randomly.
(e) The quality of products from a production-line fluctuates randomly
within some range over the time. One may set up a null hypothesis
12.2. ALTERNATIVE HYPOTHESIS 169
that the system is in normal status characterized by some key specific
parameter values. The rejection of the null hypothesis sets off an alarm
that the system is out of control.
(f) When a new medical treatment is developed, its superiority over the
standard treatment must be established in order to be approved. Nat-
urally, we will set the null hypothesis to be “there is no difference
between two treatments”.
(g) There are situations where we wish to show a new medicine is not
inferior than the existing one. This is often motivated by the desire to
produce a specific medicine at a lower cost. One needs to be careful to
think about what the null hypothesis should be here.
(i) In linear regression models, we are often interested to test whether
a regression coefficient has a value different from zero. We put zero-
value as the null value. Rejection of which implies the corresponding
explanatory has no-nil influence on the response value.
In all examples, we do not reject H0 unless the evidence against it is
mounting. Often, H0 is not rejected not because it holds true perfectly, but
because the data set does not contain sufficient information, or the departure
is too mild to matter in a scientific sense, or the departure from H0 is not in
the direction of concern. It is hard to distinguish these causes. We will come
to this issue again after introduction of the alternative hypothesis.
12.2 Alternative hypothesis
In the last section, we discussed the motivation of choosing a subset F0
of F to form H0. It is naturally to form the alternative hypothesis Ha or
H1 as the remaining distributions in F . If so, the alternative hypothesis
is heavily dependent on our choice of F . Since any data set is extreme in
some respects, severe departure from F0 can always be established. Thus, it
can be meaningless to ask absolutely whether F0 is true, by allowing F to
contain all imaginable distributions. The question becomes meaningful only
when a proper alternative hypothesis is proposed.
170 CHAPTER 12. HYPOTHESIS TEST
The alternative hypothesis serves the purpose of specifying the direction
of the departure the true model from the null hypothesis that we care! In the
example when a new medicine is introduced, the ultimate goal is to show that
it extends our lives. We put down a null hypothesis that the new medicine is
not better than the existing one. The goal of the experiment and hence the
statistical hypothesis test is to show the contrary: the new medicine is better.
Thus, the alternative hypothesis specified the direction of the departure we
intend to detect.
In regression analysis, we may want to test the normality assumption on
the error term to ensure the suitability of the least sum of squares approach.
In this case, we often worry whether the true distribution has a heavier tail
probability than the normal distribution. Thus, we want to detect departures
toward “having a heavy tail”. If the error distribution is not normal but
uniform on a finite interval, for instance, we may not care at all. Therefore,
if H1 is not rejected based on a hypothesis test, we have not provided any
evidence to claim H0 is true. All we have shown is that the error distribution
does not seem to have a heavy tail.
According to genetic theory, the recombination rate θ of two genes on
the same chromosome is lower than 0.5. Hence, if the data lead to an ob-
served very high recombination rate, we may have evidence to reject the null
hypothesis of θ = 0.5. However, it does not support the sometimes sacred
genetic claim that two genes are linked. To establish linkage, F would be
chosen as all binomial distributions with probability of success no more than
0.5.
In many social sciences, theories are developed in which the response
of interest is related to some explanatory variable. When one can afford
to collect a very large data set, such a connection is always confirmed by
rejecting the null hypothesis that the correlation is nil. As long as the theory
is not completely nonsense, a lower level of connection inevitably exists.
When the data size is large, even a practically meaningless connection will
be detected with statistical significance.
In summary, specifying alternative hypothesis is more than simply putting
done the possible distributions of the data in addition to these included in the
null already. It specifies the direction of the departure from the null model
12.3. PURE SIGNIFICANCE TEST AND P -VALUE 171
which we hope to detect or to declare its non-fitness. We generally investigate
the hypothesis test problem under the assumption that the data are generated
from a distribution inside H0 and what happens if this distribution is a
member of H1. This practice is convenient for statistical research. We should
not take it as truth in applications. It could happen that the data suggest
the truth is not in H0, H1 is slightly a better choice, yet the truth is not in
H0 nor H1. Hence, by rejecting H0, the hypothesis test itself does not prove
that H1 contains the truth.
12.3 Pure significance test and p-value
Suppose a random sample X = x is obtained from a distribution F0 and the
statistics model is F . We hope to test the null hypothesis H0 : F0 ∈ F0. Let
T (x) be a statistic to be used for statistical hypothesis test. Hence, we call
it test statistic. Ideally, it is chosen to has two desirable properties:
(a) the specific sample distribution of T when H0 is true is known (not
merely up to a distribution family but a specific distribution) at least
approximately. If H0 contains many distributions, this property implies
that the sample distribution of T remains the same whichever distribu-
tion in F0 that X may have, or at least approximately. In other words,
it is an auxiliary statistic under H0.
(b) the larger the observed value of T , the stronger the evidence of depar-
ture from H0, in the direction of H1.
If a statistic has these two properties, we are justified to reject the null
hypothesis when the realized value of T is large. Let t0 = T (x) be its real-
ized/observed value and
p0 = pr(T (X) ≥ t0;H0)
which is the probability that T (X) is larger than the observed value when
the null hypothesis is true. When pr(T (X) = t0;H0) > 0, a continuity
correction may be applied. That is, we may revise the definition to
p0 = pr(T (X) > t0;H0) + 0.5pr(T (X) = t0;H0).
172 CHAPTER 12. HYPOTHESIS TEST
In general, this is just a convention, not an issue of “correctness”. The
smaller the value of p0, the stronger is the evidence that the null hypothesis
is false. We call p0 the p-value of the test.
Remark: the definition of p-value is well motivated when a test statistic
has been introduced and it has the above two desired properties. Without the
known-distribution assumption, pr(T (X) ≥ t0;H0) does not have a definitive
answer. Without the other property, we are not justified to be exclusively
concerned on the choice of T (X) ≥ t0, rather than other possible values of
T (X).
If T is a test statistic with properties (a) and (b), and that g is a monotone
strictly increasing function, the g(T ) makes an another test statistic, and the
p-value based on g(T ) will be the same as the p-value based on T .
Since there is no standard choice of T (x), there is not a definite p-value
for a specific pair of hypotheses even if the test statistic T (x) has these two
properties. Because of this, the definition of p-value has been illusive in many
books.
Assume issues mentioned above have been fixed. If magically, p0 = 0,
then H0 cannot be true or something impossible would have been observed.
When p0 is very small, then either we have observed an unlikely event under
H0, or the rare event is much better explained by a distribution in H1. Hence,
we are justified to reject H0 in favour of H1. Take notice that a larger T (x)
value is more likely if the distribution F is a member of H1.
How small p0 should be in order for us to reject H0. A statistical practice
is to set up a standard, say 5%, so we commonly rejectH0 when p0 < 5%. The
choice of 5% is merely a convention. There is no scientific truth behind this
magic cut-off point. There is a joke related to this number: scientists tell their
students that 5% is found to be optimal by statisticians, and statisticians
tell their students that the 5% is chosen based on some scientific principles.
Incidentally, the Federal Food and Drug administration in the United States
uses 5% as its golden standard. If a new medicine beats the existing one by
a pre-specified margin, and it is demonstrated by test of significance at 5%
level, then the new medicine will be approved. Of course, we assume that
all other requirements have been met. Most research journals used to accept
results established via statistical significance at 5% level. You will pretty
12.4. ISSUES RELATED TO P -VALUE 173
soon be under pressure to find a statistical method that results in a p-value
smaller than 5% for a scientist. Recently, however, this practice has been
discredited.
Not all test statistics we recommend have both properties (a) and (b).
There are practical reasons behind the use of statistics without these prop-
erties. When their usage leads to controversies, it is helpful to review the
reasons why properties (a) and (b) are desirable and interpret the data anal-
ysis outcomes accordingly.
In the above discussion, no specifics about H0 and H1 are given, nor any-
thing specific about the test statistic. So the discussion is purely conceptual,
leading to the term of “pure significance” test. One is advised to not taking
this term very seriously.
12.4 Issues related to p-value
After one has seen the data, he can easily find the data are extreme in some
way. One may select a null hypothesis accordingly and most likely, the p-
value will be small enough to declare it is statistically significant (if the old
standard is permitted). This problem is well–known but hard to prevent.
After you have seen the final exam results of stat460/560, you may compare
the average marks between under and graduate students, between male and
female students, foreign and domestic students, younger and older students
and many more ways. If 5% standard on p-value is applied to each test,
pretty soon we will find one set of hypothesis that is tested significant. This
is statistically invalid. To find one out of 20 tests with its p-value below 5%
is much more likely than to find a p-value below 5% of a pre-decided test.
A pharmaceutical company must provide a detailed protocol before a
clinical trial is carried out. If the data fail to reject the null hypothesis,
but point to an other meaningful phenomenon, the FDR will not accept the
result based on analysis of the current data. The company must conduct
another clinical trial to establish the new claim. For example, if they try to
show that eating carrots reduces the rate of stomach cancer, yet the data
collected imply a reduction in the rate of liver cancer, the conclusion will not
be accepted. One could have examined the rates of a thousand cancers: liver
174 CHAPTER 12. HYPOTHESIS TEST
cancer happened to produce a low p-value. By this standard, Columbus did
not discover America because he did not put discovering America into his
protocol. Rather, he aimed to find a short cut to India.
Another issue is the difference between Statistical significance and
the Scientific significance. Consider a problem in lottery business, each
ball, numbered from 1 to 49, should be equally likely to be selected. Suppose
I claim that the odd numbers are more likely to be sampled than the even
numbers. The rightful probability of an odd ball being selected should be
p = 25/49. In the real world, nothing is perfect. Assume that the truth is
p = 25/49 + 10−6. It is not hard to show that if we conduct 1024 trials, the
chance that the null hypothesis p = 25/49 being rejected is practically 1,
at 5% level or any reasonable level based on a reasonable test. Yet such a
statistical significant result is nonsensical to a lottery company. They need
not be alarmed unless the departure from p = 25/49 is more than 10−3,
presumably. In a more practical example, if a drug extends the average life
expectancy by one-day, it is not significant no matter how small the p-value
of the test produces.
There are abundant discussions on the usefulness of p-value. There has
been suggestions of not teaching the concept of the p-value which I beg to
differ. The key is to make everyone understand what it presents, rather than
frantically searching for a test (analysis) that gives a p-value smaller than
0.05.
Here is an example suggested by students. It is not as meaningful to be
100% sure that someone stole 10 dollars from a store. It is a serious claim if
we are 50% sure that someone killed the store owner.
In regression analysis, a regression coefficient is often declared highly
significant. It generally refers to a very small p-value is obtained when testing
for its value being zero. This is unfortunate: the regression coefficient may be
scientifically indifferent from zero, but its effect is magnified by a microscope
created by a big data set.
12.5. GENERAL NOTION OF STATISTICAL HYPOTHESIS TEST 175
12.5 General notion of statistical hypothesis
test
Suppose a random sample of X from F is taken. The null hypothesis H0 as a
subset of F is specified and H1 is made of the rest of distributions in F . No
matter how a test statistic is constructed, in the end, one divides the range
of X into two, potentially three non-overlap regions: C and its complement
Cc. We will come back to the potential third region.
The procedure of the hypothesis test then rejects H0 when the observed
value of X, x ∈ C. Thus, C is called the critical region. When x 6∈ C, we
retain the null hypothesis. However, I do not advocate the terminology of
“Accept H0”. Such a statement can be misleading. When we fail to prove
an accused guilty, it does not imply its innocence. “Not guilty 6= Innocent.”
When the true distribution F ∈ H0 yet x ∈ C occurs, the null hypothesis
H0 is erroneously rejected. The probability pr(X ∈ C) is called Type I
error. We use
α(F ) = pr(X ∈ C;F )
as a function of F on H0. We define
α = sup
F∈H0
pr(X ∈ C;F ) = sup
F∈H0
α(F )
as the size of the test. Type I error is not the same as the size of the test
because H0 may contain many distributions. The size of a test is determined
by the “least favourable distribution” which is the one that maximizes the
probability of X ∈ C. Under simple models, it is easy to identify such a
least favourable distribution. In a general context, we have long given up the
effort of doing so.
If x 6∈ C yet F ∈ H1, we fail to reject H0, the corresponding probability
is called Type II error. For each distribution F ∈ H1, we call
β(F ) = pr(X ∈ C;F )
the power function of F on H1. If F is a parametric model with parameter
θ, it is more convenient to use
β(θ) = pr(X ∈ C; θ), θ ∈ H1.
176 CHAPTER 12. HYPOTHESIS TEST
The type II error is therefore γ(θ) = 1−β(θ). The notational convention may
differ from one textbook to another, one should always read the “fine prints”
before determining whether β(θ) is the power function or type II error.
We do not usually discuss the situation where F 6∈ F . If this happens,
a “third type” of error has occurred. One should take this possibility into
serious consideration in real world applications.
Example 12.1. (One-sample t-test). Assume we have a random sample
from a distribution that belongs to F = {N(θ, σ2)} family. We test the null
hypothesis H0 : θ = 0.
Let
T (x) =

nx¯
s
where x¯ = n−1(x1 + x2 + · · · + xn) is the realized value of X¯ and s2 is the
realized value of the sample variance. It is seen that T (X) has t-distribution
regardless of which distribution in H0 is the true distribution of X. Thus, it
has property (a). At the same time, the larger is the value of |T |, the more
obvious that the null hypothesis is inconsistent with the data. Thus, |T | also
has property (b). In other words, |T | rather than T makes a desirable test
statistic.
Let t0.975,n−1 be the 97.5% quantile of the t-distribution with n− 1 degrees
of freedom. We may put
C = {x : |T (x)| ≥ t0.975,n−1}
as the critical region of a test. If so, its size is
α = pr(|T (X)| ≥ t0.975,n−1;H0) = 0.05.
It is less convenient to write down its power function.
The p-value of this test is
p0 = pr(|T (X)| ≥ T (x);H0)
where T (x) is the realized value of T . Rejecting H0 whenever p0 < 0.05 is
equivalent to rejecting H0 whenever x ∈ C. Providing a p-value has added
benefit: we know whether H0 is rejected with barely sufficient evidence or very
strong evidence.
12.6. RANDOMIZED TEST 177
Again, p-value should be read with a pinch of salt. Even if the true θ-value
is only slightly different from 0, the evidence against H0 can be made very
strong with a large sample size n. Hence, small p-value shows how strong the
evidence is against H0, it does not necessarily indicate H0 is an extremely
poor model for the data.
To avoid the dilemma implied by overly relying on small p-value, it might
be better to specify H1 as |θ| > 0.1 and put H0 as |θ| < 0.1 instead. We have
placed an arbitrary value 0.1 here, it is not hard to come up with a sensible
small value in a real world application.
12.6 Randomized test
Particularly in theoretical development, we often hope to construct a test
with exactly the pre-given size. The above approach may not be feasible in
some circumstances.
Example 12.2. Suppose we observe X from a binomial model with n = 2
and the probability of success θ ∈ (0, 1). Let the desired size of the test
be α = 0.05 for the null hypothesis θ = 0.5. In this case, we have only 8
candidates for the critical region C. None of them result in a test of the
exact size α = 0.05.
An artificial approach to find a test with the pre-specified size is as follows.
We do not reject H0 if X = 1. When X = 0, 2, we toss a biased coin
and reject H0 when the outcome is a head. By selecting a coin such that
pr(Head) = 0.1, the probability of rejecting H0 based on this approach is
exactly 0.05 when θ = 0.5. Thus, we have artificially attained the required
size 0.05.
The region {0, 2} is the third region in the range of X mentioned previ-
ously.
Abstractly, a statistical hypothesis test is represented as a function φ(x)
such that 0 ≤ φ(x) ≤ 1. We reject H0 with probability φ(x) when X =
x. When φ(x) = 0 or 1 only, the sample space is neatly divided into the
critical region and its complement. Otherwise, the region of 0 < φ(x) < 1
178 CHAPTER 12. HYPOTHESIS TEST
is a randomization region. When x falls into that region, we randomize the
decision.
Defining a test by a function φ(x) is mathematically convenient. Note
that its size
α = sup
F∈H0
E{φ(X);F}
and its power function on F ∈ H1 is given by
β(F ) = E{φ(X);F}.
The type I error is defined for F ∈ H0 and given by
α(F ) = E{φ(X);F}.
We do not place many restrictions on φ(x) to use it as a test function.
optimality definitions. We will come to this issue soon.
12.7 Three ways to characterize a test
Discussions in previous section have presented three hypothesis test proce-
dures.
1. Define a test statistic, T , such that we reject H0 when T is large.
Preferably, T has two specific properties: known and same sample dis-
tribution under whichever distribution in H0; larger observed value of
T indicates more extreme departure of F from H0 toward the direction
we try to capture. We compute p-value as
p = pr(T ≥ tobs;H0)
where tobs is the observed value. When T has discrete distribution, we
may apply a continuity correction
p = pr(T > tobs;H0) + 0.5pr(T = tobs;H0).
We reject H0 if p is below some pre-decided level, usually 5%.
12.7. THREE WAYS TO CHARACTERIZE A TEST 179
There seem to be no universal and rigorous definitions of the p-value.
A general requirement for p-value calculation is to have a test statis-
tic which takes larger values when H0 is violated. After which, one
identifies a most likely distribution of the data, say Fˆ , and computes
p = pr(T ≥ tobs; Fˆ )
in which Fˆ is not regarded as random. This value is generally regarded
as p-value of the test.
2. Define a critical region C in terms of the range of X. When the realized
value x ∈ C, we reject H0. The region C is often required to have a
given size α:
sup
H0
pr(X ∈ C) = α.
3. When X is discrete, we may get into situation where no critical region
has a pre-specified size α. This is not problematic in applications, but
is problematic for theoretical discussions. Hence, we define a test as
a function φ(x) taking values between 0 and 1. We reject H0 with
probability φ(x) where x is the realized/observed value of X. The size
of this test is calculated as supH0 E{φ(X)}.
Method 1 is a special case of method 2 by letting C = {x : T (x) > k} for
some k. Both methods 1 and 2 can be regarded as special cases of method
3: by letting φ(x) = 1(x ∈ C). We reject H0 with probability 1 when x ∈ C,
and do not reject H0 otherwise.
Clearly, a trivial test φ(x) = α has size α. Its existence ensures that a
test with any specific size between 0 and 1 is possible. The statistical issue
is on finding one with better properties.
Suppose φ˜(x) is a test of size α˜ < α for a pair of hypotheses H0 and H1.
There must exist a test φ(x) of size α such that
E{φ(X);F} ≥ E{φ˜(X);F}
for every F ∈ H1. Chapter 12 questions
180 CHAPTER 12. HYPOTHESIS TEST
1. Suppose T (x) is a test statistic with the desired properties and g(·) is
a strictly monotoning increasing function.
(a) Show that a test based on g(T ) is equivalent to a test based on T .
That is, two resultant tests will have the same rejection region when
the size of the test is set at the same level, α.
(b) What is the consequence of using T as a test statistic when its
distribution depends on F in the null hypothesis H0?
2. Let Xi, i = 1, 2, . . . , n be a set of i.i.d. observations from N(µ, σ
2).
Consider the test problem for H0 : µ = σ versus H1 : µ > σ. Let X¯n
and s2n be the sample mean and variance. Define
Tn =
X¯n
sn
=
(X¯n − µ) + µ
sn
.
(a) Illustrate that the statistic Tn has the desired properties for the
purpose of statistical significance test.
(b) Suppose the observed value of Tn = t0. What is the p-value of the
test based on Tn in a probability expression?
3. Based on your life experience, make up a story to demonstrate the dif-
ference between “statistical significance” and “scientific significance”.
Try your best so that the even a politician can understand your point.
4. Let (Xi, Yi), i = 1, 2, . . . , n be a set of iid bivariate observations with
their joint probably density function given by
f(x, y; θ1, θ2) =
xy
θ21θ
2
2
exp(− x
θ1
− y
θ2
).
Consider the test problem for H0 : θ1 = θ2 versus H1 : θ1 > θ2.
Let X¯n and Y¯n be sample means and define Tn = log{X¯n} − log{Y¯n}.
(a) Illustrate that Tn has the desired properties for the purpose of
statistical significance test.
(b) Suppose the observed value of Tn = t0. What is the p-value of the
test based on Tn in a probability expression?
12.7. THREE WAYS TO CHARACTERIZE A TEST 181
Remark: I look for an expression in the spirit of P (Tn ∈ [1, 2]).
(c) Bonus for giving concrete justification (not mathematical details)
that X¯n
Y¯n
has F-distribution and indicating the degrees of freedom.
5. (a) What is the difference between the type I error and the size of a
test?
(b) Consider the t-test for one-sided hypothesis under the normal model
as in Example 12.1. Let H0 : θ ≤ 0 and H1 : θ > 0. Suppose n = 20
and we reject H0 when T (X) > 2.0.
(i) What is the size of this test?
(ii) Plot the type I error of the test as a function of θ/σ.
(iii) Plot the power of the test as a function of θ/σ.
Remark: pick appropriate ranges in your plots.
6. Consider the case when we have i.i.d. observations X1, . . . , Xn from
N(θ, σ2). We wish to use one-sample t-test for one-sided hypothesis
H0 : θ ≤ 0 against H0 : θ > 0. We put the size of the test at α = 0.04.
(a) What is the minimum size n needed in order to have a power 75%
at the distribution with θ = 0.24 and σ2 = 1.44?
(b) If you double the sample size obtained in (a), what would be the
power of the test at the same distribution?
(c) With the sample size obtained in (a), what would be the power of
the test if θ = 0.24 and σ2 = 2.56?
7. Suppose we have a random sample X1, . . . , Xn from a continuous dis-
tribution whose density function f(x) > 0 for all x. We wish to test
the hypothesis that the median of f(x), m = 0 against the simple
alternative m = 1.
(a) Discuss that Tn, the number of observations larger than 0, is a
useful test statistic.
(b) Suppose n = 12 and Tn = 8. Calculate the p-value of the test.
182 CHAPTER 12. HYPOTHESIS TEST
(c) If the significance level is set at 5% and n = 12, what is the rejection
region of this test?
Remark: randomization has to be utilized to achieve “exact size” re-
quirement.
8. Suppose we have a random sample X1, . . . , Xn which is likely a sample
from Gamma distribution with density function
f(x; θ) =
x
θ2
exp(−x
θ
).
The parameter space is Θ = R+, and the range of the random variable
is also R+.
At the same time, it is suspected that the actual distribution might be
a Gamma mixture:
f(x; pi, θ1, θ2) = γf(x; θ1) + (1− γ)f(x; θ2)
for some γ ∈ (0, 1) and θ1 6= θ2.
(a) Denote µ = E(X) and σ2 = var(X). Compute µ and σ2 when
γ = 0, hence the true distribution f(x; θ) and find the function g such
that σ2 = g(µ).
(b) Compute µ and σ2 when 0 < γ < 1. Show that σ2 > g(µ) where
g(·) is the function obtained in (a) and θ1 6= θ2.
(c) The result in (b) is helpful to motivate a test statistic for the null
hypothesis that the data is from a pure Gamma distribution, against
the alternative of Gamma mixture. Find such a statistic and give a
brief justification based on two desired properties of a test statistic.
Chapter 13
Uniformly most powerful test
Contemporary statistical education emphasizes teaching students effective
data analysis methods in a time efficient way. The success is measured in
terms of whether a student can quickly answer a statistical question raised
from applications. This sometimes is done at the cost of not knowing why the
statistical method actually answers the applied question other than it gives
an “answer”. Against this trend, in this course, we insist on discussing what
it means by effective data analysis methods. We preach that even though the
topic “uniformly most powerful test” itself is not important, the idea behind
this concept is.
Definition 13.1. Let φ(x) be a test of size α for a test problem with null and
alternative hypotheses H0 and H1. If for any size-α test φ1(x) and F ∈ H1,
we have
E{φ(X);F} ≥ E{φ1(X);F}
then φ(x) is a uniformly most powerful test.
Let us emphasize again that both H0 and H1 are subsets of a distribution
family. When the data X are from some F ∈ H1, we wish to have as high
a probability as possible to tell that its distribution F is not in H0. At the
same time, we do not do it at any costs. We require H0 is not rejected with
high probability when F ∈ H0. Since X is random by the name of the game,
there are unlikely any perfect solutions.
183
184 CHAPTER 13. UNIFORMLY MOST POWERFUL TEST
The task of finding Uniformly Most Powerful (UMP) tests is often dif-
ficult or even impossible. Some may argue that such a result is not mean-
ingful/useful. I agree to a large degree. However, the idea behind UMP is
important and serves a good purpose. The knowledge we gain from such ex-
ercises helps us to develop sensible methods for general problems. There are
special cases where UMP tests exist. We therefore do not wish to completely
eliminate this concept from classroom discussions. Next, we work with the
simplest case.
13.1 Simple null and alternative hypothesis
When a null hypothesis is identified, the task of statistical significance test
is to see whether or not the data suggest a departure from the null models
in a specific direction. The simplest situation is where the statistical model
F contains only two distinct distributions. The null hypothesis contains
one and the alternative hypothesis contains the other. More specifically, we
may present them as two density functions (with respect to some σ-finite
measure):
H0 : f0(x), H1 : f1(x).
Note that if X represents a set of i.i.d. random variables, the above setting
still applies. we will use E1 and E0 for expectations under the alternative
and the null models when applicable.
Based on measure theory, for any given two distributions, it is possible to
find a σ-finite measure, with respect to which, the density functions of two
distributions exist. This justifies the above general assumption.
Lemma 13.1. Neyman-Pearson Lemma: Consider the simple null and
alternative hypothesis test problem as specified.
(1) For any size α between 0 and 1, there exists a test φ and a constant
k such that
E0{φ(X)} = α (13.1)
and
φ(x) =
{
1 when f1(x) > kf0(x);
0 when f1(x) < kf0(x).
(13.2)
13.1. SIMPLE NULL AND ALTERNATIVE HYPOTHESIS 185
(2) If a test has the properties (13.1) and (13.2), then it is the most
powerful for testing H0 against H1.
(3) If φ is most powerful with size no more than α, then it satisfies (13.2)
for some k. It also satisfies (13.1) unless there exists a test of size smaller
than α and with power 1.
Proof and discussion.
Proof of (1):
A likelihood ratio test of size α exists. To prove the existence, let
α(t) = pr(f1(X) > tf0(X);H0)
It is a decreasing function of t. Hence, there exists a t0 such that
α(t0) ≤ α ≤ α(t0−).
Let
φ(x) = 1(f1(X) > t0f0(X)) + c1(f1(X) = t0f0(X))
with
c =
α− α(t0)
α(t0−)− α(t0)
if needed. Then this φ(x) is the test with the required properties.
Remark: The seemly overly complex proof is caused due to the need of
covering the discrete situation where pr{f1(X) = t0f0(X)} 6= 0. Otherwise,
the truthfulness is trivial.
Proof of (2):
Suppose φ(x) is the test given in (1), and φ˜ is another test of size α. Then
{φ(x)− φ˜(x)}{f1(x)− kf0(x)} ≥ 0
This implies, by integrating both sizes, with respect to the measure the
density is defined,
E1{φ(X)− φ˜(X)} ≥ kE0{φ(X)− φ˜(X)} = 0.
186 CHAPTER 13. UNIFORMLY MOST POWERFUL TEST
where we used E1 and E0 for expectations under the alternative and the null
models. The right hand side equals 0 because two tests have the same size.
Hence, φ has better power.
Proof of (3):
If φ˜(X) is also a most powerful test, then we should have
P [{φ(x)− φ˜(x)}{f1(x)− kf0(x)} > 0] = 0.
Otherwise, the derivation in the proof of (2) would implies φ˜(X) has lower
power which is in contradiction of the assumption hat φ˜(X) is also most
powerful.
From
{φ(x)− φ˜(x)}{f1(x)− kf0(x)} = 0
with probability one, we conclude φ(x) = φ˜(x) for all x except when
f1(x)− kf0(x) = 0.
Hence, φ˜(x) also has form (13.1). ♦
This lemma claims that the most powerful test has to be the likelihood
ratio test. At the same time, the third part of the lemma leaves rooms for
non-uniqueness. This is due to the flexibility of making decisions on the set of
x such that f1(x)/f0(x) = k. The randomization test can be used to achieve
the right size of the test. It may also be possible to split this set in other
ways and obtain a non-randomization test with the right size. These tests
are all MP. Hence, MP test is not necessarily unique.
What is the relevance of this classical, famous albeit absolete lemma?
Introductory statistical courses generally recommend one-sample t-test or z-
test for zero-mean hypothesis under normal model. Yet we usually do not
comment on why they are most recommended. An important reason for
z-test is that it is the Uniformly Most Powerful test as shown below.
Example 13.1. Let X = (X1, . . . , Xn) be a random sample from N(θ, 1).
Let us test H0 : θ = 0 against H1 : θ = 1.
By Neyman-Pearson Lemma, the most powerful test has the form
φ(x) = 1(fn(x; θ = 1) > kfn(x; θ = 0)}
13.1. SIMPLE NULL AND ALTERNATIVE HYPOTHESIS 187
where I use fn for the n-variate density, and use θ = 1 and θ = 0 to highlight
the parameter values under the alternative and null hypotheses. The constant
k is to be chosen such that the test has given size. It is not needed in this
example because the density ratio, when regarded as random variable, has
continuous distribution.
Note that the critical region can be represented equivalently in many forms.
Clearly,
{fn(x; θ = 1) > kfn(x; θ = 0)}
= {log fn(x; θ = 1) > log fn(x; θ = 0) + log k}
= {−1
2
n∑
i=1
(Xi − 1)2 > −1
2
n∑
i=1
X2i + k
′}
= {
n∑
i=1
(Xi − 1)2 <
n∑
i=1
X2i − k′′}
= {−2
n∑
i=1
Xi + n < −k′′}
= {
n∑
i=1
Xi > k
′′′}
In other words, there exists a k′′′ such that
φ(x) = 1(fn(x; θ = 1) > kfn(x; θ = 0)} = 1{
n∑
i=1
Xi > k
′′′}.
Since all we care is the size of the test, there is no need to find exactly how
k′′′ is related to k. We need only work out the critical value k′′′ in the last
step each time a size of the test is specified.
Suppose we want the test to have size α = 0.05. This requires us to pick a
specific value of k′′′. Because the size is computed under the null hypothesis
in this example contains only a single distribution, we need only solve the
equation
P (
n∑
i=1
Xi > c; θ = 0) = 0.05
which implies that c = 1.645

n is the solution. If we set α = 0.025, then
c = 1.960

n is the solution.
188 CHAPTER 13. UNIFORMLY MOST POWERFUL TEST
Suppose in addition to require the size of the test being 0.05, we also want
to have power of the test β(1) = 80%. This can be achieved by selecting an
appropriate sample size n:
P (
n∑
i=1
Xi > 1.645

n; θ = 1) ≥ 0.8.
Because n is discrete, the problem should be interpreted as finding the smallest
n such that the power is at least 0.8.
When θ = 1, we have
P (
n∑
i=1
Xi > 1.645

n; θ = 1) = P (n−1/2
n∑
i=1
(Xi − 1) > 1.645− n1/2; θ = 1)
= P (Z > 1.645− n1/2).
with Z being a standard normal random variable. The 20% quantile of the
standard normal is −0.842. Thus, we require 1.645 − n1/2 ≤ −0.842 or
n ≥ (1.645 + 0.842)2 = 6.18. Thus, n = 7 meets the requirement.
Remark: It is seen that if the alternative hypothesis H1 is replaced by θ = θ1
for any θ1 > 0, the most powerful test itself remains the same. That is, the
test is most powerful for any alternative such that θ1 > 0. In other words,
the above test is also a UMP test against H1 : θ > 0. However, to attain the
power of 80% at a different θ value such as θ1 = 0.5, the required sample size
will be higher.
Remark: It is easy to verify that the critical region of the most powerful
test when H1 becomes θ = θ1 < 0 has the form∑
Xi < c.
Clearly, a most powerful test for the H1 being θ > 0 cannot also be a most
powerful for the H1 being θ < 0. Hence, the notion of most powerful is in
general “alternative hypothesis” specific. It is often impossible to have a test
that is uniformly most powerful against composite alternative hypothesis.
Here, composite means the alternative hypothesis F − F0 contains more
than a single distribution.
Remark: Point of the example: the UMP test derived from Neyman-Pearson
Lemma is the same test we generally recommend in other courses.
13.2. MAKING MORE FROM N-P LEMMA 189
13.2 Making more from N-P lemma
N-P lemma is more relevant than it appears. Here is a helpful theorem to
make the future derivation simpler.
Theorem 13.1. Suppose that a test φ(X) of size α is most powerful H0
against H˜1 : F = F1 for every F1 ∈ H1. Then it is UMP for H0 against H1.
Proof: Suppose φ˜(X) is another test of size α for testing H0 versus H1.
For any F1 ∈ H1, by the assumption on φ(X), we have
E{φ(X) : F1} ≥ E{φ˜(X) : F1}.
This trivially shows that φ(X) is UMP against H1. ♦
Example 13.2. Suppose X1, . . . , Xn is an i.i.d. sample from Poisson distri-
bution. We test for H0 : θ ≤ 1 versus H1 : θ > 1 with the nominal level
α.
Consider testing H˜0 : θ = 1 versus H˜1 : θ = 2. The likelihood ra-
tio f(x; 2)/f(x; 1) = c exp{(log 2)∑xi}. By Neyman–Pearson Lemma, one
UMP test has the form of
φ(X) =

1

xi > k;
c

xi = k;
0

xi < k
for some k and c to get the size of the test equaling α. That is, they are
chosen so that
E{φ(X) : θ = 1} = α.
Thus, the choice of k and c does not depend on H˜1. Hence, it is UMP for
H˜0 versus H1.
Next, we hope to retain the same proposition with H˜0 replaced by H0.
It is clear that E{φ(X) : θ} < α when θ < 1. Hence, φ(X) remains a size
α test for H0. Therefore, there cannot be any other tests of size α having
greater power at any θ > 1.
The above result is more generally applicable. Note that allowing ran-
domization makes the discussion under Poisson model smoother. We do not
recommend this type of randomization in applications.
190 CHAPTER 13. UNIFORMLY MOST POWERFUL TEST
13.3 Monotone likelihood ratio
In two previous examples, we started with searching for a most powerful test
for simple H0 and H1. In the end, however, the test is found uniformly most
powerful for some composite null and alternative hypotheses. This is because
the distributions in the statistical model are parameterized in a monotonic
way. The following definition provides a specific terminology for distribution
families with such a property.
Definition 13.2. Suppose that the distribution of X belongs to a parameter
family with density functions {f(x; θ) : θ ∈ Θ ⊂ R}.
The family is said to have monotone likelihood ratio in T (x) if and only
if, for any θ1 < θ2,
f(x; θ2)
f(x; θ1)
is a nondecreasing function of T (x) for values x at which at least one of
f(x; θ1) and f(x; θ2) is positive.
It is seen that T (x) is a useful statistic for the purpose of hypothesis test
because it is a stochastically increasing function of θ.
Lemma 13.2. Monotonicity of {T (x)}. Suppose X has a distribution
from a monotone likelihood ratio family. Then E{T ; θ} is nondecreasing in
θ.
Proof: Because when θ2 > θ1,
f(x; θ2)
f(x; θ1)
is non-decreasing in T . Two random variables, T (X) and f(X; θ2)/f(X; θ1)
are positively correlated when the distribution of X is f(x; θ1). Let µ1 =
E1{T (X)}, the expectation under θ1. We have
E1
{
[T (X)− µ1]f(X; θ2)
f(X; θ1)
}
≥ 0.
Expanding this inequality gives us the conclusion. ♦
Extension This conclusion is applicable to any nondecreasing function g(T ).
13.3. MONOTONE LIKELIHOOD RATIO 191
Example 13.3. One parameter exponential family
f(x; θ) = exp(η(θ)T (x)− ξ(θ))h(x)
has monotone likelihood ratio in T (x) when η(θ) is a nondecreasing function
in θ.
The result remains true for the joint distribution an i.i.d. observations.
It will be seen that the UMP tests exist for one parameter exponential
families with the above property. It is helpful if you remember that most
commonly employed one parameter distribution families are one parameter
exponential family. Therefore, the result to be shown in the next theorem is
Before introducing another theorem, we point out another monotone like-
lihood ratio family. This one is more of theoretical interest.
Example 13.4. Let X1, . . . , Xn be an iid sample from
f(x; θ) = θ−11(0 < x < θ).
Then the distribution family of X = (X1, . . . , Xn) has monotone likelihood
ratio in X(n), the largest order statistic.
Theorem 13.2. Suppose the distribution of X is in a parametric family with
real valued parameter θ and has monotone likelihood ratio in T (X).
Consider H0 : θ ≤ θ0 and H1 : θ > θ0.
(i) There exists a UMP test of size α, given by
φ(X) =

1 T (X) > k;
c T (X) = k;
0 T (X) < k.
(ii) For any θ < θ0, φ(X) minimizes the type I error α(θ) among all φ˜
such that E{φ˜(X); θ0} = α.
Proof
(i) By Neyman–Pearson lemma, this test is one of the Most Powerful tests
for H˜0 : θ = θ0 against H˜1 : θ = θ1 for any θ1 > θ0 because the density ratio
192 CHAPTER 13. UNIFORMLY MOST POWERFUL TEST
is an increasing function of T . Hence, φ(X) is Uniformly Most Powerful for
H˜0 against H1 : θ > θ0.
By Lemma on the Monotonicity of E{T (x)}, E{φ(X); θ} is a nondecreas-
ing function of θ. Therefore E{φ(X); θ} ≤ α for all α ∈ H0. Thus, φ(X) is a
size-α test for H0 versus H1. Subsequently, it is UMP H0 versus H1 by the
extended N-P lemma.
(ii) Let us define ξ = −θ so that we have a density function
g(x; ξ) = f(x;−θ).
In terms of ξ, the family has monotone density ratio for ξ in T˜ = −T (x).
Consider testing for H∗0 : ξ ≤ ξ0 = −θ0 versus H∗1 : ξ > ξ0 = −θ0 with
size α∗ = 1− α.
Hence, the UMP tests will have the following form
φ∗(X) = 1− φ(X) =

1 T (X) < k;
1− c T (X) = k;
0 T (X) > k.
with k and c chosen such that the test has size α∗. We remark here that
the middle part is not unique but it does not invalidate our claim. This
uniformly most power test has its power maximized is the same as having
E{φ(X) : θ} = 1− E{φ∗(X) : ξ}
minimized when ξ ∈ H∗1 which is the same as θ ∈ H0. This completes the
proof. ♦
Example 13.5. Uniform distribution Let X1, . . . , Xn be a random sample
from the uniform distribution on (0, θ). Then the distribution family of X =
(X1, . . . , Xn) has monotone likelihood ratio in X(n).
For any θ1 < θ2, the density ratio
f(x; θ2)/f(x; θ1) = (θ1/θ2)
n1(0 < x(n) < θ2)
1(0 < x(n) < θ1)
.
Other than the constant factor (θ1/θ2)
n, the ratio takes three values: 1, ∞
and undefined. The last case does not matter as the case of both densities
13.4. ASSIGNMENT PROBLEMS 193
being zero is excluded in the definition. This ratio is clearly an increasing
function of X(n).
Consider the hypothesis H0 : θ ≤ θ0 and H1 : θ > θ0. By the theorem we
have just proved, the UMP test can be written as
φ(X) =

1 X(n) > k;
c X(n) = k;
0 X(n) < k.
for some k and c.
Because the distribution of X(n) is continuous, we have P (X(n) = k) = 0
for any k. Hence, it can be simplified into
φ(X) =
{
1 X(n) > k;
0 X(n) < k.
The c.d.f. of X(n) is given by (x/θ0)
n under null for 0 < x < θ0. Hence, the
choice of k is determined by
α = 1− (k/θ0)n
and k = θ0(1− α)1/n is the solution.
The power at θ > θ0 is
β(θ) = 1− (1− α)(θ0/θ)n.
Remark The UMP is not unique as the density ratio is a discrete random
variable.
13.4 Assignment problems
1. Let X1, . . . , Xn be an i.i.d. sample from N(1, θ = σ
2). Find a UMP
test of size α for testing H0 : θ ≤ θ0 versus H1 : θ > θ0.
(a) When n = 10, θ0 = 3, and α = 0.05, obtain the power function β(θ)
over an interval (3, 10).
Hint: take a grid of 100 points in this interval and get numerical values
using R. Plot this function.
194 CHAPTER 13. UNIFORMLY MOST POWERFUL TEST
(b) Use Monte Carlo simulation to verify the values β(3.5) and β(5).
Give a thoughtful justification on the size of your simulation.
2. Let X1, . . . , Xn be a random sample from a distribution with density
function f(x; θ) = 2θ−2x1(0 < x < θ) and parameter space R+. The
required size of the test is α in the following questions.
(a) Work out the rejection region of a most powerful test for θ = 1
against θ = 1.03.
(b) Work out the rejection region of the most powerful test for θ = 1
against θ = 0.95.
(c) Use computer simulation to check the precision of the type I errors
of both, when we set the size α = 0.08 and the sample size is n = 20.
Be thoughtful on the size of your simulation.
(d) Compute the power of two tests and use simulation to verify these
values.
3. LetX1, X2, X3 be three independent random variables with respectively
density functions
f1(x; θ) =
1
θ
exp(−x
θ
);
f2(x; θ) =
x
θ2
exp(−x
θ
);
f3(x; θ) =
x2
2θ3
exp(−x
θ
)
and the shared parameter θ, parameter space Θ = R+.
(a) Show that the distribution of (X1, X2, X3) belongs to a distribution
family with monotone likelihood ratio in T (X) = X1 +X2 +X3.
(b) Verify that Eθ{T (X)} is monotone in θ.
(c) Construct a uniformly most powerful test of size α ∈ (0, 1) for
H0 : θ ≤ 3 against H1 : θ > 3. This should be done by (i) first give the
format; (ii) next specify the constant for the rejection region.
13.4. ASSIGNMENT PROBLEMS 195
Remark: For the famous z-test, the rejection region is in the form
|T | > z1−α/2 with z1−α/2 being the solution to∫ z
−∞
(2pi)−1/2 exp(−t2/2)dt = 1− α/2.
In other words, z1−α/2 is the upper quantile of the standard normal
4. Let X1, X2, . . . , Xn be an i.i.d. sample from a negative binomial distri-
bution family with parameter θ and a known constant m (which is a
non-negative integer):
pr(X = x) =
(
m+ x− 1
x
)
(1− θ)xθm = c(m,x)(1− θ)xθm.
for x = 0, 1, . . . and the parameter space Θ = (0, 1).
(a) Show that the joint distribution is in a monotone likelihood ratio
family in Tn = −X¯n, the negative of the sample mean.
(b) In view of (a), specify the uniformly most powerful test for H0 :
θ ≤ 0.5 against H1 : θ > 0.5 and put the size α = 0.05.
(c) Describe a physical experiment leading to the negative binomial
distribution with brief explanation.
196 CHAPTER 13. UNIFORMLY MOST POWERFUL TEST
Chapter 14
Pushing Neyman–Pearson
Lemma Further
The famous Naymann–Pearson Lemma is established on two simple hypothe-
ses. We have seen its generalization to the situation where the alternative
hypothesis is made of an interval of parameter values and the data are from
monotone likelihood ratio families. The null hypothesis can also be extended
so that the resulting test is UMP: uniformly most power. The main purpose
of this chapter is to develop tools to cover two-sided alternative hypothesis.
We start with a statistically not so meaningful result. It will be the basis for
something statistically meaningful.
Theorem 14.1. Consider the situation where H0 = {f1, f2} and H1 = {f3}.
Let α1, α2 be constants taking values between 0 and 1. We use Ej to denote
expectation operation when the data X has distribution fj, j = 1, 2, 3.
Let T be the class of tests such that Ej{φ(X)} ≤ αj; j = 1, 2. More
formally,
T = {φ(·) : Ej{φ(X)} ≤ αj; j = 1, 2.}.
Let T0 be a subset of T such that the above inequalities replaced by equalities.
Namely,
T0 = {φ(·) : Ej{φ(X)} = αj; j = 1, 2.}.
197
198CHAPTER 14. PUSHING NEYMAN–PEARSON LEMMA FURTHER
Suppose there are constants k1 and k2 such that
φ∗(x) =
{
1 f3(x) > k1f1(x) + k2f2(x);
0 f3(x) < k1f1(x) + k2f2(x)
(14.1)
is a member of T0.
We have two conclusions:
(i) E3{φ∗(X)} ≥ E3{φ(X)} for any φ(x) ∈ T0.
(ii) If both k1 ≥ 0 and k2 ≥ 0, then E3{φ∗(X)} ≥ E3{φ(X)} for any
φ(x) ∈ T .
Proof
(i) Simply construct function
{φ∗(x)− φ(x)}{f3(x)− (k1f1(x) + k2f2(x))}
which is non-negative at all x. If both φ∗(x), φ(x) ∈ T0, we will find E3{φ∗(X)} ≥
E3{φ(X)} right away by integrating the above function.
(ii) If φ∗(x) ∈ T0, it means that E1{φ∗(x)} = α1 and E2{φ∗(x)} = α2. When
φ(x) ∈ T merely, we have E1{φ(X)} ≤ α1; E2{φ(X)} ≤ α2 by definition.
Integrating {φ∗(x)− φ(x)}{f3(x)− (k1f1(x) + k2f2(x))} with respect the
corresponding σ-finite measure, we find
E3{φ∗(X)} − E3{φ(X)} ≥ k1[α1 − E1{φ(X)}] + k2[α2 − E2{φ(X)}]
≥ 0
where we have used the condition that both k1 and k2 are nonnegative..
Hence, the conclusion is verified.
One should read the result this way. The most powerful test of a specific
size becomes hard to obtain when H0 is composite. There may be none.
We have only managed to produce a result in still very simplistic situation.
This proposition can be generalized slightly to situation where H0 has finite
number of density functions. Another shortcoming of this result is that it
is hard to determine whether such k1 and k2 exist. Answering this question
is technically involved. So I only copy the following result below for your
reference.
14.1. ONE PARAMETER EXPONENTIAL FAMILY 199
Theorem 14.2. Let f1, f2, f3 be three density functions with respect to the
same σ-finite measure. The following two conclusions are true:
(a) The set M = {(E1{φ(X)},E2{φ(X)}) : φ is a test} is convex and
closed.
(b) If (α1, α2) is an interior point of M , then there exist constants k1, k2
such that a test φ∗(x) in the form of (14.1) with type I errors α1 and α2 at
f1 and f2 exists.
Discussion: The N-P lemma gives us a UMP when both H0 and H1 contains
a single distribution. We have generalized N-P lemma to the situation where
H1 contains many distributions previously. Theorem 4.1 expands the N-P
lemma a bit further: it allows H0 to contain two distributions when H1
contains only a single distribution.
When H0 is given in the form of 1 ≤ θ ≤ 2, say under N(θ, 1) model
assumption, the distributions in H0 that matter are the ones with θ = 1 and
θ = 2. Here by “matter”, we mean the type I errors and the size of a good
test are determined by these two distributions. Once a UMP test is obtained
for H˜0 : {θ = 1, θ = 2}, this test is likely also a UMP test for H0 itself.
See Lehmann (Vol II, pp96) for details.
14.1 One parameter exponential family
The generalized N-P lemma has its targeted application to problems related
to one parameter exponential family.
Theorem 14.3. Suppose we have an i.i.d. sample x1, x2, . . . , xn from a one–
parameter exponential family with density function given by
f(x; θ) = exp{θY (x)− A(θ)}h(x).
This family is a monotone density ratio family in Tn(x) =

Y (xi).
Suppose we want to test for H0 : θ 6∈ (θ1, θ2) versus H1 : θ ∈ (θ1, θ2) for
some θ1 6= θ2.
200CHAPTER 14. PUSHING NEYMAN–PEARSON LEMMA FURTHER
(i) A UMP test of size α is given by
φ(T ) =

1 k1 < Tn(x) < k2;
cj Tn(x) = kj, j = 1, 2;
0 Tn(x) < k1 or Tn(x) > k2
where k1, k2, c1, c2 are chosen such that
E{φ(X); θj} = α, j = 1, 2.
(Note 0 < c1, c2 < 1).
(ii) The test given in (i) minimizes type I error at every θ ∈ H0 among the
tests satisfying E{φ(T ); θj} = α, j = 1, 2.
Proof of this proposition
Since Tn(x) =

Y (xi) is sufficient for θ. We need only work on a test
defined as a function of Tn(x). Otherwise, E{φ(X)|T} is a test with the same
size and power function.
(i) Next, we first work on a UMP for testing H˜0 : {θ1, θ2} versus H˜1 : {θ3}
for some θ3 ∈ (θ1, θ2). Note the structure: the alternative model is a single
distribution within the interval; while the null models are two distributions
at two ends.
According to the generalized Neyman–Pearson lemma in the form of
proposition, such a UMP may exist.
For any test φ(T ), we denote its rejection probability by β(θ;φ) = E{φ(T ); θ}.
One candidate test for having UMP property is proposed to be
φ(T ) =

1 f(x; θ3) > k1f(x; θ1) + k2f(x; θ2);
c f(x; θ3) = k1f(x; θ1) + k2f(x; θ2);
0 f(x; θ3) < k1f(x; θ1) + k2f(x; θ2).
We do not elaborate but assume the existence of c, k1 and k2 such that
β(θ1;φ) = β(θ2;φ) = α.
The inequality
f(x; θ3) > k1f(x; θ1) + k2f(x; θ2)
14.1. ONE PARAMETER EXPONENTIAL FAMILY 201
used in defining the above φ(T ) under the exponential family can be written
as
a1 exp(b1T ) + a2 exp(b2T ) < 1
for some constants a1, a2, b1 and b2. Due to the relative sizes of θ1, θ2 and θ3,
we must have b1b2 < 0.
careful discussion as follows:
(1) If both a1, a2 are smaller than 0, then the inequality holds with prob-
ability 1. That is, the size of the test would be 1. This is disallowed.
(2) If a1 ≤ 0 but a2 > 0, together with the known function b1b2 < 0, it
implies that a1 exp(b1T ) + a2 exp(b2T ) is monotone in T . That is, the
inequality in the form of
a1 exp(b1T ) + a2 exp(b2T ) < 1
is equivalent to one of T < t or T > t for some constant t. If so, the
rejection probability β(θ;φ) would be an monotone function in θ. This
contradicts β(θ1;φ) = β(θ2;φ) = α.
(3) The only choice left is a1 > 0 and a2 > 0. Note that a1 exp(b1T ) +
a2 exp(b2T ) is now convex in T . The inequality in the form of
a1 exp(b1T ) + a2 exp(b2T ) < 1
is equivalent to the one in the form of
k1 < T < k2
for another set of k1 and k2.
In summary, our discussion leads to conclusion that the test is to reject
H0 when k1 < T < k2. This is in good agreement with our intuition. Based
on the generalized Neyman-Pearson together with a1 > 0 and a2 > 0, this
φ(T ) is UMP for testing H˜0 : {θ1, θ2} versus H˜1 : {θ3}.
Because this φ(T ) does not depend on the specific choice of θ3, the UMP
conclusion extends to H˜0 : {θ1, θ2} versus H1.
202CHAPTER 14. PUSHING NEYMAN–PEARSON LEMMA FURTHER
To get the full generality that φ(T ) is UMP for testing H0 : θ 6∈ (θ1, θ2)
versus H1, we only need to verify that
β(θ;φ) ≤ α
at every θ 6∈ [θ1, θ2]. This is true due to the concavity of β function.
Consider the test problem with H˜0 : {θ1, θ2} against H˜1 : {θ3} for some
θ3 in the original H0. Consider the test φ
∗(T ) = 1− φ(T ). It can be verified
(similar to what have been done) that this φ∗(T ) has the form specified in the
generalized N-P lemma. Therefore, φ∗(T ) has the best power at θ3 (among
those with β(θ1) = 1 − α, β(θ2) = 1 − α. This implies that φ(T ) has the
lowest type I error possible, which makes it at least as low as α.
(ii) It has been proved by the above step. ♦
Remark: The result itself is mathematically interesting but it is awkward to
come up with a situation in applications where it will be used. Its usefulness
will be seen in the next section.
For students with interest in mathematical techniques, this is a proof for us
to gain mathematical insight.
The result is stated for a one-parameter exponential family of distribution in
a specific form. A general one-parameter exponential family can usually be
transformed into this form by a monotone function to the parameter. Hence,
the conclusion is more general than it appears.
14.2 Two-sided alternatives
Consider the hypothesis θ = 1 versus the alternative H1 : θ 6= 1 given
observations from exponential distribution with mean θ. Let us separate H1
into H11 : θ > 1 and H12 : θ < 1. The alternatives in the form of H1 is called
two-sided. Assume the size of the test is required to be α. We now work
out the test in the situation where i.i.d. observations from an exponential
distribution family f(x; θ) = (1/θ) exp(−x/θ) are provided.
The UMP for H0 versus H11, according to discussion in the last chapter,
14.2. TWO-SIDED ALTERNATIVES 203
is given by
φ1(x) = 1(

xi > k1)
for so that E0{φ1(X)} = α.
The UMP for H0 versus H12, for the same reason, is given by
φ2(x) = 1(

xi < k2)
for so that E0{φ2(X)} = α.
Suppose a UMP test φ(x) exists for H0 versus H1. This test remains a
UMP for H0 versus H11. Hence, we must have φ(x) = φ1(x) except for a
zero-measure set of x. For the same reason, we must also have φ(x) = φ2(x)
except for a zero-measure set of x. Such a φ(x) clearly does not exist. Hence,
there exist no UMP for this problem.
This example is not restricted to the exponential distribution but true in
general. Although there is no UMP test for two-sided alternatives, we may
provide a sensible test based on the idea of “pure significance test”. If we
define
Tn = max{x¯, 1/x¯}.
A large value of Tn (deviating from 1) is a good indication that θ = 1 is
violated. Thus, we may compute
p0 = P (Tn ≥ tobs; θ = 1)
as the p-value and reject the null hypothesis of θ = 1, say when p0 < 0.05.
We may agree that this is a sensible test. However, we cannot help to ask
whether this is the best we can do. Furthermore, in what sense that this test
is best? We could have defined the test statistics as
T ′n = max{x¯, 2/x¯}.
A test based on T ′n has the same properties.
In some situations, it is possible to set up a useful standard. This is our
next topic.
204CHAPTER 14. PUSHING NEYMAN–PEARSON LEMMA FURTHER
14.3 Unbiased test
A great person is not necessarily the best compared to everyone else in every
respect in a large population, he/she might be the best in a small community
or in a specific aspect. We may be disappointed to that in many situations, or
in nearly all realistic situations, there exist no UMP tests. However, we may
look for the best test(s) among those which are not weird in some respects
but outsmart others merely in a very narrow sense. These words are said
here to motivate us to look for sensible tests which are optimal in restricted
class. In this section, we compare tests that are unbiased by the following
definition.
Definition 14.1. Consider the problem of testing for a null hypothesis de-
noted as H0 against an alternative hypothesis denoted as H1 based on data
X. A test φ(X) of size α is unbiased if
sup
F∈H0
E{φ(X);F} ≤ α; inf
F∈H1
E{φ(X);F} ≥ α.
Justification of unbiasedness. Every guilty party should be more likely
to be sent to prison than every innocent party in a court. Be aware of the
wording: merely more likely.
One can easily show that unbiased tests always exist for any pair of null
and alternative hypotheses. Be aware that most tests proposed in the liter-
ature under complex models are not unbiased. Yet it does not hurt to think
Definition 14.2. Consider the problem of testing for a null hypothesis de-
noted as H0 against an alternative hypothesis denoted as H1 based on data
X. If a test is most power at every F ∈ H1 within the class of unbiased tests
of size α, we say it is a Uniformly Most Powerful Unbiased (UMPU) test of
size α.
14.3.1 Existence of UMPU tests
The notion of unbiasedness is helpful in some typical situations. We only dis-
cuss this topic for a one-parameter exponential family with density function
14.3. UNBIASED TEST 205
given by
f(x; θ) = exp{θY (x)− A(θ)}h(x).
This family has monotone density ratio in T =

Y (xi). Of course, we also
know that T is complete and sufficient for θ. The above parameterization is
a natural one.
Theorem 14.4. Suppose we want to test for H0 : θ ∈ [θ1, θ2] versus H1 : θ 6∈
[θ1, θ2] for some θ1 6= θ2.
A UMPU test of size α is given by
φ(T ) =

1 T < k1 or T > k2
cj T = kj, j = 1, 2.
0 k1 < T < k2.
where k1, k2, c1, c2 are chosen such that
E{φ(T ); θj} = α, j = 1, 2.
(Note 0 < c1, c2 < 1).
Proof: A good test should clearly be based on T as it is complete and
sufficient for θ. Thus, we will not look into other possibilities.
According to Theorem 14.3 proved earlier, 1− φ(T ) for the φ(T ) defined
above is a UMP for H˜0 : θ 6∈ [θ1, θ2] versus H˜1 : θ ∈ [θ1, θ2] of size α˜ = 1− α.
We put a side proposition with a proof here. Under exponential family,
E{φ(T ); θ} is continuous in θ for any test φ(T ). Because of this proposition,
if φ(T ) is an unbiased test for H0 versus H1, we must have
E{φ(T ); θj} = α, j = 1, 2.
If another unbiased test φ∗(T ) is of size α for H0 versus H1 but has higher
power at some θ3 ∈ H1, we would have
E{φ∗(T ); θ1} = E{φ∗(T ); θ2} = α
and
E{φ∗(T ); θ3} > E{φ(T ); θ3}.
206CHAPTER 14. PUSHING NEYMAN–PEARSON LEMMA FURTHER
In terms of H˜0 and H˜1, we find a pair of tests: 1− φ∗(T ) and 1− φ(T ) both
of size 1 − α, unbiased, but the type I error of 1 − φ∗(T ) is lower than that
of 1 − φ(T ) at θ3 ∈ H˜0. This contradicts the UMP result Theorem 14.3 (ii)
given earlier. ♦
UMPU when θ1 = θ2. Theorem 14.4 is not directly applicable to the
situation where θ1 = θ2. Direction application is not theoretical justified and
also lead to some difficulties. A test of the same form would require us to
select k1, k2, c1, c2 such that
E{φ(T ); θ} = α
at θ = θ1. Ignore the “continuity correction” step of choosing constants c1
and c2, we would have many choices of k1 and k2 to satisfy a single constraint
like this one.
The solution comes from the following consideration. Let us apply this
theorem to the situation where θ2 = θ1 + δ and let δ ↓ 0. Clearly, we would
have
lim
δ↓0
E{φ(T ); θ2} − E{φ(T ); θ1}
δ
= 0.
This implies, in the context of one-parameter exponential family,
E{Tφ(T ); θ1} = αE{T ; θ1}.
Hence, a UMPU for H0 : θ = θ0 against H1 : θ 6= θ0 of size α is given by
φ(T ) =

1 T < k1 or T > k2
cj T = kj, j = 1, 2.
0 k1 < T < k2.
where k1, k2, c1, c2 are chosen such that
E{φ(T ); θ0} = α, E{Tφ(T ); θ0} = αE{T ; θ0}.
To implement this procedure, one may resort to numerical approximations
to find these constants. Constants c1, c2 serve the purpose of ensuring the
equality requirements are met exactly. They have little relevance in terms of
statistical practice.
14.4. UMPU FOR NORMAL MODELS 207
14.4 UMPU for normal models
The normal distribution has two parameters. Thus, what we have discussed
do not allow us even to show the optimality of most famous t-test. We will
have this topic picked up later.
14.5 Assignment problems
1. Let X be a sample of size n = 1 from a distribution with density
function
f(x; θ) = 3θ−3(θ − x)21(0 < x < θ)
Let H0 : θ ≤ θ0 and H1 : θ > θ0.
(a) Verify that this distribution family has monotone likelihood ratio
on x.
(b) Derive the UMP test of size α ∈ (0, 0.5). That is, specify the critical
region or give the expression of the decision function φ(x).
(c) Prove that a UMP test is always an unbiased test.
(d) Analytically verify the test given in (b) is unbiased.
2. Let X1, . . . , Xn be an i.i.d. sample from Poisson distribution with mean
parameter θ.
(a) Specify the analytical form of UMPU test of size α for testing
H0 : θ = 1 versus H1 : θ 6= 1.
(b) Use R to get the critical region numerically when n = 9 and α = 0.1.
(c) A conventional rather than UMPU test would have selected k1, k2,
c1, c2 such that
φ(x) =

1

xi 6∈ (k1, k2)
c1

xi = k1;
c2

xi = k2;
0

xi ∈ (k1, k2)
and that for j = 1, 2:
pr(

Xi < k1) + c1pr(

Xi = k1) = α/2
208CHAPTER 14. PUSHING NEYMAN–PEARSON LEMMA FURTHER
pr(

Xi > k2) + c2pr(

Xi = k2) = α/2
Use R to get the critical region numerically when n = 9 and α = 0.1
for this test.
(d) Use computer simulation to confirm (to shed doubt) that the test
in (b) is superior than test given in (c).
3. Let X1, . . . , Xn be an i.i.d. sample from a distribution with density
function given by
f(x; θ) = 2θ−2x exp(−x2/θ2)1(x ≥ 0)
with parameter space Θ = (0,∞).
(a) Derive the UMP test for H0 : θ 6∈ (.9, 1.2) versus H1 : θ ∈ (.9, 1.2)
when n = 10 and size α = 0.08.
Hint: Use R to get the critical region numerically.
(b) Following (a). Compute the Type I error at θ = 1.5 and type II
error at θ = 1.1.
4. Let X1, . . . , Xn be an iid sample from a distribution with density func-
tion given by
f(x; θ) = 2θ−2x exp(−x2/θ2)1(x ≥ 0)
with parameter space Θ = (0,∞).
(a) Obtain rejection region of the UMPU test for H0 : θ ∈ [1, 2] versus
H1 : θ 6∈ [1, 2] when n = 10.
Hint: Use R to get the critical region numerically.
(b) Following (a). Compute the Type I error at θ = 1.5 and type II
error at θ = 2.5.
5. Let X1, . . . , Xn be an i.i.d. sample from a distribution with density
function given by
f(x; θ) = 2θ−2x exp(−x2/θ2)1(x ≥ 0)
14.5. ASSIGNMENT PROBLEMS 209
with parameter space Θ = (0,∞).
(a) Obtain rejection region of the UMPU test for H0 : θ ∈ [.9, 1.2]
versus H1 : θ 6∈ [.9, 1.2] when n = 10 with α = 0.08.
Hint: Use R to get the critical region numerically.
(b) Following (a). Compute the Type I error at θ = 1.1 and type II
error at θ = 1.5.
6. Let X1, . . . , Xn be a random sample from a N(µ0, σ
2 = θ) distribution,
where 0 < θ <∞ and µ0 is known.
(a) Show that the likelihood ratio test of H0 : θ = θ0 versus H1; θ 6= θ0
can be based upon the statistic W =
∑n
i=1(Xi − µ0)2/θ0.
(b) State the null distribution of W .
(c) Give an explicit rejection rule based on φ(W ) and describe how to
get the constants needed in φ(W ).
(d) For n = 9, obtain these constants numerically for α = 0.05.
210CHAPTER 14. PUSHING NEYMAN–PEARSON LEMMA FURTHER
Chapter 15
Locally most powerful test
While the UMP theorems seem impressive mathematically, they are not
broad enough. Other than being hard to find them, they often do not exist
unless the data are from some classical well behaved parametric models. We
have no choice but to relax the optimality requirements if we wish to recom-
mend some effective methods for hypothesis test in real world applications.
Definition 15.1. Consider the simple null hypothesis H0 : {θ0} against H1 :
θ > θ0 in a one parameter setting. Let β(θ) be the power function of a test
φ(x) of size α. Suppose for any other test φ∗(x) of size α, there exists an
> 0 such that
E{φ∗(X); θ} ≤ β(θ)
for all θ ∈ (θ0, θ0 + ). Then we say φ(X) is locally most powerful.
There are a number of easily missed out details in this definition. One is
that the locally most powerful criterion is only applicable to one-parameter
distribution families. In addition, it is restricted to specific type of null
and alternative hypothesis. Namely, the null hypothesis contains a single
distribution and the alternative hypothesis is one-sided.
An immediately question after this definition is: under what conditions
such a locally most powerful test exists. We give this answer in the next
section.
211
212 CHAPTER 15. LOCALLY MOST POWERFUL TEST
15.1 Score test and its local optimality
We give a straightforward theorem about existence as follows.
Theorem 15.1. Let {f(x; θ) : θ ∈ Θ} be a regular statistic model with score
function defined to be
S(θ;x) =
∂ log f(x; θ)
∂θ
.
We consider the case where Θ is an interval of real number. A test defined
by
φ(x) = 1(S(θ0;x) > k)
is a locally most powerful test for H0 : {θ0} against H1 : θ > θ0 among the
tests with size α = E{φ(X); θ0}.
Remark: For mathematical simplicity, we have totally ignored the request of
having pre-specified size α. The one we defined about is a test of whatever size
itself ends up. Also, even though the result is presented as if it is applicable
when there is only a single observation x. It is applicable if x is a vector,
particularly when it is a vector made of i.i.d. observations. In that case, we
use S(xi; θ) for the contribution of the ith sample. The overall score function
would be

i S(xi; θ).
We have switched two entries of S(·, ·) because we intend to study its
randomness induced by the randomness of X. If we intend to study it as a
function of θ given some observed value x, we use S(θ;x).
Being regular for a model here means that for any T (X) integrable,
E{T (X); θ} =

T (x)f(x; θ)dν(x)
is differentiable with respect to θ and the derivative can be taken within the
integration sign. In simple words, the order of derivative and integration can
be exchanged without altering the outcome.
The local optimality holds for only simple null hypothesis against the one-
sided alternative. The local optimality is lost immediately if any of these is
violated. Nevertheless, the above score test itself is broadly applicable.
15.1. SCORE TEST AND ITS LOCAL OPTIMALITY 213
Proof: Being locally most power is the same as to require
β(θ) = E{φ(X); θ}
to have the largest possible derivative at θ = θ0 among all tests of size α.
Thus, we show the test defined by φ(x) = 1(S(θ0;x) > k) makes β(θ) having
the largest derivative.
Let φ∗(x) be another test of the same size. Then
{φ(x)− φ∗(x)}{S(x; θ0)− k} ≥ 0.
Taking expectation under distribution f(x; θ0), and noticing E{φ(x)−φ∗(x); θ0} =
0, we get
E{[φ(X)− φ∗(X)]S(X; θ)} ≥ 0.
Under regularity conditions, the left hand side is the difference of derivatives
of two power functions. ♦
The proof seems not tight enough and the problem occurs when
E{[φ(X)− φ∗(X)]S(X; θ)} = 0.
Further investigation reveals that this occurs only if φ(x) = φ∗(x) with prob-
ability one with respect to f(x; θ0).
Example 15.1. Let X1, . . . , Xn be an iid sample from Cauchy distribution
with
f(x; θ) =
1
pi{1 + (x− θ)2} .
Consider the test for H0 : θ = 0 against H1 : θ > 0.
The locally most powerful test is
φ(x) = 1(2

xi/(1 + x
2
i ) > k)
for some k such that the test has the require size.
The distribution of
∑{Xi/(1 + X2i )} is not well investigated. There is
no simple way to compute k value with which the size requirement is met
precisely. However, it is easy to show that

Xi/(1 +X
2
i ) is asymptotically
N(0, n/8). Thus, when n is large (say larger than 20), we may use normal
214 CHAPTER 15. LOCALLY MOST POWERFUL TEST
approximation to get a k value so that the size of the test is close to required
size.
From this example, we notice that the above discussion leaves out a prac-
tical consideration: choosing the constant k so that the test can be imple-
mented in a real world problem. A general principle is work out the distribu-
tion of the score function

i S(xi; θ0). Let k be its upper 1− αth quantile.
If

i S(xi; θ0) has a discrete distribution, one may use randomization to
achieve exact size α. Apparently, randomization is not so important in real
world applications.
When n is not large, the normal approximation “can be” used, but the
precision “may be” poor. Such a problem will probably not occur in real
world applications. If we have to work on such a problem, then one may
simulate the distribution of

Xi/(1 +X
2
i ). For instance, when n = 10, the
95% quantile of the normal distribution with variance n/8 is 1.839. Based on
100,000 data sets simulation, the observed 95% sample quantile is 1.848. It
turns out that the normal approximation is not so poor at all in this particular
example. One reason is that even though Cauchy distribution does not have
even the first moment, the random variable X/(1 +X2) is bounded and has
symmetric distribution. Hence, the normal approximation works nicely.
15.2 General score test
The locally most powerful test we gave in the last section is a score test.
When the model assumption f(x; θ) is correct, we have
E{S(X; θ); θ} = 0
for any θ under regularities conditions. To emphasize the random aspect of
the score function, we put x ahead of θ in the score function in this section.
If a statistician is asked to judge whether or not θ = θ0 is a plausible
value, he or she could take a look at the value
n∑
i=1
S(xi; θ0)
15.2. GENERAL SCORE TEST 215
where the summation is needed when we are given a set of i.i.d. observations.
From pure significance test point of view, this is an informative statistic about
whether θ0 is an acceptable value.
More specifically, if θ is the only parameter under consideration. The
null hypothesis is θ = θ0 and the alternative is θ 6= θ0, then the score test
is to reject the null hypothesis when |∑S(xi; θ0)| > k for some k. When
the sample size n is large enough, say over 20, we use central limit theory to
decide the size k so that the test has size approximately equaling the specified
α.
Note that the alternative hypothesis in this section is H1 : θ 6= θ0 which
is two-sided. Because of this, there generally exist no locally most powerful
tests similar to the one discussed in the beginning. Another point is that the
rejection region is more conveniently defined as
(x1, . . . , xn) : {

S(xi; θ0)}2 > k
for some k. Let
I(θ) = E{S(X; θ)}2
be the Fisher information. Under some conditions, {∑S(xi; θ0)}2/I(θ0) has
chisquare limiting distribution with one degree of freedom. This result is
often used to find approximate k value so that the test has (approximately)
the required size.
Once we leave the territory of optimality consideration, we generally do
not make a fuss on “randomization” to ensure the size of the test exactly the
same as pre-specified. The above suggestion for score test also works when
θ is a vector parameter.
Suppose we can split a vector parameter θ = (ξ, η) and wishes to test
H0 : ξ = ξ0. Note that the null hypothesis becomes composite, namely, it
contains a set of distributions instead of a single. One may work out
S(x, ξ0, η) =
∂ log f(x; ξ, η)
∂ξ
∣∣∣∣
ξ=ξ0
and build a test statistics based on
n∑
i=1
S(xi, ξ0, ηˆ0)
216 CHAPTER 15. LOCALLY MOST POWERFUL TEST
where ηˆ0 is the MLE of η given ξ = ξ0. That is,
ηˆ0 = arg max{`n(ξ0; η) : η}.
To effectively use this statistic, we need to find out its distribution, at least
the asymptotic one, in order to specify the rejection region so that the size
of the test meets some specification. We do not leave more details about the
asymptotic distribution to a later chapter.
In general, if there is a function g(x; θ) such that E{g(X; θ); θ} = 0 for
all θ ∈ H0. Then, the value of
T = inf
θ∈H0
|

g(xi; θ)|
could be used as some kind of statistic for “pure significance hypothesis test”.
Among all such choices, the score test is optimal in some sense.
15.3 Implementation remark
Whether we test for one-sided alternative or two-sided, we need to given a
“rejection region” which is linked to the choice of k, given the desired size of
the test. Particularly in assignment problems, the distribution of the score
function at θ = θ0 under the null hypothesis may be a member of well known
distribution family. Thus, a constant k can be selected accordingly without
much difficulty. In more realistic situations, we general use the limiting
distribution of the score function at θ = θ0 to choose a k such that the size
of the test is approximately α.
When both approaches are feasible, the first choice is preferred. Yet it
does not mean the second choice is wrong: it is an approximate answer.
The approximation may not be accurate when the sample size is not very
large. Thus, we do not recommend computing the value k based on limiting
distribution unless the sample size is reasonably large. This recommendation
is not applicable to classroom, assignment or textbook problems. In real
world situations, use simulation to decide how well the approximation is and
whether it is satisfactory given the current sample size.
15.4. ASSIGNMENT PROBLEMS 217
15.4 Assignment problems
1. Let X1, . . . , Xn be i.i.d. observations from Cauchy distribution with
density function
f(x; θ) =
1
pi{1 + (x− θ)2}
with the location parameter θ ∈ R. We wish to test the hypothesis for
H0 : θ = 0 against an alternative to be specified. We set the size of the
test at α = 0.05.
(a) Derive the score test statistic against the alternative H1 : θ > 0. If
n = 10 and the size of the test is set at α = 0.05, specify the rejection
region.
(b) Derive the score test statistic against the alternative H1 : θ < 0.
(c) Suppose one chooses Tn to be the sample median as his/her test
statistic to test against H1 : θ > 0. If n = 10 and the size of the test is
set at α = 0.05, specify the rejection region.
(d) Following (c), what would be the rejection region if the hypotheses
are replaced by H0 : θ = 6 against H1 : θ > 6?
(e) Use computer simulation to compare the powers of the tests in (a)
and (c) at θ = 0.2.
Remark: generate at least 20K data sets so that the simulation error
is most likely below 0.3%.
2. Suppose we have one observation from Binomial distribution with pa-
rameters m = 50 and probability of success p. We set the size of the
test at α = 0.05.
(i) Obtain the locally most powerful test for H0 : p = 0.3 versus H1 :
p > 0.3.
(ii) Obtain the score test for H0 : p = 0.3 versus H1 : p 6= 0.3.
Remark: obtain the rejection region but ignore the need of random-
ization in (i). Use chisquare approximation in (ii) to determine the
rejection region. Use software R for numerical calculations.
218 CHAPTER 15. LOCALLY MOST POWERFUL TEST
15.5 Assignment problems
1. Let X1, . . . , Xn be i.i.d. observations from Cauchy distribution with
density function
f(x; θ) =
1
pi{1 + (x− θ)2}
with the location parameter θ ∈ R. We wish to test the hypothesis for
H0 : θ = 0 against an alternative to be specified. We set the size of the
test at α = 0.05.
(a) Derive the score test statistic against the alternative H1 : θ > 0. If
n = 10 and the size of the test is set at α = 0.05, specify the rejection
region based on asymptotic normality of the test statistic.
(b) Derive the score test statistic against the alternative H1 : θ < 0.
(c) Suppose one chooses Tn to be the sample median as his/her test
statistic to test against H1 : θ > 0. If n = 10 and the size of the test is
set at α = 0.05, specify the rejection region.
Remark: The sample median is asymptotically normal. Let ξˆn and ξ
be the sample and population median. We have

n(ξˆn − ξ) d−→ N(0, 1/4f(ξ))
where f(x) is the density function of the corresponding distribution,
(d) Following (c), what would be the rejection region if the hypotheses
are replaced by H0 : θ = 6 against H1 : θ > 6?
Remark: Make use of invariance property.
(e) Use computer simulation to compare the powers of the tests in (a)
and (c) at θ = 0.2.
Remark: generate at least 20K data sets so that the simulation error
is most likely below 0.3%.
2. Suppose we have one observation from Binomial distribution with pa-
rameters m = 50 and probability of success p. We set the size of the
test at α = 0.05.
15.5. ASSIGNMENT PROBLEMS 219
(i) Obtain the locally most powerful test for H0 : p = 0.3 versus H1 :
p > 0.3.
(ii) Obtain the score test for H0 : p = 0.3 versus H1 : p 6= 0.3.
Remark: (a) use test statistic and critical value to present tests; (b)
for convenience, use normal and Chisquare limiting distributions to
determine the critical values. (c) use R language to obtain numerical
values for the given m and α.
220 CHAPTER 15. LOCALLY MOST POWERFUL TEST
Chapter 16
Likelihood ratio test
The conclusion in the famous Neyman–Pearson lemma may not be too useful
when we must work with more complex models. However, it tells us that the
“optimal metric” in testing for a null model containing only a single distribu-
tion with parameter value θ0 against the alternative which also only contains
a single distribution with parameter value θ1 is their relative likelihood size.
This motivates the use of likelihood ratio test.
16.1 Likelihood ratio test: as a pure proce-
dure
Let us consider the situation where we have a random sample from a distri-
bution that belongs to a parametric distribution family:
{f(x; θ) : θ ∈ Θ}.
Let H0 and H1 be subsets of Θ. In common practice, we take H1 = Θ−H0,
the complement of H0. Hence, when both H0 and H1 are explicitly specified
in a problem, we also automatically set Θ = H0 ∪H1.
Let Ln(θ) =
∏n
i=1 f(Xi; θ) be the likelihood function of θ defined in Θ
under the commonly assumed i.i.d. setting. We call
{f(x; θ) : θ ∈ H0}
221
222 CHAPTER 16. LIKELIHOOD RATIO TEST
or simply H0 the null model. We also call H1 the alternative model but things
will be slightly different as will be seen. The distribution in H0 that fits the
data best from the likelihood angle is the one with θ value that maximizes
Ln(θ) within H0. Let θˆ0 be the maximizer. Similarly, the best value under the
alternative model is the one that maximizes Ln(θ) for θ ∈ H1. Yet we do not
directly utilize this value subsequently. Instead, we let θˆ1 be the maximizer
of Ln(θ) over Θ = H0∪H1, the entire parameter space. Namely, θˆ1 is the
MLE under the full model. In order for the definitions of θˆ0 and θˆ1 viable,
the supremums at null and full models must be attained at some parameter
value. This is generally true and we assume this is the case without truly
lose much of generality. This technical issue has an easy fix.
The commonly used likelihood ratio statistic in the literature is defined
to be
Λn =
Ln(θˆ0)
Ln(θˆ1)
= exp{ sup
θ∈H0
`n(θ)− sup
θ∈Θ
`n(θ)}.
The likelihood ratio test statistic is defined to be
Rn = −2 log Λn = 2{`n(θˆ1)− `n(θˆ0)}
where we have used the log likelihood function
`n(θ) = logLn(θ) =

i
log f(xi; θ).
The multiplication factor 2 in Rn does not play a rule in defining a test.
It makes the limiting distribution of Rn a neat chisquare under regularity
conditions to be seen,
We define the likelihood ratio test as
φ(x) = 1{Rn ≥ c}
for some c such that the test has pre-specified size. From now on, we will not
pay attention to the situation where Rn has a discrete distribution. More
precisely, the test will be regarded as if randomization is never needed to
make the size of the test being exactly the same as pre-specified. One reason
for this convention is that to find the precise critical value c is generally
difficult, numerically infeasible, even without this complication. In addition,
16.1. LIKELIHOOD RATIO TEST: AS A PURE PROCEDURE 223
when the sample size n is large, we have the following result due to Wilks that
works well enough. This result enables us to come up with an approximate
critical value. If it is an approximation already, it is pointless to have another
layer of approximation.
From the data analysis point of view, we have to go over several steps to
perform a likelihood ratio test.
1. Understand the data structure and come to an agreed model from which
the data were supposedly collected.
2. Work out the likelihood function. Identify the null and alternative
hypotheses from the application background.
3. Numerically find the MLE of unknown parameters under the null and
the full models. Numerically obtain the value of likelihood ratio statis-
tic, Rn.
4. Based on the user specified size of the test, α, and our knowledge on
the sample distribution of Rn under the null model to determine the
rightful c such that a rejection of the null model is recommended when
Rn ≥ c.
We may instead compute pr(Rn > Robs) and report this value as the p-
value of this test and the specific pair of hypotheses. Leave the decision
to the user.
The model choice should be made after thorough scientific understanding
of the applied problem. The statistical properties of the model should reflect
this understanding. This is a topic in Statistical Consulting courses. We
mostly discuss situations where the observations are i.i.d. here. Point 2 is
generally a topic in specialized courses such as “Generalized Linear Models”.
Numerical computation can be fitted into a specialize course in statistics or
computer science. A course in mathematical statistics focuses on the last
point. How we determine the appropriate value of c or compute p-values
symbolically.
224 CHAPTER 16. LIKELIHOOD RATIO TEST
16.2 Wilks Theorem under regularity condi-
tions
The likelihood ratio test is most popular in applications not only because it is
“optimal” due to Neyman-Peason lemma, but also because the distribution
of its test-statistic is “model-free” when the sample size n is very large, the
model and hypotheses are regular. Here is a simplified version of the elegant
result by Wilks( 1938?).
Theorem 16.1. Suppose H0 is an open subset of an m-dimensional subspace
of Θ and Θ is an open subset of Rm+d. Under some regularity conditions and
assume the data set is made of n i.i.d. observations, we have
pr{Rn ≤ t)→ pr(Z21 + Z22 + · · ·Z2d ≤ t}
as the sample size n→∞, and under any null model θ = θ0 ∈ H0.
We have used Z1, . . . , Zd as a set of i.i.d. standard normal random vari-
ables. Based on the above theorem, when n is large, a test with approximate
size α is obtained by choosing the critical value c = χ2d,1−α, the (1 − α)th
quantile of the chisquare distribution with d degrees of freedom.
When H0 contains many distributions, this theorem says whichever θ0 ∈
H0 is the specific distribution which generated the data, the distribution of
Rn stays the same, asymptotically.
In many research papers, the distribution of the test-statistic is often
referred to as the distribution of the test. Such a statement is not rigorous,
but does not seem to cause many problems. If you get confused, it can be
helpful to question the meaning of this statement.
We do not give a proof nor a list the conditions at the moment. Let us
examine a few examples of the likelihood ratio test.
Example 16.1. Consider the exponential distribution model with mean pa-
rameter θ and parameter space R+, and a hypothesis test problem in which
H0 : θ = 1 and H1 : θ 6= 1. Given a random sample of observations, we
find θˆ1 = X¯n. Since H0 contains a single distribution, we have θˆ0 = 1. The
likelihood ratio statistic is given by
Rn = −2n{log X¯n − (X¯n − 1)}
16.2. WILKS THEOREM UNDER REGULARITY CONDITIONS 225
Under the null hypothesis, it is known that X¯n → 1 almost surely. Thus, we
have, approximately,
2n{(X¯n − 1)− log X¯n} = n(X¯n − 1)2 + op(1)
where op(1) is an asymptotically zero random quantity. More precisely, op(1)
is a random quantity that goes to 0 in probability. See the definition in an
earlier chapter.
By the central limit theorem,

n(X¯n− 1) is asymptotically N(0, 1) under
the null hypothesis: θ = 1. Using Slutsky’s theorem, we find that Rn is
asymptotically χ21.
Because of this, an asymptotical rejection/critical region for a size 0.05
likelihood ratio test is approximately
C = {x : Rn ≥ 3.841}.
In the form of test function, φ(x) = 1{Rn ≥ 3.841}.
Suppose we put H1 as the set of θ-values larger than 1. Subsequently,
we regard the parameter space is given by Θ = [1,∞). If so, the MLE of θ
under H1 is no longer always X¯n. In this case, the limiting distribution of
Rn is not χ
2
1. We will see that the regularity condition is not satisfied with
this H1. That is, Theorem 16.1 does not always apply.
Example 16.2. Consider the test problem where an iid sample is from
N(θ, σ2) and H0 : θ = 0 against H1 : θ 6= 0.
The MLE under H1 is given by θˆ = X¯n and σˆ
2
n = n
−1∑(Xi − X¯n)2.
Under the null hypothesis, the MLE of σ2 is σˆ20 = n
−1∑X2i . It is not too
hard to find that
Rn ≈ n log
[
1 +
σˆ20 − σˆ2n
σˆ2n
]
≈ nX¯2n.
Thus, its limiting distribution is χ21.
There are many reasons why the likelihood ratio test is preferred by statis-
ticians and practitioners. Let me try to give you a list that I am aware of.
226 CHAPTER 16. LIKELIHOOD RATIO TEST
a) Because the limiting distribution of the likelihood ratio statistic under
regularity conditions is chisquare, it does not depend on unknown pa-
rameters. We say that it is asymptotically pivotal. One may recall
that one of the two preferred properties of a test statistic is that the
statistic has a sample distribution free from unknown parameters under
the null hypothesis.
b) Due to Neyman-Pearson Lemma, we believe that the LRT is nearly
“most powerful”. The claim is unproven, and likely false. Yet when
lacking any evidences to the contrary, we love to believe the power of
the LRT is superior.
c) Whether a limiting distribution is useful or not for statistical inference
depend on how closely it approximates the finite sample distribution
when the sample size is in the range that often occurs in applications.
For example, if a clinical trial typically recruits 200 patients, then the
limiting distribution is useful when it provides a good approximation
when n = 200. It would be not so useful if the approximation is
poor until n = 2000. There is a general belief that the chisquare
approximation for LRT is often good for moderate n.
d) The LRT is invariant to parameter transformation. If a one-to-one
transformation is applied to θ to get ξ = g(θ). The LRT remains equal
when testing g(H0) against g(H1). Note that I am regarding H0 and H1
as subsets of parameters. If one makes an one-to-one transformation to
the data, the inference conclusion based on likelihood approach will also
remain the same. A user should be aware that the data transformation
leads to change of the model before making use of this claim.
Let us also point out that the LRT is often abused. The asymptotic
chisquare distribution is valid only if (a) the true value of the parameter is an
interior point of the parameter space; (b) the distribution family is regular;
(c) the observations are i.i.d. . The result may still be valid when (c) is
violated. Yet the validity depends on the structure of the model which should
be examined before the LRT together with the chisquare approximation is
used. If (a) is violated, the result is almost surely void. If (b) is violated, we
16.3. ASYMPTOTIC CHISQUARE OF LRT STATISTIC 227
probably should not use LRT although there are examples, I believe, that
the asymptotic result remains valid. Yet there is no reason to assume so in
general.
Example 16.3. Suppose we have an i.i.d. sample from
f(x; pi) = (1− pi)N(0, 1) + piN(1, 1).
The parameter space is [0, 1]. Suppose we want to test H0 : pi = 0 against
H1 : pi > 0.
Under the null model, that is, assume the true value of pi = 0, the MLE
pˆi = 0 with probability approximately 0.5. Because of this, the limiting dis-
tribution of the likelihood ratio statistic equals 0 with probability 0.5. Hence,
the chisquare limiting distribution does not apply. The reason for the failure
is that pi = 0 is on the boundary of the parameter space.
Your may work out the asymptotic involved in the above example follow-
ing mathematical principles. We do not provide details here.
16.3 Asymptotic chisquare of LRT statistic
Let consider the simplest case when H0 = {θ0} and that θ is one dimensional.
In this case, the LRT statistic
Rn = 2{`n(θˆ)− `n(θ0)}.
We carry out the simplistic proof for the situation where the MLE is con-
sistent. Thus, it is within an infinitesimal neighbourhood of θ0. We do not
spell out but assume the model under consideration is regular. The more
technical discussion will be given in subsequent chapters.
Applying Taylor’s expansion, we have
`n(θ0) = `n(θˆ) + `

n(θˆ)(θ0 − θˆ) + (1/2)`′′n(θ˜)(θ0 − θˆ)2.
However, being MLE, θˆ makes `′n(θˆ) = 0. In addition, being consistent, we
find
n−1`′′n(θ˜) = n
−1`′′n(θ0) + op(1) = −I(θ0) + op(1)
228 CHAPTER 16. LIKELIHOOD RATIO TEST
where I(·) is the Fisher information. Hence, we find
Rn = 2{`n(θˆ)− `n(θ0)} = {nI(θ0) + op(n)}(θ0 − θˆ)2.
Recall that √
n(θˆ − θ0) = n−1/2I−1Sn(θ0) + op(1)
we get
Rn = I−1{n−1/2Sn(θ0)}2 + op(1).
Because n−1/2Sn(θ0)→ N(0, I(θ0)), we find
Rn → χ21
in distribution.
16.4 Assignment problems
1. Suppose we have one bi-variate normally distributed observation (X1, X2)
with mean µ = (µ1, µ2) and identity variance matrix. Consider the like-
lihood ratio test for
H0 : (µ1 ≤ 0) and (µ2 ≤ 0); H1 : µ1 > 0 or µ2 > 0.
(i) Obtain the expression of the likelihood ratio test statistic R.
(ii) What is the distribution of R when (µ1, µ2) = (0, 0)? That is,
obtain its cumulative distribution function.
Hint: Notation such as X+ = max{0, X} can be useful.
2. Let (Xi, Yi), i = 1, 2, . . . , n be a set of i.i.d. bivariate distributions with
joint probably mass function
f(x, y, θ1, θ2) =
θx1θ
y
2
x!y!
exp(−{θ1 + θ2}).
(i) Obtain the analytical expression of the likelihood ratio test statistic
Rn for H0 : θ1 = θ2 versus H1 : θ1 6= θ2.
16.4. ASSIGNMENT PROBLEMS 229
(ii) Directly prove that Rn has a chi-square limiting distribution with
one degree of freedom under the null model.
(iii) Let n = 50 and θ1 = θ2 = 2. Use computer simulation to find out
how closely the null rejection probability is to the nominal level 0.05.
(iv) Repeat (iii) when θ1 = θ2 = 20. Should the type I error in this
case closer to 0.05 compared to the test in (iii), why and is it?
Remark: Do not cite a generic result when you prove (ii). Based on the
asymptotic result, the rejection region for a level 0.05 test is Rn > 3.84.
Employ at least 20K repetitions in the simulation.
3. Let X1, X2, . . . , Xn be an iid sample from a distribution in Poisson
distribution family with mean parameter θ:
P (X = k) =
θk
k!
exp(−θ).
(a) Derive the expression of the likelihood ratio test statistic Rn for
H0 : θ = 1 against H1 : θ 6= 1.
(b) Verify Wilks theorem that Rn in (a) has χ
2
1 limiting distribution
under H0.
(c) Derive the expression of the likelihood ratio test statistic Rn for
H0 : θ = 1 against H1 : θ > 1.
(d) Derive the score function Sn(θ) of θ in the current context.
(e) Derive the score test statistic for H0 = 1 against H1 : θ 6= 1.
(f) Obtain the limiting distribution of the likelihood ratio test in (c).
4. Let X1, X2, . . . , Xn be an i.i.d. sample from a negative binomial distri-
bution family with parameter θ and a known constant m (which is a
positive integer):
P (X = x) =
(
m+ x− 1
x
)
(1− θ)xθm = c(m,x)(1− θ)xθm.
for x = 0, 1, . . . and the parameter space Θ = (0, 1). Use X¯n as notation
for the sample mean.
230 CHAPTER 16. LIKELIHOOD RATIO TEST
(a) Derive the expression of the likelihood ratio test statistic Rn for
H0 : θ = 0.5 against H1 : θ 6= 0.5.
(b) Verify Wilks theorem that Rn has χ
2
1 distribution under H0.
16.5 Assignment problems
1. Suppose we have one bi-variate normally distributed observation (X1, X2)
with mean µ = (µ1, µ2) and identity variance matrix. Consider the like-
lihood ratio test for
H0 : (µ1 ≤ 0) and (µ2 ≤ 0); H1 : µ1 > 0 or µ2 > 0.
(i) Obtain the expression of the likelihood ratio test statistic R.
(ii) What is the distribution of R when (µ1, µ2) = (0, 0)? That is,
obtain its cumulative distribution function.
Hint: Notation such as X+ = max{0, X} can be useful.
2. Let (Xi, Yi), i = 1, 2, . . . , n be a set of i.i.d. bivariate distributions with
joint probably mass function
f(x, y, θ1, θ2) =
θx1θ
y
2
x!y!
exp(−{θ1 + θ2}).
(i) Obtain the analytical expression of the likelihood ratio test statistic
Rn for H0 : θ1 = θ2 versus H1 : θ1 6= θ2.
(ii) Directly prove that Rn has a chi-square limiting distribution with
one degree of freedom under the null model.
(iii) Let n = 50 and θ1 = θ2 = 2. Use computer simulation to find out
how closely the null rejection probability is to the nominal level 0.05.
(iv) Repeat (iii) when θ1 = θ2 = 20. Should the type I error in this
case closer to 0.05 compared to the test in (iii), why and is it?
Remark: Do not cite a generic result when you prove (ii). Based on the
asymptotic result, the rejection region for a level 0.05 test is Rn > 3.84.
Employ at least 20K repetitions in the simulation.
16.5. ASSIGNMENT PROBLEMS 231
3. Let X1, X2, . . . , Xn be an i.i.d. sample from a distribution in Poisson
distribution family with mean parameter θ:
pr(X = k) =
θk
k!
exp(−θ)
for k = 0, 1, 2, . . ..
(a) Derive the expression of the likelihood ratio test statistic Rn for
H0 : θ = 1 against H1 : θ 6= 1.
(b) Verify Wilks theorem that Rn in (a) has χ
2
1 limiting distribution
under H0.
(c) Derive the expression of the likelihood ratio test statistic Rn for
H0 : θ = 1 against H1 : θ > 1.
(d) Derive the score function Sn(θ) of θ in the current context.
(e) Derive the score test statistic for H0 = 1 against H1 : θ 6= 1.
(f) Obtain the limiting distribution of the likelihood ratio test in (c).
4. Let X1, X2, . . . , Xn be an i.i.d. sample from a negative binomial distri-
bution family with parameter θ and a known constant m (which is a
positive integer):
pr(X = x) =
(
m+ x− 1
x
)
(1− θ)xθm = c(m,x)(1− θ)xθm.
for x = 0, 1, . . . and the parameter space Θ = (0, 1). Use X¯n as notation
for the sample mean.
(a) Derive the expression of the likelihood ratio test statistic Rn for
H0 : θ = 0.5 against H1 : θ 6= 0.5.
(b) Verify Wilks theorem that Rn has χ
2
1 distribution under H0.
232 CHAPTER 16. LIKELIHOOD RATIO TEST
Chapter 17
Likelihood with vector
parameters
Consider the situation where we have a set of i.i.d. observations from a para-
metric family {f(x; θ) : θ ∈ Θ ⊂ Rd} for some positive integer d. The log
likelihood function remains the same as
`n(θ) =
n∑
i=1
log f(xi; θ).
Note that the dimension of X is not an issue here. The score function is still
Sn(θ;x) =
n∑
i=1
∂{log f(xi; θ)}
∂θ
but we should regard it as a vector. Having n observations in i.i.d. setting
will be assumed in this chapter in general.
The regularity conditions are the same though sometimes we should in-
terpret them as “element wise”. In addition, the regularity conditions are
required for both {f(x; θ) : θ ∈ H0} and {f(x; θ) : θ ∈ Θ}.
R0 the parameter space of θ is an open set of Rm or Rm+d
R1 f(x; θ) is differentiable to order three with respect to θ at all x.
233
234 CHAPTER 17. LIKELIHOOD WITH VECTOR PARAMETERS
R2 For each θ0 ∈ Θ, there exist functions g(x), H(x) such that for all θ in
a neighborhood N(θ0),
(i)
∣∣∣∣∂f(x; θ)∂θ
∣∣∣∣ ≤ g(x);
(ii)
∣∣∣∣∂2f(x; θ)∂θ2
∣∣∣∣ ≤ g(x);
(iii)
∣∣∣∣∂3 log f(x; θ)∂θ3
∣∣∣∣ ≤ H(x)
hold for all x, and∫
g(x)dx <∞; E0{H(X)} <∞.
We have used E0 for the expectation calculated at θ0
R3 For each θ ∈ Θ,
0 < Eθ
{
∂ log f(x; θ)
∂θ
}2
<∞.
This inequality is interpreted as positive-definite.
Although the integration is stated as with respect to dx, the results we
are going to state remain valid if it is replaced by some σ-finite measure.
All conditions are stated as if required at all x. Exception over a 0-
measure set (with respect to f(x; θ0)) of x is allowed.
Lemma 17.1. (1) Under regularity conditions, we have

{
∂ log f(x; θ)
∂θ
}
= 0.
(2) Under regularity conditions, we also have

[{
∂ log f(x; θ)
∂θ
}{
∂ log f(x; θ)
∂θ
}τ]
= −Eθ
{
∂2 log f(x; θ)
∂θ∂θτ
}
.
The proof of the above lemma remains the same as the one for one-dim θ.
The second identity in this lemma is called Bartlett identity in the literature.
235
Theorem 17.1. Suppose θ0 is the true parameter value. Under Conditions
R0-R3, there exists an θˆn sequence such that
(i) Sn(θˆn) = 0 almost surely;
(ii) θˆn → θ0 almost surely.
Proof.
(i) Let be a small enough positive number. Consider a θ∗ value such
that ‖θ∗− θ0‖ = . That is, θ∗ is on the ball centred at θ0 with radius . We
aim to show that almost surely,
`n(θ
∗) < `n(θ0) (17.1)
simultaneously for all such θ.
If (17.1) is true, it implies that `n(θ) has a local maximum within this
ball. Because the likelihood function is smooth, the derivative at this local
maximum is 0. Hence, conclusion (i) is true.
Is (17.1) true? By Taylor’s series, we have
`n(θ
∗) = `n(θ0) + {`′n(θ0)}T (θ∗ − θ0) +
1
2
(θ∗ − θ0)T `′′n(θ˜)(θ∗ − θ0)
for some θ˜ in the -ball.
It is known that `′n(θ0) = Op(n
n−1`′′n(θ0)→ −I(θ0)
almost surely. Here I(θ0) is the Fisher Information which is positive definite
by R3. Activating R2(iii), it is easy to show that almost surely,
sup
θ∗
n−1|`′′n(θ˜)− `′′n(θ0)| ≤ C
in some norm for some C not random nor dependent on θ∗ and so on.
`n(θ
∗)− `n(θ0) = {`′n(θ0)}T (θ∗ − θ0)−
n
2
(θ∗ − θ0)T I(θ0)(θ∗ − θ0) + 3O(n).
Roughly, the first term is of size n1/2, the second is −n2 and the remainder
is n3. Thus, the over all size is determined by −n2 which is negative. This
completes the proof of (i).
236 CHAPTER 17. LIKELIHOOD WITH VECTOR PARAMETERS
The order assessments can be made rigorously but will not be given here.
(ii) is a direct consequence of (i).
This result is not equivalent to the consistency of MLE even for this
special case. There exists a proof of the consistency of MLE based on much
more relaxed conditions. However, the proof is too complex to be explained
clearly in this course.
17.1 Asymptotic normality of MLE after the
consistency is established
Under the assumption that f(x; θ) is smooth, and θˆ is a consistent estimator
of θ, we must have
Sn(θˆ) = 0.
By the mean-value theorem in mathematical analysis, we have
Sn(θ0) = Sn(θˆ) + S

n(θ˜)(θ0 − θˆ)
where θ˜ is a parameter value between θ0 and θˆ. This claim is not exactly true
but somehow accepted by most. A more rigorous proof will be very similar
conceptually but can be tedious to look after all details.
By one of the lemmas proved previously, we have
n−1S ′n(θ˜)→ −I(θ0)
the Fisher information almost surely. In addition, the classical multivariate
central limit theorem can be applied to obtain
n−1/2Sn(θ0)→ N(0, I(θ0)).
Thus, by Slutzky’s theorem, we find

n(θˆ − θ0) = n−1/2I−1(θ0)Sn(θ0) + op(1)→ N(0, I−1(θ0))
in distribution as n→∞.
17.2. ASYMPTOTIC CHISQUAREOF LRT FOR COMPOSITE HYPOTHESES237
17.2 Asymptotic chisquare of LRT for com-
posite hypotheses
Let us still consider the simplest case when H0 = {θ0} that is an interior
point of Θ and Θ has dimension d. The alternative is θ 6= θ0. Assume the
regularity conditions are satisfied by the full model {f(x; θ) : θ ∈ Θ}.
In this case, the LRT statistic
Rn = 2{`n(θˆ)− `n(θ0)}.
Remember, we work on the case in which the MLE consistent. Thus, it is
within an infinitesimal neighborhood of θ0.
Applying Taylor’s expansion, we have
`n(θ0) = `n(θˆ) + {`′n(θˆ)}T (θ0 − θˆ) + (1/2)(θ0 − θˆ)T{`′′n(θ˜)}(θ0 − θˆ).
However, being MLE, θˆ makes `′n(θˆ) = 0. In addition, with θˆ being consistent,
we find
n−1`′′n(θ˜) = n
−1`′′n(θ0) + op(1) = −I(θ0) + op(1).
Hence, we find
Rn = 2{`n(θˆ)− `n(θ0)} = n(θ0 − θˆ)T{I(θ0) + op(1)}(θ0 − θˆ).
Recall that √
n(θˆ − θ0) = n−1/2I−1(θ0)Sn(θ0) + op(1)
we get
Rn = n
−1STn (θ0)I
−1(θ0)Sn(θ0) + op(1).
Because n−1/2Sn(θ0)→ N(0, I(θ0)), we find
Rn → χ2d
in distribution.
Remark: d is the dimension difference between H0 and H1.
Counter Example. Suppose that we have an iid sample of size n from
(1− γ)N(0, 1) + γN(2, 1)
238 CHAPTER 17. LIKELIHOOD WITH VECTOR PARAMETERS
where γ is the mixing proportion.
We would like to test the hypothesis H0 : γ = 0 versus H1 : γ > 0.
The log likelihood function is given by
`n(γ) =
n∑
i=1
log{1 + γ[exp(−2(xi − 2))− 1]}.
We have
`′n(γ) =
n∑
i=1
exp(−2(xi − 2))− 1
1 + γ[exp(−2(xi − 2))− 1] .
At γ = 0, we find
`′n(0) =
n∑
i=1
{exp(−2(xi − 2))− 1}
which has 0-expectation under H0. According to CLT, we find
P (`′n(0) > 0)→ 0.5
as n → ∞. It is clear that `′n(γ) is a decreasing function over γ > 0. Thus,
when `′n(0) < 0, we get `

n(γ) < 0.
Two facts imply that if the data are generated from H0, and we look for
MLE in general, we would find
P (γˆ = 0)→ 0.5.
Case I: when `′n(0) ≤ 0, we have γˆ = 0. This further leads to
Rn = 2{`n(γˆ)− `n(0)} = 0.
Case II: when `′n(0) > 0, we have γˆ > 0. It solves the equation
n∑
i=1
exp(−2(xi − 2))− 1
1 + γ[exp(−2(xi − 2))− 1] = 0.
For brevity, let us assume the solution is at a small neighborhood of γ = 0.
Thus, the above equation is approximated by
n∑
i=1
{exp(−2(xi − 2))− 1} − γ
n∑
i=1
{exp(−2(xi − 2))− 1}2 + op(n) = 0.
17.2. ASYMPTOTIC CHISQUAREOF LRT FOR COMPOSITE HYPOTHESES239
γˆ =
∑n
i=1{exp(−2(xi − 2))− 1}
γ
∑n
i=1{exp(−2(xi − 2))− 1}2
+ op(n
−1/2).
Consequently,
`n(γˆ) =
[
∑n
i=1{exp(−2(xi − 2))− 1}]2
γ
∑n
i=1{exp(−2(xi − 2))− 1}2
+ op(1).
Combining two cases, we can unify the expansion to
`n(γˆ) =
{[∑ni=1{exp(−2(xi − 2))− 1}]+}2
γ
∑n
i=1{exp(−2(xi − 2))− 1}2
+ op(1).
As n→∞, the limiting distribution is given by that of
(Z+)2
which is often denoted as
0.5χ20 + 0.5χ
2
1.
Morale of this example: The full model has parameter space Θ = [0, 1].
The null model has parameter space {0}. The parameter space under the
full model is not an open set of R. This invalidates the result obtained under
regularity condition. Most people will tell you the reason for not having a
chisquare limiting distribution is that the true value γ = 0 is not an interior
point of Θ. This is a reasonable explanation but does not survive serious
scrutiny.
In many applications, the i.i.d. assumption is violated. The regularity
conditions are no-longer sensible. Yet particularly in biostatistics applica-
tions, the users still regard the MLEs asymptotically normal, and the like-
lihood ratio statistics asymptotically chisquare. Often, they are not wrong.
At the same time, it is a worry-some trend that our scientific claims are built
on a less and less solid foundation.
I hope that these lectures help you to get a sense of when the “chisquare”
distribution is valid. In addition, you are able to rigorously establish what-
ever conclusions needed in various applications rather than merely have an
impression that some claims are true.
240 CHAPTER 17. LIKELIHOOD WITH VECTOR PARAMETERS
17.3 Asymptotic chisquare of LRT: one-step
further
Write θT = (θT1 , θ
T
2 ) so that θ1 is a vector of length d and θ2 is a vector of
length k. The superscript T is to make all vectors column vector.
Consider the composite null hypothesis H0 that
θ1 = 0
in vector sense. The alternative is H1 : θ1 6= 0. The full model has θ a vector
of length m+ d, while the null model has θ living in a subspace of length m.
Both spaces are open subsets of their corresponding Euclid spaces Rm+d and
Rm.
In this section, we denote θT0 = (θ
T
10, θ
T
20) as the true vector value of the
parameter whose corresponding distribution generated the data x1, . . . , xn.
In addition, this θ0 is one of the parameter vectors in H0. We assume that θ0
is an interior point of the parameter space of H0 as usual. This is part of
the regularity conditions to ensure the validity of the asymptotic result
to be introduced.
We use θˆ as the MLE of θ without placing any restrictions on the range
of θ. We use θˆ0 as the MLE or the maximum point of θ in the space of H0.
The consistency results discussed before ensure that both θˆ and θˆ0 almost
surely converge to θ0 when H0 is true. When notationally necessary, they
will be partitioned into (θˆT1 , θˆ
T
2 )
T and (θˆT01, θˆ
T
02)
T respectively. Of course, we
have θˆ01 = 0 under the null hypothesis.
17.3.1 Some notational preparations
The Fisher information with respect to θ is now a matrix. We denote
I(θ) = E
[{
∂ log f(X; θ)
∂θ
}{
∂ log f(X; θ)
∂θ
}T]
= −E
{
∂2 log f(X; θ)
∂θ∂θT
}
17.3. ASYMPTOTIC CHISQUARE OF LRT: ONE-STEP FURTHER 241
The expectation is also computed regarding the distribution of X is given by
f(x; θ): the same θ inside out. This matrix can be partitioned into 4 blocks:
Iij(θ) = E
[{
∂ log f(X; θ)
∂θi
}{
∂ log f(X; θ)
∂θj
}T]
= −E
{
∂2 log f(X; θ)
∂θi∂θTj
}
.
for i, j = 1, 2. In other words, we have
I(θ) =
{
I11(θ) I12(θ)
I21(θ) I22(θ)
}
The regularity conditions make I(θ) positive definite which implies both I11
and I22 are positive definite. The expectations are understood as taken with
the distribution of X is given by f(x; θ). Namely, the same parameter value
for operation E and the subject.
The score function is now also a vector. Let us write
STn (θ) = (S
T
n1, S
T
n2) =
n∑
i=1
(
∂ log f(X; θ)
∂θT1
,
∂ log f(X; θ)
∂θT2
)
.
The subscripts stand for transpose and they make every vector a row vector.
They do not have other practical purposes.
Matrix result. Let I11,2 = I11 − I12I−122 I21. It is laborious to verify that
I−1(θ) =
(
I 0
−I−122 I21 I
)(
I−111,2 0
0 I−122
)(
I −I12I−122
0 I
)
where I itself is an identity matrix of proper size. We allow the same I to
be identity matrices of different sizes here if it does not cause confusion.
Based on matrix theory, or by direct verification, we have
xT I−1x = (xT1 − xT2 I−122 I21)I−111,2(x1 − I12I−122 x2) + xT2 I−122 x2
for any vector x of proper length and partition. Applying this matrix result
to Sn and I, we find
STn I
−1(θ)Sn = (STn1 − STn2I−122 I21)I−111,2(Sn1 − I12I−122 Sn2) + STn2I−122 Sn2.
242 CHAPTER 17. LIKELIHOOD WITH VECTOR PARAMETERS
It is known that n−1/2Sn is asymptotically normal with covariance matrix
given by I(θ). This implies that
n−1/2(Sn1 − I12I−122 Sn2)
is asymptotically normal with covariance matrix I11,2. Hence, the first term
n−1(STn1 − STn2I−122 I12)I−111,2(Sn1 − I12I−122 Sn2)→ χ2d
where d is the dimension of θ1.
Let us now use these results to prove the claim of the theorem. The
LRT statistic now becomes
Rn = 2{`n(θˆ)− `n(θˆ0)} = 2{`n(θˆ)− `n(θ0)} − 2{`n(θˆ0)− `n(θ0)}.
For the first one, we apparently have
Rn1 = n
−1STn (θ0){I−1(θ0)}Sn(θ0) + op(1).
Based on the same principle, we have
Rn2 = n
−1STn2(θ0){I−122 (θ0)}Sn2(θ0) + op(1).
Combining two expansions, we find
Rn = n
−1[STn (θ0){I−1(θ0)}Sn(θ0)− STn2(θ0){I−122 (θ0)}Sn2(θ0)] + op(1).
With all the preparation results already established, we have
Rn → χ2d
in distribution as n→∞. ♦
Final remark on regularity conditions: The regularity conditions by
the first look are placed on the full distribution family under consideration.
A second look reveals that H0 forms a sub distribution family. We require
the listed regularity conditions are satisfied by the model formed by H0. The
conditions on finite Fisher information and so on ensure the use of the Law
of Large Numbers, Central Limit Theorem, and to ensure that the remainder
terms in Taylor’s expansion are high order terms. They do not have influence
on the existence nor the form of the limiting distributions at various stages
of the proof.
17.4. THE MOST GENERAL CASE: FINAL STEP 243
17.4 The most general case: final step
To highlight the fact that θ is a parameter vector, we use boldface θ in this
section. The null hypothesis discussed in the last section can be expressed
as
H0 : Aθ = 0
with specific matrix A = diag{1, 1, . . . , 1, 0, 0, . . . , 0}. Denote the number of
1’s as d and number of 0’s as k.
We can easily generalize this result to be applicable to any matrix A of
rank d and θ of length m + d. It is well known in linear algebra that the
solution set of Aθ = 0 forms a linear space of dimension m. There exist
m+ d linearly independent vectors ξ1, ξ2, . . . , ξd+m such that all solutions to
Aθ = 0 can be expressed as
θ = λ1ξ1 + · · ·λmξm.
Namely, in the space of λ, H0 becomes
λ = (λ1, · · · , λm, λm+1 = 0, . . . , λm+d = 0)
which is the same as the special case we have discussed. Namely, the conclu-
sion Rn → χ2d remains solid.
Most generally, assume the parameter space is a subset of Rd+m. The
composite hypothesis is either expressed as
R(θ) = 0 Hypothesis form I
for a continuously differentiable vector valued function R, or expressed as
θ = g(λ) Hypothesis form II
for a continuously differentiable g(·).
When it is in form I, denote the rank of the differential matrix at θ0 as
d. Hence, it puts d constraints on the parameter in a small neighborhood of
θ0. After which, based on inverse function theorem, there exists a smooth
function g such that the solution to R(θ) = 0 can be written as θ = g(λ)
where the dimension of λ is m, in a neighborhood of θ0.
244 CHAPTER 17. LIKELIHOOD WITH VECTOR PARAMETERS
In both cases, we may interpret that the null hypothesis sets d elements
in θ to 0 and leave m of them free. The same proof presented earlier makes
Rn → χ2d in distribution.
The regularity conditions must be applicable to the distribution family
formed by parameters as solution of R(θ) = 0.
• true parameter value θ0 is an interior point of Θ and an interior point
of the solution space of R(θ) = 0.
• There is a neighborhood of θ0 in terms of Θ, over which R(θ) = 0
admits a smooth solution θ = g(λ).
• There are neighborhoods of λ0 and θ0 respectively such that g(λ) is
differentiable with full rank derivative matrix.
17.5 Statistical application of these results
The whole purpose of proving Rn → χ2d is to test hypothesis in applications.
As the size of Rn represents the departure from the null model, the test
based on likelihood ratio is mathematically given by
φ(x) = 1(Rn ≥ c)
and this c will be chosen as χ2d(1− α) for a size-α test.
Example 17.1. Suppose we have an i.i.d. sample from a trinomial distribu-
tion. That is, each outcome of a trial is one of three types. Let the corre-
sponding probabilities of occurrence be p1, p2, p3. Clearly, p1 + p2 + p3 = 1.
After n trials, we have n1, n2, n3 observations of three types. The log
likelihood function is given by
`n(p1, p2, p3) = n1 log p1 + n2 log p2 + n3 log p3.
The maximum likelihood estimator of these parameters are given by
pˆj = nj/n
17.5. STATISTICAL APPLICATION OF THESE RESULTS 245
for j = 1, 2, 3.
(i) Consider the test for pj = pj0 6= 0, j = 1, 2, 3 versus pj 6= pj0 for at
least one of j = 1, 2, 3. The likelihood ratio test statistic is apparently given
by
Rn = 2n1 log(pˆ1/p10) + 2n2 log(pˆ2/p20) + 2n3 log(pˆ3/p30)
= 2n

pˆj log(pˆj/pj0).
According to our theorem on the LRT, when n→∞, Rn is approximately χ22
The MLEs under this model are consistent and asymptotically normal.
We have pˆj = pj0 +Op(n
−1/2). Therefore, we have
log(pˆj/pj0) = − log{1− (pˆj − pj0)/pˆj}
= (pˆj − pj0)/pˆj + (1/2)(pˆj − pj0)2/pˆ2j +Op(n−3/2).
Hence,
Rn = n

j
(pˆj − pj0)2/pˆj +Op(n−1/2)
= n

j
(pˆj − pj0)2/pj0 +Op(n−1/2).
Note that the change from pˆj to pˆj0 in the second equality leads to a dis-
crepancy of size Op(n
−1/2). This discrepancy is understood as having been
absorbed into the the remainder term Op(n
−1/2).
The leading term is the famous Pearson’s chisquare test statistics. It is
often used for “goodness–of–fit” test.
Another version of this test will be used as an assignment problem. The
result remains similar if there are more than 3 categories. For the purpose of
assignment, we do not require rigorous justification on why these Op(n
−1/2)
terms are indeed Op(n
−1/2).
246 CHAPTER 17. LIKELIHOOD WITH VECTOR PARAMETERS
17.6 Assignment Problems
1. Suppose that X1, . . . , Xn are i.i.d. from the Weibull distribution with
pdf
f(x; θ, γ) = θ−1γxγ−1 exp(−xγ/θ)
with the range of x being x > 0. The parameter space is γ > 0 and
θ > 0. Consider the problem of testing H0 : γ = 1 versus H1 6= 1.
(a) Go over the regularity conditions one by one and confirm if they
are satisfied or not.
(b) Find the expression of the likelihood ratio test statistics as a func-
tion of γˆ, the MLE of γ under H1.
Remark: a full analytical solution may not be possible.
(c) Generate a data set from the null model with θ = 1.5 and γ = 1.
Compute the value of Rn.
set.seed(2014561)
y = rweibull(120, 1, 1.5)
Whoever is interested in this problem, obtain a histogram of Rn based
on 2000 repetitions and a qq-plot against χ21 distribution.
2. Let X1, . . . , Xn be i.i.d. from N(µ, σ
2).
(a) Suppose that σ2 = γµ2 with unknown γ > 0 and µ ∈ R. Find the
likelihood ratio test for H0 : γ = 1 versus H1 : γ 6= 1.
(b) Repeat (a) when σ2 = γµ with unknown γ > 0 and µ > 0.
(c) Are the regularity conditions satisfied (for chi-square limiting dis-
tribution of the LRT)?
3. Consider the 2×3 table that is often encountered in many applications.
The outcomes of n objects are often been summarized as
counts I II III
a n11 n12 n13
b n21 n22 n23
17.6. ASSIGNMENT PROBLEMS 247
The problem of interest is to see whether the attribute in terms of being
a or b is independent of the attribute in terms of category I, II and III.
Let pij be the probability that a random subject falls into cell (i, j).
(i) Derive the likelihood ratio test statistic for the null hypothesis pij =
pi·p·j where pi· and p·j are marginal probabilities against the alternative
that pij 6= pi·p·j. Identify (rather than prove) the limiting distribution
of this statistic as n =

ij nij →∞.
(ii) Show that this statistic is asymptotically equivalent to the Pear-
son’s chisquare test statistic:
n

i,j
(pˆij − pˆi·pˆ·j)2
pˆij
where pˆi· =

j nij/n, pˆ·j =

i nij/n and pˆij = nij/n. That is, the
difference between two statistics has limit 0 under H0.
4. Let’s perform a simulation study to check the conclusion of Q4. Let
n = 300 and repeatedly simulating the table N = 10, 000 times.
(a) Simulate from
pij = {(0.3, 0.7)× (0.2, 0.35, 0.45)}
and obtain the value of the likelihood ratio test statistic Rn. Record all
Rn values and draw a QQ plot against the null limiting distribution.
Report the simulated rejection rate for the size 0.05 likelihood ratio
test.
(b) Simulate from
(p11, p12, p13, p21, p22, p23) = {(0.1, 0.15, 0.4, 0.2, 0.1, 0.05)}
and obtain the value of the likelihood ratio test statistic Rn. Record all
Rn values and draw a QQ plot against the null limiting distribution.
Report the simulated rejection rate for the size 0.05 likelihood ratio test.
248 CHAPTER 17. LIKELIHOOD WITH VECTOR PARAMETERS
17.7 Assignment problems
1. Suppose that X1, . . . , Xn are i.i.d. from the Weibull distribution with
pdf
f(x; θ, γ) = θ−1γxγ−1 exp(−xγ/θ)
with the range of x being x > 0. The parameter space is γ > 0 and
θ > 0. Consider the problem of testing H0 : γ = 1 versus H1 6= 1.
(a) Go over the regularity conditions one by one and confirm if they
are satisfied or not.
(b) Find the expression of the likelihood ratio test statistics as a func-
tion of γˆ, the MLE of γ under H1.
Remark: a full analytical solution may not be possible.
(c) Generate a data set from the null model with θ = 1.5 and γ = 1.
Compute the value of Rn.
set.seed(2014561)
y = rweibull(120, 1, 1.5)
Whoever is interested in this problem, obtain a histogram of Rn based
on 2000 repetitions and a qq-plot against χ21 distribution.
2. Let X1, . . . , Xn be i.i.d. from N(µ, σ
2).
(a) Suppose that σ2 = γµ2 with unknown γ > 0 and µ ∈ R. Find an
analytical expression of the likelihood ratio test statistic for H0 : γ = 1
versus H1 : γ 6= 1.
(b) Repeat (a) when σ2 = γµ with unknown γ > 0 and µ > 0.
(c) Are the regularity conditions satisfied (for chi-square limiting dis-
tribution of the LRT)?
3. Consider the 2×3 table that is often encountered in many applications.
The outcomes of n objects are often been summarized as
17.7. ASSIGNMENT PROBLEMS 249
counts I II III
a n11 n12 n13
b n21 n22 n23
The problem of interest is to see whether the attribute in terms of being
a or b is independent of the attribute in terms of category I, II and III.
Let pij be the probability that a random subject falls into cell (i, j).
(i) Derive the likelihood ratio test statistic for the null hypothesis pij =
pi·p·j where pi· and p·j are marginal probabilities against the alternative
that pij 6= pi·p·j. Identify (rather than prove) the limiting distribution
of this statistic as n =

ij nij →∞.
(ii) Show that this statistic is asymptotically equivalent to the Pear-
son’s chisquare test statistic:
n

i,j
(pˆij − pˆi·pˆ·j)2
pˆij
where pˆi· =

j nij/n, pˆ·j =

i nij/n and pˆij = nij/n. That is, the
difference between two statistics has limit 0 under H0.
4. Let’s perform a simulation study to check the conclusion of Q4. Let
n = 300 and repeatedly simulating the table N = 10, 000 times.
(a) Simulate from
pij = {(0.3, 0.7)× (0.2, 0.35, 0.45)}
and obtain the value of the likelihood ratio test statistic Rn. Record all
Rn values and draw a QQ plot against the null limiting distribution.
Report the simulated rejection rate for the size 0.05 likelihood ratio
test.
(b) Simulate from
(p11, p12, p13, p21, p22, p23) = {(0.1, 0.15, 0.4, 0.2, 0.1, 0.05)}
250 CHAPTER 17. LIKELIHOOD WITH VECTOR PARAMETERS
and obtain the value of the likelihood ratio test statistic Rn. Record all
Rn values and draw a QQ plot against the null limiting distribution.
Report the simulated rejection rate for the size 0.05 likelihood ratio
test.
Chapter 18
Wald and Score tests
We discuss two types of tests that are closely related to the likelihood ratio
test in this chapter.
18.1 Wald test
We still consider the situation where n i.i.d. observations from a model
{f(x; θ) : θ ∈ Θ} is provided. Under regularity conditions, we have shown
that the MLE of θ is asymptotically normal. That is,

n(θˆn − θ) d−→ N(0, I−1(θ))
as the sample size n→∞. Note that this claim is made implicitly assuming
the true parameter value is given by the same θ as the θ in these expressions.
Because if the above generically applicable asymptotic normality, to test
for the simple null hypothesis of H0 : θ = θ0 against H1 : θ 6= θ0, we may
define
Wn(θ0) = n(θˆn − θ0)τ I(θ0)(θˆn − θ0)
which has approximately two desired properties for a test statistic. We reject
H0 when Wn(θ0) ≥ χ2d(1−α) in favour of the generic alternative H1 : θ 6= θ0.
One may notice that the most crucial factor for the validity of this test is

n(θˆn − θ) d−→ N(0, I−1(θ))
251
252 CHAPTER 18. WALD AND SCORE TESTS
as the sample size n → ∞. It is not crucial for θˆn to be the MLE nor for
matrix I to be the Fisher information. Hence, the Wald test is more generally
applicable.
18.1.1 Variations of Wald test in the aspect of Fisher
information
Because replacing I(θ0) with any of its consistent estimator does not change
the limiting distribution of Wn, such manipulations lead to many versions of
the Wald test.
1. We may replace I(θ0) in Wn by
Iˆn(θ0) = − 1
n
n∑
i=1
∂2 log f(xi; θ)
∂θ2
∣∣∣∣∣
θ=θ0
.
This expression may be called an observed Fisher information at θ0.
This change simplifies the test when it is too complex to find the ana-
lytical form of the Fisher information matrix.
2. We may replace I(θ0) in Wn by
Iˆn(θˆ)− 1
n
n∑
i=1
∂2 log f(xi; θ)
∂θ2
∣∣∣∣∣
θ=θˆ
where θˆ is the MLE of θ. Note that we ignore the given value θ0 in
favour of an estimated value. This quantity is often computed when
the MLE is obtained by iterative methods such as Newton-Raphson.
3. We may replace I(θ0) in Wn by I(θˆ) where θˆ is the MLE of θ. Note
that we again ignore the given value θ0 in favour of an estimated value.
4. When the regularity conditions are satisfied, we may replace I(θ0) by
1
n
n∑
i=1
{
∂ log f(xi; θ)
∂θ
}{
∂ log f(xi; θ)
∂θ
}τ ∣∣∣∣∣
θ=θ0
.
Unlike the earlier choices, this quantity is always positive (or at least
non-negative) definite.
18.1. WALD TEST 253
18.1.2 Variations of Wald test in the aspect of H0
The Wald test we introduced works for simple null hypothesis. Suppose the
vector parameter θ can be written as θτ = (θτ1 , θ
τ
2). To fix the idea, we denote
the dimension of θ as d + k and the dimensions of θ1 and θ2 are d and k.
Consider the problem of testing H0 : θ1 = θ10 against H1 : θ1 6= θ10. Assume
the regularity conditions are satisfied by this model and the hypotheses. In
this case, the null hypothesis is composite, as opposed to be simple in the
last section.
Let θˆn be the MLE over the whole parameter space Θ and θˆ
τ
n = (θˆ
τ
1n, θˆ
τ
2n)
be the corresponding partition. Because θˆn is asymptotically normal, so is
any of its sub-vector (or linear combination). This implies

n(θˆ1n − θ10) d−→ N(0, I11(θ0))
where I11 is upper-left corner block sub matrix of I−1 corresponding to θ1.
This leads to a sensible test statistic
Wn(θ10) = n(θˆ1n − θ10)τ{I11(θˆ)}−1(θˆ1n − θ10). (18.1)
Clearly, we have
Wn(θ10)→ χ2d
with d being the dimension of θ1. A test of approximate size α is therefore
given by
φ(X) =
{
1 when Wn(θ10) ≥ χ2d(1− α)
0 otherwise
Rather than defining Wn(θ10) as in (18.1), one may try to use
n(θˆ1n − θ10)τ{I11(θ0)}−1(θˆ1n − θ10).
The above quantity is in fact not a statistic because we do not know the
value of θ0 even when H0 is true. The null hypothesis H0 here is a composite
one.
There are, however, many well justified choices in the place of I11(θˆ). For
instance, one may require θˆ10 = θ10 to obtain restricted MLE θˆ20. That is,
θˆ20 = arg max
θ2
`n(θ10, θ2).
254 CHAPTER 18. WALD AND SCORE TESTS
After which we replace I11(θˆ) in by I11(θ10, θˆ20) in (18.1).
Just like what we discussed in the last section, there can be many other
statistics that can be used in place of I11(θˆ) without changing the asymptotic
conclusion. We do not have a rule to decide which one is the “optimal”
choice in the place of I11(θˆ). More accurately speaking, we do not aware of
any commonly accepted rules.
18.1.3 Variations of Wald test in the aspect of H0
We work under the same title but slightly different situation here.
Suppose the null hypothesis is specified in the form of
H0 : ϕ(θ) = 0
where ϕ(·) takes vector values of dimension d. The dimension of θ is d + k,
the same as before. Note that when ϕ(θ) = θ1 − θ10 with some known value
θ10, this H0 reduces to the last case.
Naturally, the alternative hypothesis is H1 : ϕ(θ) 6= 0. Assume both ϕ(·)
and ϕ′(·) are smooth and the rank of ϕ′(θ) is d for θ in a neighbourhood of
the true parameter value.
If one regards φ(θ) as a parameter itself, and φ(θˆ) is its asymptotically
normal estimator, then applying the principle behind the Wald test, we would
define
Wn = nϕ
τ (θˆ){ϕ′(θˆ)I−1(θˆ)ϕ′(θˆ)τ}−1ϕ(θˆ)
as a test statistic. It can be shown that we still have, under H0,
Wn → χ2d.
Clearly, an approximate size α test can be similarly constructed based on
this Wn.
18.2 Score Test
We have seen that under regularity conditions,
E
{∂ log f(X; θ)
∂θ
}
= 0
18.2. SCORE TEST 255
where the expectation is taken under the assumption that the distribution
of X is given by f(x; θ).
Thus, when we test for H0 : θ = θ0, the value of the score function
Sn(θ0) =
n∑
i=1
{∂ log f(Xi; θ0)
∂θ
}
is indicative of the validity of H0.
Recall that n−1/2Sn(θ0) is asymptotically multivariate normal with asymp-
totic variance I(θ0). Let us define a test statistic to be
Tn = S
τ
n(θ0){nI(θ0)}−1Sn(θ0).
The limiting distribution of Tn is chisquare with d degrees of freedom where
d is the dimension of θ.
Based on this result, a score test of approximate size α is given by
φ(X) =
{
1 when Tn ≥ χ2d(1− α)
0 otherwise
Unlike the likelihood ratio test or Wald test, this statistic does not ask us
to compute the MLE of θ. We do need to compute the Fisher information
matrix and its inversion.
Similar to Wald test, let us now consider the more complex situation
where the null hypothesis H0 : θ1 = θ10, but the dimension of θ is d + k.
This means the second part of θ vector is unspecified under H0. Let θˆ0 be
the MLE under H0. Let
Sn1(θ) =
∂`n(θ)
∂θ1
which was defined earlier too. This is a column vector of length d. If the
same asymptotic techniques are used here, we will find that
n−1/2Sn1(θˆ0)
is asymptotically multivariate normal with mean 0 and variance matrix I11,2(θ∗)
under the null hypothesis. This θ∗ stands for the true parameter value and
I11,2 = I11 − I12I−122 I21 was also defined before.
256 CHAPTER 18. WALD AND SCORE TESTS
Apparently, these notes on asymptotic results lead to the conclusion
Tn = S
τ
n1(θˆ0){nI11,2(θˆ0)}−1Sn1(θˆ0)→ χ2d
as n → ∞. This time, d is the dimension of θ1. A test can be constructed
the same way as earlier.
Finally, consider the null hypothesis specified by H0 : ϕ(θ) = 0 where
ϕ is a smooth function. Under regularity conditions, and in most applied
problems, we may equivalently writeH0 as θ = g(λ) for some smooth function
g with new parameter setting λ.
In this case, we may obtain the MLE of λ as the maximum point of
`(g(λ)). Denote it as λˆ. Next, we redefine the score statistic to be
Tn = S
T
n (g(λˆ)){nI(g(λˆ))}−1Sn(g(λˆ)).
Under regularity conditions, we still have
Rn → χ2d.
This d is the dimension of θ minus the dimension of λ.
Like the discussion for the Wald test, we can use variations of I(g(λˆ)) to
construct the score test. I do not aware definitive statements on which of
them work the best.
18.3 Power and consistency
Three tests, likelihood ratio, Wald and Score, are asymptotically equiva-
lent. By this statement, we mean that if the true parameter value is at
n−1/2-distance from the null model space, then the powers of these tests are
asymptotically equal and the value is not 1.
Most tests recommended in mathematical statistics are consistent: the
power of the test at any specific alternative distribution (a distribution in
H1) goes to 1 when the sample size n→∞ under the i.i.d. setting.
There are generally no discussions on whether these tests are unbiased.
Admittedly, there is a discrepancy between optimality theories for hypothe-
sis test, and the properties of generally recommended tests. The optimality
18.4. ASSIGNMENT PROBLEMS 257
theory provides a high ground based on which we discuss the pros and cons
of test procedures. We do not insist on using only tests with confirmed
optimality properties in applications. The recommended tests are often de-
signed to mimic or follow the principles of optimal tests after some tolerated
compromises for the sake of convenience or feasilibity in implementation.
We often use simulation study to compare the performances of various
tests, one is advocated by the user and the others are the existing ones in
specific applications. It is not unusual for a student to claim that the “new
method” is better because it rejects more null hypotheses without further
qualifications. This practice is not right. One may reject all null hypotheses
to achieve the highest power of 100% in any applications. Clearly, such a
test ignores the need of controlling type I error, or the desire of having a test
unbiased. The comparison should only be made after the tests are designed
so that their sizes are practically equal.
18.4 Assignment problems
1. Suppose that X = (X1, . . . , Xk)
τ has multinomial distribution with
the parameter P = (p1, . . . , pk)
τ . It is known that

xj = n and n is
not random. Consider the problem of testing P = P0 where P0 is a
probability vector with all entries positive.
(i) derive the quadratic form of the Wald test statistic;
(ii) derive the quadratic form of the Score test statistic;
(iii) (challeng) prove (or disprove) that these two statistics are identical.
2. Consider the hypothesis test for H0 : ϕ(θ) = 0 against the alternative
H1 : ϕ(θ) 6= 0. Assume its derivative ϕ′(θˆ) is continuous and has full
rank d while the dimension of θ is d+ k for some k. The test problem
satisfy the regularity conditions, and we have i.i.d. observations of size
n for this test. Prove that the Wald test statistic
Wn = nϕ
τ (θˆ){ϕ′(θˆ)I−1(θˆ)ϕ′(θˆ)τ}−1ϕ(θˆ)
satisfies, under H0, and as n→∞,
Wn
d−→ χ2d.
258 CHAPTER 18. WALD AND SCORE TESTS
3. Consider the null hypothesis H0 : θ1 = θ10 where the dimension of
θτ = (θτ1,θ
τ
2) is d + k and the dimension of θ1 is d. Assume the
regularity conditions are satisfied for the corresponding model and we
have i.i.d. observations of size n for this test. Let θˆ0 be the MLE under
H0 and
Sn1(θ) =
∂`n(θ)
∂θ1
.
Prove that the score test statistic
Tn = S
τ
n1(θˆ0){nI11,2(θˆ0)}−1Sn1(θˆ0)→ χ2d
in distribution as n→∞.
18.5 Assignment problems
1. Suppose that X = (X1, . . . , Xk)
τ has multinomial distribution with
the parameter P = (p1, . . . , pk)
τ . It is known that

xj = n and n is
not random. Consider the problem of testing P = P0 where P0 is a
probability vector with all entries positive.
(i) derive the quadratic form of the Score test statistic;
(ii) derive the quadratic form of the Wald test statistic;
(iii) (challeng) prove (or disprove) that these two statistics are identical.
2. Consider the hypothesis test for H0 : ϕ(θ) = 0 against the alternative
H1 : ϕ(θ) 6= 0. Assume its derivative ϕ′(θˆ) is continuous and has full
rank d while the dimension of θ is d+ k for some k. The test problem
satisfy the regularity conditions, and we have i.i.d. observations of size
n for this test. Prove that the Wald test statistic
Wn = nϕ
τ (θˆ){ϕ′(θˆ)I−1(θˆ)ϕ′(θˆ)τ}−1ϕ(θˆ)
satisfies, under H0, and as n→∞,
Wn
d−→ χ2d.
18.5. ASSIGNMENT PROBLEMS 259
3. Consider the null hypothesis H0 : θ1 = θ10 where the dimension of
θτ = (θτ1,θ
τ
2) is d + k and the dimension of θ1 is d. Assume the
regularity conditions are satisfied for the corresponding model and we
have i.i.d. observations of size n for this test. Let θˆ0 be the MLE under
H0 and
Sn1(θ) =
∂`n(θ)
∂θ1
.
Prove that the score test statistic
Tn = S
τ
n1(θˆ0){nI11,2(θˆ0)}−1Sn1(θˆ0)→ χ2d
in distribution as n→∞.
260 CHAPTER 18. WALD AND SCORE TESTS
Chapter 19
Tests under normality
There are many classical tests taught in introductory courses for various
aspects of the population under normality assumption. They include test for
the hypotheses such as whether the population mean is equal or larger than
a specific value when a single i.i.d. sample is given; test for the hypotheses
whether two population means are equal, or differ by specific amount when
two independent i.i.d. samples from two populations are given. We have
them summarized in this chapter under the light of some types of optimality.
19.1 One-sample problem under normality
Suppose we have a random sample x1, . . . , xn of size n from N(θ, σ
2). We
adopt common notations: sample mean x¯ = n−1
∑n
i=1 xi and sample variance
s2n = (n− 1)−1
∑n
i=1(xi − x¯)2; Let
Tn =

n(x¯− θ0)
sn
for some given θ0 value. The well known test for H0 : θ ≤ θ0 against H1 :
θ > θ0 of size α is given by
φ(x) =
{
1 when Tn ≥ tn−1,1−α;
0 otherwise
where tn−1,1−α is the upper 1 − α quantile of the t-distribution with n − 1
degrees of freedom.
261
262 CHAPTER 19. TESTS UNDER NORMALITY
The well known t-test for H0 : θ = θ0 against H1 : θ 6= θ0 of size α is
given by
φ(x) =
{
1 when |Tn| ≥ tn−1,1−α/2;
0 otherwise
This is the famous two-sided t-test.
Both tests are convenient to use and have nice properties. Yet after having
studied the “UMP” theory, we may question their “optimality”. We will not
prove anything but cite a few classical theorems here. Their truthfulness
is too involved to be lectured in this course. These interested are referred
to reference textbooks. Even better, you may practice your technical skill
through an attempt to prove to disprove these results.
Theorem 19.1. Suppose we have an i.i.d. sample from a distribution with
density function from an exponential family
f(x; θ, λ) = exp{θU(x) + λT (x) + A(θ, λ)}
with respect to some σ-finite measure. The parameter θ is one-dimensional,
while λ can be multi-dimensional.
(i) Suppose that V = h(U, T ) is independent of T (being two independent
random variables/vectors) when θ = θ0. In addition, for each t, h(u, t) is an
increasing function in u. Then the test defined as follows
φ(v) =

1 when v > k;
c when v = k
0 otherwise
satisfying E{φ(V ); θ0} = α is an UMPU test for H0 : θ ≤ θ0 against H1 :
θ > θ0.
(ii) Assume the same conditions as in (i), but in addition,
h(u, t) = a(t)u+ b(t) with a(t) > 0.
Let us define
φ(v) =

1 when v > k1 or v < k2;
cj when v = kj, j = 1, 2;
0 otherwise
19.1. ONE-SAMPLE PROBLEM UNDER NORMALITY 263
for some constants k1 > k2 such that E{φ(V ); θ0} = α and E{V φ(V ); θ0} =
αE{V ; θ0}. This test is an UMPU test for H0 : θ = θ0 against H1 : θ 6= θ0.
Clearly, this theorem targets problems when the hypothesis involves one
specific component θ of parameter vector in the model while leaving other
components λ of parameters unspecified. Because of this, we refer λ as nui-
sance parameter. According to the results in this theorem, the UMPU tests
have the same format to the ones we obtained in the absence of nuisance pa-
rameters when the model is an “exponential family” under some conditions.
We use this theorem to construct a test for hypotheses about the variance
σ2 under normal model in the following example.
Example 19.1. Suppose we have an i.i.d. sample from N(ξ, σ2). The joint
function can be written as
f(x; ξ, σ2) = exp{θU(x) + λT (x) + A(θ, λ)}
with θ = −1/(2σ2), λ = (nξ)/σ2; and U(x) = ∑x2i , T (x) = x¯.
Let
h(U, T ) = U − nT 2 =

(xi − x¯)2.
It is seen that for any given value of σ2, h(U, T ) is independent of T . Thus,
a UMPU test of size α for H0 : σ ≤ σ0 against H1 : σ > σ0 is given by
φ(U) =
{
1 when U > k;
0 otherwise
Because U/σ20 has chisquare distribution with n− 1 degrees of freedom. The
size of k is therefore the (1− α)th quantile of this distribution.
In the next example, we exchange the roles of σ and µ (in terms of
notation) and therefore U and T to construct a UMPU about the size of
population mean.
Example 19.2. Suppose we have an i.i.d. sample from N(ξ, σ2). The joint
function can be written as
f(x; ξ, σ2) = exp{θU(x) + λT (x) + A(θ, λ)}
with λ = −1/(2σ2), θ = (nξ)/σ2; and T (x) = ∑x2i , U(x) = x¯.
264 CHAPTER 19. TESTS UNDER NORMALITY
This time, we find
V = h(U, T ) =
U√
T − nU2 =
X¯√∑
(Xi − X¯)2
is independent of T (x) when ξ = 0. See the remark after this example.
It is easily seen that V is an increasing function of U given T . Hence, a
UMPU test for H0 : ξ ≤ 0 versus H0 : ξ > 0 can be easily obtained by the
result of Theorem 19.1
In order to construct a UMPU test for H0 : ξ = 0 versus H0 : ξ 6= 0, we
need to find a V which is also linear in U given T . This is not the case here.
However, it is possible to use a mathematical trick to show a UMPU for this
pair of hypothesis is still given by
φ(V ) = 1(|V | > k)
with k satisfying E{φ(V ); θ0} = α and E{V φ(V ); θ0} = αE{V ; θ0}. The
solution is certainly the famous two-sided t-test.
This test is not in the exact form of the famous t-test. Some normalization
steps are needed but omitted here.
Remark: When ξ = 0 is given, T (x) is complete and sufficient for σ2.
At the same time, the distribution of V is not dependent on σ. Thus, the
classical theorem implies that they are independent.
19.2 Two-sample problem under normality as-
sumption
The purpose of the first example is to show that the commonly used F-test
is a UMPU test for the one-sided hypothesis. The conclusion is likely also
“true” for the two-sided hypothesis with some complications. For the specific
hypothesis to be discussed, we may simply put ∆ = 1 in the subsequent
discussion. Ignore this ∆ if you find it disturbing.
LetX1, . . . , Xm and Y1, . . . , Yn be i.i.d. samples fromN(ξ, σ
2) andN(η, τ 2)
respectively. Their joint density function is given by
f(X, Y ; ξ, η, σ, τ) = exp
{
− 1
2σ2

x2i −
1
2τ 2

y2j +

σ2
x¯+

τ 2
y¯ − A(ξ, η, σ, τ)
}
.
19.2. TWO-SAMPLE PROBLEMUNDERNORMALITY ASSUMPTION265
Next, let us transform the parameter by
θ = − 1
2τ 2
+
1
2∆σ2
and
λ1 = − 1
2σ2
; λ2 =

σ2
; λ3 =

τ 2
for some constant ∆ > 0. Let the corresponding sufficient statistics be
U =
n∑
j=1
Y 2j ; T1 =
m∑
i=1
X2i +
1

n∑
j=1
Y 2j ; T2 = X¯; T3 = Y¯ .
Test for equal variance Consider the test for H0 : τ
2 ≤ σ2 versus H1 :
τ 2 > σ2 This is the same as, with ∆ = 1,
H0 : θ ≤ 0; versus H1 : θ > 0.
Define
V = h(U, T1, T2, T3) =
∑n
j=1(Yj − Y¯ )2∑m
i=1(Xi − X¯)2
.
It is seen that given θ = 0, V has F-distribution(after some scale adjustment)
which does not depend on any parameters. Thus, it is independent of the
sufficient and complete statistic (T1, T2, T3). To verify the sufficiency and
completeness, work out the analytical form of the distribution family when
θ = 0. It is also easy to show that h(U, T ) is monotone in U given T .
These discussions show the conditions specified in Theorem 19.1 are sat-
isfied with these parameters and statistics. Hence, a proper test based on V
is a UMPU test. That is, a UMPU test for H0 : τ
2 ≥ σ2 versus H1 : τ 2 < σ2
is given by
φ(V ) = I(V > k)
and this k is chosen according to the F-distribution to make the size right.
Extension. By putting ∆ at other values, we obtain many variations.
One may also easily get the F-test for the two-sided test. It appears that an
UMPU test is not the same test when the rejection regions on two tails of V
having equal probability α/2.
266 CHAPTER 19. TESTS UNDER NORMALITY
19.3 Test for equal mean under equal vari-
ance assumption
We certainly know that the two-sample t-test will show up here. Consider
the case where τ 2 = σ2. Under this assumption, the joint density of two
samples is given by
f(X, Y ; ξ, η, σ, τ) = exp
{
− 1
2σ2
{

x2i +

y2j}+

σ2
x¯+

σ2
y¯ − A(ξ, η, σ)
}
.
Let
θ =
η − ξ
(m−1 + n−1)σ2
;
and
λ1 =
mξ + nη
(m+ n)σ2
; λ2 = − 1
2σ2
.
The sufficient statistics are
U = Y¯ − X¯; T1 = mX¯ + nY¯ ; T2 =
m∑
i=1
X2i +
n∑
j=1
Y 2j .
Consider a test for H0 : ξ = η versus H1 : ξ 6= θ, we construct a statistic
V =
Y¯ − X¯√∑m
i=1(Xi − X¯)2 +
∑n
j=1(Yj − Y¯ )2
which is a function of U , T1, T2 and T3. Its distribution, when ξ = η does not
depend on λj, j = 1, 2, 3. Thus, it serves the proper statistic for constructing
UMPU.
A UMPU test is given by
φ(V ) = I(|V | > k)
with k satisfying E{φ(V ); η = ξ} = α and
E{V φ(V ); η = ξ} = αE{V ; η = ξ}.
19.4. TEST FOR EQUALMEANWITHOUT EQUAL VARIANCE ASSUMPTION267
This is apparently the two-sample t-test. Here are a few missing technical
steps. First, the denominator in V can be written as
T2 − 1
m+ n
T 21 −
mn
m+ n
U2.
This ensures that V is indeed a function of the required format.
The second is linearity of V in U given T . The linearity is not exactly
true. However, V is a monotone function of W :
W =
Y¯ − X¯√∑
x2i +

y2j − (m+ n)−1(

xi +

yj)2
.
So a test based on W would satisfy all conditions specified in the theorem.
Two tests are, however, equivalent. The reason for using V instead of W
lies in the fact that the distribution of V is clearly related to t, while the
distribution of W is not “standard”.
19.4 Test for equal mean without equal vari-
ance assumption
If σ2 = τ 2 is not assumed (or not known to be equal), there is no such a
simple solution as UMPU test. This is so-called Behrens-Fisher problem.
In terms of searching for “optimal tests”, one usually starts to place re-
strictions on the test: we require the test is “unbiased”, “invariant”, “similar”
and so on.
With some considerations, it appears that a good test should reject the
null hypothesis when
Y¯ − X¯√
(1/m)S2x + (1/n)S
2
y
≥ g(S2y/S2x)
for some suitable function g. If the test is required to be unbiased, then only
“pathological functions g” can have this property.
268 CHAPTER 19. TESTS UNDER NORMALITY
“Approximate solutions are available which provide tests that are satis-
factory for all practical purposes”. Among them, we probably know Welch’s
approximate t-test. In this case, a t-test statistic is defined to be
tn =
Y¯ − X¯√
(1/m)S2x + (1/n)S
2
y
.
Clearly, it is simply the “standardized difference in sample means”. Its dis-
tribution under H0 depends on the actual value of σ
2 and τ 2. Namely, it does
not have the desired “pivotal” property under H0. However, it is generally
recommended in the literature that its distribution is well approximated by
t-distribution with degree of freedom
df =
[(1/m)S2x + (1/n)S
2
y ]
2
[(1/m)S2x]
2/(m− 1) + [(1/n)S2y ]2/(n− 1)
.
Because this df is an approximation based on certain consideration, I am
reluctant to declare that this is a “correct approximation of the degree of
freedom”. It is best to call it the Welch’s approximation.
I like the description of this famous problem in wikipedia: One difficulty
with discussing the Behrens–Fisher problem and proposed solutions, is that
there are many different interpretations of what is meant by “the Behrens-
Fisher problem”. These differences involve not only what is counted as being a
relevant solution, but even the basic statement of the context being considered.
My summary: if one attempts to have an “optimal” test for ξ = η without
knowing σ = τ in a two-sample problem, there may not be such a solution.
If the “optimality” is not strictly observed, there are many sensible methods.
Summary. We have gone over most famous tests based on data from normal
models. We have not gone a complete list of all important cases. We did not
go over the theorem based on which these tests are justified to have various
optimality properties.
The optimality will go away if the data are not from a normal distribution.
My experience indicates, however, that the two-sample t-test works really
nicely even when the normality is severely violated. That is, the size of the
test remains close to what is promised and the power is superior compared
19.5. ASSIGNMENT PROBLEMS 269
to many alternatives. One can find situations where this t-test is very poor
in these respects. However, these situations are often too extreme to be
taken seriously. The simplest case I can think of is when a few extremely
large observed values are in presence compared to the majority of observed
valued.
If so, robust approaches are recommended. Some of them will be discussed
next.
19.5 Assignment problems
1. Suppose (Xi, Yi), i = 1, 2, . . . , n are a random sample from a bivariate
normal distribution with density function
f(x, y; ξ, η, σ, τ) =
{
2piστ

1− ρ2}−n exp (− h(x, y; ξ, η, σ, τ)
2(1− ρ2)
)
where
h(x, y; · · · ) = 1
σ2

(xi− ξ)2− 2ρ
στ

(xi− ξ)(yi−η) + 1
τ 2

(yi−η)2.
(a) Determine the form of the UMPU test for H0 : ρ ≤ 0 versus H1 :
ρ > 0.
(b) Determine the rejection region of the test of size α in terms of the
quantile of a well known distribution (t-distribution).
2. Let X1, . . . , Xn be a random sample from N(ξ, σ
2).
(a). Show that the power of the student’s t-test is an increasing function
of ξ/σ for testing H0 : ξ < 0 versus H1 : ξ > 0. (One-sided test).
(b). Show that the power of the student’s t-test is an increasing function
of |ξ|/σ for testing H0 : ξ = 0 versus H1 : ξ 6= 0. (two-sided test).
3. Suppose that Xi = β0 + β1ti + i, where ti’s are fixed constants that
are not all the same, i’s are i.i.d. from N(0, σ
2), and β0, β1 and σ
2 are
unknown parameters. Derive a UMPU test of sizes α for testing
(a) H0 : β0 ≤ θ0 versus H1 : β0 > θ0.
(b) H0 : β0 = θ0 versus H1 : β0 6= θ0.
270 CHAPTER 19. TESTS UNDER NORMALITY
Chapter 20
Non-parametric tests
The methods we have discussed so far are based on the assumption that
the data are generated i.i.d. from a distribution which is a member of some
regular parametric models. These methods become either inapplicable, or
inferior if the data do not listen to our command: “assume they have this or
that distribution”.
Strictly speaking, the tests designed under a “parametric assumption” can
be carried out smoothly, whether or not the model assumption is violated or
not. The real issue is: these tests may not have the prescribed size nor have
reasonable power to detect the departure of the underlying distribution from
the null hypothesis in specific aspect we are interested. For instance, suppose
we have i.i.d. bivariate observations (xi, yi) from some distribution. We wish
to test if the two component random variables X, Y are independent. If the
joint distribution is bivariate normal, we may simply test the null hypothesis
that their correlation is 0. The likelihood ratio test is a valid choice. However,
if the joint distribution is not normal, the test will not be able to detect the
violation of the independence hypothesis when X, Y are not independent
but have 0 correlation. It can also happen that X, Y are independent, but
the normality assumption based likelihood ratio test statistics has a very
different limiting distribution from chisquare. If so, the null hypothesis can
be rejected with much higher type I error.
In other examples, the performance of a test can still be respectable.
In two sample problem, the typical null hypothesis of interest is that two
271
272 CHAPTER 20. NON-PARAMETRIC TESTS
populations have the same mean. In this case, the two-sample t-test is very
hard to beat in terms of having both accurate type-I error and good power.
One has to subject this test to very weird data sets to make it look bad.
Even though some parametric tests are rather robust, there is a need of
tests whose validity is not heavily dependent on the correctness of the model
assumption.
20.1 One-sample sign test.
Suppose we have an i.i.d. sample from some distribution whose c.d.f. is given
by F (x) which is a member of F . The family F that F belongs is not very
important so we do not carefully specify it. In some applications, we do not
The one-sample sign test is designed for the following null hypothesis
H0 : p = F (u) ≤ p0
versus H1 : p = F (u) > p0 for some user-specified u and p0.
Let xi, i = 1, 2, . . . , n be the observed values. Apparently, the key infor-
mation from a single observation in this problem is whether xi > u or xi ≤ u.
Consequently, we define
∆i = 1(xi − u ≤ 0)
for i = 1, . . . , n.
If ∆i, i = 1, . . . , n are the only data we observe, then Y =
∑n
i=1 ∆i is
sufficient for the probability of success p, the unknown value of F (u). The
UMP test for H0 versus H1 has the form
φ(Y ) = 1(Y > k) + c1(Y = k)
with proper choices of k and c for the sake of the test to have a pre-given
size.
When n is large, the distribution of Y is well–approximated by a normal
distribution. Hence, the usefulness of having an c to ensure exact size of the
20.2. SIGN TEST FOR PAIRED OBSERVATIONS. 273
test disappears. For this reason, we give up the effort of determining the
value of c explicitly.
If n is not extremely large and let the observed value of Y be y0, numerical
value of
pr(Y ≥ y0)
can be obtained via many standard statistical software packages. So one may
simply compute the p-value of the test. One may also activate the continuity
correction if it deems necessary: computing (1/2)pr(Y = y0) + pr(Y > y0)
This test does not depend on the specific form of F , that is, we do not
have to specify a parametric model {f(x; θ) : θ ∈ Θ} for F . For this reason,
this test is referred to as non-parametric. The statistic Y is the number of
observations with xi− u ≤ 0, which is the number of observations where the
quantity has negative/positive sign. This test φ(Y ) is subsequently called
sign-test.
This test is UMP in general, rather than merely in the restricted sense
as specified above. However, it may not be so interesting to seriously prove
this claim. In the literature, the one-sample sign test may refer to the special
case where p0 = 0.5.
20.2 Sign test for paired observations.
Suppose a pair of observations of the same nature are obtained on n experi-
mental units (more commonly sampling units). One wishes to test whether
or not the two observations have the same distribution.
Let the observations be denoted as (xi, yi), i = 1, 2, . . . , n. Assume inde-
pendence between pairs. Define
∆i = 1(yi > xi)
and Y =

∆i. If two marginal distributions F and G are identical, then
it has conditional binomial distribution with probability of success p0 = 0.5
given n′ =

1(yi 6= xi).
274 CHAPTER 20. NON-PARAMETRIC TESTS
The sign test recommended in the literature rejects the null hypothesis
H0 : F (x) ≡ G(x) when Y exceeds k, a critical value decided by the condi-
tional distribution of Y given n′ and the desired size of test α. The presumed
alternative hypothesis is H1 : F < G in some stochastic sense. Apparently,
there are many distribution pairs F 6= G at which this test has rejection
probability smaller than α, a violation of the usual requirement of the test:
unbiasedness.
One need not be overly alarmed. In most applications where the sign-test
is used, one looks for evidences of F < G from a specific angle. Unable to
reject all possible violation of H0 : F (x) ≡ G(x) with a good power is not so
much a concern. Nevertheless, as a statistician, she or he should be aware of
this issue.
As a reminder, a paired example t-test will be used if the data are from
paired experiment and the normality is not in serious doubt.
20.3 Wilcoxon signed-rank test
The tests in the last two sections do not take the magnitude of the difference
into account. Hence, one may explore this fact and comes up with superior
tests in some way.
Consider the paired-experiment first. Let n′ being the number of obser-
vations in which xi 6= yi and remove the sample units for which xi = yi from
the sample. For simplicity, we assume xi 6= yi in the first place and use n for
n′. Define
δi =
{
1 if yi > xi
−1 if yi < xi
and ∆i = |yi − xi| for i = 1, 2, . . . , n. Let
Ri =
n∑
j=1
1(∆j < ∆i) + (1/2)
n∑
j=1
1(∆j = ∆i) + 1/2.
Do not be fooled by the above seemingly complex formula. It merely counts
the rank of ∆i among all absolute differences ∆j. When ∆i = ∆j, then we
only count that only one half of the pair j has lower value than ∆i. Finally,
20.3. WILCOXON SIGNED-RANK TEST 275
Wilconxon signed-rank test is defined to be
W =
n∑
i=1
δiRi.
Note that W looks into the sign δi and the rank Ri for each paired obser-
vation. Because of this, if a sample has a large observed difference, its rank
Ri is higher and therefore contributes more in terms of increasing the size of
W .
The distribution of W under the null hypothesis stays the same as long
as the marginal distributions F ≡ G, given the number of unequal-pairs
n. When F < G in some sense, we expect W better reflects the degree of
departure from H0 in favour of this type of departure than the pure signed
test. Hence, a signed-rank test is designed to reject H0 when W is large.
One may numerically evaluate pr(W > w0) and use it as p-value of the
test.
The distribution of W is also asymptotically normal. When the popula-
tion distributions are continuous so that there cannot be any ties in rank,
the mean of W is 0 and its variance is given by
var(W ) =
1
6
{n(n+ 1)(2n+ 1)}.
When n is large, the test reject H0 if
W >

var(W ) z1−α
for one-sided alternative.
This is the Wilcoxon signed-rank test for paired experiment.
When there are ties in |yi − xi|, the variance of W is more complex. We
do not go over that case.
In most books, this statistic is defined as the total rank of positive yi−xi.
Since the rank total is non-random, the test based on our W is the same as
the test based on those statistics.
There is also a Wilcoxon signed-rank test for one sample problem. We
do not discuss it here.
276 CHAPTER 20. NON-PARAMETRIC TESTS
20.4 Two-sample permutation test.
Consider a situation where we have one random sample x1, . . . , xm from F
and another random sample y1, . . . , yn from G. It is of interest to test for
H0 : F = G versus H1 : F 6= G. Note that this problem is different from the
one in the last section: for instance, x1 and y1 are not linked here unlike the
paired-experiment.
To have a meaningful discussion, assume both F and G are continuous.
That is, the model space contains all continuous distribution functions. De-
note the pooled sample as z = {x1, . . . , xm, } ∪ {y1, . . . , yn}. We first regard
z as a set in the above, then we turn it into a vector of length m+ n for the
subsequent discussion.
Define a set of vectors to be
Π(z) = {(zi1 , zi2 , . . . , zim+n) : (i1, . . . , im+n) is a permutation of (1, 2, . . . ,m+n)}.
That is, Π(z) contains all vectors of length m + n that are permutations of
each other. The entries of the vectors are observed values from two samples.
We denote the members of Π(z) by pi(z).
Let φ(X, Y ) be a test, namely a function taking values between 0 and 1,
such that
1
(m+ n)!

(x,y)∈Π(z)
φ(x, y) = α.
In the above expression, we write (x, y) instead of z in places to remind us
the connections. The above test is called a permutation test for a given
significance level α.
At a first look, this definition does not make much sense. Suppose that
the observed values are all different. That is, xi 6= xj, yi 6= yj for any i 6= j;
and xi 6= yj for any i and j. In this case, once the pooled sample z is
specified, Π(z) contains (m + n)! different valued vectors. Suppose the test
function φ(x, y) does not involve randomization. Then this function decides
which of these (m + n)! vectors are in the rejection region. Under the null
hypothesis of F = G and both F and G distributions are continuous, given
the set z, every member of Π(z) has probability 1/(m+n)! to occur. Hence,
if no randomization is involved, the permutation test selects 5% of (m+ n)!
possible outcomes of z to form the rejection region of a test of size α = 0.05.
20.4. TWO-SAMPLE PERMUTATION TEST. 277
The name of the test is now sensible as the rejection region is formed by
permuted observed vectors. The size of the test is computed based on the
promise that the null distribution on Π(z) is uniform conditional on the set
of values in z.
The key issue left in a permutation test is: which 5% permutations in (m+
n)! should be placed in the rejection region? In applications, we introduce
a function Tm+n defined on Π(z). This function can take at most (m + n)!
different values. We reject H0 if the observed Tm+n is among the top (100α)%
values. We now give two specific choices of Tm+n.
Difference in two sample means.
The question of which 50 depends on the “optimality requirement” and
the potential alternative hypothesis. What direction of the departure do
we care? Without such a direction, we can always find two samples differ
significantly in one way or in another. One should review the notion of two
desired properties of a test statistics now.
Consider the situation where the alternative is H1 : G(x) = F (x − δ)
for some δ > 0. In statistics, we say G(x) is obtained from F by a location
shift in this case. Under this alternative hypothesis, the samples from G are
stochastically larger than the samples from F . Any statistics which tend to
take larger values under H1 is a suitable candidate.
Suppose x1, . . . , xm and y1, . . . , yn are two random samples respectively.
Let
Tm+n = n
−1
n∑
j=1
yj −m−1
m∑
i=1
xi = y¯n − x¯m.
For each permuted x1, . . . , xm; y1, . . . , yn.,
x′1, . . . , x

m; y

1, . . . , y

n
we compute
T ′m+n = n
−1
n∑
j=1
y′j −m−1
m∑
i=1
x′i = y¯

n − x¯′m.
The observed Tm+n is one of
(
(m+n)
n
)
possible outcomes denoted as T ′m+n.
It makes sense to select the permutations which results in the largest values
of T ′m+n to form the rejection region.
278 CHAPTER 20. NON-PARAMETRIC TESTS
To carry out this test, we compute all
(
(m+n)
n
)
possible values of T ′m+n.
One of them is the observed value Tm+n. If the observed value is among the
top 100α%, we reject H0 in favour of H1 : G(x) = F (x− δ) for some δ > 0.
In applications, if (m+n) is large, computing all possible values in “finite
time” is not feasible. Computer simulation may be used to compute only a
random subset of them and get an accurate enough rank of Tm+n.
If (m + n) is small, some T ′m+n may equal Tm+n. Continuity correction
is often used. That is, each equaling T ′m+n value is counted as half is larger,
another half is smaller than Tm+n.
Under mild conditions, this test is asymptotically equivalent to t-test.
That is, two tests will give very close p-values over a wide range of p-values.
One may check against the definition to verify that this test is a permu-
tation test.
Difference in ranks.
Consider the same alternative H1 : G(x) = F (x − δ) for some δ > 0.
Instead of examining the size of the difference in sample means y¯ − x¯, we
may first replace each observed value by its rank in the set of all observed
values.
Define
r(x) =
m∑
j=1
1(xj ≤ x) +
n∑
k=1
1(yk ≤ x).
Thus, r(yj) is the number of observations in the pooled sample that are
smaller than to equal to yj. Suppose both F and G are continuous. In
this case, we do not need to look into the possibility of tied observations.
Remedies to handle data with equal observed values are given in various
places. Our focus is on conceptual issues. Let
Tm+n =
n∑
j=1
r(yj).
The largest possible value of Tm+n is when xi ≤ yj for every pair of (i, j). A
large observed value of Tm+n is indicative of departure from H0 in favour of
H1. Thus, a rank based permutation test is to reject H0 when the observed
Tm+n is among the top 100α% values.
20.4. TWO-SAMPLE PERMUTATION TEST. 279
If H0 holds, then Tm+n has same distribution as the sample total of a
simple random sample of size n without replacement from a population made
of {1, 2, . . . , N} with N = m + n. Hence, by some simple calculations, we
have
E{Tm+n} = 1
2
n(m+ n+ 1)
and
var(Tm+n) =
1
12
nm(n+m+ 1).
It can be proved that
Tm+n − E{Tm+n}√
var(Tm+n)
→ N(0, 1)
in distribution, as both n,m → ∞ and n/(n + m) has a limit in (0, 1).
An approximate one-sided rejection region can be determined by using this
limiting distribution.
This test is called Wilcoxon two-sample rank-sum test. It is also called
Mann-Whitney U test, Mann-Whitney-Wilconxon, Wilcoxon-Mann-Whitney
test. I am among those who is really confused with these names.
Note that Tm+n is related to

i

j 1(xi < yj) which is a U-statistic. See
certain references for U-statistic. This might be the reason behind the name
U test.
None of the above two tests are uniformly most powerful. The test based
on ranks are nonparametric. Such tests are valued because their validity is
free from model mis-specifications.
mulation is clearly geared for one-sided alternative. However, a two-sided
Wilcoxon two sample rank test can be built based on the same principle. We
reject the null hypothesis when Tm+n is extremely large or extremely smaller
among T ′m+n. I leave it to you to decide a way to define the p-value. Clearly,
we do not truly have a principle of what quality should be called p-value.
280 CHAPTER 20. NON-PARAMETRIC TESTS
20.5 Kolmogorov-Smirnov and Crame´r-von Mises
tests
Let x1, x2, . . . , xn be a set of i.i.d. observations from a continuous distribution
F . The model under consideration is F : all continuous univariate distribu-
tions
obtained the sample, one estimator of the cumulative distribution function
F is given by the empirical distribution
Fn(x) = n
−1
n∑
i=1
1(xi ≤ x).
When xi’s are all different, it is a uniform distribution on x1, . . . , xn. We may
not be too happy as this estimator is not a continuous c.d.f. while the model
F is made of continuous distributions. Nevertheless, Fn is a good estimator
of F in many ways.
Let
Dn(F ) = sup
x
|Fn(x)− F (x)|.
By the famous Glivenko-Cantelli theorem, Dn(F )→ 0 almost surely as n→
∞ when F is the true distribution.
Suppose we want to test for H0 : F = F0 versus H1 : F 6= F0. It is
sensible to reject H0 when Dn(F0) is large. The test in the form of
φ(x) = 1(Dn(F0) > k)
for some k > 0 is called Kolmogorov-Smirnov test.
In applications, we would like to choose k so that the test has some pre-
specified size. This is possible only if we have an easy to computer expression
of
pr{Dn(F0) > k}.
This is likely a mission impossible. However, Kolmogorov proved that
P{√nDn(F0) ≤ t} → 1− 2
∞∑
j=1
(−1)j−1 exp(−2j2t2)
20.6. PEARSON’S GOODNESS-OF-FIT TEST 281
as n→∞. Thus, when n is large, we may use the right hand side to pick a
value of t so that
2
∞∑
j=1
(−1)j−1 exp(−2j2t2) = α
and reject H0 when

nDn(F0) > t. The expression is certainly easy to use
to compute an approximate P-value.
How large this n has to be in order for the approximation to have satis-
factory accuracy? I do not have an answer but it exists somewhere. I will
not try to give a proof. All I can say that this large sample result is crazily
elegant!
Kolmogorov-Smirnov test measures the maximum discrepancy between
Fn and F . It might be more helpful to examine the average difference. The
Crame´r-von Mises test works in this fashion:
Cn(F ) =

{Fn(x)− F (x)}2dF (x).
Under null distribution F0, it has been shown that
nCn(F0)→
∞∑
j=1
λjχ
2
1j,
where λj = j
−2pi−2 and χ21j, j = 1, 2, . . . are independent chisquare random
variables with one degree of freedom. The sum of the coefficients is 1/6 and
this is right.
There can certainly be many other ways to examine the difference between
Fn and F (x). By the latest checking, there is a R-cran function which is
designed to carry out the Kolmogorov-Smirnov test. See the help file if you
are interested in how the p-value is numerically computed for various ranges
of the sample size n,
20.6 Pearson’s goodness-of-fit test
Suppose the observations are naturally categoried into K groups. At the
same time, these n observations are believed i.i.d. Let pk be the probability
282 CHAPTER 20. NON-PARAMETRIC TESTS
of an observation falls into category k, k = 1, 2, . . . , K. One simple question
is: does the data support or contradict the hypothesis that pk = pk0, k =
1, 2, . . . , K. One possible approach of addressing such a concern is Pearson’s
goodness-of-fit test. We phrase the question from an opposite angle: is there
a significant evident against the null hypothesis H0 : pk = pk0?
Let ok be the number of observations out of total n fall into category
k. Let ek = npk0 denote the expected value of ok under the null model.
Pearson’s statistic for this test problem is defined to be
Wn =
K∑
k=1
(ok − ek)2
ek
.
This statistic clearly has one desired property for a test: when the true
model deviates from the null hypothesis, we expect to have larger differences
between ok and ek. Thus, Wn is stochastically larger when H0 is severely
violated. Naturally, we reject H0 if Wn value is large.
The next desired property for a test statistics is to have a known distri-
bution under H0. This is not completely true. However, when n→∞ while
K is a fixed value, it can be shown that
Wn
d−→ χ2K−1.
Since the chisquare distribution is well documented, we may use its upper
1− α quantile as the critical region for this test. Namely, the test would be
Reject H0 when Wn > χ
2
K−1(1− α).
Of course, this writing has assumed a size-α test is desired in the first place.
In a more realistic situation, for instance, these K categories represent
the number of boys in a family with K − 1 children. Is this number truly
binomially distributed as it would be under the assumption that there is
no correlation between siblings and the population is homogeneous? In this
case, we do not have pk0’s completely specified but we have an analytical
p0k(θ) =
(
K − 1
k − 1
)
θk−1(1− θ)K−k.
20.7. FISHER’S EXACT TEST 283
Namely, they are specified by a single parameter, the probability of success.
In this case, let θˆ be the maximum likelihood estimate of θ and compute
eˆk = np0k(θˆ).
Let us revise the definition of Wn and get
Wn =
K∑
k=1
(ok − eˆk)2
eˆk
.
Although we have to estimate θ, the limiting distribution ofWn is only altered
slightly:
Wn
d−→ χ2K−2.
In general, if p0k are function of θ and θ has dimension d, the same
approach is applicable. The limiting distribution remains chi-square with
degrees of freedom being K − d− 1.
Being a course in mathematical statistics, one may ask how to establish
the asymptotic result. One approach is to connect Wn with the likelihood
ratio test. This will be left as an assignment problem.
The applied aspect of this test can be more troublesome. The biggest
concern is when the chi-square approximation kicks in? The rumour is: do
not use the goodness-of-fit test unless min{ok} ≥ 5. In other applications,
the observations are not “naturally categorized”. The step of creating K
categories in order to examine the goodness-of-fit can be controversial.
20.7 Fisher’s exact test
In a classical and likely fictional example, a lady claimed that she could tell
whether or not tea was added after milk or the other way by tasting the
mixture. Carrying out an experiment and analyzing the subsequent data is
the best way to dispute the existence of such an ability. We may put the
inability as the null hypothesis. Rejection of which leads to the claim of this
Suppose A + B cups of teas of two types of preparation were prepared
as such. Assume that the lady had no ability to tell two types of teas apart.
284 CHAPTER 20. NON-PARAMETRIC TESTS
Her selection of A cups of type A would no different from randomly selecting
A cups out of A+ B and then identifying them as cups with tea was added
after milk. Note that in this experiment, the total number of cups as well as
how many of them are of type A are not random and known. The random-
ness comes from the lady who attempts to identify type A teas, given the
knowledge of the split: A and B. Let X be the number of correctly identified
type A cups.
Under the null hypothesis that there is no correlation between being type
A and identified as type A, random variable X as hypo-geometric distribu-
tion:
pr(X = x) =
(
A
x
)(
B
A−x
)(
A+B
A
)
Clearly, this distribution does not depend on any unknown parameters as
A and B are known. A large value of X is an evidence against the null
hypothesis pointing to the direction of positive correlation. It is therefore
sensible to reject the null hypothesis when X is large. Statistic X has two
desired properties for a test statistic. Therefore, we would compute the p-
value for the alternative that she has the skill as
pr(X > x0) + 0.5pr(X = x0)
where x0 is the observed value. The continuity correction here is intuitively
helpful but there is no theory to support this practice.
More generally, a 2× 2 table may be formed as follows:
n11 n12 n1+
n21 n22 n2+
n+1 n+2 n
The experiment is carried out to have n units placed into 4 possible cells
in this table. Without placing any restrictions, the probability of each unit
falling into a cell may be denoted as pij for i, j = 1, 2. The joint distribution
of n11, n12, n21, n22 are multinomial with these probabilities.
If the row and column counts are independent such that pij = pi,·p·,j
for some pi,·, p·,j. Conditioning on the marginal totals (corresponding to the
20.8. ASSIGNMENT PROBLEMS 285
knowledge of A and B in the tea experiment), the size of n11 has hyper-
geometric distribution. Again, n11 has the two desired properties of a test
statistic, in conditional sense: regarding marginal totals are not random.
Extreme values of n11 give evidence against independence assumption. One
may select n11 or −n11 as a test statistic, or find a way to get them combined.
Regardless the choice of a test statistic, the subsequent p-value of the test
can be computed via hypergeometric distribution which does not depend on
any unknown parameters. The outcome of the p-value is exact in theory,
when the error due to rounding is not taken into consideration. In other
words, no large sample approximations are needed for the p-value computa-
tion. This property together with its inventor gives the name of the test.
20.8 Assignment problems
1. Carry out two permutation tests on the Precambrian iron formation
data to be given. Regard it as a two-sample problem, not a paired-
sample problem.
Consider the hypothesis that the first two types have the same mean
(H0) versus the hypothesis that the first two formations have unequal
means (H1). Namely, we wish to carry out two-sided tests for mean.
You are asked to carry out a number of both parametric and non-
parametric tests requested as follows.
(a.1) Carry out permutation test based on difference in means. Namely,
compute T ′n+m = |y¯n− x¯m| for all permuted X, Y values. Obtain exact
counts with continuity correction on how many of them are larger than
Tn+m.
(a.2) Carry out permutation test based on ranks. Namely, obtain the
rank of each observed value: r(x): r(x) =

1(xi ≤ x) +

1(yj ≤ x);
compute T ′n+m =
∑n
i=1 r(x

i). The rest is the same as in (a.1) but you
need to adjust for being a two-sided test.
Report both p-values.
(b) Use t-test and the Wilcoxon rank-sum test based on CLT and report
286 CHAPTER 20. NON-PARAMETRIC TESTS
both p-values. Again, use two-sided test accordingly. Directly apply R
functions.
(c) Compare these 4 p-values and comment on what you find. Taking
the significant level as 0.05, do they contradict each other?
Remark: (c) is not a right or wrong question. The number of permu-
tations in this example is around 180K.
The data set is from an article on the origin of Precambrian iron forma-
tion. They reported the following data on percentage iron for 4 types
of iron formation (1=carbonate, 2=silicate, 3=magnetite, 4=hematite)
group observations (only two of them are given below):
1: 20.5 28.1 27.8 27.0 28.0 25.2 25.3 27.1 20.5 31.3
2: 26.3 24.0 26.2 20.2 23.7 34.0 17.1 26.8 23.7 24.9
2. In a clinical trial, a total of 214 oral cancer patients are recruited.
Their recurrences after a number of years are observed, together with
the information on the site of the original tumour.
This part of the outcomes can be summarized by the following 2y2
table.
Recurrence Non-recurrence marginal total
Low site 11 26 37
High site 28 149 177
39 175 214
(a) Does the “High site” tumour exhibit higher rate of recurrence?
(b) Are recurrence risks differ between two groups of patients?
Remark: answer these questions as a real world problem. I recommend
Fisher exact test and Wald test for H0 : r1 = r2 = 0. in (a) and (b)
respectively. It is unsatisfactory to simply give two P-values via an R
function. Give a few sentences on how the p-values are calculated in
the statistical/probability terminology.
20.8. ASSIGNMENT PROBLEMS 287
3. Based on the same data as in the last question, use Pearson’s goodness-
of-fit test for the hypothesis that row and columns are independent.
Give all intermediate values.
(d) Find the 10% upper quantile of the limiting distribution of the
Kolmogorov-Smirnov test statistic to the precision of the third decimal
place. Use a short program and justify your precision. Do not use an
all-powerful R function.
288 CHAPTER 20. NON-PARAMETRIC TESTS
Chapter 21
Confidence intervals or regions
Suppose we have a sample X from a distribution that belongs to F . Under
parametric setting, the distributions in F are labeled by θ and the “true”
distribution of X is the distribution with label θ = θ0, a value conceptually
exists but unknown to us.
We generally do not bother to use a special single θ0 to denote the true
parameter value unless it becomes ambiguous otherwise. Most often, we
simply declare that the parameter of the distribution of X is θ and its range
or the parameter space is Θ. Furthermore, we implicitly assume that Θ is
a subset of Rd with all mathematical properties needed such as being open,
convex and so on. In addition, the distribution of X depends on the value
of θ in a continuous fashion.
Based on the realized value of X, one can estimate θ using any preferred
methods. This is called point estimation. One may also make a judgement on
whether or not θ is a member of an elite subset H0 by conducting a hypothesis
test. The third option is to specify a subset Θ0 of Θ so that we are confident
that θ ∈ Θ0. When d = 1, we usually prefer that Θ0 is an interval in Θ.
When d > 1, Θ0 will be called a confidence region. It is preferable that Θ is
connected and it does not contain holes. More often than not, convex set is
most appealing.
We require that Θ0 is decided by the value of X, and it does not depend
on any unknown parameters. Thus, it is a random set. The characteristic
of its randomness is dependent on the distribution of X. Just like a statistic
289
290 CHAPTER 21. CONFIDENCE INTERVALS OR REGIONS
is a function of exclusively data, the confidence region is also a set-valued
function of exclusively data.
In Bayesian data analysis, the distribution of X is regarded as a realized
distribution from a “super-population”. The super-population is specified
by a prior distribution which is regarded as known. In parametric setting
in Bayes analysis, the distribution of X is f(x; θ0) such that θ0 is a realized
but not observed value of a random variable θ whose distribution is the prior
distribution and known. The prior distribution is often denoted as pi(θ)
which also stands for its density function. The corresponding cumulative
distribution function is often denoted as Π(θ). The conditional distribution
of θ given X called posterior distribution and it is derived via Bayes formula.
A θ region on which posterior distribution has high density is referred to as
a credible region. The topic of credible region will be discussed later.
Constructing a confidence interval or region is easy. The real challenge is
to construct an interval with desirable properties. We should specify what
properties such an interval estimation should have, and how we construct
intervals with these properties.
Definition 21.1. An interval/region C(x) as a function of the realized value
of X is a confidence interval/region of θ at level 1−α for some α ∈ (0, 1), if
inf
θ
pr{θ ∈ C(X); θ} = 1− α.
Probability calculation in this definition is done by regarding θ as the
true parameter value of the distribution of X. The value of θ is not random
in the definition of the frequentist confidence region. The interval C(X)
is random due to the randomness of X. In comparison, when θ is regarded
as a random variable in Bayes analysis, a similar notion is needed and must
be defined separately. Corresponding to confidence region, the Bayes ver-
sion is called “credible region”. There is no specific shape requirement on a
confidence/credible region in their definitions. Yet we have preferences.
The probability that C(X) covers θ generally depends on what value θ
takes in this specific case. It is desirable to have the coverage probability
not dependent on this specific value of θ. If this is achieved, the infimum
operation in the above definition would be redundant.
291
The models in real world applications are often too complex to find a
sensible C(X) that meets the standard of Definition 21.1. Often, people
implicitly use the following convention which is wrong in strict sense. To
make a distinction, we call it asymptotic confidence region and place a formal
definition here.
Definition 21.2. Suppose the observation X from a distribution, which is
a member of distribution family, is regarded as an observation in an imagi-
nary sequence X1, X2, . . . , Xn, . . . from a corresponding imaginary population
sequence so that the parameter θ remains interpretable throughout.
An interval Cn(xn) as a function of the current realized value of Xn is an
asymptotic confidence interval of θ at level 1− α for some α ∈ (0, 1), if
inf
θ
lim
n→∞
pr{θ ∈ Cn(Xn); θ} = 1− α.
The n in the above definition usually stands for the sample size and Cn
is a confidence region derived based on a principled procedure applicable to
any sample size n (or other dynamic index). The relevance of this definition
largely depends on the the sensibility of the population sequence and the
principled approach at constructing Cn(Xn). In addition, the sample size n
in application should be large enough so that the value of pr{θ ∈ Cn(Xn); θ}
is not far off from its limit when θ is within the anticipated region.
Similar to the optimality notion in hypothesis test, comparison between
different confidence regions are possible only if they are lined up by their
confidence levels. If two confident intervals (or two construction procedures)
have the same confidence level, the one having a shorter average/expected
length is preferred. One may in addition prefer that the variation in the
length of the interval is low.
Suppose C(X) = [−2, 2] is a confidence interval for the population mean.
If this interval is sensibly constructed, it is generally true that the most likely
value of θ is located at the centre of this interval. Namely, [−1, 1] is more
likely to contain θ than [−2,−1]∪ [1, 2] does. Yet this belief is not supported
by nor is the part of the formal definition of the confidence interval.
Unlike the theory for hypothesis test, there seem to be fewer solid math-
ematical criteria for the optimality of confidence regions. Confidence regions
are often derived from other well–known procedures. If these procedures
292 CHAPTER 21. CONFIDENCE INTERVALS OR REGIONS
have optimal properties in the sense of their original purpose, statisticians
seem to feel comfortable to recommend the corresponding confidence regions.
There are some optimality criteria in classical textbooks though they are not
convenient to use and generally ignored in contemporary textbooks.
We now give a few generally recommended approaches for constructing
confidence intervals.
21.1 Confidence intervals based on hypothe-
sis test
Assume a test φ(x; θ) of size-α is given for each simple null hypothesis H0 :
θ = θ0 against a composite alternative hypothesis H1 : θ 6= θ0. To be
more concrete, let the test be denoted as φ(x; θ0). Thus, θ0 is rejected when
φ(x; θ0) = 1, assuming no randomization is involved. That is, we consider
only the case where φ(x; θ0) can take a value either 0 or 1.
Based on this test φ(x; θ) where θ is generic, we define
C(x) = {θ : φ(x; θ) < 1}.
It is easy to see that
pr{θ ∈ C(x); θ} = pr{φ(X; θ) < 1; θ} ≥ 1− E{φ(X; θ) : θ} ≥ 1− α.
for all θ ∈ Θ. Thus, C(x) is a 1 − α level confidence region. In most cases,
C(x) obtained this way for one dimensional θ is an interval though it does
not have to be an interval. Clearly, the coverage probability may exceed 1−α
at some θ values. At the same time, the coverage probability is never below
1− α.
Example 21.1. Suppose we have a random sample from N(θ, σ2). We hope
to construct a confidence interval for θ. One approach is to use likelihood
ratio test for each H0 : θ = θ0. The test statistic can be simplified to
T (x; θ0) =

n|X¯ − θ0|
sn
.
21.2. CONFIDENCE INTERVAL BY PIVOTAL QUANTITIES 293
The rejection region for each θ0 is given by
{x : T (x; θ0) ≥ tn−1(1− α/2)}.
Consequently, the confidence interval based on this test is
{θ0 : T (x; θ0) ≤ tn−1(1− α/2)}
or
[x¯− tn−1(1− α/2)sn/

n, x¯+ tn−1(1− α/2)sn/

n].
It is nice to see that the outcome is indeed an interval.
21.2 Confidence interval by pivotal quanti-
ties
A pivot is a function of both data and unknown parameter such that it has
a distribution not dependent on unknown parameters. Suppose q(x; θ) is a
pivot. Then there is a quantity, say qα such that
P{q(X; θ) > qα; θ} = α
when q(X; θ) has a continuous distribution. The existence of qα is ensured
because the distribution of q(X; θ) does not depend on the unknown value
θ. If q(X; θ) has a discrete distribution, some continuity corrections may be
used.
Let
C(x) = {θ : q(x; θ) < qα}.
It is easily seen that C(x) is a 1− α-level confidence region of θ.
Examples of pivotal quantity are most readily available in location-scale
families.
Example 21.2. Suppose we have a random sample of size n from N(θ, σ2).
Let us try to find a confidence interval for σ2.
It is well known that
q(x;σ2) =

(xi − x¯)2
σ2
294 CHAPTER 21. CONFIDENCE INTERVALS OR REGIONS
has chisquare distribution with n− 1 degrees of freedom. Thus, it is a pivot.
Let χ2n−1(0.95) be the 95th percentile of the chisquare distribution with
n− 1 degrees of freedom. Then,{
σ2 :

(xi − x¯)2
σ2
< χ2n−1(0.95)
}
=
[ ∑(xi − x¯)2
χ2n−1(0.95)
, ∞
)
is a 95% confidence interval for σ2.
If a two-sided confidence interval is asked, then{
σ2 :

(xi − x¯)2
σ2
∈ (χ2n−1(0.025), χ2n−1(0.975)
}
=
[ ∑(xi − x¯)2
χ2n−1(0.975)
,

(xi − x¯)2
χ2n−1(0.025)
]
is a choice with 95% confidence level.
It is natural for us to use 0.025 and 0.975 quantiles in the above example.
However, using 0.02 and 0.97 quantiles will also give us a 95% two–sided
confidence interval. Which one is better? Should we have a look of their
average lengths?
In applications, some function of both data and parameter has a distri-
bution not dependent on the “true” parameter value in asymptotic sense. In
this case, one may activate Definition 21.2 to justify asymptotic confidence
regions.
21.3 Likelihood intervals.
By the definition, a confidence region is characterized by its level of confi-
dence. Yet the interval makes more sense if a parameter value within the
region is more “likely” than a parameter value outside of the region, to be
the “true” value of the parameter. This is particularly the case in the con-
fidence interval for σ2 in the last example. The notion of likelihood interval
or related Bayesian approach seem to be an improvement in this direction.
Suppose we have a random sample of size n from a parametric family
{f(x; θ) : θ ∈ Rd}. Consider the problem of constructing a confidence inter-
val/region for θ. Since by “definition”, the maximum likelihood estimator
21.3. LIKELIHOOD INTERVALS. 295
is the most “likely” value of the parameter, the interval should contain the
MLE θˆ. In addition, if the likelihood value at θ′ is almost as large as the
likelihood value as θˆ, it is also a good candidate to be included into the
interval.
This notion quickly deduces to a likelihood region/interval in the form of
C(X) = {θ : L(θ)/L(θˆ) ≥ c}
where θˆ is the MLE, and c is a positive constant to be chosen.
By Definition 21.1, to make a likelihood interval into a confidence interval,
all we need is to choose c such that
P{θ ∈ C(X); θ} ≥ 1− α
is true for any θ, when the pre-specified level is 1− α.
There may not exist such a meaningful constant c such that the coverage
probability is no less than 1 − α under all θ. However, when the sample
size n is large and the model is regular, it is possible to find an cn such that
the coverage probability is approximately 1 − α for each θ. That is, the
difference is an quantity converges to 0 when n→∞, whichever θ is the true
value. This is an asymptotic confidence region at (1− α)-level by Definition
21.2.
To students with rigorous mathematics background, you may notice that
the asymptotic notion is not uniformly in θ. We only required the convergence
point-wise, not uniformly over the parameter space.
Example 21.3. Consider the situation where we have an i.i.d. sample of size
n from an exponential distribution parameterized by its mean θ. A confidence
interval for θ is desired.
Let X¯n be the sample mean and regard it as a random variable. It is also
the MLE of θ. The log likelihood function is given by
`n(θ) = −n log θ − nθ−1X¯n.
The likelihood ratio statistic is given by
Rn(θ) = 2n{− log(X¯n/θ)− 1 + (X¯n/θ)}
296 CHAPTER 21. CONFIDENCE INTERVALS OR REGIONS
which is convex in X¯n/θ. Hence, a likelihood interval for θ has the form
C(X) = {c1X¯n ≤ θ ≤ c2X¯n}
for some c1 < c2 such that
Rn(c1X¯n) = Rn(c2X¯n)
and
P{X¯n/θ ∈ (1/c2, 1/c1)} = 1− α
where 1− α is the pre-specified confidence level.
Suppose one would like to construct a confidence interval for the rate
parameter λ = 1/θ in this example. It is easily seen that the confidence
interval based on likelihood approach would simply be
C ′(X) = {(c2X¯n)−1 ≤ λ ≤ (c1X¯n)−1}
where c1 and c2 are the same constants in the example. Note that
θ ∈ C(X)
if and only if
1/θ = λ ∈ C ′(X).
We say that the likelihood intervals are equi-variant, just like its counterpart,
MLE. This property is not shared by all other methods such as the one to
be introduced.
21.4 Intervals based on asymptotic distribu-
tion of θˆ
It is arguable whether or not this is a new method. We might call it Wald’s
method, yet it has too many moving parts to be solidly called as this method.
Often,

n(θˆ − θ) is asymptotic normal with limiting variance σ2. When σ2
is known, then
q(X; θ) =

n(θˆ − θ)
σ
21.4. INTERVALS BASED ON ASYMPTOTIC DISTRIBUTION OF θˆ297
is an approximate pivotal quantity. Because of this, an approximate 2-sided
1− α confidence interval of region of θ is given by
θˆ ± z1−α/2σ/

n.
If σ is unknown but a consistent estimator σˆ is available, then a substitute
is given by
θˆ ± z1−α/2σˆ/

n.
It might be more convenient to write the above as
θˆ ± z1−α/2

v̂ar(θˆ).
The meaning of the above notation is obvious.
Example 21.4. Let X1, . . . , Xn be an i.i.d. sample from Poisson distribution
with mean parameter denoted as θ. The MLE of θ is given by θˆ = X¯n, the
sample mean. Construct a 95% CI for θ.
Solutions: It is well known that

n(θˆ− θ) d−→ N(0, θ). Thus, a 95% CI for
θ is given by
X¯n ± 1.96

X¯n/n.
When 1.96

X¯n/n > X¯n, one must set lower confidence limit(bound) to 0.
It is equally appropriate to notice that

n(

θˆ −

θ)
d−→ N(0, 1/4).
Hence, one may construct a 95% CI based on

n|

θˆ −

θ| ≤ 1.96/2.
Solving this inequality, we get
[{

X¯n − 1.96/2

n}+]2 < θ < {

X¯n + 1.96/2

n}2.
The third choice is to work with

n(θˆ − θ)√
θ
d−→ N(0, 1).
298 CHAPTER 21. CONFIDENCE INTERVALS OR REGIONS
With this asymptotic pivotal quantity, the CI has lower and upper limits
X¯n +
1.962
2n

1.964
4n2
+
1.962X¯
n
and
X¯n +
1.962
2n
+

1.964
4n2
+
1.962X¯
n
.

Many students have natural tendency to ask the following question. Which
of the above confidence intervals is correct? The answer is: none of them.
The reason is: the critical value 1.96 is based on the limiting distribution of
θˆ in every case. Hence, none of them have exact 95% coverage probability
(even if the rounding-off is not taken into account).
If the above answer is unsatisfactory to you, then you need to think hard
about what it means by “correct”. If approximate 95% CIs are acceptable,
all three are fine.
The real question in your mind might be: which one is the best? An-
swering this question needs an optimality criterion. We do not have one at
the moment. Now it boils down to a weak question: what are their relative
merits?
The first one is analytically simple. If the sample size n is not very large,
the normal approximation can be poor. The CI may even have a negative
lower bound. Chop-off the segment of the CI containing the negative values
is a mathematical must but somewhat unnatural. The interval is otherwise
always symmetric with respect to θˆ = X¯n. This is somewhat unattractive. I
would use this one when n and X¯n are both large. How large is large? I do
not have an absolute standard.
The second one is nice in one way: after transforming θ into g(θˆ) =√
θ, the limiting distribution of g(θˆ) has a constant limiting distribution.
For this reason, this type of transformation is called variance stabilization
transformation. Since the limiting distribution does not depend on unknown
parameter values, this interval is truly based on approximate pivotal. If n is
not large, this is a good choice.
21.5. BAYES INTERVAL 299
The third one has its own merit. Scaling θˆ−θ by a function of θ creates a
more complex pivotal. This often leads to more naturally shaped confidence
regions (intervals). While I have intuitions for this approach, I cannot come
up with concrete evidences for this preference.
Recall that testing hypothesis on θ value based on the limiting distribu-
tion of the MLE θˆ is called Wald’s method. I am not sure if this group of
intervals should be credited to Wald, but I feel it is natural to call it Wald’s
interval/region.
Topics we do not have time to go over: Multi-parameter; Binomial exam-
ple; Odds ratio; Intervals for quantiles.
21.5 Bayes Interval
Under Bayesian setup, the parameter θ is a sample from some prior distribu-
tion. Thus, its value itself is a realization of random variable. Constructing
a confidence interval for a random quantity is a different topic. However, we
may combine our prior information, if any, with data from f(x; θ), to take a
mation about θ is completely summarized in the posterior distribution of θ.
If one must take a guess on a region in which this θ has been located based
on Bayesian setup, she would select the region with the highest posterior
density.
Definition 21.3. Let pi(·|x) denote the posterior density function of param-
eter θ given X = x. Then
Ck = {θ : pi(·|x) ≥ k}
is called a level 1− α credible region for θ if pr(θ ∈ Ck|x) ≥ 1− α.
Note that pr(·|x) is used for posterior distribution of θ. If one can credibly
regard θ as an outcome from a prior distribution, then the above credible
region has a very strong appeal.
If θ is not a vector but a real value, then we may choose to ignore the
above definition of the credible region but insist of having a credible interval.
300 CHAPTER 21. CONFIDENCE INTERVALS OR REGIONS
Definition 21.4. Let Π(·|x) denote the posterior cumulative distribution
function of parameter θ given X = x. Suppose θ¯ and θ are the largest and
the smallest values satisfying
Π(θ ≤ θ|x) ≥ 1− α; Π(θ¯ ≥ θ|x) ≥ 1− α.
Then, the θ and θ¯ are level 1− α lower and upper credible bounds for
θ.
The source of these two definitions is Bickel and Doksum (2001). Some
changes are made to avoid potential non-uniqueness. The following is an
example directly copied from Bickel and Doksum (2001).
Example 21.5. Suppose that given µ, X1, . . . , Xn are i.i.d. from N(µ, σ
2
0)
with known σ20. The prior distribution of µ is N(µ0, τ
2
0 ) with both parameter
values are known. Find the credible bounds and regions according to the above
definitions
Solution. The posterior distribution of µ given the sample is still normal
with parameters
µB =
nx¯/σ20 + µ0/τ
2
0
n/σ20 + 1/τ
2
0
;
and
σ2B =
[
n
σ20
+
1
τ 20
]−1
.
The lower and upper 1− α bounds are simply
µB ± z1−α σ0√
n+ σ20/τ
2
0
.
The 1− α credible region is also an interval with lower and upper limits
given by
µB ± z1−α/2 σ0√
n+ σ20/τ
2
0
.

Note that the centre of the credible interval is shift toward µ0 compared
with usual confidence intervals. The length is shortened too.
21.6. PREDICTION INTERVALS 301
21.6 Prediction intervals
In general, the notion of confidence region is defined for unknown parame-
ters of a distribution family. There are cases where we hope to predict the
outcome of a future trial from the same probability model.
Suppose we have a set of iid sampleX1, X2, . . . , Xn from {f(x; θ) : θ ∈ Θ}.
Based on this sample, we might have an estimate of θ. The question is: if
another independent sample is to be taken from the same distribution, what
are the possible values of this future X?
If the value of θ for this experiment were known, we could use the high
density region of f(x; θ) as our prediction region. This should allow us to
catch the true value with the lowest volume confidence region. That is, let
C(θ) = {x : f(x; θ) > c}
with the known value θ. We may choose c such that
P (X ∈ C(θ)) = 1− α
for 1−α coverage probability. Note that this region is not dependent on the
random sample X1, . . . , Xn.
If θ is unknown as it is the usual case, it is natural to replace θ by its
estimator, say θˆ. Although C(θˆ) is a very sensible prediction region for X, its
coverage probability is likely lower than 1− α due to the uncertain brought
in by θˆ. The event
X ∈ C(θˆ)
contains two random components: X and θˆ. The randomness in X is unaf-
fected by how well θ is estimated, while the precision of θˆ usually improves
with sample size n. The limit of the improvement is C(θ). Due to the build-
in randomness in X, one cannot do anything better than C(θ) no matter
what.
In comparison, the precision of the confidence region for θ usually im-
proves with n. When n → ∞, the size of the confidence region with fixed
confidence level shrinks to 0.
Example Suppose we have a random sample X1, . . . , Xn from N(θ, 1). It is
well known that θˆ = X¯n is the MLE of θ.
302 CHAPTER 21. CONFIDENCE INTERVALS OR REGIONS
If X is the outcome of a future experiment, then X − X¯n has normal
distribution with mean 0 and variance 1 + n−1. Thus, a 95% prediction
interval of X is given by
(X¯n − 1.96

1 + n−1, X¯n + 1.96

1 + n−1).
Clearly, increasing the sample size does not have much impact on reducing
the length of the prediction interval.
In general, a prediction interval can be obtained via some “pivotal” quan-
tities. That is, we look for a function of the random quantity to be predicted,
and a statistic based on observations such that the resulting random quantity
has a distribution free from unknown parameters. Thus, it is possible to find
a subset of its range such that its probability equals 1 − α, the confidence
level we hope to get. The range can often be converted to obtain a prediction
interval or region.
21.7 The relationship between hypothesis test
and confidence region
Both hypothesis test and confidence region are frequentist concept. The
subsequent discussions are not applicable to Bayes analysis.
Particularly in recent years, the use of the hypothesis test is being ques-
tioned in science community. At the centre of this dispute is the interpreta-
tion of the p-value. Even for researchers who received many years of rigorous
statistical training, obtaining a small p-value based on some test on some
data set remains holy. This is bad. The majority of the researchers in the
same group fully understand the non-equivalence of “statistical significance”
and “scientific significance”. The motivation outside of scientific considera-
tion can simply be too strong to take the non-equivalence seriously.
Here is a fictionally example based on linear regression. How should we
judge the importance of explanatory variables? One may find the p-value
of one variable is 10−5, extremely small and much lower than the commonly
used nominal level 0.05. The p-value for another is, say 0.04, just small
enough to declare its statistical significance. Which one is more important?
21.7. HYPOTHESIS TEST AND CONFIDENCE REGION 303
If it were us, we would first ask what hypotheses are under test? Mostly
likely, the hypotheses are that their coefficients are zero in the regression
relationship. Nevertheless, I prefer to have them explicitly stated. I would
then ask what the purpose of this regression analysis is. If predicting the
response value at a future experiment is the goal, then it is best to examine
how much variation in the response is explained by the variation of each of
these two explanatory variables. The large one is more important. The size
of p-value merely tell us how sure we are about the conclusion that its effect
is non-zero.
One way to avoid such confusion is to make use of confidence interval,
suggested by many. We feel that this is not necessarily fool-proof nor always
feasible. Let the regression coefficients be denoted as β1 and β2 for two re-
gression coefficients under consideration. Suppose two-sided 95% confidence
intervals of these variables are [0.1, 1] and [1, 10] respectively. Suppose
these two variables have been standardized, the second explanatory variable
is more important in determining the size of the response variable.
The foundation of replacing hypothesis test with a confidence interval/region
construction is their equivalence: one may reject every null hypothesis which
does not contain any parameter values in the confidence region; one may
construct a confidence region made of all parameter values that is not re-
jected by a chosen hypothesis test procedure when the specific value form a
null hypothesis. This foundation does not always work. Consider the exam-
ple of Wilcoxon rank test which is often used as a nonparametric method for
two-sample problem. In this case, there is not a meaningful parameter whose
confidence interval can be constructed based on this test. The goodness-of-fit
test is another one. In these cases, hypothesis test is indispensable in spite of
its deficiency. The only defence against the abuse of statistical inference pro-
cedure is to uphold the statistical principle: a small p-value based on a valid
hypothesis test implies statistical significance. It does not imply scientific
significance.
Let us end this chapter with another trivial example. It is definitely sta-
tistically significant that those who buy 10 lottery tickets has higher chance
to win than those who buy only 1 lottery ticket. Yet, multiplying a number
practically zero by 10 does not lead to a meaningfully sized chance. For the
304 CHAPTER 21. CONFIDENCE INTERVALS OR REGIONS
same reason,we do not take the advice such as “drinking 10 cup of water a
day will reduce your risk of cancer by a factor of 10”. We will drink water
when we are thirsty, not to reduce the risk of cancer.
21.8 Assignment problems
1. The following values are iid observations from a binomial distribution
with m = 10 and the probability of success θ.
4 3 3 3 2 3 4 3 2 1 3 7 5 2 2 2 2 1 3 4
(1) Obtain the 95% confidence interval of θ based on likelihood method.
(2) Let x¯ be the sample mean and let
Tn(x¯, θ) =

20(x¯− 10θ)√
10θ(1− θ)
be used as a test statistic for H0 : θ = θ0 versus H1 : θ 6= θ0. Note
that the sample size n = 20. Based on CLT, Tn is asymptotic N(0, 1).
Thus, we reject H0 when |Tn(x¯, θ0)| > 1.96 at 5% level.
Numerically find all value of θ which is not rejected by the above test.
Your outcome is a confidence interval for θ.
2. Let X1, . . . , Xn be an iid sample from Cauchy distribution family
f(x; θ) =
1
pi{1 + (x− θ)2} .
(a) Derive the score test statistic for H0 : θ = 0.
(b) Generate a set of random observations of size n = 20 to construct
a two-sided 95% confidence interval for θ based on the score test.
Remark: the locally most powerful test is one-sided. For this problem,
we use S2n(θ)/I(θ) as the test statistic. The locally most powerful uses
Sn/

I(θ).
21.8. ASSIGNMENT PROBLEMS 305
3. Let X1, . . . , Xn be a random sample from exponential distribution with
density function
f(x; θ) = θ−1 exp(−θ−1x).
Consider the case n = 201 and θ = 1.
(a) Generate 1000 data sets with n = 201 to estimate the bias and
variance of the sample median for estimating the population median.
(b) Let x¯ and s2n be the sample mean and variance. Obtain a two-sided
95% confidence interval for θ based on test statistic
T =

n(x¯− θ)
θ
and its asymptotic N(0, 1) distribution.
(c): Obtain a two-sided 95% confidence interval for θ based on test
statistic
T =

n(x¯− θ)
sn
and its asymptotic N(0, 1) distribution.
(d) Which interval, rather the CI construction method, in (a) or the
one in (b), seems to work better?
4. The following values are i.i.d. observations from a geometric distribu-
tion with probability of success θ:
0 2 0 0 2 10 6 0 0 6 0 3 0 1 0 8 3 5 3 5.
The sample size n = 20 and

xi = 54.
(a) Provide the algebraic expressions needed for constructing an ap-
proximate 95% confidence interval of θ based on likelihood method.
Obtain the 95% approximate likelihood interval based on the data pro-
vided with precision up to the third decimal place.
(b) The log odds is defined to be ξ = log{θ/(1 − θ)}. Obtain the
approximate likelihood interval of ξ based on your solution in (a).
306 CHAPTER 21. CONFIDENCE INTERVALS OR REGIONS
5. Suppose that we have an i.i.d. sample x1, . . . , xn of size n from Poisson
distribution whose p.d.f. is given by
f(x; θ) =
θx
x!
exp{−θ}; x = 0, 1, . . . ,
with parameter space θ > 0.
In the context of Bayes analysis, a prior distribution for θ with density
function
pi(θ) = θ exp(−θ)
is specified.
(a) Consider the simplest case where n = 2 and x1 +x2 = 9. Show that
the p.d.f. of the posterior distribution is given by
pi(θ|x) = Cθ10 exp(−3θ)
for some constant C.
(b) Find the lower and upper limits of the 95% credible region to the
precision of the 3rd decimal place.
(c) Obtain the level 95% lower and upper credible bounds of θ.
Remark: credible region and credible bounds are different in their defi-
nition mathematically. Their statistical difference is not so meaningful.
Chapter 22
Empirical likelihood
Likelihood method for regular parametric models has many nice properties.
One potential problem, though, is the risk of model mis-specification. If a
data set is a random sample from Cauchy distribution, but we use normal
model in the analysis, the statistical claims could be grossly false.
Of course, the problem is not always so serious. If the data set is a sample
from a double exponential model, but we use normal model as the basis for
data analysis, many statistical claims will still be asymptotically valid. For
instance, the sample mean remains a good estimator of the population mean,
its variance remains well estimated by its sample variance after scaled by the
sample size. The efficiency of the point estimator, however, is compromised.
To avoid the risk of model mis-specification, non-parametric methods are
sensible alternatives. The empirical likelihood methodology is a systematic
non-parametric approach to statistical inference. It preserves some treasured
properties of the likelihood approach while being nonparametric.
22.1 Definition of the empirical likelihood
Suppose we have a set of i.i.d. observations X1, X2, . . . , Xn. We hope to make
statistical inferences without placing restrictive assumptions on their com-
mon distribution F . Can we still make meaningful and effective inferences
on F? The answer is positive because it is widely known that the empirical
distribution Fn(x) is a good estimate of F . This is an estimator based on
307
308 CHAPTER 22. EMPIRICAL LIKELIHOOD
no parametric assumptions. The empirical distribution will be seen a non-
parametric maximum likelihood estimator of F and it has many “optimal”
properties.
Let F ({xi}) = P (X = xi), where xi is the observed value of Xi, i =
1, 2, . . . , n. When all xi’s are distinct, the likelihood function becomes
Ln(F ) =
n∏
i=1
F ({xi}).
Denote pi = F ({xi}). This likelihood can also be written as
Ln(F ) =
n∏
i=1
pi.
Clearly, we have 0 ≤ pi ≤ 1 and
∑n
i=1 pi ≤ 1. It is often more convenient to
work with the log-empirical likelihood function
`n(F ) =
n∑
i=1
log pi.
If F is a continuous distribution, we have Ln(F ) = 0. Because of this, the
empirical likelihood appears insensible. In its eyes, no continuous distribu-
tions are likely at all. Yet we will find the empirical likelihood is not boggled
down by this deficiency.
When there are ties in the data, that is when some xi are equal, Ln(F )
given above in terms of pi is not authentic. For instance, the requirement of∑
pi ≤ 1 is no longer valid. To justify the continued use of this Ln(F ) via
pi as a likelihood function, we may add a set of independent and very small
continuous noises to these observed values. After which, Ln(F ) remains a
valid likelihood function but is constructed on a slightly different data set
and of a different F . We can then proceed to whatever analysis first, and
then let this amount of noise go to zero. In most situations, the analysis
conclusions on original F remain valid. Owen (2001) contains a more rigorous
justification to resolve the “philosophical issue” caused by tied observations.
The justification here might be regarded as a lazy-man’s approach.
22.2. PROFILE LIKEIHOOD 309
It is easy to see that the likelihood is maximized when F (x) = Fn(x).
Hence, empirical distribution Fn(x) based on an i.i.d. sample is also the non-
parametric MLE. One may note that this conclusion does not depend on
whether or not there are any ties in the sample.
22.2 Profile likelihood for population mean
and the Lagrange multiplier
The empirical likelihood may seem to have limited usage. The picture com-
pletely changes once we introduce the concept of profile likelihood.
Consider inference problem related to population mean when a set of
i.i.d. observations from a distribution F ∈ F is available. Naturally, we now
assume that F contains all distributions with finite first moment.
Let θ =

xdF (x) = E(X) under distribution F . The empirical likelihood
is a function of F . There are many distributions whose expectation equal θ.
What should be the likelihood value of θ? We do not have a widely acceptable
answer to this question. Let F θ be all distributions whose expectations equal
θ. The original concept of the profile likelihood would be
wrong Ln(θ) = sup{Ln(F ) : F ∈ Fθ}.
This definition is found not useful, however. It can be shown that
wrong Ln(θ) =
( 1
n
)n
for any θ. That is, the above “profile likelihood” lacks discriminative power
to tell true θ-value from other values.
To avoid the above dilemma, we define the profile likelihood by first
introducing a “distribution family”:
F
n,θ = {F : F (x) =
n∑
i=1
pi1(xi ≤ x),
n∑
i=1
pixi = θ}.
Note that this class of distributions is data dependent. When n increases,
this family expands, and in the limit, it can approximate any distribution
well in some sense.
310 CHAPTER 22. EMPIRICAL LIKELIHOOD
We now define the profile likelihood function for population mean θ as
Ln(θ) = sup{Ln(F ) : F ∈ Fn,θ}.
Note that we use Ln(·) for both empirical likelihood and for profile empirical
likelihood. Mathematically, it is an abuse of notation but such an abuse
does not seem to cause many confusions. The question of whose likelihood
it stands for is answered by whether the input is a vector θ or a c.d.f. F . As
usual, it is often more convenient to work with the logarithm transformation
of the likelihood function. We use `n(·) = logLn(·).
Does the profile likelihood function of θ works like a likelihood? To
answer this question, we need to get some idea on the numerical problem
comes with the empirical likelihood. Suppose we have n observed values or
vectors x1, . . . , xn. To compute the profile likelihood `n(θ), the numerical
problem is:
maximize :
n∑
i=1
log pi
subject to : 0 < pi < 1; i = 1, 2, . . . , n
n∑
i=1
pi = 1,
n∑
i=1
pixi = θ.
The method of Lagrange multiplier is very effective in solving this maximiza-
tion problem with restrictions. Suppose that θ is an interior point of the
convex hull formed by the n observed values. Define
g(p, s,λ) =
n∑
i=1
log pi + s(
n∑
i=1
pi − 1)− nλτ (
n∑
i=1
pixi − θ)
where p represents all pi, s and λ are Lagrange multipliers. When x’s are
vectors, λ is also a vector and the multiplication is interpreted as the dot
product. The method of Lagrange multiplier requires us to find the stationary
points of g(p, s,λ) with respect to p, s and λ. After some routine derivations,
we find the stationary point is given by
pi =
1
n(1 + λτ{xi − θ})
22.3. LARGE SAMPLE PROPERTIES 311
with λ satisfying
n∑
i=1
xi − θ
1 + λ{xi − θ} = 0. (22.1)
In the univariate case, since all 0 < pi < 1, we find
1− n−1
θ − x(n) < λ <
1− n−1
θ − x(1)
where x(1) and x(n) are the minimum and maximum observed values. In
addition, the function on the left hand side of (22.1) is monotone decreasing
function in λ. One may verify this claim by finding its derivative with respect
to λ. Hence, numerical value of λ may be easily computed. In the vector
case, the function in (22.1) is the derivative of a convex function. A revised
Newton’s method may be designed to ensure the numerical solution being
obtained.
Once the value of λ is obtained numerically, we have
`n(θ) = −
n∑
i=1
log{1 + λ(xi − θ)} − n log n
and the corresponding F has
pi =
1
n(1 + λ{xi − θ}) .
This result paves the way to study the asymptotic property of the profile
likelihood.
22.3 Large sample properties
As a preparation step, we first show that the true population mean θ0 is
within the convex hull of the data with probability approaching 1 as n→∞.
Mathematically, this means that
inf{max{a(xi − θ0) : i = 1, . . . , n} : a is a unit vector} > 0.
The reason is: viewed from θ0 to whichever direction, there should always
be data located in that direction. Remember that a unit vector is a vector
312 CHAPTER 22. EMPIRICAL LIKELIHOOD
of length one. We use Euclidean norm to define the length. A mathematical
result presented in Owen (2001) is needed here.
Lemma 22.1. Let X be a d-dimensional random vector with mean 0 and
finite variance-covariance matrix V of full rank. We have
inf
a
pr(aτX > 0) > 0
where the infimum is taken over all unit d-dimensional vectors.
Proof: Since Σ is positive definite, there cannot be an unit length vector
a0 such that pr(a
τ
0X > 0) = 0. We show that because of this, the lemma
conclusion is true.
If the conclusion is not true, then there be a sequence am such that
pr(aτmX > 0) → 0 as m → ∞. Since the set of all unit d-dimensional
vectors is compact, we must be able to find a sub-sequence of am such that
am → a0 for some a0, also of unit length. Without loss of generality, assume
am → a0 as m→∞. Clearly,
lim
m→∞
1(aτmX > 0) = 1(a
τ
0X > 0).
Hence, by Fatou’s lemma in real analysis, we have
0 = lim
m→∞
pr(aτmX > 0) ≥ pr(aτ0X > 0).
This is impossible as pointed out in the beginning.
Since the empirical measure approximates the true probability measure
uniformly over the half space {aτX > 0}. This claim can be found in high
level of probability theory books. This implies that the solution exists with
probability converging to 1.
By Slutsky’s theorem, the limiting distribution of a statistic is not affected
by an event with probability going to zero. We now assume that the solution
exists for all data sets observed. This is acceptable for deriving asymptotic
results, though one should not used it for other purposes. At least, you
should be very cautious on activating this “assumption”.
The next lemma is to show that
max
1≤i≤n
‖Xi‖ = op(n1/2).
22.3. LARGE SAMPLE PROPERTIES 313
This fact is helpful to determine the closeness of pˆi to 1/n as n→∞.
Lemma 22.2. Assume Y1, . . . , Yn be a set of i.i.d. positive random variables
with E[Y1]
2 <∞, then Y(n) = maxi Yi = o(n1/2).
Proof: There is a simple inequality for positive valued random variables:
∞∑
j=1
pr(Y 21 > j) ≤ E{Y 21 }.
Due to the i.i.d. assumption, it can also be written as
∞∑
j=1
pr(Y 2j > j) ≤ E{Y 21 } <∞.
The finiteness is the lemma condition. The inequality can then be easily
refined to show that ∞∑
j=1
pr(Y 2j > j) <∞
for any > 0. By Borel-Cantelli Lemma, it implies that the
pr
{
Bj = {Y 2j > j}; i.o.} = 0.
That is, there exists an event A, such that pr(A) = 1 and for each ω ∈ A,
Y 2n (ω) > n for only finite number of n.
Let ω ∈ A: it implies there exists an M such that Y 2n (ω) ≤ n when
n > M. Let
N(ω) = −1 max{Y 2j (ω) : j ≤M}
which is a large but finite value. For all n ≥ max{M, N},
Y 2(n) ≤ max
[
max{Y 2n (ω) : n ≤M}, n
] ≤ max{N, n} = n.
That is, Y 2(n) ≤ n almost surely for all > 0, which is the conclusion of the
lemma.
After two rather technical lemmas, we are ready to prove the following
statistically meaningful conclusion. For simplicity, we use θ for true popula-
tion mean, rather than a special notation θ0.
314 CHAPTER 22. EMPIRICAL LIKELIHOOD
Lemma 22.3. Under the conditions of Theorem 22.1, for the Lagrange mul-
tiplier corresponding to true population mean θ, we have
λn = Op(n
−1/2).
Further, we have
λn = [
n∑
i=1
(xi − θ)(xi − θ)τ ]−1
n∑
i=1
(xi − θ) + op(n−1/2)
and maxi |λτ (Xi − θ)| = op(1).
Proof: We omit subscript n on λ for simplicity. Let ρ = ‖λ‖ and denote
ξ = λ/ρ. For brevity, assume µ = 0 so that the equation for λ becomes
n∑
i=1
xi
1 + ρξτxi
= 0.
We have
0 =
n∑
i=1
(ξτxi)− ρ
n∑
i=1
{ξτxi}2
1 + ρξτxi
.
This implies
n∑
i=1
(ξτxi) = ρ
n∑
i=1
{ξτxi}2
1 + ρξτxi
≥ 0.
Let ti = ξ
τxi and δn = maxi |ti|. It is known 1 + ρti > 0 for all i and
therefore 1 + ρδn ≥ 0. Further, by the finiteness of the second moment of xi
and Lemma ?? , we know δn = o(n
1/2). This order assessment leads to
n∑
i=1
ξτxi = ρ
n∑
i=1
{ξτxi}2
1 + ρξτxi
≥ ρ [
∑n
i=1{ξτxi}2
1 + ρδn
.
Multiplying positive constant 1 + ρδn on both sides, and after some simple
algebra, we get
n∑
i=1
(ξτxi) ≥ nρ
[
n−1
n∑
i=1
(ξτxi)
2 − n−1δn
n∑
i=1
(ξτxi)
]
.
22.4. LIKELIHOOD RATIO FUNCTION 315
By the law of large numbers,
n−1
n∑
i=1
xix
τ
i → var(X1)
which is a positive definite matrix. Hence, n−1
∑n
i=1(ξ
τxi)
2 ≥ σ21 > 0 almost
surely with σ21 being the smallest eigenvalue of the covariance matrix. At the
same time, it is clear that
n−1δn
n∑
i=1
(ξτxi) = op(1).
Consequently, we have shown
ρ ≤ [ n∑
i=1
(ξτxi)
2
]−1{ n∑
i=1
ξτxi
}
(1 + op(1)) = Op(n
−1/2).
This conclusion implies maxi |λXi| = op(1). Substituting back to
n∑
i=1
xi
1 + λτxi
= 0,
we get the expression for the expansion of λ. This concludes the proof.
These preparations help to establish the useful statistical results in the
next section.
22.4 Likelihood ratio function
Since Ln(Fn) > Ln(F ) for any F 6= Fn, it is useful to introduce the empirical
likelihood ratio function
Rn(F ) = Ln(F )/Ln(Fn) =
n∏
i=1
(npi).
This function has the maximum value of 1. Similarly, for population mean
θ, we define
Rn(θ) = Ln(θ)/Ln(Fn) =
n∏
i=1
(npi)
316 CHAPTER 22. EMPIRICAL LIKELIHOOD
with npi = {1 + λ(xi − θ)}−1 for some Lagrange multiplier given earlier.
In parametric inference we may base hypothesis test and confidence re-
gions on the size of the likelihood ratio function. When Rn(θ) is large, then
θ is a likely value of the true parameter. A confidence region hence is made
of θ’s such that Rn(θ) is larger than a threshold value. As in the parametric
likelihood inference, we need to know the distribution of Rn(θ) to define a
proper threshold value. This value is to be selected that at least asymptoti-
cally, the size of the test is pre-specified α, or the likelihood interval/ region
has coverage probability 1 − α. Such a threshold value can be determined
based on the following much celebrated result.
Theorem 22.1. Let X1, X2, . . . , Xn be a set of i.i.d. random vectors of di-
mension d with common distribution F0. Let θ0 = E[X1], and suppose
0 < var(X1) <∞. Then
−2 log[Rn(θ0)]→ χ2d
in distribution as n→∞.
Because of the above Wilks type result, an effective empirical likelihood
based hypothesis test procedure is possible by rejecting H0 : E(X) = θ0 in
favour of H0 : E(X) 6= θ0 when Tn = −2 log[Rn(θ0)] ≥ χ2d(1− α). Note that
this d is the dimension of X.
I generally use Rn for the LRT statistics which is twice of the difference of
the log likelihood values maximized respectively under the full model θˆ1 and
under the null model θˆ0. Namely, Rn is generally used for 2{`n(θˆ1)−`n(θˆ0)}.
In the context of empirical likelihood, there is a compelling reason to use
Rn(θ) as the straight ratio of two likelihood values. Hence, one has to be
careful to avoid some potential confusion here.
Proof of Theorem: Because max ‖λτXi‖ = op(1), let us focus on events
such that it is no more than 1/10 in absolute value. For |t| ≤ 1/10, it is
simple to see that
| log(1 + t)− {t− 1
2
t2}| ≤ |t|3/2.
We in fact have given a big margin for the error.
22.5. NUMERICAL COMPUTATION 317
Without loss of generality, θ0 = 0. With this convention, we have
−2 logRn(θ0) = 2
n∑
i=1
log{1 + λτxi}
= 2λτ
n∑
i=1
xi − λτ{
n∑
i=1
xix
τ
i }λ+ n
= {
n∑
i=1
xi}τ{
n∑
i=1
xix
τ
i }−1{
n∑
i=1
xi}+ op(1) + n.
The leading term has chisquare limiting distribution. We need only verify
that n = op(1). This is true as
|n| ≤
n∑
i=1
|λτxi|3 ≤ max
i
|λτxi|
n∑
i=1
|λτxi|2 = op(1).
This completes the proof.
This theorem can be used to construct confidence intervals for the popu-
lation mean θ, or conduct hypothesis test regarding the value of population
mean. For instance, an approximate level 1 − α confidence region for θ is
given by
{θ : −2 logRn(θ) ≤ χ2d(1− α)}.
It can be shown that the profile likelihood function `n(θ) is concave. Hence,
the above confidence region is always convex.
On top of being derived from a non-parametric procedure, EL confidence
regions are praised for having a data-shaped confidence region; for not de-
manding an estimated covariance matrix. In general, as the above region is
based on the first order asymptotic result, it has slightly lower than nominal
1− α coverage probability in general. A high-order correction can be made
to achieve higher order precision: the actual coverage probability differs from
1− α by a quantity of order n−2.
22.5 Numerical computation
The numerical computation appears to be problematic initially. We have to
maximize a function with respect to n variables under various linear con-
318 CHAPTER 22. EMPIRICAL LIKELIHOOD
straints. It turns out that once the value of the Lagrange multiplier λ is
known, the remaining computation is very simple. We illustrate the numer-
ical computation in this section.
Consider the problem of computing the profile likelihood for the mean.
The computation is particularly simple when x is a scale. In this case, we
need to solve
g(λ) =
n∑
i=1
xi − θ
1 + λ(xi − θ) = 0
for a given set of data, and value θ. Our first step is to subtract θ from xi and
call them yi. Namely define yi = xi − θ whenever a θ value is selected. We
then sort yi to increase order and obtain y(1) and y(n). If they have the same
sign, there will be no solution. The numerical solution is mission impossible.
Otherwise, the sign of λ is the same as y¯n. If y¯n > 0, we search in
the interval of [0, (n−1 − 1)/y(1)). Otherwise, we search in the interval (
(n−1 − 1)/y(n), 0]. We also note that g(λ) is a decreasing function. Let us
provide the following pseudo code for computing λ:
1. Compute yi = xi − θ;
2. Sort yi to get y(i);
3. If y(1)y(n) ≥ 0, stop and report “no solution”. Otherwise, continue;
4. Compute y¯. If y¯ > 0, set L = 0, U = (n−1 − 1)/y(1), otherwise set
L = (n−1 − 1)/y(n), U = 0.
5. Set λ = (L+ U)/2.
6. If g(λ) < 0, set U = λ otherwise set L = λ.
7. If U −L < , stop and report λ = (U +L)/2. Otherwise, go to Step 5.
This algorithm is guaranteed to terminate. The constant is the tolerance
level set by the user or by default. Often, it is chosen to be 10−8 or so. In
applications, we should take the scale of xi’s into consideration. If all of them
are small in absolute values (after subtracting θ), λ will be larger hence the
above tolerance is fine. If xi− θ are in the order of 108, then to tolerance for
22.6. EMPIRICAL LIKELIHOODAPPLIED TO ESTIMATING FUNCTIONS319
λ must be reduced substantially, say to 10−16. A sensible choice is to find
the sample standard error sn and set the tolerance level at sn.
To find the upper and lower limits of the confidence interval of the mean,
we first note that x¯n is always included in the interval. The upper and
lower limits cannot exceed the smallest and the largest observed values. A
simple method is to bisect the interval between x¯n and x(n) iteratively until
we find the location θU at which the profile likelihood ratio function equals
some quantile of the chisquare distribution set according to the confidence
level suggested by the user. The typical value is of course 3.841 for one-dim
problem at 95% confidence level.
When Xi’s are vector valued, Chen, Sitter and Wu (2002, Biometrika)
showed that a revised Newton-Raphson method can be used for computing
the profile likelihood ratio function for the mean. The algorithm is guaran-
teed to converge when the solution exists.
22.6 Empirical likelihood applied to estimat-
ing functions
In some applications, particularly in econometrics, the parameter of interest
is defined through estimating functions. Namely, if X is a sample from a
population of interest, the parameter vector θ is the unique solution to
E{g(X;θ)} = 0
for some vector valued and smooth function g. In this setting, the distribution
ofX is left unspecified. Some restrictions will be needed to permit meaningful
discussion of some large sample properties.
Let the dimension of g be denoted as m and the dimension of θ be denoted
as d. When m < d, the solution to equation E{g(X;θ)} = 0 is likely not
unique given a hypothetical distribution F of X. In this case, θ is under-
defined. When m = d, the same equation usually has a unique solution. The
parameter is then just-defined. When m > d, solution to E{g(X;θ)} = 0
exists only for special F . The model is then over-defined. If an i.i.d. sample
320 CHAPTER 22. EMPIRICAL LIKELIHOOD
from a distribution F is available, the corresponding estimating equation
n∑
i=1
g(xi;θ) = 0
may not have any solution in θ.
Generalized Method of Moments. In mathematical textbooks, we often
postulate a linear regression model in which the response variable Y and the
p-dimensional covariate X as assumed to be linked through
Y = Xτβ +
in which β is a non-random regression coefficient. The so called error term
is a random variable independent of X. It has mean zero and finite variance.
The statistical problem to make inference about β based on an i.i.d. sample
from this system.
Typically, we estimate β by the least sum of squares. Equivalently, we
estimate β by the solution to the normal equation:
n∑
i=1
Xi(Yi −Xτi β) = 0.
This approach fits into the frame of the estimating function definition with
g(x, y;β) = x(y − xτβ). The system contains d equations and d parameters
which is just-defined.
In econometrics, however, a linear model with dependent X and is often
more appropriate. One such example is the relationship between the earning
potentials Y and the number of years spend in education X1 combined with
other controlling factors X2. A sensible model is
log(Y ) = β0 +X1β1 +X
τ
2β2 + .
It is argued that X1 is probably related to unobserved factors such as individ-
ual cost and benefit of schooling. These unobserved factors in turn are likely
presented in the error term (it is hard to identify them all to be included
in X2). Hence, X1 and are not independent.
22.6. EMPIRICAL LIKELIHOODAPPLIED TO ESTIMATING FUNCTIONS321
If one uses least sum of squares estimate for β1 and β2 in this situation,
then the estimator βˆ is biased and in fact is not consistent when the sample
size n goes to infinite. To obtain a consistent estimator of β, one may look for
some instrument variable(s) Z such that given Z, X and are independent.
In this case, an unbiased estimating function (means zero-expectation) is
given by
g(x, y, z;β) = z(y − x1β1 − xτ2β2).
Apparently, when the dimension of Z is larger than p, the combined dimen-
sion of (x1, x
τ
2), we have an over-defined system.
When a system is over-defined, the sample estimating equation
n∑
i=1
g(xi;θ) = 0
generally has no solution in β. Hence, it is not viable to use its solution as an
estimate. This scenario leads to the generalized method of moments (GMM)
extensively discussed in econometrics. Let Sn(θ) =
∑n
i=1 g(xi;θ) and it is a
kind of score function in the context of likelihood based method. Let An be
a positive definite matrix of appropriate size and well specified. The general
idea of GMM is to estimate θ by θ˜ that minimizes
{Sτn(θ)}An{Sn(θ)}.
The GMM approach leads to an inevitable question: how do we choose An?
One choice is to get an initial estimate of θ such as by solving
∑n
i=1 gd(xi;θ) =
0 where gd(·) is the first d entries of g(·). Let the solution be θˆ0 and let
An = n
−1
n∑
i=1
g(xi; θˆ0)g
τ (xi; θˆ0).
After which, we get θ˜ by GMM.
The second possibility is to iterate the previous choice: update An with
θˆ1 = θ˜ to obtain a new θ˜. Continue until hopeful something converges.
The third choice is to define a θ dependent A:
An(θ) = n
−1
n∑
i=1
g(xi;θ)g
τ (xi;θ).
322 CHAPTER 22. EMPIRICAL LIKELIHOOD
Estimate θ by the minimizer of
{Sτn(θ)}An(θ){Sn(θ)}.
Three approaches are asymptotically equivalent and “optimal” based on
some criterion.
Empirical Likelihood. In comparison, for each given value of θ of dimen-
sion d, one may define a profile empirical likelihood function as
Ln(θ) = sup{

pi :
n∑
i=1
pig(xi;θ) = 0}.
We have omitted the requirements of pi > 0 and

pi = 1 in the writing but
they are required.
For each given θ, the computation of Ln(θ) in the current case is not
different from the case where θ is the population mean. We may also notice
that the dimensions of g and d do not matter in theoretical development.
The optimal solution to the maximization problem is given by
pi =
1
n[1 + λτg(xi;θ)]
with the Lagrange multiplier λ being the solution to
n∑
i=1
g(xi;θ)
n[1 + λτg(xi;θ)]
= 0.
The profile empirical likelihood defined here works almost the same way
as the parametric likelihood. Asymptotically, as long as the model is valid
in the sense that there exists a value θ∗ such that E{g(X;θ∗)} = 0 and that
E{g(X;θ∗)}2 <∞, then
n∑
i=1
pig(xi;θ
∗) = 0
has solution in pi with probability approaching 1 as n→∞. That is, `n(θ) =
logLn(θ) is at least well defined at θ = θ
∗.
22.6. EMPIRICAL LIKELIHOODAPPLIED TO ESTIMATING FUNCTIONS323
Theorem 22.2. Under the assumption that x1, . . . , xn form a set of i.i.d. observations
from some distribution F satisfying E{g(X;θ)} = 0 for some θ. Assume
g(x;θ) and F jointly satisfy some regularity conditions. Let θˆ be the maxi-
mum empirical likelihood estimator and θ∗ be the true value of the parameter.
Then, as n→∞, as have
2{`n(θˆ)− `n(θ∗)} → χ2d
and
2{−n log n− `n(θˆ)} → χ2m−d.
where m is the dimension of g and d is the dimension of θ.
This theorem provides a simple way to construct a likelihood interval/region
of θ. By `n(θ
∗), we allow only a single value for θ. This effectively reduces
the dimension of θ under consideration to 0. By `n(θˆ), we allow any value
of θ in the parameter space that has dimension d. The difference in two
dimensions is d. Hence, the degree of freedom in the limiting distribution is
d, the same as if we work with parametric likelihood.
For the second conclusion, the degree of freedom can be interpreted in the
same fashion but differently. By −n log n, we permit any F in the likelihood
computation. This means that we have involved no constraints from g. By
`n(θˆ), we have introduced constraints in form of m estimating equations. At
the same time, these equations contain d parameters as free variables. Hence,
the effective number of constraints is reduced to m − d. Hence, the degree
of freedom in the limiting distribution is m − d, the number of restrictions
applied.
The first result on limiting distribution concerns the difference in log
likelihood at two parameter values. Hence, the size judges the fitness of a
specific parameter value. The second result on limiting distribution concerns
the difference between placing a set of constraints and placing no constraints.
Hence, the size judges the fitness of these constraints.
The maximum empirical likelihood estimator is in general asymptotically
normally distributed. Among certain type of estimators, it is also known to
be “optimal”. That is, it has the lowest asymptotic variance in a certain
class of estimators.
324 CHAPTER 22. EMPIRICAL LIKELIHOOD
A much liked advantage of EL method, compared with GMM, is that one
does not need to estimate the variance of θˆ in order to construct confidence
intervals or regions of θ.
Another valued advantage of EL is that it is “Bartlett correctable”. It
means that there exists a non-random constant bn such that the distribution
of 2bn{`n(θˆ)−`n(θ∗)} is approximated by chisquare with a very high precision
so that the difference decreases to 0 quickly when the sample size n → ∞.
My experience shows that this is more a nice theory, not so much of practical
value.
One problem with EL under the estimating function setting is that the solu-
tion to the maximization problem may not exist. That is, given a θ value,
n∑
i=1
pig(xi;θ) = 0
may not have a solution in pi such that pi > 0 and

pi = 1. This could
happen for any θ value. When it happens, the statistical literature generally
refers it as ‘empty set’ problem.
The Lagrange multiplier λ is well defined only if 0 is in the convex hull
of {g(xi;θ), i = 1, . . . , n}. Thus, for each θ value given, one must first make
sure Ln(θ) is actually defined. Looking for its maximum point θˆ can only
be accomplished in the second step. If the set of θ, on which Ln(θ) is well
defined, is empty, the rest of inference strategies falls apart.
In theory, if the model is correct, g has finite second moment, then Ln(θ
∗)
is well defined with probability approaching 1 as n→∞ where θ∗ is the true
value. In applications, there is no guarantee we can locate a θ-value at which
Ln(θ) is well defined. In fact, it can be an issue to merely determine whether
or not it is well defined.
There have been a few remedies proposed in the literature. One of them
is given in Chen, Variyath and Abraham (2008). Let us define
g(xn+1;θ) = −ang¯n
22.8. ASSIGNMENT PROBLEMS 325
where g¯n = n
−1∑n
i=1 g(xi;θ), for any θ, with a positive constant an. In this
definition, we do not look for a xn+1 value at which the above relationship
holds. We only need a g(xn+1;θ) value.
Next, we define profile empirical likelihood as
LN(θ) = sup{

pi :
N∑
i=1
pig(xi;θ) = 0}.
with N = n + 1. Namely, we have added a pseudo observation g(xn+1;θ)
into the usual definition of the original empirical likelihood. Note that the
restrictions pi > 0 and

pi = 1 are satisfied by pi = an/c for i = 1, 2, . . . , n
and pn+1 = n/c and c = nan+n for the expanded data set g1, . . . , gN . Hence,
LN(θ) is well defined for any value of θ.
Under mild conditions, the first order asymptotic properties of Ln(θ)
remain valid for LN(θ). This so-called adjusted empirical likelihood is getting
a lot of attention. Read related papers yourself if you are interested.
22.8 Assignment problems
1. Let x1, . . . , xn be a set of i.i.d. observations from a nonparametric dis-
tribution family F with finite first moment. Let θ = ∫ xdF (x) be the
mean of distribution F .
Define the profile non-empirical likelihood θ to be
`n(θ) = sup{
n∑
i=1
logF ({xi}) :

xdF (x) = θ, F ∈ F}.
Show that for any θ value, `n(θ) = −n log n when all xi values are
distinct.
2. The authentic empirical likelihood is defined differently from Q1. Let
x1, . . . , xn be a set of i.i.d. observations from a nonparametric distribu-
tion family F with finite first moment. Let
Fn = {F : F (x) =
n∑
i=1
pi1(xi ≤ x)}.
326 CHAPTER 22. EMPIRICAL LIKELIHOOD
The profile empirical likelihood θ is defined to be
`n(θ) = sup{
n∑
i=1
logF ({xi}) :

xdF (x) = θ, F ∈ Fn}.
Show that `n(θ) is a concave function of θ.
3. The following are 10 i.i.d. observations of a random vector of dimension
2 (every column is one vector observation):
20.5 28.1 27.8 27.0 28.0 25.2 25.3 27.1 20.5 31.3
26.3 24.0 26.2 20.2 23.7 34.0 17.1 26.8 23.7 24.9
(a) Use some R-functions to draw the asymptotic empirical likelihood
90% confidence region of the mean.
(b) Do the same based on parametric likelihood: assuming they are
bivariate normally distributed.
Remark: show your code and the source of R-functions you located.
Give sufficient interpretations.
Remark: show your code and the source of certain R-functions you
located. Give sufficient interpretations.
4. A bivariate Gamma distributed random vector can be obtained as fol-
lows. Generate U1, U2 i.i.d. from beta distribution with density func-
tion
Γ(a+ b)
Γ(a)Γ(b)
ua−1(1− u)b−11(0 < u < 1).
Generate W from a Gamma distribution with density function
βa+bwa+b−1 exp(−βw)
Γ(a+ b)
1(0 < w).
Let Yτ = W × (U1, U2). The distribution of Y is then the bivariate
gamma BG(a, b, β) with correlation ρ = a/(a + b). The marginal dis-
tribution of Y1 = U1W is gamma with shape parameter a and rate
parameter β.
22.8. ASSIGNMENT PROBLEMS 327
(a) Verify that the marginal distribution of Y is as claimed.
(b) Let the sample size n = 100 and repeat the simulation N = 20000
times. Put a = 3, b = 5 and β = 0.5. The population mean is hence
(6, 6). Put the size of the test at α = 0.08. Set your seed value as
2018.
Write a simulation R-code for the EL test to obtain (i) null rejection
rate for H0 : µ = (6, 6). (ii) obtain a QQ plot of the EL test statistic
against the theoretical χ22 distribution.
5. Suppose {0, 2, 3, 3, 12} is an i.i.d. sample from some distribution F . Let
θ be the population mean.
(a) Write done the analytical expression of the profile log-likelihood
function give this data set, allowing λ value unspecified.
(b) Compute numerically the value of `n(4) based on the illustration
data: the profile log likelihood at θ = 4.
(c) Compute numerically the value of `n(3) based on the illustration
data: the profile log likelihood at θ = 3.
(d) Plot the profile log-likelihood function over the range of θ ∈ (2.5, 5.5).
(Note required this year. Do it if you are interested).
6. Let η be the second moment of F .
(a) Give the corresponding analytical form of the profile log empirical
likelihood function of η with the same data given in the last question.
(b) Over what range of η is the profile empirical likelihood function
well-defined?
328 CHAPTER 22. EMPIRICAL LIKELIHOOD
Chapter 23
Resampling methods
We have routinely started a statistical inference problem with “let x1, . . . , xn
be an i.i.d. sample from F”. When a parametric distribution family f(x; θ)
for F is proposed, then the focus is to get θ well estimated. Often, the
focus is to investigate certain aspect of F . For this purpose, we may define
a function in a generic form θ = T (F ). The formation of T (F ) is commonly
applicable whether or not F is a member of a parametric family. In both
cases, a natural estimator of θ is θˆn = T (Fn) where Fn(x) is the empirical
distribution based on the i.i.d. sample. When θ is the population mean, the
estimator is the sample mean.
A point estimation of θ is generally only a starting point of the statistical
inference. An immediate question might be: what is the (sample) distribution
of θˆn. If F is a member of Poisson distribution and T (F ) is the population
mean, then θˆ = X¯n whose distribution is a scaled Poisson. If F is a member
of normal distribution, then θˆ = X¯n has normal distribution with some mean
and variance.
If T (F ) is more complex than sample mean and F is a member of a generic
distribution family, the answer to “what is the distribution of θˆn” is much
more complex. Typically, if n is very large, θˆn = T (Fn) is asymptotic normal.
This partly answers the above question. Yet even if so, we are burdened at
analytically obtaining the mean and variance of the asymptotic distribution.
329
330 CHAPTER 23. RESAMPLING METHODS
Bootstrap and other resampling methods provide some alternative ways
to answer such question. They are labor intensive in terms of computation,
but simple in terms of mathematical derivation. Because of these properties,
this line of approach is admired by many applied statisticians. At the same
time, resampling methods can be abused by many who know very little on
their limitations. This chapter aims to get you informed about the idea be-
hind the resampling methods and about when the methods work as intended.
In no circumstances we should blindly trust a cure-all magic.
23.2 Resampling procedures
Let x1, x2, . . . , xn be a set of i.i.d. observations. We have available the em-
pirical distribution function Fn(x) which is a good estimate of their common
distribution. Note that Fn(x) is the uniform distribution on these n observed
values. If these observations have ties, this interpretation remains useful and
harmless. Let X∗ denote a random variable with distribution Fn. That is,
pr(X∗ = xi) =
1
n
for i = 1, 2, . . . , n.
In addition, let X∗1 , . . . , X

n be i.i.d. random variables with the same distribu-
tion as that of X∗. Let F ∗n(x) be the empirical distribution based on observed
values of X∗1 , . . . , X

n. We regard F

n and its related entities as mirror image
of Fn in the bootstrapping world. For this reason, Fn is regarded as a real
world subject.
For parameters of interest in the form of θ = T (F ), we may estimate
their value by θˆ = T (Fn). In the bootstrapping world, we find their images
θ∗ = T (Fn) and θˆ∗ = T (F ∗n). The distribution of T (Fn) has its bootstrap
world image as distribution of T (F ∗n), conditional on fixed Fn. When F ≈ Fn
as it is the case when n is large, we anticipate that the distribution of T (Fn)
is well approximated by the conditional distribution of T (F ∗n) given data,
namely Fn. We should take note that such claims are meaningful when T (F )
is smooth in F in some sense. It does not always work but often work well.
Alone this line of thinking, the distributions of the sample mean and sam-
ple variance X¯n and s
2
n should be approximately the same as the conditional
23.2. RESAMPLING PROCEDURES 331
distribution of X¯∗n =
1
n
∑n
i=1 X

i and s
2∗
n =
1
n−1
∑n
i=1(X

i − X¯∗n)2. In fact,
these claims extend to their joint distribution and their functions.
Instead of working hard mathematically on the distributions of the sam-
ple mean and variance and so on, we intend to approximate them by these of
their bootstrap images. At this point, we cannot help question the practical-
ity of this suggestion: if deriving the distribution of X¯n is hard, deriving the
distribution of X¯∗n is likely harder. This is true. However, if we can generate
a million of independent and distributionally identical copies of X¯∗n, the dis-
tribution of X¯∗n can then be numerically determined accurately. If this idea
works, we would have successfully unloaded our technical burden to comput-
ers. We will be able to work on many problems without placing restrictive
normality assumption or other parametric assumptions on the populations.
Remember, placing assumptions on the population does not make
the population satisfy these assumptions.
Bootstrapping or other resampling procedures are generally portrayed
as a non-parametric method. They are used for many purposes far more
than merely approximating the distributions of X¯n and s
2
n. For example,
the bootstrap method can be used to approximate the distribution of sample
median under very general conditions. Such universality makes it a popular
choice. Furthermore, when data do not have an i.i.d. structure, a carefully
designed schedule may be used to create a faithful resampling mirror image
in the bootstrap world. Hence, the resampling method is not restricted to
data with i.i.d. structures.
In many applications, a parametric model f(x; θ) itself is an acceptable
assumption. Suppose θˆ is the maximum likelihood estimator of θ. What is
the distribution of θˆ? The large sample answer can also be mathematically
difficulty. In this case, one may study the distribution of θˆ, when the data
are a random sample from f(x; θˆ). To implement this idea by resampling
method, one may generate samples from f(x; θˆ), and obtain a large number of
θˆ∗, the MLE based on generated data sets. The empirical distribution based
on θˆ∗ can be an accurate approximation of the distribution of θˆ. When the
resampled data are drawn from a parametric distribution, the bootstrapping
method becomes parametric bootstrap.
332 CHAPTER 23. RESAMPLING METHODS
23.3 Bias correction
Let θ = T (F ) be a parameter. Since the empirical distribution Fn(x) is a
good estimator of F , we have proposed to use θˆ = T (Fn) to estimate θ. At
the same time, the bias of T (Fn) itself is a functional of F . Thus, bootstrap
can be used to estimate the bias of T (Fn) and subsequently reduce the bias.
Although E{Fn(x)} = F (x) for all x, it is not necessarily true that
E{T (Fn)} = T (F ). If so, how large is the bias of θˆ = T (Fn)? Let us
denote the bias by ξ = E[T (Fn)]− T (F ).
Let X∗1 , . . . , X

n be i.i.d. observations from the empirical distribution Fn.
Let F ∗n(x) be the corresponding empirical distribution. Subsequently, θˆ
∗ =
T (F ∗n) is then a bootstrap estimator of θˆ. Its (conditional) bias is given by
ξˆ∗ = E∗{T (F ∗n)} − T (Fn),
where E∗ is the expectation conditional on X1, . . . , Xn. If ξˆ∗ cannot be eval-
uated theoretically, we can evaluate it by simulation. Is ξˆ∗ an good estimator
of ξ?
Example 23.1. (a) Assume θ =

xdF (x) and E|X| < ∞. Consequently,
the bias ξ = 0. At the same time,
E∗T (F ∗n) = E∗{n−1
n∑
i=1
X∗i } = n−1
n∑
i=1
Xi = T (Fn).
Thus, we also have ξˆ∗ = 0. This result shows that ξˆ∗ works fine as an
estimator of ξ. Of course, in this example, the exercise does not lead to any
useful results.
(b) Let us consider the parameter estimation of
θ = T (F ) = E2(X) = [

xdF (x)]2.
Assume that σ2 = var(X1) < ∞. We have T (Fn) = [X¯n]2 and its bias is
given by ξ = n−1σ2. The conditional expectation of T (F ∗n) given Fn is given
23.4. VARIANCE ESTIMATION 333
by
E∗T (F ∗n) = E∗{n−1
n∑
i=1
X∗i }2
= {E∗X∗1}2 + n−2var∗(X∗1 )
= T (Fn) + (n− 1)s2n/n2.
Thus, if we estimate ξ by ξ∗i = E∗T (F ∗n)−T (Fn), we have ξˆ∗ = (n−1)s2n/n2.
This is a very reasonable estimator of ξ though we certainly do not have to
go over bootstrap resampling procedure to find out.
23.4 Variance estimation
Consider the problem of assessing the variance of T (Fn). The bootstrap
method estimates the variance of T (Fn) by the conditional variance of T (F

n),
where F ∗n is the empirical distribution based on an i.i.d. sample from the
distribution Fn.
Example 23.2. (a) Let the parameter of interest be θ = T (F ) =

xdF
again. It is seen that θˆ = T (Fn) = X¯n. Let us work as if we do not have a
good idea on its variance. Consequently, we use resampling method to esti-
mate its variance. Take an i.i.d. sample from the empirical distribution Fn.
Let X¯∗n be the resulting sample mean. We now use the conditional variance
of X¯∗n to estimate the variance of x¯n.
We can easily calculate the conditional variance as
var∗(X¯∗n) = n
−1var∗(X∗1 ) = n
−2
n∑
i=1
(Xi − X¯n)2.
Recall the true variance of X¯n is n
−1σ2 where σ2 = var(X1). The bootstrap
variance estimation is n−1s2n +O(n
−2). Clearly, we have
var∗(X¯∗n)
var(X¯n)
→ 1
almost surely as n → ∞. This result shows that the bootstrap variance esti-
mator is well justified.
334 CHAPTER 23. RESAMPLING METHODS
It is important to realize that var(X¯n)→ 0 as n→∞. Hence, even if a
variance estimator v̂ar(X¯n) makes
v̂ar(X¯n)− var(X¯n)→ 0
almost surely, this property alone does not make it a good estimator.
(b) Let the parameter of interest be θ = {∫ xdF}2. Its natural estimator
is θˆ = X¯2n. How large is the variance of θˆ? Assume that X has finite 4th
moment and without loss of generality. Then
E(X¯4n) = E{(X¯n − µ) + µ}4
= E{(X¯n − µ)4 + 4(X¯n − µ)3µ+ 6(X¯n − µ)2µ2 + 4(X¯n − µ)µ3 + µ4}
= µ4 +
6µ2σ2
n
+O(n−2).
We also have, putting s2n = n
−1∑(xi − x¯n)2,
E2(X¯2n) = (µ2 + σ2/n)2 = µ4 +
2µ2σ2
n
+O(n−2).
Therefore,
var(X¯2n) = E(X¯4n)− E2(X¯2n)
=
4µ2σ2
n
+O(n−2).
In the bootstrap method, it is easy to get
var∗({X¯∗n}2) = 4X¯2nE∗{X¯∗n − X¯n}2 + E∗{X¯∗n − X¯n}4
−{E∗[X¯∗n − X¯n]2}2 + 4X¯nE∗{X¯∗n − X¯n}3.
The order of the last three terms are Op(n
−2). The order of the first one
is Op(n
−1) when the true mean is not zero. Thus, the leading term in this
bootstrap variance estimator is (4X¯2n/n
2)
∑n
i=1(Xi− X¯n)2. This marches the
approximate variance of X¯2n which equals (4µ
2σ2)/n.
In both examples, we analytically obtained the properties of the bootstrap
method for bias and variance estimation of estimators in the form of T (Fn) for
23.4. VARIANCE ESTIMATION 335
parameter T (F ). Analytical derivation is not always feasible. For instance,
suppose θ is the location parameter in Cauchy distribution, we will not be
able to find var∗(T (F ∗n)) by theoretical computation. Instead, computer
simulation is likely the only option which can be carried out as follows.
First, draw an i.i.d. sample of size n x∗1, . . . , x

n from Fn based on some
computer package. Compute, based on bth sample,
θˆ∗b = T (F

n)
where F ∗n(x) = n
−1∑n
i=1 1(x

i ≤ x) which is the empirical distribution based
on bootstrap sample.
Next, define the simulated var∗(T (F ∗n)) value to be
v2∗ =
1
B − 1
B∑
b=1
{θˆ∗b − θ¯∗}2
where θ¯∗ = B−1
∑B
b=1 θˆ

b . If θ is a vector, we need to modify the above
formula for the variance-covariance matrix.
Under some conditions, v2∗ is a consistent estimator of var(T (Fn)). Yet
we must be more cautious on the meaning of consistency:
v2∗/var(T (Fn))→ 1
in some modes.
One many define the bootstrap variance estimator to be
v˜2∗ =
1
B − 1
B∑
b=1
{θˆ∗b − θˆ}2.
Since the difference between θˆ and θ¯∗ is likely very small in asymptotic ar-
gument, both of them are well justified. None of them can be judged as
“wrong” as many would like to ask.
In addition, simulation study will likely find situations where v2∗ is more
accurate and other situations where v˜2∗ is superior.
In summary, being a statistician does not make you an authority to decide
between these estimators. We do notice that v˜2∗ resembles mean square error.
It therefore takes a larger value. If one likes to have a more conservative
statistical procedure, using v˜2∗ a good choice.
336 CHAPTER 23. RESAMPLING METHODS
23.5 The cumulative distribution function
Consider the problem of approximating the distribution of T (Fn) by that of
T (F ∗n). The idea here is the same as the one for variance estimation. We
hope that the conditional distribution of T (F ∗n) is a good approximation to
the distribution of T (Fn).
Consider the simplest situation where the parameter to be estimated is
θ = T (F ) =

xdF . The estimator of θ is the sample mean X¯n, and we aim
at estimating the cumulative distribution function of X¯n.
Under the assumption of the finite second moment,

n(X¯n − θ)/σ is
asymptotically normal. This fact pretty much tells us to not bother at es-
timating its distribution. Nevertheless, if we insist on using bootstrap to
estimate the distribution of X¯n, we should have, as n→∞,
pr
(√
n(X¯∗n − X¯n)
sn
≤ x∣∣Fn)→ Φ(x) almost surely
for any x with s2n being the sample variance and Φ(x) being the c.d.f. of the
standard normal distribution. Note that this is a limit where both the event
under investigation and the condition are changing when n increases. As n
increases, we may use the central limit theorem for triangular array to obtain
the above result.
To prove the asymptotic normality, Berry-Esseen bound is the most sim-
ple tool though at a relatively stronger conditions.
Theorem 23.1. Let X1, . . . , Xn be an i.i.d. sample from a distribution F
with finite mean θ, finite variance σ2 and finite E|X|3. Then, we have
sup
x
∣∣∣∣pr(√n(X¯n − θ)σ ≤ x∣∣Fn
)
− Φ(x)
∣∣∣∣ ≤ 334 E(|X − θ|3)√nσ3/2
Note that this conclusion holds for all n. In other words, it is not an
asymptotic one, but universal one. At the same time, it shows that the
precision of the normal approximation improves with increased sample size
at rate n−1/2. Applying this bound to bootstrapping sample, we find
sup
x
∣∣∣∣pr(√n(X¯∗n − X¯n)sn ≤ x∣∣Fn
)
− Φ(x)
∣∣∣∣ ≤ 334 E∗|X∗1 − X¯n|3√n[E∗|X1 − X¯n|2]3/2 .
23.5. THE CUMULATIVE DISTRIBUTION FUNCTION 337
Again, this result holds for any Fn and n.
In view of such an inequality, the asymptotic normality is valid when
E∗|X∗1 − X¯n|3
[E∗|X1 − X¯n|2]3/2 = o(n
1/2)
almost surely or in any appropriate modes.
Suppose the model satisfies E|X1|3 <∞. In this case, we have
E∗|X∗1 − X¯n|3 =
1
n
n∑
i=1
|Xi − X¯n|3 → E|X1|3, almost surely
and
E∗|X∗1 − X¯n|2 =
1
n
n∑
i=1
|Xi − X¯n|2 → σ2 = 1, almost surely.
Thus, it is trivial to find that
E∗|X∗1 − X¯n|3
[E∗|X1 − X¯n|2]3/2 → E|X1|
3.
Hence, when E|X1|3 <∞, the (conditional) asymptotic normality is proved.
The simple proof is benefitted from an unnecessarily strong assumption on
the finiteness of the third moment.
A generalization can be easily made. If g(X¯n) is a smooth function of
X¯n, then g(X¯n) is asymptotic normal. By the same logic, g(X¯

n) is also
asymptotically normal conditional on Fn. Thus, the conditional distribution
of g(X¯∗n) still marches that of g(X¯n).
Although the above example is very supportive on the usefulness of the
bootstrap method, it is not without its limitations. For the sample mean,
its asymptotic normality can be established easily. The calculation of the
limiting distribution is also very simple. Why should we bootstrap in these
simple situations? In situations where the asymptotic become complex, do
we have a good theory to support the bootstrap?
One crucial justification of using bootstrap method comes from Singh
(1981). There are many results contained in this paper. Here I only pick up
a relatively simple case.
338 CHAPTER 23. RESAMPLING METHODS
Theorem 23.2. ( Singh, 1981). Assume X1, . . . , Xn are i.i.d. samples from
F . Assume EX1 = θ, σ2 = var(X1) > 0, and E|X1|3 < ∞. Let X¯n be
the sample mean and s2n be the sample variance. In addition, let X¯

n be the
bootstrap sample mean. Then
sup
x
∣∣∣∣pr(√n(X¯n − θ)σ ≤ x
)
− pr
(√
n[X¯∗n − X¯n]
sn
≤ x|Fn
)∣∣∣∣ = O(n−1/2)
almost surely. If F is a continuous distribution, then
sup
x
∣∣∣∣pr(√n(X¯n − θ)σ ≤ x
)
− pr
(√
n[X¯∗n − X¯n]
sn
≤ x|Fn
)∣∣∣∣ = o(n−1/2).
The first result is implied by Berry-Esseen bound. We do not prove
the second result here. The second result shows that the bootstrapping
approximation has better precision than the normal approximation. This is
a surprising good news.
The bootstrap sampling procedure for approximating c.d.f. of some func-
tional T (Fn) is very simple. First, we draw an i.i.d. sample of size n X

1 , . . . , X

n
from Fn using some computer software package. We repeat this step a suffi-
ciently large number B times. Next we compute, based on bth sample,
θˆ∗b = T (F

n)
where F ∗n(x) = n
−1∑n
i=1 1(x

i ≤ x). The last step is to define the estimated
cumulative distribution function to be
Hˆn(t) = B
−1
B∑
b=1
1(θˆ∗b ≤ t).
Needless to say, under some conditions, Hˆn(t) is consistent for H(t) = pr(θˆ ≤
t). We should be aware that only if the limiting distribution of T (Fn) does
not degenerate, the bootstrap approximation is meaningful. The result of
Singh, for instance, is effective for
T (Fn) =

n(X¯n − θ)
σ
.
23.6. RECIPES FOR CONFIDENCE LIMIT 339
A similar result for
T (Fn) =
(X¯n − θ)
σ
would be meaningless. One should be aware that the both T (Fn) are not
statistics, but functions of both data and parameter.
23.6 Recipes for confidence limit
We still discuss the inference problem under the assumption that an i.i.d. sample
of size n from distribution F is given. The parameter of interest is some
θ = T (F ). It is estimated by θˆ = T (Fn) where Fn is the empirical distri-
bution function. We will also use some variance estimator based on Fn and
denote it as σˆ. It is not necessary the variance of distribution F , but some
quantity used for constructing pivotal quantities.
Percentile method. Consider the case when θˆ− θ is more or less a pivotal
quantity. Suppose that its distribution is given by H(x), namely,
pr(θˆ − θ ≤ x) = H(x).
Let H−1(α) be the αth quantile of H(x). Then
pr(θˆ − θ ≥ H−1(α)) = 1− α
when H(·) is a strictly increasing function. This implies that an upper 1−α)
limit for θ is given by
θˆ −H−1(α).
A two-sided confidence interval can be formed by using one upper and one
lower confidence limits.
Let Hˆ(x) be an estimator of H(x). Define
θBP = θˆ − inf
t
{Hˆn(t) ≥ α} = θˆ − Hˆ−1(α).
This is an approximate upper confidence bound for θ because
pr(θ < θBP ) = pr(θˆ − θ ≥ Hˆ−1n (θ)) ≈ pr(θˆ − θ ≥ H−1(θ)) = 1− α.
340 CHAPTER 23. RESAMPLING METHODS
Computing confidence lower limit based on the above approach is generally
called percentile method. The subscript BP is used for “bootstrap percentile”
though we motivated this upper limit without a bootstrapping procedure.
Ordinary and studentized methods Consider the case we have estimators
θˆ and σˆ for θ and the standard error of

nθˆ. Note that the later is not the
population standard error. In many cases, it might be more realistic that

n(θˆ − θ)
σ
;

n(θˆ − θ)
σˆ
are approximate pivotal quantities. If they are, without any ambiguity, we
may define
H(x) = pr{

n(θˆ − θ)
σ
≤ x}
and
K(y) = pr{

n(θˆ − θ)
σˆ
≤ y}.
If we had complete information of H(x) and K(y), constructing confidence
intervals for θ would be a simple task.
We further notice that the task is reduced to find upper and lower confi-
dence limits. Let xα = H
−1(α) and yα = K−1(α) so that 1−α is the targeted
level of confidence. Depending on whether we have knowledge on H or on
K, their lower confidence limits are respectively
θˆord(α) = θˆ − x1−ασ/

n,
and
θˆstud(α) = θˆ − y1−ασˆ/

n.
Both of them have the format we presented in a previous chapter. The sub-
scripts, ord and stud, are abbreviations for ordinary and studentized. They
would have been z1−α or t1−α when H and K are c.d.f. of normal and t
distributions.
Hybrid and backward methods. As it is well known, when the sample
size is large, σˆ ≈ σ and hence xα ≈ yα. One may therefore use a hybrid lower
confidence limit:
θˆhyb = θˆ − x1−ασˆ/

n.
23.7. IMPLEMENTATION BASED ON RESAMPLING 341
This can be compared to the situation where quantile of t-distribution should
be used, yet we mistakenly use the quantile of the normal distribution.
Under normal distribution, zα = −z1−α because the normal distribution
is symmetric. If H is symmetric, then xα = −x1−α for the same reason.
Hence, when H is believed symmetric, we may use another lower confidence
limit:
θˆback(α) = θˆ + xασˆ/

n.
It is clearly confusing to present so many possibilities. Which one is
correct? The answer depends on what we mean by “correct”. If any (random)
interval which covers the true value of θ with probability 1−α+o(1), we feel
that they are okay, or “correct”. When the sizes of these intervals are not
taken into consideration, we may want to examine the exact sizes represented
by o(1) term in the coverage probability.
23.7 Implementation based on resampling
Having complete knowledge of H(x) and K(x) is not possible. More often
than not, they are also dependent on unknown parameter values. Nonethe-
less, bootstrap simulation can be used to properly estimate H and K, when
the population distribution is given by F and the parameter θ is a functional
of F .
We will see what it means by “can be estimated”. The distribution H is
estimated by
Hˆ(x) = pr(

n(θˆ∗ − θˆ) ≤ σˆx|Fn).
The distribution K is estimated by
Kˆ(x) = pr(

n(θˆ∗ − θˆ) ≤ σˆ∗x|Fn).
Once they are obtained via bootstrap simulation, we define xˆα = Hˆ
−1(α)
and yˆα = Kˆ
−1(α). All lower confidence limits proposed in the last section are
transformed to bootstrap lower confidence limits by putting a hat on either
xα or yα.
Now, which one makes o(1) inside the coverage probability 1 − α + o(1)
the smallest? Peter Hall (AOS some year) had a discussion paper specifically
342 CHAPTER 23. RESAMPLING METHODS
for this problem. The technical discussion is too complex for this course.
The results are not that insightful either. Without going back to the paper
itself, I put down my unverified memory here: the studentized approach
together with bootstrap resampling has this o(1) reduced toO(n−1). Without
studentization, this o(1) is o(n−1/2). Both conclusions are obtained under the
assumption that θˆ is a smooth function of the sample mean X¯n, after being
θ =
µ
σ
has its estimator given by
θˆ =
x¯n√
x¯2 − (x¯)2
.
This estimator is a smooth function of the sample mean in (xi, x
2
i ).
23.8 A word of caution
The bootstrap method is generally used to simulate the variance and the dis-
tribution of a point estimator. Based on bootstrap simulation, we can often
subsequently make inference on various parameters. Most noticeably, the
results are subsequently used to construct confidence intervals for parameter
θ and test the hypotheses such as θ = θ0. We are often freed from complex
technical issues.
At the same time, one has to have a good point estimator θˆ before the
resampling procedure can even start. The statistical properties of the corre-
sponding data analysis is largely determined by that of θˆ. The resampling
methods help to determine these properties. They do not instill good prop-
erties into these procedures.
There is no guarantee that the resampling methods always lead to valid
statistical inferences. By this statement, for instance, a 1−α level confidence
interval may have far lower coverage probability and the under-coverage
problem does not go away even if the sample size increases. The theory
in mathematical statistics cannot be thrown out simply because the resam-
pling procedure is powerful at freeing us from the task of a lot of technical
derivations.
23.9. ASSIGNMENT PROBLEMS 343
23.9 Assignment problems
1. Let X1, . . . , Xn be a random sample from exponential distribution with
density function
f(x; θ) = θ−1 exp(−θ−1x).
Consider the case n = 201 and θ = 1.
(a) Theoretically determine the median of this distribution.
(b) Generate 1000 data sets with n = 201 to estimate the bias and
variance of the sample median for estimating the population median.
(c) Bootstrap the first sample in (b) to obtain estimates of the bias and
variance of the sample median for estimating the population median.
2. Continue from the last problem.
(a) Use a Bootstrap method to construct a 95% confidence interval for
θ based on the following two asymptotic pivotals:
T1 =

n(x¯− θ)
θ
; T2 =

n(x¯− θ)
sn
.
Set B = 1999. Present your intervals for the first data set generated.
(b) Repeat (a) for N = 100 data sets. Compute their average lower
and upper confidence limits, average lengths and standard error of the
lengths. Which one do you recommend based these outcomes?
3. Generate 7 observations from uniform (0, 1) distribution.
(a) How many distinct bootstrap samples (standard bootstrap as in
this book) are possible?
(b) Draw the c.d.f. of X¯∗ and compute its 0.25, 0.5 and 0.75 quantiles.
(c) Compute the difference between the variance of X¯∗ and the variance
of X¯ both numerically and by theoretical derivation. Of course, two
results are the same if round-off is ignored.
344 CHAPTER 23. RESAMPLING METHODS
4. Based on an i.i.d. sample of size n = 99 from Cauchy distribution with
only a location parameter θ. Namely, the density function is given by
f(x; θ) =
1
pi{1 + (x− θ)2}
One wishes to test the hypothesis H0 : θ = 0 against H1 : θ 6= 0. Two
potential approaches are: (1) score test and (2) Wald test based on
sample median (the 50th order statistic).
(a) For the purpose of implementation, you need to work out some
additional details for these two tests such as the asymptotic variances.
Get them done and present your work and results.
(b) Suppose the precisions of the asymptotic distributions for these two
tests are not sufficiently high. Design bootstrap procedures to carry out
these two tests.
(c) Implement the bootstrap procedures in (b) based on B = 1999
and repeat N = 500 to obtain observed rejection rate under the null
hypothesis.
5. Suppose {0, 2, 3, 5, 10} is an i.i.d. sample from some distribution F .
(a) Let X∗ be a single bootstrap observation in the conventional form
used in this course. What is the distribution of X∗ (conditional on
these observed values)? Compute its mean and variance.
(b) Let X∗1 , · · · , X∗5 be an i.i.d. bootstrap sample from Fn, where n = 5.
Let X∗(n) = max{X∗1 , · · · , X∗5} What is the conditional distribution of
X∗(n)? Namely, work out its probability mass function.
6. Suppose x1, . . . , xn is an iid sample from a distribution family F with
finite and positive variance σ2.
Let us estimate the population mean θ by θˆ = x¯ as usual, and we
certainly have var(θˆ) = σ2/n.
However, a researcher insists of using bootstrap method to estimate
the variance var(θˆ) or more precisely nvar(θˆ). In addition, she suggests
23.9. ASSIGNMENT PROBLEMS 345
to generate b = 1, 2, . . . , 2B sets of conditional iid samples of size n,
x∗1, x

2, . . . , x

n
from the empirical distribution Fn(x). For each b, she would compute
x¯∗b = n
−1
n∑
i=1
x∗i .
She then defines
νˆ∗ = (2B)−1
B∑
b=1
(x¯∗2b−1 − x¯∗2b)2
as an estimate of var(θˆ).
(a) Given Fn, compute the conditional mean and variance of x

1.
(b) For each b, compute the conditional mean and variance of x¯∗b given
Fn.
(c) Show that the conditional expectation
E{νˆ∗|Fn} = 1
n2
n∑
i=1
(xi − x¯n)2.
(d) Show that as B →∞,
nνˆ∗
p−→ n−1
n∑
i=1
(xi − x¯n)2.
(e) Show that n{νˆ∗ − var(θˆ)} → 0 in probability as B, n → ∞. More
precisely, almost surely or in probability,
lim
n→∞
lim
B→∞
{n{νˆ∗ − var(θˆ)}} = 0.
346 CHAPTER 23. RESAMPLING METHODS
Chapter 24
Multiple comparison
One-way ANOVA is a typical method to compare a number of treatments
in terms of a specific measurement of some experimental outcomes. For
example, an experiment might be designed to compare the volumes of harvest
when different fertilizers are used.
Let the number of treatments be k. LetN = n1+n2+· · ·+nk experimental
units randomly assigned to k treatments with n1, n2, . . . , nk units each. Let
the response variable be denoted as y. Suppose the jth treatment is replicated
nj times. The outpits of the experiment can be displayed as
y11, y12, . . . , y1n1 ;
y21, y22, . . . , y2n2 ;
. . . , . . .
yk1, yk2, . . . , yknk .
The output yij is the reading of the unit assigned to the ith treatment
and the jth replication.
A linear model for this set up is
yij = η + τi + ij
for i = 1, 2, . . . , k, and j = 1, 2, . . . , ni. We assume η is the overall mean,
τi is the mean response from the ith treatment after subtracting the overall
347
348 CHAPTER 24. MULTIPLE COMPARISON
mean. The error term ij is what cannot be explained by the treatment
effect τi. The statistical analysis is often done based on the assumption that
ij ∼ N(0, σ2)
and they are assumed independent of each other. The normality assumption
and the equal variance assumption are the ones that may be violated in the
real world. The decomposition of the treatment means is always feasible.
24.1 Analysis of variance for one-way layout.
Let
y¯·· = N−1
k∑
i=1
ni∑
j=1
yij
be the over all sample mean. Let
y¯i· = n−1i
ni∑
j=1
yij
be the sample mean restricted to samples from the ith treatment. In general,
whenever an index is replaced by a dot, the resulting notation represents the
sample mean over the corresponding index. For example, y¯· 1 would be the
average of y11, y21, . . . , yk1.
Now we may decompose the response as
yij = y¯·· + (y¯i· − y¯··) + (yij − y¯i·)
which will also be written as
yij = ηˆ + τˆi + rij.
These quantities marked with hats are estimates/estimators of the corre-
sponding parameters in the linear model.
The sum of squares in (y¯i· − y¯··) represents the variation in the mean
responses between different levels of the factor (or between treatments), while
24.1. ANALYSIS OF VARIANCE FOR ONE-WAY LAYOUT. 349
rij = (yij − y¯i·) represents the residual variations. The residual variation is
the variation not explainable by the treatment effect.
The analysis of variance aims to compare the relative sizes of these two
sources of variation. The resulting ANOVA table is as follows.
ANOVA for One-Way Layout
Source D.F. SS
Treatment k − 1 ∑ki=1 ni(y¯i· − y¯··)2
Residual N − k ∑ki=1∑nij=1(yij − y¯i·)2
Total N − 1 ∑ki=1∑nij=1(yij − y¯··)2
One may notice that each sum of squares contain N terms, when du-
plicating entrances are also counted. Consider the test problem on the null
hypothesis:
H0 : τ1 = τ2 = · · · = τk.
The alternative hypothesis is that not all population means are equal. The
test statistic we commonly use is
F =
(k − 1)−1∑ki=1 ni(y¯i· − y¯··)2
(N − k)−1∑ki=1∑nij=1(yij − y¯··)2
which, under normality/equal-variance and H0, has F-distribution with k−1
and N − k degrees of freedom. This might be an opportunity for us to
refresh the memory on the desired properties a test statistics. It is also a
useful exercise to refresh the memory of what a UMPU test is.
If H0 is true, the randomness of the test statistic F is completely deter-
mined and it does not depend on any external factors. Hence, the following
p-value computation is justified:
p = pr(F > Fobs).
Rejecting H0 when p < 0.05 is a common practice.
350 CHAPTER 24. MULTIPLE COMPARISON
24.2 Multiple comparison
Once (and if) the null model is rejected, it is natural to ask: which pair or
pairs of treatments are the culprits that lead to the rejection? The rejection
may be caused by a single treatment that has substantially different effect
from the rest. It may also be caused by smaller differences between all
treatments. Of course, the rejection may be erroneous.
Regardless of these possibilities, let us ask the question on which pairs
of treatments are significantly different? The technique used to addressing
this question is called multiple comparison, because many pairs are being
compared simultaneously.
Borrowing the idea of two-sample test, we may define
tij =
y¯j· − y¯i·√
(1/ni + 1/nj)σˆ2
where σˆ2 is the variance estimator from the ANOVA table. The denominator
of this t-statistic is different from the usual two-sample t-test. It is obtained
by pooling information from all k-treatments. Each tij has t-distribution
with N − k degrees of freedom. Hence, if a level-α test is desired, we may
reject H0 : µi = µj when
|tij| > t(1− α/2;N − k).
This test has probability α to falsely reject the hypothesis that the corre-
sponding pair of treatment means are equal.
Suppose we set α = 0.05 as in common practice and k = 5. There will be
10 such pairs of treatments. Even if all pairs of treatments are not different,
there will a chance about 5% to declare any one of them significant. The
chance of declaring one of them is significant by a simple t-test is likely much
larger, approaching possibly 50%. Such a high probability of false rejection
is clearly not acceptable.
24.3 The Bonferroni Method
To address the problem of inflated type I error in multiple comparison, we
could simply set up a high standard for every pair of i and j such that the
24.4. TUKEY METHOD 351
overall type I error is guaranteed to be lower than pre-specified value α. Let
k′ = k(k − 1)/2 be the number of possible treatment pairs. We may reject
Hij : µi = µj only if
|tij| > t(1− α/2k′;N − k).
Since the probability that any pair of treatments wrongfully judged different
is no more than α/k′ (note this is a two-sided test), and there are k′ such
pairs, it is simple to see that the chance that at least one pair to be declared
different, when none of them are difference, is controlled tightly by 100α%.
24.4 Tukey Method
Particularly when k is large (5 or more), the Bonferroni method is too con-
servative. It means that the actual type I error can be far lower than the
targeted level 100α%. Having a small type I error is not strictly wrong in
term of being a valid test. The real drawback of such a test is that this
increases the type II error. When k is large, the statistical power of detect-
ing any departure from the null hypothesis is too small if the conservative
Bonferroni method is used. If such a method is used as standard, scientists
have to work unjustifiably harder to prove their point (even if their point is
valid).
Let us define
t∗ =

2 max{|tij|}.
It is seen that t∗ has a distribution which does not depend on any unknown
parameters under the null hypothesis that all treatment effects are equal.
However, its distribution does depend on k and N−k, and in fact, also on how
N units are divided and assigned to k treatments. It is a test statistic with
almost all desirable properties we specified for a pure statistical significance
test. Unlike t-distribution, however, the c.d.f. of this distribution is not as
well documented. When all nj are equal, the distribution might be named
after Tukey. Let qtukey(1 − α; k,N − k) be its upper quantile when all ni
are equal. That is, under that restriction,
P{t∗ > qtukey(1− α; k,N − k)} = α
352 CHAPTER 24. MULTIPLE COMPARISON
for any α ∈ (0, 1). We may reject the hypothesis that the i and j pair of
treatments have equal mean when |tij| > qtukey(1− α; k,N − k)/

2.
The type I error of this approach is only approximate when n1, · · · , nk
are not all equal. In fact, it is bounded by α which is proved by someone
based on my memory.
My observation: Tukey’s method is not so much as a new method. It
simply requires us to use a critical value so that the probability of wrongfully
rejecting any pair of τi = τj is below 100α%.
Pitfalls of Bonferroni and Tukey Methods In the case of Bonferroni,
the adjustment is too conservative. If we hope to test 1000 hypotheses based
on a single data set, then the significance level would be placed at 0.005%.
If it is applied to a t-test with n = 20 degrees of freedom, the critical value is
5.134. This is to be compared to 2.09 if only one hypothesis is being tested.
The actual type I error of the test is likely much lower (Assignment problem).
As a side remark, in statistical consulting practice, we often look into
many aspects of data. Based on what we spot, various hypotheses are pro-
posed and then tested. In the end, we report the p-value on the hypothesis
that is below 0.05 (or several of such p-values). The practice seriously violates
the statistical principle we preach. Nonetheless, statisticians do it routinely
and our collaborators will not be pleased otherwise. At the same time, to
avoid this problem, scientific journals intend to go so far as prohibiting the
use of p-value as a justification of the scientific findings. These problems are
further compounded by the fact the p-value is not carefully defined in the
first place.
In the case of Tukey Method, I feel that it is specifically designed for
one-way anova. I am not sure of any other situations where it is applied.
Regardless, my understanding is that it simply requires statisticians to make
sure the probability of wrongfully rejecting even one of many hypotheses is
below α, the pre-specified level. This principle leads to a technical issue: we
may not be able to find even a well approximated critical value.
24.5. FALSE DISCOVERY RATE 353
24.5 False discovery rate
In modern statistical applications, we are confronted with a problem that
is radically different from the one-way anova. Due to bio and info technical
advances, we can now cost-effectively and timely take measurements of thou-
sands of genes expression levels from each subject. It is of interest to identify
some genes whose expression levels are different on different groups of peo-
ple. Typically, one group is made of health controls and another group is
made of patients of a specific type of disease. The genes that are significantly
differentially expressed might be related to the disease.
There are two aspects of this new problem.
First, if 500,000 genes are inspected on 50+50 subjects, even if we use
α = 0.001 to test for each hypothesis that a gene is significantly differentially
expressed, and that none of them are differentially expressed, 500 of them
will likely be found statistically significant. This is bad.
Second, suppose a handful of genes are indeed differentially expressed
but the differences are not exceedingly large. Applying Bonferroni method
likely results in none of them judged significant. The high standard set by
Bonferroni method may fail the researchers for this wrong reason.
The dilemma seems solved by giving up the notion of type I error. When
thousands and thousands hypotheses are examined simultaneously, we prob-
ably should not mind to have a larger probability of “wrongfully declare
a few genes significantly differently expressed”. Rather, we should proba-
bly ask: among many genes judged significantly differently expressed, what
percentage of them are falsely significant?
Because “rejecting a null hypothesis” in such context is regarded as a
scientific discovery, the percentage of false significant outcomes among all
declared significant outcomes is called “false discovery rate”. In such appli-
cations, controlling the false discovery rate is regarded as a better principle.
A widely accepted standard is again 5%.
In comparison, the classical practice of controlling the overall type I error
is renamed as “family-wise error rate”.
There is a need to be reminded about the difference between “statistical
significance” and the “real world” significance here. How large a difference in
354 CHAPTER 24. MULTIPLE COMPARISON
the expression levels is scientifically significant should be judged by scientists.
When two expression levels are judged statistically significantly different, it
means that we have sufficient statistical evidence to declare that difference
is genuine. However, the magnitude of the difference could be so small that
it is scientifically meaningless.
24.6 Method of Benjamini and Hochberg
We will only discuss the result of Benjamini and Hochberg (1995, JRSSB).
There have been a lot of new developments and I have not followed them
very closely.
False discovery rate. Suppose m hypotheses are being tested. Let m0
denote the number of them that are true. Let R the be number of hypothesis
rejected. Note that R is random.
We have decomposition
m0 = U + V
with U of them are tested non-significant, and V of them are tested signifi-
cant.
Similarly, m−m0 = T + S: T of them are tested non-significant, and S
of them are tested significant.
The total number of hypotheses tested significant is R = V + S. The
total number of hypotheses tested non significant is m−R = U + T .
When R > 0, the percentage of false discovery is V/R. When R = 0,
there cannot be any false discovery. Thus, they propose to define
Q =
V
V + S
1(V + S > 0).
Clearly, this value is not observed and is random in any applications.
The false discovery rate (FDR) is defined to be
Qe = E(Q).
In comparison, the type I error in the current situation is also called family-
wise error rate (FWER). It measures the probability of having at least one
hypotheses rejected when all of them are truthful.
24.7. HOW TO APPLY THIS PRINCIPLE? 355
According to Benjamini and Hochberg (direct quote):
(a) If all null hypotheses are true, the FDR is equivalent to the FWER:
in this case s = 0 and v = r, so if v = 0 then Q = 0, and if v > 0 then Q = 1,
pr(V ≥ 1) = E(Q) = Qe.
Therefore control of the FDR implies the control of the FWER in the weak
sense.
(b) When m0 < m, the FDR is smaller than or equal to the FWER: in this
case, if v > 0 then v/r ≤ 1, leading to 1(V ≥ 1) ≥ Q. Taking expectations
on both sides we obtain pr(V ≥ 1) ≥ Qe, and the two can be quite different.
As a result, any procedure that controls the FWER also controls the FDR.
However, if a procedure controls the FDR only, it is less stringent and a gain
in power is expected. In particular, the larger the number of the false null
hypotheses is, the larger S tends to be, and so is the difference between the
error rates. As a result, the potential for increase in power is larger when
more of the hypotheses are untrue.
24.7 How to apply this principle?
Suppose the hypotheses to be tested are H1, H2, . . . , Hm. Whatever methods
are used, the outcome of each test is summarized by a p-value: P1, . . . , Pm.
We assume these p-values calculated based on valid tests. Sorting these
values to get P (1) ≤ P (2) ≤ · · · ≤ P (m). Their corresponding hypotheses are
denoted as H
(i)
0 accordingly. Select an upper bound for the false discovery
rate and denote it as q∗.
The BH procedure:
Step I Let k be the largest i for which P (i) ≤ (i/m)q∗: namely
k = max{i : P (i) ≤ (i/m)q∗};
Step II Reject all H(i), i = 1, 2, . . . , k.
Numerically, the BH procedure can be carried out as follows
356 CHAPTER 24. MULTIPLE COMPARISON
• If p(m) ≤ q∗, reject all null hypotheses and stop;
• else if p(m−1) ≤ m−1
m
q∗, reject these (m− 1) null hypotheses and stop;
• else if p(m−2) ≤ m−2
m
q∗, reject these (m− 2) null hypotheses and stop;
• Continue the above process until the last step: if p(1) ≤ (1/m)q∗, reject
H
(1)
0 step;
• else, reject none and terminate.
Moral of this procedure: for the targeted application, it is not a serious
issue if one falsely declares 10 genes are differentially expressed for diabetes
patients when 2 of them are not. We can figure out the true set subsequently.
The procedure is more effective than to declare none of them are significantly
differentially expressed.
Suppose we choose q∗ = 0.05. The procedure will have at least one gene
declared significantly differentially expressed when
p(1) ≤ 0.05/m.
Thus, if the Bonferroni’s method reject “all H0’s are true”, then at least
one of them is rejected by the Benjamini-Hochberg procedure. The new
procedure may rejects many individual H0’s.
For instance, the Bonferroni’s method rejects H
(2)
0 only if
p(1) ≤ p(2) ≤ 0.05/m
but the FDR method will do so when
p(1) ≤ p(2) ≤ (2/m)× 0.05.
Hence, FDR method will have more hypotheses rejected in long run. Reject-
ing both H
(1)
0 and H
(2)
0 requires only p(2) ≤ (2/m) ∗ q∗.
24.8. THEORY AND PROOF 357
24.8 Theory and proof
Theorem 24.1. For independent test statistics and for any configuration of
false null hypotheses, the Benjamini-Hochberg procedure controls the FDR at
q∗.
Remark: by “independent test statistics”, we assume that the p-values
are independent of each other, when they are regarded as random variables.
When a null hypothesis is true, its corresponding p-value, however it is ob-
tained as long as it is valid, has uniform [0, 1] distribution.
Lemma 24.1. Consider the problem when the Benjamini-Hochberg procedure
is applied to m null hypothesis. For any 0 ≤ m0 ≤ m, independent p-
values corresponding to true null hypotheses, and for any values that the
m1 = m −m0 p-values corresponding to the false null hypotheses can take,
the Benjamini-Hochberg procedure satisfies the inequality
E{Q|Pm0+1 = p1, . . . , Pm = pm1) ≤ (m0/m)q∗.
Interpreting this lemma: Suppose that m1 of the hypotheses are false.
Whatever the joint distribution of their corresponding p-values, integrating
inequality in the lemma we obtain
E(Q) ≤ (m0/m)q∗ ≤ q∗
and the FDR is controlled.
Namely, the conclusion of the theorem is implied by this lemma.
The independence of the test statistics corresponding to the false null
hypotheses is not needed for the proof of the theorem.
Proof of the Lemma. Recall m is the number of hypotheses; m0 is the
number of true hypotheses.
Denote H
(i)
0 and P

(i), i = 1, 2, . . . ,m0 the true null hypotheses and their
p-values, with p-values in increasing order. P ′(i), i = 1, 2, . . . ,m0 are order
statistics of m0 iid uniform [0, 1] random variables.
Denote false null hypotheses as H
(i)
f : i = m0 + 1,m0 + 2, . . . ,m. Their
p-values as random variables will be denoted as Pi, capitalized P and indexed
by i. Their realized values are denoted as p1 ≤ p2 ≤ . . . ≤ pm1 .
358 CHAPTER 24. MULTIPLE COMPARISON
The proof is obtained by using mathematical induction. We work on a
few simple cases first before truly starting the induction.
Case I: The case m = 1 is immediate.
(a) m1 = 1 so that m0 = 0. Hence, Q ≡ 0 and
E(Q|P1) = 0 ≤ m0
m
q∗.
(b) m1 = 0 so that m0 = 1. Hence
Q = 1(P ′(1) < q
∗).
Thus, there is nothing to condition on. We have
E(Q) = P (P ′(1) < q∗) = q∗ =
m0
m
q∗
Combining (a) and (b), we find the conclusion of the lemma is true for
the case where m = 1.
Case II: The case m = 2.
(a) m1 = 2 so that m0 = 0. In this case, Q ≡ 0 and
E(Q|P1, P2) = 0 ≤ m0
m
q∗.
(b) m1 = 1 so that m0 = 1. In this case, Q can take values 0, 1/2, and 1.
When P1 > q
∗, H(2)f is never rejected. H
(1)
0 is rejected when P

(1) < 0.5q
∗.
Hence, we have
E(Q|P1 > q∗) = P (P ′(1) ≤ 0.5q∗) = 0.5q∗ =
m0
m
q∗;
When P1 < q
∗, both H(1)0 and H
(2)
f are rejected if p(1) < q
∗. When this
happens, Q = 0.5; otherwise, Q = 0. Hence,
E(Q|P1 < q∗) = (0.5)P (P ′(1) < q∗|P1 < q∗) = 0.5q∗ =
m0
m
q∗.
(c) m1 = 0 so that m0 = 2. In this case, any rejection leads to Q = 1.
There is nothing to be conditioned on. Hence,
E(Q) = P{P ′(1) < (1/2)q∗ or P ′(2) < (2/2)q∗}
= P{P ′(1) < (1/2)q∗}+ P{P ′(1) > (1/2)q∗, P ′(2) < (2/2)q∗}
= 1− (1− .5q∗)2 + (0.5q∗)2
= q∗ = (m0/m)q∗.
24.8. THEORY AND PROOF 359
Combining (a), (b) and (c), we find the conclusion of the lemma is true
for the case where m = 2.
Induction assumption Assume that the lemma is true for any m ≤ N −1.
We work on proving this lemma when m = N .
Suppose m0 = 0 so that all null hypotheses are false. The false discovery
rate Q ≡ 0. Hence,
E{Q|Pm0+1 = p1, . . . , Pm = pm1) = 0 ≤ (m0/m)q∗.
That is, the lemma is true when m = k and m0 = 0.
Thus, we need only discuss the situation where m can take any value,
and m0 > 0.
Let j0 be the largest 0 ≤ j ≤ m1 satisfying
pj ≤ m0 + j
N
q∗.
Note that these are p-values corresponding to false null hypotheses. Denote
p′′ =
m0 + j0
N
q∗.
This value will be used as cut-off point.
The key steps of the proof start from here
Step 1 Conditioning on P ′(m0), the largest p-value in the group of true null
hypotheses, we find
E(Q|Pm0+1 = p1, . . . , Pm = pm1)
=
∫ p′′
0
E(Q|P ′m0 = p, Pm0+1 = p1, . . . , Pm = pm1)fm0(p)dp
+
∫ 1
p′′ E(Q|P ′m0 = p, Pm0+1 = p1, . . . , Pm = pm1)fm0(p)dp
with fm0(p) = m0p
(m0−1) being the density function of P ′(m0). We will work
out an upper bound for each of these two integrations.
Step 2 Analyzing the integration in two intervals:
In the first integral, we are dealing with the situation where p ≤ p′′
because the integration is over the region [0, p′′]. In this case, the BH proce-
dure will have all m0 true null, plus j0 false hypotheses rejected. The false
360 CHAPTER 24. MULTIPLE COMPARISON
discovery rate is hence
Q = m0/(m0 + j0).
Substituting this value into the first integral, and noting the density function
of fm0(p), we have∫ p′′
0
{·}dp = {m0/(m0 + j0)}
∫ p′′
0
m0p
m0−1dp = {m0/(m0 + j0)}(p′′)m0 .
Recall p′′ = m0+j0
N
q∗, we get
{m0/(m0+j0)}(p′′)m0 ≤ {m0/(m0+j0)}(p′′)m0−1
{
m0 + j0
N
q∗
}
=
m0
N
q∗(p′′)m0−1.
Now keep this result in this form and work on the second integral.
In the the second integral, by definition of j0, the largest p-value cor-
responding true H0 satisfies
P ′(m0) = p ≥ p′′ =
m0 + j0
N
q∗
and pj0 ≤ p′′.
Let j be the integer satisfying
pj0 < pj ≤ P ′(m0) = p < pj+1.
Note that the value p exceeds many more p-values of the false null hypothesis.
If such a j does not exist, it implies
pj0 ≤ p′′ < P ′(m0) = p < pj0+1
which occurs when the value p is barely larger than p′′ = m0+j0
N
q∗. If so,
j = j0. Now we regard j as fixed and satisfies one of two above
inequalities. Namely, we will work on conditional probability.
Because of the way by which j0 and p
′′ are defined, no hypothesis can be
rejected as a result of the values of
p, pj+1, . . . , pm1 .
24.8. THEORY AND PROOF 361
That is, none of H
(m0)
0 , H
(m0+j+1)
f , H
(m0+j+2)
f , . . ., H
(m0+m1)
f are rejected.
Reminder: j is fixed value in this argument. Hence, the pool of hy-
potheses that might be rejected is shrunk to
H
(i)
0 : i = 1, 2, . . . ,m0 − 1; H if : i = m0 + 1,m0 + 2, . . . ,m0 + j.
There are m0 + j − 1 < N of hypotheses in this pool.
In this pool of null hypotheses, whether or not a hypothesis is true and
false, get their p-values ordered together to obtain hypotheses H˜
(i)
0 , i =
1, 2, . . . ,m + j − 1. A hypothesis H˜(i)0 is rejected only if there exists k,
i ≤ k ≤ m0 + j− 1, for which p˜(k) ≤ {k/(m+ 1)}q∗. Namely, we look for the
largest k such that
p˜(k)
p
≤ k
m0 + j − 1
m0 + j − 1
Np
q∗. (24.1)
We now explain that this corresponding to a BH procedure with different m,
m0 and q
∗ values.
When conditioning on P ′(m0) = p, P

i/p are iid U(0, 1) random variables
(before sorting). Also, {pi/p, i = 1, 2, . . . , j}, are numbers between 0 and 1
corresponding to false null hypotheses (H
(m0+1)
f , . . ., H
(m0+j)
f ).
Using inequality (24.1) to test the m0 + j − 1 = m′ < N hypotheses is
equivalent to the BH procedure, with the constant
q˜∗ =
(m0 + j − 1)
Np
q∗
taking the role of q∗.
Applying now the induction hypothesis to this procedure in which the
total number of hypothesis being tested is
m˜ = m0 + j − 1 < N,
we have
E(Q|P ′(m0) = p, Pm0+1 = p1, . . . , Pm = pm1) ≤
m0 − 1
m′
q˜∗
=
m0 − 1
m0 + j − 1
m0 + j − 1
Np
q∗
=
m0 − 1
Np
q∗.
362 CHAPTER 24. MULTIPLE COMPARISON
The above bound depends on p, but not on the segment pj < p < pj+1 for
which it was evaluated. (That is, whichever fixed j is under consideration).
Therefore, the second integral∫ 1
p′′
E(Q|P ′(m0) = p, Pm0+1, . . . , Pm)fm0(p)dp ≤
∫ 1
p′′
m0 − 1
Np
q∗ m0p(m0−1)dp
The outcome of the integration is
m0
N
q∗
∫ 1
p′′
(m0 − 1)p(m0−2)dp = m0
N
q∗(1− {p′′}(m0−1)).
Adding two upper bounds on integrations completes the proof of the lemma
for the case of m = N .
Now the induction is completed and the lemma is fully proved.

Email:51zuoyejun

@gmail.com