辅导案例-STAT 5361

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

STAT 5361: Statistical Computing
Jun Yan
Department of Statistics
University of Connecticut
Outline I
1 Non-Stochastic Optimization
Some Background
Univariate Problems
Multivariate Problems
2 Combinatorial Optimization
Local Search
Simulated Annealing
Genetic Algorithm
3 EM and MM Algorithms
EM Algorithm
Majorization-Minimization Algorithm
Example: Penalized AFT model with induced smoothing
Non-Stochastic Optimization Some Background
Non-stochastic Optimization
What we will learn:
Some basics about statistical inference
Univariate problems
I Newton’s method
I Fisher scoring
I Secant method
I Fixed-point method
I Connections of the above
Multivariate problems
I Newton’s method
I Newton-like methods
Jun Yan (University of Connecticut) STAT 5361 3 / 125
Non-Stochastic Optimization Some Background
Background: likelihood inference
Let x1, . . . ,xn be an i.i.d. sample from
f (x |θ∗),
with the true parameter value θ∗ being unknown.
The likelihood function is
L(θ) =
n∏
i=1
f (xi |θ),
The maximum likelihood estimator (MLE) of the parameter value is
the maximizer of L(θ). MLE has the invariate property.
Usually it is easier to work with the log likelihood function
l(θ) = logL(θ).
Jun Yan (University of Connecticut) STAT 5361 4 / 125
Non-Stochastic Optimization Some Background
Typically, maximization of l(θ) is done by solving
l ′(θ) = 0.
I l ′(θ) is called the score function.
I For each possible parameter value θ, l(θ) is a random variable, because
it depends on the observed values of x1, . . . ,xn.
For any θ,
Eθ
{
l ′(θ)
}
= 0,
Eθ
{
l ′(θ)l ′(θ)T
}
= −Eθ
{
l ′′(θ)
}
.
where Eθ is the expectation with respect to f (x |θ). The equality
holds true under mild regularity conditions.
I (θ) = Eθ
{
l ′(θ)l ′(θ)T
}
is known as the Fisher information.
Jun Yan (University of Connecticut) STAT 5361 5 / 125
Non-Stochastic Optimization Some Background
The importance of the Fisher information I (θ) is that it sets the limit
on how accurate an unbiased estimate of θ can be.
As n →∞, the asymptotic distribution of √n(θ̂MLE − θ∗) is
Np(0,nI (θ∗)−1). Since θ∗ is unknown, the asymptotic covariance
matrix I (θ∗)−1) needs to be estimated.
I If dim(θ) = 1, I (θ) is a nonnegative number.
I If dim(θ) > 1, I (θ) is a nonnegative definite matrix.
The observed Fisher information is
−l ′′(θ).
The expected Fisher information I (θ) may not be easily computed.
The observed Fisher information is a good approximation to I (θ) that
improves as n gets bigger.
Jun Yan (University of Connecticut) STAT 5361 6 / 125
Non-Stochastic Optimization Some Background
Besides MLEs, there are other likelihoods for parameter estimation.
Suppose θ has two parts: θ = (φ,µ) and we are only interested in φ. The
profile likelihood for φ is
L(φ) := max
µ
L(φ,µ).
µ is the nuisance parameter.
Need to maximize L(φ,µ) for every fixed φ. Also need to maximize
L(φ).
Jun Yan (University of Connecticut) STAT 5361 7 / 125
Non-Stochastic Optimization Some Background
A general method
Suppose g(x) is a differentiable function, where
x = (x1, . . . , xn).
To find its maximum (or minimum), one method is to solve the equation
g′(x) = 0
where g′(x) =
(
∂g
∂x1
, . . . ,
∂g
∂xn
)T
.
Then the maximization is equivalent to solving
f (x) = 0,
where f = g′.
For maximum likelihood estimation, g is the log likelihood function l, and
x is the corresponding parameter vector θ.
Jun Yan (University of Connecticut) STAT 5361 8 / 125
Non-Stochastic Optimization Univariate Problems
Univariate Problems
Jun Yan (University of Connecticut) STAT 5361 9 / 125
Non-Stochastic Optimization Univariate Problems
Optimization: Univariate case
Goal: optimize a real-valued function g with respect to its argument,
a p dimensional vector x. We will first consider the case p = 1.
We will limit consideration mainly to smooth and differentiable
functions.
Root finding methods. Solving unconstrained nonlinear equations.
Iterative algorithms: starting value, updating equation, stopping
rule/convergence criterion.
Using bisection method to maximize
g(x) = log x1 + x .
or to sovle
f (x) = g′(x) = 1 +
1
x − log x
(1 + x)2 = 0.
Jun Yan (University of Connecticut) STAT 5361 10 / 125
Non-Stochastic Optimization Univariate Problems
Newton’s method
This is a fast approach to finding roots of a differentiable function f (x).
First, set initial value x0. Then for t = 0, 1, . . ., compute
xt+1 = xt + ht , with ht = − f (xt)f ′(xt) .
Continue the iterations until xt converges.
Also known as Newton-Raphson iteration.
Need to specify x0.
If f (x) = 0 has multiple solutions, the end result depends on x0.
Jun Yan (University of Connecticut) STAT 5361 11 / 125
Non-Stochastic Optimization Univariate Problems
Newton’s method require computing the derivative of a function.
Algorithms are available to find root(s) of a univariate function without
having to compute its derivative. For example, in R, one can use uniroot.
Newton’s method can be applied to optimize g by applying to
f = g′.
Both g′ (gradient) and g′′ (Hessian) are needed.
Many variants of Newton’s method avoid the computation of the
Hessian, which can be difficult especially for multivariate functions.
Jun Yan (University of Connecticut) STAT 5361 12 / 125
Non-Stochastic Optimization Univariate Problems
To maximize
g(x) = log x1 + x ,
first find
f (x) = g′(x) = 1 +
1
x − log x
(1 + x)2 ,
f ′(x) = g′′(x) = −3 + 4/x + 1/x
2 − 2 log x
(1 + x)3 .
So in the Newton’s method,
ht =
(xt + 1)(1 + 1/xt − log xt)
3 + 4/xt + 1/x2t − 2 log xt
.
Note that to solve f (x) = 0, one can instead solve 1 + 1/x − log x = 0.
Treat this as a new f function. Then in the Newton’s method,
ht = xt − x
2
t log xt
1 + xt
=⇒ xt+1 = 2xt − x
2
t log xt
1 + xt
.
Jun Yan (University of Connecticut) STAT 5361 13 / 125
Non-Stochastic Optimization Univariate Problems
To maximize the log likelihood l(θ), Newton’s method computes
θt+1 = θt − l
′(θt)
l ′′(θt)
Example: consider x1, . . . , xn ∼ i.i.d. N (µ, σ2).
Jun Yan (University of Connecticut) STAT 5361 14 / 125
Non-Stochastic Optimization Univariate Problems
Example: consider the model on shift
p(x | θ) = p(x − θ).
Given observations x1, . . . , xn i.i.d. ∼ p(x | θ)
l(θ) =
n∑
i=1
log p(xi − θ)
l ′(θ) = −
n∑
i=1
p′(xi − θ)
p(xi − θ)
l ′′(θ) =
n∑
i=1
p′′(xi − θ)
p(xi − θ) −
n∑
i=1
{p′(xi − θ)
p(xi − θ)
}2
.
Note that we need to update θ in the iterations, not x1, . . . , xn .
Jun Yan (University of Connecticut) STAT 5361 15 / 125
Non-Stochastic Optimization Univariate Problems
In R, to minimize a function, one can use
z = nlminb(x0, g, gr.g, hess.g)
here x0 is the initial value, g is the function being minimized, gr.g its
gradient and hess.g its Hessian. (Unconstrained and box-constrained
optimization using PORT routines).
In the above function, the derivatives g′ and g′′ have to be analytically
calculated. One can also use
z = nlminb(x0, g, gr.g)
without inputing the analytic expression of g′′, or, even simpler,
z = nlminb(x0, g)
without inputing the analytic expressions of either derivatives. In these
cases, numerical approximations of derivatives will be computed during the
iterations.
Other functions for optimization in R include optim, optimize, nlm and
constrOptim.
Jun Yan (University of Connecticut) STAT 5361 16 / 125
Non-Stochastic Optimization Univariate Problems
For profile likelihood
L(φ) = max
µ
L(φ,µ)
we need to optimize L(φ,µ) for each given φ.
In R, in order to get minx g(x, y) for each fixed y, one can do the
following. First, define g(x,y), gr.g(x,y), etc. Then call, say,
nlminb(x0, g, gr.g, hess.g, y=1), or
nlminb(x0, g, gr.g, y=.3), or
nlminb(x0, g, y=-1)
If one only wants minimization for x ∈ [a, b], use,
nlminb(x0, ..., lower=a, upper=b)
Jun Yan (University of Connecticut) STAT 5361 17 / 125
Non-Stochastic Optimization Univariate Problems
Fisher scoring
This is a variant of Newton’s method specific for MLE. Recall −l ′′(θ) is
the observed Fisher information at θ. To maximize l(θ), an alternative is
to replace −l ′′(θt) with I (θt) to get
θt+1 = θt +
l ′(θt)
I (θt)
.
To minimize −l(θ), one still can use
z = nlminb(x0, f, grf, fs)
where f is −l(θ), grf is −l ′(θ), and fs is I (θ) instead of −l ′′(θ).
Generally, use Fisher scoring in the beginning to make rapid improvements,
and Newton’s method for refinement near the end.
Jun Yan (University of Connecticut) STAT 5361 18 / 125
Non-Stochastic Optimization Univariate Problems
Continuing the example on p(x | θ) = p(x − θ), to use Fisher scoring, we
need to compute
I (θ) = −Eθ[l ′′(θ)].
We already know
l ′′(θ) =
n∑
i=1
p′′(xi − θ)
p(xi − θ) −
n∑
i=1
{p′(xi − θ)
p(xi − θ)
}2
.
Therefore
I (θ) = −nEθ
[
p′′(X − θ)
p(X − θ) −
{p′(X − θ)
p(X − θ)
}2]
.
Jun Yan (University of Connecticut) STAT 5361 19 / 125
Non-Stochastic Optimization Univariate Problems
Since under parameter θ, X has density p(x − θ), the last Eθ[...] equals∫ [p′′(x − θ)
p(x − θ) −
{p′(x − θ)
p(x − θ)
}2]
p(x − θ) dx
=
∫
p′′(x − θ) dx −
∫ [p′(x − θ)]2
p(x − θ) dx
= d
2
dθ2
∫
p(x − θ) dx −
∫ {p′(x)}2
p(x) dx
= d
2
dθ2 1−
∫ {p′(x)}2
p(x) dx = −
∫ {p′(x)}2
p(x) dx.
Therefore,
I (θ) = n
∫ {p′(x)}2
p(x) dx.
So, in this case, Fisher information is a constant.
Jun Yan (University of Connecticut) STAT 5361 20 / 125
Non-Stochastic Optimization Univariate Problems
Secant method
Approximating f ′(xt) by
f (xt)− f (xt−1)
xt − xt−1 ,
the Newton’s method turns into the secant method
xt+1 = xt − f (xt)(xt − xt−1)f (xt)− f (xt−1) .
Need to specify x0 and x1.
Jun Yan (University of Connecticut) STAT 5361 21 / 125
Non-Stochastic Optimization Univariate Problems
Fixed point iteration
Compute
xt+1 = xt + αf (xt), t = 0, 1, . . . ,
where α 6= 0 is a tuning parameter so that
|1 + αf ′(x)| ≤ λ < 1, all x
or more generally,
|x − y + α[f (x)− f (y)]| ≤ λ|x − y|, any x, y
with 0 ≤ λ < 1 a constant, i.e., F(x) = x + αf (x) is a contraction.
If such α exists, the speed of convergence depends on λ. The smaller λ is,
the faster the convergence.
Jun Yan (University of Connecticut) STAT 5361 22 / 125
Non-Stochastic Optimization Univariate Problems
Some Details on Fixed point iteration
Definition: A fixed point of a function is a point whose evaluation by that
function equals to itself, i.e., x = G(x).
Fixed point iteration: the natural way of hunting for a fixed point is to use
xt+1 = G(xt).
Definition: A function G is contractive on [a, b] if
(1). G(x) ∈ [a, b] whenever x ∈ [a, b],
(2). |G(x1)−G(x2)| ≤ λ|x1 − x2| for all x1, x2 ∈ [a, b] and some λ ∈ [0, 1).
Theorem: if G is contractive on [a, b], then there is a unique fixed point
x∗ ∈ [a, b], and the fixed point iteration convergence to it when starting
inside the interval.
Convergence:
|xt+1 − xt | = |G(xt)−G(xt−1)| ≤ λ|xt − xt−1| ≤ λt |x1 − x0| → 0, as
t →∞. It follows that {xt} convergent to a limit x∗.
Jun Yan (University of Connecticut) STAT 5361 23 / 125
Non-Stochastic Optimization Univariate Problems
Connection between fixed-point method and Newton
methods
Root-finding problems using fixed point iteration: for solving f (x) = 0, we
can simply let G(x) = x + αf (x), where α 6= 0 is a constant.
Required Lipschitz condition: |x − y + α[f (x)− f (y)]| ≤ λ|x − y|, for
some λ ∈ [0, 1) and for all x, y ∈ [a, b]. This holds if |G ′(x)| ≤ λ for some
λ ∈ [0, 1) and for all x ∈ [a, b], i.e., |1 + αf ′(x)| ≤ λ. (use mean value
theorem.)
Newton methods: G(x) = x − f (x)/f ′(x). So it is as if αt is chosen
adaptively as αt = −1/f ′(xt). This leads to a faster convergence order
(quadratic).
Jun Yan (University of Connecticut) STAT 5361 24 / 125
Non-Stochastic Optimization Univariate Problems
Convergence Order
Define εt = xt − x∗. A method has convergence order β if
lim
t→∞ εt = 0, limt→∞
|εt+1|
|εt |β = c,
for some constant c 6= 0 and β > 0.
For Newton’s method, β = 2. (If it converges.)
For Secant method, β ≈ 1.62.
For fixed point iteration, β = 1.
Jun Yan (University of Connecticut) STAT 5361 25 / 125
Non-Stochastic Optimization Univariate Problems
Convergence Order
For Newton’s method: suppose f has two continuous derivatives and
f ′(x∗) 6= 0. There then exists a neighborhood of x∗ within which f ′(x) 6= 0
for all x. By Taylor expansion,
0 = f (x∗) = f (xt) + f ′(xt)(x∗ − xt) + 12 f
′′(q)(x∗ − xt)2,
for some q between x∗ and xt . Rearranging terms, we find that
εt+1
ε2t
= f
′′(q)
2f ′(xt)
→ f
′′(x∗)
2f ′(x∗) .
Jun Yan (University of Connecticut) STAT 5361 26 / 125
Non-Stochastic Optimization Univariate Problems
Convergence Order
For fixed point iteration: let G be a continuous function on the closed
interval [a, b] with G : [a, b]→ [a, b] and suppose that G ′ is continuous on
the open interval (a, b) with |G ′(x)| ≤ k < 1 for all x ∈ (a, b). If
g′(x∗) 6= 0, then for any x0 ∈ [a, b], the fixed point iteration converges
linearly to the fixed point x∗. This is because
|xt+1 − x∗| = |G(xt)−G(x∗)| = |G ′(q)||xt − x∗|,
for some q between xt and x∗. This implies that
|εt+1|
|εt | → |G
′(x∗)|.
Jun Yan (University of Connecticut) STAT 5361 27 / 125
Non-Stochastic Optimization Univariate Problems
Stopping Rules
Absolute convergence criterion:
‖xt+1 − xt‖ < ,
where is a chosen tolerance.
Relative convergence criterion
‖xt+1 − xt‖
‖xt‖ < .
If xt close to zero, ‖xt+1 − xt‖
‖xt‖+ < .
Many other rules depending on the algorithm.
Jun Yan (University of Connecticut) STAT 5361 28 / 125
Non-Stochastic Optimization Multivariate Problems
Multivariate Problems
Jun Yan (University of Connecticut) STAT 5361 29 / 125
Non-Stochastic Optimization Multivariate Problems
Newton’s method
Let g now be a function in x = (x1, . . . , xp)T .
The generalization is straightforward: to maximize g(x), set
xt+1 = xt − [g′′(xt)]−1g′(xt).
g′′(x) is a p × p matrix with the (i, j)th entry equal to
∂2g(x)
∂xi∂xj
;
g′(x) is a p × 1 vector with the ith entry equal to
∂g(x)
∂xi
.
Jun Yan (University of Connecticut) STAT 5361 30 / 125
Non-Stochastic Optimization Multivariate Problems
Newton-like methods
For MLE, the generalization is straightforward. Simply use
θt+1 = θt + I (θt)−1l ′(θt).
These methods in each iteration approximate g′′(xt) by some p × p matrix
M t = M t(xt) to get
xt+1 = xt −M−1t g′(xt). (1)
Of course, M t should be easier to compute than g′′(xt).
Fisher scoring is a Newton-like method, because it uses
M t = −I (θt)
in place of −l ′′(θt).
Jun Yan (University of Connecticut) STAT 5361 31 / 125
Non-Stochastic Optimization Multivariate Problems
Steepest ascent methods
In (1), set
M t = −α−1t I p,
so that
xt+1 = xt + αtg′(xt),
where αt > 0 is the step size at t which can shrink to ensure ascent. If at
step t, the original step turns out to be downhill, the updating can
backtrack by halving αt .
Jun Yan (University of Connecticut) STAT 5361 32 / 125
Non-Stochastic Optimization Multivariate Problems
Discrete Newton and Fixed-point methods
In fixed-point methods, usually M t is fixed to be M , e.g., g′′(x0). This
amounts to applying univariate scaled fixed-point iteration to each
component.
In discrete Newton methods, g′′(xt) is approximated as follows. Compute
matrix M t with the (i, j)th entry equal to
M t(i, j) =
g′i(xt + ht(i, j)ej)− g′i(xt)
ht(i, j)
for some constant ht(i, j) 6= 0, where
g′i(x) =
∂g(x)
∂xi
.
If ht(i, j) is small, then
M t(i, j) ≈ ∂
2g(xt)
∂xi∂xj
and so M t approximates g′′(xt).
Jun Yan (University of Connecticut) STAT 5361 33 / 125
Non-Stochastic Optimization Multivariate Problems
However, since g′′(xt) is symmetric, instead of using M t directly, use the
symmetric
(M t +MTt )/2
as the approximation of g′′(xt).
No need to calculate second order derivatives
Inefficient: all p2 entries have to be updated each time.
Jun Yan (University of Connecticut) STAT 5361 34 / 125
Non-Stochastic Optimization Multivariate Problems
Quasi-Newton methods
These methods aim to achieve the following goals.
Each step satisfies the secant condition
g′(xt+1)− g′(xt) = M t+1(xt+1 − xt).
No need to compute second order derivatives.
Maintain symmetry of M t .
Aim to update M t efficiently.
Jun Yan (University of Connecticut) STAT 5361 35 / 125
Non-Stochastic Optimization Multivariate Problems
There is a unique rank-one method that satisfies the secant condition and
maintains the symmetry of M t : after getting
xt+1 = xt −M−1t g′(xt)
compute
zt = xt+1 − xt , yt = g′(xt+1)− g′(xt),
vt = yt −M tzt , ct =
1
vTt zt
.
Then update
M t+1 = M t + ctvtvTt .
Note M t+1 −M t is of rank one. We can verify the secant condition is
satisfied by multiplying both sides by zt .
Jun Yan (University of Connecticut) STAT 5361 36 / 125
Non-Stochastic Optimization Multivariate Problems
There are several rank-two method satisfying the secant condition and the
symmetry requirement. The Broyden class updates M t as follows. After
getting xt+1 and zt , yt as above, compute
dt =
yt
zTt yt
− M tztzTt M tzt
.
Then update
M t+1 = M t − M tzt(M tzt)
T
zTt M tzt
+ yty
T
t
zTt yt
+ δt(zTt M tzt)dtdTt ,
where δt is a parameter.
A popular method to solve unconstrained nonlinear optimization problems
is the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method, with δt ≡ 0.
In R, optim can be called to minimize a function, using the BFGS method
or its variant, the “L-BFGS-B” method. “L” stands for limited memory,
and “B” stands for box constraint.
Jun Yan (University of Connecticut) STAT 5361 37 / 125
Non-Stochastic Optimization Multivariate Problems
Approximating Hessian for MLE
For MLE, the Hessian l ′′(θ̂MLE) is critical because it provides estimates of
standard error and covariance.
Quasi-Newton methods may provide poor approximation because it is
based on the idea of using poor approximation of the Hessian to find
the root for l ′(θ) = 0.
Use the discrete multivariate Newton method with
M t(i, j) =
l ′i(θt + hijej)− l ′i(θt − hijej)
2hij
.
Jun Yan (University of Connecticut) STAT 5361 38 / 125
Non-Stochastic Optimization Multivariate Problems
Gauss-Newton method
Example: nonlinear regression. In many cases, we want to maximize
g(θ) = −
n∑
i=1
(yi − fi(θ))2,
where each fi(θ) is differentiable. Recall that for linear regression:
yi = xTi θ + ε, i = 1, . . . ,n
the LS estimator of θ maximizes g(θ) with fi(θ) = xTi θ and equals
θ̂ = (XTX)−1XTy
where
X =
xT1...
xTn
 , y =
y1...
yn
 .
Jun Yan (University of Connecticut) STAT 5361 39 / 125
Non-Stochastic Optimization Multivariate Problems
The Gauss-Newton method applies a similar idea to the nonlinear case.
Let θ∗ be the (unknown) maximizer of g(θ). Given any candidate θ, the
function
h(u) = −
n∑
i=1
(yi − fi(θ + u))2.
is maximized by β = θ∗ − θ. Of course, β is unknown. However, if θ is
near θ∗, then β ≈ 0, so by Taylor expansion of h(u), it may be close to
the maximizer of
−
n∑
i=1
(yi − fi(θ)− f ′i (θ)Tu)2.
Jun Yan (University of Connecticut) STAT 5361 40 / 125
Non-Stochastic Optimization Multivariate Problems
Treat yi − fi(θ) the same way as yi in the linear regression, and f ′i (θ) the
same way as xi , then
θ∗ − θ ≈ (ATA)−1ATz
where
z = z(θ) =
y1 − f1(θ)...
yn − fn(θ)
 , A = A(θ) =
f
′
1(θ)T
...
f ′n(θ)T
 .
The update rule thus is
θt+1 = θt + (ATt At)−1ATt zt ,
where zt = z(θt) and At = A(θt).
In R, the Gauss-Newton method is the default method of the function nls.
Jun Yan (University of Connecticut) STAT 5361 41 / 125
Non-Stochastic Optimization Multivariate Problems
Practical Issues
Initial value
Convergence criteria: based on a distance measures for vectors
Multiple local optima
I Multiple initial values
I Compare the objective function value at convergences
Trade-off between efficiency and robustness
Jun Yan (University of Connecticut) STAT 5361 42 / 125
Combinatorial Optimization
Combinatorial Optimization
Motivation:
There are hard optimization problems for which most methods
including what we have discussed so far are useless.
These problems are usually combinatorial in nature. Maximization
requires a discrete search of a very large space.
Problem: maximize f (θ) with respect to θ = (θ1, . . . , θp), where
θ ∈ Θ and Θ consists of N elements.
I Each θ ∈ Θ is called a candidate solution.
I Usually N is a very large number depending on problem size p.
I The difficulty of a particular size-p problem can be characterized by the
number of operations required to solve it in the worst case scenario
using the best known algorithm.
I Suppose a problem is O(p!). If it requires 1 minute to solve for p = 20,
it would take 12.1 years for p = 25 and 207 million years for p = 30.
Jun Yan (University of Connecticut) STAT 5361 43 / 125
Combinatorial Optimization
What can we do:
We have to take some sacrifices: we will abandon the global
algorithms and focus on algorithms that can find a good local
solution.
Heuristic strategies.
What we will learn:
Local search methods
Simulated annealing
Genetic algorithms
Jun Yan (University of Connecticut) STAT 5361 44 / 125
Combinatorial Optimization Local Search
Motivating Example: Subset regression
Suppose x1, . . . , xp are predictors that can be used to construct a linear
model for Y . Each candidate model is
Y =
s∑
j=1
βijxij + ε,
where 1 ≤ i1 < i2 < . . . < is ≤ p, s ≥ 0. The Akaike information criterion
(AIC) for the candidate model is
AIC = n log RSSn + 2s
where n is the sample size and RSS is the residual sum of squares of
model. The best model is the one that minimizes AIC.
To parameterize the model, set θ = (θ1, . . . , θp), such that θi = 1 if xi is
included as a predictor, and 0 otherwise. The set Θ of all possible θ has
2p values.
Jun Yan (University of Connecticut) STAT 5361 45 / 125
Combinatorial Optimization Local Search
Local search
First, define a neighborhood N (θ) for each θ, so that it
contains candidate solutions that are “near” θ, and
reduces the number of changes to the current θ.
For example, if Θ = {θ = (θ1, . . . , θp) : each θi = 0, 1}, one may define
N (θ) to be the set of θ′ which are different from θ in at most one
coordinate.
After neighborhoods are defined, at iteration t, choose θt+1 from N (θt)
according to a certain rule. For example,
steepest ascent: θt+1 = arg maxθ∈N (θt) f (θ);
ascent algorithm: θt+1 ∈ N (θt) uphill from θt , i.e.,
f (θt+1) ≥ f (θt).
Jun Yan (University of Connecticut) STAT 5361 46 / 125
Combinatorial Optimization Local Search
To avoid trapping into a local maximum, two often used variants are
random-starts local search: repeatedly run an ascent algorithm to
termination from a large number of randomly chosen starting points.
steepest ascent/mildest descent: set to the least unfavorable
θt+1 ∈ N (θt);
I if θt is a local maximum θt+1 is the one with least decrease;
I otherwise, θt+1 is the one with the largest increase.
Jun Yan (University of Connecticut) STAT 5361 47 / 125
Combinatorial Optimization Local Search
To select a linear model to minimize AIC (or maximize −AIC),
randomly select a set of predictors
at iteration t, decrease the AIC by adding a predictor to the set of
selected predictors or deleting a predictor that has been selected;
continue the iterations until the AIC can not be decreased;
repeat Steps 1 and 2 many times, and choose the set of predictors at
termination that has the lowest AIC.
Jun Yan (University of Connecticut) STAT 5361 48 / 125
Combinatorial Optimization Local Search
A salesman must visit each of p cities exactly once and return to his
starting city, using the shortest total travel distance.
A candidate solution is
θ = (i1, i2, . . . , ip, i1)
where i1, . . . , ip is a permutation of 1, . . ., p. The objective function to be
minimized is
f (θ) = d(i1, i2) + d(i2, i3) + · · ·+ d(ip−1, ip) + d(ip, i1).
There are (p − 1)!/2 all possible routes, since the point of origin and
direction of travel are arbitrary. For a traveling plan to visit 20 cities, that
amounts to more than 6× 1016 possible routes.
Jun Yan (University of Connecticut) STAT 5361 49 / 125
Combinatorial Optimization Local Search
One could define N (θ) as the set of sequences that only differ from θ at
two entries. In other words, each sequence in N (θ) is obtained by
exchanging two entries of θ.
For example, if
θ = (1, 2, 5, 4, 6, 3, 1)
then
(1, 2, 3, 4, 6, 5, 1)
is a neighbor of θ, because it only has 3 and 5 exchanged; while
(1, 2, 4, 3, 6, 5, 1)
is not a neighbor of θ, because it has 5, 4, and 3 rotated.
Jun Yan (University of Connecticut) STAT 5361 50 / 125
Combinatorial Optimization Simulated Annealing
Simulated annealing
In most cases, the simulated annealing algorithm can be thought of as a
randomized local search algorithm. It uses a “temperature” parameter to
control the randomness of the search. The algorithm starts with a high
temperature and cools down gradually so that a global optimum may be
reached. This is analogous to the annealing of metal or glass.
In the following description, a global minimum of f is being searched. The
algorithm is run in stages, such that for the iterations within a stage, the
temperature is a constant τj .
Jun Yan (University of Connecticut) STAT 5361 51 / 125
Combinatorial Optimization Simulated Annealing
Suppose iteration t belongs to stage j.
1. Sample a candidate θ∗ ∈ N (θ) according to a proposal density
gt(θ |θt); for different t, the proposal density gt can be different.
2. Let ∆ = f (θ∗)− f (θt).
a) If ∆ ≤ 0, then set θt+1 = θ∗.
b) If ∆ > 0, then set θt+1 = θ∗ with probability e−∆/τj , and θt+1 = θt
otherwise. This can be done as follows. Sample U ∼ Unif(0, 1). Then
θt+1 =
{
θ∗ if U ≤ e−∆/τj
θt otherwise.
3. Repeat steps 1 and 2 a total of mj times.
4. Update τj+1 = α(τj), mj+1 = β(mj) and move to stage j + 1, where
α and β are two deterministic functions that govern how to cool
down the temperature and how long to stay at a given temperature.
Jun Yan (University of Connecticut) STAT 5361 52 / 125
Combinatorial Optimization Simulated Annealing
Suppose I is an image consisting of “pixels” I (i, j), i, j = 1, . . . ,N . The
image is corrupted by noise Z (i, j) and only J = I + Z is observed. To
reconstruct I , one way is to minimize a function
f (I ) =
∑
i,j
|J (i, j)− I (i, j)|2 + K (I ),
where K (I ) is a function that has large values if I has many
“irregularities”. The idea is that, as long as the noise is not too strong, the
real image should be similar to J ; on the other hand, irregularities
observed in J are likely to be due to noise and should be reduced.
One way to use the simulated annealing to minimize f (I ) is as follows. In
each iteration, only one pixel of I can be updated. That means two I and
I ′ are neighbors only when they are different at just one pixel. Then at
each iteration, choose a pixel I (i, j) and select a candidate value for it.
Update I (i, j) according to the rule in step 2 and move to the next
iteration.
Jun Yan (University of Connecticut) STAT 5361 53 / 125
Combinatorial Optimization Simulated Annealing
Some guidelines:
R function
optim
can implement some simple versions of simulated annealing;
the temperature τj should slowly decrease to 0;
the number mj of iterations at each temperature τj should be large
and increasing in j;
reheating strategies that allow sporadic, systematic, or interactive
temperature increases to prevent getting stuck in a local minimum at
low temperatures can be effective.
Jun Yan (University of Connecticut) STAT 5361 54 / 125
Combinatorial Optimization Genetic Algorithm
Genetic algorithms
First, each θ ∈ Θ is a string of alphabets:
θ = (θ1, . . . , θC ),
where each θi is a symbol from a finite set, such as {0, 1}, {0, 1, 2},
{‘a’, ‘b’, ...}.
Genetic algorithms regard the maximization of f (θ) over Θ as a process of
natural selection, with f (θ) being a measure of fitness and θ the genetic
code, or “chromosome”. It assumes that fitness is a result of some “good”
pieces of θ and by inheriting these pieces plus some mutations fitness is
enhanced.
In a genetic algorithm, at each iteration t, at least two candidate solutions
θt,1, . . . , θt,P have to be tracked. Each iteration consists of several steps.
Jun Yan (University of Connecticut) STAT 5361 55 / 125
Combinatorial Optimization Genetic Algorithm
1. Selection. Randomly select from θt,1, . . . , θt,P to form a set of
pairs. The selection should be based on a fitness function φ(θ). In
general, larger values of f (θ) result in larger values of φ(θ), and
higher chance for θ to be selected.
There are many selection mechanisms:
a) select one parent θ with probability proportional to φ(θ) and the other
parent completely at random;
b) select each parent independently with probability proportional to φ(θ);
φ must be selected carefully: using φ(θt,i) = f (θt,i) may result in
rapid convergence into a local maximum. A common choice is
φ(θt,i) =
2ri
P(P + 1)
where ri is the rank of f (θt,i) in f (θt,1), . . . , f (θt,P).
Jun Yan (University of Connecticut) STAT 5361 56 / 125
Combinatorial Optimization Genetic Algorithm
2. Breeding. For each pair, (θa,θb), generate one or more “offspring”
θ′ = c(θa,θb) ∈ Θ, where c is a random operator.
Typically, c is a “crossover”. If
θa = (θa1, . . . , θaC ),
θb = (θb1, . . . , θbC ),
then a crossover works as follows,
a) randomly choose a position 1 ≤ d ≤ C ; and
b) combine (θa1, . . . , θad) and (θb,d+1, . . . , θbC ) to form
θ′ = (θa1, . . . , θad , θb,d+1, . . . , θbC ).
If necessary, also take
θ′′ = (θa,d+1, . . . , θaC , θb1, . . . , θbd)
to be another offspring.
Jun Yan (University of Connecticut) STAT 5361 57 / 125
Combinatorial Optimization Genetic Algorithm
3. Mutation. Make some random modifications to each offspring.
Typically, if an offspring chromosome is
θ = (θ1, . . . , θC )
then for i = 1, . . . ,C , with a small probability 0 < p < 1 (mutation
rate), change θi to a different value.
The offspring produced at iteration t are taken as the candidate solutions
for iteration t + 1.
Jun Yan (University of Connecticut) STAT 5361 58 / 125
Combinatorial Optimization Genetic Algorithm
Initialization.
Usually the first generation consists of completely random individuals.
Large values of the size of a generation, P, are preferred. For binary
encoding of θ, one suggestion is to have C ≤ P ≤ 2C , where C is
the chromosome length. In most real applications, population sizes
have ranged between 10 and 200.
Mutation rates are typically very low, in the neighborhood of 1%.
Jun Yan (University of Connecticut) STAT 5361 59 / 125
Combinatorial Optimization Genetic Algorithm
Termination.
A genetic algorithm is usually terminated after a maximum number of
iterations. Alternatively, the algorithm can be stopped once the genetic
diversity in the current generation is sufficiently low.
Jun Yan (University of Connecticut) STAT 5361 60 / 125
Combinatorial Optimization Genetic Algorithm
In many problems, like the traveling salesman problem, it is natural to
write θ as a permutation of 1, 2, . . . , p.
Standard crossover usually produces invalid sequences
For example, if
θa = {7, 5, 2, 6, 3, 1, 9, 4, 8}
θb = {9, 1, 2, 3, 8, 6, 7, 5, 4}
and if the crossover point is between 2nd and 3rd positions, then it
produces
{7, 5, 2, 3, 8, 6, 7, 5, 4}
an invalid traveling route.
Jun Yan (University of Connecticut) STAT 5361 61 / 125
Combinatorial Optimization Genetic Algorithm
A remedy is order crossover: pick a number of positions from one parent
and get the values at those positions, say i1, . . . , is. Next identify those
values from the 2nd parent. Suppose the set of values are found at
j1 < j2 < . . . < js. Put i1 at j1, i2 at j2, . . . , is at js while keeping the
values of the 2nd on other locations unchanged.
Example: Consider parents (752631948) and (912386754) and random
positions (4, 6, 7). The offsprings are (612389754) and (352671948).
The drawback of the operation is that it destroys some important
structures of the parent routes, in particular, the links.
Jun Yan (University of Connecticut) STAT 5361 62 / 125
Combinatorial Optimization Genetic Algorithm
Edge-recombination crossover uses a different idea. It generates an
offspring whose edges belong to those of the parent routes. By edge it
means a link into or out of a city in a route. For example, the above two
parents have the following links, the order of numbers in each link is
unimportant.
(1, 2), (1, 3), (1, 9), (2, 3), (2, 5), (2, 6), (3, 6), (3, 8),
(4, 5), (4, 8), (4, 9), (5, 7), (6, 7), (6, 8), (7, 8)
First, choose one of the initial cities of the parents as the initial city of the
offspring. At k-th step, if i1, . . . , ik have been chosen as the first k cities
on the route, then choose among cities that are linked to ik and are
different from i1, . . . , ik−1 as the (k + 1)st city of the offspring.
Jun Yan (University of Connecticut) STAT 5361 63 / 125
EM and MM Algorithms
EM and MM Algorithms
Jun Yan (University of Connecticut) STAT 5361 64 / 125
EM and MM Algorithms EM Algorithm
EM Optimizations
What it is for
inference when there is missing information
I incomplete observations
I auxiliary parameters that are unobserved but can facilitate inference on
the main parameter.
I Some important applications of the EM algorithm are in problems
where one has what we might call pseudo missing data. We never had
any chance of obtaining such data, but we could pretend they are
missing and use EM algorithm to facilitate the computation of MLEs.
Why it is important
widely useful
simple to implement
numerically stable
Jun Yan (University of Connecticut) STAT 5361 65 / 125
EM and MM Algorithms EM Algorithm
Suppose a population consists of several sub-populations, each following a
distribution p(x | θk) and having population fraction qk . Typically,
q = (q1, . . . , qK ), θ = (θ1, . . . , θK )
are unknown, and the goal is to estimate them.
Let x1, . . . , xn be a sample. If for each xi we know it comes from the zthi
sub-population, then, letting
x = (x1, . . . , xn), z = (z1, . . . , zn),
the likelihood function of (q,θ) is
L(q,θ |x, z) =
n∏
i=1
[qzi × p(xi | θzi )].
Jun Yan (University of Connecticut) STAT 5361 66 / 125
EM and MM Algorithms EM Algorithm
However, in many cases, z is unobservable:
cluster analysis
the “sub-populations” may be constructs that facilitate inference or
decision making, so they are nonexistent in reality.
The likelihood function then becomes
L(q,θ |x) =
∑
z
L(q,θ |x, z).
The question is how to get the MLE of q and θ efficiently.
In this example, y = (x, z) is the complete data. When only x is
observable, z is called the missing data.
Jun Yan (University of Connecticut) STAT 5361 67 / 125
EM and MM Algorithms EM Algorithm
Suppose a population follows a distribution p(y | θ), where θ is the
unknown parameter. Suppose a random sample is collected from p(y | θ).
However, when an observation yi is greater than a cut-off C , it is
truncated and recorded as C , such as in clinical trials. In this case, we
only know yi ≥ C but not its exact value. On the other hand, if yi ≤ C ,
then it is fully observed. Therefore, the observed data is
x1 = min(y1,C ), . . . , xn = min(yn ,C ).
In this case, y = (y1, . . . , yn) is the complete data and x = (x1, . . . , xn).
The question is how to estimate θ based on x.
Jun Yan (University of Connecticut) STAT 5361 68 / 125
EM and MM Algorithms EM Algorithm
Suppose Z1, . . . ,Zn and X1, . . . ,Xn are time series related by
Xk = f (Zk | θ),
where the transformation f is parametrized by θ. Zk may include not only
variables of interest, but also some “noise” that interact with the variables.
Suppose the joint distribution of Z1, . . . ,Zn has the parametric form
h(z1, . . . , zn | γ), however, the values of θ and γ are both unknown. If
X1, . . . ,Xn are observed but Z1, . . . ,Zn are only partially observed or
completely hidden, how to estimate θ and γ?
Jun Yan (University of Connecticut) STAT 5361 69 / 125
EM and MM Algorithms EM Algorithm
The Hidden Markov Model (HMM) is a typical case. Let S1, . . . ,Sn be a
discrete Markov chain with a finite number of states. Conditional on
S1, . . . ,Sn , the observations X1, . . . ,Xn are independent, such that, given
Sk = s, Xk ∼ N (µs, σ2s ).
- - - -
? ? ? ?
S1
X1
S2
X2
S3
X3
S4
X4
· · ·
In this case,
Zk = (Sk ,Wk),
where W1, . . . ,Wn ∼ N (0, 1) are independent of Z1, . . . ,Zn , such that
Xk = µSk + σSkWk . Typically, µk , σk and the transition probabilities of Sk
are unknown and only Xk are observed.
Jun Yan (University of Connecticut) STAT 5361 70 / 125
EM and MM Algorithms EM Algorithm
Positron emission tomography (PET) is a tool to study the metabolic
activity in organs. To generate a PET scan, some radioactive material is
administered into the organ. Then the emissions of the material are
recorded using a PET scanner, which consists of an array of detectors
surrounding the patient’s body. The region of interest is divided into
“voxels”. The number of photons Yij coming from voxel i and received by
detector j is assumed to follow a Poisson distribution
Yij ∼ Poisson(θiaij)
where the coefficient aij can be determined accurately beforehand, and θi
is the emission density of the radioactive material in voxel i, which is used
to quantify the metabolic activity therein.
Jun Yan (University of Connecticut) STAT 5361 71 / 125
EM and MM Algorithms EM Algorithm
If for each pair of voxel and detector, we can observe Yij , then θj ’s can be
estimated by MLE. However, in reality, each detector j receives photons
from all the voxels, and hence only the sum
Xj =
∑
i
Yij
is observable. The goal is to use X = (Xj) instead of Y = (Yij) to
estimate θ = (θj).
Jun Yan (University of Connecticut) STAT 5361 72 / 125
EM and MM Algorithms EM Algorithm
Suppose the complete data
Y ∼ pY (y |θ)
while the observed data is X = M (Y ), where M is a many-to-one
mapping. Then
pX(x |θ)︸︷︷︸
L(θ | x)
=
∫
M(y)=x
pY (y |θ)︸︷︷︸
L(θ | y)
dy.
If only X = x is observed, then the MLE θ̂ satisfies
L′(θ̂ |x) = 0
i.e., ∫
M(y)=x
L′(θ̂ |y) dy = 0.
Jun Yan (University of Connecticut) STAT 5361 73 / 125
EM and MM Algorithms EM Algorithm
Write log-likelihood function l(θ |y) = lnL(θ |y), so that
[L(θ |y)]′ = [el(θ | y)]′ = [l(θ |y)]′el(θ | y) = [l(θ |y)]′L(θ |y).
Therefore, ∫
M(y)=x
L′(θ̂ |y)dy = 0
is the same as ∫
M(y)=x
[l(θ̂ |y)]′L(θ̂ |y) dy = 0.
In principle, Newton’s method can be used to find a solution. However,
since θ̂ appears in two places in the integration, the implementation is
often messy and because of that, numerically unstable.
Jun Yan (University of Connecticut) STAT 5361 74 / 125
EM and MM Algorithms EM Algorithm
We may try to “decouple” the two θ’s in the integration to get simpler
iterations. An iterative procedure could go as follows. At step t, if we have
θt , then find θt+1 to solve∫
M(y)=x
[l(θt+1 |y)]′L(θt |y) dy = 0,
or, to optimize the function∫
M(y)=x
l(θ |y)L(θt |y) dy.
This is a function in θ. Since θt is different at each step, the function is
different. If θt converge, then the limit is a solution to the original MLE.
Jun Yan (University of Connecticut) STAT 5361 75 / 125
EM and MM Algorithms EM Algorithm
The explicit numerical integration in the above function is inconvenient.
To replace it, note that, since θt is fixed, the above way to find θt+1 is
equivalent to optimizing∫
M(y)=x
l(θ |y)L(θt |y)L(θt |x) dy.
For any θt ,
L(θt |y)
L(θt |x) =
pY (y |θt)
pX(x |θt) = pY |X(y |x,θt),
the conditional density of Y given X = x under pY (y |θt). Then the
integral is simply E [l(θ |Y ) x,θt ]. In turns out that in order to find
suitable θt+1, we need to maximize this function.
Jun Yan (University of Connecticut) STAT 5361 76 / 125
EM and MM Algorithms EM Algorithm
EM (Expectation-Maximization) algorithm.
Define
Q(θ |θ′) = E [l(θ |Y ) |x,θ′] = E [lnL(θ |Y ) |x,θ′] .
Initialize θ0.
At iteration t = 1, 2, . . ., with θt given, set θt+1 to maximize
Q(θ |θt).
The “E step” of the algorithm refers to the computation of Q(θ |θt), and
the “M step” refers to the maximization of Q(θ |θt). Stopping criteria are
usually based upon |θt+1 − θt | or |Q(θt+1 |θt)−Q(θt |θt)|.
Jun Yan (University of Connecticut) STAT 5361 77 / 125
EM and MM Algorithms EM Algorithm
If Y = (X ,Z), where Z is the missing data, then, given X = x, the only
unknown part of Y is Z , so
Q(θ |θ′) = E [lnL(θ |Y ) |x,θ′]
= E
[
lnL(θ |x,Z) |x,θ′]
=

∑
z
p(z |x,θ′) ln p(x, z |θ), if Z is discrete∫
p(z |x,θ′) ln p(x, z |θ) dz, if Z is continuous
Jun Yan (University of Connecticut) STAT 5361 78 / 125
EM and MM Algorithms EM Algorithm
Gaussian mixture clustering
Suppose a population is a mixture of Gaussian distributions N (µk ,Σk)
with population fractions qk , k = 1, . . . ,K . The goal is to estimate µk ,
Σk , qk .
Let x1, . . . ,xn ∈ Rd be an iid sample from the population. Let zi be the
index of the Gaussian distribution from which xi is sampled. Suppose that
z = (z1, . . . , zn)
cannot be observed. The observed data thus is
x = (x1, . . . ,xn).
Jun Yan (University of Connecticut) STAT 5361 79 / 125
EM and MM Algorithms EM Algorithm
Suppose at iteration t of the EM algorithm, one has obtained
θt = (µt1, . . . ,µtK ,Σt1, . . .ΣtK , qt1, . . . , qtK ).
Then, since (xi , zi) are independent, and zi are discrete, for any candidate
θ = (µ1, . . . ,µK ,Σ1, . . . ,ΣK , q1, . . . , qK ),
one has
Q(θ |θt) =
∑
z
p(z |x,θt) ln p(x, z |θ)
=
n∑
i=1
K∑
k=1
p(zi = k |xi ,θt)︸︷︷︸
wik
ln p(zi = k,xi |θ).
Jun Yan (University of Connecticut) STAT 5361 80 / 125
EM and MM Algorithms EM Algorithm
To maximize Q(θ |θt), note that wik has nothing to do with θ so it
should be computed in the first place. By Bayes rule,
wik =
p(zi = k,xi |θt)
p(xi |θt) =
p(zi = k,xi |θt)∑K
s=1 p(zi = s,xi |θt)
,
with
p(zi = k,xi |θt) = p(zi = k |θt)p(xi | zi = k,θt)
= qtkφ(xi |µtk ,Σtk), (1)
where φ(x |µ,Σ) is the density of N (µ,Σ) at x. Once wik ’s are
computed,
Q(θ |θt) =
K∑
k=1
n∑
i=1
wik ln p(zi = k,xi |θ),
which looks a lot simpler.
Jun Yan (University of Connecticut) STAT 5361 81 / 125
EM and MM Algorithms EM Algorithm
As in (1), p(zi = k,xi |θ) = qkn(xi |µk ,Σk). Since
n(xi |µk ,Σk) =
(2pi)−d/2√|Σk | exp
{
−12(xi − µk)
′Σ−1k (xi − µk)
}
,
then
Q(θ |θt) =
K∑
k=1
n∑
i=1
wik ln[(2pi)−d/2qk ]− 12
K∑
k=1
n∑
i=1
wik ln |Σk |
− 12
K∑
k=1
n∑
i=1
wik(xi − µk)′Σ−1k (xi − µk)
= I1 − I22 −
I3
2 .
Jun Yan (University of Connecticut) STAT 5361 82 / 125
EM and MM Algorithms EM Algorithm
From last page,
I1 =
K∑
k=1
n∑
i=1
wik ln[(2pi)−d/2qk ], I2 =
K∑
k=1
n∑
i=1
wik ln |Σk |
I3 =
K∑
k=1
n∑
i=1
wik(xi − µk)′Σ−1k (xi − µk)
First, only I3 contains µk , which is a quadratic form. To minimize it,
observe that I3 is the sum of K quadratic forms, each involving a single µk
I3k =
n∑
i=1
wik(xi − µk)′Σ−1k (xi − µk).
We only need to find µk to minimize each I3k . From the property of
sample mean, µk must be the mean of a weighted sample x1, . . . ,xn ,
each xi having weight wik . So
µk =
∑n
i=1 wikxi∑n
i=1 wik
.
Jun Yan (University of Connecticut) STAT 5361 83 / 125
EM and MM Algorithms EM Algorithm
Next, only I2 and I3 contain Σk , and I2 + I3 is the sum of K terms of the
following form, each involving a single Σk ,
Sk =
n∑
i=1
wik ln |Σk |+
n∑
i=1
wik(xi − µk)′Σ−1k (xi − µk)
and so we only need to find Σk to minimize each Sk . Now that µk is
equal to the weighted mean of xi , to minimize Sk , Σk must be the sample
variance of the weighted sample x1, . . . ,xn . So
Σk =
∑n
i=1 wik(xi − µk)(xi − µk)′∑n
i=1 wik
.
Jun Yan (University of Connecticut) STAT 5361 84 / 125
EM and MM Algorithms EM Algorithm
Finally, only I1 contains qk . Since
I1 =
K∑
k=1
n∑
i=1
wik ln[(2pi)−d/2qk ]
=
K∑
k=1
n∑
i=1
wik ln[(2pi)−d/2] +
K∑
k=1
n∑
i=1
wik ln qk
= −(d/2) ln(2pi)
K∑
k=1
n∑
i=1
wik +
K∑
k=1
( n∑
i=1
wik
)
ln qk
it suffices to minimize
K∑
k=1
( n∑
i=1
wik
)
︸︷︷︸
Wk
ln qk
Jun Yan (University of Connecticut) STAT 5361 85 / 125
EM and MM Algorithms EM Algorithm
Note q1, . . . , qK must satisfy q1 + · · ·+ qK = 1. Therefore, the
maximization can be obtained by finding a solution for L′(q1, . . . , qK ) = 0,
i.e.
∂L(q1, . . . , qK )
∂qk
= 0
where
L(q1, . . . , qK ) =
K∑
k=1
Wk ln qk − λ
{ K∑
k=1
qk − 1
}
with λ a Lagrange multiplier. Then
qk =
Wk∑K
k=1Wk
.
Since for each i, ∑Kk=1 wik = 1,
qk =
1
n
n∑
i=1
wik .
Jun Yan (University of Connecticut) STAT 5361 86 / 125
EM and MM Algorithms EM Algorithm
The above steps consist a basic EM Gaussian clustering algorithm. It can
be summarized as follows.
Given
θt = (µt1, . . . ,µtn ,Σt1, . . .ΣtK , qt1, . . . , qtK ),
for k = 1, . . . ,K , compute
wik = p(zi = k |xi ,θt), i = 1, . . . ,n
then set qt+1,k equal to the mean of w1k , . . . , wnk , µt+1,k and Σt+1,k
equal to the weighted average and weighted covariance of x1, . . . ,xn ,
with weights w1k , . . . , wnk , respectively.
Jun Yan (University of Connecticut) STAT 5361 87 / 125
EM and MM Algorithms EM Algorithm
For high dimensional data, there are many variants of the algorithm which
put constraints on Σk to improve computational and estimation efficiency.
The strongest constraint (and computationally the simplest) is
Σ1 = · · · = ΣK = σ2I ,
with σ2 being the only parameter. Others include
1 Σk = σ2kI , k = 1, . . . ,K , with σk being the parameters;
2 Σk are diagonal matrices that may be equal to or different from each
other;
3 Σ1 = σ2kΣ, with the parameters being σ2k and Σ.
Jun Yan (University of Connecticut) STAT 5361 88 / 125
EM and MM Algorithms EM Algorithm
If Newton’s method is used to directly maximize the log likelihood function
l(θ |x) = lnL(θ |x), then one would have to differentiate the function
l(θ |x) =
n∑
i=1
li(θ)
where, for every i = 1, . . . ,n,
li(θ) = ln p(xi |θ) = ln
{∑
zi
p(xi , zi |θ)
}
= ln
{ K∑
k=1
p(xk | zi = k,θ)p(zi = k |θ)
}
= ln
{ K∑
k=1
(2pi)− d2 qk√|Σk | e− 12 (xi−µk)′Σ−1k (xi−µk)
}
.
The iterations apparently are quite involved.
Jun Yan (University of Connecticut) STAT 5361 89 / 125
EM and MM Algorithms EM Algorithm
Computation
The R package
mclust
provides a large number of clustering methods, including EM based
methods. Use
install.packages("mclust")
to install it on your computer if it is not yet installed. To use, first execute
library("mclust")
which loads all the functions in the package into an R session.
Jun Yan (University of Connecticut) STAT 5361 90 / 125
EM and MM Algorithms EM Algorithm
mclustBIC
computes the Bayes information criterion associated with different
Gaussian mixture models, which are characterized by number of clusters
and constraints on the covariance matrices.
mclustModel
estimate the parameters for the best model determined via mclustBIC.
mclustModelNames
lists the names of models used in the mclust package.
mclust2Dplot
plots 2-D data given parameters of a Gaussian mixture model for the data.
Some MATLAB routines, such as EM-GM.m can perform some of the
functions of the R package.
Jun Yan (University of Connecticut) STAT 5361 91 / 125
EM and MM Algorithms EM Algorithm
EM for Exponential Families
Many important parametric distributions belong to an exponential family,
which has the form
pY (y |θ) = c1(y)c2(θ) exp
{
θTs(y)
}
,
where θ is a vector of parameters and s(y) a vector of sufficient statistics.
Jun Yan (University of Connecticut) STAT 5361 92 / 125
EM and MM Algorithms EM Algorithm
1 The normal densities Nd(θ,Σ) with Σ being fixed consist an
exponential family, because
pY (y |θ) = e
− 12 (y−θ)TΣ−1(y−θ)
(2pi)d/2|Σ|1/2
= e
− 12yTy
(2pi)d/2︸︷︷︸
c1(y)
e− 12θTθ
|Σ|1/2︸︷︷︸
c2(θ)
exp
{
θT (Σ−1y)︸︷︷︸
s(y)
}
.
Jun Yan (University of Connecticut) STAT 5361 93 / 125
EM and MM Algorithms EM Algorithm
1 Poisson distribution Poisson(θ), where θ > 0, has distribution function
p(y | θ) = e
−θyθ
y! , y = 0, 1, . . .
If we define
c1(y) =
1
y! c2(θ) = e
−θ s(y) = ln y,
then p(y | θ) = c1(y)c2(θ)eθs(y). Therefore, Poisson(θ) also form an
exponential family.
Jun Yan (University of Connecticut) STAT 5361 94 / 125
EM and MM Algorithms EM Algorithm
1 Gamma distribution Gamma(a, r), where a > 0, r > 0, has density
p(y | a, r) =

e−y/rya−1
raΓ(a) y > 0
0 y ≤ 0.
Indeed, if we define b = 1/r , θ = (a, b), and
c1(y) =
1
y c2(θ) = c2(a, b) =
ba
Γ(θ1)
s(y) = (ln y,−y)
Then
c1(y)c2(θ) exp(θTs(y)) =
1
y
1
baΓ(a)e
a ln y−by = p(y | a, r).
Jun Yan (University of Connecticut) STAT 5361 95 / 125
EM and MM Algorithms EM Algorithm
1 Beta distribution Beta(a, b), where a > 0, b > 0, has density
p(y | a, b) =

ya−1(1− y)b−1
B(a, b) , 0 < y < 1
0 otherwise
If we let θ = (a, b),
c1(y) =
1
y(1− y) c2(θ) =
1
B(a, b) s(y) = (ln y, ln(1− y))
then
c1(y)c2(θ) exp(θTs(y)) =
1
y(1− y)
1
B(a, b)e
a ln y+b ln(1−y) = p(y | a, b).
Jun Yan (University of Connecticut) STAT 5361 96 / 125
EM and MM Algorithms EM Algorithm
Suppose x = M (y) is observed and we want to estimate θ. The EM
algorithm in this case is significantly simplified, with no explicit evaluation
of Q(θ |θ′). It works as follows. Given θt , set θt+1 such that
E [s(Y ) |θt+1] = E [s(Y ) |x,θt ] . (2)
Jun Yan (University of Connecticut) STAT 5361 97 / 125
EM and MM Algorithms EM Algorithm
To get the result, we start with the evaluation
Q(θ |θt) =
∫
[ln pY (y |θ)]p(y |x,θt)dy
=
∫
[ln c1(y) + ln c2(θ) + θTs(y)]p(y |x,θt) dy
=
∫
[ln c1(y)]p(y |x,θt) dy
+
∫
[ln c2(θ)]p(y |x,θt) dy +
∫
θTs(y)p(y |x,θt) dy
= k(θt) + ln c2(θ) + θTE [s(Y ) |x,θt ] .
Jun Yan (University of Connecticut) STAT 5361 98 / 125
EM and MM Algorithms EM Algorithm
To maximize Q(θ |θt) as a function of θ, it suffices to solve
(ln c2(θ))′ + E [s(Y ) |x,θt ] = 0,
or
c′2(θ)
c2(θ)
+ E [s(Y ) |x,θt ] = 0.
The following equality is basic for exponential families. For any θ,
c′2(θ)
c2(θ)
= −E [s(Y ) |θ] = −
∫
s(y)pY (y |θ) dy. (3)
Assuming the result is correct for now, it is seen that θt+1 should be set
such that E [s(Y ) |θt+1] = E [s(Y ) |x,θt ], i.e., as in (2).
Jun Yan (University of Connecticut) STAT 5361 99 / 125
EM and MM Algorithms EM Algorithm
To get (3), notice
∫
pY (y |θ) = 1 for any θ, i.e.
c2(θ)
∫
c1(y) exp{θTs(y)} dy = 1.
Then
ln c2(θ) = − ln
{∫
c1(y) exp{θTs(y)} dy
}
.
Differentiate both sides to get
c′2(θ)
c2(θ)
= −
∫
c1(y)s(y) exp{θTs(y)} dy∫
c1(y) exp{θTs(y)} dy
= −c2(θ)
∫
c1(y)s(y) exp{θTs(y)} dy
= −
∫
s(y)pY (y |θ) dy,
as claimed.
Jun Yan (University of Connecticut) STAT 5361 100 / 125
EM and MM Algorithms EM Algorithm
Convergence of EM
We first show that each step of the algorithm increases the observed-data
log likelihood l(θ |x) = ln pX(x |θ), that is
l(θt+1 |x) ≥ l(θt |x).
Once this is done, one can say l(θt |x) always climbs up to a local
maximum of l(θ |x).
In general, there is no guarantee that the limit is the (global)
maximum.
EM has slower convergence than the Newton’s method, sometime
substantially.
Nevertheless, the ease of implementation and the stable ascent of EM
are often very attractive.
Jun Yan (University of Connecticut) STAT 5361 101 / 125
EM and MM Algorithms EM Algorithm
Convergence of EM
By X = M (Y ), pY (y |θ) = p(x,y |θ) = pX(x |θ)pY |X(y |x,θ), so
pX(x |θ) = pY (y |θ)pY |X(y |x,θ)
.
Because x is fixed, pY |X(y |x,θ) only depends on θ. To simplify
notation, write f (y |θ) = pY |X(y |x,θ). Then
ln pX(x |θ) = ln pY (y |θ)− ln f (y |θ).
Jun Yan (University of Connecticut) STAT 5361 102 / 125
EM and MM Algorithms EM Algorithm
Convergence of EM
Integrate both sides with respect to the density f (y |θt). From∫
[ln pY (y |θ)] f (y |θt) dy
=
∫
[ln pY (y |θ)] pY |X(y |x,θt) dy = Q(θ |θt),
It follows
ln pX(x |θt) = Q(θ |θt)−H (θ |θt), (4)
where
H (θ |θt) =
∫
[ln f (y |θ)]f (y |θt) dy.
Jun Yan (University of Connecticut) STAT 5361 103 / 125
EM and MM Algorithms EM Algorithm
Convergence of EM
To show ln pX(x |θt+1) ≥ ln pX(x |θt), by (4), their difference equals
[Q(θt+1 |θt)−Q(θt |θt)]− [H (θt+1 |θt)−H (θt |θt)].
By the M-step, Q(θt+1 |θt) ≥ Q(θt |θt). On the other hand, by the
Jensen’s inequality,
H (θt+1 |θt)−H (θt |θt) =
∫
ln
[ f (y |θt+1)
f (y |θt)
]
f (y |θt) dy
≤ ln
[∫ f (y |θt+1)
f (y |θt) f (y |θt) dy
]
= ln
[∫
f (y |θt+1) dy
]
= 0,
thus showing the likelihood increases at each iteration.
Jun Yan (University of Connecticut) STAT 5361 104 / 125
EM and MM Algorithms EM Algorithm
Variants of EM
E step is hard: Monte Carlo (MCEM)
M step is hard: Instead of choosing θt+1 to maximize Q(θ |θt), one
simply chooses any θt+1 for which
Q(θt+1 |θt) > Q(θt |θt),
then the resulting algorithm is called generalized EM, or GEM.
I In GEM, l(θt |x) is still increasing.
I An example: ECM algorithm, which replaces the M-step of the EM
algorithm with several computationally simpler conditional
maximization (CM) steps.
Jun Yan (University of Connecticut) STAT 5361 105 / 125
EM and MM Algorithms EM Algorithm
Variance Estimation
Louis’ method (1982, JRSSB)
Supplemented EM (SEM) algorithm (Meng and Rubin, 1991, JASA)
Oakes’ method (1999, JRSSB)
Bootstrapping
Empirical information
Numerical Differentiation
Jun Yan (University of Connecticut) STAT 5361 106 / 125
EM and MM Algorithms EM Algorithm
Example: Hidden Markov Model
Hidden state s = (st), t = 1, . . . ,T , follows a Markov chain with
initial state distribution pik , k = 1, . . . ,M , and transition probability
matrix {ak,l = Pr(st+1 = l|st = k)}. (Note this matrix is row
stochastic.)
Given the state st , the observation ut is independent of other
observations and states
Given state k, the observation follows distribution bk(u). For
example, bk(u|γk) can be N (µk ,Σk); or discrete.
The observed data is u = (ut), t = 1, . . . ,T .
Jun Yan (University of Connecticut) STAT 5361 107 / 125
EM and MM Algorithms EM Algorithm
Likelihood
Parameters: θ contains pik ’s, akl ’s, γk ’s
Missing (latent, unobserved) states: st , t = 1, . . . ,T .
Complete data: (s,u)
Complete data likelihood
pis1
T∏
t=2
ast−1,st
T∏
t=1
bst (ut |γst )
Complete data log-likelihood
log p(s,u|θ) = log pis1 +
T∑
t=2
log ast−1,st +
T∑
t=2
bst (ut |γst )
Jun Yan (University of Connecticut) STAT 5361 108 / 125
EM and MM Algorithms EM Algorithm
E-Step
Taking expectation of log p(s,u|θ′) with respect to p(s|u, θ):
E
(
log p(s,u)|θ′)∣∣u, θ))
=
∑
s
p(s|u, θ) log p(s,u|θ′)
=
M∑
k=1
Lk(1) log pi′k +
T∑
t=2
M∑
k=1
M∑
l=1
Hk,l(t) log a′k,l
+
T∑
t=1
M∑
k=1
Lk(t) log bst (ut |γ′k)
where
Lk(t) = Pr(st = k|u) =
∑
s:st=k
p(s|u)
and
Hk,l(t) = Pr(st = k, sk+1 = l|u) =
∑
s:st=k,st+1=l
Pr(s|u).
Jun Yan (University of Connecticut) STAT 5361 109 / 125
EM and MM Algorithms EM Algorithm
M-step
If bk ’s are multivariate normals, the M-step has closed-form solution:
µk =
∑T
t=1 Lk(t)ut∑T
t=1 Lk(t)
Σk =
∑T
t=1 Lk(t)(ut − µk)(ut − µk)>∑T
t=1 Lk(t)
ak,l =
∑T−1
t=1 Hk,l(t)∑T−1
t=1 Lk(t)
pik =
∑T
t=1 Lk(t)∑M
k=1
∑T
t=1 Lk(t)
Jun Yan (University of Connecticut) STAT 5361 110 / 125
EM and MM Algorithms EM Algorithm
Forward and Backward Probability
This is for efficient computation of Lk(t)’s and Hk,l(t)’s
Computation of order M 2T and memory requirement of order MT .
Forward probability
αk(t) = Pr(u1, . . . , ut , st = k)
Backword probability
βk(t) = Pr(ut+1, . . . , uT |st = k)
Jun Yan (University of Connecticut) STAT 5361 111 / 125
EM and MM Algorithms EM Algorithm
Forward and Backward Algorithms
Forward probability recursion
αk(1) = pikbk(u1), 1 ≤ k ≤ M ,
αk(t) = bk(ut)
M∑
l=1
αl(t − 1)al,k , 1 < t ≤ T , 1 ≤ k ≤ M .
Backward probability recursion
βk(T ) = 1, 1 ≤ k ≤ M ,
βk(t) =
M∑
l=1
ak,lbl(ut+1)βl(t + 1), 1 ≤ t < T , 1 ≤ k ≤ M .
Jun Yan (University of Connecticut) STAT 5361 112 / 125
EM and MM Algorithms EM Algorithm
Fast Algorithm for the E-step Quantities
Lk(t) = Pr(st = k|u) = αk(t)β(t)p(u)
Hk,l(t) = Pr(st = k, sk+1 = l|u) = αk(t)ak,lbl(ut+1)βl(t + 1)p(u)
p(u) =
M∑
k=1
αk(t)β(t), ∀t
Jun Yan (University of Connecticut) STAT 5361 113 / 125
EM and MM Algorithms Majorization-Minimization Algorithm
Majorization-Minimization Algorithm
Jun Yan (University of Connecticut) STAT 5361 114 / 125
EM and MM Algorithms Majorization-Minimization Algorithm
Majorization-Minimization Algorithm
The MM algorithm is not an algorithm, but a prescription for
constructing optimization algorithms.
The EM algorithm is a special case.
An MM algorithm operates by creating a surrogate function that
majorizes (minorizes) the objective function. When the surrogate
function is optimized, the objective function is driven downhill
(uphill).
Majorization-Minimization, Minorization-Maximization.
Jun Yan (University of Connecticut) STAT 5361 115 / 125
EM and MM Algorithms Majorization-Minimization Algorithm
Majorization-Minimization Algorithm
It may generate an algorithm that avoids matrix inversion.
It can linearize an optimization problem.
It can conveniently deal with equality and inequality constraints.
It can solve a non-differentiable problem by iteratively solving smooth
problems.
Jun Yan (University of Connecticut) STAT 5361 116 / 125
EM and MM Algorithms Majorization-Minimization Algorithm
Majorization-Minimization Algorithm
A function g(θ | θs) is said to majorize the function f (θ) at θs if
f (θs) = g(θs | θs),
and
f (θ) ≤ g(θ | θs)
for all θ.
Let
θs+1 = arg min
θ
g(θ | θs).
It follows that
f (θs+1) ≤ g(θs+1 | θs) ≤ g(θs | θs) = f (θs).
The strict inequality holds unless g(θs+1 | θs) = g(θs | θs) and
f (θs+1) = g(θs+1 | θs).
Therefore, by alternating between the majorization and the
minimization steps, the objective function is monotonically decreasing
and thus its convergence is guaranteed.
Jun Yan (University of Connecticut) STAT 5361 117 / 125
EM and MM Algorithms Majorization-Minimization Algorithm
Majorization-Minimization Algorithm
We can use any inequalities to construct the desired
majorized/minorized version of the objective function. There are
several typical choices:
I Jensen’s inequality: for a convex function f (x),
f (E(X)) ≤ E(f (X)).
I Convexity inequality: for any λ ∈ [0, 1],
f (λx1 + (1− λ)x2) ≤ λf (x1) + (1− λ)f (x2)
I Cauchy-Schwarz inequality.
I Supporting hyperplanes.
I Arithmetic-Geometric Mean Inequality.
Jun Yan (University of Connecticut) STAT 5361 118 / 125
EM and MM Algorithms Majorization-Minimization Algorithm
Example: Penalized AFT model with induced smoothing
Jun Yan (University of Connecticut) STAT 5361 119 / 125
EM and MM Algorithms Majorization-Minimization Algorithm
AFT model
Let {Ti ,Ci , xi}, i = 1, . . . ,n be n independent copies of {T ,C ,X},
where Ti and Ci are log-transformed failure time and log transformed
censoring time, xi is a p × 1 covariate vector, and given X, C and T
are independent.
The Accelerated failure time model has the form
Ti = xTi β + i , i = 1, . . . ,n.
The observed data are (Yi , δi , xi), where Yi = min(Ti ,Ci),
δi = I (Ti < Ci).
Jun Yan (University of Connecticut) STAT 5361 120 / 125
EM and MM Algorithms Majorization-Minimization Algorithm
AFT model
For a case-cohort study, The weight-adjusted estimating equation
with induced smoothing is
U˜n(β) =
1
n
n∑
i=1
n∑
j=1
∆i(xi − xj)Φ
(
ej(β)− ei(β)
rij
)
= 0,
where ej(β) = Yj − xTj β and
r2ij = n−1(xi − xj)T(xi − xj) = n−1‖xi − xj‖2.
Define H (x) = xΦ(x) + φ(x), and
L˜n(β) =
1
n
n∑
i=1
n∑
j=1
∆irijH
(
ej(β)− ei(β)
rij
)
.
Then solving the estimating equation U˜n(β) = 0 is equivalent to
minimizing the objective function L˜n(β).
The objective function L˜n(β) is convex in β.
Jun Yan (University of Connecticut) STAT 5361 121 / 125
EM and MM Algorithms Majorization-Minimization Algorithm
AFT model
We propose the penalized AFT method, which amounts to minimize
P˜Ln(β) =
1
n
n∑
i=1
n∑
j=1
∆irijH
(
ej(β)− ei(β)
rij
)
+ λP(β),
where P(β) is some penalty function and λ is the tuning parameter.
Jun Yan (University of Connecticut) STAT 5361 122 / 125
EM and MM Algorithms Majorization-Minimization Algorithm
AFT model
To solve this problem, the key is to identity a surrogate function of
L˜n(β) that is easy to work with. In view of the form of the objective
function, an immediate idea is to majorize H (x).
Because H (x) is twice differentiable and H ′′(x) = φ(x) < φ(0) ≈ 0.4
for any x ∈ R, it follows that for any x0 ∈ R,
H (x | x0) ≤ G(x | x0) ≡ H (x0) + (x − x0)Φ(x0) + φ(0)2 (x − x
0)2.
Note that in a standard optimization method based on quadratic
approximation, e.g., Newton-Raphson, the φ(0) term is replaced with
φ(x0), which, however, does not guarantee the majorization.
Jun Yan (University of Connecticut) STAT 5361 123 / 125
EM and MM Algorithms Majorization-Minimization Algorithm
AFT model
Write
L˜n(β) =
1
n
n∑
i=1
n∑
j=1
wijH (aij + zTijβ) + λP(β),
where wij = ∆irij , aij = (Yj −Yi)/rij and zij = (xi − xj)/rij . Then
L˜n(β | βs) ≤ 1n
n∑
i=1
n∑
j=1
wijG(aij + zTijβ | aij + zTijβs) + λP(β)
= 12β
TAβ − bsTβ + λP(β) + const
≡ L˜∗n(β | βs) + const,
where
A = φ(0)n
n∑
i=1
n∑
i=1
wijzijzTij , bs = Aβs−
1
n
n∑
i=1
n∑
i=1
wijΦ(aij+zTijβs)zij .
Minimizing L˜∗n(β | βs) is a Lasso-type problem, for which a
coordinate descent algorithm can be applied.
Jun Yan (University of Connecticut) STAT 5361 124 / 125
EM and MM Algorithms Majorization-Minimization Algorithm
MM for Penalized AFT
Initialize β0 ∈ Rp. Compute
A =
φ(0)
n
n∑
i=1
n∑
i=1
wijzijzTij .
Set k ← 0.
repeat
(1). Majorization step:
bk+1 ← Aβk − 1
n
n∑
i=1
n∑
i=1
wijΦ(aij + zTij β
k )zij .
(2). Minimization step:
Initialize β˜0 = βk .
while ‖β˜s − β˜s+1‖/‖β˜s‖ ≥ and s < smax do
(i) For j = 1, . . . , p,
β˜
s
j ←
1
ajj
S(bk+1j − a
T
j β˜
s + ajj β˜sj , λ).
(ii) β˜s+1 ← β˜s .
(iii)s ← s + 1.
end while
β
k+1 ← β˜s+1.
Set k ← k + 1.
until convergence, i.e., ‖βk+1 − βk‖/‖βk‖ ≤ .
Jun Yan (University of Connecticut) STAT 5361 125 / 125