辅导案例-STAT 5361
STAT 5361: Statistical Computing Jun Yan Department of Statistics University of Connecticut Outline I 1 Non-Stochastic Optimization Some Background Univariate Problems Multivariate Problems 2 Combinatorial Optimization Local Search Simulated Annealing Genetic Algorithm 3 EM and MM Algorithms EM Algorithm Majorization-Minimization Algorithm Example: Penalized AFT model with induced smoothing Non-Stochastic Optimization Some Background Non-stochastic Optimization What we will learn: Some basics about statistical inference Univariate problems I Newton’s method I Fisher scoring I Secant method I Fixed-point method I Connections of the above Multivariate problems I Newton’s method I Newton-like methods Jun Yan (University of Connecticut) STAT 5361 3 / 125 Non-Stochastic Optimization Some Background Background: likelihood inference Let x1, . . . ,xn be an i.i.d. sample from f (x |θ∗), with the true parameter value θ∗ being unknown. The likelihood function is L(θ) = n∏ i=1 f (xi |θ), The maximum likelihood estimator (MLE) of the parameter value is the maximizer of L(θ). MLE has the invariate property. Usually it is easier to work with the log likelihood function l(θ) = logL(θ). Jun Yan (University of Connecticut) STAT 5361 4 / 125 Non-Stochastic Optimization Some Background Typically, maximization of l(θ) is done by solving l ′(θ) = 0. I l ′(θ) is called the score function. I For each possible parameter value θ, l(θ) is a random variable, because it depends on the observed values of x1, . . . ,xn. For any θ, Eθ { l ′(θ) } = 0, Eθ { l ′(θ)l ′(θ)T } = −Eθ { l ′′(θ) } . where Eθ is the expectation with respect to f (x |θ). The equality holds true under mild regularity conditions. I (θ) = Eθ { l ′(θ)l ′(θ)T } is known as the Fisher information. Jun Yan (University of Connecticut) STAT 5361 5 / 125 Non-Stochastic Optimization Some Background The importance of the Fisher information I (θ) is that it sets the limit on how accurate an unbiased estimate of θ can be. As n →∞, the asymptotic distribution of √n(θ̂MLE − θ∗) is Np(0,nI (θ∗)−1). Since θ∗ is unknown, the asymptotic covariance matrix I (θ∗)−1) needs to be estimated. I If dim(θ) = 1, I (θ) is a nonnegative number. I If dim(θ) > 1, I (θ) is a nonnegative definite matrix. The observed Fisher information is −l ′′(θ). The expected Fisher information I (θ) may not be easily computed. The observed Fisher information is a good approximation to I (θ) that improves as n gets bigger. Jun Yan (University of Connecticut) STAT 5361 6 / 125 Non-Stochastic Optimization Some Background Besides MLEs, there are other likelihoods for parameter estimation. Suppose θ has two parts: θ = (φ,µ) and we are only interested in φ. The profile likelihood for φ is L(φ) := max µ L(φ,µ). µ is the nuisance parameter. Need to maximize L(φ,µ) for every fixed φ. Also need to maximize L(φ). Jun Yan (University of Connecticut) STAT 5361 7 / 125 Non-Stochastic Optimization Some Background A general method Suppose g(x) is a differentiable function, where x = (x1, . . . , xn). To find its maximum (or minimum), one method is to solve the equation g′(x) = 0 where g′(x) = ( ∂g ∂x1 , . . . , ∂g ∂xn )T . Then the maximization is equivalent to solving f (x) = 0, where f = g′. For maximum likelihood estimation, g is the log likelihood function l, and x is the corresponding parameter vector θ. Jun Yan (University of Connecticut) STAT 5361 8 / 125 Non-Stochastic Optimization Univariate Problems Univariate Problems Jun Yan (University of Connecticut) STAT 5361 9 / 125 Non-Stochastic Optimization Univariate Problems Optimization: Univariate case Goal: optimize a real-valued function g with respect to its argument, a p dimensional vector x. We will first consider the case p = 1. We will limit consideration mainly to smooth and differentiable functions. Root finding methods. Solving unconstrained nonlinear equations. Iterative algorithms: starting value, updating equation, stopping rule/convergence criterion. Using bisection method to maximize g(x) = log x1 + x . or to sovle f (x) = g′(x) = 1 + 1 x − log x (1 + x)2 = 0. Jun Yan (University of Connecticut) STAT 5361 10 / 125 Non-Stochastic Optimization Univariate Problems Newton’s method This is a fast approach to finding roots of a differentiable function f (x). First, set initial value x0. Then for t = 0, 1, . . ., compute xt+1 = xt + ht , with ht = − f (xt)f ′(xt) . Continue the iterations until xt converges. Also known as Newton-Raphson iteration. Need to specify x0. If f (x) = 0 has multiple solutions, the end result depends on x0. Jun Yan (University of Connecticut) STAT 5361 11 / 125 Non-Stochastic Optimization Univariate Problems Newton’s method require computing the derivative of a function. Algorithms are available to find root(s) of a univariate function without having to compute its derivative. For example, in R, one can use uniroot. Newton’s method can be applied to optimize g by applying to f = g′. Both g′ (gradient) and g′′ (Hessian) are needed. Many variants of Newton’s method avoid the computation of the Hessian, which can be difficult especially for multivariate functions. Jun Yan (University of Connecticut) STAT 5361 12 / 125 Non-Stochastic Optimization Univariate Problems To maximize g(x) = log x1 + x , first find f (x) = g′(x) = 1 + 1 x − log x (1 + x)2 , f ′(x) = g′′(x) = −3 + 4/x + 1/x 2 − 2 log x (1 + x)3 . So in the Newton’s method, ht = (xt + 1)(1 + 1/xt − log xt) 3 + 4/xt + 1/x2t − 2 log xt . Note that to solve f (x) = 0, one can instead solve 1 + 1/x − log x = 0. Treat this as a new f function. Then in the Newton’s method, ht = xt − x 2 t log xt 1 + xt =⇒ xt+1 = 2xt − x 2 t log xt 1 + xt . Jun Yan (University of Connecticut) STAT 5361 13 / 125 Non-Stochastic Optimization Univariate Problems To maximize the log likelihood l(θ), Newton’s method computes θt+1 = θt − l ′(θt) l ′′(θt) Example: consider x1, . . . , xn ∼ i.i.d. N (µ, σ2). Jun Yan (University of Connecticut) STAT 5361 14 / 125 Non-Stochastic Optimization Univariate Problems Example: consider the model on shift p(x | θ) = p(x − θ). Given observations x1, . . . , xn i.i.d. ∼ p(x | θ) l(θ) = n∑ i=1 log p(xi − θ) l ′(θ) = − n∑ i=1 p′(xi − θ) p(xi − θ) l ′′(θ) = n∑ i=1 p′′(xi − θ) p(xi − θ) − n∑ i=1 {p′(xi − θ) p(xi − θ) }2 . Note that we need to update θ in the iterations, not x1, . . . , xn . Jun Yan (University of Connecticut) STAT 5361 15 / 125 Non-Stochastic Optimization Univariate Problems In R, to minimize a function, one can use z = nlminb(x0, g, gr.g, hess.g) here x0 is the initial value, g is the function being minimized, gr.g its gradient and hess.g its Hessian. (Unconstrained and box-constrained optimization using PORT routines). In the above function, the derivatives g′ and g′′ have to be analytically calculated. One can also use z = nlminb(x0, g, gr.g) without inputing the analytic expression of g′′, or, even simpler, z = nlminb(x0, g) without inputing the analytic expressions of either derivatives. In these cases, numerical approximations of derivatives will be computed during the iterations. Other functions for optimization in R include optim, optimize, nlm and constrOptim. Jun Yan (University of Connecticut) STAT 5361 16 / 125 Non-Stochastic Optimization Univariate Problems For profile likelihood L(φ) = max µ L(φ,µ) we need to optimize L(φ,µ) for each given φ. In R, in order to get minx g(x, y) for each fixed y, one can do the following. First, define g(x,y), gr.g(x,y), etc. Then call, say, nlminb(x0, g, gr.g, hess.g, y=1), or nlminb(x0, g, gr.g, y=.3), or nlminb(x0, g, y=-1) If one only wants minimization for x ∈ [a, b], use, nlminb(x0, ..., lower=a, upper=b) Jun Yan (University of Connecticut) STAT 5361 17 / 125 Non-Stochastic Optimization Univariate Problems Fisher scoring This is a variant of Newton’s method specific for MLE. Recall −l ′′(θ) is the observed Fisher information at θ. To maximize l(θ), an alternative is to replace −l ′′(θt) with I (θt) to get θt+1 = θt + l ′(θt) I (θt) . To minimize −l(θ), one still can use z = nlminb(x0, f, grf, fs) where f is −l(θ), grf is −l ′(θ), and fs is I (θ) instead of −l ′′(θ). Generally, use Fisher scoring in the beginning to make rapid improvements, and Newton’s method for refinement near the end. Jun Yan (University of Connecticut) STAT 5361 18 / 125 Non-Stochastic Optimization Univariate Problems Continuing the example on p(x | θ) = p(x − θ), to use Fisher scoring, we need to compute I (θ) = −Eθ[l ′′(θ)]. We already know l ′′(θ) = n∑ i=1 p′′(xi − θ) p(xi − θ) − n∑ i=1 {p′(xi − θ) p(xi − θ) }2 . Therefore I (θ) = −nEθ [ p′′(X − θ) p(X − θ) − {p′(X − θ) p(X − θ) }2] . Jun Yan (University of Connecticut) STAT 5361 19 / 125 Non-Stochastic Optimization Univariate Problems Since under parameter θ, X has density p(x − θ), the last Eθ[...] equals∫ [p′′(x − θ) p(x − θ) − {p′(x − θ) p(x − θ) }2] p(x − θ) dx = ∫ p′′(x − θ) dx − ∫ [p′(x − θ)]2 p(x − θ) dx = d 2 dθ2 ∫ p(x − θ) dx − ∫ {p′(x)}2 p(x) dx = d 2 dθ2 1− ∫ {p′(x)}2 p(x) dx = − ∫ {p′(x)}2 p(x) dx. Therefore, I (θ) = n ∫ {p′(x)}2 p(x) dx. So, in this case, Fisher information is a constant. Jun Yan (University of Connecticut) STAT 5361 20 / 125 Non-Stochastic Optimization Univariate Problems Secant method Approximating f ′(xt) by f (xt)− f (xt−1) xt − xt−1 , the Newton’s method turns into the secant method xt+1 = xt − f (xt)(xt − xt−1)f (xt)− f (xt−1) . Need to specify x0 and x1. Jun Yan (University of Connecticut) STAT 5361 21 / 125 Non-Stochastic Optimization Univariate Problems Fixed point iteration Compute xt+1 = xt + αf (xt), t = 0, 1, . . . , where α 6= 0 is a tuning parameter so that |1 + αf ′(x)| ≤ λ < 1, all x or more generally, |x − y + α[f (x)− f (y)]| ≤ λ|x − y|, any x, y with 0 ≤ λ < 1 a constant, i.e., F(x) = x + αf (x) is a contraction. If such α exists, the speed of convergence depends on λ. The smaller λ is, the faster the convergence. Jun Yan (University of Connecticut) STAT 5361 22 / 125 Non-Stochastic Optimization Univariate Problems Some Details on Fixed point iteration Definition: A fixed point of a function is a point whose evaluation by that function equals to itself, i.e., x = G(x). Fixed point iteration: the natural way of hunting for a fixed point is to use xt+1 = G(xt). Definition: A function G is contractive on [a, b] if (1). G(x) ∈ [a, b] whenever x ∈ [a, b], (2). |G(x1)−G(x2)| ≤ λ|x1 − x2| for all x1, x2 ∈ [a, b] and some λ ∈ [0, 1). Theorem: if G is contractive on [a, b], then there is a unique fixed point x∗ ∈ [a, b], and the fixed point iteration convergence to it when starting inside the interval. Convergence: |xt+1 − xt | = |G(xt)−G(xt−1)| ≤ λ|xt − xt−1| ≤ λt |x1 − x0| → 0, as t →∞. It follows that {xt} convergent to a limit x∗. Jun Yan (University of Connecticut) STAT 5361 23 / 125 Non-Stochastic Optimization Univariate Problems Connection between fixed-point method and Newton methods Root-finding problems using fixed point iteration: for solving f (x) = 0, we can simply let G(x) = x + αf (x), where α 6= 0 is a constant. Required Lipschitz condition: |x − y + α[f (x)− f (y)]| ≤ λ|x − y|, for some λ ∈ [0, 1) and for all x, y ∈ [a, b]. This holds if |G ′(x)| ≤ λ for some λ ∈ [0, 1) and for all x ∈ [a, b], i.e., |1 + αf ′(x)| ≤ λ. (use mean value theorem.) Newton methods: G(x) = x − f (x)/f ′(x). So it is as if αt is chosen adaptively as αt = −1/f ′(xt). This leads to a faster convergence order (quadratic). Jun Yan (University of Connecticut) STAT 5361 24 / 125 Non-Stochastic Optimization Univariate Problems Convergence Order Define εt = xt − x∗. A method has convergence order β if lim t→∞ εt = 0, limt→∞ |εt+1| |εt |β = c, for some constant c 6= 0 and β > 0. For Newton’s method, β = 2. (If it converges.) For Secant method, β ≈ 1.62. For fixed point iteration, β = 1. Jun Yan (University of Connecticut) STAT 5361 25 / 125 Non-Stochastic Optimization Univariate Problems Convergence Order For Newton’s method: suppose f has two continuous derivatives and f ′(x∗) 6= 0. There then exists a neighborhood of x∗ within which f ′(x) 6= 0 for all x. By Taylor expansion, 0 = f (x∗) = f (xt) + f ′(xt)(x∗ − xt) + 12 f ′′(q)(x∗ − xt)2, for some q between x∗ and xt . Rearranging terms, we find that εt+1 ε2t = f ′′(q) 2f ′(xt) → f ′′(x∗) 2f ′(x∗) . Jun Yan (University of Connecticut) STAT 5361 26 / 125 Non-Stochastic Optimization Univariate Problems Convergence Order For fixed point iteration: let G be a continuous function on the closed interval [a, b] with G : [a, b]→ [a, b] and suppose that G ′ is continuous on the open interval (a, b) with |G ′(x)| ≤ k < 1 for all x ∈ (a, b). If g′(x∗) 6= 0, then for any x0 ∈ [a, b], the fixed point iteration converges linearly to the fixed point x∗. This is because |xt+1 − x∗| = |G(xt)−G(x∗)| = |G ′(q)||xt − x∗|, for some q between xt and x∗. This implies that |εt+1| |εt | → |G ′(x∗)|. Jun Yan (University of Connecticut) STAT 5361 27 / 125 Non-Stochastic Optimization Univariate Problems Stopping Rules Absolute convergence criterion: ‖xt+1 − xt‖ < , where is a chosen tolerance. Relative convergence criterion ‖xt+1 − xt‖ ‖xt‖ < . If xt close to zero, ‖xt+1 − xt‖ ‖xt‖+ < . Many other rules depending on the algorithm. Jun Yan (University of Connecticut) STAT 5361 28 / 125 Non-Stochastic Optimization Multivariate Problems Multivariate Problems Jun Yan (University of Connecticut) STAT 5361 29 / 125 Non-Stochastic Optimization Multivariate Problems Newton’s method Let g now be a function in x = (x1, . . . , xp)T . The generalization is straightforward: to maximize g(x), set xt+1 = xt − [g′′(xt)]−1g′(xt). g′′(x) is a p × p matrix with the (i, j)th entry equal to ∂2g(x) ∂xi∂xj ; g′(x) is a p × 1 vector with the ith entry equal to ∂g(x) ∂xi . Jun Yan (University of Connecticut) STAT 5361 30 / 125 Non-Stochastic Optimization Multivariate Problems Newton-like methods For MLE, the generalization is straightforward. Simply use θt+1 = θt + I (θt)−1l ′(θt). These methods in each iteration approximate g′′(xt) by some p × p matrix M t = M t(xt) to get xt+1 = xt −M−1t g′(xt). (1) Of course, M t should be easier to compute than g′′(xt). Fisher scoring is a Newton-like method, because it uses M t = −I (θt) in place of −l ′′(θt). Jun Yan (University of Connecticut) STAT 5361 31 / 125 Non-Stochastic Optimization Multivariate Problems Steepest ascent methods In (1), set M t = −α−1t I p, so that xt+1 = xt + αtg′(xt), where αt > 0 is the step size at t which can shrink to ensure ascent. If at step t, the original step turns out to be downhill, the updating can backtrack by halving αt . Jun Yan (University of Connecticut) STAT 5361 32 / 125 Non-Stochastic Optimization Multivariate Problems Discrete Newton and Fixed-point methods In fixed-point methods, usually M t is fixed to be M , e.g., g′′(x0). This amounts to applying univariate scaled fixed-point iteration to each component. In discrete Newton methods, g′′(xt) is approximated as follows. Compute matrix M t with the (i, j)th entry equal to M t(i, j) = g′i(xt + ht(i, j)ej)− g′i(xt) ht(i, j) for some constant ht(i, j) 6= 0, where g′i(x) = ∂g(x) ∂xi . If ht(i, j) is small, then M t(i, j) ≈ ∂ 2g(xt) ∂xi∂xj and so M t approximates g′′(xt). Jun Yan (University of Connecticut) STAT 5361 33 / 125 Non-Stochastic Optimization Multivariate Problems However, since g′′(xt) is symmetric, instead of using M t directly, use the symmetric (M t +MTt )/2 as the approximation of g′′(xt). No need to calculate second order derivatives Inefficient: all p2 entries have to be updated each time. Jun Yan (University of Connecticut) STAT 5361 34 / 125 Non-Stochastic Optimization Multivariate Problems Quasi-Newton methods These methods aim to achieve the following goals. Each step satisfies the secant condition g′(xt+1)− g′(xt) = M t+1(xt+1 − xt). No need to compute second order derivatives. Maintain symmetry of M t . Aim to update M t efficiently. Jun Yan (University of Connecticut) STAT 5361 35 / 125 Non-Stochastic Optimization Multivariate Problems There is a unique rank-one method that satisfies the secant condition and maintains the symmetry of M t : after getting xt+1 = xt −M−1t g′(xt) compute zt = xt+1 − xt , yt = g′(xt+1)− g′(xt), vt = yt −M tzt , ct = 1 vTt zt . Then update M t+1 = M t + ctvtvTt . Note M t+1 −M t is of rank one. We can verify the secant condition is satisfied by multiplying both sides by zt . Jun Yan (University of Connecticut) STAT 5361 36 / 125 Non-Stochastic Optimization Multivariate Problems There are several rank-two method satisfying the secant condition and the symmetry requirement. The Broyden class updates M t as follows. After getting xt+1 and zt , yt as above, compute dt = yt zTt yt − M tztzTt M tzt . Then update M t+1 = M t − M tzt(M tzt) T zTt M tzt + yty T t zTt yt + δt(zTt M tzt)dtdTt , where δt is a parameter. A popular method to solve unconstrained nonlinear optimization problems is the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method, with δt ≡ 0. In R, optim can be called to minimize a function, using the BFGS method or its variant, the “L-BFGS-B” method. “L” stands for limited memory, and “B” stands for box constraint. Jun Yan (University of Connecticut) STAT 5361 37 / 125 Non-Stochastic Optimization Multivariate Problems Approximating Hessian for MLE For MLE, the Hessian l ′′(θ̂MLE) is critical because it provides estimates of standard error and covariance. Quasi-Newton methods may provide poor approximation because it is based on the idea of using poor approximation of the Hessian to find the root for l ′(θ) = 0. Use the discrete multivariate Newton method with M t(i, j) = l ′i(θt + hijej)− l ′i(θt − hijej) 2hij . Jun Yan (University of Connecticut) STAT 5361 38 / 125 Non-Stochastic Optimization Multivariate Problems Gauss-Newton method Example: nonlinear regression. In many cases, we want to maximize g(θ) = − n∑ i=1 (yi − fi(θ))2, where each fi(θ) is differentiable. Recall that for linear regression: yi = xTi θ + ε, i = 1, . . . ,n the LS estimator of θ maximizes g(θ) with fi(θ) = xTi θ and equals θ̂ = (XTX)−1XTy where X = xT1... xTn , y = y1... yn . Jun Yan (University of Connecticut) STAT 5361 39 / 125 Non-Stochastic Optimization Multivariate Problems The Gauss-Newton method applies a similar idea to the nonlinear case. Let θ∗ be the (unknown) maximizer of g(θ). Given any candidate θ, the function h(u) = − n∑ i=1 (yi − fi(θ + u))2. is maximized by β = θ∗ − θ. Of course, β is unknown. However, if θ is near θ∗, then β ≈ 0, so by Taylor expansion of h(u), it may be close to the maximizer of − n∑ i=1 (yi − fi(θ)− f ′i (θ)Tu)2. Jun Yan (University of Connecticut) STAT 5361 40 / 125 Non-Stochastic Optimization Multivariate Problems Treat yi − fi(θ) the same way as yi in the linear regression, and f ′i (θ) the same way as xi , then θ∗ − θ ≈ (ATA)−1ATz where z = z(θ) = y1 − f1(θ)... yn − fn(θ) , A = A(θ) = f ′ 1(θ)T ... f ′n(θ)T . The update rule thus is θt+1 = θt + (ATt At)−1ATt zt , where zt = z(θt) and At = A(θt). In R, the Gauss-Newton method is the default method of the function nls. Jun Yan (University of Connecticut) STAT 5361 41 / 125 Non-Stochastic Optimization Multivariate Problems Practical Issues Initial value Convergence criteria: based on a distance measures for vectors Multiple local optima I Multiple initial values I Compare the objective function value at convergences Trade-off between efficiency and robustness Jun Yan (University of Connecticut) STAT 5361 42 / 125 Combinatorial Optimization Combinatorial Optimization Motivation: There are hard optimization problems for which most methods including what we have discussed so far are useless. These problems are usually combinatorial in nature. Maximization requires a discrete search of a very large space. Problem: maximize f (θ) with respect to θ = (θ1, . . . , θp), where θ ∈ Θ and Θ consists of N elements. I Each θ ∈ Θ is called a candidate solution. I Usually N is a very large number depending on problem size p. I The difficulty of a particular size-p problem can be characterized by the number of operations required to solve it in the worst case scenario using the best known algorithm. I Suppose a problem is O(p!). If it requires 1 minute to solve for p = 20, it would take 12.1 years for p = 25 and 207 million years for p = 30. Jun Yan (University of Connecticut) STAT 5361 43 / 125 Combinatorial Optimization What can we do: We have to take some sacrifices: we will abandon the global algorithms and focus on algorithms that can find a good local solution. Heuristic strategies. What we will learn: Local search methods Simulated annealing Genetic algorithms Jun Yan (University of Connecticut) STAT 5361 44 / 125 Combinatorial Optimization Local Search Motivating Example: Subset regression Suppose x1, . . . , xp are predictors that can be used to construct a linear model for Y . Each candidate model is Y = s∑ j=1 βijxij + ε, where 1 ≤ i1 < i2 < . . . < is ≤ p, s ≥ 0. The Akaike information criterion (AIC) for the candidate model is AIC = n log RSSn + 2s where n is the sample size and RSS is the residual sum of squares of model. The best model is the one that minimizes AIC. To parameterize the model, set θ = (θ1, . . . , θp), such that θi = 1 if xi is included as a predictor, and 0 otherwise. The set Θ of all possible θ has 2p values. Jun Yan (University of Connecticut) STAT 5361 45 / 125 Combinatorial Optimization Local Search Local search First, define a neighborhood N (θ) for each θ, so that it contains candidate solutions that are “near” θ, and reduces the number of changes to the current θ. For example, if Θ = {θ = (θ1, . . . , θp) : each θi = 0, 1}, one may define N (θ) to be the set of θ′ which are different from θ in at most one coordinate. After neighborhoods are defined, at iteration t, choose θt+1 from N (θt) according to a certain rule. For example, steepest ascent: θt+1 = arg maxθ∈N (θt) f (θ); ascent algorithm: θt+1 ∈ N (θt) uphill from θt , i.e., f (θt+1) ≥ f (θt). Jun Yan (University of Connecticut) STAT 5361 46 / 125 Combinatorial Optimization Local Search To avoid trapping into a local maximum, two often used variants are random-starts local search: repeatedly run an ascent algorithm to termination from a large number of randomly chosen starting points. steepest ascent/mildest descent: set to the least unfavorable θt+1 ∈ N (θt); I if θt is a local maximum θt+1 is the one with least decrease; I otherwise, θt+1 is the one with the largest increase. Jun Yan (University of Connecticut) STAT 5361 47 / 125 Combinatorial Optimization Local Search To select a linear model to minimize AIC (or maximize −AIC), randomly select a set of predictors at iteration t, decrease the AIC by adding a predictor to the set of selected predictors or deleting a predictor that has been selected; continue the iterations until the AIC can not be decreased; repeat Steps 1 and 2 many times, and choose the set of predictors at termination that has the lowest AIC. Jun Yan (University of Connecticut) STAT 5361 48 / 125 Combinatorial Optimization Local Search A salesman must visit each of p cities exactly once and return to his starting city, using the shortest total travel distance. A candidate solution is θ = (i1, i2, . . . , ip, i1) where i1, . . . , ip is a permutation of 1, . . ., p. The objective function to be minimized is f (θ) = d(i1, i2) + d(i2, i3) + · · ·+ d(ip−1, ip) + d(ip, i1). There are (p − 1)!/2 all possible routes, since the point of origin and direction of travel are arbitrary. For a traveling plan to visit 20 cities, that amounts to more than 6× 1016 possible routes. Jun Yan (University of Connecticut) STAT 5361 49 / 125 Combinatorial Optimization Local Search One could define N (θ) as the set of sequences that only differ from θ at two entries. In other words, each sequence in N (θ) is obtained by exchanging two entries of θ. For example, if θ = (1, 2, 5, 4, 6, 3, 1) then (1, 2, 3, 4, 6, 5, 1) is a neighbor of θ, because it only has 3 and 5 exchanged; while (1, 2, 4, 3, 6, 5, 1) is not a neighbor of θ, because it has 5, 4, and 3 rotated. Jun Yan (University of Connecticut) STAT 5361 50 / 125 Combinatorial Optimization Simulated Annealing Simulated annealing In most cases, the simulated annealing algorithm can be thought of as a randomized local search algorithm. It uses a “temperature” parameter to control the randomness of the search. The algorithm starts with a high temperature and cools down gradually so that a global optimum may be reached. This is analogous to the annealing of metal or glass. In the following description, a global minimum of f is being searched. The algorithm is run in stages, such that for the iterations within a stage, the temperature is a constant τj . Jun Yan (University of Connecticut) STAT 5361 51 / 125 Combinatorial Optimization Simulated Annealing Suppose iteration t belongs to stage j. 1. Sample a candidate θ∗ ∈ N (θ) according to a proposal density gt(θ |θt); for different t, the proposal density gt can be different. 2. Let ∆ = f (θ∗)− f (θt). a) If ∆ ≤ 0, then set θt+1 = θ∗. b) If ∆ > 0, then set θt+1 = θ∗ with probability e−∆/τj , and θt+1 = θt otherwise. This can be done as follows. Sample U ∼ Unif(0, 1). Then θt+1 = { θ∗ if U ≤ e−∆/τj θt otherwise. 3. Repeat steps 1 and 2 a total of mj times. 4. Update τj+1 = α(τj), mj+1 = β(mj) and move to stage j + 1, where α and β are two deterministic functions that govern how to cool down the temperature and how long to stay at a given temperature. Jun Yan (University of Connecticut) STAT 5361 52 / 125 Combinatorial Optimization Simulated Annealing Suppose I is an image consisting of “pixels” I (i, j), i, j = 1, . . . ,N . The image is corrupted by noise Z (i, j) and only J = I + Z is observed. To reconstruct I , one way is to minimize a function f (I ) = ∑ i,j |J (i, j)− I (i, j)|2 + K (I ), where K (I ) is a function that has large values if I has many “irregularities”. The idea is that, as long as the noise is not too strong, the real image should be similar to J ; on the other hand, irregularities observed in J are likely to be due to noise and should be reduced. One way to use the simulated annealing to minimize f (I ) is as follows. In each iteration, only one pixel of I can be updated. That means two I and I ′ are neighbors only when they are different at just one pixel. Then at each iteration, choose a pixel I (i, j) and select a candidate value for it. Update I (i, j) according to the rule in step 2 and move to the next iteration. Jun Yan (University of Connecticut) STAT 5361 53 / 125 Combinatorial Optimization Simulated Annealing Some guidelines: R function optim can implement some simple versions of simulated annealing; the temperature τj should slowly decrease to 0; the number mj of iterations at each temperature τj should be large and increasing in j; reheating strategies that allow sporadic, systematic, or interactive temperature increases to prevent getting stuck in a local minimum at low temperatures can be effective. Jun Yan (University of Connecticut) STAT 5361 54 / 125 Combinatorial Optimization Genetic Algorithm Genetic algorithms First, each θ ∈ Θ is a string of alphabets: θ = (θ1, . . . , θC ), where each θi is a symbol from a finite set, such as {0, 1}, {0, 1, 2}, {‘a’, ‘b’, ...}. Genetic algorithms regard the maximization of f (θ) over Θ as a process of natural selection, with f (θ) being a measure of fitness and θ the genetic code, or “chromosome”. It assumes that fitness is a result of some “good” pieces of θ and by inheriting these pieces plus some mutations fitness is enhanced. In a genetic algorithm, at each iteration t, at least two candidate solutions θt,1, . . . , θt,P have to be tracked. Each iteration consists of several steps. Jun Yan (University of Connecticut) STAT 5361 55 / 125 Combinatorial Optimization Genetic Algorithm 1. Selection. Randomly select from θt,1, . . . , θt,P to form a set of pairs. The selection should be based on a fitness function φ(θ). In general, larger values of f (θ) result in larger values of φ(θ), and higher chance for θ to be selected. There are many selection mechanisms: a) select one parent θ with probability proportional to φ(θ) and the other parent completely at random; b) select each parent independently with probability proportional to φ(θ); φ must be selected carefully: using φ(θt,i) = f (θt,i) may result in rapid convergence into a local maximum. A common choice is φ(θt,i) = 2ri P(P + 1) where ri is the rank of f (θt,i) in f (θt,1), . . . , f (θt,P). Jun Yan (University of Connecticut) STAT 5361 56 / 125 Combinatorial Optimization Genetic Algorithm 2. Breeding. For each pair, (θa,θb), generate one or more “offspring” θ′ = c(θa,θb) ∈ Θ, where c is a random operator. Typically, c is a “crossover”. If θa = (θa1, . . . , θaC ), θb = (θb1, . . . , θbC ), then a crossover works as follows, a) randomly choose a position 1 ≤ d ≤ C ; and b) combine (θa1, . . . , θad) and (θb,d+1, . . . , θbC ) to form θ′ = (θa1, . . . , θad , θb,d+1, . . . , θbC ). If necessary, also take θ′′ = (θa,d+1, . . . , θaC , θb1, . . . , θbd) to be another offspring. Jun Yan (University of Connecticut) STAT 5361 57 / 125 Combinatorial Optimization Genetic Algorithm 3. Mutation. Make some random modifications to each offspring. Typically, if an offspring chromosome is θ = (θ1, . . . , θC ) then for i = 1, . . . ,C , with a small probability 0 < p < 1 (mutation rate), change θi to a different value. The offspring produced at iteration t are taken as the candidate solutions for iteration t + 1. Jun Yan (University of Connecticut) STAT 5361 58 / 125 Combinatorial Optimization Genetic Algorithm Initialization. Usually the first generation consists of completely random individuals. Large values of the size of a generation, P, are preferred. For binary encoding of θ, one suggestion is to have C ≤ P ≤ 2C , where C is the chromosome length. In most real applications, population sizes have ranged between 10 and 200. Mutation rates are typically very low, in the neighborhood of 1%. Jun Yan (University of Connecticut) STAT 5361 59 / 125 Combinatorial Optimization Genetic Algorithm Termination. A genetic algorithm is usually terminated after a maximum number of iterations. Alternatively, the algorithm can be stopped once the genetic diversity in the current generation is sufficiently low. Jun Yan (University of Connecticut) STAT 5361 60 / 125 Combinatorial Optimization Genetic Algorithm In many problems, like the traveling salesman problem, it is natural to write θ as a permutation of 1, 2, . . . , p. Standard crossover usually produces invalid sequences For example, if θa = {7, 5, 2, 6, 3, 1, 9, 4, 8} θb = {9, 1, 2, 3, 8, 6, 7, 5, 4} and if the crossover point is between 2nd and 3rd positions, then it produces {7, 5, 2, 3, 8, 6, 7, 5, 4} an invalid traveling route. Jun Yan (University of Connecticut) STAT 5361 61 / 125 Combinatorial Optimization Genetic Algorithm A remedy is order crossover: pick a number of positions from one parent and get the values at those positions, say i1, . . . , is. Next identify those values from the 2nd parent. Suppose the set of values are found at j1 < j2 < . . . < js. Put i1 at j1, i2 at j2, . . . , is at js while keeping the values of the 2nd on other locations unchanged. Example: Consider parents (752631948) and (912386754) and random positions (4, 6, 7). The offsprings are (612389754) and (352671948). The drawback of the operation is that it destroys some important structures of the parent routes, in particular, the links. Jun Yan (University of Connecticut) STAT 5361 62 / 125 Combinatorial Optimization Genetic Algorithm Edge-recombination crossover uses a different idea. It generates an offspring whose edges belong to those of the parent routes. By edge it means a link into or out of a city in a route. For example, the above two parents have the following links, the order of numbers in each link is unimportant. (1, 2), (1, 3), (1, 9), (2, 3), (2, 5), (2, 6), (3, 6), (3, 8), (4, 5), (4, 8), (4, 9), (5, 7), (6, 7), (6, 8), (7, 8) First, choose one of the initial cities of the parents as the initial city of the offspring. At k-th step, if i1, . . . , ik have been chosen as the first k cities on the route, then choose among cities that are linked to ik and are different from i1, . . . , ik−1 as the (k + 1)st city of the offspring. Jun Yan (University of Connecticut) STAT 5361 63 / 125 EM and MM Algorithms EM and MM Algorithms Jun Yan (University of Connecticut) STAT 5361 64 / 125 EM and MM Algorithms EM Algorithm EM Optimizations What it is for inference when there is missing information I incomplete observations I auxiliary parameters that are unobserved but can facilitate inference on the main parameter. I Some important applications of the EM algorithm are in problems where one has what we might call pseudo missing data. We never had any chance of obtaining such data, but we could pretend they are missing and use EM algorithm to facilitate the computation of MLEs. Why it is important widely useful simple to implement numerically stable Jun Yan (University of Connecticut) STAT 5361 65 / 125 EM and MM Algorithms EM Algorithm Suppose a population consists of several sub-populations, each following a distribution p(x | θk) and having population fraction qk . Typically, q = (q1, . . . , qK ), θ = (θ1, . . . , θK ) are unknown, and the goal is to estimate them. Let x1, . . . , xn be a sample. If for each xi we know it comes from the zthi sub-population, then, letting x = (x1, . . . , xn), z = (z1, . . . , zn), the likelihood function of (q,θ) is L(q,θ |x, z) = n∏ i=1 [qzi × p(xi | θzi )]. Jun Yan (University of Connecticut) STAT 5361 66 / 125 EM and MM Algorithms EM Algorithm However, in many cases, z is unobservable: cluster analysis the “sub-populations” may be constructs that facilitate inference or decision making, so they are nonexistent in reality. The likelihood function then becomes L(q,θ |x) = ∑ z L(q,θ |x, z). The question is how to get the MLE of q and θ efficiently. In this example, y = (x, z) is the complete data. When only x is observable, z is called the missing data. Jun Yan (University of Connecticut) STAT 5361 67 / 125 EM and MM Algorithms EM Algorithm Suppose a population follows a distribution p(y | θ), where θ is the unknown parameter. Suppose a random sample is collected from p(y | θ). However, when an observation yi is greater than a cut-off C , it is truncated and recorded as C , such as in clinical trials. In this case, we only know yi ≥ C but not its exact value. On the other hand, if yi ≤ C , then it is fully observed. Therefore, the observed data is x1 = min(y1,C ), . . . , xn = min(yn ,C ). In this case, y = (y1, . . . , yn) is the complete data and x = (x1, . . . , xn). The question is how to estimate θ based on x. Jun Yan (University of Connecticut) STAT 5361 68 / 125 EM and MM Algorithms EM Algorithm Suppose Z1, . . . ,Zn and X1, . . . ,Xn are time series related by Xk = f (Zk | θ), where the transformation f is parametrized by θ. Zk may include not only variables of interest, but also some “noise” that interact with the variables. Suppose the joint distribution of Z1, . . . ,Zn has the parametric form h(z1, . . . , zn | γ), however, the values of θ and γ are both unknown. If X1, . . . ,Xn are observed but Z1, . . . ,Zn are only partially observed or completely hidden, how to estimate θ and γ? Jun Yan (University of Connecticut) STAT 5361 69 / 125 EM and MM Algorithms EM Algorithm The Hidden Markov Model (HMM) is a typical case. Let S1, . . . ,Sn be a discrete Markov chain with a finite number of states. Conditional on S1, . . . ,Sn , the observations X1, . . . ,Xn are independent, such that, given Sk = s, Xk ∼ N (µs, σ2s ). - - - - ? ? ? ? S1 X1 S2 X2 S3 X3 S4 X4 · · · In this case, Zk = (Sk ,Wk), where W1, . . . ,Wn ∼ N (0, 1) are independent of Z1, . . . ,Zn , such that Xk = µSk + σSkWk . Typically, µk , σk and the transition probabilities of Sk are unknown and only Xk are observed. Jun Yan (University of Connecticut) STAT 5361 70 / 125 EM and MM Algorithms EM Algorithm Positron emission tomography (PET) is a tool to study the metabolic activity in organs. To generate a PET scan, some radioactive material is administered into the organ. Then the emissions of the material are recorded using a PET scanner, which consists of an array of detectors surrounding the patient’s body. The region of interest is divided into “voxels”. The number of photons Yij coming from voxel i and received by detector j is assumed to follow a Poisson distribution Yij ∼ Poisson(θiaij) where the coefficient aij can be determined accurately beforehand, and θi is the emission density of the radioactive material in voxel i, which is used to quantify the metabolic activity therein. Jun Yan (University of Connecticut) STAT 5361 71 / 125 EM and MM Algorithms EM Algorithm If for each pair of voxel and detector, we can observe Yij , then θj ’s can be estimated by MLE. However, in reality, each detector j receives photons from all the voxels, and hence only the sum Xj = ∑ i Yij is observable. The goal is to use X = (Xj) instead of Y = (Yij) to estimate θ = (θj). Jun Yan (University of Connecticut) STAT 5361 72 / 125 EM and MM Algorithms EM Algorithm Suppose the complete data Y ∼ pY (y |θ) while the observed data is X = M (Y ), where M is a many-to-one mapping. Then pX(x |θ)︸ ︷︷ ︸ L(θ | x) = ∫ M(y)=x pY (y |θ)︸ ︷︷ ︸ L(θ | y) dy. If only X = x is observed, then the MLE θ̂ satisfies L′(θ̂ |x) = 0 i.e., ∫ M(y)=x L′(θ̂ |y) dy = 0. Jun Yan (University of Connecticut) STAT 5361 73 / 125 EM and MM Algorithms EM Algorithm Write log-likelihood function l(θ |y) = lnL(θ |y), so that [L(θ |y)]′ = [el(θ | y)]′ = [l(θ |y)]′el(θ | y) = [l(θ |y)]′L(θ |y). Therefore, ∫ M(y)=x L′(θ̂ |y)dy = 0 is the same as ∫ M(y)=x [l(θ̂ |y)]′L(θ̂ |y) dy = 0. In principle, Newton’s method can be used to find a solution. However, since θ̂ appears in two places in the integration, the implementation is often messy and because of that, numerically unstable. Jun Yan (University of Connecticut) STAT 5361 74 / 125 EM and MM Algorithms EM Algorithm We may try to “decouple” the two θ’s in the integration to get simpler iterations. An iterative procedure could go as follows. At step t, if we have θt , then find θt+1 to solve∫ M(y)=x [l(θt+1 |y)]′L(θt |y) dy = 0, or, to optimize the function∫ M(y)=x l(θ |y)L(θt |y) dy. This is a function in θ. Since θt is different at each step, the function is different. If θt converge, then the limit is a solution to the original MLE. Jun Yan (University of Connecticut) STAT 5361 75 / 125 EM and MM Algorithms EM Algorithm The explicit numerical integration in the above function is inconvenient. To replace it, note that, since θt is fixed, the above way to find θt+1 is equivalent to optimizing∫ M(y)=x l(θ |y)L(θt |y)L(θt |x) dy. For any θt , L(θt |y) L(θt |x) = pY (y |θt) pX(x |θt) = pY |X(y |x,θt), the conditional density of Y given X = x under pY (y |θt). Then the integral is simply E [l(θ |Y ) x,θt ]. In turns out that in order to find suitable θt+1, we need to maximize this function. Jun Yan (University of Connecticut) STAT 5361 76 / 125 EM and MM Algorithms EM Algorithm EM (Expectation-Maximization) algorithm. Define Q(θ |θ′) = E [l(θ |Y ) |x,θ′] = E [lnL(θ |Y ) |x,θ′] . Initialize θ0. At iteration t = 1, 2, . . ., with θt given, set θt+1 to maximize Q(θ |θt). The “E step” of the algorithm refers to the computation of Q(θ |θt), and the “M step” refers to the maximization of Q(θ |θt). Stopping criteria are usually based upon |θt+1 − θt | or |Q(θt+1 |θt)−Q(θt |θt)|. Jun Yan (University of Connecticut) STAT 5361 77 / 125 EM and MM Algorithms EM Algorithm If Y = (X ,Z), where Z is the missing data, then, given X = x, the only unknown part of Y is Z , so Q(θ |θ′) = E [lnL(θ |Y ) |x,θ′] = E [ lnL(θ |x,Z) |x,θ′] = ∑ z p(z |x,θ′) ln p(x, z |θ), if Z is discrete∫ p(z |x,θ′) ln p(x, z |θ) dz, if Z is continuous Jun Yan (University of Connecticut) STAT 5361 78 / 125 EM and MM Algorithms EM Algorithm Gaussian mixture clustering Suppose a population is a mixture of Gaussian distributions N (µk ,Σk) with population fractions qk , k = 1, . . . ,K . The goal is to estimate µk , Σk , qk . Let x1, . . . ,xn ∈ Rd be an iid sample from the population. Let zi be the index of the Gaussian distribution from which xi is sampled. Suppose that z = (z1, . . . , zn) cannot be observed. The observed data thus is x = (x1, . . . ,xn). Jun Yan (University of Connecticut) STAT 5361 79 / 125 EM and MM Algorithms EM Algorithm Suppose at iteration t of the EM algorithm, one has obtained θt = (µt1, . . . ,µtK ,Σt1, . . .ΣtK , qt1, . . . , qtK ). Then, since (xi , zi) are independent, and zi are discrete, for any candidate θ = (µ1, . . . ,µK ,Σ1, . . . ,ΣK , q1, . . . , qK ), one has Q(θ |θt) = ∑ z p(z |x,θt) ln p(x, z |θ) = n∑ i=1 K∑ k=1 p(zi = k |xi ,θt)︸ ︷︷ ︸ wik ln p(zi = k,xi |θ). Jun Yan (University of Connecticut) STAT 5361 80 / 125 EM and MM Algorithms EM Algorithm To maximize Q(θ |θt), note that wik has nothing to do with θ so it should be computed in the first place. By Bayes rule, wik = p(zi = k,xi |θt) p(xi |θt) = p(zi = k,xi |θt)∑K s=1 p(zi = s,xi |θt) , with p(zi = k,xi |θt) = p(zi = k |θt)p(xi | zi = k,θt) = qtkφ(xi |µtk ,Σtk), (1) where φ(x |µ,Σ) is the density of N (µ,Σ) at x. Once wik ’s are computed, Q(θ |θt) = K∑ k=1 n∑ i=1 wik ln p(zi = k,xi |θ), which looks a lot simpler. Jun Yan (University of Connecticut) STAT 5361 81 / 125 EM and MM Algorithms EM Algorithm As in (1), p(zi = k,xi |θ) = qkn(xi |µk ,Σk). Since n(xi |µk ,Σk) = (2pi)−d/2√|Σk | exp { −12(xi − µk) ′Σ−1k (xi − µk) } , then Q(θ |θt) = K∑ k=1 n∑ i=1 wik ln[(2pi)−d/2qk ]− 12 K∑ k=1 n∑ i=1 wik ln |Σk | − 12 K∑ k=1 n∑ i=1 wik(xi − µk)′Σ−1k (xi − µk) = I1 − I22 − I3 2 . Jun Yan (University of Connecticut) STAT 5361 82 / 125 EM and MM Algorithms EM Algorithm From last page, I1 = K∑ k=1 n∑ i=1 wik ln[(2pi)−d/2qk ], I2 = K∑ k=1 n∑ i=1 wik ln |Σk | I3 = K∑ k=1 n∑ i=1 wik(xi − µk)′Σ−1k (xi − µk) First, only I3 contains µk , which is a quadratic form. To minimize it, observe that I3 is the sum of K quadratic forms, each involving a single µk I3k = n∑ i=1 wik(xi − µk)′Σ−1k (xi − µk). We only need to find µk to minimize each I3k . From the property of sample mean, µk must be the mean of a weighted sample x1, . . . ,xn , each xi having weight wik . So µk = ∑n i=1 wikxi∑n i=1 wik . Jun Yan (University of Connecticut) STAT 5361 83 / 125 EM and MM Algorithms EM Algorithm Next, only I2 and I3 contain Σk , and I2 + I3 is the sum of K terms of the following form, each involving a single Σk , Sk = n∑ i=1 wik ln |Σk |+ n∑ i=1 wik(xi − µk)′Σ−1k (xi − µk) and so we only need to find Σk to minimize each Sk . Now that µk is equal to the weighted mean of xi , to minimize Sk , Σk must be the sample variance of the weighted sample x1, . . . ,xn . So Σk = ∑n i=1 wik(xi − µk)(xi − µk)′∑n i=1 wik . Jun Yan (University of Connecticut) STAT 5361 84 / 125 EM and MM Algorithms EM Algorithm Finally, only I1 contains qk . Since I1 = K∑ k=1 n∑ i=1 wik ln[(2pi)−d/2qk ] = K∑ k=1 n∑ i=1 wik ln[(2pi)−d/2] + K∑ k=1 n∑ i=1 wik ln qk = −(d/2) ln(2pi) K∑ k=1 n∑ i=1 wik + K∑ k=1 ( n∑ i=1 wik ) ln qk it suffices to minimize K∑ k=1 ( n∑ i=1 wik ) ︸ ︷︷ ︸ Wk ln qk Jun Yan (University of Connecticut) STAT 5361 85 / 125 EM and MM Algorithms EM Algorithm Note q1, . . . , qK must satisfy q1 + · · ·+ qK = 1. Therefore, the maximization can be obtained by finding a solution for L′(q1, . . . , qK ) = 0, i.e. ∂L(q1, . . . , qK ) ∂qk = 0 where L(q1, . . . , qK ) = K∑ k=1 Wk ln qk − λ { K∑ k=1 qk − 1 } with λ a Lagrange multiplier. Then qk = Wk∑K k=1Wk . Since for each i, ∑Kk=1 wik = 1, qk = 1 n n∑ i=1 wik . Jun Yan (University of Connecticut) STAT 5361 86 / 125 EM and MM Algorithms EM Algorithm The above steps consist a basic EM Gaussian clustering algorithm. It can be summarized as follows. Given θt = (µt1, . . . ,µtn ,Σt1, . . .ΣtK , qt1, . . . , qtK ), for k = 1, . . . ,K , compute wik = p(zi = k |xi ,θt), i = 1, . . . ,n then set qt+1,k equal to the mean of w1k , . . . , wnk , µt+1,k and Σt+1,k equal to the weighted average and weighted covariance of x1, . . . ,xn , with weights w1k , . . . , wnk , respectively. Jun Yan (University of Connecticut) STAT 5361 87 / 125 EM and MM Algorithms EM Algorithm For high dimensional data, there are many variants of the algorithm which put constraints on Σk to improve computational and estimation efficiency. The strongest constraint (and computationally the simplest) is Σ1 = · · · = ΣK = σ2I , with σ2 being the only parameter. Others include 1 Σk = σ2kI , k = 1, . . . ,K , with σk being the parameters; 2 Σk are diagonal matrices that may be equal to or different from each other; 3 Σ1 = σ2kΣ, with the parameters being σ2k and Σ. Jun Yan (University of Connecticut) STAT 5361 88 / 125 EM and MM Algorithms EM Algorithm If Newton’s method is used to directly maximize the log likelihood function l(θ |x) = lnL(θ |x), then one would have to differentiate the function l(θ |x) = n∑ i=1 li(θ) where, for every i = 1, . . . ,n, li(θ) = ln p(xi |θ) = ln {∑ zi p(xi , zi |θ) } = ln { K∑ k=1 p(xk | zi = k,θ)p(zi = k |θ) } = ln { K∑ k=1 (2pi)− d2 qk√|Σk | e− 12 (xi−µk)′Σ−1k (xi−µk) } . The iterations apparently are quite involved. Jun Yan (University of Connecticut) STAT 5361 89 / 125 EM and MM Algorithms EM Algorithm Computation The R package mclust provides a large number of clustering methods, including EM based methods. Use install.packages("mclust") to install it on your computer if it is not yet installed. To use, first execute library("mclust") which loads all the functions in the package into an R session. Jun Yan (University of Connecticut) STAT 5361 90 / 125 EM and MM Algorithms EM Algorithm mclustBIC computes the Bayes information criterion associated with different Gaussian mixture models, which are characterized by number of clusters and constraints on the covariance matrices. mclustModel estimate the parameters for the best model determined via mclustBIC. mclustModelNames lists the names of models used in the mclust package. mclust2Dplot plots 2-D data given parameters of a Gaussian mixture model for the data. Some MATLAB routines, such as EM-GM.m can perform some of the functions of the R package. Jun Yan (University of Connecticut) STAT 5361 91 / 125 EM and MM Algorithms EM Algorithm EM for Exponential Families Many important parametric distributions belong to an exponential family, which has the form pY (y |θ) = c1(y)c2(θ) exp { θTs(y) } , where θ is a vector of parameters and s(y) a vector of sufficient statistics. Jun Yan (University of Connecticut) STAT 5361 92 / 125 EM and MM Algorithms EM Algorithm 1 The normal densities Nd(θ,Σ) with Σ being fixed consist an exponential family, because pY (y |θ) = e − 12 (y−θ)TΣ−1(y−θ) (2pi)d/2|Σ|1/2 = e − 12yTy (2pi)d/2︸ ︷︷ ︸ c1(y) e− 12θTθ |Σ|1/2︸ ︷︷ ︸ c2(θ) exp { θT (Σ−1y)︸ ︷︷ ︸ s(y) } . Jun Yan (University of Connecticut) STAT 5361 93 / 125 EM and MM Algorithms EM Algorithm 1 Poisson distribution Poisson(θ), where θ > 0, has distribution function p(y | θ) = e −θyθ y! , y = 0, 1, . . . If we define c1(y) = 1 y! c2(θ) = e −θ s(y) = ln y, then p(y | θ) = c1(y)c2(θ)eθs(y). Therefore, Poisson(θ) also form an exponential family. Jun Yan (University of Connecticut) STAT 5361 94 / 125 EM and MM Algorithms EM Algorithm 1 Gamma distribution Gamma(a, r), where a > 0, r > 0, has density p(y | a, r) = e−y/rya−1 raΓ(a) y > 0 0 y ≤ 0. Indeed, if we define b = 1/r , θ = (a, b), and c1(y) = 1 y c2(θ) = c2(a, b) = ba Γ(θ1) s(y) = (ln y,−y) Then c1(y)c2(θ) exp(θTs(y)) = 1 y 1 baΓ(a)e a ln y−by = p(y | a, r). Jun Yan (University of Connecticut) STAT 5361 95 / 125 EM and MM Algorithms EM Algorithm 1 Beta distribution Beta(a, b), where a > 0, b > 0, has density p(y | a, b) = ya−1(1− y)b−1 B(a, b) , 0 < y < 1 0 otherwise If we let θ = (a, b), c1(y) = 1 y(1− y) c2(θ) = 1 B(a, b) s(y) = (ln y, ln(1− y)) then c1(y)c2(θ) exp(θTs(y)) = 1 y(1− y) 1 B(a, b)e a ln y+b ln(1−y) = p(y | a, b). Jun Yan (University of Connecticut) STAT 5361 96 / 125 EM and MM Algorithms EM Algorithm Suppose x = M (y) is observed and we want to estimate θ. The EM algorithm in this case is significantly simplified, with no explicit evaluation of Q(θ |θ′). It works as follows. Given θt , set θt+1 such that E [s(Y ) |θt+1] = E [s(Y ) |x,θt ] . (2) Jun Yan (University of Connecticut) STAT 5361 97 / 125 EM and MM Algorithms EM Algorithm To get the result, we start with the evaluation Q(θ |θt) = ∫ [ln pY (y |θ)]p(y |x,θt)dy = ∫ [ln c1(y) + ln c2(θ) + θTs(y)]p(y |x,θt) dy = ∫ [ln c1(y)]p(y |x,θt) dy + ∫ [ln c2(θ)]p(y |x,θt) dy + ∫ θTs(y)p(y |x,θt) dy = k(θt) + ln c2(θ) + θTE [s(Y ) |x,θt ] . Jun Yan (University of Connecticut) STAT 5361 98 / 125 EM and MM Algorithms EM Algorithm To maximize Q(θ |θt) as a function of θ, it suffices to solve (ln c2(θ))′ + E [s(Y ) |x,θt ] = 0, or c′2(θ) c2(θ) + E [s(Y ) |x,θt ] = 0. The following equality is basic for exponential families. For any θ, c′2(θ) c2(θ) = −E [s(Y ) |θ] = − ∫ s(y)pY (y |θ) dy. (3) Assuming the result is correct for now, it is seen that θt+1 should be set such that E [s(Y ) |θt+1] = E [s(Y ) |x,θt ], i.e., as in (2). Jun Yan (University of Connecticut) STAT 5361 99 / 125 EM and MM Algorithms EM Algorithm To get (3), notice ∫ pY (y |θ) = 1 for any θ, i.e. c2(θ) ∫ c1(y) exp{θTs(y)} dy = 1. Then ln c2(θ) = − ln {∫ c1(y) exp{θTs(y)} dy } . Differentiate both sides to get c′2(θ) c2(θ) = − ∫ c1(y)s(y) exp{θTs(y)} dy∫ c1(y) exp{θTs(y)} dy = −c2(θ) ∫ c1(y)s(y) exp{θTs(y)} dy = − ∫ s(y)pY (y |θ) dy, as claimed. Jun Yan (University of Connecticut) STAT 5361 100 / 125 EM and MM Algorithms EM Algorithm Convergence of EM We first show that each step of the algorithm increases the observed-data log likelihood l(θ |x) = ln pX(x |θ), that is l(θt+1 |x) ≥ l(θt |x). Once this is done, one can say l(θt |x) always climbs up to a local maximum of l(θ |x). In general, there is no guarantee that the limit is the (global) maximum. EM has slower convergence than the Newton’s method, sometime substantially. Nevertheless, the ease of implementation and the stable ascent of EM are often very attractive. Jun Yan (University of Connecticut) STAT 5361 101 / 125 EM and MM Algorithms EM Algorithm Convergence of EM By X = M (Y ), pY (y |θ) = p(x,y |θ) = pX(x |θ)pY |X(y |x,θ), so pX(x |θ) = pY (y |θ)pY |X(y |x,θ) . Because x is fixed, pY |X(y |x,θ) only depends on θ. To simplify notation, write f (y |θ) = pY |X(y |x,θ). Then ln pX(x |θ) = ln pY (y |θ)− ln f (y |θ). Jun Yan (University of Connecticut) STAT 5361 102 / 125 EM and MM Algorithms EM Algorithm Convergence of EM Integrate both sides with respect to the density f (y |θt). From∫ [ln pY (y |θ)] f (y |θt) dy = ∫ [ln pY (y |θ)] pY |X(y |x,θt) dy = Q(θ |θt), It follows ln pX(x |θt) = Q(θ |θt)−H (θ |θt), (4) where H (θ |θt) = ∫ [ln f (y |θ)]f (y |θt) dy. Jun Yan (University of Connecticut) STAT 5361 103 / 125 EM and MM Algorithms EM Algorithm Convergence of EM To show ln pX(x |θt+1) ≥ ln pX(x |θt), by (4), their difference equals [Q(θt+1 |θt)−Q(θt |θt)]− [H (θt+1 |θt)−H (θt |θt)]. By the M-step, Q(θt+1 |θt) ≥ Q(θt |θt). On the other hand, by the Jensen’s inequality, H (θt+1 |θt)−H (θt |θt) = ∫ ln [ f (y |θt+1) f (y |θt) ] f (y |θt) dy ≤ ln [∫ f (y |θt+1) f (y |θt) f (y |θt) dy ] = ln [∫ f (y |θt+1) dy ] = 0, thus showing the likelihood increases at each iteration. Jun Yan (University of Connecticut) STAT 5361 104 / 125 EM and MM Algorithms EM Algorithm Variants of EM E step is hard: Monte Carlo (MCEM) M step is hard: Instead of choosing θt+1 to maximize Q(θ |θt), one simply chooses any θt+1 for which Q(θt+1 |θt) > Q(θt |θt), then the resulting algorithm is called generalized EM, or GEM. I In GEM, l(θt |x) is still increasing. I An example: ECM algorithm, which replaces the M-step of the EM algorithm with several computationally simpler conditional maximization (CM) steps. Jun Yan (University of Connecticut) STAT 5361 105 / 125 EM and MM Algorithms EM Algorithm Variance Estimation Louis’ method (1982, JRSSB) Supplemented EM (SEM) algorithm (Meng and Rubin, 1991, JASA) Oakes’ method (1999, JRSSB) Bootstrapping Empirical information Numerical Differentiation Jun Yan (University of Connecticut) STAT 5361 106 / 125 EM and MM Algorithms EM Algorithm Example: Hidden Markov Model Hidden state s = (st), t = 1, . . . ,T , follows a Markov chain with initial state distribution pik , k = 1, . . . ,M , and transition probability matrix {ak,l = Pr(st+1 = l|st = k)}. (Note this matrix is row stochastic.) Given the state st , the observation ut is independent of other observations and states Given state k, the observation follows distribution bk(u). For example, bk(u|γk) can be N (µk ,Σk); or discrete. The observed data is u = (ut), t = 1, . . . ,T . Jun Yan (University of Connecticut) STAT 5361 107 / 125 EM and MM Algorithms EM Algorithm Likelihood Parameters: θ contains pik ’s, akl ’s, γk ’s Missing (latent, unobserved) states: st , t = 1, . . . ,T . Complete data: (s,u) Complete data likelihood pis1 T∏ t=2 ast−1,st T∏ t=1 bst (ut |γst ) Complete data log-likelihood log p(s,u|θ) = log pis1 + T∑ t=2 log ast−1,st + T∑ t=2 bst (ut |γst ) Jun Yan (University of Connecticut) STAT 5361 108 / 125 EM and MM Algorithms EM Algorithm E-Step Taking expectation of log p(s,u|θ′) with respect to p(s|u, θ): E ( log p(s,u)|θ′)∣∣u, θ)) = ∑ s p(s|u, θ) log p(s,u|θ′) = M∑ k=1 Lk(1) log pi′k + T∑ t=2 M∑ k=1 M∑ l=1 Hk,l(t) log a′k,l + T∑ t=1 M∑ k=1 Lk(t) log bst (ut |γ′k) where Lk(t) = Pr(st = k|u) = ∑ s:st=k p(s|u) and Hk,l(t) = Pr(st = k, sk+1 = l|u) = ∑ s:st=k,st+1=l Pr(s|u). Jun Yan (University of Connecticut) STAT 5361 109 / 125 EM and MM Algorithms EM Algorithm M-step If bk ’s are multivariate normals, the M-step has closed-form solution: µk = ∑T t=1 Lk(t)ut∑T t=1 Lk(t) Σk = ∑T t=1 Lk(t)(ut − µk)(ut − µk)>∑T t=1 Lk(t) ak,l = ∑T−1 t=1 Hk,l(t)∑T−1 t=1 Lk(t) pik = ∑T t=1 Lk(t)∑M k=1 ∑T t=1 Lk(t) Jun Yan (University of Connecticut) STAT 5361 110 / 125 EM and MM Algorithms EM Algorithm Forward and Backward Probability This is for efficient computation of Lk(t)’s and Hk,l(t)’s Computation of order M 2T and memory requirement of order MT . Forward probability αk(t) = Pr(u1, . . . , ut , st = k) Backword probability βk(t) = Pr(ut+1, . . . , uT |st = k) Jun Yan (University of Connecticut) STAT 5361 111 / 125 EM and MM Algorithms EM Algorithm Forward and Backward Algorithms Forward probability recursion αk(1) = pikbk(u1), 1 ≤ k ≤ M , αk(t) = bk(ut) M∑ l=1 αl(t − 1)al,k , 1 < t ≤ T , 1 ≤ k ≤ M . Backward probability recursion βk(T ) = 1, 1 ≤ k ≤ M , βk(t) = M∑ l=1 ak,lbl(ut+1)βl(t + 1), 1 ≤ t < T , 1 ≤ k ≤ M . Jun Yan (University of Connecticut) STAT 5361 112 / 125 EM and MM Algorithms EM Algorithm Fast Algorithm for the E-step Quantities Lk(t) = Pr(st = k|u) = αk(t)β(t)p(u) Hk,l(t) = Pr(st = k, sk+1 = l|u) = αk(t)ak,lbl(ut+1)βl(t + 1)p(u) p(u) = M∑ k=1 αk(t)β(t), ∀t Jun Yan (University of Connecticut) STAT 5361 113 / 125 EM and MM Algorithms Majorization-Minimization Algorithm Majorization-Minimization Algorithm Jun Yan (University of Connecticut) STAT 5361 114 / 125 EM and MM Algorithms Majorization-Minimization Algorithm Majorization-Minimization Algorithm The MM algorithm is not an algorithm, but a prescription for constructing optimization algorithms. The EM algorithm is a special case. An MM algorithm operates by creating a surrogate function that majorizes (minorizes) the objective function. When the surrogate function is optimized, the objective function is driven downhill (uphill). Majorization-Minimization, Minorization-Maximization. Jun Yan (University of Connecticut) STAT 5361 115 / 125 EM and MM Algorithms Majorization-Minimization Algorithm Majorization-Minimization Algorithm It may generate an algorithm that avoids matrix inversion. It can linearize an optimization problem. It can conveniently deal with equality and inequality constraints. It can solve a non-differentiable problem by iteratively solving smooth problems. Jun Yan (University of Connecticut) STAT 5361 116 / 125 EM and MM Algorithms Majorization-Minimization Algorithm Majorization-Minimization Algorithm A function g(θ | θs) is said to majorize the function f (θ) at θs if f (θs) = g(θs | θs), and f (θ) ≤ g(θ | θs) for all θ. Let θs+1 = arg min θ g(θ | θs). It follows that f (θs+1) ≤ g(θs+1 | θs) ≤ g(θs | θs) = f (θs). The strict inequality holds unless g(θs+1 | θs) = g(θs | θs) and f (θs+1) = g(θs+1 | θs). Therefore, by alternating between the majorization and the minimization steps, the objective function is monotonically decreasing and thus its convergence is guaranteed. Jun Yan (University of Connecticut) STAT 5361 117 / 125 EM and MM Algorithms Majorization-Minimization Algorithm Majorization-Minimization Algorithm We can use any inequalities to construct the desired majorized/minorized version of the objective function. There are several typical choices: I Jensen’s inequality: for a convex function f (x), f (E(X)) ≤ E(f (X)). I Convexity inequality: for any λ ∈ [0, 1], f (λx1 + (1− λ)x2) ≤ λf (x1) + (1− λ)f (x2) I Cauchy-Schwarz inequality. I Supporting hyperplanes. I Arithmetic-Geometric Mean Inequality. Jun Yan (University of Connecticut) STAT 5361 118 / 125 EM and MM Algorithms Majorization-Minimization Algorithm Example: Penalized AFT model with induced smoothing Jun Yan (University of Connecticut) STAT 5361 119 / 125 EM and MM Algorithms Majorization-Minimization Algorithm AFT model Let {Ti ,Ci , xi}, i = 1, . . . ,n be n independent copies of {T ,C ,X}, where Ti and Ci are log-transformed failure time and log transformed censoring time, xi is a p × 1 covariate vector, and given X, C and T are independent. The Accelerated failure time model has the form Ti = xTi β + i , i = 1, . . . ,n. The observed data are (Yi , δi , xi), where Yi = min(Ti ,Ci), δi = I (Ti < Ci). Jun Yan (University of Connecticut) STAT 5361 120 / 125 EM and MM Algorithms Majorization-Minimization Algorithm AFT model For a case-cohort study, The weight-adjusted estimating equation with induced smoothing is U˜n(β) = 1 n n∑ i=1 n∑ j=1 ∆i(xi − xj)Φ ( ej(β)− ei(β) rij ) = 0, where ej(β) = Yj − xTj β and r2ij = n−1(xi − xj)T(xi − xj) = n−1‖xi − xj‖2. Define H (x) = xΦ(x) + φ(x), and L˜n(β) = 1 n n∑ i=1 n∑ j=1 ∆irijH ( ej(β)− ei(β) rij ) . Then solving the estimating equation U˜n(β) = 0 is equivalent to minimizing the objective function L˜n(β). The objective function L˜n(β) is convex in β. Jun Yan (University of Connecticut) STAT 5361 121 / 125 EM and MM Algorithms Majorization-Minimization Algorithm AFT model We propose the penalized AFT method, which amounts to minimize P˜Ln(β) = 1 n n∑ i=1 n∑ j=1 ∆irijH ( ej(β)− ei(β) rij ) + λP(β), where P(β) is some penalty function and λ is the tuning parameter. Jun Yan (University of Connecticut) STAT 5361 122 / 125 EM and MM Algorithms Majorization-Minimization Algorithm AFT model To solve this problem, the key is to identity a surrogate function of L˜n(β) that is easy to work with. In view of the form of the objective function, an immediate idea is to majorize H (x). Because H (x) is twice differentiable and H ′′(x) = φ(x) < φ(0) ≈ 0.4 for any x ∈ R, it follows that for any x0 ∈ R, H (x | x0) ≤ G(x | x0) ≡ H (x0) + (x − x0)Φ(x0) + φ(0)2 (x − x 0)2. Note that in a standard optimization method based on quadratic approximation, e.g., Newton-Raphson, the φ(0) term is replaced with φ(x0), which, however, does not guarantee the majorization. Jun Yan (University of Connecticut) STAT 5361 123 / 125 EM and MM Algorithms Majorization-Minimization Algorithm AFT model Write L˜n(β) = 1 n n∑ i=1 n∑ j=1 wijH (aij + zTijβ) + λP(β), where wij = ∆irij , aij = (Yj −Yi)/rij and zij = (xi − xj)/rij . Then L˜n(β | βs) ≤ 1n n∑ i=1 n∑ j=1 wijG(aij + zTijβ | aij + zTijβs) + λP(β) = 12β TAβ − bsTβ + λP(β) + const ≡ L˜∗n(β | βs) + const, where A = φ(0)n n∑ i=1 n∑ i=1 wijzijzTij , bs = Aβs− 1 n n∑ i=1 n∑ i=1 wijΦ(aij+zTijβs)zij . Minimizing L˜∗n(β | βs) is a Lasso-type problem, for which a coordinate descent algorithm can be applied. Jun Yan (University of Connecticut) STAT 5361 124 / 125 EM and MM Algorithms Majorization-Minimization Algorithm MM for Penalized AFT Initialize β0 ∈ Rp. Compute A = φ(0) n n∑ i=1 n∑ i=1 wijzijzTij . Set k ← 0. repeat (1). Majorization step: bk+1 ← Aβk − 1 n n∑ i=1 n∑ i=1 wijΦ(aij + zTij β k )zij . (2). Minimization step: Initialize β˜0 = βk . while ‖β˜s − β˜s+1‖/‖β˜s‖ ≥ and s < smax do (i) For j = 1, . . . , p, β˜ s j ← 1 ajj S(bk+1j − a T j β˜ s + ajj β˜sj , λ). (ii) β˜s+1 ← β˜s . (iii)s ← s + 1. end while β k+1 ← β˜s+1. Set k ← k + 1. until convergence, i.e., ‖βk+1 − βk‖/‖βk‖ ≤ . Jun Yan (University of Connecticut) STAT 5361 125 / 125