M3S2/M4S2 Statistical Modelling II Recommended Literature All the material used in this course can be found in various textbooks. • McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models. Chapman & Hall, • Lee, Y., Nelder, J. A. & Pawitan, Y. (2006). Generalized linear models with random effects: unified analysis via H-likelihood. CRC Press • Hastie, T. & Tibshirani, R. (1990). Generalized additive models. Wiley Online Library • Venables, W. N. & Ripley, B. D. (2013). Modern applied statistics with S-PLUS. Springer Science & Business Media • Dobson, A. J. & Barnett, A. (2008). An introduction to generalized linear models. CRC press • Wood, S. (2006). Generalized Additive Models: An Introduction with R. CRC Press • Ravishanker, N. & Dey, D. K. (2001). A first course in linear model theory. CRC Press • Faraway, J. J. (2016). Extending the linear model with R: generalized linear, mixed effects and nonparametric regression models, vol. 124. CRC press • Fox, J. & Weisberg, S. (2011). An R companion to applied regression. Sage Publications These books can be found (some electronically) in the college library. 1 Chapter 1. Introduction to Statistical Models Consider situations where we observe some data y from experiments, events etc. We interpret the observation y as a realisation of some random variable Y. A statistical model is the specification of the distribution of Y up to an unknown parameter θ. Often, the observations y = (y1, . . . , yn) ∈ Rn is a vector and Y = (Y1, . . . ,Yn) is a random vector. In this case a statistical model is the specification of the joint distribution of Y1, . . . ,Yn up to an unknown parameter θ. Example Mobile Phones Observations: yi = student i looks at their mobile in first 10 mins of the lecture. Model: Y1, . . . ,Yn iid, P(Yi = true) = 1− P(Yi = false) = θ, θ ∈ [0, 1]. In many experiments and situations, the observations Y1, . . . ,Yn do not have the same distribution. The distribution of Y1, . . . ,Yn may depend on non-random quantities x1, . . . , xn called covariates. Example Do students who attend more lectures get higher marks in Maths exams? yi = average mark, xi = number of lectures attended Model: Yi = a+ bxi + ei, i = 1, . . . , n, ei iid∼ N(0, σ2), θ = (a, b, σ2). Model Fitting We shall describe the process of the model fitting as follows: 1. Model specification — specify the distribution of the observations Y1, . . . ,Yn up to unknown parameters. 2. Estimation of the unknown parameters of the model. 3. Inference — this involves constructing confidence intervals and testing hypotheses about the parameters. 4. Diagnostics — to check how well the model fits the data. An “ideal” model should • agree with the observed data reasonably well. • not contain more parameters than are necessary. • be easy to interpret. 2 Chapter 2. Normal Linear Models We start with a revision of linear models. We shall see how this simple model leads to some closed form results. However, we shall also point out that linear models have their limitations. The intuition behind the fit prestige.R prestige-data.RData Consider the following dataset relating to the prestige of different occupations. Each row, relates to a different occupation, contains a prestige score and the log income. Can the (log) income be used to measure the prestige of occupations? Occupation Prestige Log Income gov.administrators 68.8 9.421 general.managers 69.1 10.16 accountants 63.4 9.135 purchasing.officers 56.8 9.09 chemists 73.5 9.036 physicists 77.6 9.308 biologists 72.6 9.019 architects 78.1 9.558 civil.engineers 73.1 9.339 mining.engineers 68.8 9.308 ... ... ... l l l l l l l l l l l l l l l l l l l l l l l l l ll l l ll l l l l l l l ll l lll l l ll l l l l l ll l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l ll l 7 8 9 10 20 40 60 80 log income Pr es tig e A possible linear model: Yi = β1 + xiβ2 + ei, i = 1, . . . , n, where • Yi is the outcome or response of interest (random variable) — observed realisations yi. • xi is a covariate — observed constant • β1, β2 are unknown parameters. • e1, . . . , en are iid error (random variables) with E(ei) = 0 and var(ei) = σ2. Note that σ2 is another unknown parameter — not observed. 3 × (x1, y1) × (x2, y2) × (x3, y3) × (x4, y4) × (x5, y5) β1 + β2x x y y1 − β1 − x1β2 The least-squares estimators β̂1 and β̂2 are defined as the minimisers of n ∑ i=1 (yi − β1 − xiβ2)2. 2.1 Specification of Normal Linear Models Definition. A linear model is defined as Y = Xβ+ , where • Y is the n-dimensional random vector of observations, called the response. • X ∈ Rn×p is the design matrix (known) which contains the covariates (predictors). Let r := rank(X). • β ∈ Rp is the unknown parameter vector. • is the n-variate unobserved error vector such that E() = 0 All vectors are column vectors and are presented in bold. Further if, ∼ N(0, σ2 In) for some σ2 > 0, then the linear model is normal. The normal linear model can be written as Y ∼ N(Xβ, σ2 In), where In is the n× n identity matrix. The assumption that ∼ N(0, σ2 In) is called the Normal Theory Assumption (NTA) and has two parts: • the observations error are uncorrelated; and • the variance of each Yi is the same, namely, σ2. 4 Examples of normal linear models Example Consider the model where Y1, . . . ,Yn are independent normally distributed random variables with variance σ2 and E(Yi) = β1 + xiβ2 for i = 1, . . . , n. This is a normal linear model as it can be written in the form Example Consider the model where Yi ∼ N(µi, σ2) independently with µi = xi1β1 + xi2β2 + xi3xi4β3 for i = 1, . . . , n. This is a normal linear model as it can be written as Example Consider two school classes each consisting of 10 children. Denote the two classes as class A and class B. Suppose we model the observations of all children as normally distributed random variables with variance σ2. Further suppose for child i from class A, denoted as YiA, that E(YiA) = µA for all i and analogously the observation for child i from class B, denoted by YiB, that E(YiB) = µB. Then this is a normal linear model as it can be written as 5 Therefore, the linear model accommodates a large variety of different models, i.e. those with and without a constant intercept term and those structured according to classes/groups. However, . . . Example Suppose the independent observations Y1, . . . ,Yn are normal random variables with E(Yi) = β1 + x β2 i1 and var(Yi) = σ 2 for i = 1, . . . , n. . . . is not a linear model, as it cannot be written in the form Y = Xβ+ 2.2 Estimation So far we have just introduced the normal linear model, where Y ∼ N(Xβ, σ2 In). The joint probability density function of Y is We now estimate the unknown parameter β using the maximum likelihood approach; that is, we want to find (keeping σ2 fixed) β̂ = argmax β L(β, σ2;y), where L(β, σ2;y) is the likelihood function. Typically, it is easier to obtain the MLE by maximising `(β, σ2;y) = log L(β, σ2;y).∗ First, the log-likelihood of the data is To find the maximum likelihood estimator (MLE), β̂, we differentiate with respect to β (keeping σ2 fixed and treating it as a constant) to get −2XTY + 2XTXβ. Solving this equation equal to zero give the so-called normal equations XTXβ̂ = XTY . If XTX is invertible (iff X has full rank), the normal equations can be solved directly giving the MLE as β̂ = ( XTX )−1 XTY . (2.1) The MLE for σ2 can be obtained using the same method. ∗The probability density function (pdf) f (y; θ) is considered a function of y for fixed or given θ whereas the likelihood L(θ; y), and therefore the log-likelihood `(θ; y) = log L(θ; y), is considered a function of θ for a particular data y observed. 6 Properties of the Maximum Likelihood Estimator The MLE of β has the following properties. • β̂ is linear in Y . • β̂ is unbiased for β • cov(β̂) = σ2(XTX)−1. In fact there is good reason to use the MLE β̂; it can be used to construct a best linear unbiased estimator (BLUE). An estimator γ̂ is called linear if there exists L ∈ Rn such that γ̂ = LTY . Theorem 1. (The Gauss Markov Theorem for full rank linear models) Assume a linear model with unknown parameter β, a design matrix of full rank and cov(e) = σ2 In. Fix any c ∈ Rp and let β̂ be the least squares estimator of the linear model. Then the following holds: The estimator cTβ̂ has the smallest variance among all linear unbiased estimators for cTβ. It so happens that the least squares estimator i.e. the β that minimises: S(β) := (Y − Xβ)T(Y − Xβ) and the MLE, β̂, are the same (see problem sheet). Before we present an estimator of the variance σ2, it is necessary to introduce the notion of projection matrices. Projection Matrices Let L be a linear subspace of Rn, where dim L = p ≤ n. Definition. P ∈ Rn×n is projection matrix onto L, if 1. Px = x for all x ∈ L, 2. Px = 0 for all x ∈ L⊥ = {z ∈ Rn : zTy = 0 ∀y ∈ L}︸ ︷︷ ︸ orthogonal complement It follows that Lemma 2. P is a projection matrix ⇐⇒ symmetric︷ ︸︸ ︷ PT = P and idempotent︷ ︸︸ ︷ P2 = P . Definition. Ŷ = Xβ̂, where β̂ is the MLE, is called the vector of fitted values. In the case where X has full rank, Ŷ = X(XTX)−1XTY . 7 It turns out that P := X(XTX)−1XT is a projection matrix; since (by Lemma 2) where the first equation holds because (A−1)T = (AT)−1 for a non-singular matrix A. We now define the residuals and the residual sum of squares. Definition. e = Y − Ŷ is called the vector of residuals. Notice that the residuals can be written as e = Y − Ŷ = Y − PY = (In − P)Y . It can be shown that In − P is also a projection matrix (see problem sheet). The projection matrix P is sometimes referred to as the hat matrix, as it puts the hat on Y. Definition. RSS = eTe is called the residual sum of squares. The RSS quantifies the departure of the data from the model - ideally we want a small RSS value. It is also the minimum of S(β). Theorem 3. σ̂2 := RSSn−r is an unbiased estimator of σ 2. Proof. See problem sheet If we derived the MLE for σ2 in the usual way — maximising the log-likelihood and solving equal to zero, we would obtain a biased estimator. To see this: `(β, σ2;Y ) = ∂` ∂(σ2) = Solving ∂` ∂(σ2) = 0 gives: σ˜2 = Notice that if we plug in β = β̂, we have σ˜2 = RSS/n. Since E(RSS) = σ2(n− r), it follows that E(σ˜2) = σ2(n− r)/n. This is why we use the estimator in Theorem 3. 8 Example In the following, we consider a normal linear model where the MLE of β can be derived explicitly. Let Y1, . . . ,Yn be independent random variables with Yi ∼ N(µi, σ2) with µi = β1 + β2ai, where a1, . . . , an are known deterministic constants. Take n ≥ 2. Then Assume that not all ais are equal to ensure X has full rank. Then Now we can write the MLE β̂ explicitly as βˆ = (XTX)−1XTY : Identifiability In this section we have discussed how to estimate the unknown parameter, β, in the normal linear model. It can be the case that two parameter values lead to the same distribution for the observed data. This occurs when r < p and we say that β is not identifiable. 9 Example in R gas.R gas-data.RData Suppose we have data consisting of household gas consumption and average external temperature (see table below). Can the external temperature be used to measure the amount of gas consumed in a household? # Temp Gas 1 -0.8 7.2 2 -0.7 6.9 3 0.4 6.4 4 2.5 6.0 5 2.9 5.8 6 3.2 5.8 7 3.6 5.6 8 3.9 4.7 9 4.2 5.8 10 4.3 5.2 11 5.4 4.9 12 6.0 4.9 13 6.0 4.3 # Temp Gas 14 6.0 4.4 15 6.2 4.5 16 6.3 4.6 17 6.9 3.7 18 7.0 3.9 19 7.4 4.2 20 7.5 4.0 21 7.5 3.9 22 7.6 3.5 23 8.0 4.0 24 8.5 3.6 25 9.1 3.1 26 10.2 2.6 Table 2.1: Gas consumption data We take Gas as the response, Yi, and Temp as the covariate xi. Suppose that we use a linear normal model to explain the data; where Yi are independent N(µi, σ2) with µi = β1 + β2xi for i = 1, . . . , 26. For this model, we have Then to compute the MLE for β̂ with the following commands. First, we assemble the vectors of observations Y and the design matrix X: > Y <- dat$Gas > X <- cbind(1,dat$Temp) As XTX has full rank, we can compute its inverse: 10 > solve(t(X)%*%X) [,1] [,2] [1,] 0.17718685 -0.025929965 [2,] -0.02592996 0.004846722 The MLE β̂ = (XTX)−1XTY can be computed as follows: > betahat <- solve(t(X)%*%X)%*%t(X)%*%Y > betahat [,1] [1,] 6.8538277 [2,] -0.3932388 Then to plot the model fit we can use > plot(dat,pch=16) #pch=16 selected for better presentation > xs <- seq(-2,12,by=0.5) #dummy x values > lines(x=xs,y=betahat[1]+xs*betahat[2]) l l l l l l l l l l l l ll ll l l l ll l l l l l 0 2 4 6 8 10 3 4 5 6 7 Temp G as Further, the residual sum of squares (RSS) is > ehat <- Y-X%*%betahat > RSS <- t(ehat)%*%ehat > RSS 11 [,1] [1,] 1.899568 Lastly, we can compute σ̂2: > sig2hat <- RSS/(26-2) > sig2hat [,1] [1,] 0.07914867 12 2.3 Inference In section 2.1 we specified a normal linear model. Then in section 2.2 we discussed how to estimate the parameters of the linear model. In this section we shall consider two main tools of statistical inference 1. Confidence Intervals The width of these intervals indicate uncertainty of parameter estimates and inferences. 2. Hypothesis Testing These are used to compare how two related models fit the data. To compare models we require a measure of their goodness of fit. We shall present goodness of fit statistics based on the log-likelihood function. In order to construct confidence intervals and measure the goodness of fit of normal linear models, we require sampling distributions. Note that the derivations of these sampling distributions are omitted — see Statistical Modelling I for details. Before we begin, here is an example of constructing confidence interval and performing a hypothesis test as a reminder: Example Confidence Interval Let X1, . . . ,Xn be iid N(µ, σ2) random variables with σ2 > 0 known. We know that where X = 1n ∑ n i=1 Xi. This is our pivotal quantity for µ. We can construct a 1− α confidence interval for µ as follows: 13 Notes • ( X− zα/2σ/ √ n,X+ zα/2σ/ √ n ) is a random interval. • ( x− zα/2σ/ √ n, x+ zα/2σ/ √ n ) is the observed interval. • We could use asymmetric values, but symmetrical values i.e. using ±zα/2 gives the shortest intervals in this example. Definition. A (1− α) confidence interval for θ is a random interval I that contains the true parameter with probability ≥ 1− α, i.e. Pθ(θ ∈ I) ≥ 1− α ∀θ ∈ Θ The interval, I, can be any type of interval. For example, if L and U are random variables with L ≤ U then I could be the open interval (L,U), the closed interval [L,U], the unbounded interval [L,∞), etc. Example Hypothesis Test (Continuing the previous example) If we wish to test the null hypothesis H0 : µ = µ0 versus H1 : µ 6= µ0 a test rejection region is {x : |x − µ0| > zα/2σ/ √ n}. So, the null hypothesis, H0, is not rejected for samples with x− zα/2σ/ √ n < µ0 < x+ zα/2σ/ √ n. This test is constructed to have significance level α i.e. P(H0 is not rejected|µ = µ0) = 1− α. Example Pipes In a manufacturing process that produces pipes, historical data suggests that the diameter of the pipes are N(30, 1.52). The process is modified but the variance remains the same (at 1.52). After the modification n = 13 pipes are produced and the average diameter of the pipes is 32.05; so that the new diameter ∼ N(µ, 1.52). Let’s construct a 95% confidence interval for µ: n = , σ = , x = , α = , zα/2 = which gives the 95% confidence interval: . If we were to test the hypothesis H0 : µ = 30 versus H1 : µ 6= 30; that is the null is that the process has not changed, we would reject the null hypothesis as 30 is not contained in the interval. 14 Confidence Intervals and Regions We now discuss how to construct confidence regions for the model parameter β. First we require a pivotal quantity for σ2. It turns out that RSS σ2 ∼ χ2n−r, (2.2) where r = rank(X) and χ2k denotes the chi-squared distribution with k degrees of freedom. Then it follows that, for any c ∈ Rp, and when X has full rank i.e. rank(X) = p cTβ̂− cTβ√ cT(XTX)−1c RSSn−p ∼ tn−p, (2.3) where tk denotes the student-t distribution with k degrees of freedom. This follows from cTβ̂− cTβ√ cT(XTX)−1cσ2 ∼ N(0, 1) (2.4) and recalling that: By selecting c appropriately, we can use (2.3) to construct confidence intervals for components of β e.g. β1 — using c = (1, 0, . . . , 0). In fact, we can go further and construct confidence regions for β by using (β̂− β)TXTX(β̂− β) RSS n− p p ∼ Fp,n−p, (2.5) where Fa,b denotes the F-distributions with degrees of freedom a and b. Example Suppose we use a normal linear model with known σ2 = 1 and design matrix X = 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 . A hypothesis test H0 : β2 = 0 versus H1 : β2 6= 0 is constructed as follows: 15 Note • If σ is known, use the normal distribution. If σ is unknown, use the t-distribution. • We can now able to construct confidence intervals (and regions) and conduct hypothesis test for β — see examples at the beginning of this section. Warning Following from (Wasserman, 2013, p. 111): “There is sometimes much confusion over the interpretation of a confidence interval. A confidence interval is not a probability statement about µ (or some other parameter of interest) since µ is a fixed quantity, not a random variable. Some texts interpret confidence intevals as follows”: if I repeat the experiment over and over again, the interval will contain the true parameter value 95% of the time. “This is correct but useless since we rarely repeat the same experiment over and over. A better interpretation is this”: On day 1, you collect data and construct a 95% confidence interval for a parameter θ1. On day 2, you collect new data and construct a 95% confidence interval for an unrelated parameter θ2. On day 3, you collect new data and construct a 95% confidence interval for an unrelated parameter θ3. You continue this way constructing confidence interval for a sequence of unrelated parameters θ1,, θ2, . . . . Then the 95% of the time your intervals will contain the true parameter. There is no need to introduce the idea of repeating the same experiment over and over. 16 p-values Sometimes one does not want to specify the significance level α in advance. In this case, the so-called p-value is reported. p = sup θ∈Θ0 Pθ(observing something at least as extreme as the observation). Then we reject H0 iff the p-value p ≤ α. Hypothesis Testing: The F-test Suppose we wish to test whether a sub-model of a linear model E(Y ) = Xβ is better i.e. we wish to test H0 : E(Y ) ∈ span(X0) vs H1 : E(Y ) /∈ span(X0), for some matrix X0 with span(X0) ⊂ span(X). The span is the linear space spanned by the column vectors of the matrix. Example Consider comparing the model H0 : Yi = β1 + β2x1,i + ei against H1 : Yi = β1 + β2x1,i + β3x2,i + ei. Theorem 4. Under H0 : E(Y ) ∈ span(X0), F = RSS0 − RSS RSS n− r r− s ∼ Fr−s,n−r, where r = rank(X) and s = rank(X0). Think: what is the difference between the F-test and the t-test? A possible use of this F-test is the ANOVA test, which we now discuss. One-Way Analysis of Variance (ANOVA) Suppose we have m groups of observations, each group consisting of k observation. Label Yij as the jth observation from ith group. Assume the model is E(Yij) = µ+ βi i = 1, . . . ,m; j = 1, . . . , k, and that var(Yij) = σ2 with all observations independent. We want to test the null hypothesis that all the observations Yij come from the same population; that is H0 : β1 = β2 = · · · = βm, or equivalently H0 : E(Yij) = µ i = 1, . . . ,m; j = 1, . . . , k. We can test this hypothesis using the F-test (Theorem 4). 17 2.4 Prediction Once we have fitted a model, we may use it to predict outcomes. Prediction, as interpreted here, is concerned with answers to ‘what-if’ questions rather than actual prediction into the future. Suppose we have fitted a linear model with design matrix X to obtain the estimates β̂ and σ̂2. For given covariates, x?, the predicted (expected) response is ŷ? = xT? β̂. To ascertain the uncertainty in this prediction, we need to clear about the type of prediction we are making. Suppose we have fitted a linear regression model that predicts the selling price of homes in a given area that is based on predictors such as the number of bedrooms and closeness to a major highway. There are two kinds of predictions that can be made for a given x?: • Suppose a specific house comes on the market with characteristics x?. Its selling price will be xT?β+ e. Since E(e) = 0, the predicted price is xT? β̂, but in assessing the variance of this prediction, we must include the variance of e. • Suppose we ask the question: “What would a house with characteristics x? sell for on average?” This selling price is xT?β and is again predicted by xT? β̂ but now only the variance in β̂ needs to be taken into account. We have that var(xT? β̂) = xT? (XTX)−1x?σ2. So a 100(1− α)% confidence interval for a single future response is ŷ? ± t(α/2)n−p σ̂ √ 1+ xT? (XTX)−1x?, where P(T ≤ t(α/2)n−p ) = 1 − α/2, T ∼ tn−p. The confidence interval for the mean response for given x? is ŷ? ± t(α/2)n−p σ̂ √ xT? (XTX)−1x?. These confidence intervals are sometimes referred to as prediction intervals. 2.4.1 Worked Example lm-preds.R We now show how predictions and their confidence intervals can be computed in R. Consider the Galapagos dataset (which is in the faraway library). 18 Species Endemics Area Elevation Nearest Scruz Adjacent Baltra 58.00 23.00 25.09 346.00 0.60 0.60 1.84 Bartolome 31.00 21.00 1.24 109.00 0.60 26.30 572.33 Caldwell 3.00 3.00 0.21 114.00 2.80 58.70 0.78 Champion 25.00 9.00 0.10 46.00 1.90 47.40 0.18 Coamano ... ... ... ... ... ... ... This dataset consists of 30 Islands and seven variables. We shall be interested in Species which is the number of species of tortoise found on the island. Let’s start by fitting a linear model: > library("faraway") > X <- cbind(1,gala$Area,gala$Elevation,gala$Nearest,gala$Scruz, gala$Adjacent) > y <- gala$Species > XTX <- solve(t(X)%*%X) > betahat <- XTX%*%t(X)%*%y > ehat <- y-X%*%betahat > RSS <- sum(ehat^2) > sig2hat <- RSS/(30-6) Now suppose we want to predict the number of species of tortoise on an island with predictors 0.08, 93, 3, 12, 0.34. The prediction (expected) response from the model is: > xstar <- c(1,0.08,93,3,12,0.34) # include 1 for an intercept term. > ystar <- sum(xstar*betahat) > ystar [1] 33.89224 Now consider the two types of confidence intervals: we must decide whether we are predicting the number of species on one new island or the mean response for all islands with the same predictors x?. For a mean response, we would use the second type of interval. First, we require the critical value from the t-distribution > qcrit <- qt(0.975,24) > qcrit [1] 2.063899 19 The width of the confidence interval is: > ciw <- qcrit*sqrt(sig2hat)*sqrt(xstar%*%XTX%*%xstar) > ciw [,1] [1,] 32.41583 and so the interval is > c(ystar-ciw,ystar+ciw) [1] 1.476401 66.308070 The prediction interval for a single future response is: > ciw <- qcrit*sqrt(sig2hat)*sqrt(1+xstar%*%XTX%*%xstar) > c(ystar-ciw,ystar+ciw) [1] -96.06219 163.84667 20 2.5 Diagnostics Coefficient of Determination A way of measuring the goodness of fit of a linear model is by inspecting the coefficient of determination, which we now explain. In the simplest model with only an intercept term Y = 1 ... 1 β1 + , E(e) = 0 we have the RSS = ∑ni=1(Yi − Y)2. Larger models with more parameters and large design matrices will have a smaller RSS. For models that include an intercept term, a measure of the quality of a linear model is R2 = 1− RSS ∑ni=1(Yi −Y)2 , which is called the coefficient of determination or R2 statistic. Notice that 0 ≤ R2 ≤ 1 and R2 = 1 corresponds to the “perfect” model. Intuition For the intercept only models, the 1n ∑ n i=1(Yi −Y)2 is an estimator of σ2 — call this the total variance. Then RSS 1 n ∑ n i=1(Yi −Y)2 ≈ var. of model variance . Therefore, R2 ≈ proportion of total variance of the data that is explained by the model. There is also an alternate version of the R2 statistic called the adjusted coefficient of determination defined as R2 = 1− RSS/(n− p) ∑ni=1(Yi −Y)2/(n− 1) . Notice that the RSS terms are divided by the models’ respective degrees of freedom - thus R2 attempts to account for the degrees of freedom for each model. 21 2.5.1 The Danger of Using R2 : Anscombe Quartet Anscombe.R Using the R2 and the adjusted R2 summary statistics alone can be dangerous, as illustrated by Anscombe’s quartet. Anscombe’s quartet is 4 datasets — see table below: x1 x2 x3 x4 y1 y2 y3 y4 1 10.00 10.00 10.00 8.00 8.04 9.14 7.46 6.58 2 8.00 8.00 8.00 8.00 6.95 8.14 6.77 5.76 3 13.00 13.00 13.00 8.00 7.58 8.74 12.74 7.71 4 9.00 9.00 9.00 8.00 8.81 8.77 7.11 8.84 5 11.00 11.00 11.00 8.00 8.33 9.26 7.81 8.47 6 14.00 14.00 14.00 8.00 9.96 8.10 8.84 7.04 7 6.00 6.00 6.00 8.00 7.24 6.13 6.08 5.25 8 4.00 4.00 4.00 19.00 4.26 3.10 5.39 12.50 9 12.00 12.00 12.00 8.00 10.84 9.13 8.15 5.56 10 7.00 7.00 7.00 8.00 4.82 7.26 6.42 7.91 11 5.00 5.00 5.00 8.00 5.68 4.74 5.73 6.89 If we proceed to fit a linear model for the 4 datasets, we obtain the following parameter estimates and R2 values: β0 β1 R2 Adj. R2 Model 1 3.000 0.500 0.667 0.629 Model 2 3.001 0.500 0.666 0.629 Model 3 3.002 0.500 0.666 0.629 Model 4 3.002 0.500 0.667 0.630 Table 2.2: Summary statistics The parameter estimates and the R2 statistic values are approximately the same for the 4 datasets. So these models are (more or less) equal, in terms of the fit and summary statistics. Let’s do what we should have done at the start; look at the data: 22 l l l l l l l l l l l 5 10 15 4 6 8 10 12 x1 y1 l l ll l l l l l l l 5 10 15 4 6 8 10 12 x2 y2 l l l l l l l l l l l 5 10 15 4 6 8 10 12 x3 y3 l l l ll l l l l l 5 10 15 4 6 8 10 12 x4 y4 Figure 2.1: Anscombe’s Quartet Clearly, the data is different for each dataset and the fitted linear model may be inappropriate in some cases. Note • Looking at plots of the data before fitting is important. • Using a summary statistic, such as R2, on its own can be dangerous. 23 2.5.2 Outliers An outlier is an observation that does not conform to the general pattern of the rest of the data. Outliers can occur due to, among other causes, error in the data recording, the data is a mixture of two or more populations and when the model requires improvement. These outlying cases may involve large residuals and often have dramatic effects on the fitted function. Therefore, it is important to study these outlying cases carefully. A case may be outlying or extreme with respect to its Y value, its X value(s) or both. l l ll l l l l l l l l l l l l ll l ll lll l l l l x y 1 2 3 4 Figure 2.2: Illustration of different types of outliers. In the scatter plot (Fig. 2.2) case 1 is outlying with respect to its Y value, given X. Cases 2, 3 and 4 are outlying with respect to their X values since they have much larger X values; cases 3 and 4 are also outlying with respect to their Y values given X. Not all outlying cases have a strong influence on the fit. Case 1 may not be too influential because a number of other cases have similar X values. This will keep the fitted function from being displaced too far by the outlying case. Likewise, case 2 may not be too influential because its Y value is consistent with the regression relation displayed by the nonextreme cases. Cases 3 and 4, on the other hand, are likely to be very influential in affecting the fit of the regression function. 24 Notes For models with one or two predictor variables, it is relatively easy to identify outlying cases, with respect to their X or Y values by means of box-plots, scatter plots, residuals etc and to study whether they are influential in affecting the fitted regression function. However, when more than two predictor variables are included in the model, identification of outliers by simple plots becomes difficult. Residuals A practical method for detecting outliers is to look at observations with residuals that are “large”. We have e = E(e) = cov(e) = We now discuss some refined residuals for identifying observations with outlying Y values. In the following, we are interested in identifying cases that are multivariable outliers with respect to their X values. We shall assume a design matrix of full rank. Standardized Residuals To use residuals for detecting outlying X observations, it is important to consider their variance — note that the variance of each residual may be substantially different. Therefore it is appropriate to consider the magnitude of each ei relative to its estimated standard deviation. An estimator of the standard deviation of ei is√ σ̂2(1− hii), where hii ≡ Pii. Recall that in practice, σ2, is typically unknown. Then the standardized residual is ri := ei√ σ̂2(1− hii) . The studentized residuals ri have constant variance when the model is appropriate. Studentized residuals often are called internally studentized residuals. Using the plugin estimate for σ2 means that we lose the normality, however the standardised residuals should be approximately normal. A plot of ri against some other variable should not reveal any trends or patterns. 25 Residual Plots l l l l l l l l ll l l l l l l l lll l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l ll l l ll l l l l l l l l l l ll l l l l l l l l l l l l l −20 0 20 40 − 3 − 2 − 1 0 1 2 3 x St an da rd ise d Re sid ua ls l lllllll l ll ll ll l l l ll ll l l l ll l l ll ll ll l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll lll l l l l l ll l l l l ll l l l l l l −20 0 20 40 − 4 − 2 0 2 4 x St an da rd ise d Re sid ua ls l ll l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l 50 100 150 200− 6 − 4 − 2 0 2 4 6 x St an da rd ise d Re sid ua ls Fine Heteroscedastic error (non-constant variance) Non-linear Leverages We may be interested in how much each observation influences the model fit. For instance, consider the residuals e where var(ei) = σ2(1− hii). An observation’s leverage is related to the variance of its residual. Definition. The ith observation in a linear model has leverage equal to hii. If an observation has leverage close to 1, its residual has a small variance. Notice that the leverage only depends on the design matrix X. In practice, the leverages are compared to the “average” r/n and looking for hii > 2r/n, as ∑ni=1 hii = trace(P) = rank(X) = r 26 Deleted Residuals Intuition: l l ll l l l ll l l ll l ll l l ll ll l l l l l l −2 0 2 4 6 8 10 0 5 10 x y Figure 2.3: Illustration of observations with high leverage. Dotted line is a linear model fit including N; Solid line is a linear model without N Fit the model without the ith observation. Estimate (or predict) the ith observation using xi; denote the expected value by Ŷ(i). The difference between the actual observed value Yi and the estimated expected value Ŷ(i) is the deleted residual denoted by di. Specifically, di = Yi − Ŷ(i) An algebraically equivalent expression for di that does not require recomputation of the fit omitting the ith case is di = ei 1− hii . The estimated variance of the deleted residual for the ith case, di, is σ̂2(i)(1+ x T i (X T (i)X(i)) −1xi) where xi is the predictor vector for the ith case, σ̂2(i) is the mean square error when the ith case is omitted, and X(i) is the design matrix with the ith row deleted. An equivalent expression for this variance is σ̂2(i) 1− hii Note: The estimated variance of the deleted residual can be obtained by “predicting” the ith observation and using section 2.4. 27 Studentized Deleted Residuals As before, we can “studentized” the deleted residuals, by dividing it by its standard deviation. The studentized deleted residual, denoted by ti, is ti = di√ σ̂2 (i) 1−hii = ei√ σ̂2 (i)(1− hii) The studentized deleted residual ti is also called an externally studentized residual. 2.5.3 Cook’s Distance Another way to measure the influence of the observation is to consider the change or influence it has on the estimator β. One such measure is Cook’s distance Ci = (β̂(i) − β̂)TXTX(β̂(i) − β̂) pRSS/(n− p) , (leave-one-out estimator) where β̂(i) is the estimator calculated without using the ith observation. The rule of thumb is to look at observations with Ci close to 1. 2.5.4 Distributional Checks If Y1, . . . ,Yn are independent N(µ, σ2) distributed random variables, then P(Yi ≤ y) = P ( Yi − µ σ ≤ y− µ σ ) = Φ ( y− µ σ ) . Then the empirical cdf Fn(x) := 1 n n ∑ i=1 I(Yi ≤ x)→ Φ ( x− µ σ ) n→ ∞. Therefore, Φ−1(Fn(yi)) ≈ yi − µσ . A plot of Φ−1(Fn(yi)) against yi should look like a straight line. This is called a Q-Q plot. 28 2.5.5 Interpretation of Cook’s Distance cookdistance-data.RData We now investigate residuals, leverage and Cook’s distance in more detail. Consider 4 artificial data sets which are presented below each with a normal linear model fitted. Further, a plot of the residuals against the leverages is presented. Fine l l l l l l l l l l l l l ll l l l l l l l ll −5 0 5 10 15 20 − 20 0 20 40 Data Plot and Model Fit x y 0.00 0.05 0.10 0.15 − 1 0 1 2 Residuals vs Leverage Leverage St an da rd ize d re sid ua ls l l l l l l ll l l l l l l l l l l l l l l l l Cook's distance 0.5 3 17 2 High Leverage Low Residual l l l l l l l l l l l l l ll l l l l l l l l l −10 0 10 20 30 − 20 20 60 x y 0.00 0.10 0.20 0.30 − 1 0 1 2 Leverage St an da rd ize d re sid ua ls l l l l l l l l l l l l l l l l l l l l l l l l l Cook's distance 0.5 0.5 1 25 17 5 Low Leverage High Residual l l l l l l l l l l l l l ll l l l l l l l ll l −5 0 5 10 15 20 − 20 0 20 40 x y 0.00 0.05 0.10 0.15 − 1 0 1 2 3 Leverage St an da rd ize d re sid ua ls l l l l l l ll l l l l l l l l l l l l l l l l l Cook's distance 0.5 1 25 5 13 High Leverage High Residual l l l l l l l l l l l l l ll l l l l l l l ll l −20 −10 0 10 20 − 20 0 20 40 x y 0.0 0.1 0.2 0.3 − 2 0 1 2 3 4 5 Leverage St an da rd ize d re sid ua ls l l ll l l l l l l l ll ll l l l l l l l l l Cook's distance 1 0.5 0.5 1 25 518 The red data point is a suspicious data point i.e. either high leverage, high residual (in absolute value) or both. The red line is the fit including the red data point, and the black line is the fit excluding the red data point. 29 Note A data point with a high leverage has the potential to change the model fit significantly. We also need to inspect the residual to see how much the point actually affects the fit. This is why we look at Cook’s distance — it combines the leverage and standardised residual as it can be written equivalently as Ci = r2i hii (1− hii)r where ri is the standardised residual and r = rank(X). The Cook’s distance is presented in the plots on the previous page, in the form of contours — for instance those points lying between the 0.5 red dotted lines have a Cook’s distance less than or equal to 0.5. The red dashed lines are contour lines of Cook’s distance. 30 2.6 Worked Example carbo.R carbo-data.RData The data in Table 2.3 shows the percentage of total calories obtained from complex carbohydrates, for twenty male diabetics who have been on a high-carbohydrate diet, along with their age, weight and percentage of calories as protein. # Carbohydrate Age Weight Protein 1 33 33 100 14 2 40 47 92 15 3 37 49 135 18 4 27 35 144 12 5 30 46 140 15 6 43 52 101 15 7 34 62 95 14 8 48 23 101 17 9 30 32 98 15 10 38 42 105 14 11 50 31 108 17 12 51 61 85 19 13 30 63 130 19 14 36 40 127 20 15 41 50 109 15 16 42 64 107 16 17 46 56 117 18 18 24 61 100 13 19 35 48 118 18 20 37 28 102 14 Table 2.3: Example Dataset We take the Carbohydrate value as our responses Yi with age, weight and protein as the covariates. Then we fit normal linear model with Yi ∼ N(µi, σ2) where µi = E(Yi) = β1 + β2agei + β3weighti + β4proteini (i = 1, . . . , 20). > y <- dat$Carb #response Y is the Carbs. > X <- cbind(1,dat$Age,dat$Weight,dat$Protein) #design matrix 31 Then, to find the maximum likelihood estimator of β we need to solve XTXβ̂ = XTy: > beta.hat <- solve(t(X)%*%X)%*%t(X)%*%y > t(beta.hat) [,1] [,2] [,3] [,4] [1,] 36.96006 -0.1136764 -0.2280174 1.957713 The unbiased estimator of the variance, RSS/(n− p), can be computed and used to compute the standard deviation for each component of β̂. > sig.sq.hat <- sum((y-X%*%beta.hat)^2)/(length(y)-length(beta.hat)) > sqrt(diag(sig.sq.hat*solve(t(X)%*%X))) [1] 13.07128293 0.10932548 0.08328895 0.63489286 The residuals are > ehat <- y-X%*%beta.hat > summary(ehat) V1 Min. :-10.3424 1st Qu.: -4.8203 Median : 0.9897 Mean : 0.0000 3rd Qu.: 3.8553 Max. : 7.9087 The R2 and the adjusted coefficient R2 are > RSS <- t(ehat)%*%ehat > RSS0 <- sum((y-mean(y))^2) > R2 <- 1-(RSS/RSS0) > R2 32 [,1] [1,] 0.4805428 > 1-(RSS/(20-4))/(RSS0/(20-1)) [,1] [1,] 0.3831445 We can check these results using the lm function in R as follows: > mylm1 <- lm(Carbohydrate~Age+Weight+Protein,data=dat) > summary(mylm1) Call: lm(formula = Carbohydrate ~ Age + Weight + Protein, data = dat) Residuals: Min 1Q Median 3Q Max -10.3424 -4.8203 0.9897 3.8553 7.9087 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 36.96006 13.07128 2.828 0.01213 * Age -0.11368 0.10933 -1.040 0.31389 Weight -0.22802 0.08329 -2.738 0.01460 * Protein 1.95771 0.63489 3.084 0.00712 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 5.956 on 16 degrees of freedom Multiple R-squared: 0.4805, Adjusted R-squared: 0.3831 F-statistic: 4.934 on 3 and 16 DF, p-value: 0.01297 33 To compare two models using the F-test (Theorem 4), we can use the anova command (not to be confused on the ANOVA test) > mylm2 <- lm(Carbohydrate~Age,data=dat) > anova(mylm2,mylm1) Analysis of Variance Table Model 1: Carbohydrate ~ Age Model 2: Carbohydrate ~ Age + Weight + Protein Res.Df RSS Df Sum of Sq F Pr(>F) 1 18 1088.98 2 16 567.66 2 521.32 7.3469 0.005452 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Double-check the F-statistic agrees > RSS0 <- sum(mylm2$residuals^2) > F <- ((RSS0-RSS)*16)/(RSS*2) > F [,1] [1,] 7.346886 and also check the corresponding p-values > pf(F,df1=2,df2=16,lower.tail=FALSE) [,1] [1,] 0.005452024 Conclusion: Since the corresponding p-values is less than 0.01 we make the following conclusion: There is sufficient evidence to reject the null hypothesis at the 1% level (corresponding to the smaller model), and therefore accept the alternative. This means we should proceed to use the larger model. 34 Lastly, R can also produce some residual plots for us as follows: > plot(mylm2) 37.0 37.4 37.8 38.2 − 15 − 5 0 5 10 Fitted values R es id ua ls l l l l l l l l l l l l l l l l l l l l Residuals vs Fitted 12 18 11 l l l l l l l l l l l l l l l l l l l l −2 −1 0 1 2 − 1 0 1 2 Theoretical Quantiles St an da rd ize d re sid ua ls Normal Q−Q 12 18 11 37.0 37.4 37.8 38.2 0. 0 0. 4 0. 8 1. 2 Fitted values St a n da rd iz e d re si du a ls l l l l l l l l l l l l l l l l l l l l Scale−Location 1218 11 0.00 0.10 0.20 − 2 − 1 0 1 2 Leverage St an da rd ize d re sid ua ls l l l l l l l l l l l l l l l l l l l l Cook's distance 0.5 0.5 Residuals vs Leverage 8 12 18 35 2.7 Transformations The way that the data are presented may not necessarily reflect the way we should use the data in our linear models. Recall the prestige dataset — I used the logarithm of income, rather than income directly. Learning to use transformations effectively is part of data analysis. Fortunately, computational tools exists to help us transform the data. Transformations of the response and predictors can improve the fit and correct violations of model assumptions such as non-constant error variance. 2.7.1 Transforming the Response Suppose that you are contemplating a logged response in a simple regression situation: log y = β0 + β1x+ e. On the original scale of the response, this model becomes: y = exp(β0 + β1x) exp(e) In this model, the errors enter multiplicatively and not additively. Therefore, for linear models the logged response model requires that we believe the errors enter multiplicatively on the original scale. After transforming the response, any interpretations will need to be expressed on the original scale. For example, in the logged model above, your prediction would be exp(ŷ?). If your prediction confidence interval in the logged scale was [l, u], then you would use [exp l, exp u]. When you use a log transformation on the response, the regression coefficients have a particular interpretation: log ŷ = β̂0 + β̂1x1 + · · ·+ β̂pxp ŷ = eβ̂0eβ̂1x1 . . . eβ̂pxp An increase of one unit in x1 would multiply the predicted response (in the original scale) by exp(β̂1). 36 Box-Cox Transformation boxcox.R The Box-Cox method is a popular way to determine a transformation on the response (and also the predictors). The idea is to work with the transformation: y(λ) = yλ−1 λ λ 6= 0 log(y) λ = 0 . For fixed y > 0, y(λ) is continuous in λ. For y ≤ 0 the transformation is undefined. The idea is to choose λ that maximises the likelihood. By a change of variables, we obtain the distribution of y; the resulting log-likelihood being: `(λ,β, σ2) = −n 2 log(2piσ2)− 1 2σ2 (y(λ) − Xβ)T(y(λ) − Xβ) + (λ− 1) n ∑ i=1 log(yi). Substituting the MLEs of β and σ2, for a fixed λ, we obtain the (profile) log-likelihood: `(λ) = c− n 2 log(RSSλ) + (λ− 1) n ∑ i=1 log(yi) (2.6) where c is a constant not involving λ and RSSλ is the residual sum of squares using y(λ) as the response. The procedure is to evaluate the profile log-likelihood (2.6) for a range of possible values of λ. Rather than selecting the maximum, one often rounds to a value such as −1, 0, 1/2, 1 or 2, particularly if the profile log-likelihood is relatively flat around the maximum. Formally, let λ̂ denote the value that maximises the profile log-likelihood. We can test the hypothesis H0 : λ = λ0 for any fixed value λ0 by calculating the likelihood ratio 2(`(λ̂)− `(λ0)) which is approximating χ21. This can also be used to construct confidence intervals. Fortunately, we can maximise the log-likelihood and construct confidence intervals numerically in R. We will need the boxcox function from the MASS package: > library("MASS") We shall demonstrate the boxcox function on the savings dataset 37 > library("faraway") > data(savings) > mylm <- lm(sr~pop15+pop75+dpi+ddpi,data=savings) > boxcox(mylm,plotit=TRUE) > boxcox(mylm,plotit=TRUE,lambda=seq(0.5,1.5,by=0.1)) −2 −1 0 1 2 − 20 0 − 15 0 − 10 0 − 50 λ lo g− Li ke lih oo d 95% 0.6 1.0 1.4− 60 .0 − 58 .5 − 57 .0 λ lo g− Li ke lih oo d 95% Figure 2.4: Output plots from boxcox command: Log-likelihood for Box-Cox transformations. Not much can be seen from the first plot, so the range of λ is narrowed (second plot). These plots also include the confidence interval for λ. The confidence interval for λ is roughly (0.6, 1.4). We can see that there is no good reason to transform. Some general considerations concerning the Box-Cox method are 1. The Box-Cox method is affected by outliers. 2. If maxi yi/mini yi is small, then the Box-Cox will not have much effect. 3. There is some doubt whether the estimation of λ, counts as an extra parameter to be considered in the degrees of freedom. Notes • The Box-Cox method is not the only way of transforming — e.g. Yeo-Johnson transformations. • You can take a Box-Cox style approach for each of the predictors, choosing the transformation to minimize the RSS. 38 2.8 Contrasts for Categorical Variables We have encountered different types of variables so far — continuous and categorical. The way we handle categorical variables within a model needs some careful attention which we now discuss. As a running example in this section, let’s return to the Prestige data set. Suppose we want to fit the model: [drop subscript i for now] prestige = β1 + β2education+ β3type+ e e ∼ N(0, σ2). The variable education is assumed to be continuous, whereas the occupation type is categorical taking the the levels: blue collar (bc), professional (prof), and white collar (wc). Let’s first discuss the wrong way to include the occupation type into the linear model. [START OF THE WRONG WAY] Suppose we enter the occupation type into the linear model by encoding occupation type numerically as shown below. Label (used in dataset) Encoding blue collar 1 professional 2 white collar 3 In this model we assume that all three occupations exhibit the same relationship between education and prestige but that they start from different baselines. Using the encoding scheme described in the table above yields the following equations for the occupations. blue collar: prestige = β1 + β2education+ β3 professional: prestige = β1 + β2education+ 2β3 white collar: prestige = β1 + β2education+ 3β3 39 This translates into the following illustrative diagram: prestige education The numeric coding scheme we have used for occupation has imposed an unwanted assumption/constraint on the location of the intercepts: • While the location of the intercepts of blue collar and professional workers are arbitrary, the location of white collar workers is not. The numeric coding scheme has constrained, for a fixed level of education, the difference in prestige between blue collar and professional workers to be exactly the same as the difference between white collar and professional workers, β3. • Furthermore the prestige difference between white collar and blue collar workers, for fixed education level, is twice the difference between either one and professional workers. These assumptions may or may not be correct — in general we would prefer to test this instead of assuming them through the postulated model. In order to correct this problem, we require a more general encoding of the categories. Note Just changing the number assignments, i.e. instead of using 1, 2 and 3 use other numbers, will lead to the same problem – just with a different set of constraints on the intercepts. Therefore, we require a way of encoding that does not enforce any constraints on the estimates of the model. [END OF THE WRONG WAY] The standard approach when handling a categorical variable with n distinct levels is to create n− 1 different regressors. These are often called dummy regressors. If the Prestige data set is read into R using the read.table function, the character variable occupational type is automatically converted to a variable of class type "factor". When a factor is used in a linear model it automatically employs the correct 40 number of dummy regressors. The coding used is termed a contrast and in R the default contrast type for a factor is "treatment" and denoted contr.treatment. This produces the two dummy regressors X1 and X2 shown below. X1 X2 bc 0 0 prof 1 0 wc 0 1 Note By default R orders the levels of the character variable and defines the dummy coding levels in alphabetical order. As a result the first level alphabetically becomes the baseline level. If we fit a normal linear model in R with prestige as the response and education and occupation type as predictors by lm(prestige∼education+type,data=Prestige) the following model is fitted: prestige = β1 + β2education+ β3X1 + β4X2 + e. • In the case X1 = 0 and X2 = 0 for blue collar workers, the model is prestige = β1 + β2education+ e. • In the case X1 = 1 and X2 = 0 for professional workers, the model is prestige = (β1 + β3) + β2education+ e. • In the case X1 = 0 and X2 = 1 for white collar workers, the model is prestige = (β1 + β4) + β2education+ e. These fitted models are illustrated in the following diagram: prestige education 41 Since β1, β3, and β4 are estimated separately it is clear from the diagram above there are no constraints placed on the location of the intercepts. It is also clear why we only need n − 1 regressors to independently describe n levels of a categorical variable. In the additive model, the categorical variable levels just serve to change the intercept of the regression line. Since there is already an intercept in the model, it is used for one of levels of the categorical variable. 2.8.1 Other Codings There are many possible coding schemes for categorical variables other than the treatment contrast. To simplify matters we shall consider a model in which the only predictor is the variable occupation. Thus in R notation we fit the model: lm(prestige∼type,data=Prestige). For the following coding schemes, we represent the occupation type using two regressors. Dummy Coding As explained above with dummy coding the occupation variable is converted into dummy variables X1 and X2 whose values for the various levels of occupation were listed above. By default the level that is first alphabetically becomes the baseline level. This is the contr.treatment coding scheme of R. If we specify the regression model lm(prestige∼type,data=Prestige), R fits the model: prestige = β1 + β2X1 + β3X2 + e The model equation can be used to predict the mean as a function of X1 and X2. When we choose values for X1 and X2 that correspond to the various occupations, we obtain the means of those professions. More precisely, with the dummy coding scheme we obtain the following equations for the means: µbc = β1 µprof = β1 + β2 µwc = β1 + β3 By subtracting the mean for blue collar workers from each of the other equations we obtain expressions for β2 and β3. µbc = β1 µprof − µbc = β2 µwc − µbc = β3 Thus in the dummy coding scheme each coefficient measures a difference in (conditional) mean between one classification level and the classification level that was 42 chosen as baseline. The intercept corresponds to the mean of the baseline classification level. Deviation Coding In R deviation coding is denoted contr.sum. To assign this contrast to the variable occupation type we would use the following statement: contrasts(Prestige$type) <- "contr.sum" The deviation coding scheme used in R yields the two dummy regressors shown below: X1 X2 bc 1 0 prof 0 1 wc -1 -1 If we run lm(prestige∼type), R fits the model: prestige = β1 + β2X1 + β3X2 As before, the regression model estimates the conditional mean of y, conditional on the values that are specified for the predictors. Thus when we choose values for X1 and X2 that correspond to the various professions, we obtain the means of those professions. The deviation coding scheme yields the following equations for the means. µbc = β1 + β2 µprof = β1 + β3 µwc = β1 − β2 − β3 Further, summing these equations give µbc + µprof + µwc = 3β1 ∴ β1 = µbc + µprof + µwc 3 Thus the intercept in deviation coding corresponds to the mean of all three levels. From the equations for blue collar and professional and using the formula for β1 above we immediately obtain interpretations for β2 and β3. β2 = the difference between the mean for blue collar and the overall mean β3 = the difference between the mean for professional and the overall mean Thus in deviation coding the coefficients measure the distance between individual levels and the mean of all the levels. 43 Helmert Coding In RHelmert coding is denoted contr.helmert. Assigning this contrast to the variable occupation type is done with the following statement. contrasts(Prestige$type) <- "contr.helmert" The Helmert coding scheme yields the two dummy regressors shown below X1 X2 bc -1 -1 prof 1 -1 wc 0 2 As before when we specify the regression model lm(prestige∼type), R actually fits the model: prestige = β1 + β2X1 + β3X2. The Helmert coding scheme yields the following equations for the means. µbc = β1 − β2 − β3 µprof = β1 + β2 − β3 µwc = β1 + 2β3 As with deviation coding, if we add all three equations together we isolate β1 and find that the intercept represents the mean of all the levels. β1 = µbc + µprof + µwc 3 We can isolate β2 by subtracting the blue collar mean from the professional mean. µprof − µbc = 2β2 ∴ β2 = µprof − µbc 2 The best way to interpret this coefficient is to observe that if β2 is not significantly different from zero we would conclude that the mean prestige for professionals and the mean prestige for blue collar workers are not significantly different from each other. We can isolate β3 by taking the average of the blue collar and professional means and then subtracting the result from the white collar mean. µwc − µbc + µprof 2 ∴ β3 = 1 3 ( µwc − µbc + µprof 2 ) The best way to interpret this coefficient is to observe that if β3 is not significantly different from zero we would conclude that the mean prestige of white collar workers is not significantly different from the mean prestige of blue collar and professional workers together. Thus Helmert coding compares the current level with the average of the all the levels that preceded it. Thus Helmert contrast coding is especially appropriate if there is a natural order to the categories because then sequential comparisons of this sort make sense. 44 Chapter 3. Generalized Linear Models Motivating Example beetle.R beetle-data.RData Let us consider the following dataset regarding numbers of beetles that died after five hours of exposure to gaseous carbon disulphide at various dose levels. Can the dosage be used to measure the proportion of deaths? Dose xi Number ni Number Killed yi 1.6907 59 6 1.7242 60 13 1.7552 62 18 1.7842 56 28 1.8113 63 52 1.8369 59 53 1.861 62 61 1.8839 60 60 l ll l ll ll 1.0 1.5 2.0 2.5 0. 0 0. 5 1. 0 1. 5 Dose (xi) Pr op or tio n Ki lle d (y i / n i) Here the response of interest is the proportion of deaths, which must lie between 0 and 1. Thus, a linear model will not explain the data adequately. Therefore, we need a different class of model - in this case, one where the fitted values remain between 0 and 1. In this chapter, we will discuss the class of generalized linear models (GLMs), that is able to give a suitable fit. For this example, a model fitted using a generalized linear model is presented below. l ll l ll ll 1.0 1.5 2.0 2.5 0. 0 0. 5 1. 0 1. 5 Dose (xi) Pr op or tio n Ki lle d (y i / n i) 45 3.1 Specification of Generalized Linear Models 3.1.1 The components of GLMs NORMAL LINEAR MODELS 1. Random Component The components of Y have independent normal distributions with E(Y ) = µ and individual variance σ2. 2. Systematic Component Using the covariates x1, . . . ,xp form the linear predictor η = Xβ = p ∑ j=1 xjβ j 3. Link The link between the random and systematic components is η = µ GENERALIZED LINEAR MODELS 1. Random Component The components of Y are independent and have the same distribution. This distribution is a member of an exponential family with E(Y ) = µ. 2. Systematic Component Using the covariates x1, . . . ,xp form the linear predictor η = Xβ = p ∑ j=1 xjβ j 3. Link The link between the random and systematic components is ηi = g(µi) for i = 1, . . . , n. In GLMs • The random component specifies the probability distribution of the response variables. Specifically, the components of y have pdf or pmf from an exponential family of distributions (see section 3.1.2), with E(Y ) = µ. • The systematic component specifies a linear predictor η = Xβ as a function of the covariates and the unknown parameters. • The link function g may be any monotonic differentiable function. The link function provides a functional relationship between the systematic component and the expectation of the response in the random component; namely η = g(µ) 46 3.1.2 Exponential Families Definition. Consider a random variable Y whose probability density/mass function depends on the two parameters θ and φ. The distribution of Y is said to be a member of an exponential family if its probability density function can be written in the form exp { yθ − b(θ) a(φ) + c(y, φ) } , (3.1) for some specific functions a(·), b(·) and c(·). If φ is known, this is an exponential family model with canonical parameter θ. The function a(φ) commonly takes the form a(φ) = φ/w, where φ is called the dispersion parameter that is constant over all observations and w is a known prior weight which can vary across the observations. However, the practical cases of interest dealt with in this course will use the form a(φ) = φ/w, where in most cases w = 1. Example Consider Y ∼ N(µ, σ2). The density function for Y can be written in exponential family form Example Consider Y ∼ Poisson(λ). The mass function for Y can be written in exponential family form 47 Example Consider Y ∼ Inverse Gaussian(µ,λ). Its probability density function can be written in exponential family form as follows√ λ 2piy3 exp {−λ(y− µ)2 2µ2y } = exp {−λ(y− µ)2 2µ2y + 1 2 log ( λ 2piy3 )} = exp { y(−1/2µ2) + (1/µ) 1/λ − λ 2y + 1 2 log ( λ 2piy3 )} where we identify θ = (−1/2µ2), φ = 1/λ and a(φ) = φ, b(θ) = −(−2θ)1/2, c(y, φ) = −1 2 ( log(2piφy3) + 1 φy ) . Note that if φ is unknown, it is called a nuisance parameter of the distribution as it will interfere with inference of µ or θ. We need to be wary about this when performing hypothesis tests or constructing confidence intervals (see later in section 3.3). Properties of exponential family distributions The mean and variance of an exponential family distribution are easily derived as follows. From the definition of any probability density function∫ f (y; θ, φ)dy = 1. Under regularity conditions, we can differentiate with respect to θ to obtain ∂ ∂θ ∫ f (y; θ, φ)dy = ∫ ∂ ∂θ f (y; θ, φ)dy = ∂ ∂θ 1 = 0. (3.2) If we differentiate again with respect to θ, we get∫ ∂2 ∂θ2 f (y; θ, φ)dy = 0. (3.3) These general results can be applied to distributions from an exponential family. From (3.1) recall that general form of the pdf from an exponential family is: so that ∂ ∂θ f (y; θ, φ) = Then by (3.2) we find 48 by definition of the expected value. Thus µ ≡ E(Y) = b′(θ). (3.4) Next, we have ∂2 ∂θ2 f (y; θ, φ) = −b ′′(θ) a(φ) f (y; θ, φ) + ( y− b′(θ) a(φ) )2 f (y; θ, φ). (3.5) Plugging this into (3.3) gives Rearranging yields var(Y) = b′′(θ)a(φ). The function b′′(θ) is referred to as the variance function. The variance function considered as a function of µ will be denoted as V(µ). To summarise E(Y) ≡ µ = b′(θ) var(Y) = b′′(θ)a(φ) ≡ V(µ)a(φ) Warning! In general, the variance function is not the same as the variance of Y. Example Suppose Y ∼ N(µ, σ2). Recall that the exponential family form for the pdf of Y is defined as θ = µ, φ = σ2 with a(φ) = φ and b(θ) = θ2/2. Then E(Y) and var(Y) are easily obtained: 49 Example Suppose Y ∼ Poisson(λ). Then for any λ > 0 and y ∈N f (y;λ) = exp(y logλ− λ− log(y!)), where we identify θ = logλ, a(φ) = 1 and b(θ) = exp(θ) and c(y, φ) = − log(y!). We can check the mean and variance: Therefore, we are able to compute the mean and variance of any exponential family distribution just by looking at the form of its probability density function and differentiating b. 50 3.1.3 Link Functions Recall that in the specification of GLMs, the function g describes the link between the expected response µ ≡ E(Y ) and the linear predictor η = Xβ. We introduced the link function g as ηi = g(µi) for i = 1, . . . , n as any monotonic differentiable function. Recall a function, g, is monotonically increasing (or decreasing), if for all z1 and z2 with z1 ≤ z2, we have g(z1) ≤ g(z2) (or g(z1) ≥ g(z2)) Why link functions are necessary Example Suppose that Y1, . . . ,Yn are independent random variables such that Yi ∼ Poisson(λi) for some λi > 0. Then µi ≡ E(Yi) = λi > 0. It is plausible that the linear predictor ηi = ∑ p j=1 xijβ j can take any value on R. Thus using the identity link function, i.e. g(u) = u, is not recommended as ηi may be negative while λi is always positive. Therefore, using another link function g is required. Canonical Links For exponential family distributions, there is a natural or canonical choice of link function. These canonical links occur when θ = η where θ is the canonical parameter (see Definition 3.1.2). Recall that in the GLM specification, the link function can be any monotonic differentiable function. Therefore, using the canonical link is not necessary. However, as we shall see later, using the canonical link leads to some desirable results (see later in section 3.2). 51 Example of GLMs Example Suppose Y1, . . . ,Yn are independent random variables each with a Poisson distribution and E(Yi) = exp(β1 + β2xi) Then this is a GLM as the Poisson distribution can be written in exponential family form and the link function g(z) = log(z) is monotonic and differentiable with linear predictor η = Xβ. Example Suppose Y1, . . . ,Yn are independent N(µi, σ2) random variables where µi = β1 + β2xi Then this is a GLM as the Normal distribution can be written in exponential family form and the link function g(z) = z (the identity) is monotonic and differentiable. The generalization: from normal linear models to GLMs Notice the two generalizations: • The distribution of the response can be any member of the exponential family — not just a normal distribution. • The link function, connecting the mean of the response and the linear predictor η = Xβ, may be any monotonic differentiable function — not only the identity function. These generalizations lead to more complicated estimation and inference procedures, as we shall see. 52 3.2 Estimation So far we have just introduced the class of models called the GLMs. Now we discuss how to estimate the unknown parameter β in a GLM. Recall the estimation procedure for normal linear models; we want to find β̂ by maximising the log-likelihood. The log-likelihood for GLMs is `(β; φ,y) = n ∑ i=1 { yiθi − b(θi) a(φ) + c(yi, φ) } and depends on β through: µi = b′(θi), g(µi) = ηi, ηi = p ∑ j=1 xijβ j, for i = 1, . . . , n. The approach used to find the estimator of β for GLMs is the same we used before; namely to maximise the log-likelihood by solving ∂` ∂β ! = 0. However, for GLMs, this (generally) involves solving a non-linear set of equations. Therefore, we use an iterative method to obtain a numerical solution. Before we present the maximum likelihood estimation procedure for GLMs, we require a brief detour to introduce the numerical method and some maximum likelihood theory. 3.2.1 Maximum Likelihood Theory Suppose that Y = (Y1, . . . ,Yn)T has a pdf or pmf f (y;ψ) for y ∈ Rn and parameter vector ψ ∈ Rp. Then likelihood function is a function of ψ for each fixed y given by L(ψ;y) = f (y;ψ). Similarly, the log-likelihood function is given by `(ψ;y) = log L(ψ;y) = log f (y;ψ). 53 We then have the following: Score Vector: U (ψ;y) = ∇`(ψ;y) where ∇ = ( ∂ ∂ψ1 , . . . , ∂ ∂ψp ) so that Uj(ψ;y) = ∂ ∂ψj `(ψ;y), j = 1, . . . , p Observed Information Matrix: I = −∇∇T`(ψ;y) so that Ijk(ψ;y) = − ∂ 2 ∂ψj∂ψk `(ψ;y) Fisher’s Information Matrix: J (ψ) = E(I(ψ;Y )) If Y1, . . . ,Yn are iid with pdf or pmf f (yi;ψ), y ∈ R then L(ψ; y1, . . . , yn) = n ∏ i=1 L(ψ, yi) and `(ψ; y1, . . . , yn) = n ∑ i=1 `(ψ; yi) Example Suppose Y1, . . . ,Yn iid∼ Exponential(1/µ) for some µ > 0. We have f (yi; µ) = 1 µ exp ( −yi µ ) yi ≥ 0 `(µ; y1, . . . yn) = U = ∂ ∂µ `(µ; y) = I = − ∂ 2 ∂µ2 `(µ; y1, . . . , yn) = J = 54 Example Suppose Y1, . . . ,Yn ∼ N(µi, σ2) independently with σ2 > 0 known and some for µi ∈ R. We have f (yi; µi) = 1√ 2piσ2 exp ( − 1 2σ2 (yi − µi)2 ) f (y1, . . . , y1;µ) = n ∏ i=1 1√ 2piσ2 exp ( − 1 2σ2 (yi − µi)2 ) `(µ; y1, . . . , yn) = Uj = ∂ ∂µj `(µ;y) = Ijk = Jjk = Theorem 5. Let Y = (Y1, . . . ,Yn)T have a pdf or pmf f (y;ψ) for y ∈ Rn and parameter vector ψ ∈ Rp. Then, under certain regularity conditions, E(U ) = 0 and cov(U ) = E(UUT) = J . Proof. For j = 1, . . . , p we have E(Uj) = It then follows that cov(U ) = E(UUT) as E(U ) = 0. Thus we only need to show that E(UUT) = J = E(I). See problem sheet for remainder of proof We now return to exponential families. We shall write xij = (X)ij; also recall that the parameter of interest is β ∈ Rp. 55 Theorem 6. Suppose we have Y1, . . . ,Yn independent random variables from an exponential family with means µ1, . . . , µn and variance functions V(µ1), . . . ,V(µn) with common dispersion parameter φ. Then for j = 1, . . . , p and k = 1, . . . , p, Uj = n ∑ i=1 [ (yi − µi) var(Yi) xij ∂µi ∂ηi ] and Jjk = n ∑ i=1 [ xijxik var(Yi) ( ∂µi ∂ηi )2] . Proof. The likelihood function for Y1, . . . ,Yn is L(β; y1, . . . , yn) = where θ(µi) = θ(g−1(ηi)), ηi = ∑ p j=1 xijβ j. Therefore, the loglikelihood is `(β; y) = For score function Uj = ∂`(β; y) ∂β j = n ∑ i=1 ∂`i ∂β j for j = 1, . . . , p, we decompose the partial derivative as ∂`i ∂β j = ∂`i ∂θi ∂θi ∂µi ∂µi ∂β j where θi ≡ θ(µi). Now consider each partial derivative in turn. First, ∂`i ∂θi = Second, ∂θi ∂µi = Third, ∂µi ∂β j = Inserting these into the score Uj yields Uj = n ∑ i=1 [ (yi − µi) var(Yi) xij ( ∂µi ∂ηi )] , (3.6) 56 as required. For the Fisher information matrix, J = E(UUT), we use (3.6) to get Jjk = where we used that the observations are independent. These results can be succintly be written as U = XTWc, (3.7) where the components of c are given by ci = (yi − µi) ∂ηi∂µi , and the Fisher information matrix can be represented as J = XTWX, (3.8) with W ∈ Rn×n being a diagonal matrix with entries wii = 1 var(Yi) ( ∂µi ∂ηi )2 . Recall that to find the MLE, β̂, the procedure is to differentiate the log-likelihood with respect to β and solve equal to zero: ∂` ∂β ! = 0. This is equivalent to solving Uj = 0 for j = 1, . . . , p. Therefore, the maximum likelihood equations are n ∑ i=1 [ (yi − µi) var(Yi) xij ( ∂µi ∂ηi )] = 0, for j = 1, . . . , p. (3.9) In general, this is a non-linear system of equations in β and a numerical solution is required. We proceed to solve (3.9) using a variant of the Newton-Raphson algorithm. 57 Algorithm. Newton-Raphson: For f (x) : X ⊆ Rp → R;x 7→ f (x), the algorithm below: x(m+1) = x(m) − H−1g, where H = ∂ 2 f (x) ∂xi∂xj ∣∣∣∣ x=x(m) , g = ∂ ∂x f (x)|x=x(m) will converge to x? = argmax x f (x) (under regularity conditions). Using the Newton-Rapshon algorithm for `(β;y) yields: β̂(m+1) = β̂(m) + ( I (m) )−1 U (m), where U (m) is the score vector evaluated at β = β̂(m), and similarly for the observed information matrix I (m). In maximum likelihood estimation it is common to estimate I by E(I) = J i.e. Fishers information matrix∗. The result is the Fisher scoring algorithm: β̂(m+1) = β̂(m) + ( J (m) )−1 U (m). (3.10) Continuing to find simpler representations for the iterations of the scoring algorithm: Plugging in (3.7) and (3.8) into (3.10) gives β̂(m+1) = (3.11) These iterated weighted least squares (IWLS) equations (3.11) may be viewed as a weighted least squares equations using the linear model z = Xβ+ c, where c = (y −µ)(∂η/∂µ) ∼ N(0,W−1). (3.12) To see this, start by rewriting (3.12) as z = Xβ+W−1/2. Then multiplying on the left by W1/2 W1/2z︸ ︷︷ ︸ y˜ = W1/2X︸ ︷︷ ︸ X˜ β+ , ∼ N(0, In), ∗see problem sheet 58 which is of the form of a normal linear model. We can write the estimate down directly β̂ = (X˜TX˜)−1X˜T y˜ = (XTWX)−1XTWz, which is similar to (3.11). This motivates the iterative weighted least squares (IWLS) algorithm. Algorithm 3.1 Iterative weighted least squares (IWLS) algorithm 1: Given a current estimate β̂, form the linear predictor η̂ and the fitted values µ̂. 2: Form the adjusted dependent variable zi = η̂i + (yi − µ̂i)∂η∂µ ∣∣∣∣ µ=µ̂i 3: Form the estimated weights w˜ii by w˜−1ii = ( ∂η ∂µ )2 V(µ) ∣∣∣∣ µ=µ̂i 4: Regress zi on xi with weights w˜ii and obtain the new estimate β̂ . 5: Repeat steps 1 to 4 until convergence. Note the relationship between the two weights: w˜ii = φwii ¶(see below) Lastly, we have that cov(β̂) = cov {( XTWX )−1 XTWz } = [( XTWX )−1 XTW ] cov(z)︸ ︷︷ ︸ W−1 [( XTWX )−1 XTW ]T = ( XTWX )−1 = φ(XTW˜X)−1 = J −1. (3.13) Notice, that J involves the dispersion parameter φ— this needs to be known in order to compute J . We estimate cov(β̂) by “plugging” in the last W(m) from the IWLS algorithm to get cov(β̂) ≈ (XTW(m)X)−1 59 To be clear¶ why we are using two weights: Recall that we want to solve the maximum likelihood equations : Uj = n ∑ i=1 [ (yi − µi) var(Yi) xij ( ∂µi ∂ηi )] = 0, for j = 1, . . . , p. (3.9 revisited) Since we have that var(Yi) = a(φ)V(µi) = φV(µi), solving (3.9) is equivalent to solving n ∑ i=1 [ (yi − µi) V(µi) xij ( ∂µi ∂ηi )] = 0, for j = 1, . . . , p. Therefore, within the IWLS algorithm (Algorithm 3.1) we use w˜ii = 1 V(µ) ( ∂µ ∂η )2∣∣∣∣ µ=µ̂i rather than wii = 1 var(Yi) ( ∂µ ∂η )2∣∣∣∣ µ=µ̂i = 1 φV(µ) ( ∂µ ∂η )2∣∣∣∣ µ=µ̂i 60 Example calculations Example For independent Y1, . . . ,Yn with Yi ∼ Binomial(ni,pii), for known ni, f (yi;pii) = ( ni yi ) pi yi i (1− pii)ni−yi , Then, working with the canonical link: µi ≡ E(Yi) = θi = b(θi) = b′(θi) = b′′(θi) = V(µi) = ηi = ∂ηi ∂µi = zi = η̂i + w−1ii = 61 Example For independent Y1, . . . ,Yn with Yi ∼ Normal(ζi, σ2) for known σ2 > 0, f (yi;λi) = 1√ 2piσ2 exp { − 1 2σ2 (yi − ζi)2 } . Then, working with the canonical link: µi ≡ E(Yi) = ζi θi = a(φ) = b(θi) = b′(θi) = b′′(θi) = V(µi) = ηi = ∂η ∂µ = zi = η̂i + w−1ii = 62 Example For independent Y1, . . . ,Yn where Yi ∼ Poisson(λi) f (yi;λi) = e−λiλyii yi! . Then, working with the identity link: µi ≡ E(Yi) = θi = a(φ) = b(θi) = b′(θi) = b′′(θi) = V(µi) = ηi identity link = ∂ηi ∂µi = zi = η̂i + w−1ii = 63 Example in R poisson.R poisson-data.RData Consider the following artificial dataset 1 2 3 4 5 6 7 8 9 y 2 4 6 7 8 10 11 12 14 x -1 -1 0 0 0 0 1 1 1 Table 3.1: Example Dataset l l l l l l l l l −1.0 −0.5 0.0 0.5 1.0 2 4 6 8 10 14 x y A possible model may be one using a Poisson distribution as the response distribution. Lets model the relationship between Yi and xi as µi ≡ E(Yi) = β1 + β2xi, for i = 1, . . . , 9, and Yi ∼ Poisson(µi). Thus we have taken the link function as the identity; then ∂µi ∂ηi = 1 and wii = 1 var(Yi) = 1 β1 + β2xi . The design matrix is X = 1 x1 1 x2 ... ... 1 x9 = 1 −1 1 −1 ... ... 1 1 64 Start by forming the response vector > y <- dat$y We can then fit the model using the iterative weighted least squares algorithm (Algorithm 3.1). > beta <- c(20,4) #initial guess > for (i in 1:25){ + eta <- cbind(1,x)%*%beta #estimated linear predictor + mu <- eta #estimated mean response + z <- y #form the adjusted variate + w <- 1/mu #weights + lmod <- lm(z~x, weights=w) #regress z on x with weights w + beta <- as.numeric(lmod$coeff) #new beta + print(beta) #print out the beta estimate every iteration + } [1] 7.703704 4.666667 [1] 7.701896 4.682932 [1] 7.701886 4.683026 [1] 7.701886 4.683027 ... The algorithm seems to converge at β =(7.702, 4.683). l l l l l l l l l −1.0 −0.5 0.0 0.5 1.0 2 4 6 8 10 14 x y Using (3.13) we can work out the estimated standard error of these parameters: 65 > X <- cbind(1,x) > J <- t(X)%*%diag(1/as.vector(mu))%*%X > invJ <- solve(J) > sqrt(as.vector(diag(invJ))) [1] 0.9020884 1.1317672 In R there is a function that automatically fits the model for us: > myglm <- glm(y~x,family=poisson(link="identity")) > summary(myglm) Call: glm(formula = y ~ x, family = poisson(link = "identity")) Deviance Residuals: Min 1Q Median 3Q Max -0.6382 -0.4012 -0.1100 0.4495 0.7913 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 7.7019 0.9021 8.538 < 2e-16 *** x 4.6830 1.1318 4.138 3.51e-05 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 16.4022 on 8 degrees of freedom Residual deviance: 2.1658 on 7 degrees of freedom AIC: 40.682 Number of Fisher Scoring iterations: 4 The estimates of the parameter and its standard deviation seem to agree, which should be no surprise as the estimation procedures are exactly the same. 66 Notes • The estimation procedure may fail (e.g. see later in Section 3.8.3) – in R the maximum number of iterations is 25 by default. • The convergence criterion used by the glm command in R will be discussed later in Section 3.6.1. • A difference between the results may be due to a different initial starting point. • Notice that the dispersion parameter, φ, is recognised by R as 1. 67 3.3 Inference As in the inference section for linear models, section 2.3, we discuss how to construct confidence intervals and perform hypothesis tests. In particular, for hypothesis testing we shall discuss how to compare two related models. For GLMs, we say that two models are related if 1. the distribution of the response Y is the same; and 2. the same link functions is used. The models differ in the number of parameters i.e. the dimensionality of β. To compare models we require a measure of their goodness of fit. We shall present goodness of fit statistics based on the log-likelihood function. 3.3.1 Sampling Distributions for GLMs In this section, we discuss the sampling distributions† relevant to GLMs. We shall use the following notion: under appropriate regularity conditions, which are satisfied for generalized linear models, if S is a statistic of interest, then S− E(S)√ var(S) ·∼· N(0, 1) or equivalently (S− E(S))2 var(S) ·∼· χ21, where we shall use ·∼· to denote “approximately distributed as”. If there is a vector of statistics of interest S = S1 ... Sp with asymptotic expectation E(S) and asymptotic variance-covariance matrix Q, then asymptotically S − E(S) ·∼· N(0,Q) (3.14) and further, asymptotically (S − E(S))T Q−1 (S − E(S)) ·∼· χ2p (3.15) provided Q is non-singular so that Q−1 exists and is unique. †The sampling distribution of a statistic is the distribution of that statistic, considered as a random variable, when derived from a random sample 68 To be clear, these approximations follow from the central limit theorems: Central Limit Theorem Suppose A1, . . . , An are iid random variables with E(Ai) = µ and var(Ai) = σ2 (both finite). Then √ n ( 1 n ∑ n i=1 Ai − µ√ σ2 ) d−→ N(0, 1) as n→ ∞ where d−→ means convergence in distribution. Multivariate Central Limit Theorem IfA1, . . . ,An are iid random vectors with E(Ai) = µ ∈ Rp and finite, positive definite, symmetric covariance matrix Q then √ n ( 1 n n ∑ i=1 Ai −µ ) d−→ N(0,Q) as n→ ∞ The asymptotic chi-squared result (3.15) follows from the fact: If Z ∼ N(µ,Q) where µ ∈ Rp and Q ∈ Rp×p is a positive definite, symmetric matrix, then (Z −µ)TQ−1(Z −µ) ∼ χ2p. Sampling distribution for the score We can apply the ideas above to the score vector U . Recall from (3.6) that Uj = ∂` ∂β j = n ∑ i=1 [ (yi − µi) var(Yi) xij ( ∂µi ∂ηi )] , j = 1, . . . , p. Since E(U ) = 0 and cov(U ) = J we have, from (3.14) asymptotically U ·∼· N(0,J ) (3.16) and, from (3.15), asymptotically, UTJ −1U ·∼· χ2p (3.17) Example Suppose Y1, . . . ,Yn are independent and Yi ∼ N(µ, σ2) where µ is the parameter of interest and σ2 > 0 is a known constant. The log-likelihood function is `(β;y) = 69 Then the score is U = and the Fisher’s information matrix is J = Then by rearranging (3.16), we get (asymptotically) Y ∼ N(µ, σ2/n). We can use this result to construct a confidence interval for µ. For example, a 95% confidence interval for µ is y ± 1.96σ/√n approximately due to asymptotic normality‡. Sampling distribution for MLEs Let β̂ be the MLE for β in a GLM i.e. β̂ := argmax β `(β) and so U (β̂) = ∂`(β̂)∂β = 0. Now consider the 1st order Taylor expansion of U (β) about the MLE β̂: U (β) ≈ where we approximated U ′ by E(U ′) = −J . Consequently, we find (β̂− β) = { J (β̂) }−1 U (β), assuming that J is invertible. If J is regarded as a constant then β̂ is unbiased for β. It turns out that this is at least asymptotically true. The variance-covariance matrix for β̂ is E [ (β̂− β)(β̂− β)T ] = ‡Actually, this is exact and not approximate as the observations are Normal in this example. This coincides with the results presented in Chapter 2 70 where we have used that J = J T and treated J as a constant. Thus, by (3.14), we have β̂ ·∼· N(β,J −1). (3.18) and also, using (3.15), we have (β̂− β)TJ (β̂)(β̂− β) ·∼· χ2p. (3.19) This is called the Wald statistic. We can use (3.18) to construct confidence intervals and conduct hypothesis tests involving the components of β. For instance, from (3.18), we have β̂1 − β1√ var(β̂1) ·∼· N(0, 1), where var(β̂1) = (XTWX)−111︸ ︷︷ ︸ the 1,1 entry of (XTWX)−1 which we can use to construct confidence intervals for β1. Further, under the null hypothesis H0 : β1 = 0 we have β̂1√ var(β̂1) ·∼· N(0, 1). 3.4 Prediction For given covariates x?, we have§ η̂? = xT? β̂ with variance xT? (XTWX)−1x?. Therefore, an approximate confidence interval can be constructed using the normal distribution. To obtain an approximate confidence interval in terms of µ we can use the inverse of the link function to transform the end points. As with the linear models, we can use the predict function in R for GLMs. More precisely, an approximate 95% confidence interval for the mean response of a prediction with covariates x? is( g−1 ( η̂? − 1.96 √ xT? (XTWX)−1x? ) , g−1 ( η̂? + 1.96 √ xT? (XTWX)−1x? )) , where g−1 is the inverse of the link function g. §x? is a column vector containing covariates and typically an intercept. Also, note that η̂? is just a number. 71 3.4.1 Measuring the Goodness of Fit One measure of the goodness of fit of a GLM is to compare it with a more general model with the maximum number of parameters that can be estimated. This model is called the saturated model. It is a GLM with exactly the same response distribution and link function as the model of interest. For n observations, y1, . . . , yn, all with potentially different values for the linear predictors ηi = ∑ p j=1 xijβ j, a saturated model can be specified with n parameters. It turns out that the fitted saturated model forces µi ≡ E(Yi) = yi for i = 1, . . . , n which achieves the maximum attainable log-likelihood. For n observations from a GLM with n parameters, the log-likelihood is `(θ, φ;y) = n ∑ i=1 ( yiθi − b(θi) a(φ) + c(yi, φ) ) . Then, for i = 1, . . . , n, ∂` ∂θi = yi − b′(θi) a(φ) = yi − µi a(φ) For this it is clear that log-likelihood is maximised if we force µi = yi for i = 1, . . . , n. Now reconsider the log-likelihood in terms of the mean response µ: `(µ, φ;y) = n ∑ i=1 ( yiθ(µi)− b(θ(µi)) a(φ) + c(yi, φ) ) . Thus, the maximum achievable log-likelihood is `(y, φ;y). For the model of interest, a GLM with typically less parameters than observations (p < n), we can maximise the log-likelihood to get β̂, then η̂ = Xβ̂ and so the estimated mean response µ̂ = g−1(η̂) — leading the maximised log-likelihood `(µ̂, φ;y). 72 Deviance A measure of the goodness of fit of a GLM is the deviance. Definition. The deviance for a model with estimated mean response µ̂ is defined as D = 2φ {`(y, φ;y)− `(µ̂, φ;y)} , and the scaled deviance is D∗ = D/φ. To make things concrete: Suppose we have a GLM of interest and consider the following extreme models with the same response distribution and link function: Saturated Model: The GLM with number of parameters, p, equal to the number of (distinct) observations. In this case, µi ≡ E(Yi) is equal to the observed response yi, so there is no variation in the random component. Null Model: The GLM with only one parameter (p = 1) representing a common mean response µ for all ys. In practice, the null model will be too simple and the saturated model is uninformative as it just repeats the observed response. We aim for a model with a likelihood close to the likelihood of the saturated model, but with fewer parameters. 73 Deviance — constrained, unconstrained likelihood? Consider a saturated model where β ∈ Rn and (as usual) η = Xβ, with X ∈ Rn×n. Assuming that X is invertible, the model imposes no constraints on the linear predictor η — it can take any value on Rn. In turn, this means µ and also θ are unconstrained. Thus, for the saturated model ; the maximised log-likelihood is max µ∈Rn s.t. µ=g−1(η) η∈Rn `(µ;y) = `(y;y). For models with p < n parameters, the maximised log-likelihood is max µ∈Rn s.t. µ=g−1(Xβ) for some β∈Rp `(µ;y) = `(µ̂;y) For instance, suppose we have a GLM with n = 3, p = 1 with X = (1 1 1)T and identity link function g(z) = z. Then ηi = β ∈ R for i = 1, . . . , n, and µ = g−1(β)g−1(β) g−1(β) = β 11 1 Thus µ is constrained to lie on a line in R3. 74 Why we do not use the saturated model. We can make the fit as close as possible, in terms of minimising the deviance, by including sufficiently many parameters. Below are some fits for a GLM based on the 7 data points represented as triangles. The first row corresponds to a saturated model — note how the fit goes through every data point. The second row is another fit using just 2 parameters. The second column is a repeat of the first, but with additional data points — note how the saturated model would give poor predictions for the additional data points. Thus simplicity, represented by parsimony of parameters, is a desirable feature for models; we do not include parameters that are unnecessary. −4 −2 0 2 4 0. 0 0. 4 0. 8 x y −4 −2 0 2 4 0. 0 0. 4 0. 8 x y −4 −2 0 2 4 0. 0 0. 4 0. 8 x y −4 −2 0 2 4 0. 0 0. 4 0. 8 x y Let us now switch back, considering the log-likelihood in terms of β — write the scaled deviance as D∗ = 2 { `(β̂sat;y)− `(β̂;y) } , where β̂sat is the MLE of βsat for the saturated model and β̂ is still the MLE of β in the model of interest. 75 Example Consider the with Y1, . . . ,Yn independent where Yi ∼ N(µi, σ2) and E(Yi) = µi = p ∑ j=1 xijβ j. The log-likelihood function is `(µ;y) = −n 2 log(2piσ2)− 1 2σ2 (y −µ)T(y −µ). Setting µ = y yields the maximum achievable log-likelihood: `(y;y) = Then for any other model with p < n (not a saturated model) we find the maximum likelihood estimate β̂ = (XTX)−1XTy — which in turn gives the estimated mean response µ̂ = g−1(η̂) = Xβ̂. So `(β̂;y) = Therefore the deviance is D = 2φ { `(β̂sat;y)− `(β̂;y) } 3.4.2 Pearson’s X2 statistic Another important measure of the discrepancy of a GLM is Pearson’s X2 statistics. X2 = n ∑ i=1 (yi − µ̂i)2 V(µ̂i) . Pearson’s X2 statistic shares similar asymptotic distribution properties with the deviance. However, in this course we shall focus on the deviance as a measure of goodness-of-fit as • the deviance has a general advantage as a discrepancy measure in that it is additive for nested sets of models (see next section). • the deviance leads to better normalised residuals. 76 3.4.3 Hypothesis Testing We now discuss how to compare GLMs by hypothesis testing. These tests will involve the deviance and scaled deviance, therefore we first need their sampling distributions. Model Comparison with known φ First, consider the 2nd order Taylor expansion of the log-likelihood around the MLE β̂: `(β) ≈ where we approximated U ′ by E(U ′) = −J . Consequently, we find 2 { `(β̂)− `(β) } = (β− β̂)TJ (β̂)(β− β̂) (3.20) which is approximately χ2p distributed by (3.19). Rewriting the deviance and using (3.20) gives D∗ = If the model with p parameters is correct, then D∗ ·∼· χ2n−p. Now consider comparing two nested models as follows: Consider the null hypothesis H0 : β = β0 = β1 ... βq , corresponding to model M0 and a more general hypothesis, the alternative, H1 : β = β1 = β1 ... βp , 77 corresponding to model M1 with q < p < n. We can test H0 against H1 using the difference of the scaled deviances D∗0 − D∗1 = 2 { `(β̂sat;y)− `(β̂0;y) } − 2 { `(β̂sat;y)− `(β̂1;y) } = 2 { `(β̂1;y)− `(β̂0;y) } . The statistic D∗0 − D∗1 is approximately χ2p−q distributed under the null hypothesis. Model Comparison with unknown φ Under the null hypothesis we have D∗1 ·∼· χ2n−p and D∗0 − D∗1 ·∼· χ2p−q. If we consider D∗1 and D ∗ 0 − D∗1 as asymptotically independent, then (D∗0 − D∗1)/(p− q) D∗1/(n− p) ·∼· F(p−q),(n−p). The advantage of this test statistic is that we can multiply top and bottom by φ to get a test statistic based on the deviance: (D0 − D1)/(p− q) D1/(n− p) ·∼· F(p−q),(n−p). The advantage of this hypothesis test for model comparison is that it does not depend on φ. Warning There are certain assumptions made when proving that the deviance, D, is approximately or asymptotically χ2n−p distributed which we should be wary of • The observations are independent and distributed according to some member of an exponential family. • The approximation relies on the number of parameters in the model staying fixed, while the sample size tends to infinity. But the saturated model has as many parameters as number of data. Also, in the Binomial case the χ2n−p approximation (or asymptotics) for the deviance are based upon the following assumptions (McCullagh & Nelder, 1989, p. 118) – the observations are truly distributed independently according to the binomial distribution; – The approximation is based on a limiting operation in which n is fixed and ni → ∞ for each i (and in fact nipii(1− pii)→ ∞). However, the χ2 approximation is sound when comparing two nested models, as the deviance for the saturated model cancels out. 78 3.4.4 Estimating the Dispersion Recall that the MLE for β does not depend on the dispersion φ. However, in cases where the dispersion is unknown, it must be estimated. There are two commonly used estimators: First is based on the deviance: φ̂D = D n− p , where D is the deviance of the model. This follows from the expected value of D/φ = D∗ ·∼· χ2n−p. Second is φ̂P = X2 n− p where X2 is Pearson’s statistic — this is based on the approximation X2/φ ·∼· χ2n−p. Note If Z ∼ χ2d then E(Z) = d; that is the expected value of a χ2 distribution is equal to its degrees of freedom. We can then plug in an estimator for the dispersion parameter φ to estimate cov(β̂): cov(β̂) ≈ φ̂(XTW˜X)−1 (see page 59). 3.4.5 Akaike’s Information Criteria (AIC) An alternative statistic to compare two models was suggested by Akaike: pick whichever model minimises AIC = −2`(β̂) + 2p, for a given set of data. Notice, that it is similar to using the log-likelihood of the data, but with a penalisation term for the number of parameters i.e. the more parameters included, the higher the AIC. Therefore the AIC involves a trade-off between goodness-of-fit of the model and its complexity (in terms of its number of parameters). 79 3.5 Diagnostics As with the normal linear models, we can use residuals to explore the fit of GLMs. For GLMs we require extenstions of residual definitions to accommodate for all non-Normal distributions. 3.5.1 Residuals Pearson’s Residuals Definition. For a single observation y, Pearson’s residual is defined as rp = y− µ√ V(µ) . Deviance Residuals Suppose the deviance, D, is used as a measure of discrepancy of a GLM. Then each observation contributes a quantity di to D so that “D = ∑ni=1 di”. Thus, it makes sense to define a deviance based residual. Definition. For a single observation, yi, the deviance residual is defined as rD = sign(yi − µi) √ di, thus the deviance is D = ∑i r2D. ¶ Example For the Poisson distribution, we have Pearson’s residual: rp = Deviance residual: rD = Similar with the residuals for linear models, we can standardise to account for each observation’s leverage. To this end, we require the corresponding hat matrix for GLMs: P = W1/2X(XTWX)−1XTW1/2. The standardised Pearson’s and deviance residuals are obtained by dividing by√ (1− Pii)φ̂. ¶sign(x) = +1 if x ≥ 0−1 if x < 0 80 Note These residuals are approximately normal in general. However, the deviance residuals are the closest, for n large and ni (for Binomial) large. In practice, one should not expect them to lie on a straight line in a QQ plot, but rather on a smooth curve — departure from this curve may indicate an outlier. 3.5.2 Cook’s Distance Similar to the normal linear model case, we can examine the Cook’s distance for the observations. Ci = ( β̂(i) − β̂ )T (XTWX) ( β̂(i) − β̂ ) pφ̂ where β̂(i) is the estimator calculated without using the ith observation. Again we look for large Ci close to 1. 81 3.6 Worked Examples 3.6.1 Worked Example 1 seeds-data.RData seeds.R The data presented in Table 3.2 shows the number of Orobanche seeds germinated for two genotypes and two treatments. # Germinated y Total tested n Genotype x1 Treatment x2 1 10 39 0 0 2 23 62 0 0 3 23 81 0 0 4 26 51 0 0 5 17 39 0 0 6 5 6 0 1 7 53 74 0 1 8 55 72 0 1 9 32 51 0 1 10 46 79 0 1 11 10 13 0 1 12 8 16 1 0 13 10 30 1 0 14 8 28 1 0 15 23 45 1 0 16 0 4 1 0 17 3 12 1 1 18 22 41 1 1 19 15 30 1 1 20 32 51 1 1 21 3 7 1 1 Table 3.2: Orobanche seeds Dataset We are interested in, yi/ni, the proportion, so we want the fit to remain between 0 and 1. Suppose Y1, . . . ,Yn are independent Binomial(ni,pii) random variables and take the logit link function. Take the design matrix X = 1 x11 x12 x11x12 ... ... ... ... 1 x21,1 x21,2 x21,1x21,2 Using the binomial distribution we would have µi ≡ E(Yi) = nipii with pdf n ∏ i=1 ( ni yi ) pi yi i (1− pii)ni−yi . 82 Now we fit the model using the IWLS algorithm (Algorithm 3.1) > beta <- c(0.5,0.5,0,0) #initial guess > #inverse logit function > inv.link <- function(u) + n*(1/(1+exp(-u))) > #deviance function > D <- function(p){ + a <- y*log(y/p) + b <- (n-y)*log((n-y)/(n-p)) + a[y==0] <- 0 + 2*sum(a+b) + } > oldD <- D(inv.link(as.numeric(X%*%beta))) > jj <- 0 > while(jj==0){ + eta <- X%*%beta #estimated linear predictor + mu <- inv.link(eta) #estimated mean response + detadmu <- n/(mu*(n-mu)) + z <- eta+ (y-mu)*detadmu #adjusted dependent variable + w <- mu*(n-mu)/n #weights + lmod <- lm(z~x, weights=w) #regress z on x with weights w + beta <- as.vector(lmod$coeff) #new beta + newD <- D(inv.link(X%*%beta)) + control <- abs(newD-oldD)/(abs(newD)+0.1) + if(control<1e-8) + jj <- 1 + oldD <- newD + } > beta #final estimate [1] -0.5581717 0.1459269 1.3181819 -0.7781037 > newD #last deviance calculated [1] 33.27779 Notice that this time we have used the change in the deviance as the convergence criterion; this is what R uses: If ∣∣Dnew − Dold∣∣ |Dnew|+ 0.1 83 is less than 1× 10−8 then the algorithm is deemed to have converged. By (3.13), the standard errors for these estimates are > J <- t(X)%*%diag(as.vector(w))%*%X > invJ <- solve(J) > beta.sd <- sqrt(as.vector(diag(invJ))) > beta.sd [1] 0.1260213 0.2231657 0.1774677 0.3064330 The deviance residuals (3.5.1) are > p <- as.vector(inv.link(X%*%beta)) > a <- y*log(y/p) > b <- (n-y)*log((n-y)/(n-p)) > a[y==0] <- 0 > d <- sign(y-mu)*sqrt(2*(a+b)) > summary(d) V1 Min. :-2.01617 1st Qu.:-1.24398 Median : 0.05995 Mean :-0.08655 3rd Qu.: 0.84695 Max. : 2.12122 We can test individual parameters using (3.18). The corresponding p-values are: > z <- beta/beta.sd > z [1] -4.429187 0.653895 7.427729 -2.539229 > 2*(1-pnorm(abs(z),lower.tail=TRUE)) [1] 9.458885e-06 5.131795e-01 1.105782e-13 1.110970e-02 84 The AIC (section 3.4.5) is > -2*sum(dbinom(y,n,as.vector(mu/n),log=TRUE)) + 2*length(beta) [1] 117.874 Of course, R can do this for us: > dat2 <- seeds > y <- cbind(dat2$r, dat2$n - dat2$r) > my.bin.glm <- glm(y ~ seed + extract + seed * extract, + family = binomial(link = "logit"), data = dat2) > summary(my.bin.glm) Call: glm(formula = y ~ seed + extract + seed * extract, family = binomial(link = "logit"), data = dat2) Deviance Residuals: Min 1Q Median 3Q Max -2.01617 -1.24398 0.05995 0.84695 2.12123 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.5582 0.1260 -4.429 9.46e-06 *** seed 0.1459 0.2232 0.654 0.5132 extract 1.3182 0.1775 7.428 1.10e-13 *** seed:extract -0.7781 0.3064 -2.539 0.0111 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 98.719 on 20 degrees of freedom Residual deviance: 33.278 on 17 degrees of freedom AIC: 117.87 Number of Fisher Scoring iterations: 4 85 3.6.2 Worked Example 2 The data used in this example can be directly loaded in R using infert Consider the following data relating to the study of abortions. Table 3.3 presents the first 6 data only — in total there are 248 cases. education age parity induced case spontaneous stratum pooled.stratum 1 0-5yrs 26 6 1 1 2 1 3 2 0-5yrs 42 1 1 1 0 2 1 3 0-5yrs 39 6 2 1 0 3 4 4 0-5yrs 34 4 2 1 0 4 2 5 6-11yrs 35 3 1 1 1 5 32 6 6-11yrs 36 4 2 1 1 6 36 Table 3.3: Infertility Dataset Let us fit a binomial model to see if case, a binary observation, can be predicted using the other measures as covariates. Note that .˜ means that we include all other data columns singly (useful for large datasets). > myglm0 <- glm(case ~.,data=infert,family=binomial) > summary(myglm0) Call: glm(formula = case ~ ., family = binomial, data = infert) Deviance Residuals: Min 1Q Median 3Q Max -1.7975 -0.7836 -0.4599 0.8556 2.8998 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.039297 2.135797 -1.891 0.0586 . education6-11yrs 1.320471 1.565614 0.843 0.3990 education12+ yrs 3.489701 2.965837 1.177 0.2393 age 0.078590 0.038060 2.065 0.0389 * parity -0.451423 0.276877 -1.630 0.1030 induced 1.435629 0.320870 4.474 7.67e-06 *** spontaneous 2.191282 0.329069 6.659 2.76e-11 *** stratum -0.002842 0.014621 -0.194 0.8459 pooled.stratum -0.078768 0.043800 -1.798 0.0721 . --- 86 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 316.17 on 247 degrees of freedom Residual deviance: 254.53 on 239 degrees of freedom AIC: 272.53 Number of Fisher Scoring iterations: 4 At first glance, the residual deviance looks reasonable for the degrees of freedom. We can look at sequentially adding terms using (somewhat confusingly the anova function again): > anova(myglm0,test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: case Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 247 316.17 education 2 0.002 245 316.17 0.99886 age 1 0.006 244 316.16 0.94012 parity 1 0.026 243 316.14 0.87088 induced 1 0.056 242 316.08 0.81372 spontaneous 1 58.284 241 257.80 2.269e-14 *** stratum 1 0.003 240 257.79 0.95346 pooled.stratum 1 3.263 239 254.53 0.07085 . --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 The anova function recognises that a GLM was fitted and produces an analysis of deviance table. The test="Chisq" option reports the p-values on the right. 87 Model comparisons and selection can be done automatically in R. First, consider similar models where just one parameter is dropped. This is achieved using the drop1 function: > drop1(myglm0,test="Chisq") Single term deletions Model: case ~ education + age + parity + induced + spontaneous + stratum + pooled.stratum Df Deviance AIC LRT Pr(>Chi)
254.53 272.53 education 2 256.76 270.76 2.230 0.32791 age 1 258.85 274.85 4.315 0.03778 * parity 1 257.26 273.26 2.728 0.09861 . induced 1 277.42 293.42 22.890 1.716e-06 *** spontaneous 1 316.03 332.03 61.504 4.419e-15 *** stratum 1 254.57 270.57 0.038 0.84591 pooled.stratum 1 257.79 273.79 3.263 0.07085 . --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 This suggests dropping the education and stratum parameters (and perhaps more) from the model. The drop1 function has a brother function — the add1 function, which adds individual terms to the given model: > add1(myglm0,~.^2,test="Chisq") Single term additions Model: case ~ education + age + parity + induced + spontaneous + stratum + pooled.stratum Df Deviance AIC LRT Pr(>Chi) 254.53 272.53 education:age 2 254.51 276.51 0.0198 0.99013 88 education:parity 2 254.33 276.33 0.1977 0.90589 education:induced 2 247.58 269.58 6.9513 0.03094 * education:spontaneous 2 252.28 274.28 2.2478 0.32500 education:stratum 2 254.41 276.41 0.1219 0.94089 education:pooled.stratum 2 254.01 276.01 0.5227 0.77003 age:parity 1 254.19 274.19 0.3444 0.55728 age:induced 1 254.20 274.20 0.3282 0.56670 age:spontaneous 1 251.39 271.39 3.1409 0.07635 . age:stratum 1 254.36 274.36 0.1723 0.67805 age:pooled.stratum 1 254.51 274.51 0.0205 0.88626 parity:induced 1 252.90 272.90 1.6332 0.20126 parity:spontaneous 1 252.57 272.57 1.9616 0.16134 parity:stratum 1 254.48 274.48 0.0521 0.81942 parity:pooled.stratum 1 254.42 274.42 0.1089 0.74145 induced:spontaneous 1 254.50 274.50 0.0323 0.85736 induced:stratum 1 248.77 268.77 5.7651 0.01635 * induced:pooled.stratum 1 250.36 270.36 4.1678 0.04120 * spontaneous:stratum 1 251.27 271.27 3.2615 0.07093 . spontaneous:pooled.stratum 1 253.63 273.63 0.9052 0.34140 stratum:pooled.stratum 1 254.42 274.42 0.1120 0.73790 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 The ˜. ˆ2 informs R to consider all possible two-factor interactions. The result above suggests including an education:induced, induced:stratum and induced:pooled.stratum terms. The drop1 and add1 perform many individual χ2 tests between models — it does not select any particular model. The step function will automatically search through the models for us. > stepsearch <- step(myglm0,~.^2,test="Chisq") ... this is produce a lot of output! To summarise the step search, we can use the anova component: > stepsearch$anova 89 Step Df Deviance Resid. Df Resid. Dev AIC 1 NA NA 239 254.5310 272.5310 2 + induced:stratum -1 5.765135 238 248.7658 268.7658 3 - education 2 1.749457 240 250.5153 266.5153 4 + age:spontaneous -1 4.072592 239 246.4427 264.4427 5 - pooled.stratum 1 1.888343 240 248.3310 264.3310 Reading from top to bottom: a induced:stratum parameter was added, then the education parameter removed, etc. The criteria to add or remove parameters is based on the AIC. The step search continues until the AIC cannot be reduced any further. Finally, a summary oxf the final chosen model can be called (no need to fit again): > summary(stepsearch) Call: glm(formula = case ~ age + parity + induced + spontaneous + stratum + induced:stratum + age:spontaneous, family = binomial, data = infert) Deviance Residuals: Min 1Q Median 3Q Max -1.8253 -0.7699 -0.5257 0.8536 2.6835 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.086896 1.463758 -0.743 0.45776 age 0.000298 0.041023 0.007 0.99420 parity -0.880102 0.193553 -4.547 5.44e-06 *** induced 2.278841 0.487873 4.671 3.00e-06 *** spontaneous -0.503841 1.395070 -0.361 0.71798 stratum 0.003730 0.008352 0.447 0.65519 induced:stratum -0.025723 0.009161 -2.808 0.00498 ** age:spontaneous 0.080038 0.044768 1.788 0.07380 . --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 316.17 on 247 degrees of freedom Residual deviance: 248.33 on 240 degrees of freedom AIC: 264.33 90 Number of Fisher Scoring iterations: 4 Finally, let us inspect the diagnostic plots. −3 −1 0 1 2 − 2 − 1 0 1 2 3 Predicted values R es id ua ls l l l l l l l l l ll ll l l lll ll ll l l l l l l l l l l l l l l l lll l l l ll l l ll l l l l l l ll l l l ll l l l ll l l l l Residuals vs Fitted 39 810 l l l l l l l l l l l lll l ll ll l l l ll l l l l l l l lll l l l l l l l l l l ll ll l l l l l l l l l ll l lll l l l −3 −1 0 1 2 3− 2 − 1 0 1 2 3 Theoretical Quantiles St d. d ev ia nc e re si d. Normal Q−Q 39 810 −3 −1 0 1 2 0. 0 0. 5 1. 0 1. 5 Predicted values St d. de vi a n ce re si d. l l l l l l l l l ll ll l l ll l ll ll l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l ll l l l l l l l l lll l l l l Scale−Location 39 810 0.00 0.04 0.08 0.12 − 2 0 2 4 6 Leverage St d. P e a rs o n r e si d. l l l l l l l l l l l l l l l l l lll ll l l l l l l ll l l l l l l l l l l ll l l l ll l l l l ll ll l l l l l l l ll ll l l l l ll l lll l ll l ll l lll l l l l l l l ll l l llll l l lll l l l llll l l lll l l l l l l l ll l l Cook's distance 0.5 1 Residuals vs Leverage 813 74 The residual plots are less informative than they were for linear models. The response contains less information than a continuous one. Nevertheless, the issue of outliers and influential observations are just as prevalent in logistic regression as for linear models — so look at the Cook’s distance plot and investigate any highly influential observations.‖ Finally, for convenience, the residual values as follows: > residuals(stepsearch,type="pearson") > residuals(stepsearch,type="deviance") and the Cook’s distances: ‖but this is not the end of the investigation... 91 > cooks.distance(stepsearch) and the standardised residuals: > rstandard(stepsearch,type="pearson") > rstandard(stepsearch,type="deviance") The commands above can be executed upon any glm model fitted in R. We can proceed to check for large (in magnitude) residuals and observations with high leverages: > plot(abs(rstandard(stepsearch,type="pearson"))) > plot(influence(stepsearch)$hat) #extract hat matrix values > l.threshold <- 2*8/248 #rule of thumb > l.threshold [1] 0.06451613 > abline(h=l.threshold) #adds a horizontal line l l l l l l l l l l l l l l lll l l ll l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l ll l l l ll l l l lllll l ll ll l ll l l l ll l ll l l ll l lllll l ll l ll lll l l l l l l l ll l l l l lll ll ll ll l lll l ll ll l ll l lll l ll llll l lll l lll l l l l ll l l 0 50 100 200 0 1 2 3 4 5 6 Index |St d. Pe a rs o n 's r e si du al | l l l l l l l l l l ll l l l ll l l l l l ll l ll l l l l l l l l l l lll l l l l l l ll l l l l ll l l l l l l ll l l ll l l l l l l l l l l l l l l l l ll l l l l l l llll l l ll l l ll l lll l l l l l l l l ll l l l l l ll l l l l l ll l l l l lll l l l l ll l l l l l l l l l l l l l l l l lll l l ll l l l ll l ll ll l ll l l l ll l l l ll l l l l l l l l l l l ll l l l l 0 50 100 200 0. 02 0. 06 0. 10 0. 14 Index Le ve ra ge We can mark any suspicious points and investigate if removing them significantly influences the fit. 92 A handy way of looking for these: First find the largest, say 5, standardised residuals: > order(abs(rstandard(stepsearch,type="pearson")),decreasing=TRUE)[1:5] [1] 39 8 10 48 81 These are the corresponding index numbers. Next, find the observations 5 with the largest leverages: > order(influence(stepsearch)$hat,decreasing=TRUE)[1:5] [1] 74 239 1 174 51 If we see any reoccuring indices, we may want to investigate further, as they will have both high leverage and high magnitude residual. 93 3.7 Binomial Regression Suppose the independent response variable Y1, . . . ,Yn where Yi ∼ Binomial(ni,pii), so that P(Yi = yi) = ( ni yi ) pi yi i (1− pii)ni−yi . Suppose that for the ith response we also observe covariates xi,1, xi,2 . . . . Following the linear model approach, we construct a linear predictor: ηi = p ∑ j=1 xijβ j. We shall discuss the three common choices of link functions are used for binomial regression: 1) logit: η = log ( pi 1− pi ) 2) probit: η = Φ−1(pi), where Φ−1 is the inverse cdf of a standard normal distribution. 3) complementary log log: η = log(− log(1− pi)) Note • A logistic binomial regression models is a GLM with binomial response and logit link function and a binary logistic binomial regression model is one with ni = 1. • The link above is defined through pi and not µ ≡ npi; this is annoying, but sometimes used in practice. You can use either pi or µ to get the same results. 94 Worked Example challenger.R challenger-data.RData In January 1989, the space shuttle Challenger exploded shortly after launch. An investigation was conducted into the cause of the crash, paying particular attention to the O-ring seals in the rocket boosters. Could the failure of the O-rings have been predicted? Table 3.4 contains information about the damage of O-rings from 22 previous shuttle missions. Temperature No. Failure O-rings Total No. O-rings 66 0 6 70 1 6 69 0 6 68 0 6 67 0 6 72 0 6 73 0 6 70 0 6 57 1 6 63 1 6 70 1 6 78 0 6 67 0 6 53 2 6 67 0 6 75 0 6 70 0 6 81 0 6 76 0 6 79 0 6 76 0 6 58 1 6 Table 3.4: Challenger Dataset Let us proceed to fit a logistic regression model in R: > mat <- cbind(dat$Fail,6-dat$Fail) > logitmod <- glm(mat~Temp,family=binomial(link=logit),data=dat) > probitmod <- glm(mat~Temp,family=binomial(link=probit),data=dat) > cloglogmod <- glm(mat~Temp,family=binomial(link=cloglog),data=dat) The estimated parameter values using the different link functions look very different: 95 logit probit cloglog (Intercept) 8.6615667 4.1452794 7.9384946 Temp -0.1768048 -0.0875188 -0.1665137 However, the actual fit given by each model are similar, especially around the observations: > plot(x=dat$Temp,y=dat$Fail/6) > x <- seq(25,85,by=0.5) # dummy x values > logiteta <- 8.6615667-0.1768048*x > probiteta <- 4.1452794-0.0875188*x > cloglogeta <- 7.9384946-0.1665137*x > lines(x,1/(1+exp(-logiteta)),lwd=2) > lines(x,pnorm(probiteta),lwd=2,lty=2) > lines(x,inv.clog(cloglogeta),lwd=2,lty=3) l l lll lll l l l l l ll l l 30 40 50 60 70 80 0. 0 0. 4 0. 8 Temperature Pr op . o f D am ag e We can predict the response given by each model at 31F: > xstar <- 31 > 1/(1+exp(-(8.6615667-0.1768048*xstar))) [1] 0.9600983 > pnorm(4.1452794-0.0875188*xstar) [1] 0.9239562 96 > inv.clog(7.9384946-0.1665137*xstar) [1] 0.9999999 The models suggest that there is a very high probability of damage at this temperature. Let us check the deviance for the logistic model using the χ2 test: > pchisq(deviance(logitmod),df.residual(logitmod),lower=FALSE) [1] 0.9776587 This p-value based on the χ2 test of the deviance in well in excess of e.g. the 5% level. Thus we may conclude by saying model fits the data well. We may not say that the model is correct. To construct a confidence interval for the prediction for the model using the logit link at 31F. Then > ustar <- c(1,31) > etastar <- ustar%*%logitmod$coefficients > etastar [,1] [1,] 3.180617 > 1/(1+exp(-etastar)) [,1] [1,] 0.9600983 > J <- vcov(logitmod) > se <- sqrt(t(ustar)%*%J%*%ustar) Then for an approximate 95% confidence interval (see section 3.4) on the probability scale: 97 > 1/(1+exp(-c(etastar - 1.96*se,etastar + 1.96*se))) [1] 0.3957676 0.9988700 We can obtain the predictions directly using the predict command: > predict(logitmod,newdata=list(Temp=31)) 1 3.180617 > predict(logitmod,newdata=list(Temp=31),type="response") 1 0.9600983 The predict function can also be used to extract the fitted mean response values, µ̂: > predict(logitmod,type="response") The same procedure above can be applied for the other link functions. 98 Odds Odds are sometimes a better scale than probability to represent chance. • A 4-1 against bet would pay £4 for every £1. • A 4-1 on bet would pay £1 for every £4. Let p be the probability and o be the odds, where we represent e.g. 4-1 against as 1/4 and 4 - 1 on as 4. In general, we have the relationship o = p 1− p and p = o 1+ o . Odds Ratio Suppose we have two groups where the probability of an event being in the first group is p1 and the probability for the second group is p2. Then the odds ratio is: p1/(1− p1) p2/(1− p2) . An odds ratio • equal to 1 indicates the event is equally likely to occur in both groups. • greater than 1 indicates the event is more likely to occur in the first group. • less than 1 indicates the event is less likely to occur in the first group. Odds Interpretation Suppose we have a logistic Binomial regression model using two covariates x1 and x2. The linear predictor is given by η = log ( p 1− p ) = log(o) = β0 + β1x1 + β2x2. We can interpret β1 as follows: a unit increase in x1 keeping x2 held fixed increases the log-odds of success by β1, or equivalently increases the odds of success by a factor of exp(β1). Note This interpretation follows from using the logit link — no simple interpretation exists for other links. 99 3.7.1 Worked Example breastfeeding.R breastfeeding-data.RData Consider the data in Table 3.5 containing information from a study on infant respiratory disease by type of breast feeding and gender. The ratios presented are of the form “# with disease / total #”. Bottle only Some breast with supplement Breast only Boys 77/458 19/147 47/494 Girls 48/384 16/127 31/464 Table 3.5: Breast Feeding data Can gender and feeding type be used to measure whether or not infants contract a respiratory disease? Let us proceed to fit a logistic regression model in R: > mat <- cbind(babyfood$disease, babyfood$nondisease) > lrmod <- glm(mat~ sex + food, family=binomial, data=babyfood) > summary(lrmod) Call: glm(formula = mat ~ sex + food, family = binomial, data = babyfood) Deviance Residuals: 1 2 3 4 5 6 0.1096 -0.5052 0.1922 -0.1342 0.5896 -0.2284 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.6127 0.1124 -14.347 < 2e-16 *** sexGirl -0.3126 0.1410 -2.216 0.0267 * foodBreast -0.6693 0.1530 -4.374 1.22e-05 *** foodSuppl -0.1725 0.2056 -0.839 0.4013 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 26.37529 on 5 degrees of freedom Residual deviance: 0.72192 on 2 degrees of freedom AIC: 40.24 Number of Fisher Scoring iterations: 4 100 We expect the χ2 approximation for the deviance to be accurate for this data (why?). Consider the interpretation of the coefficient for breast feeding. We have > exp(-0.6693) [1] 0.5120669 Following the log-odds interpretation: breast feeding reduces the odd of respiratory disease by 51% of that for bottle feeding. We could compute the confidence interval on the odds scale. However, to get better coverage properties we compute the interval on the log-odds scale and then transforming the endpoints as follows: > z <- qnorm(0.025,lower.tail=FALSE) # critical value > z [1] 1.959964 > exp(c(-0.669-z*0.153,-0.669+z*0.153)) [1] 0.3795099 0.6913386 101 3.7.2 Prospective and Retrospective Sampling Prospective Sampling : In prospective sampling, the covariates are fixed and then the outcome is observed. Retrospective Sampling : In retrospective sampling, the outcome is fixed and then the covariates are observed. Consider a study in which the contraction of a disease is of interest. Let • v0 be the probability that an individual is included in the study if they do not have the disease, • v1 be the probability that an individual is included in the study if they do have the disease. For a prospective study, v0 = v1 as we have no knowledge of the outcome. For a retrospective study, typically v1 is much greater than v0. For a given covariate x, let • p∗(x) denote the conditional probability that an individual has the disease given inclusion in the study. • p(x) denote the unconditional probability that an individual has the disease, as we would obtain from a prospective study. Then by Bayes Theorem: p∗(x) = v1p(x) v1p(x) +v0(1− p(x)) . Rearranging yields logit(p∗(x)) = log ( v1 v0 ) + logit(p(x)). So the only difference between the retrospective and prospective study is the intercept term log(v1/v0). • Generally, v1/v0 is unknown — meaning we cannot estimate β0. • However, knowledge of the other β can be used to assess the relative error of the covariates. 102 Now return to the respiratory disease example: Prospective Sampling : In the infant respiratory example, we would select a sample of newborns whose parents had chosen a particular method of feeding and then monitor them for their first year. Retrospective Sampling : In the infant respiratory example, typically, we would find infants visiting a doctor with a respiratory disease in the first year and then record their gender and method of feeding. We would also obtain a sample of respiratory disease-free infants. How these samples are collected is important — we require that the probability of inclusion in the study is independent of the predictor. Suppose that the respiratory disease example had been a prospective study. Then, focussing on the boys only, • Given the infant is breast fed, the log-odds of having a respiratory disease are log(47/447) = −2.25. • Given the infant is bottle fed, the log-odds of having a respiratory disease are log(77/381) = −1.60. The difference between these two log-odds, ∆ = −1.60 − (−2.25) = 0.65, represents the increased risk of respiratory disease incurred by bottle feeding relative to breast feeding. This is the log-odds ratio. Now suppose that the respiratory disease example had been a retrospective study — we could compute the log-odds of feeding type given respiratory disease status and then find the difference. The log-odds ratio would be exactly the same: ∆ = log(77/47)− log(381/447) = 0.65 This shows that a retrospective design is as effective as a prospective design for estimating ∆.∗∗ Notes • Retrospective designs are cheaper, faster and more efficient, so it is convenient that the same results may be obtained from a prospective study. • Retrospective studies are typically less reliable than prospective studies - relies on historical records that may be incomplete or inaccurate. ∗∗See Supplementary handout for further details. 103 Summary for using the logit link Canonical choice for binomial response GLMs is the logit link. Using the logit link • leads to simplier mathematics, • easy interpretation using odds; • allows for easy analysis of retrospective data. 104 3.8 Poisson Regression So far we have examined the following cases for the response: • Response is real→ normal linear model. • Response is a probability→ Binomial regression. • Response is a bounded integer→ Binomial regression. What if the response is an unbounded integer? One possible approach is to use a Poisson distribution as the response. Let the response Y follow a Poisson distribution with mean λ > 0: P(Y = y) = e−λλy y! . Examples where the response can be modelled as Poisson are: • counting the occurrence of a rare event (i.e. a small probability of success). • counting the number of events in a time interval. Recall that the Poisson distribution is a member of the exponential family as its pmf can be written as exp {y log(λ)− λ− log(y!)} , where θ = log(λ), b(θ) = λ = exp(θ), a(φ) = φ = 1 and c(y, φ) = − log(yi!). Therefore the canonical link is η = θ = log(λ) Using the canonical link, the likelihood, for independent Y1, . . . ,Yn where Yi ∼ Poisson(λi), is: L(β;y) = and the log-likelihood is: `(β;y) = 105 The discrepancy of the model can be measured using the deviance, which is D = 2 n ∑ i=1 ( yi log(yi/λ̂i)− (yi − λ̂i) ) We can use the deviance to • judge the goodness of fit of the model using the χ2n−p approximation. • compare two nested models. Alternatively, we can use Pearson’s X2 statistic as a goodness of fit measure, which for a Poisson response distribution, takes the form: X2 = n ∑ i=1 ( yi − λ̂i )2 λ̂i Note A key property of the Poisson distribution is that its mean is equal to its variance. 3.8.1 Worked Example In R fit the linear model using the following command: > mylm1 <- lm(interlocks~assets+sector+nation,data=Ornstein) The normal linear model fitted has response vector and design matrix: Y = , X = The Ornstein dataset contains 248 rows of data. The observations are the 248 largest Canadian firms specifying their assets in millions of dollars, the sector the firm belongs to, the nation that controls the firm and the number of interlocking director and executive positions shared with other firms. Can we use a company’s assets, sector and nation values to predict the number of interlocking positions? 106 Let us take a look at the summary of linear model fitted in R: > summary(mylm1) Call: lm(formula = interlocks ~ assets + sector + nation, data = Ornstein) Residuals: Min 1Q Median 3Q Max -25.001 -6.602 -1.629 4.780 40.728 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.027e+01 1.561e+00 6.575 3.14e-10 *** assets 8.096e-04 6.119e-05 13.231 < 2e-16 *** sectorBNK -1.781e+01 5.906e+00 -3.016 0.00284 ** sectorCON -4.709e+00 4.728e+00 -0.996 0.32034 sectorFIN 5.153e+00 2.646e+00 1.948 0.05266 . sectorHLD 8.777e-01 4.004e+00 0.219 0.82669 sectorMAN 1.149e+00 2.065e+00 0.556 0.57849 sectorMER 1.491e+00 2.636e+00 0.566 0.57206 sectorMIN 4.880e+00 2.067e+00 2.361 0.01905 * sectorTRN 6.171e+00 2.760e+00 2.236 0.02629 * sectorWOD 8.228e+00 2.679e+00 3.072 0.00238 ** nationOTH -1.241e+00 2.695e+00 -0.461 0.64555 nationUK -5.775e+00 2.674e+00 -2.159 0.03184 * nationUS -8.618e+00 1.496e+00 -5.760 2.64e-08 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 9.827 on 234 degrees of freedom Multiple R-squared: 0.6463, Adjusted R-squared: 0.6267 F-statistic: 32.89 on 13 and 234 DF, p-value: < 2.2e-16 What can we say about the model fit from this summary? 107 Lets take a look at a residual plot: 0 20 40 60 80 100 − 20 0 20 40 Fitted values R es id ua ls l l l l l l l l l l l l l l l l l l ll l l l l l ll l l l l l l l l ll ll l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l ll l l l ll l l ll l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l lll ll ll ll Residuals vs Fitted 9 6386 This plot shows signs of heteroscedasity. Perhaps using a GLM with a Poisson response distribution could explain the data better. We now proceed to fit the model with independent Y1, . . . ,Yn with Yi ∼ Poisson(λi) and E(Yi) = exp(β1 + β2assetsi + β3sectori + β4nationi), i = 1, . . . , n. Therefore, the link function is log link (canonical link). 108 > myglm <- glm(interlocks~assets+sector+nation,data=Ornstein,family=poisson) > summary(myglm) Call: glm(formula = interlocks ~ assets + sector + nation, family = poisson, data = Ornstein) Deviance Residuals: Min 1Q Median 3Q Max -5.9908 -2.4767 -0.8582 1.3472 7.3610 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.325e+00 5.193e-02 44.762 < 2e-16 *** assets 2.085e-05 1.202e-06 17.340 < 2e-16 *** sectorBNK -4.092e-01 1.560e-01 -2.623 0.00872 ** sectorCON -6.196e-01 2.120e-01 -2.923 0.00347 ** sectorFIN 6.770e-01 6.879e-02 9.841 < 2e-16 *** sectorHLD 2.085e-01 1.189e-01 1.754 0.07948 . sectorMAN 5.260e-02 7.553e-02 0.696 0.48621 sectorMER 1.777e-01 8.654e-02 2.053 0.04006 * sectorMIN 6.211e-01 6.690e-02 9.283 < 2e-16 *** sectorTRN 6.778e-01 7.483e-02 9.059 < 2e-16 *** sectorWOD 7.116e-01 7.532e-02 9.447 < 2e-16 *** nationOTH -1.632e-01 7.362e-02 -2.217 0.02663 * nationUK -5.771e-01 8.903e-02 -6.482 9.05e-11 *** nationUS -8.259e-01 4.897e-02 -16.867 < 2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 3737.0 on 247 degrees of freedom Residual deviance: 1887.4 on 234 degrees of freedom AIC: 2813.4 Number of Fisher Scoring iterations: 5 Based on the (residual) deviance, we can perform a χ2n−p significance test: > pchisq(deviance(myglm), df.residual(myglm), lower=FALSE) 109 [1] 5.793237e-256 This is an extremely small p-value indicating an ill-fitting model if the Poisson response distribution is correct. Why is this a poor fit? Let us look at the diagnostic plots. 1 2 3 4 5 − 5 0 5 Predicted values R es id ua ls l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l ll ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l ll l l l l l l l l l l ll ll l l l l l l l l l l l l l l ll l l l ll l ll Residuals vs Fitted 86 9 167 This residual plot looks better than that for the linear model. However, the residuals are quite large — larger than the Poisson distribution suggest. The Poisson distribution is restrictive in the sense that it has only one parameter, forcing the mean to equal the variance of the observations, which is not very flexible for fitting purposes. This problem can be alleviated by estimating φ, which then leads to better representative standard errors. To this end, we compute Pearson’s dispersion estimate φ̂p: > phihat <- sum(residuals(myglm,type="pearson")^2)/df.residual(myglm) > phihat [1] 7.943697 We can then “plug-in” this estimate into the model: > summary(myglm, dispersion=phihat) Call: glm(formula = interlocks ~ assets + sector + nation, family = poisson, data = Ornstein) 110 Deviance Residuals: Min 1Q Median 3Q Max -5.9908 -2.4767 -0.8582 1.3472 7.3610 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.325e+00 1.464e-01 15.882 < 2e-16 *** assets 2.085e-05 3.389e-06 6.152 7.64e-10 *** sectorBNK -4.092e-01 4.397e-01 -0.931 0.352038 sectorCON -6.196e-01 5.974e-01 -1.037 0.299703 sectorFIN 6.770e-01 1.939e-01 3.492 0.000480 *** sectorHLD 2.085e-01 3.350e-01 0.622 0.533800 sectorMAN 5.260e-02 2.129e-01 0.247 0.804857 sectorMER 1.777e-01 2.439e-01 0.728 0.466323 sectorMIN 6.211e-01 1.886e-01 3.294 0.000989 *** sectorTRN 6.778e-01 2.109e-01 3.214 0.001309 ** sectorWOD 7.116e-01 2.123e-01 3.352 0.000803 *** nationOTH -1.632e-01 2.075e-01 -0.787 0.431534 nationUK -5.771e-01 2.509e-01 -2.300 0.021456 * nationUS -8.259e-01 1.380e-01 -5.984 2.17e-09 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for poisson family taken to be 7.943697) Null deviance: 3737.0 on 247 degrees of freedom Residual deviance: 1887.4 on 234 degrees of freedom AIC: 2813.4 Number of Fisher Scoring iterations: 5 Note that the estimation of β is independent of the dispersion parameter, φ, therefore modifying φ does not change the estimated coefficients. Further, notice, in this example, that now some of the coefficients have become non-significant. 3.8.2 Overdispersion The term overdispersion means that the observed variance of the response is larger than the variation implied by the distribution used to fit the model. Overdispersion can be caused by several different problems — we state some below: 111 • Observations for different individuals with the same covariates do not have exactly the same distribution; that is, there are unaccounted for individual differences not included in the model. • Observations may be correlated or clustered, while the specified variance function wrongly assumes uncorrelated data One approach to mitigate the problem of overdispersion is to estimate the dispersion parameter, φ rather than assume φ = 1 for the binomial and Poisson distributions. The procedure is to “plug-in” the estimated dispersion parameter φ̂ into the analysis, as done in the worked example above. As mentioned above, estimating the dispersion parameter has no effect on the estimate of β but it inflates all their standard errors. 112 3.8.3 Estimation Problems It can be the case that the glm function in R fails to converge. Problems may arise due to problems with the Fisher scoring method or a “bad” initial starting point, however sometimes it is a problem with the data themselves, which is exhibited in the following example. The following data set contains the values of androgen and estrogen (types of hormones) from 26 healthy males with their sexual orientation. androgen estrogen orientation 1 3.9 1.8 s 2 4.0 2.3 s 3 3.8 2.3 s 4 3.9 2.5 s 5 2.9 1.3 s 6 3.2 1.7 s 7 4.6 3.4 s 8 4.3 3.1 s 9 3.1 1.8 s 10 2.7 1.5 s 11 2.3 1.4 s 12 2.5 2.1 g 13 1.6 1.1 g androgen estrogen orientation 14 3.9 3.9 g 15 3.4 3.6 g 16 2.3 2.5 g 17 1.6 1.7 g 18 2.5 2.9 g 19 3.4 4.0 g 20 1.6 1.9 g 21 4.3 5.3 g 22 2.0 2.7 g 23 1.8 3.6 g 24 2.2 4.1 g 25 3.1 5.2 g 26 1.3 4.0 g Table 3.6: Hormone Dataset Suppose we fit a binomial model to see if the orientation can be predicted from the hormone values. > myglm <- glm(orientation~estrogen+androgen,data=hormone,family=binomial) Executing the above command gives the following warnings Warning messages: 1: glm.fit: algorithm did not converge 2: glm.fit: fitted probabilities numerically 0 or 1 occurred Lets take a look at the summary 113 > summary(myglm) Call: glm(formula = orientation ~ estrogen + androgen, family = binomial, data = hormone) Deviance Residuals: Min 1Q Median 3Q Max -2.759e-05 -2.100e-08 -2.100e-08 2.100e-08 3.380e-05 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -84.49 136095.03 -0.001 1.000 estrogen -90.22 75910.98 -0.001 0.999 androgen 100.91 92755.62 0.001 0.999 (Dispersion parameter for binomial family taken to be 1) Null deviance: 3.5426e+01 on 25 degrees of freedom Residual deviance: 2.3229e-09 on 23 degrees of freedom AIC: 6 Number of Fisher Scoring iterations: 25 Notice • the residual deviance is extraordinarily small indicating a good fit, yet none of the parameters are significant as they have very high standard errors. • the default maximum number of iterations (25) has been reached. The reason for these results come from the data. 114 s ss s s s s s s ss g g g g g g g g g g g g g g g 1.5 2.0 2.5 3.0 3.5 4.0 4.5 1 2 3 4 5 androgen e st ro ge n The plot of the data reveals that the two classes of orientation are linearly separable: it is possible to draw a straight line that divides the two groups perfectly. In this case, the data do not contain enough information to make stable estimates of the coefficients. The likelihood function does not have a finite maximum. This behaviour reflects the small amount of data used to fit the model. Possible approaches to deal with these types of problems: • Exact logistic regression. • Bias reduction method — R package available brlr. However, these methods are outside the scope of the course. 115 3.9 Quasi-Likelihood Recall that a GLM is determined by the • • We now discuss an approach that only requires specification of the link and variance functions of the model, but not the distribution of the response. Motivation Suppose we have the independent random variables Y1, · · · ,Yn. Let Yi have mean µi and variance φV(µi). Define the score as Ui = yi − µi φV(µi) . It follows that E(Ui) = 0 var(Ui) = 1 φV(µi) and − E ( ∂Ui ∂µi ) = 1 φV(µi) . These properties are shared by the score function of members of the exponential family. This suggests that we may use Ui as a score. Hence Qi = ∫ µi yi yi − t φV(t) dt, should behave like a log-likelihood function for µi (if the integral exists). We shall refer to Qi as the log quasi-likelihood (or confusingly as just the quasi-likelihood) for µi. As we assume the observations are independent, the log quasi-likelihood for the complete data is just the sum of the components: Q = ∑ni=1Qi. 116 Example Take V(µ) = 1 and φ = σ2. Then U = and Q = which is the same as the log-likelihood of a normal distribution up to constants. In general, using variance functions associated with members of the exponential family recovers the log-likelihood. Further, other choices of V(µ) may not correspond to any known distribution or may even lead to something that is not a distribution. Estimation The estimation of β in the model is obtained by maximising the log quasi-likelihood, Q. We can again use the IWLS algorithm (Algorithm 3.1) . The only exception is now the dispersion parameter φ is estimated by φ̂P = 1 n− p n ∑ i=1 (yi − µi)2 V(µi) , which is based on Pearson’s X2 statistic. We do not use the deviance estimator for φ as it is based on the likelihood — not reliable here. Inference By analogy, the quasi-deviance function for a single observation is †† Di = −2φQi = 2 ∫ yi µi yi − t V(t) dt, and the total quasi-deviance is the sum over the Di. This total quasi-deviance can be used like the ordinary deviance to perform inference on the model. ††see problem sheet 117 Example in R quasi-seeds.R seeds-data.RData We continue with the seeds example presented in section 3.6.1. We can fit a corresponding quasi-binomial model as follows: > my.quasi.bin <- glm(prop ~ seed + extract + seed * extract, + family = quasibinomial(link = "logit"), data = dat) > summary(my.quasi.bin) Call: glm(formula = prop ~ seed + extract + seed * extract, family = quasibinomial(link = "logit"), data = dat) Deviance Residuals: Min 1Q Median 3Q Max -0.88834 -0.18458 0.01555 0.13622 0.38167 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.5262 0.2764 -1.904 0.07398 . seed -0.2000 0.3969 -0.504 0.62079 extract 1.4479 0.3865 3.747 0.00161 ** seed:extract -0.8478 0.5496 -1.543 0.14135 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for quasibinomial family taken to be 0.08915264) Null deviance: 3.9112 on 20 degrees of freedom Residual deviance: 1.8151 on 17 degrees of freedom AIC: NA Number of Fisher Scoring iterations: 4 Notice that the estimated dispersion parameter is not 1 (as in the previous analysis), but φ = 0.0892, far less than 1. Further, the models deviance has been significantly reduced. Comparison of multiple quasi models can be done using the F-test e.g. > my.quasi.bin.2 <- glm(prop ~ seed + extract, family = quasibinomial, + data = dat) > anova(my.quasi.bin.2, my.quasi.bin, test = "F") 118 Analysis of Deviance Table Model 1: prop ~ seed + extract Model 2: prop ~ seed + extract + seed * extract Resid. Df Resid. Dev Df Deviance F Pr(>F) 1 18 2.0280 2 17 1.8151 1 0.21282 2.3871 0.1407 This F-value is not significant, therefore there is insufficient evidence to reject the null hypothesis, i.e. use the smaller model. The other options for use of quasi-likelihoods in R are quasi(link = "identity", variance = "constant") quasibinomial(link = "logit") quasipoisson(link = "log") Note • The dispersion, φ, can be modelled as a free parameter which is useful in modelling overdispersion or underdispersion. • Although using the quasi-likelihood approach is attractive as it uses fewer assumptions — the quasi based estimators are generally less efficient than corresponding regular likelihood-based estimator — so if information about the distribution is available, use it. 119 Chapter 4. Normal Linear Mixed Models Models with mixed effects consist of both a fixed effect and a random effect. In the previous chapters we have only considered fixed effects; where the only source of randomness arises from considering the samples as independent and random samples. Random effects are used to model more complex correlation structures, such when there is more than one observation in a group. In this setting, we expect each group to vary independently, but measurements from the same group may be correlated. We shall only consider random effects for normal linear models. However, random effects can be included in other, more complicated models e.g. non-linear, non-Normal (outside the scope of this course). Motivating Example Consider the dataset containing the math exam scores of students from 3 different schools. There are 10 scores from each school in this dataset. The data is presented in the boxplots below: 5 10 15 20 25 School Ex am S co re First, ignore the grouping structure of the data and assume the simple model: Yij = µ+ eij, j = 1, . . .m i = 1, . . . ,Kj, where Yij is the observed score for student i from school j, µ is the mean score across the population of students being sampled, and the e are independent N(0, σ2) error terms. > mylm <- lm(score~1,data=dat) 120 Let’s look at the summary of this model and some plots of the residuals: Call: lm(formula = score ~ 1, data = dat) Residuals: Min 1Q Median 3Q Max -9.966 -5.624 -1.810 5.941 14.031 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 10.96 1.26 8.701 1.4e-09 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 6.9 on 29 degrees of freedom l l l l l l l l l l l l l l l l l l 5 10 15 20 25 − 10 0 5 10 Exam Score y R es id ua ls l l l l l l l l l l l l ll llllll l l l l l l l l l l 0 5 10 15 20 25 30 − 10 0 5 10 Index R es id ua ls Notice that the group school effects are incoporated into the residuals — we see clear division in the residuals given by the different schools. This leads to an inflated estimate of variability i.e. the RSS. 121 One approach would be to incorporate the group effects, by introducing a seperate mean for each school: > mylm2 <- lm(score~as.factor(school)+0,data=dat) Call: lm(formula = score ~ as.factor(school) + 0, data = dat) Residuals: Min 1Q Median 3Q Max -13.5675 -2.1747 0.0172 2.9415 10.4295 Coefficients: Estimate Std. Error t value Pr(>|t|) as.factor(school)140 14.564 1.751 8.318 6.30e-09 *** as.factor(school)142 4.927 1.751 2.814 0.00902 ** as.factor(school)155 13.394 1.751 7.650 3.15e-08 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 5.537 on 27 degrees of freedom Multiple R-squared: 0.834, Adjusted R-squared: 0.8155 F-statistic: 45.21 on 3 and 27 DF, p-value: 1.162e-10 Notice that the residual standard error has been slightly reduced in comparison to the previous model. Let us inspect the residual plots again. l l l l l l l l l l l l l l l l l l l l l 5 10 15 20 25 − 10 0 5 10 Score y R es id ua ls l l l l l l l l l l l l ll llllll l l l l l l l l l l 0 5 10 15 20 25 30 − 10 0 5 10 Index R es id ua ls These residual plots still show a trend. 122 This model has several disadvantages: • it only models the specific sample of students used in the experiment. We cannot say anything about the population of students from which the sample was drawn. • it does not provide an estimate of the between-school variability. We clearly need a class of model that incorporates the group structure of the data. Then we can pose interesting questions such as: what was the average maths exam score across the population, and how much variation were there between schools. 123 4.1 Specification of Normal Linear Mixed Models Definition. The normal linear mixed model is defined as Y = Xβ+ Zν + , where • Y ∈ Rn is the response vector, • X ∈ Rn×p is the design matrix for the fixed effect, • Z ∈ Rn×m model matrix for the random effects, • β ∈ Rp is the unknown parameter, • ν is an m-variate vector with ν ∼ N(0, σ2ν Im), • is an n-variate vector with ∼ N(0, σ2e In). and ν and are independent. The joint distribution of Y can be easily derived as follows: First, E(Y ) Next, cov(Y ) Since Y is a sum of normally distributed variables, it follows that Y ∼ N(Xβ, σ2e (In + ZΨZT)), where Ψ = σ2ν σ2e Im For compactness, we sometimes group the variances into one parameter τ = (σ2e , σ2ν )T, called the variance-components of the model. 124 Example A simple normal linear model is Yij = µ+ νj + eij, j = 1, . . .m, i = 1, . . . ,Kj, where, as before, µ is the fixed effect, while νj ∼ N(0, σ2ν ) and eij ∼ N(0, σ2e ). Casting this model into the definition above: Y = Y11 ... YK1,1 Y1,2 ... YK2,2 ... , X = 1 1 ... 1 , Z = 1 0 . . . 0 ... ... ... ... 1 0 . . . 0 0 1 . . . 0 ... ... ... ... 0 1 . . . 0 ... ... ... ... , β = µ, ν = ν1 ν2 ... νm and = e11 ... eK1,1 e12 ... eK2,2 ... , where n = ∑mj=1 Kj, Y ∈ Rn, X ∈ Rn×1, Z ∈ Rn×m, β ∈ R. This model is called the one-way random effects model. 125 4.2 Estimation Estimation for models with random effects is not as straightforward as it was for fixed effects models. The standard methods of estimation for normal linear mixed models are maximum likelihood (ML) and restricted maximum likelihood (REML). We shall discuss both methods in this section. 4.2.1 Estimation of β Under a normal linear mixed model, the distribution of Y is 1 (2pi)n/2|σ2eV|1/2 exp { − 1 2σ2e (y − Xβ)TV−1(y − Xβ) } , where V := In + ZΨZT. Thus, the log-likelihood function is given by `(β, τ ;y) = −n 2 log(2pi)− 1 2 log ∣∣∣σ2eV∣∣∣− 12σ2e (y − Xβ)TV−1(y − Xβ), where the variance-covariance term τ appears in V. By differentiating the log-likelihood with respect to β, we obtain ∂` ∂β = (4.1) The standard approach to find the MLEs is to solve ∂`∂β = 0 (and ∂` ∂τ = 0). For simplicity, assume that X has full rank; so that rank(X) = p. Let (β̂, τ̂ ) be the MLE. From (4.1) we obtain β̂ = (4.2) where Vτ̂ = V(τ̂ ), that is, V with the MLE τ̂ plugged in. Thus once the MLE of τ is found, the MLE of β can be calculated by expression (4.2). Thus one procedure is to first solve for τ̂ , then compute β̂ using (4.2). However, finding τ̂ is not so simple. 126 4.2.2 Estimation of Variance Components There are several approaches to estimating the variance components, σ2e and σ2ν , of normal linear mixed models. The first approach we discuss is to use the analysis of variances: 4.2.2.1 ANOVA estimators Let’s start with the simplest possible random effects model to gain intuition: Yij = µ+ νj + eij, j = 1, . . . ,m i = 1, . . . ,Kj. where νj is a random effect with E(νj) = 0, var(νj) = σ2ν , and E(eij) = 0, var(eij) = σ2e . This induces a correlation between the observations in the same group ρ = σ2ν σ2ν + σ 2 e . Proof. The correlation, ρ, is known as the intraclass correlation coefficient (ICC). Note • if σ2ν = 0 then ρ = 0. • if σ2ν σ2e then ρ ≈ 1. 127 For simplicity, suppose there are an equal number of observations for each of the m groups i.e. set Kj = K — referred to as a balanced design. We can decompose the variation as follows: K ∑ i=1 m ∑ j=1 (Yij −Y)2 = K ∑ i=1 m ∑ j=1 (Yij −Y•j)2 + k ∑ i=1 m ∑ j=1 (Y•j −Y)2 where Y•j = 1 K K ∑ i=1 Yij and Y = 1 mK K ∑ i=1 m ∑ j=1 Yij. This decomposition can be written as: SST = SSE+ SSA where SSE is the residual sum of squares, SST is the total sum of squares (corrected for the mean) and SSA is the sum of square due to ν. It turns out that (see Lemmas below): E(SSE) = m(K− 1)σ2e , E(SSA) = (m− 1)(Kσ2ν + σ2e ) which suggests using the estimators: σ̂2e = SSE m(K− 1) =: MSE, σ̂ 2 ν = SSA/(m− 1)− σ̂2e K =: MSA−MSE K , where MSA = SSA/(m− 1). These estimators, σ̂2e and σ̂2ν , are known as the ANOVA estimators. 128 Lemma 7. SST = SSE+ SSA Proof. Lemma 8. The expectations of these sums of squares are: i) E(SSE) = m(K− 1)σ2e ii) E(SSA) = (m− 1)(Kσ2ν + σ2e ) iii) E(SST) = (mK− 1)σ2e + K(m− 1)σ2ν Proof. i) For SSE, the expectation of the summand is E {( Yij −Y•j )2} 129 ii) For E(SSA), again start by looking at the expectation of the summand E {( Y•j −Y )2} 130 iii) Plugging in the results for i) and ii) yields: E(SST) =m(K− 1)σ2e + (m− 1)(Kσ2ν + σ2e ) =(mK− 1)σ2e + K(m− 1)σ2ν 131 These were the first estimators developed for the variance components. They have an explicit form suitable for hand calculations, which was important in precomputing days. However, they have a number of disadvantages: 1. The estimates can take negative values. For instance, if MSA < MSE then σ2ν < 0u˚ 2. A balanced design is assumed — where the number of observations in each group is equal. In such settings, the ANOVA decomposition into sum of squares is unique. For unbalanced data, this is not true and therefore we must choose which ANOVA decomposition and therefore the ANOVA estimator to use. 3. The need for complicated algebraic calculations. Formulae for the simpler models are easy to derive and code — but more complex models will require difficult constructions. 4.2.2.2 Maximum Likelihood Approach Recall that the normal linear mixed model can be written as Y ∼ N(Xβ, σ2e (In + ZΨZT)︸ ︷︷ ︸ =:V ), where Ψ = σ2ν σ2e Im. If Ψ is known (i.e. the variance components), we can estimate β using generalised least squares. However, the estimation of the variance components is often the purpose of the analysis. Standard maximum likelihood is one method of estimation that we can use. If we let V = In + ZΨZT, then the distribution of Y is 1 (2pi)(n/2) |σ2eV|1/2 exp { − 1 2σ2e (y − Xβ)TV−1(y − Xβ) } . Thus the log-likelihood takes the form `(β, τ ;y) = −n 2 log(2pi)− 1 2 log ∣∣∣σ2eV∣∣∣− 12σ2e (y − Xβ)TV−1(y − Xβ), (4.3) where the variance-covariance term τ appears in V. We can then proceed to optimise to find the maximum likelihood estimates of β and τ . This is straightforward in principle, but difficult in practice. More complex models involving larger numbers of random effects parameters can be difficult to estimate. One particular problem is that the variance cannot be negative so the MLE for the variance might be zero. Nevertheless, we rely on numerical techniques to find the MLEs.∗ A problem with MLEs is that they can be biased — as illustrated in the following example. ∗the optimisation techniques are not examinable. 132 Example (Neyman-Scott) Consider the model Yij = µj + eij where eij iid∼ N(0, σ2), for i = 1, 2 and j = 1, . . . ,m. We are interested in estimating σ2. The joint pdf of Y is m ∏ j=1 [( 1√ 2piσ2 )2 exp { − 1 2σ2 ( Y1j − µj )2} exp{− 1 2σ2 ( Y2j − µj )2}] . Hence the log-likelihood is The derivative with respect to µj is for j = 1, . . . ,m. Consequently, their MLEs are µ̂j = Similarly we find ∂` ∂σ2 = Thus the MLE of σ2 is σ̂2 = 133 We then proceed to plug in the MLEs for µj into the formula for σ̂2 to get σ̂2 = Now consider the bias of this estimator for σ2. E(σ̂2) = . Since 4s2j = { (Y2j − µj)− (Y1j − µj) }2, it follows from independence that 4E(s2j ) = and so E(s2j ) = Finally, E(σ̂2) = which shows that σ̂2 is biased and does not even become unbiased as n = 2m → ∞. Here, the maximum likelhood estimator is not consistent. 134 4.2.2.3 Restricted Maximum Likelihood The Neyman-Scott example demonstrates that the MLE approach can lead to a biased estimator of the variance. Notice that the fixed effects, the µis, are considered to be nuisance parameters and the main interest lies in the variance term σ2. We now discuss a method that can be used to estimate the parameters of interest without dealing with nuisance parameters. Example (Neyman-Scott continued) Consider the transformation Bj = Y1j − Y2j. Then B1, . . . , Bm are independent and N(0, 2σ2) distributed. Notice that this transformation leads to a distribution that does not involves the nuisance parameters µ1, . . . , µm. Then the joint pdf of B1, . . . , Bm is f (b1, . . . , bm) = m ∏ j=1 1√ 2pi(2σ2) exp { − 1 2(2σ2) b2j } . Hence the log-likelihood is and it follows that ∂` ∂σ2 = Therefore, the MLE of σ2 is σ̂2 = Finally, we can compute the expectation of this MLE and find that is it indeed an unbiased estimator of σ2: E(σ̂2) = The trick shown in this example is to apply a transformation to the data to eliminate the fixed effects; then use to transformed data to estimate the variance component. This idea can be applied in general. Assume (w.l.o.g.) that X has full rank i.e. rank(X) = p. Let L ∈ Rn×(n−p) such that rank(L) = n− p and LTX = 0. Then multiplying the mixed normal linear model equation by LT on the left we get LTY︸︷︷︸ =:B = 135 After the transformation we have B ∼ N(0, σ2e LTVL). It follows that the joint pdf of B is 1 (2pi)(n−p)/2 |σ2e LTVL|1/2 exp { − 1 2σ2e bT(LTVL)−1b } so that the log-likelihood is `R(τ ; b) = − (n− p)2 log(2pi)− 1 2 log ∣∣∣σ2e LTVL∣∣∣− 12σ2e bT(LTVL)−1b. To emphasis this is the restricted log-likelihood we use the subscript R. The REML (restricted maximum likelihood) estimator of τ is defined as the maximiser of `R. As with ordinary MLE; such a maximiser satifies the REML equation ∂`R ∂τ = 0. Again, we shall rely on numerical methods to find the maximiser of the REML equation. After the random effects are estimated; the fixed effects are subsequently estimated. Notice that we have already seen a candidate for L. Let LT = In − X(XTX)−1XT, then† LTY = LTXβ+ LTZν + LT †see problem sheet 136 4.2.3 Worked Example REM-pulp.R We now illustrate the fitting methods used in R using some data from an experiment to test the paper brightness depending on a shift operator: > library("faraway") > data("pulp") bright operator 1 59.8 a 2 60.0 a 3 60.8 a 4 60.8 a 5 59.8 a 6 59.8 b 7 60.2 b 8 60.4 b 9 59.9 b 10 60.0 b bright operator 11 60.7 c 12 60.7 c 13 60.5 c 14 60.9 c 15 60.3 c 16 61.0 d 17 60.8 d 18 60.6 d 19 60.5 d 20 60.5 d Table 4.1: Pulp Dataset We start by calculating the ANOVA estimators: > myaov <- aov(bright~operator,data=pulp) > summary(myaov) Df Sum Sq Mean Sq F value Pr(>F) operator 3 1.34 0.4467 4.204 0.0226 * Residuals 16 1.70 0.1062 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 > coef(myaov) (Intercept) operatorb operatorc operatord 60.24 -0.18 0.38 0.44 137 The estimate of σ2e = 0.1062 and of σ2ν is > (0.4467-0.1062)/5 [1] 0.0681 We now demonstrate the maximum likelihood estimators. > library("lme4") > mylme <- lmer(bright~1+(1|operator),data=pulp) Take note of how the model is specified in the lmer function: the first part corresponds to the fixed effects; the second part corresponds to the random effect, where (1|operator) indicates that the data are grouped by operator and the 1 indicates that the random effect is constant within each group. > summary(mylme) Linear mixed model fit by REML ['lmerMod'] Formula: bright ~ 1 + (1 | operator) Data: pulp REML criterion at convergence: 18.6 Scaled residuals: Min 1Q Median 3Q Max -1.4666 -0.7595 -0.1244 0.6281 1.6012 Random effects: Groups Name Variance Std.Dev. operator (Intercept) 0.06808 0.2609 Residual 0.10625 0.3260 Number of obs: 20, groups: operator, 4 Fixed effects: Estimate Std. Error t value (Intercept) 60.4000 0.1494 404.2 138 Also note that the default fitting method is REML. The reported standard deviations are just the square root of the reported variances — not estimates of the uncertainty in the variances. The maximum likelihood estimates can also be computed. > mylme2 <- lmer(bright~1+(1|operator),data=pulp,REML=FALSE) > summary(mylme2) Linear mixed model fit by maximum likelihood ['lmerMod'] Formula: bright ~ 1 + (1 | operator) Data: pulp AIC BIC logLik deviance df.resid 22.5 25.5 -8.3 16.5 17 Scaled residuals: Min 1Q Median 3Q Max -1.50554 -0.78116 -0.06353 0.65850 1.56232 Random effects: Groups Name Variance Std.Dev. operator (Intercept) 0.04575 0.2139 Residual 0.10625 0.3260 Number of obs: 20, groups: operator, 4 Fixed effects: Estimate Std. Error t value (Intercept) 60.4000 0.1294 466.7 139 4.3 Inference To compare two nested models M0 (smaller) and M1, we compute the difference of the log-likelihoods (or difference of deviances) 2 { `(β̂1, τ̂1;y)− `(β̂0, τ̂0;y) } , where β̂0, τ̂0 are the MLEs of the parameters for M0 and β̂1, τ̂1 are the MLEs of the parameters for M1. We have two choices: either (wrongly) assume the variance parameters are equal to their estimated values, then perform a χ2-test, or try to take into account their uncertainty, then use an F-test. Testing whether to include the random effects or not involves the hypothesis: H0 : σ2ν = 0. The standard derivation of the asymptotic χ2 distribution depends on the null hypothesis lying on the interior of the parameter space. This assumption does not hold for this test. Consequently, one may resort to sampling procedures – see parametric bootstrap below (Section 4.3.1) If the χ2 approximation is used, then the test will tend to be conservative in the sense that the p-values will tend to be larger than they should be. inference in the mixed model is intricate, and often controversial Warning We cannot use the REML estimation method if the deviance test is to be used. The reason is that; the REML estimates the random effects by eliminating the fixed effects (we transformed using L). If these fixed effects are changed, the likelihoods of the two models are not directly comparable. Therefore, we should use the maximum likelihood estimation method in this situation. 140 4.3.1 Worked Example (Continued) — Parametric Bootstrap REM-pulp-cont.R Returning to the pulp example. We fitted the mixed effects model using the maximum likelihood (ML) approach > mylme2 <- lmer(bright~1+(1|operator),data=pulp,REML=FALSE) We now consider a test concerning the random effects — therefore we must use ML. Say we are interested in comparing the mylme2 and > mylm <- lm(bright~1,data=pulp) that is, comparing if the random effects should be included or not. Our null model is the smaller linear model mylm. Note that as there are no random effects in this model, we must use the lm function. For once, we cannot use anova and therefore must compute things directly: > d <- as.numeric(2*(logLik(mylme2)-logLik(mylm))) > d [1] 2.568371 > pchisq(d,1,lower=FALSE) [1] 0.1090199 This p-value is now well above the 5% significance level. We cannot say that this result is necessarily wrong, but the use of the χ2 approximation does cause us to doubt the result. We can use the parametric bootstrap approach to obtain a more accuate p-value. We need to estimate the probability, given that the null hypothesis is true, of observing a d value of 2.568371 or greater. Under the null hypothesis, Y ∼ N(µ, σ2). A simulation approach would proceed to generate data under this model, fit the null and alternative models and then compute the deviance. The process would be repeated a large number of times and the proportion of deviance values exceeding the observed value of 2.568371 would be used to estimate the p-value. In practice, we do not know the true values of µ and σ2 but we can use the estimated values — this distinguishes the parametric approach from the simulation approach. We can simulate responses under the null as follows: 141 > y <- simulate(mylm) Now taking the generated data, fit both the null and alternative model compute the deviance and repeat, say 1000 times: > ds <- numeric(1000) > for(i in 1:1000){ + y <- unlist(simulate(mylm)) + nullmod <- lm(y~1) + altmod <- lmer(y~1+(1|operator),data=pulp,REM=FALSE) + ds[i] <- as.numeric(2*(logLik(altmod)-logLik(nullmod))) + } We may examine the distribution of the bootstrapped deviance values i.e. look at the proportion of values close to zero > mean(ds<0.00001) [1] 0.698 This clearly indicates that the deviance in this case does not have the χ2 distribution under the null hypothesis. Our estimated p-value is: > phat <- mean(ds>2.568371) > phat [1] 0.02 We should compute the standard error for this estimate (where does this come from?) > sqrt(0.02*0.98/1000) [1] 0.004427189 So we can be fairly sure it is under 5%. If in doubt, do more replications. 142 4.4 Prediction 4.4.1 Best prediction when all the parameters are known When the fixed effects, β, and variance components, τ , are known, the best predictor for ν, in the sense of minimum mean squared error, is its conditional expectation given the data; that is, ν˜ := E(ν|y) = ΨZTV−1(y − Xβ). To show E(ν|y) = ΨZTV−1(y − Xβ), first consider the covariance between Y and ν: . The proof then follows by considering the joint over Y and ν which is an (n + m)-dimensional multivariate Gaussian distribution:( Y ν ) ∼ N (( Xβ 0 ) , σ2e ( In + ZΨZT ZΨ ΨZT Ψ )) where we also used that ΨT = Ψ. Now recall the result:( Wa Wb ) ∼N (( µa µb ) , ( Σaa Σab Σba Σbb )) ⇒ Wb |Wa = wa ∼ N ( µb + ΣbaΣ −1 aa (wa −µa),Σbb − ΣbaΣ−1aa Σab ) . This allows us to obtain the conditional ν|y (i.e. the posterior) without performing an intergration: ν˜ := E(ν | Y = y) = Once the best predictors ν˜ are obtained, the best predictor of η = xT?β+ zT?ν is η̂B := xT?β+ z T ? ν˜ = x T ?β+ z T ?ΨZ TV−1(y − Xβ). The subscript B refers to the best predictor. 143 4.4.2 Best linear unbiased prediction (BLUP) If the fixed effects, β, are unknown but the variance components, τ , are known, η̂B is not a predictor. In this case, it is customary to replace β by is MLE β̂, which is β̂ = (XTV−1X)−1XTV−1Y . After replacing β by its MLE, the predictor for η := xT?β+ zT?ν becomes η̂BLUP = xT? β̂+ z T ? ν˜β̂ = x T ? β̂+ z T ?ΨZ TV−1(y − Xβ̂). This is the best linear unbiased predictor (BLUP) of η, in the sense that (i) is it linear in y, (ii) its expected value is equal to η and (iii) it minimises the MSE among all linear unbiased predictors. The vector ν˜β̂ := ΨZ TV−1(y − Xβ̂) is also called the BLUP of ν. 4.4.3 Empirical BLUP In practice, the fixed effects and variance components are typically unknown. Therefore, in most cases neither the best predictor nor the BLUP is computable. In such cases, one replaces the vector of variance components, τ , which is involved in the expression of BLUP, by, τ̂ . That is, instead of predicting the random effects by ν˜ we use ν̂ := E(ν|Y = y, τ = τ̂ ) = ΨZTV−1τ̂ (y − Xβ̂) where Vτ̂ is V with an estimator of τ plugged in. The resulting predictor, often called empirical BLUP or eBLUP, is η̂eBLUP = xT? β̂+ z T ? ν̂ = x T ? β̂+ z T ?ΨZ TV−1τ̂ (y − Xβ̂). Note Recall that the random effects, ν, are not parameters: they are random variables. So rather than estimating their values, we predict them. Therefore, the conditional expectations are referred to as predictors. 144 4.5 Worked Example REM-sim.R We begin by simulating some data as follows: > n <- 60 > X <- rnorm(n,0,1) > Z <- c(rep(1,20),rep(2,20),rep(3,20)) > nu <- rnorm(3,0,1) > beta <- 1.5 > y <- nu[Z]+beta*X+rnorm(60,0,0.1) > dat <- data.frame(y=y,Z=Z,X=X) We have simulated from 3 groups with an associated random effect with standard deviation 0.1 and additional covariate X, representing a fixed effect with regression coefficient 1.5 with standard deviation 1. Next , we fit the normal linear mixed model, Y = Xβ+ Zν + , as follows > library("lme4") > mylme <- lmer(y~X+(1|Z)+0,data=dat,REML=FALSE); summary(mylme) Linear mixed model fit by maximum likelihood ['lmerMod'] Formula: y ~ X + (1 | Z) + 0 Data: dat AIC BIC logLik deviance df.resid -71.4 -65.1 38.7 -77.4 57 Scaled residuals: Min 1Q Median 3Q Max -2.58247 -0.59496 -0.00208 0.49970 2.02786 Random effects: Groups Name Variance Std.Dev. Z (Intercept) 0.77931 0.8828 Residual 0.01123 0.1060 Number of obs: 60, groups: Z, 3 Fixed effects: Estimate Std. Error t value X 1.51026 0.01572 96.09 145 We see that the output includes the standard effects summary and an additional summary for the random effects; reporting the variance of the random effect (named as Intercept) and the residual variance. Notice that the estimated residual variance is accurate (true value is 0.01), but the estimated random effect variance is not as accurate (true value is 1). Why? The estimated coefficient, β, is also accurate (true value 1.5). To get the posterior estimate for the random effects i.e. ν̂ := E(ν|y, τ̂ ): > ranef(mylme) $Z (Intercept) 1 0.8867433 2 1.1064969 3 -0.5706097 If we wish to predict the response of an observation given the covariate x? = 1, but do not know which group it belongs to, then the best predictor is xT? β̂ = β̂; for this example this is 1.51026. If we want to predict the response of an observation from, say, group j = 1, then we use the estimate of the random effect to get the estimate > fixef(mylme)+ranef(mylme)$Z (Intercept) 1 2.3970000 2 2.6167536 3 0.9396471 146 4.6 Inference Continued 4.6.1 Inference about the fixed effects, β Let us return to the un-restricted log-likelihood `(β, τ ;y) = −n 2 log(2pi)− 1 2 log ∣∣∣σ2eVτ ∣∣∣− 12σ2e (y − Xβ)TV−1τ (y − Xβ). (4.4) For fixed τ , taking derivative of this log-likelihood with repsect to β gives ∂` ∂β = XTV−1τ (y − Xβ) so that the MLE of β is the solution of XTV−1τ Xβ̂ = XTV−1τ y, which we recognise. Recall that Vτ := (In + ZΨZT) where Ψ = σ2ν σ2e Im — therefore Vτ depends on the variance components τ = (σ2e , σ2ν )T. The (profile) log-likelihood of the variance parameter τ is `(τ ;y) = −n 2 log(2pi)− 1 2 log ∣∣∣σ2eVτ ∣∣∣− 12σ2e (y − Xβ̂)TV−1τ (y − Xβ̂). and the Fisher information of β is J (β) = XTV−1τ X. In practice, the estimated value of τ̂ is plugged into the Fisher information, from which we can find the standard error for the MLE β̂ in the form β̂ = β̂τ̂ J ( β̂ ) = XTV−1τ̂ X. Problem The standard errors computed from this plugin formula do not take into account the uncertainty in the estimation of τ ; but this approach is nevertheless commonly used. Note We can plug in any estimator of the variance components τ in the above. This is how we get estimate of the fixed effect, β̂, using the REML estimators. 147 4.7 Diagnostics For normal mixed linear models, the two main distributional assumptions pertain to the normality of the random effects ν and the errors . Normality of Random Effects For normal mixed linear models, we assumed ν ∼ N(0, σ2ν Im). To check this assumption, some “estimates” of the random variables ν are required. Typically, the conditional expectations of the random effects, given the observations, are used: ν̂ = E(ν | y, τ = τ̂ ) = ΨZTV−1τ̂ (y − Xβ̂). In turns out that using QQ plots of the predicted random effects for the purpose of checking their normality are not useful. This is because the observed distribution of ν̂ does not necessarily reflect the true distribution of ν. In practice, checking the normality assumption for ν should be based on the results for a linear mixed model with and without assuming this normality (outside scope of course). Residuals As for linear models, the main tools for checking the assumption of the normality of the errors, , are based on residuals. First, recall the normal linear mixed model: Y = Xβ+ Zν + or Y ∼ N(Xβ, σ2e (In + ZΨZT)). One set of residuals is the conditional residuals defined as (c) = y − Xβ̂− Zν̂, where ν̂ = E(ν|y, τ = τ̂ ). Another set is the marginal residuals defined as (m) = y − Xβ̂. These raw residuals are useful to check heterogeneity of the conditional or marginal variance. However, they are not useful for checking normality assumptions and/or detecting outlying observations. This is because the raw residuals are typically correlated and their variances will differ. Moreover, standardised and Pearson residuals are not appropriate for checking normality. This is because the linear mixed model allows for correlation between the errors. A work-around (approximate) 148 solution is to transform the raw residual such that the resulting transformed residual is approximately Normal. Then a QQ plot of the transformed residuals should show an approximate straight line. This transformation can be based upon the Cholesky decomposition of the variance-covariance matrix. Some recommend the following diagnostic plots: plot of the marginal residuals against covariates to check the linearity assumptions for the covariates and; plots of the conditional residuals against the estimated conditional means can be used to detect outlying observations or heteroscedasticity of the errors. 149 Question: Does the REML estimator depend on the choice of transformation, L? 150 4.7.1 Summary: Fixed or Random Effects In certain cases, random effects are almost obligatory: • when we wish to generalise for unseen levels of a factor (i.e., we view the observed levels as samples from a population). • when we have reason to believe that the grouping effects are actually “tightly” distributed around an overall population mean. • when the design forces correlation within groups (e.g,. block designs where one treatment only is tested per plot/block). 151 欢迎咨询51作业君