Multiple Linear Regression Models - Part 2 Residual Diagnostics, Unusual observations Dr. Linh Nghiem STAT3022 Applied linear models Regression Diagnostics Background Recall the MLR model y = Xβ + ε, E(y) = Xβ, Var(y) = Var(ε) = σ2In Assuming the design matrix X is full-ranked, so the OLS estimate is βˆ = (X>X)−1X> y . The vector of fitted value and residual are yˆ = X βˆ = X(X>X)−1Xy = Hy, e = y−yˆ = y−Hy = (In−H)y where H = X(X>X)−1X> is the n× n hat matrix. 1 Background Similar to model diagnostics for SLR, diagnostic for MLR is based on the residuals, which depends critically on the hat matrix H. • H is symmetric, i.e H> = H. As a result, the matrix In −H is also symmetric. • Next, HX = X. As a result, (In−H)X = X−X = 0. • Third, H2 = H, so we say H is idempotent. As a result, the matrix In −H is also idempotent, since (In −H)(In −H) = InIn −H In − In H + H H = In −H−H + H = In −H . • Finally, as proved in the Tutorial 4, trace(H) = ∑n i=1 hii = p. 2 Residual vector • First, let’s compute its expectation: E(e) = E {(In−H)y} = (In−H)E(y) = (In−H)Xβ = 0. • Second, let’s compute the variance-covariance matrix. Var(e) = Var {(In−H)y} = (In−H) Var(y)(In−H)> = (In−H)σ2 In(In−H) = σ2(In−H)(In−H) = σ2(In−H), i.e Var(ei) = σ 2(1− hii), Cov(ei, ej) = −σ2hij . These computation tell us that (1) each residual term ei has a smaller variance than the true error εi, and (2) these residuals are correlated. 3 Residuals plots We can use similar residual plots similar to in the case of simple linear regression for model diagnostics. Specifically, • To check constant variance assumption: Use the plot of residual ei vs. fitted values yˆi or the plot of residual vs. each covariate. no news is good news. • To check normality assumption: Use normal quantile-quantile plot, or normality test. 4 A reasonable constant-variance l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l ll l l l l l l ll l l l l l l ll −5 0 5 10 7.5 10.0 12.5 15.0 Fitted values R es id ua ls Constant variance is reasonable Residuals vs. fitted values plot 5 Example of violation of assumption: Non-constant variance l lll l l l l l l l ll l l l l l l l l l lll l l l l l l l ll l ll l l l l l l l l l l l l l 0.0 0.2 0.4 0.6 0.8 1.0 − 1. 5 0. 0 1. 0 Fitted values R es id ua ls l l l l lll l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 0.2 0.4 0.6 0.8 − 1. 0 0. 0 0. 5 Fitted values R es id ua ls l l l ll l l l l ll l l l ll l ll ll l l l l l l l l l l l l l l l l l ll l l l l l l 0.0 0.2 0.4 0.6 0.8 1.0 − 2. 0 − 1. 0 0. 0 1. 0 Fitted values R es id ua ls l l l l l l l l ll l l l l l l ll l ll l l l lll l l l l l l l l l l l l l l l l 0.0 0.2 0.4 0.6 0.8 1.0 − 1. 5 − 0. 5 0. 5 Fitted values R es id ua ls l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l ll l ll l l l l l 0.0 0.2 0.4 0.6 0.8 1.0 − 1. 5 0. 0 1. 0 Fitted values R es id ua ls l l l l ll ll l l l l l l l l l l l ll l l l l l l l l l l ll l l l l l l l l l l l l l 0.0 0.2 0.4 0.6 0.8 − 1. 5 − 0. 5 0. 5 1. 5 Fitted values R es id ua ls l l l l l l l l l l l l l l l l l l l l l l ll l l l l ll l l l l l l l ll l l l l l l l l 0.0 0.2 0.4 0.6 0.8 1.0 − 2. 0 − 1. 0 0. 0 1. 0 Fitted values R es id ua ls l l l l l l l l l l l l l ll l l ll l l l l l l l l l l l l l l l l l l l ll l ll l l l l 0.0 0.2 0.4 0.6 0.8 1.0 − 1. 5 − 0. 5 0. 5 1. 5 Fitted values R es id ua ls l l l l l l l ll l ll l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 0.2 0.4 0.6 0.8 1.0 − 1. 0 0. 0 1. 0 Fitted values R es id ua ls 6 A reasonable normality assumption l l l l l l l l l l lllll l lll lllll ll lllllll l lll lll l l l l l l l l l l −5 0 5 10 −2 −1 0 1 2 Theoretical Quantile Sa m pl e Qu an tile Normality is reasonable QQ−plot 7 Example of violation of assumption: Non-normality l l l l l l l l l ll l l ll l l l l l l l l l l l l l l l l l l lll l −2 −1 0 1 2 0 2 4 6 8 Normal Q−Q Plot Theoretical Quantiles Sa m pl e Qu an tile s l l l l l ll l l l l l l l ll l l l l l lll l l l l ll l l lll l ll −2 −1 0 1 2 0 2 4 6 Normal Q−Q Plot Theoretical Quantiles Sa m pl e Qu an tile s l l ll l l l l l l ll ll ll l lll l l ll l lll l l l ll l l l −2 −1 0 1 2 0 10 20 30 Normal Q−Q Plot Theoretical Quantiles Sa m pl e Qu an tile s l l ll l ll l l l l l lll l l ll l l l l l l l l l l l l ll l ll l l l l −2 −1 0 1 2 0 1 2 3 4 5 Normal Q−Q Plot Theoretical Quantiles Sa m pl e Qu an tile s l l l ll l l ll l ll l l l l ll l l ll l ll l l l l l ll l lll l −2 −1 0 1 2 0 1 2 3 4 5 6 Normal Q−Q Plot Theoretical Quantiles Sa m pl e Qu an tile s l l ll l l l l l ll l l l l l l l l l l l l l l l l l l l l l ll l l −2 −1 0 1 2 0 1 2 3 4 5 Normal Q−Q Plot Theoretical Quantiles Sa m pl e Qu an tile s l l l l l l l ll l l l l l l ll ll l l l ll l lll ll l l l ll l l −2 −1 0 1 2 0 5 10 15 Normal Q−Q Plot Theoretical Quantiles Sa m pl e Qu an tile s lll l l lll l l l l l l l l l ll ll l llll ll l l l l l l ll l −2 −1 0 1 2 0 2 4 6 8 Normal Q−Q Plot Theoretical Quantiles Sa m pl e Qu an tile s l l lll l l l l ll l l l l l l l l l ll l l l l ll l l l lll l l l l l l −2 −1 0 1 2 0 1 2 3 4 5 6 Normal Q−Q Plot Theoretical Quantiles Sa m pl e Qu an tile s 8 Leverage, Outlier and Influential Observations Overview Roughly speaking, • An outlier is an observation that appears to contradict the postulated model used to describe the data. • A high leverage observation is one that is far from the center of the predictor space. • An influential observation is one that exerts substantial influence of the fitted model; i.e when removed, significant changes occur in the fitted model result. Note that one does not necessarily imply another. 9 Overview 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 x y R2 = 0.86 , R2 (red point removed) = 0.87 Outlier, Low Leverage, not Influential 2 3 4 5 1 2 3 4 x y R2 = 0.9 , R2 (red point removed) = 0.9 High Leverage only 1.5 2.0 2.5 3.0 3.5 1.0 1.5 2.0 2.5 3.0 x y R2 = 0.36 , R2 (red point removed) = 0.6 High Leverage, Influential, Outlier 10 Leverage • Each observation in the data always try to “pull” the fitted line toward itself, i.e try to make the fitted value as close to the observed outcome as possible. • Formally, leverage of the ith point represents the change in yˆi when yi changes one unit. In MLR, we have yˆi = h > i y = n∑ j=1 hijyj , where hii denotes the (i, i) element of H = X(X >X)−1X>. Hence, when yi changes one unit, yˆi changes hii unit. Therefore hii is a measure of leverage of the ith observation in MLR. 11 Leverage Note that we have trace(H) = n∑ i=1 hii = p, so h¯ = 1 n n∑ i=1 hii = p n ; in other words, the average leverage is p/n. • As a rule of thumb, the ith observation is said to have a high leverage if their corresponding leverage hii > 2h¯ or hii > 3h¯. • hii = x>i (X >X)−1xi; in SLR, hii = n−1 + (xi − x¯)2/Sxx. The further a point is from the center of the predictor space, the higher leverage it has. 12 Outlier An outlier is an observation that is far from the postulated model. Hence, a natural idea to detect whether an observation is an outlier or not is to look at whether the magnitude of the corresponding residual ei is big. But how big is big? From the previous background slide, ei ∼ N ( 0, σ2(1− hii) ) , so the variance of each residual ei depends on the scale σ 2. Therefore, to determine whether an observation is an outlier, we typically look at one of the following types of standardized residuals. 13 Different types of residuals Standardized residual: We replace σ2 by σˆ2 and standardize ri = ei σˆ √ 1− hii The main problems with this kind of residual is that since ei and s are independent, the distribution of ri is difficult to calculate. 14 Different types of residuals To overcome this problem, we consider the regression with the ith observation deleted (leave one-out cross validation). • Take the ith observation (x>i , yi) out of the dataset. • Fit the model on (n− 1) remaining observations and obtain the residual standard error σˆ(i). Finally, we obtain the externally studentized residuals ti = ei σˆ(i) √ 1− hii ∼ tn−1−p so the ith observation can be considered outlier if the |ti| is large (for example, greater than t1−α/2,n−1−p with α = 0.05) . 15 Different types of residuals The procedure seems to be computationally expensive, since it requires us to run n regressions, where each regression has one observation removed. It turns out that ti = ei [ n− p− 1 SSE(1− hii)− e2i ]1/2 with SSE = ∑n i=1 e 2 i , the sum of squared residuals on the regression with the full dataset. 16 Outlier detection In practice, we typically obtain studentized residuals for all n observations in the dataset, and we want to check whether every observation is an outlier or not. Hence, to be more conservative, when deciding whether an observation is an outlier, we should use a more conservative threshold for |ti|. A simple way is to use something called Bonferroni correction, when we claim the ith observation is an outlier if |ti| > t1−α/2n,n−1−p. 17 Influential Observations The last kind of unusual observation is influential observations, the points when being removed results in significant changes in the model fit. Given we want to inspect whether the ith observation is influential, the above definition motivates us to fit the models with and without the ith observation, and examine the changes. Two most common measures of these changes are • DFBETA and DFBETAS: measure changes in regression estimated coefficients. • DFFITS and Cook’s distance: measure changes in model fit. 18 DFBETA and DFBETAS Recall βˆ is the p× 1 estimated coefficient vector with the full dataset, while βˆ(i) is the same quantity with the ith observation being removed. Hence, we define DFBETA(i) = βˆ− βˆ(i), so DFBETA is another p× 1 vector, whose element is denoted as DFBETA(i)j , j = 1, . . . , p. Finally we form the standardized vector DEBETAS by standardizing each element separately DFBETAS(i)j = DFBETA(i)j σˆ(i) √ vjj , where vjj is the jth diagonal element of the matrix V = (X>X)−1. 19 DFBETA and DFBETAS • The DFBETAS helps us determine which data points influence which coefficients. • You can use cutoffs of 1, 2, or 3 to determine whether DFBETAS(i)j is big enough. • But the best approach is for each coefficient βj , make a plot of all the DFBETA(i)j , i = 1, . . . , n and inspect large values. 20 DFFITS and Cook’s Distance The second approach is to measure changes in model fit. The influence of case i on the fitted value yˆi is DFFITSi = yˆi − yˆ(i)i s(i) √ hii = x>i βˆ−x>i βˆ(i) s(i) √ hii = ti √ hii 1− hii The Cook’s distance measures the influence of the ith cases on all n fitted values: Di = ∑n j=1 ( yˆj − yˆ(i)j )2 pMSE = r2i hii p(1− hii) where ri is the corresponding studentized residual. In the red part of the above formulas, the denominator is just the standard error of the numerator. 21 DFFITS and Cook’s Distance • Both DFFITS and Cook’s distance simultaneously takes both leverage (hii) and outlier (ti or ri) measures into account. • Hence, influential is associated with at least one of the two: either high leverage or outlier. • Similar to DFBETAS, a good approach for determining whether an observation is influential is to plot all the DFFITSi or Di, then inspect these large values. 22
欢迎咨询51作业君