Multiple Linear Regression Models - Part 3 Multicollinearity, Polynomial, Qualitative variables Dr. Linh Nghiem STAT 3022 Applied linear models Multicollinearity in MLR Overview • A distinct characteristic of MLR compared to SLR is that they have more than one predictor. As such, the intercorrelation between predictors play important roles in the estimated coefficients as well as inference of the MLR. • Such intercorrelation is known as multicollinearity (multi: many; collinear: linear dependence). • We will study three cases: 1. When all predictors are uncorrelated. 2. When all predictors are perfectly correlated. 3. When all predictors are correlated but not perfectly correlated. • In this section, we denote rjk as the sample correlation between two predictors Xj and Xk. 1 Uncorrelated predictors Consider the models yi = β0 + β1xi1 + β2xi2 + εi (1) yi = β0 + β1xi1 + εi (2) yi = β0 + β2xi2 + εi (3) If r12 = 0 (i.e X1 and X2 are uncorrelated), then • The OLS estimates for β1 of model (1) and model (2) are exactly the same. • The OLS estimates for β2 of model (1) and model (3) are exactly the same. • SSR(X1, X2) = SSR(X1) + SSR(X2) 2 An example: Kutner et al. (Table 7.6) Example: effect of work crew size (X1) and level of bonus pay (X2) on crew productivity (Y). X1 and X2 are uncorrelated. 3 An example: Kutner et al. (Table 7.6) 4 Uncorrelated predictors In general, if all p− 1 predictors are mutually uncorrelated: • The effect of one predictor on the response does not depend on whether these other predictors are in the model. • Hence, we can get the effect of one predictor Xj on the response Y just by fitting SLR of Xj and Y . • We do not go into the math of this conclusion, but intuitively, when all the predictors are uncorrelated, they have “separate” effects on the response. • You will see this case again when we talk about experimental designs. 5 Perfectly correlated predictors The second (extreme) case is when one or some predictors are perfectly correlated with one another. • Essentially, that just means one predictor can be written as the linear combination of some other predictor variables. In this case, the design matrix X is not full-ranked, i.e rank(X) < p. • Recall the normal equation for OLS: X>Xb = X> y and rank(X>X) = rank(X). Hence, in this case, the matrix X>X is also not full-ranked, and we will have infinite number of solutions for b. 6 An example: Kutner et al. (Table 7.7) 7 An example: Kutner et al. (Table 7.7) Normal equations: 4 26 3326 204 232 33 232 281 b0b1 b2 = 2722118 2419 . The fitted value yˆ4 is yˆ4 = [ 1 10 10 ]b0b1 b2 = [−0.4 0.1 0] 4 26 3326 204 232 33 232 281 b0b1 b2 = [ −0.4 0.1 0 ] 2722118 2419 = 103. 8 Perfectly correlated predictors • Though we have infinitely number of solutions for b, all solutions give the same fitted values (and residuals). • Therefore, while there is no interpretation for b, the model can still provide a good fit for the data. 9 Highly correlated predictors Although these above cases are extreme, in reality, it is very common to find many predictors are highly correlated. At the end, highly correlated variables are inherent characteristics of the population of interest. • Example: Regression of food expenditures on income, savings, age of head of household, educational level, etc., all the predictors are correlated with one another. • Mathematically, although the design matrix X and the matrix X>X still have the full rank, the inversion V = (X>X)−1 become unstable. • Recall that Var(βˆ) = σ2V, so multicollinearity inflates the variance of the OLS estimator. 10 Body fat example (n=20) Corr: 0.924*** Corr: 0.458* Corr: 0.085 Corr: 0.843*** Corr: 0.878*** Corr: 0.142 X1 X2 X3 y X1 X2 X3 y 15 20 25 30 45 50 55 25 30 35 15 20 25 0.00 0.02 0.04 0.06 45 50 55 25 30 35 15 20 25 11 Effect on sum of squares Model SSE y ∼ X1 143.12 y ∼ X2 113.42 y ∼ X1 +X2 109.95 y ∼ X1 +X2 +X3 98.40 Since X1 and X2 are highly correlated, the model including both X1 and X2 reduces SSE very little compared to the SSE of the model that has only X2. 12 Effect on regression coefficients Model βˆ1 se(βˆ1) t1 βˆ2 se(βˆ2) t2 y ∼ X1 0.86 0.13 6.66* y ∼ X2 0.86 0.11 7.78* y ∼ X1 +X2 0.22 0.30 0.73 0.66 0.29 2.26* y ∼ X1 +X2 +X3 4.34 3.01 1.44 -2.86 2.58 -1.37 *: t-test of β1 = 0 or β2 = 0 are significant at α = 0.05 • When predictors are highly correlated, regression parameters are difficult to interpret. When new variables are added, the OLS estimate can change dramatically. 13 Effect on regression coefficients Now consider the last model y ∼ X1 +X2 +X3. We see from the last table that none of the t-test for β1 = 0 nor β2 = 0 is significant. However, if we conduct the F -test: H0 : β1 = β2 = 0 H1 : at least one of the β1 and β2 is non-zero Under H0, the model is y ∼ X3, so SSE(H0) = 485.34 and df(H0) = 18. Hence, the corresponding F statistic is F = (485.34− 98.40)/(18− 16) 98.40/16 = 31.46 with p-value = 0.00. Hence, we reject H0 and conclude at least at least one of the β1 and β2 is non-zero. 14 Effect on regression coefficients Q: Are the results of F -test and t-tests contradictory? A: No, because they test two different things. • The t-test for β1 = 0 assume X2 already included in the model, and similarly, the t-test for β2 = 0 assume X1 already included in the model. • On the other hand, the F -test essentially check whether at least one of X1 and X2 has to be in the model. • Such seemingly contradictory results between marginal t and F -tests are common when the predictors are highly correlated, so do not just look at the p-values of the t-tests in the summary table and decide which variables should be kept. 15 Implications • Regression parameters are difficult to interpret. - While it is mathematically true to say, ’when all the other covariates hold fixed, a change of one unit in X1 leads to a change of βˆ1 unit in the outcome,’ you are unable to change X1 and keep other covariate fixed in practice. - Possibly the only exception is when we conduct experiments and are able to completely control for the values of the covariates. • Multicollinearity is the issue mostly for X, and it does not seriously affect the fitted value yˆ of the model. 16 Implications • When p is large, multicollinearity tend to become more serious, making variable and model selection more important. • Variable selection and model evaluation have been covered in DATA2x02: - Information criterion: AIC, BIC - Backward and forward selection - Prediction performance: in-sample (R2) and out-of-sample prediction (cross-validation) • These above topics will not be tested in the quiz/final exam, but you are expected to do it in the project. 17 Polynomial regression Polynomial regression First, we consider the following models with the following forms: yi = β0 + β1xi + β2x 2 i + . . .+ βp−1x p−1 i + εi, i = 1, . . . , n There is only one outcome Y and one predictor X; however, unlike the simple linear regression, the regression model is a polynomial of degree p− 1 with respect to X. Fitting that model is essentially the same as fitting a MLR model with p− 1 predictors and design matrix X1 = X; X2 = X 2; . . . , Xp−1 = Xp−1, 18 Polynomial regression The design matrix is X = 1 x1 x 2 1 . . . x p−1 1 1 x2 x 2 2 . . . x p−1 2 ... ... ... 1 xn x 2 n . . . x p−1 n and the OLS estimate for β is the same as in regular MLR: βˆ = ( X>X )−1 X> y and then all other inferences can be done as previously. 19 Notes for polynomial regression The polynomial terms are highly correlated. • Example: Car data, Y = distance; X = speed. speed (Y ) dist (X1 = X) X2 = X 2 X3 = X 3 4 2 4 8 4 10 100 1000 ... ... ... ... 25 85 7225 614125 • Correlation matrix among polynomial terms: rXX = 1.00 0.96 0.870.96 1.00 0.97 0.87 0.97 1.00 20 Polynomial regression A common step to reduce multicollinearity in polynomial regression is to center the data before taking higher degree. dist (X1 = X) X ∗ 1 = X1 − x¯ X∗2 = (X∗1 )2 X∗3 = (X∗1 )3 2 -41 (−41)2 (−41)3 10 -33 (−33)2 (−33)3 ... ... ... ... 85 42 (42)2 (42)3 X¯ = 42.98 rX∗X∗ = 1.00 0.52 0.750.52 1.00 0.83 0.75 0.83 1.00 , rXX = 1.00 0.96 0.870.96 1.00 0.97 0.87 0.97 1.00 21 Polynomial regression • Centering the data reduces multicollinearity, but not removes it completely. • One possible way to reduce multicollinearity completely in polynomial regression is through the use of orthogonal polynomials. - The math can be read more in the Wikipedia page. - In R, the set of orthogonal polynomials for a vector x up to degree k can be created by the function poly(x, k). • The main disadvantage of using orthogonal polynomials is the model is not so easy to interpret, since they transform the covariate for mathematical convenience without much meaning. 22 Polynomial regression Another important aspect of polynomial regression is hierarchy: • First, it is essential that a higher order term should only be added sequentially. For example, do NOT add the cubic term (X3) unless the quadratic term (X2) is already in the model. Eg: Do NOT fit the models: yi = β0 + β1xi + β3x 3 i + εi yi = β0 + β1x 2 i + β3x 3 i + εi The reason is the lower-order term provides more basic information about the regression function, while the higher-order term only provides a refinement of that. 23 Polynomial regression • On the other hand, if a polynomial term of a given degree is retained, then all related terms of lower degree are also retained in the model. Eg: If you fit the cubic model yi = β0 + β1xi + β2x 2 i + β3x 3 i + εi, and β3 is significant, never drop the linear (xi) and the quadratic term (x2i ) regardless of whether β1 nor β2 is significant or not. Otherwise, continue to test whether β2 is significant. If it is, never drop the linear term xi regardless of whether β1 is significant. 24 Polynomial regression Finally, a good strategy of using polynomial regression in practice is to add polynomial terms sequentially until the model fit does not significantly improve. Eg. mtcar data. Y = miles per gallon (mpg), X = weight (wt) l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l 10 15 20 25 30 35 2 3 4 5 wt m pg mtcars dataset 25 Polynomial regression First, we fit the linear model Y ∼ X using OLS. The slope is significant βˆ1 = −5.34 with p-value < 0.0001. l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l 10 15 20 25 30 35 2 3 4 5 wt m pg mtcars dataset with linear term From the scatterplot, we may doubt that including a quadratic term may capture the relationship between mpg and wt better. 26 Polynomial regression Before fitting the model with quadratic term, we first center the predictor wt to a new predictor wtc to reduce multicolinearity: wtc = wt− w¯t = wt− 3.22. Then we fit the model with quadratic term mpg ∼ wtc + wt2c using OLS. The slope for the quadratic term is βˆ2 = 1.17 and significant. l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l 10 15 20 25 30 35 −1 0 1 2 wt_c m pg mtcars data with linear and quadratic terms 27 Polynomial regression We may doubt if we need the cubic term wt3c . We go ahead and fit the model mpg ∼ wtc + wt2c + wt3c using OLS. l l l l ll l l l l l l l l l l l l l l l ll l l l l l l l l l 10 15 20 25 30 35 −1 0 1 2 wt_c m pg mtcars data with linear, quadratic (red), and cubic (green) terms The slope for wt3c is non-significant, so we stop here. The best model for us is the model with quadratic term. m̂pg = 19− 5.84wtc + 1.17wt2c = 19− 5.84(wt− 3.22) + 1.17(wt− 3.22)2. 28 Polynomial regression Due to this hierarchy, the sequential anova table is particularly useful. We can fit a model with a very high degree, look at the anova table and stop at the last row where the F -test is still significant. 29 Using orthogonal polynomials 30 MLR with Qualitative Predictors Qualitative Predictors • Binary: yes/no, male/female, before/after, control/treatment • Multiclass: type A/B/C, republican/democratic/neither General principle: A quantitative variable with C classes can be represented by C − 1 indicator variables (dummy variable), each taking the value 0 and 1. 31 An example of binary predictor First consider mtcars dataset. Regression of Y = miles per gallon (mpg) X1 = versus weight (wt) and engine shape (V-shaped or straight) • Since transmission type is a qualitative predictor with C = 2 classes, we only need 1 indicator variable to represent it, say X2 = 1 if straight0 if V-shaped • Design matrix for regression is formed as usual: X = 1 x11 0 1 x12 1 . . . . . . . . . 1 xn1 0 32 An example of binary predictor What if we accidentally creates another indicator variable: X3 = 1 if V-shaped0 if straight and form the design matrix with X1, X2 and X3 X = 1 x11 0 1 1 x12 1 0 . . . . . . . . . 1 xn1 0 1 Since we have X3 = 1−X4, so this design matrix has perfect multicollinearity, because column 1 (intercept) = column 3 + column 4. Hence, it is essential that we create only 1 indicator variable for each quantitative variable. 33 An example of binary predictor Let’s take a closer look at the regression function: Yi = β0 + β1Xi1 + β2Xi2 + εi If X2 = 0, then we have E(Y ) = β0 + β1X1 while if X2 = 1, then we have E(Y ) = (β0 + β2) + β1X1 so the two regression functions have the same slope β1 but different intercepts. The slope β2 represents that difference at all level of x1 (weight). 34 An example of binary predictor l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l 10 15 20 25 30 35 2 3 4 5 wt m pg vs l l 0 1 Same slope but different intercept Regression with Binary Predictor 35 Interaction with quantitative predictors Furthermore, we can let the qualitative and quantitative variables interact with each other. yi = β0 + β1xi1 + β2xi2 + β3xi1xi2 + εi, so if X2 = 0, then E(Y ) = β0 + β1X1, while if X2 = 1 then E(Y ) = (β0 + β2) + (β1 + β3)X1. As a result, the two regression functions have different slopes and different intercepts. 36 Interaction with quantitative predictors l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l 10 15 20 25 30 35 2 3 4 5 wt m pg vs l l 0 1 Different slope and different intercept Regression with Binary Predictor 37 Interaction with quantitative predictors Fitting and inferences for interaction models (not only between quantitative and qualitative predictors) also follow hierarchy. • Do not include an interaction term unless all the main effect terms are included. - Do not fit the model Y ∼ X1 +X1X2, since it lacks the main effect term X2, - Do not fit Y ∼ X1 +X2 +X21X2, since it lacks the main effect term X21 . • If the interaction term is significant, then do not drop the main effect terms regardless of whether they are significant or not. - Eg: If β3 in the model Y = β0 + β1X1 + β2X2 + β3X1X2 + ε is significant, then do not drop the main effect term β1X1 and β2X2 even when β1 and β2 are not significant. 38 Syntax in R Syntax in R Meaning X1 +X2 main effects of both X1 and X2 X1 : X2 second-order interaction effect of X1 and X2 X1 ∗X2 all main and interaction effects of X1 and X2 X1 : X2 : X3 third-order interaction effect of X1, X2 and X3 X1 ∗X2 ∗X3 all main, second-order, and third order interac- tion effects among X1, X2, X3 For example, lm(y ∼ X1 ∗X2 ∗X3) is the same as lm(y ∼ X1+X2+X3+X1 : X2+X2 : X3+X1 : X3+X1 : X2 : X3) 39 Fitting model with qualitative predictors in R • If you want to treat a predictor to be qualitative, you have to convert it to factor. mtcars = mtcars %>% mutate (vs = factor(vs)) • After that, you can put the variable into the lm command. For example, here I fit a model where two types of engine shaped have the same slope but different intercept: yi = β0 + β1x1i + β2x2i + εi, and the coefficient β2 represents the difference in the intercept of the two engine types. 40 Fitting model with qualitative predictors in R From this output, the regression equation for each type of engine is Yˆ = 33− 4.44X1, for V-shaped (X2 = 0) Yˆ = (33 + 3.15)− 4.44X1 = 36.15− 4.44X1 for straight (X2 = 1) 41 Fitting model with qualitative predictors in R I can fit a model with interaction between weight and engine-type yi = β0 + β1xi1 + β2xi2 + β3xi1xi2 + εi Yˆ = 29.53− 3.50X1, for V-shaped (X2 = 0) Yˆ = (29.53 + 11.77) + (−3.50− 2.90)X1 = 41.30− 6.40X1 for straight (X2 = 1) 42 Multi-Classes qualitative predictor Consider a regression of tool wear (Y) on speed (X1) and tool types, where model is a qualitative variable with C = 4 types (M1, M2, M3, M4). We need C − 1 = 3 indicator variables. X2 = 1 if M10 otherwise X3 = 1 if M20 otherwise X4 = 1 if M30 otherwise and the regression function is yi = β0 + β1xi1 + β2xi2 + β3xi3 + β4xi4 + εi. The class of the qualitative predictor corresponding to all indicator being 0 is called as the base or reference level. For example, it is M4 in that example, so β2, β3, β4 corresponds to the differences in the intercept of the other types with the base level. 43 Multi-classes qualitative predictor 44 Notes on qualitative predictor • One can have models with more than one qualitative predictor as well as interaction between them (but quickly leads to a huge number of predictors). • While indicator variables are the most common way to deal with qualitative predictor, alternative codings are possible, so be careful with interpretation of coefficients. 45 Seattle housing data example • Model the log housing price (Y ) with two potential predictors: log square feet (X1) and condition of the house. Condition is a qualitative predictor with 5 levels from 1 to 5. • Choose condition 1 as a base level and fit the model including interaction of log square feet and condition. yi = β0+β1x1i+β2x2i+. . .+β5x5i+β6x1ix2i+. . .+β9x1ix5i+εi where xji = 1 if condition of observation i is j, and 0 otherwise. 46 An example 47 An example We can write out the estimated regression function for each condition to be Yˆ = 7.67 + 0.69X1 condition 1, (7.67 + 0.11) + (0.69− 0.02)X1, condition 2 (7.67− 0.84) + (0.69 + 0.13)X1, condition 3 (7.67− 0.91) + (0.69 + 0.14)X1, condition 4 (7.67− 2.03) + (0.69 + 0.30)X1, condition 5. 48 An example We can visualize the relationships between log housing price and log square feet corresponding to each condition. 12 13 14 6 7 8 9 log(sqft_living) lo g(p ric e) condition 1 2 3 4 5 49 Hypotheses testing with qualitative variables A qualitative variable has significant (additive or interaction) effect when at least one coefficient of the corresponding indicator variables is significant. In the previous example, condition has an interaction effect with area if at least one of β6, . . . , β9 is not equal to zero. H0 : β6 = . . . = β9 = 0 vs H1 : at least one of β6, . . . , β9 6= 0. As such, the sequential ANOVA table in R always give the extra sum of squares together, SSR(X6, . . . , X9|X1, . . . , X5) not for a separate indicator variable. • What is the degree of freedom of SSR(X6, . . . , X9|X1, . . . , X5)? 50 Hypothesis testing with qualitative variables Likewise, the Sum Sq for condition is SSR(X2, . . . , X5|X1) with 4 degrees of freedom. 51
欢迎咨询51作业君