程序代写案例-STAT 3022

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
Multiple Linear Regression Models - Part 3
Multicollinearity, Polynomial, Qualitative variables
Dr. Linh Nghiem
STAT 3022
Applied linear models

Multicollinearity in MLR
Overview
• A distinct characteristic of MLR compared to SLR is that they
have more than one predictor. As such, the intercorrelation
between predictors play important roles in the estimated
coefficients as well as inference of the MLR.
• Such intercorrelation is known as multicollinearity (multi:
many; collinear: linear dependence).
• We will study three cases:
1. When all predictors are uncorrelated.
2. When all predictors are perfectly correlated.
3. When all predictors are correlated but not perfectly correlated.
• In this section, we denote rjk as the sample correlation
between two predictors Xj and Xk.
1
Uncorrelated predictors
Consider the models
yi = β0 + β1xi1 + β2xi2 + εi (1)
yi = β0 + β1xi1 + εi (2)
yi = β0 + β2xi2 + εi (3)
If r12 = 0 (i.e X1 and X2 are uncorrelated), then
• The OLS estimates for β1 of model (1) and model (2) are
exactly the same.
• The OLS estimates for β2 of model (1) and model (3) are
exactly the same.
• SSR(X1, X2) = SSR(X1) + SSR(X2)
2
An example: Kutner et al. (Table 7.6)
Example: effect of work crew size (X1) and level of bonus pay
(X2) on crew productivity (Y). X1 and X2 are uncorrelated.
3
An example: Kutner et al. (Table 7.6)
4
Uncorrelated predictors
In general, if all p− 1 predictors are mutually uncorrelated:
• The effect of one predictor on the response does not depend
on whether these other predictors are in the model.
• Hence, we can get the effect of one predictor Xj on the
response Y just by fitting SLR of Xj and Y .
• We do not go into the math of this conclusion, but intuitively,
when all the predictors are uncorrelated, they have “separate”
effects on the response.
• You will see this case again when we talk about experimental
designs.
5
Perfectly correlated predictors
The second (extreme) case is when one or some predictors are
perfectly correlated with one another.
• Essentially, that just means one predictor can be written as the
linear combination of some other predictor variables. In this
case, the design matrix X is not full-ranked, i.e rank(X) < p.
• Recall the normal equation for OLS:
X>Xb = X> y
and rank(X>X) = rank(X). Hence, in this case, the matrix
X>X is also not full-ranked, and we will have infinite
number of solutions for b.
6
An example: Kutner et al. (Table 7.7)
7
An example: Kutner et al. (Table 7.7)
Normal equations: 4 26 3326 204 232
33 232 281

b0b1
b2
 =
 2722118
2419
 .
The fitted value yˆ4 is
yˆ4 =
[
1 10 10
]b0b1
b2
 = [−0.4 0.1 0]
 4 26 3326 204 232
33 232 281

b0b1
b2

=
[
−0.4 0.1 0
] 2722118
2419
 = 103.
8
Perfectly correlated predictors
• Though we have infinitely number of solutions for b, all
solutions give the same fitted values (and residuals).
• Therefore, while there is no interpretation for b, the model
can still provide a good fit for the data.
9
Highly correlated predictors
Although these above cases are extreme, in reality, it is very
common to find many predictors are highly correlated. At the end,
highly correlated variables are inherent characteristics of the
population of interest.
• Example: Regression of food expenditures on income, savings,
age of head of household, educational level, etc., all the
predictors are correlated with one another.
• Mathematically, although the design matrix X and the matrix
X>X still have the full rank, the inversion V = (X>X)−1
become unstable.
• Recall that Var(βˆ) = σ2V, so multicollinearity inflates the
variance of the OLS estimator.
10
Body fat example (n=20)
Corr:
0.924***
Corr:
0.458*
Corr:
0.085
Corr:
0.843***
Corr:
0.878***
Corr:
0.142
X1 X2 X3 y
X1
X2
X3
y
15 20 25 30 45 50 55 25 30 35 15 20 25
0.00
0.02
0.04
0.06
45
50
55
25
30
35
15
20
25
11
Effect on sum of squares
Model SSE
y ∼ X1 143.12
y ∼ X2 113.42
y ∼ X1 +X2 109.95
y ∼ X1 +X2 +X3 98.40
Since X1 and X2 are highly correlated, the model including both
X1 and X2 reduces SSE very little compared to the SSE of the
model that has only X2.
12
Effect on regression coefficients
Model βˆ1 se(βˆ1) t1 βˆ2 se(βˆ2) t2
y ∼ X1 0.86 0.13 6.66*
y ∼ X2 0.86 0.11 7.78*
y ∼ X1 +X2 0.22 0.30 0.73 0.66 0.29 2.26*
y ∼ X1 +X2 +X3 4.34 3.01 1.44 -2.86 2.58 -1.37
*: t-test of β1 = 0 or β2 = 0 are significant at α = 0.05
• When predictors are highly correlated, regression parameters
are difficult to interpret. When new variables are added, the
OLS estimate can change dramatically.
13
Effect on regression coefficients
Now consider the last model y ∼ X1 +X2 +X3. We see from the
last table that none of the t-test for β1 = 0 nor β2 = 0 is
significant. However, if we conduct the F -test:
H0 : β1 = β2 = 0
H1 : at least one of the β1 and β2 is non-zero
Under H0, the model is y ∼ X3, so SSE(H0) = 485.34 and
df(H0) = 18. Hence, the corresponding F statistic is
F =
(485.34− 98.40)/(18− 16)
98.40/16
= 31.46
with p-value = 0.00. Hence, we reject H0 and conclude at least at
least one of the β1 and β2 is non-zero.
14
Effect on regression coefficients
Q: Are the results of F -test and t-tests contradictory?
A: No, because they test two different things.
• The t-test for β1 = 0 assume X2 already included in the
model, and similarly, the t-test for β2 = 0 assume X1 already
included in the model.
• On the other hand, the F -test essentially check whether at
least one of X1 and X2 has to be in the model.
• Such seemingly contradictory results between marginal t and
F -tests are common when the predictors are highly correlated,
so do not just look at the p-values of the t-tests in the
summary table and decide which variables should be kept.
15
Implications
• Regression parameters are difficult to interpret.
- While it is mathematically true to say, ’when all the other
covariates hold fixed, a change of one unit in X1 leads to a
change of βˆ1 unit in the outcome,’ you are unable to change
X1 and keep other covariate fixed in practice.
- Possibly the only exception is when we conduct experiments
and are able to completely control for the values of the
covariates.
• Multicollinearity is the issue mostly for X, and it does not
seriously affect the fitted value yˆ of the model.
16
Implications
• When p is large, multicollinearity tend to become more
serious, making variable and model selection more important.
• Variable selection and model evaluation have been covered in
DATA2x02:
- Information criterion: AIC, BIC
- Backward and forward selection
- Prediction performance: in-sample (R2) and out-of-sample
prediction (cross-validation)
• These above topics will not be tested in the quiz/final exam,
but you are expected to do it in the project.
17
Polynomial regression
Polynomial regression
First, we consider the following models with the following forms:
yi = β0 + β1xi + β2x
2
i + . . .+ βp−1x
p−1
i + εi, i = 1, . . . , n
There is only one outcome Y and one predictor X; however, unlike
the simple linear regression, the regression model is a polynomial
of degree p− 1 with respect to X.
Fitting that model is essentially the same as fitting a MLR model
with p− 1 predictors and design matrix
X1 = X; X2 = X
2; . . . , Xp−1 = Xp−1,
18
Polynomial regression
The design matrix is
X =

1 x1 x
2
1 . . . x
p−1
1
1 x2 x
2
2 . . . x
p−1
2
...
...
...
1 xn x
2
n . . . x
p−1
n

and the OLS estimate for β is the same as in regular MLR:
βˆ =
(
X>X
)−1
X> y
and then all other inferences can be done as previously.
19
Notes for polynomial regression
The polynomial terms are highly correlated.
• Example: Car data, Y = distance; X = speed.
speed (Y ) dist (X1 = X) X2 = X
2 X3 = X
3
4 2 4 8
4 10 100 1000
...
...
...
...
25 85 7225 614125
• Correlation matrix among polynomial terms:
rXX =
1.00 0.96 0.870.96 1.00 0.97
0.87 0.97 1.00

20
Polynomial regression
A common step to reduce multicollinearity in polynomial regression
is to center the data before taking higher degree.
dist (X1 = X) X

1 = X1 − x¯ X∗2 = (X∗1 )2 X∗3 = (X∗1 )3
2 -41 (−41)2 (−41)3
10 -33 (−33)2 (−33)3
...
...
...
...
85 42 (42)2 (42)3
X¯ = 42.98
rX∗X∗ =
1.00 0.52 0.750.52 1.00 0.83
0.75 0.83 1.00
 , rXX =
1.00 0.96 0.870.96 1.00 0.97
0.87 0.97 1.00

21
Polynomial regression
• Centering the data reduces multicollinearity, but not removes
it completely.
• One possible way to reduce multicollinearity completely in
polynomial regression is through the use of orthogonal
polynomials.
- The math can be read more in the Wikipedia page.
- In R, the set of orthogonal polynomials for a vector x up to
degree k can be created by the function poly(x, k).
• The main disadvantage of using orthogonal polynomials is the
model is not so easy to interpret, since they transform the
covariate for mathematical convenience without much
meaning.
22
Polynomial regression
Another important aspect of polynomial regression is hierarchy:
• First, it is essential that a higher order term should only be
added sequentially. For example, do NOT add the cubic term
(X3) unless the quadratic term (X2) is already in the model.
Eg: Do NOT fit the models:
yi = β0 + β1xi + β3x
3
i + εi
yi = β0 + β1x
2
i + β3x
3
i + εi
The reason is the lower-order term provides more basic
information about the regression function, while the
higher-order term only provides a refinement of that.
23
Polynomial regression
• On the other hand, if a polynomial term of a given degree is
retained, then all related terms of lower degree are also
retained in the model.
Eg: If you fit the cubic model
yi = β0 + β1xi + β2x
2
i + β3x
3
i + εi,
and β3 is significant, never drop the linear (xi) and the
quadratic term (x2i ) regardless of whether β1 nor β2 is
significant or not.
Otherwise, continue to test whether β2 is significant. If it is,
never drop the linear term xi regardless of whether β1 is
significant.
24
Polynomial regression
Finally, a good strategy of using polynomial regression in practice
is to add polynomial terms sequentially until the model fit does not
significantly improve.
Eg. mtcar data. Y = miles per gallon (mpg), X = weight (wt)
l l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
10
15
20
25
30
35
2 3 4 5
wt
m
pg
mtcars dataset
25
Polynomial regression
First, we fit the linear model Y ∼ X using OLS. The slope is
significant βˆ1 = −5.34 with p-value < 0.0001.
l l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
10
15
20
25
30
35
2 3 4 5
wt
m
pg
mtcars dataset with linear term
From the scatterplot, we may doubt that including a quadratic
term may capture the relationship between mpg and wt better.
26
Polynomial regression
Before fitting the model with quadratic term, we first center the
predictor wt to a new predictor wtc to reduce multicolinearity:
wtc = wt− w¯t = wt− 3.22. Then we fit the model with quadratic
term mpg ∼ wtc + wt2c using OLS. The slope for the quadratic
term is βˆ2 = 1.17 and significant.
l l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
10
15
20
25
30
35
−1 0 1 2
wt_c
m
pg
mtcars data with linear and quadratic terms
27
Polynomial regression
We may doubt if we need the cubic term wt3c . We go ahead and fit
the model mpg ∼ wtc + wt2c + wt3c using OLS.
l l
l
l
ll
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
10
15
20
25
30
35
−1 0 1 2
wt_c
m
pg
mtcars data with linear, quadratic (red),
and cubic (green) terms
The slope for wt3c is non-significant, so we stop here. The best
model for us is the model with quadratic term.
m̂pg = 19− 5.84wtc + 1.17wt2c
= 19− 5.84(wt− 3.22) + 1.17(wt− 3.22)2. 28
Polynomial regression
Due to this hierarchy, the sequential anova table is particularly
useful. We can fit a model with a very high degree, look at the
anova table and stop at the last row where the F -test is still
significant.
29
Using orthogonal polynomials
30
MLR with Qualitative Predictors
Qualitative Predictors
• Binary: yes/no, male/female, before/after, control/treatment
• Multiclass: type A/B/C, republican/democratic/neither
General principle: A quantitative variable with C classes can be
represented by C − 1 indicator variables (dummy variable), each
taking the value 0 and 1.
31
An example of binary predictor
First consider mtcars dataset. Regression of Y = miles per gallon
(mpg) X1 = versus weight (wt) and engine shape (V-shaped or
straight)
• Since transmission type is a qualitative predictor with C = 2
classes, we only need 1 indicator variable to represent it, say
X2 =
1 if straight0 if V-shaped
• Design matrix for regression is formed as usual:
X =

1 x11 0
1 x12 1
. . . . . . . . .
1 xn1 0

32
An example of binary predictor
What if we accidentally creates another indicator variable:
X3 =
1 if V-shaped0 if straight
and form the design matrix with X1, X2 and X3
X =

1 x11 0 1
1 x12 1 0
. . . . . . . . .
1 xn1 0 1

Since we have X3 = 1−X4, so this design matrix has perfect
multicollinearity, because column 1 (intercept) = column 3 +
column 4. Hence, it is essential that we create only 1 indicator
variable for each quantitative variable. 33
An example of binary predictor
Let’s take a closer look at the regression function:
Yi = β0 + β1Xi1 + β2Xi2 + εi
If X2 = 0, then we have
E(Y ) = β0 + β1X1
while if X2 = 1, then we have
E(Y ) = (β0 + β2) + β1X1
so the two regression functions have the same slope β1 but
different intercepts. The slope β2 represents that difference at all
level of x1 (weight).
34
An example of binary predictor
l l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
10
15
20
25
30
35
2 3 4 5
wt
m
pg
vs
l
l
0
1
Same slope but different intercept
Regression with Binary Predictor
35
Interaction with quantitative predictors
Furthermore, we can let the qualitative and quantitative variables
interact with each other.
yi = β0 + β1xi1 + β2xi2 + β3xi1xi2 + εi,
so if X2 = 0, then
E(Y ) = β0 + β1X1,
while if X2 = 1 then
E(Y ) = (β0 + β2) + (β1 + β3)X1.
As a result, the two regression functions have different slopes and
different intercepts.
36
Interaction with quantitative predictors
l l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
10
15
20
25
30
35
2 3 4 5
wt
m
pg
vs
l
l
0
1
Different slope and different intercept
Regression with Binary Predictor
37
Interaction with quantitative predictors
Fitting and inferences for interaction models (not only between
quantitative and qualitative predictors) also follow hierarchy.
• Do not include an interaction term unless all the main effect
terms are included.
- Do not fit the model Y ∼ X1 +X1X2, since it lacks the main
effect term X2,
- Do not fit Y ∼ X1 +X2 +X21X2, since it lacks the main
effect term X21 .
• If the interaction term is significant, then do not drop the
main effect terms regardless of whether they are significant or
not.
- Eg: If β3 in the model Y = β0 + β1X1 + β2X2 + β3X1X2 + ε
is significant, then do not drop the main effect term β1X1 and
β2X2 even when β1 and β2 are not significant.
38
Syntax in R
Syntax in R Meaning
X1 +X2 main effects of both X1 and X2
X1 : X2 second-order interaction effect of X1 and X2
X1 ∗X2 all main and interaction effects of X1 and X2
X1 : X2 : X3 third-order interaction effect of X1, X2 and X3
X1 ∗X2 ∗X3 all main, second-order, and third order interac-
tion effects among X1, X2, X3
For example, lm(y ∼ X1 ∗X2 ∗X3) is the same as
lm(y ∼ X1+X2+X3+X1 : X2+X2 : X3+X1 : X3+X1 : X2 : X3)
39
Fitting model with qualitative predictors in R
• If you want to treat a predictor to be qualitative, you have to
convert it to factor.
mtcars = mtcars %>% mutate (vs = factor(vs))
• After that, you can put the variable into the lm command.
For example, here I fit a model where two types of engine
shaped have the same slope but different intercept:
yi = β0 + β1x1i + β2x2i + εi,
and the coefficient β2 represents the difference in the
intercept of the two engine types.
40
Fitting model with qualitative predictors in R
From this output, the regression equation for each type of engine is
Yˆ = 33− 4.44X1, for V-shaped (X2 = 0)
Yˆ = (33 + 3.15)− 4.44X1
= 36.15− 4.44X1 for straight (X2 = 1)
41
Fitting model with qualitative predictors in R
I can fit a model with interaction between weight and engine-type
yi = β0 + β1xi1 + β2xi2 + β3xi1xi2 + εi
Yˆ = 29.53− 3.50X1, for V-shaped (X2 = 0)
Yˆ = (29.53 + 11.77) + (−3.50− 2.90)X1
= 41.30− 6.40X1 for straight (X2 = 1)
42
Multi-Classes qualitative predictor
Consider a regression of tool wear (Y) on speed (X1) and tool
types, where model is a qualitative variable with C = 4 types (M1,
M2, M3, M4). We need C − 1 = 3 indicator variables.
X2 =
1 if M10 otherwise X3 =
1 if M20 otherwise X4 =
1 if M30 otherwise
and the regression function is
yi = β0 + β1xi1 + β2xi2 + β3xi3 + β4xi4 + εi.
The class of the qualitative predictor corresponding to all indicator
being 0 is called as the base or reference level. For example, it is
M4 in that example, so β2, β3, β4 corresponds to the differences in
the intercept of the other types with the base level. 43
Multi-classes qualitative predictor
44
Notes on qualitative predictor
• One can have models with more than one qualitative predictor
as well as interaction between them (but quickly leads to a
huge number of predictors).
• While indicator variables are the most common way to deal
with qualitative predictor, alternative codings are possible, so
be careful with interpretation of coefficients.
45
Seattle housing data example
• Model the log housing price (Y ) with two potential predictors:
log square feet (X1) and condition of the house. Condition is
a qualitative predictor with 5 levels from 1 to 5.
• Choose condition 1 as a base level and fit the model including
interaction of log square feet and condition.
yi = β0+β1x1i+β2x2i+. . .+β5x5i+β6x1ix2i+. . .+β9x1ix5i+εi
where xji = 1 if condition of observation i is j, and 0
otherwise.
46
An example
47
An example
We can write out the estimated regression function for each
condition to be
Yˆ =

7.67 + 0.69X1 condition 1,
(7.67 + 0.11) + (0.69− 0.02)X1, condition 2
(7.67− 0.84) + (0.69 + 0.13)X1, condition 3
(7.67− 0.91) + (0.69 + 0.14)X1, condition 4
(7.67− 2.03) + (0.69 + 0.30)X1, condition 5.
48
An example
We can visualize the relationships between log housing price and
log square feet corresponding to each condition.
12
13
14
6 7 8 9
log(sqft_living)
lo
g(p
ric
e)
condition
1
2
3
4
5
49
Hypotheses testing with qualitative variables
A qualitative variable has significant (additive or interaction) effect
when at least one coefficient of the corresponding indicator
variables is significant.
In the previous example, condition has an interaction effect with
area if at least one of β6, . . . , β9 is not equal to zero.
H0 : β6 = . . . = β9 = 0 vs H1 : at least one of β6, . . . , β9 6= 0.
As such, the sequential ANOVA table in R always give the extra
sum of squares together, SSR(X6, . . . , X9|X1, . . . , X5) not for a
separate indicator variable.
• What is the degree of freedom of SSR(X6, . . . , X9|X1, . . . , X5)?
50
Hypothesis testing with qualitative variables
Likewise, the Sum Sq for condition is SSR(X2, . . . , X5|X1) with 4
degrees of freedom.
51

欢迎咨询51作业君
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468