程序代写案例-GR 5205 004

Logistic Regression
Instructor: Yuqi Gu
Linear Regression Models, Fall 2022
Statistics GR 5205 004 / GU 4205 005
Columbia University
1 r>Binary Response Variable
▶ In many regression applications, the response Y has only two possible
qualitative outcomes:
– Financial status of firm: sound status/headed toward insolvency
– Coronary heart disease status: has the disease/does not have the disease
– In a study of labor force participation of married women: married woman
in labor force/married woman not in labor force
▶ The constraint on the mean responses to belong to [0, 1] rule out a
linear response function
2
Heart Disease Data
▶ Response: “chd”: indicates whether the person has heart disease or not
▶ The men vary in height (in inches) and the number of cigarettes
(cigs) smoked per day
> data(wcgs, package="faraway")
> summary(wcgs[,c("chd","height","cigs")])
chd height cigs
no :2897 Min. :60.00 Min. : 0.0
yes: 257 1st Qu.:68.00 1st Qu.: 0.0
Median :70.00 Median : 0.0
Mean :69.78 Mean :11.6
3rd Qu.:72.00 3rd Qu.:20.0
Max. :78.00 Max. :99.0
3
Heart Disease Data
> plot(height ~ chd, wcgs)
> wcgs$y <- ifelse(wcgs$chd == "no",0,1)
> plot(jitter(y,0.1) ~ jitter(height), wcgs, xlab="Height",
+ ylab="Heart Disease", pch=".")
Figure: Plots of the presence/absence of heart disease according to height.
4
Heart Disease Data
▶ Predict heart disease and explain the relationship between height,
cigarette usage and heart disease.
▶ For the same height and cigs, both outcomes occur. So we model the
probability of getting heart disease P(Y = 1 | X ) rather than Y itself
Figure: Interleaved histograms of the distribution of heights and cigarette usage for
men with and without heart disease.
5
Logistic Regression
Logistic regression defines the probability mass function:
P(Y = 1 | X) = exp(β
⊤X)
1 + exp(β⊤X)
which implies that
P(Y = 0 | X) = 1− P(Y = 1 | X) = 1
1 + exp(β⊤X)
where X is a (p + 1)-dim. vector with X0 ≡ 1, and β0 is the intercept
6
Logistic Regression
This plot shows P(Y = 1 | X) and P(Y = 0 | X), plotted as functions of β⊤X
7
Logistic Regression
▶ The logit function
logit(x) = log
(
x
1− x
)
maps the unit interval (0, 1) to the entire real line (−∞,∞)
▶ The inverse logit function, or expit function
expit(x) = logit−1(x) =
exp(x)
1 + exp(x)
maps the real line to the unit interval
▶ In logistic regression, the inverse logit function is used to map the linear
predictor β⊤X to a probability of Y = 1:
P(Y = 1 | X) = logit−1(β⊤X)
8
Logistic Regression
Geometric interpretation: a logistic regression fit based on two predictors can
be represented by a S-shape surface in the 3D space
9
Logistic Regression
▶ The linear predictor in logistic regression is the conditional log odds:
log
[
P(Y = 1 | X)
P(Y = 0 | X)
]
= β⊤X = β0 + β1X1 + · · ·+ βpXp
▶ Interpret logistic regression: a one unit increase in Xj results in a change
of βj in the (conditional) log odds
▶ Or: a one unit increase in Xj results in a multiplicative change of exp(βj)
in the conditional odds
▶ exp(βj) is also called the odds ratio, as it is the ratio of the two odds,
corresponding to two scenarios where the values of Xj differ by one unit
10
Heart Disease Example
> lmod <- glm(chd ~ height + cigs, family = binomial, wcgs)
> summary(lmod)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0041 -0.4425 -0.3630 -0.3499 2.4357
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.50161 1.84186 -2.444 0.0145 *
height 0.02521 0.02633 0.957 0.3383
cigs 0.02313 0.00404 5.724 1.04e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1781.2 on 3153 degrees of freedom
Residual deviance: 1749.0 on 3151 degrees of freedom
AIC: 1755
Number of Fisher Scoring iterations: 5 11
Heart Disease Example
(beta <- coef(lmod))
plot(jitter(y,0.1) ~ jitter(height), wcgs, xlab="Height",
ylab="Heart Disease",pch=".")
curve(ilogit(beta[1] + beta[2]*x + beta[3]*0),add=TRUE)
curve(ilogit(beta[1] + beta[2]*x + beta[3]*20),add=TRUE,lty=2)
plot(jitter(y,0.1) ~ jitter(cigs), wcgs, xlab="Cigarette Use",
ylab="Heart Disease",pch=".")
curve(ilogit(beta[1] + beta[2]*60 + beta[3]*x),add=TRUE)
curve(ilogit(beta[1] + beta[2]*78 + beta[3]*x),add=TRUE,lty=2)
12
Heart Disease Example
Figure: Predicted probability of heart disease. Left: solid line represents a
nonsmoker, the dashed line is a pack-a-day smoker. Right: solid line represents a
very short man (60 in.), the dashed line represents a very tall man (78 in.)
13
Latent variable model for logistic regression
▶ It may make sense to view the binary outcome Y as being a
dichotomization of a latent continuous outcome Yc :
Y = I(Yc ≥ 0)
▶ Suppose Yc | X follows a logistic distribution with CDF
F (Yc | X) = exp(Yc − β
⊤X)
1 + exp(Yc − β⊤X)
▶ In this case, Y | X follows the logistic regression model:
P(Y = 1 | X) = P(Yc ≥ 0 | X) = 1− exp(0− β
⊤X)
1 + exp(0− β⊤X)
=
exp(β⊤X)
1 + exp(β⊤X)
14
Mean and variance relationship for logistic regression
▶ Since Y | X follows Bernoulli(logit(β⊤X)), the mean is
E[Y | X] = P(Y = 1 | X) = exp(β
⊤X)
1 + exp(β⊤X)
And the variance is
Var[Y | X] = P(Y = 1 | X) · P(Y = 0 | X)
=
exp(β⊤X)
(1 + exp(β⊤X))2
▶ Since the variance depends on X, logistic regression models are always
heteroscedastic (unequal error variances)
15
Estimation in logistic regression
▶ Assuming independent observations (x1, y1), . . . , (xn, yn), the
log-likelihood for logistic regression is
L(β | Y,X) = log
n∏
i=1
(
exp(β⊤xi )
1 + exp(β⊤xi )
)yi (
1
1 + exp(β⊤xi )
)1−yi
= log
n∏
i=1
exp(yi · β⊤xi )
1 + exp(β⊤xi )
=
n∑
i=1
yi · β⊤xi −
n∑
i=1
log(1 + exp(β⊤xi ))
▶ This likelihood is for the conditional distribution of Y given X. As in
linear regression, we do not model the marginal distribution of X
16
Estimation in logistic regression
▶ Logistic regression models are usually fit using maximum likelihood
estimation
▶ This means that the parametric likelihood above is maximized as a
function of β
▶ The gradient of the log-likelihood function (the score function) is
∇L(β | Y,X) = ∂
∂β
L(β | Y,X)
=
n∑
i=1
yix i −
n∑
i=1
exp(β⊤xi )
1 + exp(β⊤xi )
xi
=
N∑
i=1
xi
(
yi − exp(β
⊤xi )
1 + exp(β⊤xi )
)
This is a (p + 1)× 1 vector
17
Estimation in logistic regression
▶ The Hessian of the log-likelihood is
H(β | Y,X) = ∂
2
∂ββ⊤
L(β | Y,X)
= −
n∑
i=1
exp(β⊤xi )
(1 + exp(β⊤xi ))2
xix

i
This is (p + 1)× (p + 1) symmetric matrix
▶ The Hessian is strictly negative definite as long as the design matrix
Xn×(p+1) has independent columns (full column rank (p + 1))
(see proof on the next slide)
▶ Therefore L(β | Y,X) is a concave function of β, so has a unique
maximizer, and hence the MLE is unique
18
Hessian matrix in logistic regression
▶ Rewrite the Hessian matrix as
H(β | Y,X) = −X⊤

exp(β⊤x1)
(1+exp(β⊤x1))2
. . .
exp(β⊤xn)
(1+exp(β⊤xn))2
X
=: −X⊤DX
where the design matrix X of size n × (p + 1) is
X =

1 x11 · · · x1p
1 x21 · · · x2p
...
...
. . .
...
1 xn1 · · · xnp

▶ The n × n diagonal matrix D has positive diagonal entries, so if X has
full rank p + 1, the Hessian matrix is negative definite
19
Estimation in logistic regression
▶ We need to use some optimization method to numerically find the MLE
under logistic regression, because it has no closed form
▶ Newton-Rapson (Newton’s) Numerical Optimization Method:
β(t+1) = β(t) − H
(
β(t)
)−1
∇L(β(t)),
where t = 0, 1, 2, . . . index the iterations
▶ Newton’s method for logistic regression can be shown to be equivalent
to a version of Iteratively (Re-)Weighted Least Squares (IRLS)
20
MLE of logistic regression
▶ Since β̂ is an MLE for a regular problem, it is consistent, asymptotically
unbiased, and asymptotically normal
The estimated approximate covariance matrix of β̂ (also the in-
verse of the Fisher information matrix):
S2(β̂) = −H(β̂ | Y,X)−1
There is
β̂
approx.∼ N(β,S2(β̂))
▶ This provides a useful basis for statistical inference (confidence intervals,
hypothesis testing)
▶ Note: such inferences rely on large sample sizes (large n)
21
Inference for logistic regression
▶ When n is large,
β̂k − βk
S(β̂k)
approx.∼ N(0, 1), k = 0, 1, . . . , p
▶ Consider
H0 : βk = 0
Ha : βk ̸= 0
▶ Wald test: test statistic
z∗ =
β̂k
S(β̂k)
Decision rule:
If |z∗| > z(1− α/2), reject H0
If |z∗| ≤ z(1− α/2), fail to reject H0
22
Likelihood Ratio Test
▶ Likelihood Ratio Test (LRT) can be used to test whether several βk = 0.
▶ LRT is a general procedure for use with maximum likelihood estimation,
analogous to the general linear test procedure (F test) for linear models
▶ LRT is also based on a comparison of full and reduced models, and valid
for large sample sizes
H0 : βq+1 = · · · = βp = 0
Ha : not all of the βk in H0 equal zero
▶ Reduced model: (π = P(Y = 1 | X))
π = logit−1(β⊤RX)
where β⊤RX = β0 + β1X1 + · · ·+ βqXq
23
Likelihood Ratio Test
▶ The maximum likelihood under the reduced model, L(Reduced), can not
exceed that under the full model, L(Full)
▶ Actual test statistic for the LRT, denoted by G 2, is:
G 2 = −2 log
[
L(Reduced)
L(Full)
]
▶ If the ratio L(Reduced)/L(Full) is small, indicating that H0 seems
inappropriate, then G 2 is large
▶ Large-sample theory of LRT states that: when n is large, G 2 is
distributed approximately as χ2(p − q) when H0 holds
p − q = (n − q − 1)− (n − p − 1)
▶ Decision rule: If G 2 > χ2(1− α; p − q), reject H0
24
Deviance in logistic regression
▶ In logistic regression analysis, deviance is used instead of a sum of
squares calculations in linear regression
Deviance = −2× the maximum log likelihood
▶ Deviance is a measure of the goodness of fit to the data in a logistic
regression model (analogous to R2 in linear models)
▶ The smaller the deviance, the better fit is the model to the training data
25
Heart Disease Data Example
> sumary(lmod)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.5016140 1.8418627 -2.4441 0.01452
height 0.0252078 0.0263274 0.9575 0.33833
cigs 0.0231274 0.0040402 5.7243 1.038e-08
n = 3154 p = 3
Deviance = 1749.04923 Null Deviance = 1781.24374 (Difference = 32.19451)
▶ Deviance is the deviance of the current model; Null Deviance is that for
a model with not predictors but an intercept
▶ Use the deviance to compare two nested models: The LRT becomes
G 2 = DReduced − DFull approx.∼ χ22
▶ The p-value for testing whether βheight = βcigs = 0 is
> 1-pchisq(32.2, 2)
[1] 1.01826e-07
26
Heart Disease Data Example
We can test the individual predictors by fitting models that drop the
predictors and computing the difference in deviance.
> lmodc <- glm(chd ~ cigs, family = binomial, wcgs)
> anova(lmodc, lmod, test="Chi")
Analysis of Deviance Table
Model 1: chd ~ cigs
Model 2: chd ~ height + cigs
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 3152 1750
2 3151 1749 1 0.92025 0.3374
So height is not significant in a model that already includes cigarette consumption
27
Heart Disease Data Example
We can test all the predictors in this fashion:
> drop1(lmod, test="Chi")
Single term deletions
Model:
chd ~ height + cigs
Df Deviance AIC LRT Pr(>Chi)
1749.0 1755.0
height 1 1750.0 1754.0 0.9202 0.3374
cigs 1 1780.1 1784.1 31.0695 2.49e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
An alternative to this test is the aforementioned z-value β̂/S(β̂); recall:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.5016140 1.8418627 -2.4441 0.01452
height 0.0252078 0.0263274 0.9575 0.33833
cigs 0.0231274 0.0040402 5.7243 1.038e-08
28
Heart Disease Data Example
▶ In this example, the p-values from the two tests are similar
▶ The outcomes of the deviance-based test and z-value-based test are not
identical, unlike the linear regression with normal errors
▶ In general, deviance-based test is preferred
29
Inference for logistic regression
▶ Approximate 1− α confidence interval for βk :
β̂k ± z(1− α/2)S(β̂k)
▶ Corresponding confidence limits for the odds ratio exp(βk) are
exp
{
β̂k ± z(1− α/2)S(β̂k)
}
▶ 95% confidence interval for β1:
> 0.02521 + c(-1,1) * 1.96 * 0.02633
[1] -0.0263968 0.0768168
A better way is to construct a profile likelihood-based confidence interval:
> confint(lmod)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -8.13475465 -0.91297018
height -0.02619902 0.07702835
cigs 0.01514949 0.03100534
30
Model selection methods
▶ For logistic regression, the AIC and BIC formulas incorporate the
log-likelihood function instead of the RSS (Residual Sum of Squares)
▶ For a model containing k predictors,
AIC = −2 log(L) + 2(k + 1)
BIC = −2 log(L) + (k + 1) log(n)
where L refers to the maximum of the likelihood
▶ As with multiple linear regression, the smaller the AIC/BIC, the better
the model
▶ BIC generally would favor a more parsimonious model, i.e., a model
containing fewer predictors
31
Prediction of a New Observation
To predict yh, a cutoff point τ for π̂h = logit
−1(β̂ ⊤xh) needs to be
determined: Predict 1 if π̂h ≥ τ ; predict 0 if π̂h < τ
32
Inferences about Mean Response
▶ Given
xh = (1, xh1, . . . , xhp)

the population mean response is
πh =
1
1 + exp(−x⊤i β)
and its point estimator is
π̂h =
1
1 + exp(−x⊤i β̂)
▶ Interval estimation for πh: first calculate confidence limits for the linear
predictor x⊤h β, then apply the inverse logit function to obtain confidence
limits for πh
33
Inferences about Mean Response
▶ Recall the large sample approximate distribution for β̂ is
β̂
approx.∼ N(β,−H(β̂ | Y,X)−1)
so
x⊤h β̂
approx.∼ N(x⊤h β, −x⊤h H(β̂ | Y,X)−1xh︸ ︷︷ ︸
denote this value by s2(x⊤h β̂)
)
▶ Approximate 1− α large-sample confidence limits for x⊤h β are
L = x⊤h β − z(1− α/2) · s(x⊤h β̂)
U = x⊤h β + z(1− α/2) · s(x⊤h β̂)
▶ Finally, using the monotonic transformation of the inverse logit function,
the approximate 1− α large-sample confidence limits for πh are
L∗ =
1
1 + exp(−L) , U
∗ =
1
1 + exp(−U)
34
Simultaneous confidence intervals
▶ We may estimate several mean responses πh corresponding to different
xh vectors (h = 1, . . . , g) with family confidence level 1− α
▶ Bonferroni simultaneous confidence intervals for g such mean responses:
Just replace the multiplier z(1− α/2) with z(1− α/(2g))
▶ Confidence limits for linear predictor terms, h = 1, . . . ,H
Lh = x

h β − z(1− α/(2g)) · s(x⊤h β̂)
Uh = x

h β + z(1− α/(2g)) · s(x⊤h β̂)
35
Polytomous Logistic Regression for Nominal Response
▶ When the response variable have > 2 qualitative levels, we can use
logistic regression with a polytomous or multicategory response:
– In business, relate a consumer’s choice of product (product A, B, C) to
the consumer’s age, gender, geographic location and other variables
▶ Suppose the response Y has J response categories and define
yij =
{
1 if case i response is category j
0 otherwise
Denote πij = P(yij = 1), then
∑J
j=1 πij = 1 and
∑J
j=1 yij = 1
▶ Recall the binary response case
log
{
P(yi = 1)
1− P(yi = 1)
}
=: x⊤i β1,base
If yi can be 1 or 2, then category 2 is the baseline category above
36
Polytomous Logistic Regression for Nominal Response
▶ For J polytomous categories, arbitrarily specify a baseline, say the Jth
one, as the baseline, and define
log
{
πij
πiJ
}
= x⊤i βj,base, j = 1, . . . , J − 1
▶ Logits for any other comparisions can be obtained from above:
log
{
πij1
πij2
}
= log
{
πij1
πiJ
· πiJ
πij2
}
= x⊤i βj1,base − x⊤i βj2,base
▶ After some algebra, we can show
πij =
exp(x⊤i βj,base)
1 +
∑J−1
j′=1 exp(x

i βj′,base)
, j = 1, . . . , J − 1
This is multinomial logistic regression
37
Maximum Likelihood Estimation
▶ Write βj,base simply as βj . We can use MLE to estimate β1, . . . ,βJ−1
▶ Since yi follows a Categorical distribution with parameters (πi1, . . . , πiJ),
for example, P(yi = 3) = πi3 =
J∏
j=1
π
yij
ij
▶ Log-likelihood for n independent observations and J categories:
L(β | Y,X) = log
n∏
i=1
 J∏
j=1
π
yij
ij

=
n∑
i=1
J−1∑
j=1
(yijx

i β)− log
1 + J−1∑
j=1
exp(x⊤i βj)

▶ The J − 1 fitted response functions π̂ij may be obtained by substituting
the MLE β̂j ’s into the expression of πij ’s
38
Polytomous Logistic Regression for Ordinal Response
▶ The ordinal response where categories of Y are ordered are also common:
– Employees asked to rate working conditions using a 7-point scale
(unacceptable, poor, fair, acceptable, good, excellent, outstanding)
– The severity of cancer is rated by stages on a 1-4 basis
▶ A more effective model for such responses explicitly takes into account
the orderings
▶ Propotional odds model :
P(yi ≤ j) = exp(αj + x

i β)
1 + exp(αj + x⊤i β)
, j = 1, . . . , J − 1
where there are J − 1 cumulative logits:
log
[
P(yi ≤ j)
1− P(yi ≤ j)
]
= αj + x

i β
MLE for this model can be similarly obtained
39

欢迎咨询51作业君
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468