Instructor: Yuqi Gu

Linear Regression Models, Fall 2022

Statistics GR 5205 004 / GU 4205 005

Columbia University

1 r>Binary Response Variable

▶ In many regression applications, the response Y has only two possible

qualitative outcomes:

– Financial status of firm: sound status/headed toward insolvency

– Coronary heart disease status: has the disease/does not have the disease

– In a study of labor force participation of married women: married woman

in labor force/married woman not in labor force

▶ The constraint on the mean responses to belong to [0, 1] rule out a

linear response function

2

Heart Disease Data

▶ Response: “chd”: indicates whether the person has heart disease or not

▶ The men vary in height (in inches) and the number of cigarettes

(cigs) smoked per day

> data(wcgs, package="faraway")

> summary(wcgs[,c("chd","height","cigs")])

chd height cigs

no :2897 Min. :60.00 Min. : 0.0

yes: 257 1st Qu.:68.00 1st Qu.: 0.0

Median :70.00 Median : 0.0

Mean :69.78 Mean :11.6

3rd Qu.:72.00 3rd Qu.:20.0

Max. :78.00 Max. :99.0

3

Heart Disease Data

> plot(height ~ chd, wcgs)

> wcgs$y <- ifelse(wcgs$chd == "no",0,1)

> plot(jitter(y,0.1) ~ jitter(height), wcgs, xlab="Height",

+ ylab="Heart Disease", pch=".")

Figure: Plots of the presence/absence of heart disease according to height.

4

Heart Disease Data

▶ Predict heart disease and explain the relationship between height,

cigarette usage and heart disease.

▶ For the same height and cigs, both outcomes occur. So we model the

probability of getting heart disease P(Y = 1 | X ) rather than Y itself

Figure: Interleaved histograms of the distribution of heights and cigarette usage for

men with and without heart disease.

5

Logistic Regression

Logistic regression defines the probability mass function:

P(Y = 1 | X) = exp(β

⊤X)

1 + exp(β⊤X)

which implies that

P(Y = 0 | X) = 1− P(Y = 1 | X) = 1

1 + exp(β⊤X)

where X is a (p + 1)-dim. vector with X0 ≡ 1, and β0 is the intercept

6

Logistic Regression

This plot shows P(Y = 1 | X) and P(Y = 0 | X), plotted as functions of β⊤X

7

Logistic Regression

▶ The logit function

logit(x) = log

(

x

1− x

)

maps the unit interval (0, 1) to the entire real line (−∞,∞)

▶ The inverse logit function, or expit function

expit(x) = logit−1(x) =

exp(x)

1 + exp(x)

maps the real line to the unit interval

▶ In logistic regression, the inverse logit function is used to map the linear

predictor β⊤X to a probability of Y = 1:

P(Y = 1 | X) = logit−1(β⊤X)

8

Logistic Regression

Geometric interpretation: a logistic regression fit based on two predictors can

be represented by a S-shape surface in the 3D space

9

Logistic Regression

▶ The linear predictor in logistic regression is the conditional log odds:

log

[

P(Y = 1 | X)

P(Y = 0 | X)

]

= β⊤X = β0 + β1X1 + · · ·+ βpXp

▶ Interpret logistic regression: a one unit increase in Xj results in a change

of βj in the (conditional) log odds

▶ Or: a one unit increase in Xj results in a multiplicative change of exp(βj)

in the conditional odds

▶ exp(βj) is also called the odds ratio, as it is the ratio of the two odds,

corresponding to two scenarios where the values of Xj differ by one unit

10

Heart Disease Example

> lmod <- glm(chd ~ height + cigs, family = binomial, wcgs)

> summary(lmod)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.0041 -0.4425 -0.3630 -0.3499 2.4357

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.50161 1.84186 -2.444 0.0145 *

height 0.02521 0.02633 0.957 0.3383

cigs 0.02313 0.00404 5.724 1.04e-08 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1781.2 on 3153 degrees of freedom

Residual deviance: 1749.0 on 3151 degrees of freedom

AIC: 1755

Number of Fisher Scoring iterations: 5 11

Heart Disease Example

(beta <- coef(lmod))

plot(jitter(y,0.1) ~ jitter(height), wcgs, xlab="Height",

ylab="Heart Disease",pch=".")

curve(ilogit(beta[1] + beta[2]*x + beta[3]*0),add=TRUE)

curve(ilogit(beta[1] + beta[2]*x + beta[3]*20),add=TRUE,lty=2)

plot(jitter(y,0.1) ~ jitter(cigs), wcgs, xlab="Cigarette Use",

ylab="Heart Disease",pch=".")

curve(ilogit(beta[1] + beta[2]*60 + beta[3]*x),add=TRUE)

curve(ilogit(beta[1] + beta[2]*78 + beta[3]*x),add=TRUE,lty=2)

12

Heart Disease Example

Figure: Predicted probability of heart disease. Left: solid line represents a

nonsmoker, the dashed line is a pack-a-day smoker. Right: solid line represents a

very short man (60 in.), the dashed line represents a very tall man (78 in.)

13

Latent variable model for logistic regression

▶ It may make sense to view the binary outcome Y as being a

dichotomization of a latent continuous outcome Yc :

Y = I(Yc ≥ 0)

▶ Suppose Yc | X follows a logistic distribution with CDF

F (Yc | X) = exp(Yc − β

⊤X)

1 + exp(Yc − β⊤X)

▶ In this case, Y | X follows the logistic regression model:

P(Y = 1 | X) = P(Yc ≥ 0 | X) = 1− exp(0− β

⊤X)

1 + exp(0− β⊤X)

=

exp(β⊤X)

1 + exp(β⊤X)

14

Mean and variance relationship for logistic regression

▶ Since Y | X follows Bernoulli(logit(β⊤X)), the mean is

E[Y | X] = P(Y = 1 | X) = exp(β

⊤X)

1 + exp(β⊤X)

And the variance is

Var[Y | X] = P(Y = 1 | X) · P(Y = 0 | X)

=

exp(β⊤X)

(1 + exp(β⊤X))2

▶ Since the variance depends on X, logistic regression models are always

heteroscedastic (unequal error variances)

15

Estimation in logistic regression

▶ Assuming independent observations (x1, y1), . . . , (xn, yn), the

log-likelihood for logistic regression is

L(β | Y,X) = log

n∏

i=1

(

exp(β⊤xi )

1 + exp(β⊤xi )

)yi (

1

1 + exp(β⊤xi )

)1−yi

= log

n∏

i=1

exp(yi · β⊤xi )

1 + exp(β⊤xi )

=

n∑

i=1

yi · β⊤xi −

n∑

i=1

log(1 + exp(β⊤xi ))

▶ This likelihood is for the conditional distribution of Y given X. As in

linear regression, we do not model the marginal distribution of X

16

Estimation in logistic regression

▶ Logistic regression models are usually fit using maximum likelihood

estimation

▶ This means that the parametric likelihood above is maximized as a

function of β

▶ The gradient of the log-likelihood function (the score function) is

∇L(β | Y,X) = ∂

∂β

L(β | Y,X)

=

n∑

i=1

yix i −

n∑

i=1

exp(β⊤xi )

1 + exp(β⊤xi )

xi

=

N∑

i=1

xi

(

yi − exp(β

⊤xi )

1 + exp(β⊤xi )

)

This is a (p + 1)× 1 vector

17

Estimation in logistic regression

▶ The Hessian of the log-likelihood is

H(β | Y,X) = ∂

2

∂ββ⊤

L(β | Y,X)

= −

n∑

i=1

exp(β⊤xi )

(1 + exp(β⊤xi ))2

xix

⊤

i

This is (p + 1)× (p + 1) symmetric matrix

▶ The Hessian is strictly negative definite as long as the design matrix

Xn×(p+1) has independent columns (full column rank (p + 1))

(see proof on the next slide)

▶ Therefore L(β | Y,X) is a concave function of β, so has a unique

maximizer, and hence the MLE is unique

18

Hessian matrix in logistic regression

▶ Rewrite the Hessian matrix as

H(β | Y,X) = −X⊤

exp(β⊤x1)

(1+exp(β⊤x1))2

. . .

exp(β⊤xn)

(1+exp(β⊤xn))2

X

=: −X⊤DX

where the design matrix X of size n × (p + 1) is

X =

1 x11 · · · x1p

1 x21 · · · x2p

...

...

. . .

...

1 xn1 · · · xnp

▶ The n × n diagonal matrix D has positive diagonal entries, so if X has

full rank p + 1, the Hessian matrix is negative definite

19

Estimation in logistic regression

▶ We need to use some optimization method to numerically find the MLE

under logistic regression, because it has no closed form

▶ Newton-Rapson (Newton’s) Numerical Optimization Method:

β(t+1) = β(t) − H

(

β(t)

)−1

∇L(β(t)),

where t = 0, 1, 2, . . . index the iterations

▶ Newton’s method for logistic regression can be shown to be equivalent

to a version of Iteratively (Re-)Weighted Least Squares (IRLS)

20

MLE of logistic regression

▶ Since β̂ is an MLE for a regular problem, it is consistent, asymptotically

unbiased, and asymptotically normal

The estimated approximate covariance matrix of β̂ (also the in-

verse of the Fisher information matrix):

S2(β̂) = −H(β̂ | Y,X)−1

There is

β̂

approx.∼ N(β,S2(β̂))

▶ This provides a useful basis for statistical inference (confidence intervals,

hypothesis testing)

▶ Note: such inferences rely on large sample sizes (large n)

21

Inference for logistic regression

▶ When n is large,

β̂k − βk

S(β̂k)

approx.∼ N(0, 1), k = 0, 1, . . . , p

▶ Consider

H0 : βk = 0

Ha : βk ̸= 0

▶ Wald test: test statistic

z∗ =

β̂k

S(β̂k)

Decision rule:

If |z∗| > z(1− α/2), reject H0

If |z∗| ≤ z(1− α/2), fail to reject H0

22

Likelihood Ratio Test

▶ Likelihood Ratio Test (LRT) can be used to test whether several βk = 0.

▶ LRT is a general procedure for use with maximum likelihood estimation,

analogous to the general linear test procedure (F test) for linear models

▶ LRT is also based on a comparison of full and reduced models, and valid

for large sample sizes

H0 : βq+1 = · · · = βp = 0

Ha : not all of the βk in H0 equal zero

▶ Reduced model: (π = P(Y = 1 | X))

π = logit−1(β⊤RX)

where β⊤RX = β0 + β1X1 + · · ·+ βqXq

23

Likelihood Ratio Test

▶ The maximum likelihood under the reduced model, L(Reduced), can not

exceed that under the full model, L(Full)

▶ Actual test statistic for the LRT, denoted by G 2, is:

G 2 = −2 log

[

L(Reduced)

L(Full)

]

▶ If the ratio L(Reduced)/L(Full) is small, indicating that H0 seems

inappropriate, then G 2 is large

▶ Large-sample theory of LRT states that: when n is large, G 2 is

distributed approximately as χ2(p − q) when H0 holds

p − q = (n − q − 1)− (n − p − 1)

▶ Decision rule: If G 2 > χ2(1− α; p − q), reject H0

24

Deviance in logistic regression

▶ In logistic regression analysis, deviance is used instead of a sum of

squares calculations in linear regression

Deviance = −2× the maximum log likelihood

▶ Deviance is a measure of the goodness of fit to the data in a logistic

regression model (analogous to R2 in linear models)

▶ The smaller the deviance, the better fit is the model to the training data

25

Heart Disease Data Example

> sumary(lmod)

Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.5016140 1.8418627 -2.4441 0.01452

height 0.0252078 0.0263274 0.9575 0.33833

cigs 0.0231274 0.0040402 5.7243 1.038e-08

n = 3154 p = 3

Deviance = 1749.04923 Null Deviance = 1781.24374 (Difference = 32.19451)

▶ Deviance is the deviance of the current model; Null Deviance is that for

a model with not predictors but an intercept

▶ Use the deviance to compare two nested models: The LRT becomes

G 2 = DReduced − DFull approx.∼ χ22

▶ The p-value for testing whether βheight = βcigs = 0 is

> 1-pchisq(32.2, 2)

[1] 1.01826e-07

26

Heart Disease Data Example

We can test the individual predictors by fitting models that drop the

predictors and computing the difference in deviance.

> lmodc <- glm(chd ~ cigs, family = binomial, wcgs)

> anova(lmodc, lmod, test="Chi")

Analysis of Deviance Table

Model 1: chd ~ cigs

Model 2: chd ~ height + cigs

Resid. Df Resid. Dev Df Deviance Pr(>Chi)

1 3152 1750

2 3151 1749 1 0.92025 0.3374

So height is not significant in a model that already includes cigarette consumption

27

Heart Disease Data Example

We can test all the predictors in this fashion:

> drop1(lmod, test="Chi")

Single term deletions

Model:

chd ~ height + cigs

Df Deviance AIC LRT Pr(>Chi)

height 1 1750.0 1754.0 0.9202 0.3374

cigs 1 1780.1 1784.1 31.0695 2.49e-08 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

An alternative to this test is the aforementioned z-value β̂/S(β̂); recall:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.5016140 1.8418627 -2.4441 0.01452

height 0.0252078 0.0263274 0.9575 0.33833

cigs 0.0231274 0.0040402 5.7243 1.038e-08

28

Heart Disease Data Example

▶ In this example, the p-values from the two tests are similar

▶ The outcomes of the deviance-based test and z-value-based test are not

identical, unlike the linear regression with normal errors

▶ In general, deviance-based test is preferred

29

Inference for logistic regression

▶ Approximate 1− α confidence interval for βk :

β̂k ± z(1− α/2)S(β̂k)

▶ Corresponding confidence limits for the odds ratio exp(βk) are

exp

{

β̂k ± z(1− α/2)S(β̂k)

}

▶ 95% confidence interval for β1:

> 0.02521 + c(-1,1) * 1.96 * 0.02633

[1] -0.0263968 0.0768168

A better way is to construct a profile likelihood-based confidence interval:

> confint(lmod)

Waiting for profiling to be done...

2.5 % 97.5 %

(Intercept) -8.13475465 -0.91297018

height -0.02619902 0.07702835

cigs 0.01514949 0.03100534

30

Model selection methods

▶ For logistic regression, the AIC and BIC formulas incorporate the

log-likelihood function instead of the RSS (Residual Sum of Squares)

▶ For a model containing k predictors,

AIC = −2 log(L) + 2(k + 1)

BIC = −2 log(L) + (k + 1) log(n)

where L refers to the maximum of the likelihood

▶ As with multiple linear regression, the smaller the AIC/BIC, the better

the model

▶ BIC generally would favor a more parsimonious model, i.e., a model

containing fewer predictors

31

Prediction of a New Observation

To predict yh, a cutoff point τ for π̂h = logit

−1(β̂ ⊤xh) needs to be

determined: Predict 1 if π̂h ≥ τ ; predict 0 if π̂h < τ

32

Inferences about Mean Response

▶ Given

xh = (1, xh1, . . . , xhp)

⊤

the population mean response is

πh =

1

1 + exp(−x⊤i β)

and its point estimator is

π̂h =

1

1 + exp(−x⊤i β̂)

▶ Interval estimation for πh: first calculate confidence limits for the linear

predictor x⊤h β, then apply the inverse logit function to obtain confidence

limits for πh

33

Inferences about Mean Response

▶ Recall the large sample approximate distribution for β̂ is

β̂

approx.∼ N(β,−H(β̂ | Y,X)−1)

so

x⊤h β̂

approx.∼ N(x⊤h β, −x⊤h H(β̂ | Y,X)−1xh︸ ︷︷ ︸

denote this value by s2(x⊤h β̂)

)

▶ Approximate 1− α large-sample confidence limits for x⊤h β are

L = x⊤h β − z(1− α/2) · s(x⊤h β̂)

U = x⊤h β + z(1− α/2) · s(x⊤h β̂)

▶ Finally, using the monotonic transformation of the inverse logit function,

the approximate 1− α large-sample confidence limits for πh are

L∗ =

1

1 + exp(−L) , U

∗ =

1

1 + exp(−U)

34

Simultaneous confidence intervals

▶ We may estimate several mean responses πh corresponding to different

xh vectors (h = 1, . . . , g) with family confidence level 1− α

▶ Bonferroni simultaneous confidence intervals for g such mean responses:

Just replace the multiplier z(1− α/2) with z(1− α/(2g))

▶ Confidence limits for linear predictor terms, h = 1, . . . ,H

Lh = x

⊤

h β − z(1− α/(2g)) · s(x⊤h β̂)

Uh = x

⊤

h β + z(1− α/(2g)) · s(x⊤h β̂)

35

Polytomous Logistic Regression for Nominal Response

▶ When the response variable have > 2 qualitative levels, we can use

logistic regression with a polytomous or multicategory response:

– In business, relate a consumer’s choice of product (product A, B, C) to

the consumer’s age, gender, geographic location and other variables

▶ Suppose the response Y has J response categories and define

yij =

{

1 if case i response is category j

0 otherwise

Denote πij = P(yij = 1), then

∑J

j=1 πij = 1 and

∑J

j=1 yij = 1

▶ Recall the binary response case

log

{

P(yi = 1)

1− P(yi = 1)

}

=: x⊤i β1,base

If yi can be 1 or 2, then category 2 is the baseline category above

36

Polytomous Logistic Regression for Nominal Response

▶ For J polytomous categories, arbitrarily specify a baseline, say the Jth

one, as the baseline, and define

log

{

πij

πiJ

}

= x⊤i βj,base, j = 1, . . . , J − 1

▶ Logits for any other comparisions can be obtained from above:

log

{

πij1

πij2

}

= log

{

πij1

πiJ

· πiJ

πij2

}

= x⊤i βj1,base − x⊤i βj2,base

▶ After some algebra, we can show

πij =

exp(x⊤i βj,base)

1 +

∑J−1

j′=1 exp(x

⊤

i βj′,base)

, j = 1, . . . , J − 1

This is multinomial logistic regression

37

Maximum Likelihood Estimation

▶ Write βj,base simply as βj . We can use MLE to estimate β1, . . . ,βJ−1

▶ Since yi follows a Categorical distribution with parameters (πi1, . . . , πiJ),

for example, P(yi = 3) = πi3 =

J∏

j=1

π

yij

ij

▶ Log-likelihood for n independent observations and J categories:

L(β | Y,X) = log

n∏

i=1

J∏

j=1

π

yij

ij

=

n∑

i=1

J−1∑

j=1

(yijx

⊤

i β)− log

1 + J−1∑

j=1

exp(x⊤i βj)

▶ The J − 1 fitted response functions π̂ij may be obtained by substituting

the MLE β̂j ’s into the expression of πij ’s

38

Polytomous Logistic Regression for Ordinal Response

▶ The ordinal response where categories of Y are ordered are also common:

– Employees asked to rate working conditions using a 7-point scale

(unacceptable, poor, fair, acceptable, good, excellent, outstanding)

– The severity of cancer is rated by stages on a 1-4 basis

▶ A more effective model for such responses explicitly takes into account

the orderings

▶ Propotional odds model :

P(yi ≤ j) = exp(αj + x

⊤

i β)

1 + exp(αj + x⊤i β)

, j = 1, . . . , J − 1

where there are J − 1 cumulative logits:

log

[

P(yi ≤ j)

1− P(yi ≤ j)

]

= αj + x

⊤

i β

MLE for this model can be similarly obtained

39