Homework 3

CS534 Machine Learning, Fall 2021

This homework will explore validation and model selection procedures

using regularized logistic regression models and nested cross validation.

Problem 1 - Cross validation (60 points)

This elastic-net regularized logistic regression model is derived by minimiz-

ing the negative log likelihood function for samples (xi, gi), gi ∈ {1, 2}

max

β,β0

`(β, β0) = max

β,β0

1

N

N∑

i=1

{I(gi−1)log p(xi)+I(gi−2)log (1−p(xi))}−λPα(β),

where the regularization penalty Pα(β) = (1−α)12‖β‖22+α‖β‖1 is a function

of the free parameters λ, α, and where the indicator function I(0) = 1 and

zero elsewhere. The weight λ determines the magnitude of regularization,

and the mixing parameter α determines the proportion of the penalty allo-

cated to the 1 and 2 norms. The mixing parameter α is often not chosen

through cross validation but just set using intuition or experience, where

the weight λ is typically determined through cross validation.

In this problem you will use a built-in function for elastic-net logistic

regression. The mixing parameter will be fixed α = 0.95 and you will search

for the optimal regularization weight lambda in the range λ ∈ [0, 100]. These

are often spaced on a logarithmic scale at 100 or more intervals.

Perform a nested cross validation using five folds to determine the op-

timal regularization weight and report test error. In each step, 4/5 of the

data will be used for training and validation, and 1/5 will be used to report

test error.

1.a. Validation diagram (20 points) Draw a diagram of the cross valida-

tion where you indicate which samples are in training, testing, and validation

in each of the 5 nested folds. Depict both the inner and outer loops of the

nested CV. Indicate clearly how the data is segmented, and the fraction of

the data used in each segment.

1

1.b. Model selection (20 points) For each outer CV fold, generate a plot

of the classification error on the validation set as a function of the sequence

of λ values (five plots total). Use the inner CV folds to calculate standard

deviations of the error σE(λ) at each λ, as well as the mean error µE(λ).

Choose the optimal lambda λ∗ as the largest λ within 1-standard devi-

ation of the minimum average error

λmin = argmin

λ

µE(λ),

λ∗ = max{λ} subject to λ ≤ µE(λmin) + σE(λmin)

In each plot, indicate µE(λ) with a solid black line, and the intervals of

variance µE(λ) ± σE(λ) in red, µE(λmin) as a blue point and µE(λ∗) as a

green point.

1.c. Test error (20 points) Generate a box plot of the five test errors

generated by cross-validation. Display the validation errors µE(λ∗) in two

additional boxplots in the same graph (there are 20 training errors and 20

validation errors). Compare and discuss the errors.

2

Problem 2 - Elastic net logistic regression (40 points)

Implement the elastic net regression algorithm using the soft-thresholding

and iterative reweighted least-squares approach described in Friedman 2010.

This problem can be solved by optimizing the penalized negative log likeli-

hood

min`(β, β0) =

1

N

N∑

i=1

{I(gi− 1)log p(xi) + I(gi− 2)log (1− p(xi))}−λPα(β),

where the indicator function I(0) = 1 and zero elsewhere, and p(xi) repre-

sents the class probability

Pr(G = 1|X = xi) = p(xi) = 1

1 + e−(β0+xTi β)

.

A quadratic approximation of this likelihood `Q can be used to transform

this problem into a weighted least-squares problem with weights wi and

response zi (see Equation 10 in Friedman paper)

`Q(β, β0) = − 1

2N

N∑

i=1

wi(zi − β0 − xTi β)2.

where

zi = β˜0 + x

T

i β˜ +

yi − p˜(xi)

wi

,

wi = p˜(xi)(1− p˜(xi)).

2.a. Solution path

Implement this algorithm using the mixing parameter α = 0.95 and plot the

solution path β(λ) for a sequence of 1000 logarithmically spaced λ in the

range λ ∈ [1, 100]. Label each feature j using text on the plot at the point

where βj enters the model (when this model coefficient becomes non-zero).

3

Notes

• Understanding the approach at a high-level is important. You are

going to solve the quadratic approximation repeatedly using equation

10 from the paper, and at each iteration you will recalculate zi, wi to

update the quadratic approximation `Q.

• You will generate a solution for each value of λ. A fixed number of

iterations is the easiest approach for deciding when each problem is

solved (50 worked for me).

• To accelerate convergence you can use a warm-starting technique.

Start with the largest λ which will have the fewest non-zero coeffi-

cients. Use the solution of this regularization level as the start for the

next smallest lambda, etc.

• This problem has some numerical issues that need to be addressed.

The weights, wi can shrink to zero and cause NaN values where used

in division. The parameters β will continue to grow in an attempt to

push the probabilities pi to zero or one. For this reason, you can clamp

these to 0, 1 once they are within a reasonable distance (say 1e − 5).

The solutions may also become unstable when the regularization gets

too small (this will be apparent in the solution path plot).

4

欢迎咨询51作业君