Homework 3 CS534 Machine Learning, Fall 2021 This homework will explore validation and model selection procedures using regularized logistic regression models and nested cross validation. Problem 1 - Cross validation (60 points) This elastic-net regularized logistic regression model is derived by minimiz- ing the negative log likelihood function for samples (xi, gi), gi ∈ {1, 2} max β,β0 `(β, β0) = max β,β0 1 N N∑ i=1 {I(gi−1)log p(xi)+I(gi−2)log (1−p(xi))}−λPα(β), where the regularization penalty Pα(β) = (1−α)12‖β‖22+α‖β‖1 is a function of the free parameters λ, α, and where the indicator function I(0) = 1 and zero elsewhere. The weight λ determines the magnitude of regularization, and the mixing parameter α determines the proportion of the penalty allo- cated to the 1 and 2 norms. The mixing parameter α is often not chosen through cross validation but just set using intuition or experience, where the weight λ is typically determined through cross validation. In this problem you will use a built-in function for elastic-net logistic regression. The mixing parameter will be fixed α = 0.95 and you will search for the optimal regularization weight lambda in the range λ ∈ [0, 100]. These are often spaced on a logarithmic scale at 100 or more intervals. Perform a nested cross validation using five folds to determine the op- timal regularization weight and report test error. In each step, 4/5 of the data will be used for training and validation, and 1/5 will be used to report test error. 1.a. Validation diagram (20 points) Draw a diagram of the cross valida- tion where you indicate which samples are in training, testing, and validation in each of the 5 nested folds. Depict both the inner and outer loops of the nested CV. Indicate clearly how the data is segmented, and the fraction of the data used in each segment. 1 1.b. Model selection (20 points) For each outer CV fold, generate a plot of the classification error on the validation set as a function of the sequence of λ values (five plots total). Use the inner CV folds to calculate standard deviations of the error σE(λ) at each λ, as well as the mean error µE(λ). Choose the optimal lambda λ∗ as the largest λ within 1-standard devi- ation of the minimum average error λmin = argmin λ µE(λ), λ∗ = max{λ} subject to λ ≤ µE(λmin) + σE(λmin) In each plot, indicate µE(λ) with a solid black line, and the intervals of variance µE(λ) ± σE(λ) in red, µE(λmin) as a blue point and µE(λ∗) as a green point. 1.c. Test error (20 points) Generate a box plot of the five test errors generated by cross-validation. Display the validation errors µE(λ∗) in two additional boxplots in the same graph (there are 20 training errors and 20 validation errors). Compare and discuss the errors. 2 Problem 2 - Elastic net logistic regression (40 points) Implement the elastic net regression algorithm using the soft-thresholding and iterative reweighted least-squares approach described in Friedman 2010. This problem can be solved by optimizing the penalized negative log likeli- hood min`(β, β0) = 1 N N∑ i=1 {I(gi− 1)log p(xi) + I(gi− 2)log (1− p(xi))}−λPα(β), where the indicator function I(0) = 1 and zero elsewhere, and p(xi) repre- sents the class probability Pr(G = 1|X = xi) = p(xi) = 1 1 + e−(β0+xTi β) . A quadratic approximation of this likelihood `Q can be used to transform this problem into a weighted least-squares problem with weights wi and response zi (see Equation 10 in Friedman paper) `Q(β, β0) = − 1 2N N∑ i=1 wi(zi − β0 − xTi β)2. where zi = β˜0 + x T i β˜ + yi − p˜(xi) wi , wi = p˜(xi)(1− p˜(xi)). 2.a. Solution path Implement this algorithm using the mixing parameter α = 0.95 and plot the solution path β(λ) for a sequence of 1000 logarithmically spaced λ in the range λ ∈ [1, 100]. Label each feature j using text on the plot at the point where βj enters the model (when this model coefficient becomes non-zero). 3 Notes • Understanding the approach at a high-level is important. You are going to solve the quadratic approximation repeatedly using equation 10 from the paper, and at each iteration you will recalculate zi, wi to update the quadratic approximation `Q. • You will generate a solution for each value of λ. A fixed number of iterations is the easiest approach for deciding when each problem is solved (50 worked for me). • To accelerate convergence you can use a warm-starting technique. Start with the largest λ which will have the fewest non-zero coeffi- cients. Use the solution of this regularization level as the start for the next smallest lambda, etc. • This problem has some numerical issues that need to be addressed. The weights, wi can shrink to zero and cause NaN values where used in division. The parameters β will continue to grow in an attempt to push the probabilities pi to zero or one. For this reason, you can clamp these to 0, 1 once they are within a reasonable distance (say 1e − 5). The solutions may also become unstable when the regularization gets too small (this will be apparent in the solution path plot). 4
欢迎咨询51作业君