代写辅导接单-CS6923 Machine Learning, Fall 2023 Prof. Linda Sellie, NYU School of Engineering Homework 5

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

CS6923 Machine Learning, Fall 2023

Prof. Linda Sellie, NYU School of Engineering

Homework 51

Due Saturday, October 10 at 8:00 p.m. Submission is required only on GradeScope.

You may work together with one other person on this homework. If you do that, hand in JUST ONE homework for the two of you, with both of your names on it. You may *discuss* this homework with other students but YOU MAY NOT SHARE WRITTEN ANSWERS OR CODE

WITH ANYONE BUT YOUR PARTNER.

Part I: Written Exercises

1. When using gradient ascent/descent to minimize an error function, there are common problems that people encounter. For each of the following problems, explain why it might be happening, and suggest a way to fix the problem.

(a) The error doesn’t decrease steadily with the number of iterations; it sometimes goes up, and it sometimes goes down. Also, the weights don’t converge.

(b) The error decreases steadily, but very slowly. Even after 1, 000, 000 iterations, the error is still not much smaller than it was at the beginning.

2. A data scientist is hired by a political candidate to predict who will donate money. The data scientist decides to use two predictors for each possible donor:

• x1 = the income of the person(in thousands of dollars), and

• x2 = the number of websites with similar political views as the candidate the person

follows on Facebook.

To train the model, the scientist tries to solicit donations from a randomly selected subset of people and records who donates or not. She obtains the following data:

Income (thousands $), x(i) 1

Num websites, x(i) 2

Donate (1=yes or 0=no), y(i)

(a) Draw a scatter plot of the data labeling the two classes with different markers.

(b) Find a linear classifier that makes at most one error on the training data. The classifier should be of the form,

(1 ifz(i)>0 (i) T (i) yˆi= 0 ifz(i)<0, z =wx

What is the weight vector w of your classifier? 1Some of these are modified from Prof. Rangan’s questions.

30 50 70 0 1 1 0 1 0

80 100 2 1 1 1

P (y(i) = 1|x(i)) = 1 , z(i) = wTx(i)

1 + e−z(i)

Using w from the previous part, which sample i is the least likely to from class 1 (i.e. P(y(i)|x(i)) is the smallest). If you do the calculations correctly, you should not need a calculator.

(d) Now consider a new set of parameters

w′ = αw,

where α > 0 is a positive scalar. Would using the new parameters change the values yˆ in part (b)? Would they change the likelihoods P(yi|xi) in part (c)? If they do not change, state why. If they do change, qualitatively describe the change as a function of α.

3. Weighted Logistic Regression Formula

Weighted logistic regression allows for differential weighting of observations in a dataset. This is especially useful in scenarios like dealing with imbalanced classes, factoring in varying con- fidence levels in data points, or considering different misclassification costs.By incorporating these weights, the model can be more attuned to specific applications or challenges, ensuring that certain observations have a proportionally greater influence on the training process.

Derive the formula for weighted logistic regression.

The weighted likelihood function for the dataset is given by:

L(w)=YN ?[σ(wTx(i))]y(i)[1−σ(wTx(i))]1−y(i)?a(i) i=1

In this function, the weights a(i) act as exponents, amplifying or diminishing the contribution of each observation based on its significance or reliability. A greater weight signifies a more pronounced influence on the likelihood and, consequently, on estimating model parameters.

1. Calculate the log-likelihood of the dataset.

2. Derive the log-likelihood gradient with respect to w which can be used in gradient descent. Note: In this context, every data point is accompanied by a weight a(i).

4. (Do not turn in this question) Regularization in Logistic Regression:

In machine learning, overfitting is a common problem. This occurs when a model learns the training data so well, including its noise and outliers, that it performs poorly on unseen or new data. Just as linear regression can overfit the training data, logistic regression is also prone to this pitfall.

What can we do to prevent this?

One common strategy to combat overfitting is adding a penalty to the objective function. By doing this, it discourages the model from fitting the training data too closely, especially by assigning large weights to any one feature.

One penalty is L2 regularization is widely used. L2 regularization adds a penalty equivalent to the square of the magnitude of coefficients. This means that, with L2 regularization, we’re not just trying to fit the data well, but we’re also trying to keep the model weights as small as possible.

Now, let’s delve deeper and see how this regularization is applied to the logistic regression model.

Regularization:

• Add ridge regularization to the log-likelihood function for logistic regression

• Determine the derivative of the log-likelihood function for logistic regression with ridge

regularization.

• Now, you can continue with the details of how to add ridge regularization to the logistic regression function and derive its derivative.

5. (Extra Practice Questions)

Suppose we are training a logistic classifier to solve a binary classification problem (i.e. we are performing logistic regression). The classifier corresponds to a function of the form h(x) =

1 whose output is an estimate of the probability that x belongs to class 1. 1+e−wT x

Suppose while performing gradient ascent, the weights become wT = [3, −5, −6].

The table below shows the result of using these weights to predict the class on the training

examples.

x1 x2 h(x) y

1 0.49 0.09

2 1.69 0.04

3 0.04 0.64

4 1. 0.16

5 0.16 0.09

6 0.25 0.

7 0.49 0.

8 0.04 0.01

0.502 0 0.003 0 0.261 0 0.049 0 0.840 1 0.852 1 0.634 1 0.939 1

(a) For a decision boundary of 0.5, create the confusion matrix.

(b) Plot the points on a graph and draw the decision boundary (I would suggest using some

sort of plotting library and a image editor)

(d) For the data set above what is the TPR?

(e) What is the accuracy?

(f) What is the recall?

(g) What is the precision?

(h) In logistic regression, we are trying to maximize the log likelihood

l(w) = X y(i) ln(h(x(i))) + (1 − y(i)) ln(1 − h(x(i)))

i=1

which is the same as minimizing the error function

− X y(i) ln(h(x(i))) + (1 − y(i)) ln(1 − h(x(i)))

N i=1

This quantity is sometimes called the cross-entropy of the classifier on the dataset. Using the initial weights, what is the cross-entropy of the classifier on the given training set?

(i) Given w as described above and w′ = (2,−3,−3)T, which is more likely to to have generated the dataset given above.2

(j) Perform one step of gradient ascent using the w given above. Use learning rate α = 0.1.3

2w′ here just means a new coefficient vector. It does note mean to take the derivative of w.

3You do not need to perform this by hand - but make sure you can perform this by hand. 3

(k) How did the data points near the decision boundary contribute to the new value of w?

(l) How did the data points which were correctly classified and far away from the decision

boundary contribute to the new value of w?

(m) How did incorrectly classified points contribute to the new value of w?

(n) Using the updated weights, what is the cross-entropy (error) of the classifier on the given training set?

(o) Did the cross-entropy (error) go up or down after one iteration of the gradient ascent (or descent)? Is this what you expected? Why or why not?

Part II: Programming Exercise

In this exercise, you will implement a logistic regression classifier using gradient ascent. This classifier will be used to address a binary classification problem: detecting breast cancer. You are provided with the gradient ascent algorithm in the lecture notes, and you should base your implementation on this.

Initial Setup:

• Coefficient weights should be initialized to 0 before the first iteration. • Use a threshold value of 0.5.

Hyperparameters:

Start with the following hyperparameters for this assignment: • Learning rate: 0.5

• Number of iterations: 5000

Post-Classification Tasks:

After successfully implementing and running your logistic regression classifier, provide the following:

• Values of the coefficient vector w.

• Create a plot showcasing the value of the log-likelihood (objective function) every 100 iterations during the gradient ascent. This should have the iteration number on the x-axis and the objective value on the y-axis. The plotting function is pre-written for you in the provided Jupyter notebook.

• For the test dataset, compute and report:

– Precision

– Recall

– (Optional) F1 score – Confusion matrix

• Utilize the test dataset as a validation set. Experiment with different hyperparameter settings to potentially find more optimal values. Report any noteworthy findings. Your main goal is to experiment and observe the effects of varying values; finding the “best” values is not mandatory.

Disclaimer: In this assignment, we’re using the test set as a validation set for the sake of simplicity and to facilitate learning. However, in real-world machine learning applications, this is considered bad practice. Typically, you’d have a separate training set, validation set, and test set. The validation set is used for tuning hyperparameters, and the test set is reserved exclusively for evaluating the final model’s performance. Using the test set for validation can lead to over-optimistic estimates of a model’s performance and potential overfitting to the test data.