程序辅导案例 > C/C++ >

代写接单-COGS 118A: Supervised Machine Learning Algorithms Midterm Exam II Practice

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

1 COGS 118A: Supervised Machine Learning Algorithms Midterm Exam II Practice Problems True or False

1. The false positive rate is the how frequently truly positive data samples are misclassified as negative class. [True] [False] 2. For a linear classifier that is well-calibrated (i.e., wT x + b can be interpreted as a probabil- ity), the ROC curve is constructed by varying the classification threshold t ∈ [−1, 1] when predicting outputs on a given dataset according to f(x;w,b) = ��+1, ifwTx+b≥t, −1, otherwise. (1) and then plotting the true positive rate vs the false positive rate for each choice of t [True] [False] 3. The best possible ROC curve lies along the diagonal. [True] [False] 4. Perceptrons, logistic regressions, and SVMs all share the identical prediction formulation sign(w⊤x + b) at test time. [True] [False] 2 5. Typically, when the gradient descent algorithm is used to train a classifier f(x;w,b), the better the current model (wt, bt) predicts a sample xi, the less contribution xi makes to the gradient in the update of wt. [True] [False] 6. The regularization term in the SVM loss function improves the robustness of the classifier because it increases the margin through maximizing the magnitude of the weights. [True] [False] 7. The kernel trick can only be applied to SVMs, not other ML algorithms. [True] [False] 8. Perceptrons can learn the XOR task. [True] [False] 9. Even with a convex loss function, if you pick the wrong step size, gradient descent could end up never converging on the answer or taking a very long time to do so [True] [False] 10. A one-vs-rest multi-class prediction scheme uses fewer classifiers than a one-vs-one scheme. [True] [False] Conceptual Questions Select the correct option(s). Note that there might be multiple correct options. 1. Choose all the valid answers to the description about gradient descent from the options below: A. With different initial weights, it is possible for the gradient descent algorithm to obtain to different local minimum. B. Every gradient descent iteration can always decrease the value of loss function even when the gradient of the loss function is zero. C. It is possible for the gradient descent algorithm to find the global minimum, though it is not guaranteed. D. Large learning rate is always better than a small one because it leads to faster convergence. Computing the margin width 2. In the Figure 1, we use a black line to mark the decision boundary of a linear SVM classifier parameterized by weight vector w and bias b. If we start to decrease the margin of the classifier, what would happen to w and b? SCR© 7 Figure 1: A linear SVM classifier with data points. A. b will be larger. B. w will have smaller magnitude. C. w will have greater magnitude. D. Both A and C. 3. Pick all of the following classifiers that can fit a non-linear decision boundary. A. Decision tree B. Perceptron C. Kernel SVM D. Logistic regression E. k-NN 4. Suppose we are given a dataset S = {(xi,yi),i = 1,...,n} where x is the feature vector and y ∈ {−1, +1} is the corresponding label. We train a logistic regression classifier parameterized by weight w and bias b on the dataset S. Please choose the correct loss formulation L(w,b) of the logistic regression classifier: A.L(w,b)=√ 2 w⊤w +C��n max(0,1−yi(w⊤xi+b)) i=1 B. L(w, b) = − ��n i=1 E.L(w,b)=√ 2 −C��n max(0,1−yi(w⊤xi+b)) w⊤w i=1 5. Choose all the valid answers to the question from the options below. When training a classifier, the goal of cross-validation is to: A. Reduce the training error. B. Reduce the training time. C. Tune/Choose the hyper-parameters. D. Estimate the generalization error. ln ln D. L(w, b) = ��ni=1 max(0, −yi(w⊤xi + b)) 1 1+e−yi (w⊤ xi +b) C. L(w, b) = − ��n i=1 1 1+e−(w⊤ xi +b) 3 Error Metrics There are 20 people tested for SARS-CoV-2 by the DeepMLBuzzword3000 test. If someone is actually well or sick they have the appropriate emoji. Our test results are always negative (no virus) unless the thermometer icon appears on that person’s head; those people test positive for the virus. Evaluate the following metrics for this (not very good) virus test. 1. Compute the Recall using the following formula: Recall = number of true positives (correct guesses) number of actual targets 2. Compute the Precision using the following formula: Precision = number of true positives (correct guesses) number of guesses 3. Compute the F-value using the following formula: F-value = 2 * Precision * Recall Precision + Recall 4 Support Vector Machine Consider a dataset {(xi, yi), i = 1, 2, ..., 10} where x = [x1, x2]⊤ and y ∈ {−1, +1}, which is shown in the Figure 2 below. Suppose we have trained a support vector machine (SVM) classifier on the dataset, which has the decision boundary l : w⊤x + b = 0. The SVM classifier is optimized as following: Find: argmin1||w||2 w2 Subject to, ∀i :w⊤xi + b ≥ +1, if yi = +1, w⊤xi + b ≤ −1, if yi = −1. We define l+ : w⊤x+b = +1 as the positive plane and l− : w⊤x+b = −1 as the negative plane. After training the SVM classifier using the gradient descent method for several iterations to reach an intermediate state, we have obtained the positive plane and the negative plane, which are indicated as solid lines in the Figure 2 below. Figure 2: the positive plane and the negative plane, and the data points 1. Calculate the w and b from your drawn positive plane and negative plane. 2. Calculate the size of the margin. 5 K Nearest Neighbors Consider a training dataset Straining = {(xi,yi),i = 1,2,...,10} where each data point (x,y) has a feature vector x = [x1,x2]⊤ and the corresponding label y ∈ {−1,+1}. The points with the corresponding labels in the training set are shown in the figure below. In this question, we attempt to predict the label of a new point with the k nearest neighbors (k-NN) algorithm. Specifically, in the k-NN, we use Euclidean distance as the distance metric and k is set to 1. 1. Predict the label for feature vector (1, 3). 2. Predict the label for feature vector (5, 4). 6 Perceptron In this section, we will apply perceptron learning algorithm to solve a binary classification problem: We need to predict a binary label y ∈ {−1,+1} for a feature vector x = [x0,x1]⊤. The decision rule of the perceptron model is defined as: Straining = {(xi, yi)}, i = 1, . . . , n}, the outline of the perceptron algorithm is shown as below: • Initialize weight w and bias b. • If not all data points in Straining are correctly predicted: – Obtain the prediction for each feature vector in Straining. – If the prediction is wrong, then update the parameters w and b. • Otherwise return current weight w and bias b. In this problem, you will implement a perceptron algorithm. Suppose X is the matrix of all feature vectors xi and Y is the matrix of all labels yi in the training set Straining. Besides, assume a function calc error(X, Y, W, b) is given, which computes the error from the feature matrix X, the label matrix Y, the weight W and the bias b. Please reorder the following lines of Python code to complete your perceptron algorithm. Fill the indexes (a-g) into the blanks. def perceptron_algorithm(X, Y): # Initialization. W = np.zeros(2) b =0.0 lam = 1 # Main algorithm. while calc_error(X, Y, W, b) > 0: for xi, yi in zip(X, Y): __________________________________________________ ______________________________________________ else: ______________________________________________ __________________________________________________ ______________________________________________ else: ______________________________________________ ______________________________________________ return W, b ��+1, ifwTx+b≥0, −1, otherwise. f(x;w,b) = where w = [w0,w1]⊤ is the weight vector, and b is the bias scalar. Given a training dataset (2) (a) W = W + lam * (yi - yi_pred) * xi (b) yi_pred = -1 (c) b = b + lam * (yi - yi_pred) (d) yi_pred = +1 (e) if W.T.dot(x) + b >= 0: (f) continue (g) if yi_pred == yi: 7 Model selection For the tasks you have to complete below, you have been given a dataset S = {(xi, yi), i = 1, 2, ..., n}, where xi is a vector of d dimensions. 1. You wish to estimate the generalization error of a classifier that is VERY computationally intensive to train. The number of samples in the dataset, n, is huge, on the order of the number of cat videos on YouTube. You should use A. leave one out cross-validation B. 10 fold cross-validation C. 100 bootstrap resamples D. 50%/50% train/test split 2. You wish to find out which one of h different sets of hyperparameters optimize logistic regres- sion’s performance on S. This classifier has training time complexity O(nd) and testing time complexity O(d). Assuming you use k fold cross-validation to select the hyperparameters, approximately how long will it take to do the model selection? A. O(kd) B. O(hkd) C. O(knd) D. O(hknd) 3. You want to both select the correct hyperparameters and also estimate the generalization error. The number of samples is probably just barely enough. You should A. do a randomized 50%/50% train/test split; select hyperparameters on the training set, fit the best hyperparameters on the training set, estimate generalization error with the fitted best version on the test set B. do a randomized 50%/25%/25% train/validation/test split; train on the training set, select hyperparameters on the validation set, fit the best hyperparameters on the train- ing+validation sets, estimate generalization error with the fitted best version on the test set C. do a randomized 70%/30% train/test split; use 5 fold cross validation on the training set to select hyperparameters, fit the best hyperparameters on the training set, and estimate generalization error with the fitted best version on the test set 8 Kernels Consider a toy dataset S = {(xi, yi), i = 1, 2, 3} where each example is represented by a 2 dimen- sional feature vector. The design matrix for this problem is as follows: 2 4 X=3 5 (3) 00 Based on your domain expertise, you would like to use the kernel trick to make predictions, e.g., predictions will be make using a linear combination of the training points based on kernel similarity. In other words, you have a good intuition of how to compute the similarity between data points rather than an underlying feature transformation φ(x) (of possibly unbounded dimensionality). Given a new test point x′ = [2, 3]⊤, you would like to make a prediction. Assuming (1) you chose the following Gaußian kernel (please remember that exp is base e and that ∥x∥2 is the square of the L2 norm of x): ′ −∥x − x′∥2 k(x, x ) = exp( 2 ) and assuming (2) that k(x′) has entries ki(x′) = k(xi,x′) and represents the similarity of your test point to all points in the training dataset, please compute the kernel similarity of x′ (up to 3 decimal points):    k(x′) =   (4)    9 Logistic Regression Consider a logistic regression classifier that predicts a binary label y ∈ {−1, 1} for a feature vector x=[x1,x2]⊤ ∈R2: ��+1, if f(x; w, b) ≥ 0.5, y= −1, otherwise. P(y = +1|x) = f(x;w,b) = 1 (5) 1 + e−(w⊤x+b) where weight vector w = [w1,w2]⊤ ∈ R2 and bias b ∈ R. Besides, P(y = +1|x) represents the probability of prediction y = +1 given a feature vector x. 1. Write down the formula for P (y = −1|x) (in a similar form as Equation (5) above). 2. Suppose that after training the logistic regression classifier on a dataset, we have learned the weight vector w = [1, −2]⊤ and the bias b = 2. (a) Sketch the decision boundary. Make sure you have marked the x1 and x2 axes and the intercept points on these axes. (b) Predict the label of xpred = [3, 4]⊤ using the learned logistic regression classifier.