程序代写案例-CS480/680

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

University of Waterloo CS480/680 2021 Winter
CS480/680: Introduction to Machine Learning
Homework 1
Due: 11:59 pm, January 28, 2021, submit on Crowdmark (yet to be set up, stay tuned).
Last Updated: January 17, 2021
Include your name and student number!
Submit your writeup in pdf and all source code in a zip file (with proper documentation). Write a script for each
programming exercise so that the TAs can easily run and verify your results. Make sure your code runs!
[Text in square brackets are hints that can be ignored.]
Exercise 1: Perceptron Implementation (5 pts)
Convention: All algebraic operations, when applied to a vector or matrix, are understood to be element-wise
(unless otherwise stated).
Algorithm 1: The perceptron algorithm.
Input: X ∈ Rn×d, y ∈ {−1, 1}n, w = 0d, b = 0, max pass ∈ N
Output: w, b,mistake
1 for t = 1, 2, . . . ,max pass do
2 mistake(t)← 0
3 for i = 1, 2, . . . , n do
4 if yi(〈xi,w〉+ b) ≤ 0 then
5 w← w + yixi // xi is the i-th row of X
6 b← b+ yi
7 mistake(t)← mistake(t) + 1
Implement the perceptron in Algorithm 1. Your implementation should take input asX = [x>1 , . . . ,x
>
n ]
> ∈
Rn×d, y ∈ {−1, 1}n, an initialization of the hyperplane parameters w ∈ Rd and b ∈ R, and the maximum num-
ber of passes of the training set [suggested max pass = 500]. Run your perceptron algorithm on the spambase
dataset (use the version on the course website), and plot the number of mistakes (y-axis) w.r.t. the number of
passes (x-axis).
Ans:
Exercise 2: Perceptron Questions (5 pts)
1. (3 pts) The perceptron algorithm makes an update every time it witnesses a mistake. What if it makes
an update on every point, regardless of whether or not that point is correctly classified? For simplicity,
consider a setting where where b is fixed to 0. Give an example of an infinite sequence of points (xi, yi)
with the following properties:
• The sequence is strictly linearly separable with b = 0 (i.e., the margin is some constant γ > 0),
• The sequence has max ‖xi‖2 ≤ 1,
• The modified perceptron algorithm makes an infinite number of mistakes on this sequence.
Prove that it has these properties. Note that the perceptron convergence theorem and the first two
conditions imply that, at some point, the unmodified perceptron algorithm would only make a finite
number of mistakes.
Ans:
2. (1 pt) Give examples of where the perceptron algorithm converges to a 0 margin halfspace, and a maxi-
mum margin halfspace.
Ans:
3. (1 pt) Suppose that in each iteration of the perceptron algorithm, a point is chosen uniformly at random
from the dataset. Show how the perceptron algorithm can be viewed as an instantiation of stochastic
gradient descent (SGD). In particular, you must provide a loss function and a learning rate such that
SGD with these parameters and perceptron are identical.
Gautam Kamath ([email protected]) ©2021
University of Waterloo CS480/680 2021 Winter
Ans:
Exercise 3: Regression Implementation (8 pts)
Recall that ridge regression refers to
min
w∈Rd,b∈R
loss︷︸︸︷
1
2n‖Xw + b1− y‖22︸︷︷︸
error
+λ‖w‖22, (1)
where X ∈ Rn×d and y ∈ Rn are the given dataset and λ > 0 is the regularization hyperparameter.
1. (1 pt) Show that the derivatives are
∂
∂w
= 1nX
>(Xw + b1− y) + 2λw (2)
∂
∂b
= 1n1
>(Xw + b1− y). (3)
Ans:
2. (2 pts) Implement the gradient descent algorithm for solving ridge regression. The following incomplete
pseudo-code may of help.
Test your implementation on the Boston housing dataset (to predict the median house price, i.e., y). Use
the train and test splits provided on course website. Try λ ∈ {0, 10} and report your training error,
training loss and test error. [Your training loss should monotonically decrease during iteration; if not try
to tune your step size η, e.g. make it smaller.]
Algorithm 2: Gradient descent for ridge regression.
Input: X ∈ Rn×d, y ∈ Rn, w0 = 0d, b0 = 0, max pass ∈ N, η > 0, tol > 0
Output: w, b
1 for t = 1, 2, . . . ,max pass do
2 wt ←
3 bt ←
4 if ‖wt −wt−1‖ ≤ tol then // can use other stopping criteria
5 break
6 w← wt, b← bt
Ans:
For the next part, you may use the Python package scikit-learn.
3. (5 pts) Train (unregularized) linear regression, ridge regression, and lasso on the mystery datasets A, B,
and C on the course website (using X train and Y train for each dataset). For ridge regression and lasso,
use parameters 1 and 10 (note that you will have to divide these by n for lasso in scikit-learn – why?).
Report the average mean squared error on the test set for each method. Which approach performs best
in each case? Plot the five parameter vectors obtained for each dataset on the same histogram, so they’re
all visible at once (change the opacity setting for the bars if necessary): specifically, for each parameter
vector, plot a histogram of its value in each coordinate. Given which approach performs best, and how
its parameter histogram looks, how do you suspect the true parameter vector and responses might have
been generated?
Ans:
Exercise 4: An Alternative to Least Squares? (3 pts)
Suppose we are in a setting with a dataset X ∈ Rn×d, and labels are generated according to Y = XW+ε, where
W ∈ Rd, Y ∈ Rn, and ε ∈ Rn is a random vector, where all entries are independent random variables with 0
Gautam Kamath ([email protected]) ©2021
University of Waterloo CS480/680 2021 Winter
mean and variance σ2 (you can imagine Gaussian, but it isn’t necessary). As we saw in class, the least-squares
solution Wˆ can be written as Wˆ = (XTX)−1XTY – this is a linear transformation of the response vector Y .
Consider some different linear transformation
(
(XTX)−1XT +N
)
Y , where N ∈ Rd×n is a non-zero matrix.
1. (1 pt) Show that the expected value of this linear transformation is (Id + NX)W . Conclude that its
expected value is W if and only if NX = 0. (1 pt)
Ans:
2. (2 pts) Compute the covariance matrix of this linear transformation when NX = 0, and show that it is
equal to σ2(XTX)−1 + σ2NNT . Since the former term is the covariance of the least squares solutiona
and the latter matrix is positive semi-definite, this implies that this alternative estimator only increased
the variance of our estimate.
Ans:
aVerify this for yourself, but no need to submit it.
Exercise 5: Sample Statistics (2 pts)
1. (1 pt) Suppose there is a dataset x1, . . . , xn sampled from a distribution with mean µ and variance σ
2.
Compute the expected value of the sample mean: x¯ = 1n
∑n
i=1 xi. Describe any modifications that might
be required to make the expected value µ (recall that µ and σ2 are unknown).
Ans:
2. (1 pt) Suppose there is a dataset x1, . . . , xn sampled from a distribution with mean µ and variance σ
2.
Compute the expected value of the sample variance: 1n
∑n
i=1 (xi − x¯)2, where x¯ is the sample mean from
the previous part. Describe any modifications that might be required to make the expected value σ2
(recall that µ and σ2 are unknown).
Ans:
Gautam Kamath ([email protected]) ©2021

欢迎咨询51作业君