辅导案例-EECE 5645

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

EECE 5645
Homework 2
Parallel Regression
Follow the Discovery Cluster Rules:
• Never run jobs on the gateways.
• Do not reserve more than one nodes from the eece5645 reservation in the last 24 hours before a homework deadline
• Start working on homework assignments early.
Preparation. Make sure that a folder on discovery named after your username exists under the directory /scratch. You can
confirm this by logging into discovery and typing
ls /scratch/ | grep $USER
You should see a directory named after your username. Copy the directory
/scratch/EECE5645/HW2
to this folder, by typing:
cp -r /scratch/EECE5645/HW2 /scratch/$USER/
Make the contents of this directory private, by typing:
chmod -R go-rx /scratch/$USER/HW2
The directory contains (a) a python file called ParallelRegression.py, (b) a directory called data containing 6 files: three
training datasets and three test datasets. In this assignment, you are asked to modify the provided code ParallelRegression.py
and use it to train and test a linear model over the data using ridge regression.
Your code should be run on Discovery on “local” mode, on single compute node which you have reserved on the Discovery
cluster. Use sr5645 to reserve a node.
Submission. You must:
1. Provide a report, in pdf format, outlining the answers of the questions below. The report should be type-written in a word
processor of your choice (e.g., MS Word, LATEX, etc.).
2. Provide the ParallelRegression.py you wrote, along with any auxiliary files you may have created in the assignment.
Before uploading, rename ParallelRegression.py, and any other auxiliary .py files you may have created, so that
they have a .txt ending. For example, xxx.py should become xxx.txt.
The report along with your final code should be uploaded on canvas by Tuesday, Nov. 3rd, 2020. Upload files separately. DO
NOT UPLOAD .zip FILES.
1
Question 1: In your answers below, provide a detailed derivation, not just the final answer. Given a vector x ∈ Rd and a real
number y ∈ R, define the function f : Rd → R as
f(β;x, y) =
(
y − x>β
)2
=
(
y −
d∑
k=1
βkxk
)2
.
1(a) Derive the partial derivatives
∂f
∂βk
, k = 1, ..., d,
as functions of x, β ∈ Rd and y ∈ R.
1(b) Show that the gradient∇f(β;x, y) =
[
∂f
∂βk
]
k=1,...,d
∈ Rd can be written as
∇f(β;x, y) = −2(y − x>β)x.
1(c) Consider n vectors and and n real numbers:
xi ∈ Rd, yi ∈ R, for i = 1, . . . , n.
Define F : Rd → R as:
F (β) =
1
n
n∑
i=1
f(β;xi, yi) + λ‖β‖22
=
1
n
n∑
i=1
(
yi − x>i β
)2
+ λβ>β,
where λ ≥ 0 is a non-negative real number and β ∈ Rd. Show that
∇F (β) = − 2
n
n∑
i=1
[(
yi − x>i β
)
xi
]
+ 2λβ.
2
1(d) Let β1, β2 ∈ Rd be two vectors and γ ∈ R be a real number. Show that the function h : R→ R defined as:
h(γ) = F (β1 + γβ2),
can be written as:
h(γ) = aγ2 + bγ + c,
where a, b, c ∈ R. Compute a, b, c ∈ R in terms of β1, β2, λ, and xi, yi, for i = 1, . . . , n.
1(e) Show that a > 0.
1(f) Show that h is a convex function, and that it is minimized at γ∗ ≡ − b
2a
.
1(g) Consider the optimization problem:
Minimize: h(γ)
subject to: γ ∈ [0, 1].
Assume that a 6= 0. Find the optimal γ ∈ [0, 1] is each of these cases:
(a) γ∗ ∈ [0, 1],
(b) γ∗ < 0,
(c) γ∗ > 1,
where γ∗ is given by 1(f). Using a similar reasoning, determine the optimal γ ∈ [0, 1] when a = 0.
1(h) Note: this question is unrelated to (a)-(g). Given K ≥ 0 and some vector z ∈ Rd, where z 6= 0, consider the optimization
problem:
Minimize: z>β
subject to:
∑d
i=k |βk| ≤ K
Prove that the optimal solution vector β∗ ∈ Rd can be constructed as follows:
1. Find a coordinate k∗ ∈ {1, . . . , d} such that:
|zk∗ | ≥ |zk|, for all k = 1, . . . , d.
2. Set:
β∗k =
{
−K sign (zk∗) if k = k∗,
0, otherwise.
Here,
sign(s) =
{
+1 if s ≥ 0,
−1 if s < 0.
Hint: Prove that (a) β∗ is feasible, i.e., it satisfies the constraint
∑d
k=1 |β∗k | ≤ K, and that (b) for every other feasible β, that also
satisfies the constraint, it must be true that z>β ≥ z>β∗.
3
Question 2: Go to the directory that contains ParallelRegression.py and start the pyspark interpreter:
pyspark --master "local[20]"
Within the interpreter, type:
import ParallelRegression as PR
help(PR.readData)
help(PR.f)
Which part of the code in PR.readData causes this output to be printed? Try to use the function PR.readData to read file
data/small.test and load its contents into an RDD. Describe what is the format of a line in data/small.test, and what
is the format of an element of the resulting RDD.
4
Question 3: In linear regression, the predicted score of a vector of x ∈ Rd under parameter vector β ∈ Rd is given by:
yˆ = x>β
3(a) Modify the code in ParallelRegression.py so that function predict
• receives as input an x and a β represented as numpy arrays, and
• returns the predicted score yˆ, i.e., their inner product x>β.
Add the corresponding code snippet in your report.
3(b) Import your code within the pyspark interpeter to compute the predicted score under the following two arrays
import numpy as np
x = np.array([np.cos(t) for t in range(5)])
beta = np.array([np.sin(t) for t in range(5)])
Add the code that you entered in pyspark to execute this, as well as the result, in your report. Tip. Whenever you change the
contents of ParallelRegression.py, you can update the function definitions in pyspark by typing, e.g., reload(PR).
5
Question 4: The given a feature vector x ∈ Rd, a true value y ∈ R, and a parameter vector β ∈ Rd, the square error of a
prediction is given by
f(β;x, y) = (y − x>β)2.
4(a) Modify the program ParallelRegression.py so that the function f:
• receives as input an x and a β represented as numpy arrays, and a float y, and
• returns the square error f(β;x, y).
Include the code snippet you wrote in the report.
4(b) Similarly, modify ParallelRegression.py so that the function localGradient
• receives as input an x a β represented as numpy arrays, and a float y, and
• returns the gradient∇f(β;x, y) w.r.t. β.
Include the code snippet you wrote in the report.
4(c) Write a python script that verifies that your computation of the gradient is correct–or, at least, is consistent with your
implementation of f . The script should import f and localGradient from ParallelRegression.py, as well as func-
tion estimateGrad. Correctess can be tested by confirming that localGradient agrees with the estimate produced by
estimateGrad when δ is small. Test this extensively, with multiple random values of x,y, and β (using, e.g., a for loop), and
appropriate assert statements.
Your script should perform tests without modifying the definitions/arguments of any of these three functions, even though
estimateGrad assumes that fun has a single argument! Include the verification code/script that you wrote, as well as its output,
in your report.
6
Question 5: Given a dataset D of pairs (xi, yi) ∈ Rd × R, i = 1, . . . n, ridge regression attempts to find a β that fits the dataset
by minimizing the following `2-regularized mean square error:
F (β) =
1
n
n∑
i=1
f(β;xi, yi) + ‖β‖22
=
1
n
n∑
i=1
(
yi − x>i β
)2
+ λβ>β
where X ∈ Rn×d is a matrix whose rows comprise the feature vectors in the dataset D, and y ∈ Rn is the vector of target/ground
truth values. Parameter λ is called the regularization parameter. When λ = 0, ridge regression corresponds precisely to least
squares estimation.
5(a) Modify the program ParallelRegression.py so that the function F:
• receives as input an RDD comprising (x, y) pairs, a β represented as a numpy array, and a float λ, and
• returns F (β), the regularized MSE.
The function should make use of f, the per-data point square error function f . Include the code snippet you wrote in the report.
5(b) Modify the program ParallelRegression.py so that the function gradient:
• receives as input an RDD comprising (x, y) pairs, a β represented as a numpy array, and a float λ, and
• returns the gradient∇F (β) of the regularized MSE F .
The function should make use of localGradient, the gradient∇f(β) of per data-point square error f . Include the code snippet
you wrote in the report.
5(c) Write a script that uses the above definitions as well as estimateGrad to test whether your gradient estimate is correct.
Use data/small.test as a dataset. Test this extensively, with multiple random values of β and λ (using, e.g., a for loop),
and appropriate assert statements. Include your code and verification results in your report. Again, you should perform this
verification without altering the argument list of any of the aforementioned functions, including estimateGrad.
7
Question 6: Let F be the `2-regularized mean square error defined in Question 5. Given β1, β2 ∈ Rd, and γ ∈ R, define :
h(γ) = F (β1 + γβ2).
Recall from question 1(d) that h : R→ R is a quadratic function, i.e., it can be written as:
h(γ) = aγ2 + bγ + c,
for some a, b, c ∈ R.
Modify the program ParallelRegression.py so that the function hcoeff:
• receives as input an RDD comprising (x, y) pairs, two vectors β1, β2 represented as a numpy arrays, and a float λ, and
• returns the coefficients a, b, c ∈ R of function h.
Write a script that tests the correctness of function you wrote by picking many different values of β1, β2 ∈ Rd, and γ ∈ R,
computing a, b, c, and confirming that
F (b1 + γb2) = aγ
2 + bγ + c.
Include your function hcoeff you wrote in your report, along with script that tests its correctness and the corresponding
output.
8
Question 7: In this question, you are asked to solve the problem:
Minimize: F (β)
subject to: β ∈ Rd
through gradient descent. Starting from β0 = 0 ∈ Rd, gradient descent performs the following iterations:
βk+1 = βk − γk · ∇F (βk), for k = 0, 1, 2, . . .
In the above iterative process, the step-size γk ∈ R is determined through exact line search. That is, γk should be selected as:
γk = arg min
γ∈R
F (βk − γ∇F (βk)).
Given some tolerance ε > 0, gradient descent should be repeated until the convergence condition ‖∇F (βk)‖2 ≤ ε, is satisfied or
a maximum number of iterations is reached.
A trained parameter vector β ∈ Rd, trained over a dataset D.train, can be tested over a dataset D.test by computing the mean
square error of predictions over the test set; that is, if D.test contains m datapoints (xi, yi), i = 1, . . . ,m, then
MSE(β) =
1
m
m∑
i=1
f(β;xi, yi) =
1
m
m∑
i=1
(yi − β>xi)2.
7(a) Modify the program ParallelRegression.py so that the function test:
• receives as input an RDD comprising (x, y) pairs, a β represented as a numpy array, and
• returns MSE(β).
Hint: This should be extremely short.
7(b) Modify the program ParallelRegression.py so that the function exactLineSearch:
• receives as input an RDD comprising (x, y) pairs, a vector β1 ∈ Rd and a direction g ∈ Rd, represented as a numpy arrays,
and real number λ, and
• returns γ∗ = arg minγ∈R F (β − γg).
Include the code snippet you wrote in your report.
7(c) Use the code you have been provided to train a model over data/small.train and test it over data/small.test with
λ = 10 and = 0.01. You can do so by typing:
spark-submit --master "local[20]" --driver-memory 60G ParallelRegression.py \
--train data/small.train --test data/small.test \
--beta beta_small_gd --lam 10.0 --eps 0.01
Include the output of this execution in your report.
7(d) Use your code to train a model over data/large.train and test it over data/large.test with λ = 10 and = 0.01.
You can do so by typing:
spark-submit --master "local[20]" --driver-memory 60G ParallelRegression.py \
--train data/large.train --test data/large.test \
--beta beta_large_gd --lam 10.0 --eps 0.01
Include the output of this execution in your report.
9
Question 8: In this question, you are asked to solve problem
Minimize: F0(β) =
1
n
n∑
i=1
(
yi − x>i β
)2
subject to: ‖β‖1 ≤ K,
where ‖β‖1 = ∑dk=1 |βk| is the `1 norm of vector β ∈ Rd, and K > 0 is some positive number. You will solve this using the
Frank-Wolfe algorithm.1 Starting from β0 = 0 ∈ Rd, the Frank-Wolfe algorithm performs the following iterations:{
sk = arg mins:‖s‖1≤K s
>∇F0(βk)
βk+1 = (1− γk)βk+1 + γksj
for k = 0, 1, 2, . . . .
In other words, at each iteration, the algorithm obtains sk ∈ Rd by solving problem:
Minimize: s>∇F (βk),
subject to: ‖s‖1 ≤ K
and then interpolates between the current solution βk with the constucted vector sk.
In this interpolation, the step-size γk ∈ [0, 1] is determined through exact line search. That is, γk should be selected as:
γk = arg min
γ∈[0,1]
F0 ((1− γ)βk + γsk) .
Given some tolerance ε > 0, gradient descent should be repeated until the convergence condition
(βk − sk)>∇F0(βk) ≤ ε,
is satisfied or a maximum number of iterations is reached.
8(a) Implement function solveLin as described in its docstring. Include the code snippet you wrote in your report.
Hint: See question 1(h).
8(b) Implement function exactLineSearchFW as described in its docstring. Include the code snippet you wrote in your report.
Hint 1: You do not need to re-implement hcoeff!
Hint 2: See question 1(g).
8(c) Implement function train_FW as described in its docstring. The function train should receive as input:
• an RDD comprising (x, y) pairs,
• a β0 ∈ Rd, represented as a numpy array, indicating the starting value of the Frank-Wolfe algorithm.
• a λ ∈ R, indicating the regularization parameter/upper bound in the constraint,
• the maximum number of iterations max_iter, and
• the tolerance ε > 0.
The function should run Frank-Wolfe with exact line search to minimize the (non-regularized) MSE F0. The function should return
three values:
• The computed β,
• the last criterion (βk − sk)>∇F0 of the gradient of the regularized MSE F at the return value β,
• the number of iterations performed.
In doing so, the function should make use of gradient, F, solveLin, as well as exactLineSearchFW.
At each iteration, print a single line containing some basic information about the progress of your execution, including:
• the present iteration number k (1,2,3, etc.),
• the time that has elapsed since the function started being executed,
• the present function value F0(βk),
• the present feasibility λ‖βk‖1, and
• the present criterion (βk − sk)T∇F0(βk).
Include the code snippet you wrote in the report.
1The algorithm is named after Marguerite Frank and Philip Wolfe, the mathematicians that invented it.
10
Question 9: Labels yi in all three datasets (small, medium, large) have been created using linear models β, with gaussian
noise added. For all datasets, the labels depend only on a handful of features: only the first few coordinates of β are non-zero.
Use your code (with any number of partitions you prefer) to regress values β for the medium datasets in the assignment for
different values of λ andK. You can extract the relevant information from the output printed by the program, but feel free to modify
it so that the values requested below are stored in a file. In particular:
9(a) Train a model over data/medium.train and test it over data/medium.test using gradient descent for the following
values of λ:
0.125, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0
Produce either a table or a bar plot with the resulting test MSE values for each lambda in your report. Include the vector β that
attains the smallest test MSE in your report.
9(b) Train a model over data/medium.train and test it over data/medium.test using Frank-Wolfe for the following
values of K:
1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0
Produce either a table or a bar plot with the resulting test MSE values for each lambda in your report. Include the vector β that
attains the smallest test MSE in your report.
9(c) What differences do you observe between the two solutions in 9(a) and 9(b)?
11

欢迎咨询51作业君