程序代写案例-COMS 4995-Assignment 1

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

COMS 4995 Fall 2021 with Professor Richard Zemel Assignment 1
Assignment 1
Version 2: Fixed up some typos and broken links.
Deadline: October 6, 2021 by 2pm.
Assignment structure: There are two sections of problems. One includes written problems, which
are presented in this file. The other set involves programming. More information about these are
in the Notebook “hw1_code.ipynb”.
Submission:
• You must submit your solutions of written problems as a PDF file through GradeScope.
Note that the PDF file of written problems needs to be concatenated with the PDF file gener-
ated from the Notebook before being submitted to GradeScope. You may access the Grade-
Scope page through the navigation bars of the CourseWorks page. Please make sure you
link the solutions and questions correctly to receive credits on Gradescope. Please include
justification for all your answers - an answer with no work shown will not receive credit.
• The hw1_code_.ipynb iPython Notebook, where is your uni. This
file needs to be submitted to the CourseWorks assignment page.
1 Written Problems
1.1 Hard-Coding Networks: Verify Element in List [10 pts]
The reading on multilayer perceptrons located at https://www.cs.columbia.edu/~zemel/Class/
Nndl/files/Grosse-Multilayer-Perceptrons.pdf may be useful for this question.
In this problem, you need to find a set of weights and biases for a multilayer perceptron that
determines if an input is in a list of length 3. You receive an input list with three elements x1, x2, x3,
and fourth input x4 where xi ∈ Z. If x4 = xj for j = 1, 2 or 3, the network should output 1, and 0
otherwise. You may assume all elements in the input list are distinct for simplicity. You will use
the following architecture: All of the hidden units and the output unit use an indicator activation
function:
φ(z) = I(z ∈ [−1, 1]) =
{
1 if z ∈ [−1, 1]
0 otherwise
Please give a set of weights and biases for the network which correctly implements this function.
Your answer should include:
• A 3× 4 weight matrix W(1) for the hidden layer
• A 3-dimensional vector of biases b(1) for the hidden layer
• A 3-dimensional weight vector w(2) for the output layer
• A scalar bias b(2) for the output layer
You do not need to show your work.
1.2 Backpropagation
The reading on backpropagation located at https://www.cs.columbia.edu/~zemel/Class/Nndl/
files/Grosse-Backpropagation.pdf may be useful for this question.
1
COMS 4995 Fall 2021 with Professor Richard Zemel Assignment 1
1.2.1
Consider a neural network defined with the following procedure:
z1 = W(1)x+ b(1)
h = ReLU(z1)
z2 = W(2)x+ b(2)
g1 = σ(z2)
g2 = h g1
y = W(3)g2 + b(3),
y′ = softmax(y)
S =
N
∑
k=1
I(tk = 1) log(y′k)
J = −S
for input x with class label t where ReLU(z) = max(z, 0) denotes the ReLU activation function,
σ(z) = 11+e−z denotes the Sigmoid activation function, both applied elementwise, and softmax(y) =
exp(y)
∑Ni=1 exp(yi)
. Here, denotes element-wise multiplication.
Computation Graph [5 pts]: Draw the computation graph relating x, t, z1, h, z2, g1, g2, y, y′, S
and J .
Backward Pass [5 pts]: Derive the backprop equations for computing x¯ = ∂J∂x , one variable at a
time, similar to the vectorized backward pass derived in Lec 2.
1.2.2 Automatic Differentiation
Consider the functionL : Rn → RwhereL(x) = 12x>vv>x, and v ∈ Rn×1 and x ∈ Rn×1. Here, we
will explore the relative costs of evaluating Jacobians and vector-Jacobian products. Specifically,
we will study vector-Hessian products, which is a special case of vector-Jacobian products, where
the Jacobian is of the gradient of our function. We denote the gradient of L with respect to x as
g ∈ R1×n and the Hessian of L w.r.t. x with H ∈ Rn×n. The Hessian of L w.r.t. x is defined as:
H =

∂2L
∂x21
∂2L
∂x1∂x2
· · · ∂2L∂x1∂xn
∂2L
∂x2∂x1
∂2L
∂x22
· · · ∂2L∂x2∂xn
...
...
. . .
...
∂2L
∂xn∂x1
∂2L
∂xn∂x2
· · · ∂2L
∂x2n

Compute Hessian [5 pts]: Compute H for n = 3 and v> = [4, 2, 3] at x> = [1, 2, 3]. In other words,
write down the numbers in:
H =

∂2L
∂x21
∂2L
∂x1∂x2
∂2L
∂x1∂x3
∂2L
∂x2∂x1
∂2L
∂x22
∂2L
∂x2∂x3
∂2L
∂x3∂x1
∂2L
∂x3∂x2
∂2L
∂x23

Computation Cost [5 pts]: What is the number of scalar multiplications and memory cost to
compute the Hessian H in terms of n? You may use big O notation.
2
COMS 4995 Fall 2021 with Professor Richard Zemel Assignment 1
1.3 Vector-Hessian Products [10 pts]
Compute z = Hy = vv>y where n = 3, v> = [1, 2, 3], y> = [1, 1, 1] using two algorithms: reverse-
mode and forward-mode autodiff. In backpropagation (also known as reverse-mode autodiff),
you will compute M = v>y first, then compute vM. Whereas, in forward-mode, you will compute
H = vv> then compute Hy. Write down the numerical values of zT = [z1, z2, z3] for the given
v and y. What is the number of scalar multiplications and memory cost in evaluating z with
backpropagation (reverse-mode) in terms of n? What about forward-mode?
1.3.1 Trade-off of Reverse- and Forward-mode Autodiff [10 pts]
Consider computing Z = Hy1y>2 where v ∈ Rn×1, y1 ∈ Rn×1 and y2 ∈ Rm×1. What are number
of scalar multiplications and memory cost in evaluating Z with reverse-mode in terms of n and
m? What about forward-mode? When is forward-mode a better choice? (Hint: Think about the
shape of Z, “tall” versus “wide”.)
2 Programming Problems
The programming assignments are individual work.
You should attempt all questions for this assignment. Most of them can be answered at least
partially even if you were unable to finish earlier questions. If you think your computational
results are incorrect, please say so; that may help you get partial credit.
Introduction
In this section we will learn about word embeddings and make neural networks learn about
words. We could try to match statistics about the words, or we could train a network that takes a
sequence of words as input and learns to predict the word that comes next.
This assignment will ask you to implement a linear embedding and then the backpropagation
computations for a neural language model and then run some experiments to analyze the learned
representation. The amount of code you have to write is very short but each line will require you
to think very carefully. You will need to derive the updates mathematically, and then implement
them using matrix and vector operations in NumPy.
Starter code and data
In the Notebook, you will see a list of packages to be imported to start the programming questions.
You can also manually download and unzip the data from http://www.cs.columbia.edu/~zemel/
Class/Nndl/files/a1_data.tar.gz and put them in the same folder as where you store this note-
book.
Feel free to use a different way to access the files data.pk , partially_trained.pk, and raw_sentences.txt.
The file raw_sentences.txt contains the sentences that we will be using for this assignment. These
sentences are fairly simple ones and cover a vocabulary of only 250 words (+ 1 special [MASK]
token word).
3
COMS 4995 Fall 2021 with Professor Richard Zemel Assignment 1
Now data is a Python dict which contains the vocabulary, as well as the inputs and targets for all
three splits of the data. data['vocab'] is a list of the 251 words in the dictionary; data['vocab'][0]
is the word with index 0, and so on. data['train_inputs'] is a 372,500 x 4 matrix where each
row gives the indices of the 4 consecutive context words for one of the 372,500 training cases. The
validation and test sets are handled analogously.
Even though you only have to modify two specific locations in the code, you may want to read
through this code before starting the assignment.
2.1 GLoVE Word Representations [Total of 16 pts]
In this part, you will implement a simplified version of the GLoVE embedding (please see the
handout for detailed description of the algorithm) with the loss defined as
L({wi, w˜i, bi, b˜i}Vi=1) =
V
∑
i,j=1
(w>i w˜j + bi + b˜j − logXij)2
.
Note that each word is represented by two d-dimensional embedding vectorswi, w˜i and two scalar
biases bi, b˜i.
Answer the following questions:
2.1.1 GLoVE Parameter Count [1 pt]
Given the vocabulary size V and embedding dimensionality d, how many parameters does the
GLoVE model have? Note that each word in the vocabulary is associated with 2 embedding
vectors and 2 biases.
2.1.2 Expression for gradient ∂L∂wi [6 pts]
Write the expression for ∂L∂wi , the gradient of the loss function L with respect to one parameter
vector wi. The gradient should be a function of w, w˜, b, b˜,X with appropriate subscripts (if any).
2.1.3 Implement the gradient update of GLoVE. [6 pts]
See YOUR CODE HERE Comment below for where to complete the code
We have provided a few functions for training the embedding:
• calculate_log_co_occurence computes the log co-occurrence matrix of a given corpus
• train_GLoVE runs momentum gradient descent to optimize the embedding
• loss_GLoVE:
• INPUT - V × d matrix W (collection of V embedding vectors, each d-dimensional); V × d
matrix W_tilde; V × 1 vector b (collection of V bias terms); V × 1 vector b_tilde; V ×V log
co-occurrence matrix.
• OUTPUT - loss of the GLoVE objective
• grad_GLoVE: TO BE IMPLEMENTED.
• INPUT:
4
COMS 4995 Fall 2021 with Professor Richard Zemel Assignment 1
– V× d matrix W (collection of V embedding vectors, each d-dimensional), embedding for
first word;
– V × d matrix W_tilde, embedding for second word;
– V × 1 vector b (collection of V bias terms);
– V × 1 vector b_tilde, bias for second word;
– V ×V log co-occurrence matrix.
• OUTPUT:
– V × d matrix grad_W containing the gradient of the loss function w.r.t. W;
– V × d matrix grad_W_tilde containing the gradient of the loss function w.r.t. W_tilde;
– V × 1 vector grad_b which is the gradient of the loss function w.r.t. b.
– V × 1 vector grad_b_tilde which is the gradient of the loss function w.r.t. b_tilde.
Run the code to compute the co-occurrence matrix. Make sure to add a 1 to the occurrences, so
there are no 0’s in the matrix when we take the elementwise log of the matrix.
• [ ] TO BE IMPLEMENTED: Calculate the gradient of the loss function w.r.t. the parameters
W, W˜, b, and b. You should vectorize the computation, i.e. not loop over every word.
2.1.4 Effect of embedding dimension d [3 pts]
Train both the symmetric and asymmetric GLoVe model with varying dimensionality d by run-
ning the cell below. Comment on: 1. Which d leads to optimal validation performance for the
asymmetric and symmetric models? 2. Why does / doesn’t larger d always lead to better valida-
tion error? 3. Which model is performing better, and why?
2.2 Training the model [Total of 26 pts]
We will modify the architecture slightly inspired by BERT [?]. Instead of having only one output,
the architecture will now take in N = 4 context words, and also output predictions for N = 4
words. See Figure 2 diagram in the handout for the diagram of this architecture.
During training, we randomly sample one of the N context words to replace with a [MASK] token.
The goal is for the network to predict the word that was masked, at the corresponding output
word position. In practice, this [MASK] token is assigned the index 0 in our dictionary. The weights
W(2) = hid_to_output_weights now has the shape NV × H, as the output layer has NV neurons,
where the first V output units are for predicting the first word, then the next V are for predicting
the second word, and so on. We call this as concatenating output uniits across all word positions,
i.e. the (j+ nV)-th column is for the word j in vocabulary for the n-th output word position. Note
here that the softmax is applied in chunks of V as well, to give a valid probability distribution
over the V words. Only the output word positions that were masked in the input are included in
the cross entropy loss calculation:
C = −
B
∑
i
N
∑
n
V
∑
j
m(i)n (t
(i)
n,j log y
(i)
n,j),
where y(i)n,j denotes the output probability prediction from the neural network for the i-th training
example for the word j in the n-th output word, and t(i)n,j is 1 if for the i-th training example, the
word j is the n-th word in context. Finally, m(i)n ∈ {0, 1} is a mask that is set to 1 if we are predicting
5
COMS 4995 Fall 2021 with Professor Richard Zemel Assignment 1
the n-th word position for the i-th example (because we had masked that word in the input), and
0 otherwise.
There are three classes defined in this part: Params, Activations, Model. You will make changes
to Model, but it may help to read through Params and Activations first.
In this part of the assignment, you implement a method which computes the gradient using back-
propagation. To start you out, theModel class contains several important methods used in training:
• compute_activations computes the activations of all units on a given input batch
• compute_loss computes the total cross-entropy loss on a mini-batch
• evaluate computes the average cross-entropy loss for a given set of inputs and targets
You will need to complete the implementation of two additional methods which are needed for
training, and print the outputs of the gradients.
2.2.1 Implement gradient with respect to output layer inputs [8 pts]
* [ ] TOBE IMPLEMENTED: compute_loss_derivative computes the derivative of the loss func-
tion with respect to the output layer inputs.
In other words, if C is the cost function, and the softmax computation for the j-th word in vocab-
ulary for the n-th output word position is:
yn,j =
ezn,j
∑l ezn,l
This function should compute a B×NV matrix where the entries correspond to the partial deriva-
tives ∂C/∂znj . Recall that the output units are concatenated across all positions, i.e. the (j+ nV)-th
column is for the word j in vocabulary for the n-th output word position.
2.2.2 Implement gradient with respect to parameters [8 pts]
• [ ] TO BE IMPLEMENTED: back_propagate is the function which computes the gradient
of the loss with respect to model parameters using backpropagation. It uses the derivatives
computed by compute_loss_derivative. Some parts are already filled in for you, but you need
to compute the matrices of derivatives for embed_to_hid_weights, hid_bias,
hid_to_output_weights, and output_bias. These matrices have the same sizes as the pa-
rameter matrices (see previous section).
In order to implement backpropagation efficiently, you need to express the computations in terms
of matrix operations, rather than for loops. You should first work through the derivatives on pen-
cil and paper. First, apply the chain rule to compute the derivatives with respect to individual
units, weights, and biases. Next, take the formulas you’ve derived, and express them in matrix
form. You should be able to express all of the required computations using only matrix multiplica-
tion, matrix transpose, and elementwise operations — no for loops! If you want inspiration, read
through the code for Model.compute_activations and try to understand how the matrix operations
correspond to the computations performed by all the units in the network.
To make your life easier, we have provided the routine checking.check_gradients, which checks
your gradients using finite differences. You should make sure this check passes before continuing
with the assignment.
6
COMS 4995 Fall 2021 with Professor Richard Zemel Assignment 1
2.2.3 Print the gradients [8 pts]
To make your life easier, we have provided the routine check_gradients, which checks your gra-
dients using finite differences. You should make sure this check passes before continuing with the
assignment. Once check_gradients() passes, call print_gradients() and include its output in
your write-up.
2.2.4 Run model training [2 pts]
Once you’ve implemented the gradient computation, you’ll need to train the model. The function
train implements the main training procedure. It takes two arguments:
• embedding_dim: The number of dimensions in the distributed representation.
• num_hid: The number of hidden units
As the model trains, the script prints out some numbers that tell you how well the training is
going. It shows:
• The cross entropy on the last 100 mini-batches of the training set. This is shown after every
100 mini-batches.
• The cross entropy on the entire validation set every 1000 mini-batches of training.
At the end of training, this function shows the cross entropies on the training, validation and test
sets. It will return a Model instance.
To convince us that you have correctly implemented the gradient computations, please include
the following with your assignment submission:
• [ ] You will submit hw1_code_.ipynb through CourseWorks. You do not need to
modify any of the code except the parts we asked you to implement.
• [ ] In your PDF file, include the output of the function print_gradients. This prints out part
of the gradients for a partially trained network which we have provided, and we will check
them against the correct outputs. Important: make sure to give the output of print_gradients,
not check_gradients.
Since we gave you a gradient checker, you have no excuse for not getting full points on this part.
2.3 Arithmetics and Analysis [8 pts]
In this part, you will perform arithmetic calculations on the word embeddings learned from pre-
vious models and analyze the representation learned by the networks with t-SNE plots.
t-SNE
You will first train the models discussed in the previous sections; you’ll use the trained models for
the remainding of this section.
Important: if you’ve made any changes to your gradient code, you must reload the hw1_code
module and then re-run the training procedure. Python does not reload modules automatically,
and you don’t want to accidentally analyze an old version of your model.
These methods of the Model class can be used for analyzing the model after the training is done:
* tsne_plot_representation creates a 2-dimensional embedding of the distributed representa-
7
COMS 4995 Fall 2021 with Professor Richard Zemel Assignment 1
tion space using an algorithm called t-SNE. (You don’t need to know what this is for the assign-
ment, but we may cover it later in the course.) Nearby points in this 2-D space are meant to
correspond to nearby points in the 16-D space. * display_nearest_words lists the words whose
embedding vectors are nearest to the given word * word_distance computes the distance between
the embeddings of two words
Plot the 2-dimensional visualization for the trained model from part 3 using the method
tsne_plot_representation. Look at the plot and find a few clusters of related words. What do
the words in each cluster have in common? Plot the 2-dimensional visualization for the GloVe
model from part 1 using the method tsne_plot_GLoVe_representation. How do the t-SNE
embeddings for both models compare? Plot the 2-dimensional visualization using the method
plot_2d_GLoVe_representation. How does this compare to the t-SNE embeddings? Please an-
swer in 2 sentences for each question and show the plots in your submission.
8

欢迎咨询51作业君