COMS 4995 Fall 2021 with Professor Richard Zemel Assignment 1 Assignment 1 Version 2: Fixed up some typos and broken links. Deadline: October 6, 2021 by 2pm. Assignment structure: There are two sections of problems. One includes written problems, which are presented in this file. The other set involves programming. More information about these are in the Notebook “hw1_code.ipynb”. Submission: • You must submit your solutions of written problems as a PDF file through GradeScope. Note that the PDF file of written problems needs to be concatenated with the PDF file gener- ated from the Notebook before being submitted to GradeScope. You may access the Grade- Scope page through the navigation bars of the CourseWorks page. Please make sure you link the solutions and questions correctly to receive credits on Gradescope. Please include justification for all your answers - an answer with no work shown will not receive credit. • The hw1_code_
.ipynb iPython Notebook, where is your uni. This file needs to be submitted to the CourseWorks assignment page. 1 Written Problems 1.1 Hard-Coding Networks: Verify Element in List [10 pts] The reading on multilayer perceptrons located at https://www.cs.columbia.edu/~zemel/Class/ Nndl/files/Grosse-Multilayer-Perceptrons.pdf may be useful for this question. In this problem, you need to find a set of weights and biases for a multilayer perceptron that determines if an input is in a list of length 3. You receive an input list with three elements x1, x2, x3, and fourth input x4 where xi ∈ Z. If x4 = xj for j = 1, 2 or 3, the network should output 1, and 0 otherwise. You may assume all elements in the input list are distinct for simplicity. You will use the following architecture: All of the hidden units and the output unit use an indicator activation function: φ(z) = I(z ∈ [−1, 1]) = { 1 if z ∈ [−1, 1] 0 otherwise Please give a set of weights and biases for the network which correctly implements this function. Your answer should include: • A 3× 4 weight matrix W(1) for the hidden layer • A 3-dimensional vector of biases b(1) for the hidden layer • A 3-dimensional weight vector w(2) for the output layer • A scalar bias b(2) for the output layer You do not need to show your work. 1.2 Backpropagation The reading on backpropagation located at https://www.cs.columbia.edu/~zemel/Class/Nndl/ files/Grosse-Backpropagation.pdf may be useful for this question. 1 COMS 4995 Fall 2021 with Professor Richard Zemel Assignment 1 1.2.1 Consider a neural network defined with the following procedure: z1 = W(1)x+ b(1) h = ReLU(z1) z2 = W(2)x+ b(2) g1 = σ(z2) g2 = h g1 y = W(3)g2 + b(3), y′ = softmax(y) S = N ∑ k=1 I(tk = 1) log(y′k) J = −S for input x with class label t where ReLU(z) = max(z, 0) denotes the ReLU activation function, σ(z) = 11+e−z denotes the Sigmoid activation function, both applied elementwise, and softmax(y) = exp(y) ∑Ni=1 exp(yi) . Here, denotes element-wise multiplication. Computation Graph [5 pts]: Draw the computation graph relating x, t, z1, h, z2, g1, g2, y, y′, S and J . Backward Pass [5 pts]: Derive the backprop equations for computing x¯ = ∂J∂x , one variable at a time, similar to the vectorized backward pass derived in Lec 2. 1.2.2 Automatic Differentiation Consider the functionL : Rn → RwhereL(x) = 12x>vv>x, and v ∈ Rn×1 and x ∈ Rn×1. Here, we will explore the relative costs of evaluating Jacobians and vector-Jacobian products. Specifically, we will study vector-Hessian products, which is a special case of vector-Jacobian products, where the Jacobian is of the gradient of our function. We denote the gradient of L with respect to x as g ∈ R1×n and the Hessian of L w.r.t. x with H ∈ Rn×n. The Hessian of L w.r.t. x is defined as: H = ∂2L ∂x21 ∂2L ∂x1∂x2 · · · ∂2L∂x1∂xn ∂2L ∂x2∂x1 ∂2L ∂x22 · · · ∂2L∂x2∂xn ... ... . . . ... ∂2L ∂xn∂x1 ∂2L ∂xn∂x2 · · · ∂2L ∂x2n Compute Hessian [5 pts]: Compute H for n = 3 and v> = [4, 2, 3] at x> = [1, 2, 3]. In other words, write down the numbers in: H = ∂2L ∂x21 ∂2L ∂x1∂x2 ∂2L ∂x1∂x3 ∂2L ∂x2∂x1 ∂2L ∂x22 ∂2L ∂x2∂x3 ∂2L ∂x3∂x1 ∂2L ∂x3∂x2 ∂2L ∂x23 Computation Cost [5 pts]: What is the number of scalar multiplications and memory cost to compute the Hessian H in terms of n? You may use big O notation. 2 COMS 4995 Fall 2021 with Professor Richard Zemel Assignment 1 1.3 Vector-Hessian Products [10 pts] Compute z = Hy = vv>y where n = 3, v> = [1, 2, 3], y> = [1, 1, 1] using two algorithms: reverse- mode and forward-mode autodiff. In backpropagation (also known as reverse-mode autodiff), you will compute M = v>y first, then compute vM. Whereas, in forward-mode, you will compute H = vv> then compute Hy. Write down the numerical values of zT = [z1, z2, z3] for the given v and y. What is the number of scalar multiplications and memory cost in evaluating z with backpropagation (reverse-mode) in terms of n? What about forward-mode? 1.3.1 Trade-off of Reverse- and Forward-mode Autodiff [10 pts] Consider computing Z = Hy1y>2 where v ∈ Rn×1, y1 ∈ Rn×1 and y2 ∈ Rm×1. What are number of scalar multiplications and memory cost in evaluating Z with reverse-mode in terms of n and m? What about forward-mode? When is forward-mode a better choice? (Hint: Think about the shape of Z, “tall” versus “wide”.) 2 Programming Problems The programming assignments are individual work. You should attempt all questions for this assignment. Most of them can be answered at least partially even if you were unable to finish earlier questions. If you think your computational results are incorrect, please say so; that may help you get partial credit. Introduction In this section we will learn about word embeddings and make neural networks learn about words. We could try to match statistics about the words, or we could train a network that takes a sequence of words as input and learns to predict the word that comes next. This assignment will ask you to implement a linear embedding and then the backpropagation computations for a neural language model and then run some experiments to analyze the learned representation. The amount of code you have to write is very short but each line will require you to think very carefully. You will need to derive the updates mathematically, and then implement them using matrix and vector operations in NumPy. Starter code and data In the Notebook, you will see a list of packages to be imported to start the programming questions. You can also manually download and unzip the data from http://www.cs.columbia.edu/~zemel/ Class/Nndl/files/a1_data.tar.gz and put them in the same folder as where you store this note- book. Feel free to use a different way to access the files data.pk , partially_trained.pk, and raw_sentences.txt. The file raw_sentences.txt contains the sentences that we will be using for this assignment. These sentences are fairly simple ones and cover a vocabulary of only 250 words (+ 1 special [MASK] token word). 3 COMS 4995 Fall 2021 with Professor Richard Zemel Assignment 1 Now data is a Python dict which contains the vocabulary, as well as the inputs and targets for all three splits of the data. data['vocab'] is a list of the 251 words in the dictionary; data['vocab'][0] is the word with index 0, and so on. data['train_inputs'] is a 372,500 x 4 matrix where each row gives the indices of the 4 consecutive context words for one of the 372,500 training cases. The validation and test sets are handled analogously. Even though you only have to modify two specific locations in the code, you may want to read through this code before starting the assignment. 2.1 GLoVE Word Representations [Total of 16 pts] In this part, you will implement a simplified version of the GLoVE embedding (please see the handout for detailed description of the algorithm) with the loss defined as L({wi, w˜i, bi, b˜i}Vi=1) = V ∑ i,j=1 (w>i w˜j + bi + b˜j − logXij)2 . Note that each word is represented by two d-dimensional embedding vectorswi, w˜i and two scalar biases bi, b˜i. Answer the following questions: 2.1.1 GLoVE Parameter Count [1 pt] Given the vocabulary size V and embedding dimensionality d, how many parameters does the GLoVE model have? Note that each word in the vocabulary is associated with 2 embedding vectors and 2 biases. 2.1.2 Expression for gradient ∂L∂wi [6 pts] Write the expression for ∂L∂wi , the gradient of the loss function L with respect to one parameter vector wi. The gradient should be a function of w, w˜, b, b˜,X with appropriate subscripts (if any). 2.1.3 Implement the gradient update of GLoVE. [6 pts] See YOUR CODE HERE Comment below for where to complete the code We have provided a few functions for training the embedding: • calculate_log_co_occurence computes the log co-occurrence matrix of a given corpus • train_GLoVE runs momentum gradient descent to optimize the embedding • loss_GLoVE: • INPUT - V × d matrix W (collection of V embedding vectors, each d-dimensional); V × d matrix W_tilde; V × 1 vector b (collection of V bias terms); V × 1 vector b_tilde; V ×V log co-occurrence matrix. • OUTPUT - loss of the GLoVE objective • grad_GLoVE: TO BE IMPLEMENTED. • INPUT: 4 COMS 4995 Fall 2021 with Professor Richard Zemel Assignment 1 – V× d matrix W (collection of V embedding vectors, each d-dimensional), embedding for first word; – V × d matrix W_tilde, embedding for second word; – V × 1 vector b (collection of V bias terms); – V × 1 vector b_tilde, bias for second word; – V ×V log co-occurrence matrix. • OUTPUT: – V × d matrix grad_W containing the gradient of the loss function w.r.t. W; – V × d matrix grad_W_tilde containing the gradient of the loss function w.r.t. W_tilde; – V × 1 vector grad_b which is the gradient of the loss function w.r.t. b. – V × 1 vector grad_b_tilde which is the gradient of the loss function w.r.t. b_tilde. Run the code to compute the co-occurrence matrix. Make sure to add a 1 to the occurrences, so there are no 0’s in the matrix when we take the elementwise log of the matrix. • [ ] TO BE IMPLEMENTED: Calculate the gradient of the loss function w.r.t. the parameters W, W˜, b, and b. You should vectorize the computation, i.e. not loop over every word. 2.1.4 Effect of embedding dimension d [3 pts] Train both the symmetric and asymmetric GLoVe model with varying dimensionality d by run- ning the cell below. Comment on: 1. Which d leads to optimal validation performance for the asymmetric and symmetric models? 2. Why does / doesn’t larger d always lead to better valida- tion error? 3. Which model is performing better, and why? 2.2 Training the model [Total of 26 pts] We will modify the architecture slightly inspired by BERT [?]. Instead of having only one output, the architecture will now take in N = 4 context words, and also output predictions for N = 4 words. See Figure 2 diagram in the handout for the diagram of this architecture. During training, we randomly sample one of the N context words to replace with a [MASK] token. The goal is for the network to predict the word that was masked, at the corresponding output word position. In practice, this [MASK] token is assigned the index 0 in our dictionary. The weights W(2) = hid_to_output_weights now has the shape NV × H, as the output layer has NV neurons, where the first V output units are for predicting the first word, then the next V are for predicting the second word, and so on. We call this as concatenating output uniits across all word positions, i.e. the (j+ nV)-th column is for the word j in vocabulary for the n-th output word position. Note here that the softmax is applied in chunks of V as well, to give a valid probability distribution over the V words. Only the output word positions that were masked in the input are included in the cross entropy loss calculation: C = − B ∑ i N ∑ n V ∑ j m(i)n (t (i) n,j log y (i) n,j), where y(i)n,j denotes the output probability prediction from the neural network for the i-th training example for the word j in the n-th output word, and t(i)n,j is 1 if for the i-th training example, the word j is the n-th word in context. Finally, m(i)n ∈ {0, 1} is a mask that is set to 1 if we are predicting 5 COMS 4995 Fall 2021 with Professor Richard Zemel Assignment 1 the n-th word position for the i-th example (because we had masked that word in the input), and 0 otherwise. There are three classes defined in this part: Params, Activations, Model. You will make changes to Model, but it may help to read through Params and Activations first. In this part of the assignment, you implement a method which computes the gradient using back- propagation. To start you out, theModel class contains several important methods used in training: • compute_activations computes the activations of all units on a given input batch • compute_loss computes the total cross-entropy loss on a mini-batch • evaluate computes the average cross-entropy loss for a given set of inputs and targets You will need to complete the implementation of two additional methods which are needed for training, and print the outputs of the gradients. 2.2.1 Implement gradient with respect to output layer inputs [8 pts] * [ ] TOBE IMPLEMENTED: compute_loss_derivative computes the derivative of the loss func- tion with respect to the output layer inputs. In other words, if C is the cost function, and the softmax computation for the j-th word in vocab- ulary for the n-th output word position is: yn,j = ezn,j ∑l ezn,l This function should compute a B×NV matrix where the entries correspond to the partial deriva- tives ∂C/∂znj . Recall that the output units are concatenated across all positions, i.e. the (j+ nV)-th column is for the word j in vocabulary for the n-th output word position. 2.2.2 Implement gradient with respect to parameters [8 pts] • [ ] TO BE IMPLEMENTED: back_propagate is the function which computes the gradient of the loss with respect to model parameters using backpropagation. It uses the derivatives computed by compute_loss_derivative. Some parts are already filled in for you, but you need to compute the matrices of derivatives for embed_to_hid_weights, hid_bias, hid_to_output_weights, and output_bias. These matrices have the same sizes as the pa- rameter matrices (see previous section). In order to implement backpropagation efficiently, you need to express the computations in terms of matrix operations, rather than for loops. You should first work through the derivatives on pen- cil and paper. First, apply the chain rule to compute the derivatives with respect to individual units, weights, and biases. Next, take the formulas you’ve derived, and express them in matrix form. You should be able to express all of the required computations using only matrix multiplica- tion, matrix transpose, and elementwise operations — no for loops! If you want inspiration, read through the code for Model.compute_activations and try to understand how the matrix operations correspond to the computations performed by all the units in the network. To make your life easier, we have provided the routine checking.check_gradients, which checks your gradients using finite differences. You should make sure this check passes before continuing with the assignment. 6 COMS 4995 Fall 2021 with Professor Richard Zemel Assignment 1 2.2.3 Print the gradients [8 pts] To make your life easier, we have provided the routine check_gradients, which checks your gra- dients using finite differences. You should make sure this check passes before continuing with the assignment. Once check_gradients() passes, call print_gradients() and include its output in your write-up. 2.2.4 Run model training [2 pts] Once you’ve implemented the gradient computation, you’ll need to train the model. The function train implements the main training procedure. It takes two arguments: • embedding_dim: The number of dimensions in the distributed representation. • num_hid: The number of hidden units As the model trains, the script prints out some numbers that tell you how well the training is going. It shows: • The cross entropy on the last 100 mini-batches of the training set. This is shown after every 100 mini-batches. • The cross entropy on the entire validation set every 1000 mini-batches of training. At the end of training, this function shows the cross entropies on the training, validation and test sets. It will return a Model instance. To convince us that you have correctly implemented the gradient computations, please include the following with your assignment submission: • [ ] You will submit hw1_code_.ipynb through CourseWorks. You do not need to modify any of the code except the parts we asked you to implement. • [ ] In your PDF file, include the output of the function print_gradients. This prints out part of the gradients for a partially trained network which we have provided, and we will check them against the correct outputs. Important: make sure to give the output of print_gradients, not check_gradients. Since we gave you a gradient checker, you have no excuse for not getting full points on this part. 2.3 Arithmetics and Analysis [8 pts] In this part, you will perform arithmetic calculations on the word embeddings learned from pre- vious models and analyze the representation learned by the networks with t-SNE plots. t-SNE You will first train the models discussed in the previous sections; you’ll use the trained models for the remainding of this section. Important: if you’ve made any changes to your gradient code, you must reload the hw1_code module and then re-run the training procedure. Python does not reload modules automatically, and you don’t want to accidentally analyze an old version of your model. These methods of the Model class can be used for analyzing the model after the training is done: * tsne_plot_representation creates a 2-dimensional embedding of the distributed representa- 7 COMS 4995 Fall 2021 with Professor Richard Zemel Assignment 1 tion space using an algorithm called t-SNE. (You don’t need to know what this is for the assign- ment, but we may cover it later in the course.) Nearby points in this 2-D space are meant to correspond to nearby points in the 16-D space. * display_nearest_words lists the words whose embedding vectors are nearest to the given word * word_distance computes the distance between the embeddings of two words Plot the 2-dimensional visualization for the trained model from part 3 using the method tsne_plot_representation. Look at the plot and find a few clusters of related words. What do the words in each cluster have in common? Plot the 2-dimensional visualization for the GloVe model from part 1 using the method tsne_plot_GLoVe_representation. How do the t-SNE embeddings for both models compare? Plot the 2-dimensional visualization using the method plot_2d_GLoVe_representation. How does this compare to the t-SNE embeddings? Please an- swer in 2 sentences for each question and show the plots in your submission. 8 欢迎咨询51作业君