辅导案例-CS 471
CS 471 Intro Artificial Intelligence 2020 Spring Practice Final Note: These questions are meant to help guide your studying for the exam only. It is not guaranteed that this practice midterm covers all of this material (though we have tried to cover as much as possible) and it is not guaranteed that every topic on this practice will show up on the final exam. Please note that while the final exam will have an emphasis on material covered since Midterm 2, material from earlier parts of the course may be tested as well. We will go over this practice final and its solutions in class in the review session. Goodluck! Bayes Net 1. After consulting a behavioral psychologist, we obtained the following complete set of conditional relationships: • HW depends only on Party and Smart • Mac depends only on Smart and Creative • Project depends only on Smart and Creative • Success depends only on HW and Project • Happy depends only on Party, Mac, and Success i. Draw the Bayesian network. ii. Write joint distribution as a product of conditional probabilities. iii. What is the number of independent parameters needed for each conditional probability table? iv. What is the total number of independent parameters? v. Using only the Bayesian network structure from part 4.1, answer the following True/False questions and provide a brief explanation: a. Party is independent of Success given HW. b. Party is independent of Smart given Success c. Party is independent of Creative given Happy. vi. Given the simple Bayes net shown above, what is the probability that it is raining, given that the grass is wet (P(R = T | G = T))? Show your work. . 2. (a) Write down the joint probability distribution associated with the following Bayes Net. Express the answer as a product of terms representing individual conditional probabilities tables associated with this Bayes Net: (b) Draw the Bayes net associated with the following joint distribution: P(A) · P(B) · P(C|A, B) · P(D|C) · P(E|B, C) (c) Do the following products of factors correspond to a valid joint distribution over the variables A, B, C, D? (Circle TRUE or FALSE and give a sentence of explanation) (i) TRUE FALSE P(A) · P(B) · P(C|A) · P(C|B) · P(D|C) (ii) TRUE FALSE P(A) · P(B|A) · P(C) · P(D|B, C) (iii) TRUE FALSE P(A) · P(B|A) · P(C) · P(C|A) · P(D) (iv) TRUE FALSE P(A|B) · P(B|C) · P(C|D) · P(D|A) (d) What factor can be multiplied with the following factors to form a valid joint distribution? (Write “none” if the given set of factors can’t be turned into a joint by the inclusion of exactly one more factor.) (i) P(A) · P(B|A) · P(C|A) · P(E|B, C, D) (ii) P(D) · P(B) · P(C|D, B) · P(E|C, D, A) (e) Answer the next questions based of the Bayes Net below: All variables have domains of {-1, 0, 1} (i) Before eliminating any variables or including any evidence, how many entries does the factor at G have? (ii) Now we observe e = 1 and want to query P(D|e = 1), and you get to pick the first variable to be eliminated • Which choice would create the largest number of factor f? • Which choice would create the smallest number of factor f1? Regression 3.We would like to classify some data. We have N samples, where each sample consists of a feature vector x = {x1, · · · ,xk} and a label y = {0, 1}. We introduce a new type of classifier called logistic regression, which produces predictions as follows: where s(γ) is the logistic function, exp(x) = ex , and w = {w1, · · · , wk} are the learned weights. Let’s find the weights wj for logistic regression using stochastic gradient descent. We would like to minimize the following loss function for each sample: (a) Find dL/dwi . Hint: s ' (γ) = s(γ)(1 − s(γ)). (b) Write the stochastic gradient descent update for wi . Our step size is η. 4.In many real-world scenarios our data has millions of dimensions, but a given example has only hundreds of non-zero features. For example, in document analysis with word counts for features, our dictionary may have millions of words, but a given document has only hundreds of unique words. In this question we will make l2 regularized SGD efficient when our input data is sparse. Recall that in l2 regularized logistic regression, we want to maximize the following objective (in this problem we have excluded w0 for simplicity): where l(x (j) , y(j) , w) is the logistic objective function and the remaining sum is our regularization penalty. When we do stochastic gradient descent on point (x (j) , y(j) ), we are approximating the objective function as Definition of sparsity: Assume that our input data has d features, i.e. x (j) ∈ Rd . In this problem, we will consider the scenario where x (j) is sparse. Formally, let s be average number of nonzero elements in each example. We say the data is sparse when s << d. In the following questions, your answer should take the sparsity of x(j) into consideration when possible. Note: When we use a sparse data structure, we can iterate over the non-zero elements in O(s) time, whereas a dense data structure requires O(d) time. a) Let us first consider the case when λ = 0. Write down the SGD update rule for wi when λ = 0, using step size η, given the example (x (j) , y(j) ). b) If we use a dense data structure, what is the average time complexity to update wi when λ = 0? What if we use a sparse data structure? Justify your answer in one or two sentences. c) Now let us consider the general case when λ > 0. Write down the SGD update rule for wi when λ > 0, using step size η, given the example (x (j) , y(j) ) d) If we use a dense data structure, what is the average time complexity to update wi when λ > 0? e) Let w (t) i be the weight vector after t-th update. Now imagine that we perform k SGD updates on w using examples (x (t+1), y(t+1)), · · · ,(x (t+k) , y(t+k) ), where x (j) i = 0 for every example in the sequence. (i.e. the i-th feature is zero for all of the examples in the sequence). Express the new weight, w (t+k) i in terms of w (t) i , k, η, and λ f) Using your answer in the previous part, come up with an efficient algorithm for regularized SGD when we use a sparse data structure. What is the average time complexity per example? (Hint: when do you need to update wi?) 5.Below is a data set to be linearly separated into two sets (+) and (o) by the blue line. The blue line is initialized as x2 = -2*x1 by the weights w = [ 0 1 0.5 ]’. Use data points from the set below to test their classification by the blue line. Once you find a point that is incorrectly classified, use it to update the weights until correct weights are found. Use eta = 0.2. 6.We are dealing with samples x where x is a single value. We would like to test two alternative regression models: 1. y = ax + e 2. y = ax + bx2 + e We make the same assumptions we had in class about the distribution of e (e~N(0,s2)) a) Assume we have n samples: x1 … xn . with their corresponding y values: y1 … yn. Derive the value assigned to b in model 2. You can use a in the equation for b. b) Which of the two models is more likely to fit the training data better, explain your answer? i) model 1. ii) model 2. iii) both will fit equally well iv) impossible to tell c) Which of the two models is more likely to fit the test data better, explain your answer? i) model 1. ii) model 2. iii) both will fit equally well iv) impossible to tell Perceptron 7. You are learning a multi-class perceptron between 4 classes. The weight vectors are currently [1, 0]T , [0, 1] T , [−1, 0] T , [0, −1] T for the classes A, B, C, and D. The next training example x has a label of A and feature vector f(x). For the following questions, do not make any assumptions about tie-breaking. (Do not write down a solution that creates a tie.). Show your work. (i) Write down a feature vector in which no weight vectors will be updated. () = [ ] or Not possible (ii) Write down a feature vector in which only wA will be updated by the perceptron. () = [ ] or Not possible (iii) Write down a feature vector in which only wA and wB will be updated by the perceptron () = [ ] or Not possible (iv) Write down a feature vector in which only wA and wC will be updated by the perceptron. () = [ ] or Not possible The weight vectors are the same as before, but now there is a bias feature with value of 1 for all x and the weight of this bias feature is 0, −2, 1, −1 for classes A, B, C, and D respectively. As before, the next training example x has a label of A and a feature vector f(x). The always “1” bias feature is the first entry in f(x). (v) Write down a feature vector in which only wB and wC will be updated by the perceptron () = [ 1 ] or Not possible (vi) Write down a feature vector in which only wA and wC will be updated by the perceptron. () = [ 1 ] or Not possible 8. We would like to use a perceptron to train a classifier for drone position in 3D world coordinate system and labels 1(active tracking of adversaries) or 0 (passive surveillance). Let’s use a learning rate of α = .25. Consider the following labeled training data: Features – Drone Position (x, y, z) Label y* (-1,2,3) 1 (3,-1,3) 0 (1,2,3) 0 (3,1,3) 1 (a) Our two perceptron weights have been initialized to w1 = 2 , w2 = −2, w3 = 0.5. After processing the first point with the perceptron learning rule , what will be the updated values for these weights? (b) After how many steps will the perceptron algorithm converge? Write “never” if it will never converge. Note: one steps means processing one point. Points are processed in order and then repeated, until convergence. (c) Instead of the perceptron, we decide to treat the perceptron as a single node neural network and update the weights using gradient descent on the loss function. The loss function for one data point is Loss(y, y* ) = (y − y* ) 2 , where y* is the training label for a given point and y is the output of our single node network for that point. (i) Given a general activation function g(z) and its derivative g’(z), what is the derivative of the loss function with respect to w2 in terms of g, g’ , y* , x1, x2, x3, w1, w2 and w3? (ii) For this question, ReLU activation function is used: g(z) = 1 if z ≥ 0 and = 0 if z < 0 Given the following gradient descent equation to update the weights given a single data point. With initial weights of w1 = 2 , w2 = −2, w3 = 0.5, what are the updated weights after processing the first? Gradient descent update equation: wi = wi − α ∂Loss/ ∂wi . Show all of your work. (ii) For this question, a TanH activation function is used: g(z) = exp(z) – exp(-z) / exp(z) + exp(-z) Given the following gradient descent equation to update the weights given a single data point. With initial weights of w1 = 2 , w2 = −2, w3 = 0.5, what are the updated weights after processing the second point? Gradient descent update equation: wi = wi − α ∂Loss/ ∂wi . Show all of your work. (you can round of the value to 1 digit after the decimal) Neural networks 9.The network below is a neural network with inputs x1 and x2, and outputs y1 and y2. The internal nodes are computed below. All variables are scalar values. Note that ReLU(x) = max(0, x). The expressions for the internal nodes in the network are given here for convenience: h1 = w11x1 + w12x2 h2 = w21x1 + w22x2 h3 = w31x1 + w32x2 r1 = ReLU(h1) r2 = ReLU(h2) r3 = ReLU(h3) s1 = x1 + r1 + r2 s2 = x2 + r2 + r3 y1 = exp(s1) exp(s1)+exp(s2) y2 = exp(s2) exp(s1)+exp(s2) (a) Forward Propagation Suppose for this part only, x1 = 3, x2 = 5, w11 = −10, w12 = 7, w21 = 2, w22 = 5, w31 = 4, w32 = −4. What are the values of the following internal nodes? Please simplify any fractions. (i) h1 = (ii) s1 = (iii) y2 = (b) Back Propagation Compute the following gradients analytically. The answer should be an expression of any of the nodes in the network (x1, x2, h1, h2, h3, r1, r2, r3, s1, s2, y1, y2) or weights w11, w12, w21, w22, w31. In the case where the gradient depend on the value of nodes in the network, please list all possible analytical expressions, caused by active/inactive ReLU, separated by comma. Hint 1: If z is a function of y, and y is a function of x, the chain rule of taking derivative is: Hint 2: Hint: Recall that for functions of the , (i) (ii) (iii) (iv) (v) (vi) (c) In roughly 15 words, what role could the non-negative values in node r2 play according to its location in the network architecture? 10. For each of the piecewise-linear functions below, mark all networks from the list above that can represent the function exactly on the range x ∈ (−∞, ∞). In the networks above, relu denotes the element-wise ReLU nonlinearity: relu(z) = max(0, z). The networks Gi use 1-dimensional layers, while the networks Hi have some 2-dimensional intermediate layers. Briefly show your reason.