辅导案例-CS 188

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
CS 188
Fall 2019
Introduction to
Artificial Intelligence Final Exam
• You have approximately 170 minutes. The time will be projected at the front of the room. You may not leave during the
last 10 minutes of the exam
• The exam is closed book, closed calculator, and closed notes except your two-page crib sheet.
• Mark your answers ON THE EXAM IN THE DESIGNATED ANSWER AREAS. Provide a brief explanation if applica-
ble.
• In the interest of fairness, we want everyone to have access to the same information. To that end, we will not be answering
questions about the content. If a clarification is needed, it will be projected at the front of the room. Make sure to
periodically check the clarifications.
• For multiple choice questions,
– □ means mark all options that apply
– # means mark a single choice
– When selecting an answer, please fill in the bubble or square completely ( and■)
– If need to undo a selection, either erase it completely or lay a giant cross (X) over the box. The staff reserves the
right to subtract points from ambiguous answers
First name
Last name
SID
Student to your right
Student to your left
For staff use only:
Q1. Coin Stars /10
Q2. Five Card Game /14
Q3. Vehicle Perception Indication /10
Q4. Hidden CSP /14
Q5. Snail Bayes /15
Q6. We Are Getting Close... /11
Q7. Generating and classifying text /10
Q8. Deep “Blackjack” /16
Total /100
1
THIS PAGE IS INTENTIONALLY LEFT BLANK
2
SID:
Q1. [10 pts] Coin Stars
In a new online game called Coin Stars, all players are walking around an M x N grid to collect hidden coins, which only
appear when you’re on top of them. There are also power pellets scattered across the board, which are visible to all players.
If you walk onto a square with a power pellet, your power level goes up by 1, and the power pellet disappears. Players will also
attack each other if one player enters a square occupied by another player. In an attack, the player with a higher power level
will steal all the coins from the other player. If they have equal power levels, nothing happens. Each turn, players go in order to
move in one of the following directions: {N, S, E, W}.
In this problem, you and your friend Amy are playing Coin Stars against each other. You are player 1, and your opponent
Amy is player 2. Our state space representation includes the locations of the power pellets (푥푝푗 , 푦푝푗 ) and the following playerinformation:
• Each player’s location (푥푖, 푦푖)
• Each player’s power level 푙푖
• Each player’s coin count 푐푖
(a) (i) [2 pts] Suppose a player wins by collecting more coins at the end of a number of rounds, so we can formulate this
as a minimax problem with the value of the node being 푐1 − 푐2. Consider the following game tree where you arethe maximizing player (maximizing the your net advantage, as seen above) and the opponent is the minimizer. As-
suming both players act optimally, if a branch can be pruned, fill in its square completely, otherwise leave the square
unmarked.
None of the above can be pruned
If you traverse the tree with 훼 − 훽 pruning, you will see that no branches can be pruned
(ii) [1 pt] Suppose that instead of the player with more coins winning, every player receives payout equal to the number
of coins they’ve collected. Can we still use a multi-layer minimax tree (like the one above) to find the optimal action?# Yes, because the update in payout policy does not affect the minimax structure of the game.# Yes, but not for the reason above# No, because we can no longer model the game under the updated payout policy with a game tree. No, but not for the reason above
No, because the game is no longer zero-sum: your opponent obtaining more coins does not necessarily make you
worse off, and vice versa. We can still model this game with a game-tree, where each node contains a tuple of two
values, instead of a single value. But this means the tree is no longer a minimax tree.
An example of using the minimax tree but not optimizing the number of coins collected: when given a choice
between gathering 3 coins or stealing 2 coins from the opponent, the minimax solution with 푐1 − 푐2 will steal the 2coins (net gain of 4), even though this will cause it to end up with fewer coins.
3
(b) Suppose we want to train a Reinforcement Learning (RL) agent using Q-learning to play against a wide range of human
participants, with the objective to obtain more coins than its opponent like in (a.i).
Your friend Blake discovers that a computer scientist and expert Coin Stars player Oswald has published (open-sourced)
his policy 휋푒, which consists of a set of expert features 퐹푒 = [푓1, 푓2, ..., 푓푛], all of which can be computed using the currentstate representation mentioned in the problem statement. Oswald has hand-tuned the weights 푊푒 = [푤1, 푤2, ..., 푤푛] ofthe policy 휋푒 such that 휋푒 is near optimal assuming the opponent also plays optimally.
You decide to use the same set of features as Oswald’s 퐹푒, and come up with four proposed training procedures, all ofwhich will learn the optimal policy (i.e. the best response) against that particular opponent:
I Playing against an opponent who plays 휋푒, Oswald’s expert-tuned policy that assumes our agent plays optimally
II Playing against an opponent who plays a fixed policy 휋푝, a sub-optimal policy that moves to maximize the collectionof power pellet, and chases its opponent afterwards.
III Playing against an opponent who plays a fixed policy 휋푐 , a sub-optimal policy that ignores power pellet and itsopponent, and moves to maximize the collection of hidden coins.
(i) [1 pt] With which of the training procedures is our agent most likely to learn a set of weights that can achieve the
highest win rate against Oswald, an expert human participant who plays according to 휋푒? (I)# (II)# (III)# There is not enough information to distinguish among the three
Procedure (I) is the direct fit to the expert human players, and thus is selected. Procedure (II)’s and (III)’s opponents
are not expert human players, thus it’s unlikely that the policies trained to cope with these opponents can win as
frequently as the policy from Procedure (I).
(ii) [1 pt] Is either of (II) or (III) guaranteed to result in an agent who achieves a higher win rate than (I) when playing
against novice human participants who play sub-optimally, in an unknown way?# Yes, because (II) and (III) are trained with sub-optimal opponents, whereas (I) is trained with expert opponents.# Yes, but not for the reason above. No, because the novice human may not be sub-optimal in the same way as the opponents in (II) or (III).# No, but not for the reason above.
Procedure (I) could yield a policy that is robust enough and still beat sub-optimal human players, but the policy also
makes the incorrect assumption that the human opponent play optimally, which is not true for novice human players.
There are essentially so many ways to play sub-optimally, so, similar to Procedure (I), with a high probability,
Procedures (II) and (III) make incorrect assumptions about the novice human player’s strategy and therefore cannot
guarantee a higher win rate.
(iii) [2 pts] Suppose instead of training an RL agent against an agent with a fixed strategy, we instead train the RL agent
against itself. That is, the RL agent plays against an opponent with its weights, and the features are computed from
each agent’s point of view.
Which of the following are potential issues in this self-play Reinforcement Learning approach?
■ The RL agent’s weights are not guaranteed to converge because it is playing against itself.
□ Because the opponent you play against can be arbitrarily bad, the RL agent cannot improve at the game at all.
□ Because the RL agent is playing against the same opponent (itself), the best (highest reward) policy from its
perspective does not change over the course of training.
□ Because the RL agent is playing against itself, it does not need to do exploration (for example, it can just do the
greedy action with respect to its current Q-values).# None of the above
The 1st choice: suppose there exist three strategies in the game A, B, C, such that A beats B beats C beats A. (For
an analogy, consider rock-paper-scissors.) If the agent is currently playing B, then it learns that playing C is better,
and will change its weights too play C. If the agent is currently playing C, then playing A will look better, and so it
will play A. This will cause it to play B, then C, then A, and so forth, ad infinitum. So this choice is true.
The 2nd choice: even if the opponent is arbitrarily bad, you still might learn some basic skills such as how to gather
coins. Then the opponent is better in the next iteration, and the RL agent might get better yet.
The 3rd choice: again, consider the rock-paper-scissors example. Over time, the best policy (against the previous
iteration of the agent) changes.
The 4th choice: You need to do exploration, as, for example, your initial Q-values can be arbitrarily bad.
4
SID:
(c) [1 pt] For this part only, suppose the game randomly reveals some hidden coins throughout the game, making them
visible to both you and your friend. Which of the conditions below will both players benefit the most from the feature
"Distance to closest visible coins"? Coins are sparse (∼ 1 coin per 50 squares), reveal frequency high (∼ 1 coin becomes visible per 1 moves)# Coins are sparse (∼ 1 coin per 50 squares), reveal frequency low (∼ 1 coin becomes visible per 50 moves)# Coins are dense (∼ 1 coin per 2 squares), reveal frequency high (∼ 1 coin becomes visible per 1 moves)# Coins are dense (∼ 1 coin per 2 squares), reveal frequency low (∼ 1 coin becomes visible per 50 moves) Equally useful in all conditions above
Spare coins makes information about their location more valuable. This is because if coins are dense, agents are able
to collects lots of coins without knowing their locations, just by walking on the grid. Higher reveal frequency is strictly
better because they are able to make their plan more efficient with strictly more information. We also accepted "Equally
useful in all conditions above" as an answer since benefit can be interpreted as a better chance to win the game - but this
situation equally benefits both players in all coin situations.
(d) Using the original problem setup and, we have the following features and weights for a given state 푠:
Feature Initial Weight
푓1(푠, 푎) = |푥1 − 푥2| + |푦1 − 푦2| 푤1 = 1
푓2(푠, 푎) = 푙1 − 푙2 푤2 = 4
푓3(푠, 푎) =
1
the Manhattan distance from player 1 to their closest pellet 푤3 = 2
(i) [1 pt] Calculate 푄(푠, 푎) for the q-state where player 1 is at (1, 1) and player 2 is at (5, 4). Player 1 has power level 7,
and player 2 has power level 3. The closest pellet to player 1 is located at (2, 1).# 24 25 # 26 # 27 # 28 # A value different from all of the above# There is not enough information to calculate this value
1 ∗ (1 − 5 + 1 − 4) + 4 ∗ (7 − 3) + 2 ∗ 11 = 1 ∗ 7 + 4 ∗ 4 + 2 ∗ 1 = 25
(ii) [1 pt] We observe a sample (푠, 푎, 푟) for the q-state in the previous part with 푟 = 23. Assuming a learning rate of
훼 = 0.5 and discount factor of 훾 = 0.5, calculate the new weight 푤1_푛푒푤 after a single update of approximateQ-Learning. This part will be graded independently of the previous part, so please check your work before
proceeding.
푤1_푛푒푤 =# -13 # -6 # 1 # 8 # 15 # A value different from all of the above There is not enough information to calculate this value
We do not have any information about the values of features for any outgoing state-action pair 푠′, 푎′ from this current
푠, 푎 state-action pair.
5
Q2. [14 pts] Five Card Game
There is a two-player zero-sum card game called Five Card Game. Player A plays odd-numbered turns (and thus plays first),
and he initially has the cards labeled 1, 3, 5 in his hand. Player B plays even-numbered turns, and she initially has the cards
labeled 2 and 4 in her hand. They take turns to give out one of their cards to the judge based on their policies. The game ends
after 5 turns and forms the sequence 푋1, 푋2, 푋3, 푋4, 푋5, where 푋1, 푋3, 푋5 is a permutation of 1, 3, 5 provided by player A,and 푋2, 푋4 is a permutation of 2, 4 provided by player B.
However, neither of the two players have access to what cards their opponent plays throughout the game. The only hint they
receive is from the judge: when 푋푡 has been determined by either of the players (푡 ≥ 2), the judge of the game will compare 푋푡and 푋푡−1, and assign one point to either of the players. With probability 푝푡 (0 ≤ 푝푡 ≤ 1), the point will be given to the playerwhose card is larger, and with probability 1 − 푝푡, the point will be given to the player whose card is smaller. 푝2, 푝3, 푝4, 푝5 are
pre-defined before the game and everyone knows their values. Both the players and the audience know the point assignment
right after the judge assigns it. The player who wins more points wins the game.
We denote the point that player A earned at each time step as 푈퐴2 , 푈퐴3 , 푈퐴4 , 푈퐴5 (value can be either 0 or 1), and thus all thevariables are determined in the following order: 푋1, 푋2, 푈퐴2 , 푋3, 푈퐴3 , 푋4, 푈퐴4 , 푋5, 푈퐴5 . Note player A determines 푋3 afterknowing 푈퐴2 , and player B determines 푋4 after knowing 푈퐴2 and 푈퐴3 , and player A determines 푋5 after knowing 푈퐴2 , 푈퐴3 , 푈퐴4 .
As an illustration, if 푋1, 푋2, 푋3, 푋4, 푋5 is 3, 2, 5, 4, 1, and 푝푡 = 1 for all 푡 (so that the bigger card’s owner always gets thepoint), then 푈퐴2 , 푈퐴3 , 푈퐴4 , 푈퐴5 is 1, 1, 1, 0 and player A wins this game. In this example, 푈퐴2 = 1 because 푋2 < 푋1 and hencethe point is awarded to player A, while 푈퐴5 = 0 because 푋5 < 푋4 and hence the point is awarded to player B.
(a) Suppose 푝푡 = 1 for all 푡.
(i) [1 pt] Each player uses their own optimal deterministic policy and is aware that their opponent is also using the
optimal deterministic policy. How many points will Player A obtain? Hint: Drawing a game tree might help you
here.
Answer: 2
Player A is the maximizer, and player B is the minimizer. Utility of the root node = 2.
6
SID:
(ii) [1 pt] Player B decides to gives out her cards randomly (with equal probability for each ordering). If Player A
becomes aware of this, what is the expected number of points Player A will obtain if he plays optimally?
Answer: 2.5
Player A is again the maximizer, but here player B has the chance nodes. Utility of the root node = 2.5.
7
Now we assume both the players have their own random policies. Suppose a player’s policy for each 푋푡 is a conditional proba-bility table, conditioned on the MINIMUM set of DEPENDENT information they know when playing the card 푋푡. Thenat each time step 푡, they randomly sample which card to draw according to their probability table. During the game, each player
only knows their own policy. In addition, the policies do not change over time.
Wewant to construct a BayesNetwith theminimumnumber of edges to investigate the relations between the nodes푋1, 푋2, 푈퐴2 , 푋3,
푈퐴3 , 푋4, 푈

4 , 푋5, 푈

5 .
Hint 1: Drawing the Bayes Net with nodes푋1, 푋2, 푈퐴2 , 푋3, 푈

3 , 푋4, 푈

4 , 푋5, 푈

5 while answering part (b) will help you in part
(c). But only your response to the questions will be graded. No partial credit for the drawings.
Hint 2: Do Player A and Player B know each other’s card sequence exactly? Do they know their own card sequence exactly?
Hint 3: As noted in the problem statement, before any player determines his or her turn 푋푡, he or she knows all the existing
hints 푈퐴2 , ..., 푈

푡−1.
The Bayes Net:
Special note about 푈퐴2 → 푋3: As indicated in the problem, player A determines 푋3 after getting the value of 푈퐴2 from thejudge. 푈퐴2 would be a valuable indication of 푋2 played by player B if 푋1 = 3. However, for 푋4, player B has no other choice.He only have the card that is different from 푋2 at hand.
(b) (i) [1 pt] Which of the following are directly reflected in the Bayes net’s probability tables (including prior probability
tables and conditional probability tables)?
■ The judge’s point assignment probability.
□ The player A’s winning probability.# None of the above
(ii) [2 pts] For each table, mark the minimal set of variables that each probability table should be conditioned on.
Player B’s policy probability table to determine 푋2
□ 푋1 None of the above
Player B has no information about 푋1, because the judge has not stepped in to give hints.
Player A’s policy probability table to determine 푋3
■ 푋1 □ 푋2 ■ 푈퐴2 # None of the above
푋1 eliminate the options for 푋3, and 푈2 gives clue to what 푋1 could have been, and thus gives clue to what 푋3would be.
Player B’s policy probability table to determine 푋4
□ 푋1 ■ 푋2 □ 푋3 □ 푈퐴2 □ 푈퐴3 # None of the aboveNote for 푋4, it can be precisely determined by 푋2, and hence it does not condition on any other variables.
8
SID:
Player A’s policy probability table to determine 푋5
■ 푋1 □ 푋2 ■ 푋3 □ 푋4 □ 푈퐴2 □ 푈퐴3 □ 푈퐴4 # None of the above
푋1 and 푋3 eliminate the options for 푋5 to one option.
(c) Analyze the independency or conditional independency between variables.
Hint: Feel free to use the Bayes Net you constructed from part (b), but this part will be graded independently. Please
double check your work in part (b)
(i) [1 pt] Which of the following pairs of variables are independent, given no observations?
□ 푋2 and 푋3 ■ 푋1 and 푋4 # None of the above
(ii) [1 pt] Which of the following pairs of variables are independent, given only 푈퐴2 ?
□ 푋2 and 푋3 □ 푋1 and 푋4 None of the above
By using the D-Separation on the Bayes Net constructed for this question, we can obtain the correct answer. For the
second question, although the route푋2 −푈퐴2 −푋3 is cut off, there is still another path푋2 −푈퐴2 −푋1 −푋3 that connects.
9
Now the game is over and you have only observed the values of 푈퐴2 , 푈퐴3 , 푈퐴4 , 푈퐴5 , and want to decode the values of푋1,푋2,푋3,
푋4,푋5. You want to solve this problem by searching over all the possible states. Each of your states contains the card sequenceat some point during the game. For example, one search path from the start state to one goal state could be:
() → (3,) → (3, 2) → (3, 2, 5) → (3, 2, 5, 4) → (3, 2, 5, 4, 1) (Goal State).
(d) [1 pt]
How many goal states are there in your search problem? Answer: 12
There are 12 leaf nodes in the game tree, and hence there are 12 goal states.
Which data structure is more similar to the search graph? Your game tree from part (a) # Your bayes net from part (b)
There is no valid start state or goal state in the Bayes Net. In the game tree, the root node is the start state, while the leaf
nodes are the goal states.
Now we wish to decode 푋1, 푋2, 푋3, 푋4, 푋5 given 푈퐴2 , 푈퐴3 , 푈퐴4 , 푈퐴5 by solving the following problem:
argmax푋1,푋2,푋3,푋4,푋5푃 (푈퐴2 |푋1, 푋2)푃 (푈퐴3 |푋2, 푋3)푃 (푈퐴4 |푋3, 푋4)푃 (푈퐴5 |푋4, 푋5)
(e) We then assign proper cost to each edge in the search graph.
(i) [2 pts] Which is a correct edge cost configuration for the edge connecting the states (푋1, ..., 푋푡−1) and (푋1, ..., 푋푡)(where 푡 ≥ 2)? Note a correct configuration means that, the optimal path of your search problem should always be
the correct argmax solution above for any valid player policy probability table values and 푝푡 values.# 푃 (푈퐴푡 |푋푡−1, 푋푡) # −푃 (푈퐴푡 |푋푡−1, 푋푡) # ln푃 (푈퐴푡 |푋푡−1, 푋푡) − ln푃 (푈퐴푡 |푋푡−1, 푋푡)# 푒푃 (푈퐴푡 |푋푡−1,푋푡) # 푒−푃 (푈퐴푡 |푋푡−1,푋푡) # None of the above
Turning multiplication (argmax) into summation (total cost in search), we need logarithm. Turning argmax into argmin
(search is finding the minimum cost), we need the negative sign.
Now suppose we observe that 푈퐴2 , 푈퐴3 , 푈퐴4 , 푈퐴5 all equal to 1.
(ii) [2 pts]
What is the value for 푃 (푈퐴5 |푋4, 푋5) used to compute the edge cost from the state (3, 2, 5, 4) to the state (3, 2, 5, 4,1)? Your answer should just be the value of 푃 (푈퐴5 |푋4, 푋5) rather than the edge cost computed using your answerto part (e)(i).# 푝5 1 − 푝5 # Cannot be determined only by 푝푡 - we also need the players’ policy probability table.
What is the value for 푃 (푈퐴3 |푋2, 푋3) used to compute the edge cost from the state (3, 2) to the state (3, 2, 5)? Youranswer should just be the value of 푃 (푈퐴3 |푋2, 푋3) rather than the edge cost computed using your answer to part(e)(i). 푝3 # 1 − 푝3 # Cannot be determined only by 푝푡 - we also need the players’ policy probability table.
Note we still did not assign the edge cost to the edge connecting () and (푋1, ). Let us assign 1 as the cost to all this type of edges.
(f) [2 pts] Now you will conduct A* search on your constructed search graph with the proper edge cost from part (e). The
heuristic for each node will be set to be the number of time steps remaining to the game end. For example, the heuristic
for the state (3, 2, 5) is 2 (2 steps from the game end). Which of the following statements is correct?# Neither tree search nor graph search will be complete. Both tree search and graph search will be complete, but neither is guaranteed to be optimal.# Both tree search and graph search will be complete, and only the tree search is guaranteed to be optimal.# Both the tree search and the graph search will be complete and optimal.
In this question, the graph search is exactly the same as the tree search because the search graph is a tree and all the edges
are directed from the parent node to the child node. The heurstic is not admissible because if 푝푡 > 1푒 , then − ln 푝푡 < 1.
Similarly, if 푝푡 < 1 − 1푒 , then − ln(1 − 푝푡) < 1. So this heuristic can overestimate the true cost. The counter example for
10
SID:
optimality (for both tree search and graph search) can be: 푝2 = 푝3 = 1, 푝4 = 0.5푒휖 , 푝5 = 1푒 ⋅ 푒훿 , where 휖 and 훿 are verysmall positive numbers and satisfy 훿 > 2휖, and we have 푈퐴2 = 푈퐴3 = 푈퐴4 = 1 and 푈퐴5 = 0. Then the optimal path shouldreach the goal state (1, 2, 3, 4, 5). But the A* search will return the path reaching the goal state (1, 2, 5, 4, 3) or (3, 4, 5, 2,
1) (tie breaker will decide which). It is worth point out that if all the cost of the edges are set to be − log푎 푃 (푈퐴푡 |푋푡−1, 푋푡),and if 1 < 푎 < 2, both the tree search and the graph search of the A* search will be both complete and optimal. Try a
proof if you are interested. Hint: the edge cost in the same tree depth can only take two values 퐴 and 퐵, and we have
푎−퐴 + 푎−퐵 = 1.
11
Q3. [10 pts] Vehicle Perception Indication
A vehicle is trying to identify the situation of the world around it using a set of sensors located around the vehicle.
Each sensor reading is based off of an object’s location (LOC) and an object’s movement (MOVE). The sensor reading will then
produce various values for its predicted location, predicted movement, and image rendered, which is then sent back to the user.
(a) The vehicle takes an action, and we assign some utility to the action based on the object’s location and movement. Possible
actions are MOVE TOWARDS, MOVE AWAY, and STOP. Suppose the decision network faced by the vehicle is the
following.
(i) [2 pts] Based on the diagram above, which of the following could possibly be true?
■ VPI (Image) = 0
□ VPI (SRead) < 0
□ VPI (SMove,SRead) > VPI (SRead)
■ VPI (Feedback) = 0# None of the above
VPI(Image) = 0 because there is not active path connecting Image and U
VPI cannot be negative, so option 2 is not selected.
VPI(SMove, SRead) = VPI(SMove | SRead) + VPI(SRead), therefore we can cancel VPI(SRead) from both side,
and it becomes asking if VPI(SMove | SRead) > 0. And we can see that cannot be true, because shading in SRead,
there is no active path connecting SMove and U.
There is an active path connecting Feedback and U, therefore VPI(Feedback) ≥ 0. It could still be 0 because active
path only gives the possiblity of > 0.
(ii) [2 pts] Based on the diagram above, which of the following must necessarily be true?
■ VPI (Image) = 0
□ VPI (SRead) = 0
■ VPI (SMove,SRead) = VPI (SRead)
□ VPI (Feedback) = 0# None of the above
VPI(Image) = 0 because there is not active path connecting Image and U
VPI(SRead) could be >0 because SRead-MOVE-U is an active path between SRead and U
VPI(SMove, SRead) = VPI(SMove | SRead) + VPI(SRead), therefore we can cancel VPI(SRead) from both side,
and it becomes asking if VPI(SMove | SRead) == 0. And we can see that must true, because shading in SRead,
there is no active path connecting SMove and U.
12
SID:
VPI(Feedback) could be >0 because feedback-SLoc-SRead-MOVE-U is an active path
Let’s assume that your startup has less money, so we use a simpler sensor network. One possible sensor network can be repre-
sented as follows.
You have distributions of 푃 (LOC), 푃 (MOVE), 푃 (푆푅푒푎푑|LOC,MOVE), 푃 (푆퐿표푐|푆푅푒푎푑) and utility values 푈 (푎, 푙, 푚).
(b) Complete the equation for determining the expected utility for some ACTION 푎.
퐸푈 (푎) =
(
(i) (ii) (iii)
)
푈 (푎, 푙, 푚)
(i) [1 pt] ∑푙 푃 (푙) # ∑푠푙표푐 푃 (푠푙표푐|푙) # ∑푙∑푠푙표푐 푃 (푠푙표푐|푙) # 1
(ii) [1 pt] ∑푚 푃 (푚) # ∑푚 푃 (푠푙표푐|푚) # ∑푙∑푚∑푠푙표푐 푃 (푠푙표푐|푙)푃 (푠푙표푐|푚) # 1
(iii) [1 pt]# ∗ ∑푙∑푚∑푠푙표푐 푃 (푠푙표푐|푙)푃 (푠푙표푐|푚) # +∑푙∑푚∑푠푙표푐 푃 (푠푙표푐|푙)푃 (푠푙표푐|푚)# +∑푙∑푚∑푠푙표푐 푃 (푠푙표푐|푙, 푚)푃 (푙)푃 (푚) ∗ 1
퐸푈 (푎) =

푙 푃 (푙)

푚 푃 (푚)푈 (푎, 푙, 푚)
We can eliminate SRead and SLoc via marginalization, so they don’t need to be included the expression
(c) Your colleague Bob invented a new sensor to observe values of 푆퐿표푐.
(i) [1 pt] Suppose that your company had no sensors till this point. Which of the following expression is equivalent to
VPI(푆퐿표푐)?
■ VPI(푆퐿표푐) = (∑푠푙표푐 푃 (푠푙표푐)MEU(푆퐿표푐 = 푠푙표푐)) − max푎 EU(푎)
■ VPI(푆퐿표푐) = MEU(푆퐿표푐) −MEU(∅)
□ VPI(푆퐿표푐) = max푠푙표푐 MEU(푆퐿푂퐶 = 푠푙표푐) −MEU(∅)# None of the above
Option 2 is correct by definition, and option 1 is the expanded version of option 2
(ii) [2 pts] Gaagle, an established company, wants to sell your startup a device that gives you 푆푅푒푎푑. Given that
you already have Bob’s device (that gives you 푆퐿표푐), what is the maximum amount of money you should pay for
Gaagle’s device? Suppose you value $1 at 1 utility.
□ VPI(푆푅푒푎푑)
■ VPI(푆푅푒푎푑) − VPI(푆퐿표푐)
□ VPI(푆푅푒푎푑, 푆퐿표푐)
■ VPI(푆푅푒푎푑, 푆퐿표푐) − VPI(푆퐿표푐)
■ VPI(푆푅푒푎푑|푆퐿표푐)
13
# None of the above
Choices 4, 5, are correct by definition
Choice 2 is true because VPI(SLoc |SRead) = 0, and thus
VPI(SRead) = VPI(SRead) + 0 = VPI(SRead) + VPI(SLoc|SRead) = VPI(SRead, SLoc), which makes choice 2
the same as choice 4
14
SID:
Q4. [14 pts] Hidden CSP
We have studied lots of algorithms to solve a CSP with a list of known constraints. But can we solve a CSP without explicitly
knowing the constraints?
The wildfire reaches Berkeley in the middle of CS188 lecture, and all 푁 ≥ 1 students need to be evacuated on푀 ≥ 1 buses
(each with capacity푁).
Suppose the only constraints are 푘 ≥ 0 pairs of students who have to be on two different buses (for example, "Zelda and Yoda"
constitute one pair of students who have to be on two different buses). You don’t know these constraints explicitly, but you can
observe the yelling from the buses, which is correlated with whether all constraints are satisfied.
In this question, 푆 = +푠 means that the current assignment satisfies all the constraints, and 푆 = −푠 means it violates at least 1
constraint. We also denote 푌 = +푦 as the event where there is yelling from the buses, and 푌 = −푦 as the event where there is
no yelling from the buses.
(a) Your friend Alice suggests starting with a set of random assignments 푆0 (where you randomly assign each of푁 studentsto one of푀 buses).
(i) [1 pt] What is 푝0 = 푃 (푆0 = +푠), the probability that the randomly generated assignment satisfies all the constraints,for any choice of 푁 ,푀 , 푘. (Hint: try writing down the expression for the probability that it satisfies two different
constraints)
# 0 # 1 −( 1(푀2 )
)푘 # 1 − ( 1푀 )푘 # (1 − 1(푀2 )
)푘 # (1 − 1푀 )푘 # 1 None of the above
We are presenting solutions from 3 different angles
Solution 1: Observe unwarranted independence assumption in Chain Rule.
First of all, 푃 (violating 푐푖) = 1푀 for all constraint 푐푖, because that’s the probability of assigning the two students inconflict to the same bus.
1 − ( 1푀 )
푘 is wrong because it is the probability of violating all constraints (which is already wrong), and making
unwarranted independent assumptions.
The bogus solution (1 − 1푀 )푘 is wrong because it makes unwarranted independent assumptions that not violationsof constraints.
We know 푃 (not violating 푐푖) = 1 − 1푀 But,
푃 (not violating 푐푖, not violating 푐푗 , not violating 푐ℎ)
= 푃 (not violating 푐ℎ|not violating 푐푗 , not violating 푐푖) ∗ 푃 (not violating 푐푗|not violating 푐푖) ∗ 푃 (not violating 푐푖)≠ 푃 (not violating 푐ℎ) ∗ .푃 (not violating 푐푗) ∗ 푃 (not violating 푐푖) in general
Since 푃 (not violating 푐푖, not violating 푐푗 , not violating 푐ℎ) ≠ (1 − 1푀 )3 in general, it also is not true for any 푘 ≥ 3——————-
Solution 2: Observe how the probability should intuitively behave.
It’s clear that 0 and 1 are wrong, because if k = 0 then the probability is 1, and if M= 1, k >= 1, then the probability is
0. So the probability must depend on M and k. As k -> infinity with M fixed, the probability should go to 0. Indeed,
for M = 1, k >=1, it should be 0! Plugging this in, we immediately eliminate the second and third proobabilities.
Now, let’s consider the other two probabilities. Suppose that we have N = 3 student, M = 2 buses, and k = 3
constraints. If the constraints are:
Student A not on same bus as student B. Student B not on same bus as student C. Student A not on same bus as
student C.
Then the probability is 0. Yet neither of the probabilities are 0 in this case. Therefore, it has to be none of the above.
———————
Solution 3: Find one good counter example:
With M=2, k=3, we see that none of the expressions work, as the probability depends on the consntraints. For
example, if the constraints are:
Student A not on same bus as student B. Student B not on same bus as student C. Student A not on same bus as
student C.
Then the probability is 0.
15
If the constraints are: Student A not on same bus as student B. Student B not on same bus as student C. Student C
not on same bus as student D.
Then the probability is 1/8 > 0. So we can see that there’s no expression for the probability that depends only on M,
N, k.
(ii) [1 pt] Since we don’t know the constraints explicitly, our observation of yelling is the only way to infer whether all
constraints are satisfied.
Alice, an expert in the human behavior of yelling, constructs the table푃 (푌푡|푆푡) below to describe the relation betweenyelling (푌푡) and the hidden assignment being satisfying (푆푡). What is 푃 (푆0 = +푠|푌0 = +푦)?# 0.1# 0.1(푝0) + 0.2(1 − 푝0)# 0.1(푝0) + 0.8(1 − 푝0)# 0.1(푝0)0.1(푝0)+0.2(1−푝0) 0.1(푝0)0.1(푝0)+0.8(1−푝0)# None of the above
Please note that we can use the law of total probability to get 푃 (+푦| − 푠) = 1 − 푃 (−푦| − 푠) = 1 − 0.2 = 0.8
푃 (+푦) = 푃 (+푦,+푠) + 푃 (+푦,−푠) = 푃 (+푦| + 푠)푃 (+푠) + 푃 (+푦| − 푠)푃 (−푠) = 0.1 ∗ 푝0 + 0.8 ∗ (1 − 푝0)
Using Bayes Rule, 푃 (+푠| + 푦) = 푃 (+푦,+푠)푃 (+푦) = 0.1(푝0)0.1(푝0)+0.8(1−푝0)
푆0 푃 (푆0)
+푠 푝0
−푠 1 − 푝0
푆푡 푌푡 푃 (푌푡|푆푡)
+푠 +푦 0.1
+푠 −푦 –
−푠 +푦 –
−푠 −푦 0.2
푆푡 푆푡+1 푃 (푆푡+1|푆푡)
+푠 +푠 푟++
+푠 −푠 –
−푠 +푠 –
−푠 −푠 푟−−
Note: "–" in the tables means the value is not given in the problem
(b) Your friendBob suggests iteratively updating the assignment: at each time step, we randomly select a student and randomly
assign them to a different bus. Assume for the remainder of the Q4, there are at least 2 buses (푀 ≥ 2).
The resulting transition probability table is listed above, where 푃 (푆푡+1|푆푡) is the probability of transitioning from 푆푡 to
푆푡+1 after the iterative improvement at the end of time 푡.
(i) [2 pts] Which of the following conditions are sufficient for 푟++ = 푃 (푆푡+1 = +푠|푆푡 = +푠) = 0? Select all that apply.
□ The underlying constraint graph is fully connected
□ The underlying constraint graph does not contain a cycle
■ There exists exactly one satisfying assignment# None of the above
Reason for 1 and 2 being unselected: When there are more buses than students (푀 > 푁), you can always move
one student to any one of the vacant푀 −푁 buses and still satisfies all constraints. Since the probability of doing
so is 1푁 ∗ 푀−푁푀−1 > 0, 푟++ is definitely non-zero.Reason for 3 being selected: If there is only one satisfying assignment, changing any variable to any other value will
violate at least one constraint.
(ii) [2 pts] Ben claims that although 푟++ and 푟−− can be approximated with constant numbers, they actually fluctuatethroughout the execution of the program. Is Ben right and why?# Ben is right, because 푟++ and 푟−− are non-constant functions of time-step 푡.# Ben is right, because enforcing assigning the randomly selected student to a different bus changes the belief of
the hidden states.# Ben is right, but not for the reasons above.# Ben is wrong, because 푟++ and 푟−− will only change monotonically (either always going up or always goingdown), and never fluctuate. Ben is wrong, because 푟++ and 푟−− are always constants# Ben is wrong, but not for the reasons above.
Let 푎푖, 푎푗 ∈ 퐴푡, the set of all possible assignments, whose domain is {(1, 1, 1, 1, 1, ...., 1), (1, 1, 1, 1, 1, ...., 2),
(1, 1, 1, 1, 1, ...., 3), ..., (푀,푀,푀,푀,푀, ....,푀 − 1), (푀,푀,푀,푀,푀, ....,푀)} for all time 푡
16
SID:
(1) 푃 (퐴푡 = 푎푖) is a constant for all t, as the uniform distribution is the stationary distribution, because for all 푎푖,there are푁(푀 − 1) adjacent states with equal probability. The number푁(푀 − 1) come out of the fact that each of
the푁 students could be re-assigned to any of the other푀 − 1 buses it is currently on.
(1.analogy) If we model the distribution of 푃 (퐴푡 = 푎푖) as water level in many connected buckets. The analogy hereis that, if all buckets start at the same level and flow to each other the same way, there will be no change in the
distribution of water across all bucket.
(2) Since 푃 (퐴푡+1 = 푎푗|퐴푡 = 푎푖) is a constant for all 푡 (as the swapping procedure is the same over time), it followsthat 푃 (퐴푡+1 = 푎푗 , 퐴푡 = 푎푖) is a constant for all 푡.
(3) Since 푆 = +푠 is just summing over all values of 푎푖 that satisfy the constraints, and 푆 = −푠 is just summing over
all the values of 푎푖 that don’t satisfy them, it follows that
(a) 푃 (푆푡 = +푠) is a constant and
(b) 푃 (푆푡+1 = +푠, 푆푡 = +푠) is a constant.
So the conditional probability 푟++ = 푃 (푆푡+1 = +푠|푆푡 = +푠) is also a constant.
(iii) [1 pt] Does your selection above change to one of the other 5 choices, if hypothetically, we initially assign all students
to bus 1 instead?# Yes, because "푟++ =undefined" is true immediately (at time 푡 = 0) if we initially assign all students to bus 1,but not true if the initial assignment is randomly generated. Yes, but not for the reason above.# No, because the properties of 푟++ and 푟−− are independent of how we generate the initial assignment.# No, but not for the reason above.
Choice 1 is not selected because if 푘 = 0, all assignments are satisfying, and thus 푟++ = 1, not undefined.
Also, 푟++ = undefined can be true even if we assign students uniformly, if no satisfying assignment exists.
The real reason is that "initially assign all students to bus 1" create a huge bias in the distribution of assignment-space
퐴0 (Now (1, 1, 1, 1, 1, 1, ...., 1) has probability 1 in the assignment-space while all other assignments including (2,1, 1, 1, 1, 1, ......, 1) will have probability 0. This drastic difference will be balanced out after we start to re-assign
students to different buses, because now (2, 1, 1, 1, 1, 1, ......, 1) has non-zero probability in the assignment state,
and thus the probability of its projection into to +푠,−푠 space will change, and thus 푟++ will change over time.
(analogy) Following the analogy in the previous part, if all buckets are connected and one bucket starts full while
others start empty. Overtime, all buckets will reach the same level, but during this process, the distribution of waters
in the bucket will fluctuate, instead of staying constant. (In this case, (1, 1, 1, 1, 1, 1, ...., 1) starts full and all others
start empty)
(c) Your friend Charlie suggests a policy that, since the absence of yelling (-y) is a good indicator that the current assignment
satisfy all constraints (+s), we will not alter any variable when we currently observe no yelling (-y).
푆0 푃 (푆0)
+푠 푝0
−푠 1 − 푝0
푆푡 푌푡 푃 (푌푡|푆푡)
+푠 +푦 0.1
+푠 −푦 –
−푠 +푦 –
−푠 −푦 0.2
푆푡 푌푡 푆푡+1 푃 (푆푡+1|푆푡, 푌푡)
+푠 +푦 +푠 푟++
+푠 +푦 −푠 (I)
+푠 −푦 +푠 (II)
+푠 −푦 −푠 (III)
−푠 +푦 +푠 (IV)
−푠 +푦 −푠 푟−−
−푠 −푦 +푠 (V)
−푠 −푦 −푠 (VI)
Note: "–" in the tables means the value is not given in the problem
(i) [1 pt] Which of the quantities in the 푃 (푆푡+1|푆푡, 푌푡) table are guaranteed to be 1, per Charlie’s policy.
□ (I) ■ (II) □ (III) □ (IV) □ (V) ■ (VI)# None of the above
Both (II) and (VI) correspond to −푦 and maintain 푠 on the same side, which is guaranteed to be 1. Also (I) and (V)
are guaranteed to be 0 since we don’t swap values.
(ii) [1 pt] We are following Charlie’s algorithm, and for the first 푇 time-steps, all observations are no yelling (-y), and
thus we have not altered any variables. Conditioned on 푌0 = 푌1 = ... = 푌푇−1 = −푦, what’s the probability that theinitial assignment indeed satisfy all constraints (+푠)?
17
# 푝0# 0.9(푝0)# (0.9)푇 푝0# (0.9(푝0))푇# 0.9(푝0)0.9(푝0)+0.2(1−푝0) (0.9)푇 푝0(0.9)푇 푝0+(0.2)푇 (1−푝0)# (0.9(푝0))푇(0.9(푝0))푇+(0.2(1−푝0))푇# None of the above
Key observation: since we never changed the assignments, 푆0 = 푆1 = . . .푆푇 = +푠 or 푆0 = 푆1 = . . .푆푇 = −푠
푃 (푆0 = +푠, 푌0 = 푌1 = ... = 푌푇−1 = −푦)
= 푃 (푆푇−1 = +푠, 푌0 = 푌1 = ... = 푌푇−1 = −푦) (because the assignment is never alter)
= 푃 (푆푇−2 = +푠, 푌0 = 푌1 = ... = 푌푇−2 = −푦) ∗ 푃 (푌푇−1 = −푦|푆푇−2 = +푠) ∗ 푃 (푆푇−1 = +푠|푆푇−2 = +푠, 푌푇−2 =
−푦) + 푃 (푆푇−2 = −푠, 푌0 = 푌1 = ... = 푌푇−2 = −푦) ∗ 푃 (푌푇−1 = −푦|푆푇−2 = −푠) ∗ 푃 (푆푇−1 = +푠|푆푇−2 =
−푠, 푌푇−2 = −푦)
= 푃 (푆푇−2 = +푠, 푌0 = 푌1 = ... = 푌푇−2 = −푦) ∗ 0.9 ∗ 1 + 푃 (푆푇−2 = −푠, 푌0 = 푌1 = ... = 푌푇−2 = −푦) ∗ 0.2 ∗ 0
= 0.9 ∗ 푃 (푆푇−2 = +푠, 푌0 = 푌1 = ... = 푌푇−2 = −푦) (because the second term has a 0 factor)
= (0.9)2 ∗ 푃 (푆푇−3 = +푠, 푌0 = 푌1 = ... = 푌푇−3 = −푦)
= ...
= (0.9)푇−1 ∗ 푃 (푌0 = −푦|푃 (푆0 = +푠) ∗ 푃 (푆0 = +푠)
= (0.9)푇−1 ∗ 0.9 ∗ 푝0
= (0.9)푇 ∗ 푝0
Similarly, 푃 (푆0 = −푠, 푌0 = 푌1 = ... = 푌푇−1 = −푦) = (0.2)푇 (1 − 푝0)
Therefore, 푃 (푆0 = +푠|푌0 = 푌1 = ... = 푌푇−1 = −푦)
= 푃 (푆0=+푠,푌0=푌1=...=푌푇−1=−푦)푃 (푆0=+푠,푌0=푌1=...=푌푇−1=−푦)+푃 (푆0=−푠,푌0=푌1=...=푌푇−1=−푦)
= (0.9)
푇 푝0
(0.9)푇 푝0+(0.2)푇 (1−푝0)
18
SID:
(iii) [1 pt] In which of the following way does Charlie’s suggestion improve on Bob’s algorithm?
□ When the assignment is likely to violate a lot of constraints, Charlie’s suggestion helps reduce the number of
violated constraints better then Bob’s
■ When the assignment is likely to satisfy all constraints, Charlie’s suggestion helps retain that state better than
Bob’s# None of the above
Because 푃 (−푦|−푠) = 0.2 is not negligible, and we preserve the assignment when observing−푦, Charlie’s algorithm
actually make it harder to get out of −푠 than Bob’s does.
(iv) [3 pts] Charlie’s algorithm is likely to work well to find a satisfying assignment (if one exists) in which of the
following scenarios?
■ When the constraint graph is sparse
□ When the constraint graph is dense
■ When 푟++ is close to 1
□ When 푟++ is close to 0
□ When 푟−− is close to 1
■ When 푟−− is close to 0# None of the above
Since this algorithm is largely random and it will stop swapping when yelling start to not be observed, it heavily
relies on 2 aspects:
1. When we have a non-satisfying assignment (−푠), the algorithm has high probability of jumping out of it (and thus
low probability of returning to another non-satisfying assignment (−푠), indicated by low 푟−−
2. When we have a satisfying assignment (+푠), the algorithm is not going to shortly fall to another non-satisfying
assignment (−푠) as a result of accidentally observing the noise (푃 (+푦| + 푠) = 0.1 > 0). Therefore, having a high
retaining rate for satisfying assignment is also important (indicated by high 푟++)
A sparse constrain graph tends to have the two properties above, while dense graph tends to have the opposite of the
two properties.
(d) [1 pt] The approach used in this question (especially in Charlie’s algorithm) is the most similar to which of the following
concepts?# AC-3 Algorithm Local Search# Forward Checking# Likelihood Weighing
19
Q5. [15 pts] Snail Bayes
Celebrating the near-end of the semester, the CS188 TAs have gathered around the staff aquarium to check up on the snails and
their search for love. To our excitement, two snails decided to go on a date! We don’t know who the snails are, but we spent
enough time around the terrarium to know that the first one (푆1) is either Alex (a) or Bubbles (b), and the second one (푆2) iseither Cuddles (c) or Scorpblorg (s). On the date, the snails will eat some dinner (D), which can be a beautiful flower (+d) or a
poisonous mushroom (-d), and they will walk (W) around wonderful rocks (+w) or some treacherous puddle (-w). The snails
are in the quest for love (L), which, depending on how the date goes, they can find (+l) or not (-l).
(a) [1 pt] What is the probability of an outcome (푆1 = 푎, 푆2 = 푐,퐷 = −푑,푊 = +푤,퐿 = +푙), the probability that Cuddlesand Alex are on a date, where they share a poisonous mushroom, walk around the wonderful rocks and find love?# 0.5 ∗ 0.4 ∗ 0.7 ∗ 0.5 ∗ 0.4 # 0.4 ∗ 0.6 ∗ 0.7 ∗ 0.5 ∗ 0.4 0.6 ∗ 0.6 ∗ 0.7 ∗ 0.5 ∗ 0.4# None of the above
푃 (푎, 푐,−푑,+푤,+푙) = 푃 (+푙 ∣ 푎, 푐) ∗ 푃 (푎 ∣ −푑,+푤) ∗ 푃 (푐 ∣ −푑,+푤) ∗ 푃 (−푑) ∗ 푃 (+푤) = 0.6 ∗ 0.6 ∗ 0.7 ∗ 0.5 ∗ 0.4 =
0.0504
(b) [2 pts] Which of the following independence statements are guaranteed to be true by the Snail Bayes’ Net graph structure?
□ 푆1 ⫫ 푆2 ∣ 퐿
■ 퐷 ⫫ 푊
□ 퐷 ⫫ 푊 ∣ 퐿
■ 푆1 ⫫ 푆2 ∣ 퐷,푊
■ 퐿 ⫫ 푊 ∣ 푆1, 푆2
□ 퐷 ⫫ 푊 ∣ 푆1, 푆2
20
SID:
# None of the above
We can use D-Separation Algorithm to check for independence guarantees
The date is about to start and people are making guesses for what’s going to happen. One TA notices how adorable it would be
if the snails were Bubbles and Cuddles.
(c) If Bubbles and Cuddles are on the date, we want to compute the probability of them eating a beautiful flower and walking
around the wonderful rocks.
(i) [1 pt] What is the equivalent expression for this probability?# 푃 (푏, 푐,+푑,+푤) # 푃 (푏, 푐 ∣ +푑,+푤) 푃 (+푑,+푤 ∣ 푏, 푐) # 푃 (+푑,+푤)
(ii) [1 pt] What minimal set of probability tables will we use in calculating this probability?
■ 푃 (퐷) ■ 푃 (푊 ) ■ 푃 (푆1 ∣ 퐷,푊 ) ■ 푃 (푆2 ∣ 퐷,푊 ) □ 푃 (퐿 ∣ 푆1, 푆2)# None of the above
We need all the conditional probability table containing퐷,푊 ,푆1, 푆2 on the unconditioned side, which are the firstfour. The last table is left out because 퐿 is not part of the query and none of 퐷,푊 ,푆1, 푆2 is conditioned on 퐿
The snails are starving, so the date begins with a delightful dinner. The snails start sharing a mushroom as their dish of choice.
(d) Given their choice of dinner, what is 푃 (푆1 ∣ −푑), the belief over which snail is S1? Please answer in decimal numbers.
You should not need a calculator.
(i) [1 pt] 푃 (푆1 = 푎 ∣ 퐷 = −푑) = 0.36
푃 (푆1 = 푎 ∣ −푑) =

푤 푃 (푎 ∣ −푑,푤) ∗ 푃 (푤) = 0.6 ∗ 0.4 + 0.2 ∗ 0.6 = 0.36.
A late TA rushes in and is thrilled to see us around the aquarium. You see, he spent many hours befriending the little ones and
immediately recognizes the second snail as Cuddles! Apparently, Cuddles is the ooziest and is easiest to identify.
(e) [1 pt] What is 푃 (푆1 = 푏 ∣ 푆2 = 푐,퐷 = −푑), the probability that the first snail is Bubbles given that the second is Cuddlesand they ate a mushroom?
# ∑푤 푃 (푏∣−푑,푤)∗푃 (푤)∗푃 (푐∣−푑,푤)∗푃 (푤)푃 (푐∣−푑)
# ∑푤 푃 (푏∣−푑,푤)∗푃 (푤)∗푃 (푐∣−푑,푤)∗푃 (푤)∑
푤 푃 (푐∣−푑,푤)∗푃 (−푑)
# ∑푤 푃 (푏∣−푑,푤)∗푃 (−푑)∗푃 (푐∣−푑,푤)∗푃 (푤)푃 (푐)
∑푤 푃 (푏∣−푑,푤)∗푃 (푐∣−푑,푤)∗푃 (푤)푃 (푐∣−푑)
# None of the above
푃 (푏 ∣ 푐,−푑) = 푃 (푏,푐∣−푑)푃 (푐∣−푑) =

푤 푃 (푏,푐∣−푑,푤)∗푃 (푤)
푃 (푐∣−푑) =

푤 푃 (푏∣−푑,푤)∗푃 (푐∣−푑,푤)∗푃 (푤)
푃 (푐∣−푑) =
0.4∗0.7∗0.4+0.8∗0.8∗0.6
0.76 = 0.6526
where we used 푆1 ⫫ 푆2 ∣ 퐷,푊 from part b) for the numerator.
21
(f) [2 pts] What is 푃 (퐿 = +푙 ∣ 푆2 = 푐,퐷 = −푑), the probability that the snails will find love given that the second snail isCuddles and they ate a mushroom?


푠1
푃 (+푙∣푐,푠1)∗푃 (−푑∣푐)∗푃 (푠1∣−푑,푐)
푃 (푐∣−푑)


푠1
푃 (+푙∣푐,푠1)∗푃 (푐∣−푑)∗푃 (푠1∣−푑,푐)
푃 (−푑∣푐)
■ ∑푠1 푃 (+푙 ∣ 푐, 푠1) ∗ 푃 (푠1 ∣ −푑, 푐)


푠1
푃 (+푙,−푑∣푐,푠1)∗푃 (푠1∣푐)
푃 (−푑∣푐)
# None of the above
푃 (+푙 ∣ 푐,−푑) = 푃 (+푙,−푑∣푐)푃 (−푑∣푐) =

푠1
푃 (+푙,−푑∣푐,푠1)∗푃 (푠1∣푐)
푃 (−푑∣푐) =

푠1
푃 (+푙∣푐,푠1)∗푃 (−푑∣푐,푠1)∗푃 (푠1∣푐)
푃 (−푑∣푐) =

푠1
푃 (+푙∣푐,푠1)∗푃 (−푑∣푐)∗푃 (푠1∣−푑,푐)
푃 (−푑∣푐) =
0.6 ∗ 0.3474 + 0.5 ∗ 0.6526 = 0.53474
where we used 퐿 ⫫ 푊 ∣ 푆1, 푆2 and the answer from part e).
The snails found love! We are now trying to find the probability that the other snail was Bubbles given all evidence so far,
푃 (푏 ∣ 푐,−푑,+푙). The TAs are tired of multiplying probabilities, so they instead try another way. The late TA actually wrote
down memories of previous dates he has witnessed in a notebook. He can sample some of his memories from the notebook and
help us learn probabilities.
(g) [1 pt] If the TA uses prior sampling, what is the probability of obtaining the sample [퐷 = −푑,푊 = +푤,푆1 = 푏, 푆2 =
푐, 퐿 = −푙]?
# 0.5*0.4*0.6*0.3*0.2 # 0.4*0.4*0.9*0.2*0.8
# 0.6*0.1*0.7*0.1*0.2 0.5*0.4*0.4*0.7*0.5
# 0.25*0.24*0.9*0.1*0.5 # 0.4*0.5*0.24*0.21*0.25
# None of the above
Prior sampling samples without taking the evidence into account, so the probability of the sample is 푃 (−푑)푃 (+푤)푃 (푏 ∣
−푑,+푤)푃 (푐 ∣ −푑,+푤)푃 (−푙 ∣ 푏, 푐).
(h) [1 pt] If the TA samples [퐷 = −푑,푊 = +푤,푆1 = 푏, 푆2 = 푐, 퐿 = −푙], would rejection sampling discard the memory?
Yes # No
Rejection sampling discards samples inconsistent with the evidence. In this case, −푙 is inconsistent with the fact that our
snails did in fact find love.
(i) [1 pt] Assuming that the TA actually sampled using likelihood weighing and obtained [퐷 = −푑,푊 = +푤,푆1 = 푏, 푆2 =
푐, 퐿 = +푙], what is the weight of this sample?
# 0.5*0.5*0.5 0.5*0.7*0.5
# 0.4*0.9*0.6 # 0.5*0.3*0.5
# 0.4*0.24*0.6 # 0.6*0.3*0.6
# None of the above
The weight of a sample in likelihood weighing is the probability of the evidence given their parents: 푃 (−푑)푃 (푐 ∣
−푑,+푤)푃 (+푙 ∣ 푏, 푐).
(j) [1 pt] Sampling using likelihood weighting will systematically underestimate the probability of a variable conditioned on
22
SID:
one of its ancestors.# Yes, because likelihood weighting does not sample all the variables, and thus creates a bias# Yes, but not for the reason above No, because likelihood weighting is unbiased# No, but not for the reason above
Likelihood weighting is an unbiased sampling procedure.
(k) [2 pts] To estimate 푃 (푏 ∣ 푐,−푑,+푙), the TA samples five memories in a row:
[퐷 = −푑,푊 = +푤,푆1 = 푏, 푆2 = 푐, 퐿 = +푙],
[퐷 = −푑,푊 = +푤,푆1 = 푏, 푆2 = 푐, 퐿 = +푙],
[퐷 = −푑,푊 = +푤,푆1 = 푎, 푆2 = 푐, 퐿 = +푙],
[퐷 = −푑,푊 = +푤,푆1 = 푎, 푆2 = 푐, 퐿 = +푙],
[퐷 = −푑,푊 = +푤,푆1 = 푏, 푆2 = 푐, 퐿 = +푙].Could these memories have been generated using Gibbs sampling?# Yes, because all evidence variables are consistent with their values in the query 푃 (푏 ∣ 푐,−푑,+푙). Yes, but the reason above is incorrect because there exist other samples sequences that fit the condition in the previous
choice but cannot be generated by Gibbs sampling.# No, because the sequence of samples only differs by 푆1, the query variable. The values of푊 , a variable that is notpart of the query, never changes throughout the sequence.# No, but the reason above is incorrect because there exist other samples sequences that fit the condition in the previous
choice but cannot be generated by Gibbs sampling.
Since each neighboring sample differs by at most one variable value AND the evidence variables are not change, this
sequence could be generated via Gibbs sampling, and thus eliminate choice 3 and 4. [0->1: no change, 1->2: no change,
2->3: only changes 푆1, 3->4: no change, 4->5: no change]
But choice one "all evidence variables are consistent with their values in the query 푃 (푏 ∣ 푐,−푑,+푙)" is insufficient, because
the following sequence fits the condition, yet cannot be generated by Gibbs Sampling:
[퐷 = −푑,푊 = +푤,푆1 = 푏, 푆2 = 푐, 퐿 = +푙],
[퐷 = −푑,푊 = −푤,푆1 = 푎, 푆2 = 푐, 퐿 = +푙],
[퐷 = −푑,푊 = −푤,푆1 = 푏, 푆2 = 푐, 퐿 = +푙].
Because 0->1: changes both푊 and 푆1
Just a quick note about choice 3: it is totally possible that when it is푊 ’s turn to be re-sampled from 푃 (푊 |everything else),
the result just so happens to remain +푤 all times, throughout the process of the 5 consecutive samples.
23
Q6. [11 pts] We Are Getting Close...
The CS 188 TAs have built an autonomous vehicle, and it’s finally on the street! Approaching a crossroad, our vehicle must
avoid bumping into pedestrians. However, how close are we?
X is the signal received from sensors on our vehicle. We have a estimation model E, which estimates the current distance of any
object in our view. Our vehicle also needs a model to detect objects and label their classes as one of {pedestrian, stop sign, road,
other}. The TAs trained a detection model D that does the above and with a simple classification, outputs one of {no pedestrian,
pedestrian on the road, pedestrian beside the stop sign}. Our vehicle has a control operator C, which determines the velocity by
changing the acceleration.
(a) [5 pts] For the above Dynamic Bayes Net, complete the equations for performing updates. (Hint: think about the predic-
tion update and observation update equations in the forward algorithm for HMMs.)
Time elapse: (i) = (ii) (iii) (iv) 푃 (푥푡−1|푒0∶푡−1, 푑0∶푡−1, 푐0∶푡−1)
(i) # 푃 (푥푡) 푃 (푥푡|푒0∶푡−1, 푑0∶푡−1, 푐0∶푡−1) # 푃 (푒푡, 푑푡, 푐푡|푒0∶푡−1, 푑0∶푡−1, 푐0∶푡−1)
(ii) # 푃 (푐0∶푡−1) # 푃 (푥0∶푡−1, 푐0∶푡−1) # 푃 (푒0∶푡−1, 푑0∶푡−1, 푐0∶푡−1)# 푃 (푒0∶푡, 푑0∶푡, 푐0∶푡) 1
(iii) Σ푥푡−1 # Σ푥푡 # max푥푡−1 # max푥푡 # 1
(iv) # 푃 (푥푡−1|푥푡−2) # 푃 (푥푡−1, 푥푡−2) # 푃 (푥푡|푒0∶푡−1, 푑0∶푡−1, 푐0∶푡−1)# 푃 (푥푡|푥푡−1) # 푃 (푥푡, 푥푡−1) # 푃 (푥푡, 푒0∶푡−1, 푑0∶푡−1, 푐0∶푡−1) 푃 (푥푡|푥푡−1, 푐푡−1) # 푃 (푥푡, 푥푡−1, 푐푡−1) # 1
Recall the prediction update of forward algorithm: 푃 (푥푡|표0∶푡−1) = Σ푥푡−1푃 (푥푡|푥푡−1)푃 (푥푡−1|표0∶푡−1), where o is the obser-vation. Here it is similar, despite that there are several observations at each time, which means 표푡 corresponds to 푒푡, 푑푡, 푐푡for each t, and that X is dependent on the C value of the previous time, so we need 푃 (푥푡|푥푡−1, 푐푡−1) instead of 푃 (푥푡|푥푡−1).Also note that 푋 is independent of 퐷푡−1, 퐸푡−1 given 퐶푡−1, 푋푡−1.
Update to incorporate new evidence at time 푡:

(
푥푡|푒0∶푡, 푑0∶푡, 푐0∶푡) = (v) (vi) (vii) Your choice for (i)
(v) # (푃 (푐푡|푐0∶푡−1))−1 # (푃 (푒푡|푒0∶푡−1)푃 (푑푡|푑0∶푡−1)푃 (푐푡|푐0∶푡−1))−1 (푃 (푒푡, 푑푡, 푐푡|푒0∶푡−1, 푑0∶푡−1, 푐0∶푡−1))−1 # (푃 (푒0∶푡−1|푒푡)푃 (푑0∶푡−1|푑푡)푃 (푐0∶푡−1|푐푡))−1# (푃 (푒0∶푡−1, 푑0∶푡−1, 푐0∶푡−1|푒푡, 푑푡, 푐푡))−1 # 1
(vi) # Σ푥푡−1 # Σ푥푡 # Σ푥푡−1,푥푡 # max푥푡−1 # max푥푡 1
(vii) □ 푃 (푥푡|푒푡, 푑푡, 푐푡) □ 푃 (푥푡, 푒푡, 푑푡, 푐푡)
□ 푃 (푥푡|푒푡, 푑푡, 푐푡, 푐푡−1) □ 푃 (푥푡, 푒푡, 푑푡, 푐푡, 푐푡−1)
■ 푃 (푒푡, 푑푡|푥푡)푃 (푐푡|푒푡, 푑푡, 푐푡−1) □ 푃 (푒푡, 푑푡, 푐푡|푥푡) # 1
Recall the observation update of forward algorithm: 푃 (푥푡|표0∶푡) ∝ 푃 (푥푡, 표푡|표0∶푡−1) = 푃 (표푡|푥푡)푃 (푥푡|표0∶푡−1).Here the observations 표푡 corresponds to 푒푡, 푑푡, 푐푡 for each t. Apply the Chain Rule, we are having

(
푥푡|푒0∶푡, 푑0∶푡, 푐0∶푡) ∝ 푃 (푥푡, 푒푡, 푑푡, 푐푡|푒0∶푡−1, 푑0∶푡−1, 푐0∶푡−1) = 푃 (푒푡, 푑푡, 푐푡|푥푡, 푐푡−1)푃 (푥푡|푒0∶푡−1, 푑0∶푡−1, 푐0∶푡−1)
24
SID:
= 푃 (푒푡, 푑푡|푥푡)푃 (푐푡|푒푡, 푑푡, 푐푡−1)푃 (푥푡|푒0∶푡−1, 푑0∶푡−1, 푐0∶푡−1).Note that in 푃 (푒푡, 푑푡, 푐푡|푥푡, 푐푡−1), we cannot omit 푐푡−1 due to the arrow between 푐푡 and 푐푡−1.
To calculate the normalizing constant, use Bayes Rule: 푃 (푥푡|푒0∶푡, 푑0∶푡, 푐0∶푡) = 푃 (푥푡,푒푡,푑푡,푐푡|푒0∶푡−1,푑0∶푡−1,푐0∶푡−1)푃 (푒푡,푑푡,푐푡|푒0∶푡−1,푑0∶푡−1,푐0∶푡−1) .
(viii) Suppose we want to do the above updates in one step and use normalization to reduce computation. Select all the
terms that are not explicitly calculated in this implementation.
DO NOT include the choices if their values are 1.
□ (ii) □ (iii) □ (iv) ■ (v) □ (vi) □ (vii) # None of the above
(v) is a constant, so we don’t calculate it during implementation and simply do a normalization instead. Everything else
is necessary.
(b) Suppose X outputs 1024 × 1024 greyscale images and our vehicle stays stationary. As before, E includes precise estima-
tion of the distance between our vehicle and the pedestrian evaluated from outputs of X. Unfortunately, a power outage
happened, and before the power is restored, E will not be available for our vehicle. But we still have the detection model
D, which outputs one of {no pedestrian, pedestrian on the road, pedestrian beside the stop sign} for each state.
(i) [1 pt] During the power outage, it is best to# do particle filtering because the particles are easier to track for D than for both D and E do particle filtering because of memory constraints# do particle filtering, but not for the reasons above# do exact inference because it saves computation# do exact inference, but not for the reason above
E is unavailable and C does not change its value since our vehicle stays stationary, so we only considers X and D.
Although D has only 3 possible values, X is huge and it is impossible to store the belief distribution.
(ii) [1 pt] The power outage was longer than expected. As the sensor outputs of X have degraded to 2×2 binary images,
it is best to# do particle filtering because the particles are easier to track for D than for both D and E# do particle filtering because of memory constraints# do particle filtering, but not for the reasons above# do exact inference because it saves computation do exact inference, but not for the reason above
In this case we do not have the "X is huge" problem in (i), and we can do exact inference, which is always more
accurate than particle filtering and thus more favorable in this setting.
(iii) [1 pt] After power is restored and we have E, it is reasonable to do particle filtering because of memory constraints# do particle filtering, but not for the reason above# do exact inference because E gives more valuable information than D# do exact inference because it’s impractical to do particle filtering for E# do exact inference, but not for the reasons above
The belief distribution is too big to store in memory.
25
(c) Now we formulate the Dynamic Bayes Net for this question into a non-deterministic two-player game (analogous to MDP
in single-player setting). Each state 푆 = (푋,퐸,퐷).
There are 2 agents in the game: our vehicle (with action set A), and a pedestrian (with action set B). The vehicle and the
pedestrian take turns to perform their actions.
The TAs implemented 3 modes for the autonomous vehicle to act in the same space with the kind pedestrian, the confused
pedestrian, and the naughty pedestrian. In each round of testing, a TA will be the pedestrian, and one of the modes will
be tested. The vehicle and the pedestrian are both in the corresponding mode.
Below, 푄∗푣 is the Q-function for the autonomous vehicle. For each subquestion, given the standard notation for an MDP,select the expression 푓푛 that would complete the blank part of the Q-Value Iteration under the specified formulation.
푄∗푣(푠, 푎) =

푠′ 푇 (푠, 푎, 푠′)[푅(푠, 푎, 푠′) + 훾 ]
푓1 =

푏∈퐵

푠′′
(

(
푠′, 푏, 푠′′
) [

(
푠′, 푏, 푠′′
)
+ 훾 max푎′∈퐴푄∗푣
(
푠′′, 푎′
)])
푓2 = max푏∈퐵

푠′′
(

(
푠′, 푏, 푠′′
) [

(
푠′, 푏, 푠′′
)
+ 훾 max푎′∈퐴푄∗푣
(
푠′′, 푎′
)])
푓3 = min푏∈퐵

푠′′
(

(
푠′, 푏, 푠′′
) [

(
푠′, 푏, 푠′′
)
+ 훾 max푎′∈퐴푄∗푣
(
푠′′, 푎′
)])
푓4 =

푏∈퐵

푠′′
(

(
푠′, 푏, 푠′′
) [

(
푠′, 푏, 푠′′
)
+ 훾 1|퐵| max푎′∈퐴푄∗푣 (푠′′, 푎′)])
푓5 = max푎′∈퐴푄∗푣
(
푠′, 푎′
)
푓6 = min푎′∈퐴푄∗푣
(
푠′, 푎′
)
푓7 =
1|퐴| ∑푎′∈퐴푄∗푣 (푠′, 푎′)
(i) [1 pt] The kind pedestrian that acts friendly, maximizing the vehicle’s utility.
□ 푓1 ■ 푓2 □ 푓3 □ 푓4 □ 푓5 □ 푓6 □ 푓7 # None of the above
(ii) [1 pt] The confused pedestrian that acts randomly.
□ 푓1 □ 푓2 □ 푓3 □ 푓4 ■ 푓5 □ 푓6 □ 푓7 # None of the above
(iii) [1 pt] The naughty pedestrian that performs adversarial actions, minimizing the vehicle’s utility.
□ 푓1 □ 푓2 ■ 푓3 □ 푓4 □ 푓5 □ 푓6 □ 푓7 # None of the aboveRecall the q-value iteration formula: 푄푘+1(푠, 푎) ← ∑푠′ 푇 (푠, 푎, 푠′) [푅 (푠, 푎, 푠′) + 훾 max푎′ 푄푘 (푠′, 푎′)].Here it is similar, but in addition to the vehicle, there’s also a pedestrian taking actions from a different set, so we
need to do something similar for the pedestrian, as in 푓1, 푓2, 푓3, 푓4. That is, instead of using the maximum푄푣 rightaway, we substitute that with the q-value iteration formula for the pedestrian with respect to 푄푣.The value of this formula is maximized (as in 푓2) in the case of the kind pedestrian,minimized (as in 푓3) in the case of the naughty pedestrian,and averaged in the case of the naughty pedestrian, which would be
1|퐵| ∑푏∈퐵∑푠′′ (푇 (푠′, 푏, 푠′′) [푅 (푠′, 푏, 푠′′) + 훾 max푎′∈퐴푄∗푣 (푠′′, 푎′)]).Hence 푓1 and 푓4 are incorrect. Since the pedestrian is acting completely randomly, we can include the pedestrian inthe transition dynamics and just use regular q-value iteration, which is 푓5.
26
SID:
Q7. [10 pts] Generating and classifying text
In this question, we are interested in modelling sentences. Assume each sentence is represented using a bag-of-words (i.e. a
vector which contains counts for each word in a sentence). We are interested in classifying whether a sentence (represented as
a bag of words 푋) is a positive-sounding (class 퐶 = 1) or negative-sounding (퐶 = −1) sentence. 푋1, 푋2, ...푋푝 represent theentries for individual words in the bag-of-words representation 푋.
(a) In this question, we are interested in the basics of Naive Bayes and logistic regression.
(i) [1 pt] Which of these are modelled explicitly in Naive Bayes? Select all that apply.
□ 푃 (퐶|푋)
■ 푃 (푋|퐶)
■ 푃 (퐶)# None of the above
(ii) [1 pt] Which of these decompositions reflect the naive assumption in Naive Bayes?# 푃 (퐶) = 푃 (퐶1) ⋅ 푃 (퐶2) 푃 (푋|퐶) = 푃 (푋1|퐶) ⋅ 푃 (푋2|퐶) ⋅ 푃 (푋3|퐶) ⋅ 푃 (푋4|퐶) ⋅ ...# 푃 (푋) = 푃 (푋1) ⋅ 푃 (푋2|푋1) ⋅ 푃 (푋3|푋2, 푋1)...# 푃 (푋1) = (푃 (푋1) + 1)∕(∑푥푖 푃 (푋푖))# None of the above
The naive assumption is that each feature is independent of all other features given the class label. That is, 푋푖s areconditionally independent given C, which is reflected in the second choice.
(iii) [1 pt] Which of these are modelled directly in Logistic Regression? Select all that apply.
■ 푃 (퐶|푋)
□ 푃 (푋|퐶)
□ 푃 (퐶)
□ 푃 (푋)# None of the above
(b) In this part, we will consider different scenarios about the data we have.
(i) [1 pt] Assume we only have two points in our training dataset which are linearly separable using the feature 푋1.Which of the following methods will be able to achieve zero training error? Select all that apply.
■ Naive Bayes
■ Bayes classifier (i.e. naive bayes with no naive assumption)
■ Logistic Regression
■ Perceptron
■ A very large neural network with many nonlinearities# None of the above
(ii) [1 pt] Assume we now have a very large training dataset which is not linearly separable using any individual variable,
but is separable given 푋1 ⋅푋2. Which of the following methods will be able to achieve zero training error (withoutaugmenting the data set with additional features)? Select all that apply.
□ Naive Bayes
■ Bayes classifier (i.e. naive bayes with no naive assumption)
□ Logistic Regression
□ Perceptron (with no softmax on the output)
■ A very large neural network with many nonlinearities# None of the above
Bayes classifier is able to model 푃 (푋1, 푋2), which lets us classify correctly. A neural network is able to model theinteraction between 푋1 and 푋2
(iii) [1 pt] Now assume that the true dataset is linearly separable but our training set has a single mis-labeled data point.
Which of the following are true? Select all that apply.
□ Logistic regression may output probabilities greater than 1
27
■ Perceptron (with no softmax on the output) may not converge# None of the above
Perceptron updates may continue to oscillate due to the misclassified outlier.
(iv) [1 pt] Assume we initially have a model trained on a very large dataset with the same number of positive and negative
examples. Then, we duplicate all the positive examples in our training set and re-add them (resulting in a training
set 1.5 times the size of the original training set). We then train a second model on the new training set. Which of
the following is true? Select all that apply.
■ In logistic regression, 푃 (퐶 = 1|푋) will generally be higher for the retrained model than the original model
□ In naive bayes, 푃 (푋 = 푥|퐶 = 1) will be higher for some 푥 for the retrained model than the original model
■ In naive bayes, 푃 (퐶 = 1) will generally be higher for the retrained model than the original model
□ In naive bayes, 푃 (푋 = 1) will generally to be higher for the retrained model than the original model# None of the above
Duplicating all the positive samples does not change the distribution of X given that the example is positive.
(c) We are now interested in generating text (still in the form of bag-of-words) from each of our classes.
(i) [1 pt] If we have already fit naive bayes for text classification, which distribution can we use to generate text for the
positive class?# 푃 (퐶)# 푃 (푋) 푃 (푋|퐶)# 푃 (퐶|푋)# None of the above
(ii) [1 pt] Assuming we have an infinite amount of data, which of the following modeling assumption is generally able to
more accurately model the true distribution of 푃 (푋) (and thus to generate more realistic bag-of-words sentences)?# 푃 (푋) = 푃 (푋1) ⋅ 푃 (푋2) ⋅ 푃 (푋3) ⋅ 푃 (푋4) ⋅ . . . 푃 (푋) = 푃 (푋1) ⋅ 푃 (푋2|푋1) ⋅ 푃 (푋3|푋2, 푋1) ⋅ ...# They are the same
(iii) [1 pt] Which of the following will best help us generate text with words in the correct order?# Using Laplace smoothing# Predicting 푃 (푋|퐶) using a neural network Changing the representation of the text (to use something other than bag-of-words)# None of the above
Regardless of the model, bag of words cannot represent the order of words.
28
SID:
Q8. [16 pts] Deep “Blackjack”
To celebrate the end of the semester, you visit Las Vegas and decide to play a good, old fashioned game of “Blackjack”!
Recall that the game has states 0,1,. . . ,8, corresponding to dollar amounts, and a 퐷표푛푒 state where the game ends. The player
starts with $2, i.e. at state 2. The player has two actions: Stop (푎 = 0) and Roll (푎 = 1), and is forced to take the Stop action at
states 0,1,and 8.
When the player takes the Stop action (푎 = 0), they transition to the퐷표푛푒 state and receive reward equal to the amount of dollars
of the state they transitioned from: e.g. taking the stop action at state 3 gives the player $3. The game ends when the player
transitions to 퐷표푛푒.
The Roll action (푎 = 1) is available from states 2-7. The player rolls a biased 6-sided die. If the player Rolls from state s and
the die lands on outcome 표, the player transitions to state 푠 + 표 − 2, as long as 푠 + 표 − 2 ≤ 8 (푠 is the amount of dollars of the
current state, 표 is the amount rolled, and the negative 2 is the price to roll). If 푠 + 표 − 2 > 8, the player busts, i.e. transitions to
Done and does NOT receive reward.
As the bias of the dice is unknown, you decided to perform some good-old fashioned reinforcement learning (RL) to solve the
game. However, unlike in the midterm, you have decided to flex and solve the game using approximate Q-learning. Not only
that, you decided not to design any features - the features for the Q-value at (푠, 푎) will simply be the vector [푠 푎], where 푠 is the
state and 푎 is the action.
(a) First, we will investigate how your choice of features impacts whether or not you can learn the optimal policy. Suppose
the unique optimal policy in the MDP is the following:
State 2 3 4 5 6 7
휋∗(푠) Roll Roll Roll Stop Stop Stop
For each of the cases below, select “Possible with large neural net” if the policy can be expressed by using a large neural
net to represent the Q-function using the features specified as input. (That is, the greedy policy with respect to some
Q-function representable with a large neural network is the optimal policy: 푄(푠, 휋∗(푠)) > 푄(푠, 푎) for all states 푠 and
actions 푎 ≠ 휋∗(푠).) Select “Possible with weighted sum" if the policy can be expressed by using a weighted linear sum to
represent the Q-function. Select “Not Possible” if expressing the policy with given features is impossible no matter the
function.
(i) [1 pt] Suppose we decide to use the state 푠 and action 푎 as the features for 푄(푠, 푎).
■ Possible with large neural network □ Possible with linear weighted sum of features # Not possible
A sufficiently large neural network could represent the true optimal 푄-function using this feature representation. The
optimal 푄-function satisfies the desired property (there are no ties as the optimal policy is unique). Alternatively, a
sufficiently large neural network could represent a function that is 1 for the optimal action in each state, and 0 otherwise,
which also suffices.
No linear weighted sum of features can represent this optimal policy. To see this, let our linear weighted sum be 푤0푠 +
푤1푎 + 푤3. We need 푄(4, 1) > 푄(4, 0) and 푄(5, 0) > 푄(5, 1). Plugging in the expression for Q, the former inequalityrequires that푤1 > 0. The second inequality requires that푤1 < 0, a contradiction. So we cannot represent the policy witha weighted sum of features.
(ii) [1 pt] Now suppose we decide to use 푠 + 푎 as the feature for 푄(푠, 푎).
■ Possible with large neural network □ Possible with linear weighted sum of features # Not possible
Indeed, it’s possible that no neural network can represent the optimal푄-functionwith this feature representation, as푄(4, 1)
does not have to equal푄(5, 0). However, the question is not asking about representing the optimal푄-function, but instead
the optimal policy, which merely requires that 푄(푠, 1) > 푄(푠, 0) for 푠 ≤ 4 and 푄(푠, 0) > 푄(푠, 1) for 푠 ≥ 5. This can
be done with the feature representation. For example the neural network in part (d) can represent the following function
(using 푤0 = −5 and 푤1 = −2) that represents the optimal policy:
Again, no linear weighted sum of features can represent this optimal policy. To see this, let our linear weighted sum be
푤0푠+푤0푎+푤1. This is a special case of the linear weighted sums in part (i), which we know cannot represent the optimalpolicy.
29
Figure 1: A Q-function from 푠 + 푎, that represents the optimal policy.
(iii) [1 pt] Now suppose we decide to use 푎 as the feature for 푄(푠, 푎).
□ Possible with large neural network □ Possible with linear weighted sum of features Not possible
This isn’t possible, regardless of what function you use. With this representation, we will have 푄(4, 1) = 푄(5, 1) and
푄(4, 0) = 푄(5, 0). So we cannot both have 푄(4, 1) > 푄(4, 0) and 푄(5, 0) > 푄(5, 1).
(iv) [1 pt] Now suppose we decide to use sign(푠 − 4.5) ⋅ 푎 as the feature for 푄(푠, 푎), where sign(푥) is −1 if 푥 < 0, 1 if
푥 > 0, and 0 if 푥 = 0.
■ Possible with large neural network ■ Possible with linear weighted sum of features # Not possible
푄(푠, 푎) = −sign(푠 − 4.5) ⋅ 푎 is sufficient to represent the optimal policy, as we have 푄(푠, 0) = 0 for all 푠, 푄(푠, 1) =
1 > 푄(푠, 0) for 푠 ≤ 4, and 푄(푠, 1) = −1 < 푄(푠, 0) for 푠 ≥ 5. This is a linear function of the input, and so can also be
represented using a neural network.
30
SID:
(b) [4 pts] Next, we investigate the effect of different neural network architectures on your ability to learn the optimal policy.
Recall that our features for the Q-value at (푠, 푎) will simply be the vector [푠 푎], where 푠 is the state and 푎 is the action. In
addition, suppose that the unique optimal policy is the following:
State 2 3 4 5 6 7
휋∗(푠) Roll Roll Roll Stop Stop Stop
Which of the following neural network architectures can express Q-values that represent the optimal policy? That is,
the greedy policy with respect to some Q-function representable with the given neural network is the optimal policy:
푄(푠, 휋∗(푠)) > 푄(푠, 푎) for all states 푠 and actions 푎 ≠ 휋∗(푠). Hint: Recall that 푅푒퐿푈 (푥) =
{
푥 푥 > 0
0 푥 ≤ 0
□ Neural Network 1:
□ Neural Network 2:
□ Neural Network 3:
31
□ Neural Network 4:
■ Neural Network 5:
# None of the above.
Recall from the previous question that no linear function of the features [푠 푎] can represent the optimal policy. So network
1, which is linear (as it has no activation function), cannot represent the optimal policy.
Network 2 cannot represent the optimal function, as it does not take as input the action. So푄(푠, 0) = 푄(푠, 1) for all states
푠.
Network 3 cannot simultaneously satisfy 푄(4, 0) < 푄(4, 1) and 푄(5, 0) > 푄(5, 1). This is because the rectified linear
unit is a monotonic function: if 푥 ≥ 푦, then 푅푒퐿푈 (푥) ≥ 푅푒퐿푈 (푦). Since we cannot represent the optimal policy using a
linear function of 푠, 푎, we cannot represent it with a ReLU of a linear function of 푠, 푎.
Network 4 is always 0, so it cannot represent the (unique) optimal policy.
Network 5 can represent the optimal policy. For example, 푤0 = −4, 푤1 = −2 represents the optimal policy.
(c) [1 pt] As with the linear approximate q-learning, you decide to minimize the squared error of the Bellman residual. Let
푄퐰(푠, 푎) be the approximate 푄-values of 푠, 푎. After taking action 푎 in state 푠 and transitioning to state 푠′ with reward 푟,you first compute the target target = 푟 + 훾 max푎′ 푄퐰(푠′, 푎′). Then your loss is:
loss(퐰) = 1
2
(
푄퐰(푠, 푎) − target
)2
You then perform gradient descent to minimize this loss. Note that we will not take the gradient through the target - we
treat it as a fixed value.
Which of the following updates represents one step of gradient descent on the weight parameter 푤푖 with learning rate
훼 ∈ (0, 1) after taking action 푎 in state 푠 and transitioning to state 푠′ with reward 푟? [Hint: which of these is equivalent to
the normal approximate Q-learning update when 푄퐰,(푠, 푎) = 퐰 ⋅ 퐟 (푠, 푎)?]# 푤푖 = 푤푖 + 훼 (푄퐰(푠, 푎) − (푟 + 훾 max푎′ 푄퐰(푠′, 푎′))) 휕푄퐰(푠,푎)휕푤푖 푤푖 = 푤푖 − 훼 (푄퐰(푠, 푎) − (푟 + 훾 max푎′ 푄퐰(푠′, 푎′))) 휕푄퐰(푠,푎)휕푤푖# 푤푖 = 푤푖 + 훼 (푄퐰(푠, 푎) − (푟 + 훾 max푎′ 푄퐰(푠′, 푎′))) 푠# 푤푖 = 푤푖 − 훼 (푄퐰(푠, 푎) − (푟 + 훾 max푎′ 푄퐰(푠′, 푎′))) 푠# None of the above.
(d) and (e) are on the next page.
32
SID:
Note that the gradient of the loss with respect to the parameter, via the chain rule, is:(
푄퐰(푠, 푎) −
(
푟 + 훾 max
푎′
푄퐰(푠′, 푎′)
))
휕푄퐰(푠, 푎)
휕푤푖
The second option performs gradient descent, the first option is gradient ascent, and the other options compute the gradient
incorrectly.
33
(d) While programming the neural network, you’re getting some bizarre errors. To debug these, you decide to calculate the
gradients by hand and compare them to the result of your code.
Suppose your neural network is the following:
Figure 2: Neural Network 6
That is, 푄퐰(푠, 푎) = 푠 + 푎 +푤1 푅푒퐿푈 (푤0 + 푠 + 푎).
You are able to recall that 푑푑푥푅푒퐿푈 (푥) =
{
1 푥 ≥ 0
0 푥 < 0
.
(i) [1 pt] Suppose 푤0 = −4, and 푤1 = −1. What is 푄퐰(5, 0)?
푄퐰(5, 0) = 4
Plugging in values, we get
푄퐰(5, 0) = 5 − 푅푒퐿푈 (−4 + 5) = 4.
(ii) [2 pts] Suppose 푤0 = −4, and 푤1 = −1. What is the gradient with respect to 푤0, evaluated at 푠 = 5, 푎 = 0?

푤0
푄퐰(5, 0) = -1
Since the input to the ReLU is positive, the gradient of the ReLU is 1. Applying the chain rule, we get that 휕푤0푄퐰(5, 0) =
푤1 = −1
(iii) [2 pts] Suppose 푤0 = −4, and 푤1 = −1. What is the gradient with respect to 푤0, evaluated at 푠 = 3, 푎 = 0?

푤0
푄퐰(3, 0) = 0
Since the input to the ReLU is negative, the gradient of the ReLU is 0. Applying the chain rule, the gradient with respect
to 푤0 has to be 0.
(e) After picking a feature representation, neural network architecture, and update rule, as well as calculating the gradients,
it’s time to turn to the age old question... will this even work?
(i) [1 pt] Without any other assumptions, is it guaranteed that your approximate 푄-values will converge to the optimal
policy, if each 푠, 푎 pair is observed an infinite amount of times?# Yes No
(ii) [1 pt] Without any other assumptions, is it guaranteed that your approximate 푄-values will converge to the optimal
policy, if each 푠, 푎 pair is observed an infinite amount of times and there exists some푤 such that푄퐰(푠, 푎) = 푄∗(푠, 푎)?# Yes No
Note that there’s no guarantee that your neural network will converge in this case. For example, the learning rate can be
too large! (As in the RL Blackjack question on midterm.)
34
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468