HW 2 1 Problem 1 [20 pts] An agent is exploring an MDP M = (S,A,R, P, γ) where S = {s1, s2, s3}, A = {a1, a2, a3}, γ = 0.5, and P (si|ai, s) = 1 for any s for all i. The rewards for transitioning into a state si are defined as R(si) = i. 1. The agent follows the trajectory (s1, a1, 1, s1, a2, 2, s2) Consider a Q-learning agent with ε-greedy exploration. Further assume that a random action never happens to pick the greedy action, and ties are broken by choosing the ai with the smallest i. The learning rate is set to 0.5. Initialize Q to zeros. Could such an agent generate this trajectory for ε 6= 0? If yes, label the actions that are greedy and which are random, and justify your answer. [10 points] 2. Now given the above trajectory, suppose we use Q-learning for off-policy learning. Suppose all Q values are initialized to 6. The learning rate is set to 0.5. Write the updated Q values as state, action, rewards are visited as in the trajectory. [10 points] 2 Problem 2 [20 points] In class we defined Q values for the discounted sum reward problem. We can similarly define Q values for a finite horizon MDP with γ = 1. Let pi?H−t(s) be the action taken by the optimal policy in state s at time t (i.e., with H − t time steps in the horizon), for t = 0, 1, · · · , H − 1 . Let Q?H−t(s, a) be the Q value of the optimal policy upon taking action a in state s with H − t time steps in the horizon. 1. In the first time step t = 0, we are in state s0 and choose action a, which is suboptimal. If we then follow the optimal policy from t = 1 until the end of the episode, what is the value of this policy compared to the optimal one? Express your result only using Q? values. [2 points] 2. Suppose instead of policy pi?, we follow another policy pi. Denote with V piH(s0) and V ?H(s0) the value function of policy pi and the optimal, respectively, starting at state s0. Show that the loss can be computed as: [8 points] |V ?H(s0)− V piH(s0)| = H−1∑ t=0 ( Q?H−t(st, pi?H−t(st))−Q?H−t(st, piH−t(st)) ) 1 3. Now suppose we do not have access to true Q? but have estimated Q? by Qˆ which is a close approximation to Q?. In particular, for every state s, action a and time step t, we have that |QˆH−t(s, a)−QH−t(s, a)| ≤ ε. Suppose we follow the actions of policy p˜iH−t(s) = argmaxa QˆH−t(s, a). Show that the loss in the total reward can be bounded as: [10 points] |V ?H(s0)− V p˜iH(s0)| ≤ 2εH 3 Problem 3 [10 pts] Read the Cliff Walking example (Example 6.6) in Sutton and Barto. Write a summary of what you think is the difference between the “online performance” of SARSA and Q-learning. 4 Programming: Frozen Lake MDP- Part 2 [50 pts] In HW1, you implemented policy iteration assuming the knowledge of the model. In this homework, you will implement methods that do not need the knowledge of the model: SARSA and Q-Learning. We have provided a stater code in the form of a Jupyter notebook. You can write and execute your Python code directly in your web browser or on your machine locally with Jupyter. 2
欢迎咨询51作业君