程序代写案例-HW 2

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
HW 2
1 Problem 1 [20 pts]
An agent is exploring an MDP M = (S,A,R, P, γ) where S = {s1, s2, s3}, A = {a1, a2, a3}, γ = 0.5,
and P (si|ai, s) = 1 for any s for all i. The rewards for transitioning into a state si are defined as
R(si) = i.
1. The agent follows the trajectory
(s1, a1, 1, s1, a2, 2, s2)
Consider a Q-learning agent with ε-greedy exploration. Further assume that a random action
never happens to pick the greedy action, and ties are broken by choosing the ai with the
smallest i. The learning rate is set to 0.5. Initialize Q to zeros.
Could such an agent generate this trajectory for ε 6= 0? If yes, label the actions that are greedy
and which are random, and justify your answer. [10 points]
2. Now given the above trajectory, suppose we use Q-learning for off-policy learning. Suppose
all Q values are initialized to 6. The learning rate is set to 0.5. Write the updated Q values
as state, action, rewards are visited as in the trajectory. [10 points]
2 Problem 2 [20 points]
In class we defined Q values for the discounted sum reward problem. We can similarly define Q
values for a finite horizon MDP with γ = 1. Let pi?H−t(s) be the action taken by the optimal policy in
state s at time t (i.e., with H − t time steps in the horizon), for t = 0, 1, · · · , H − 1 . Let Q?H−t(s, a)
be the Q value of the optimal policy upon taking action a in state s with H − t time steps in the
horizon.
1. In the first time step t = 0, we are in state s0 and choose action a, which is suboptimal. If we
then follow the optimal policy from t = 1 until the end of the episode, what is the value of
this policy compared to the optimal one? Express your result only using Q? values. [2 points]
2. Suppose instead of policy pi?, we follow another policy pi. Denote with V piH(s0) and V ?H(s0) the
value function of policy pi and the optimal, respectively, starting at state s0. Show that the
loss can be computed as: [8 points]
|V ?H(s0)− V piH(s0)| =
H−1∑
t=0
(
Q?H−t(st, pi?H−t(st))−Q?H−t(st, piH−t(st))
)
1
3. Now suppose we do not have access to true Q? but have estimated Q? by Qˆ which is a
close approximation to Q?. In particular, for every state s, action a and time step t, we
have that |QˆH−t(s, a)−QH−t(s, a)| ≤ ε. Suppose we follow the actions of policy p˜iH−t(s) =
argmaxa QˆH−t(s, a). Show that the loss in the total reward can be bounded as: [10 points]
|V ?H(s0)− V p˜iH(s0)| ≤ 2εH
3 Problem 3 [10 pts]
Read the Cliff Walking example (Example 6.6) in Sutton and Barto. Write a summary of what you
think is the difference between the “online performance” of SARSA and Q-learning.
4 Programming: Frozen Lake MDP- Part 2 [50 pts]
In HW1, you implemented policy iteration assuming the knowledge of the model. In this homework,
you will implement methods that do not need the knowledge of the model: SARSA and Q-Learning.
We have provided a stater code in the form of a Jupyter notebook. You can write and execute your
Python code directly in your web browser or on your machine locally with Jupyter.
2

欢迎咨询51作业君
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468