程序代写案例-DS8010

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

DS8010 - Interactive Learning in Decision Processes
Assignment # 3
Out April 14
Due April 25, 11:59PM
Late submissions will not be accepted!
Solve the following problems. You may work in groups of 1-2 students. You must cite any references (texts,
papers or websites) you have used to help you solve these problems (you do not need to do this for the program-
ming questions e.g. when you check a function name over the internet, however, you cannot copy/use someone
else’s code without a reference). For the written questions, submit a report (as a “pdf” file — this means you
either type your questions in word/latex or take a scan of your hand-written answers, e.g., using an app) and for
the programming questions submit a single python file that contains your solution in the respective assignment
submission folder in D2L. Assignment submission folders are under Assignments in the Assessment tab.
• Pay attention to the following guidelines for the coding problems:
– Each submitted file must include the name of the group members at the top of the file (in jupyter
notebook, you should do so in the first cell).
– Each submitted file must have the following naming convention (replace the terms questionID, “stu-
dent1LastName” and “student2LastName” – student1LastName and student2LastName should be al-
phabetical):
hw3 questionID student1LastName student2LastName(.pdf or .py)
For instance, for the 3rd question, which is a written question, the naming convention would be
hw3 3 Cevik Jahanshahi.pdf
– You are required to submit a summary report (as a pdf file as described above) for the coding questions
as well (include plots, summary results etc in this report)
– Only one group member should submit the homework.
Not following the submission guidelines will result in point deductions!
1
1. [30 pts, written] You are given the following trajectory in a single episode:
{S0,A0,R1,S1,A1,R2,S2,A2,R3,S3,A3,R4,S4,A4,R5,S5}
= {X ,“right”,5,Y,“le f t”,3,X ,“le f t”,−3,Z,“le f t”,−1,W,“le f t”,2,T}
where {X ,Y,Z,W,T} is the set of (observed) states, T being the terminal state. Assume a discount factor of
γ = 0.8.
For state S0 = X :
(a) [10 pts] Calculate the Monte Carlo (MC) return Gt (show your work).
(b) [10 pts] Calculate 2-step (G0:2) and 3-step (G0:3) returns (show your work).
(c) [10 pts] Calculate the compounded λ -return where λ = 0.5 (show your work).
2. [20 pts, written] Assume that the path given in Figure 1 has been taken in the Dyna Maze in the first episode.
The rewards are deterministic, and, when transitioning into the goal state (G), it is +1, otherwise 0. The state
is given by the tuple (x,y) where x is the row number and the y is the column number. Assume all state and
action values are initialized to zero.
Figure 1: A 6×8 Dyna maze example (S: starting state, G: goal state)
(a) [10 pts] Considering the Tabular Dyna-Q algorithm, fill out the Model(s,a) table that would be ob-
tained from the episode illustrated in Figure 1, which can then be used in the planning steps of the
algorithm. (Hint: See Table 1).
State Action Next State Reward
(2,0) Right ... ...
Table 1: Model(s,a) table template
(b) [10 pts] If the planning step were to implement TD style Q-learning updates (assume α = 1), write
down the planning updates under a prioritized sweeping rule with a threshold value of θ = 0.1.
2
3. [50 pts (+ 20 pts Bonus), implementation] Solving the Cartpole problem with DQN variants (all implemen-
tations should be using Tensorflow and Keras, and you are recommended to use Google Colab for your
implementations). Cartpole problem environment can be obtained from gym library as follows:
1 env = gym . make ( ’ C a r t P o l e −v0 ’ )
(a) [30 pts] Implement Double DQN (i.e., DDQN) algorithm for the Cartpole problem. Note that generic
DDQN algorithm involves two networks: the prediction network and the target network. In your
implementation, the target network should be updated every 10 episodes. The rest of the algorithm pa-
rameters are identical to those you can find in the DQN exercise posted in D2L (see ds8010 lec9 ex cartPole v0.ipynb).
Specifically, you are required to create a memory with the size of 2000 to keep track of the latest steps,
and the batch size for the training phase should be 32. In addition, you should run the DDQN al-
gorithm for at least 1,000 episodes. Your implementation is required to output a plot showing the
collected reward for each episode (e.g., see Figure 2). A helper file for this question is provided in
ds8004 hw3 q3a CartPole DDQN helper.ipynb.
Figure 2: Sample plot for DDQN rewards over 100 episodes
(b) [20 pts, Bonus] Using the same settings described in part (a), implement a dueling DDQN algorithm.
A review of dueling DDQN algorithm is provided below.
Q-values correspond to how good it is to be at state s and taking action a, i.e., Q(s,a).
Accordingly, we can decompose Q(s,a) as the sum of V (s) (i.e., the value of being in state
s), and A(s,a) (the advantage of taking that action at that state –how much better is to take
this action versus all other possible actions at that state). That is, Q(s,a) = A(s,a)+V (s).
With dueling DDQN, we aim to separate the estimator of these two elements, using two
new streams: one that estimates V (s), and one that estimates the advantage for each action
A(s,a). Then, we combine these two streams through a special aggregation layer to get an
estimate of Q(s,a). By decoupling the estimation, intuitively our Dueling DDQN can learn
which states are valuable without having to learn the effect of each action at each state (since
it is also calculating V (s)).
3
Run DDQN (from part (a)) and duelling DDQN algorithms for 1000 episodes and comment on their
relative performance (e.g., as indicated by the reward plots).
(c) [20 pts] Solve the Cartpole problem using REINFORCE algorithm. Note that the implementation pro-
vided in D2L (see ds8010 lec12 ex REINFORCE cartPole v0.ipynb) is done in PyTorch. However,
for this question, you are required to implement the same algorithm in Tensorflow. The (high-level)
steps of the REINFORCE algorithm is provided below.
− Initialize a Random Policy (a NN that takes the state as input and returns the probability of actions)
− Repeat:
1. Use the policy to play N steps of the game – record action probabilities from policy, reward
from environment, action sampled by agent
2. Calculate the discounted reward for each step by backpropagation
3. Calculate expected reward G
4. Adjust the weights θ of Policy (back-propagate error in NN)
You should run the DDQN algorithm for at least 1,000 episodes, and similar to part (a), your imple-
mentation is required to output a plot showing the collected reward for each episode. How does your
results compare to part (a)? In theory, which algorithm is better suited for Cartpole problem (explain
your reasoning)?
4

欢迎咨询51作业君