HW 1
1 Problem 1 [20 pts]
Consider the simple MDP shown below. Starting from state s1, the agent can move to the right
(a0) or left (a1) from any state si. Actions are deterministic (e.g. choosing a1 at state s2 results in
transition to state s1). Taking any action from the goal state G earns a reward of r = +1 and the
agent stays in state G. Otherwise, each move has zero reward (r = 0). Assume a discount factor
γ < 1.
" $%"&'
), = 0
", = 0 = 1
), = 0 ), = 0
", = 0", = 0
(a) What is the optimal action at any state si 6= G? Find the optimal value function for all states
si and the goal state G. [5 pts]
(b) Does the optimal policy depend on the value of the discount factor γ? Explain your answer.
[5 pts]
(c) Consider adding a constant c to all rewards. Find the new optimal value function for all states
si and the goal state G. Does adding a constant reward c change the optimal policy? Explain
your answer. [5 pts]
(d) After adding a constant c to all rewards now consider scaling all the rewards by a constant a
(i.e. rnew = a(c+ rold)). Find the new optimal value function for all states si and the goal
state G. Does that change the optimal policy? Explain your answer, If yes, give an example
of a and c that changes the optimal policy. [5 pts]
2 Problem 2 [20 pts]
Consider the infinite MDP with discount factor γ < 1 illustrated in Figure 1. It consists of 3 states,
and rewards are given upon taking an action from the state. From state s0, action a1 has zero
immediate reward and causes a deterministic transition to state s1 where there is reward +1 for
every time step afterwards (regardless of action). From state s0, action a2 causes a deterministic
transition to state s2 with immediate reward of γ2/(1− γ) but state s2 has zero reward for every
time step afterwards (regardless of action).
1
= +1
= 0
', = 0
), = )1 −

'
)
Figure 1: infinite 3state MDP
(a) What is the total discounted return (∑∞t=0 γtrt) of taking action a1 from state s0 at time step
t = 0? [5 pts]
(b) What is the total discounted return (∑∞t=0 γtrt) of taking action a2 from state s0 at time step
t = 0? What is the optimal action? [5 pts]
(c) Assume we initialize value of each state to zero, (i.e. at iteration n = 0, ∀s : Vn=0(s) = 0).
Show that value iteration continues to choose the suboptimal action until iteration n∗ where,
n∗ ≥ log(1− γ)log γ ≥
1
2 log(
1
1− γ )
1
1− γ
Thus, value iteration has a running time that grows faster than 1/(1− γ). (You just need to
show the first inequality) [10 pts]
3 Problem 3 [20 pts]
In class, we stated the result regrading the convergence of Value Iteration Algorithm. In this
homework, we will prove it.
(a) [10 pts] Prove that the convergence rate of Value Iteration Algorithm satisfies
‖Vk − V ?‖ γ
k
1− γ ‖V1 − V0‖
(b) [10 pts] Let pik be the policy at the end of kth iteration. Show that
‖V pik − V ?‖ ≤ 2γ
k
1− γ ‖V1 − V0‖
Hint: use the contracion property of Bellman operators. The above is Theorem 6.3.3, Section 6.3.2
in Puterman’s book (Markov Decision Processes: Discrete Stochastic Dynamic Programming. John
Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1994). Try to prove it yourself. If you need
help, you can read the proof from the book and summarize it here.
2
4 Programming: Frozen Lake MDP [40 pts]
Now you will implement policy iteration for the Frozen Lake environment from OpenAI Gym:
https://gym.openai.com/envs/FrozenLakev0. We have provided a stater code in the form of
a Jupyter notebook. This allows you to write and execute your Python code directly in your
web browser or on your machine locally with Jupyter. You will implement policy_evaluation,
policy_improvement and policy_iteration.
Instructions: If you want to run in your browser, you can simply open it in Google Colab:
https://colab.research.google.com/notebooks/intro.ipynb.
If you want to run locally on your machine, we recommend installing Anaconda which includes
Python, the Jupyter Notebook, and other commonly used packages for data science. Installation
instructions are here https://docs.anaconda.com/anaconda/install/. Then you can proceed
to install gym using instructions here https://gym.openai.com/docs/
3
欢迎咨询51作业君