辅导案例-CSCI 360

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

CSCI 360 – Project #3
1 Introduction
In this project, you will learn about MDPs by playing Frozen Lake1. The description of
Frozen Lake from the OpenAI website is as follows:
Winter is here. You and your friends were tossing around a frisbee at the park
when you made a wild throw that left the frisbee out in the middle of the lake.
The water is mostly frozen, but there are a few holes where the ice has melted. If
you step into one of those holes, you’ll fall into the freezing water. At this time,
there’s an international frisbee shortage, so it’s absolutely imperative that you
navigate across the lake and retrieve the disc. However, the ice is slippery, so you
won’t always move in the direction you intend.
Markov Decision Process Formulation
We formalize the problem as a Markov Decision Process (MDP). Before diving into the
definitions of states, actions, transition probabilities, and rewards, we introduce the slightly
different notation used in this project and the changes you need to make in order to apply
the algorithms you learned in class to this project.
Notation
In the lecture, we defined the cost c(s) = ci of state s = si as
c(s) = min
a
∑
s′
p(s′|s, a)(c(s, a, s′) + γc(s′)), (1)
where c(s, a, s′) is the action cost that you incur when you start in state s, execute action a,
and transition to state s′. c(s) is the minimum expected total discounted cost that you incur
when you start in state s and behave optimally. We further define the q-value q(s, a) of state
s and action a as
q(s, a) =
∑
s′
p(s′|s, a)(c(s, a, s′) + γc(s′)). (2)
1https://gym.openai.com/envs/FrozenLake-v0/
1
q(s, a) is the minimum expected cost that you incur when you start in state s, execute action
a, and then behave optimally. Then, c(s) can be written as
c(s) = min
a
q(s, a). (3)
In this project, we deal with action rewards r(s, a, s′) instead of action costs c(s, a, s′),
where r(s, a, s′) is the action reward that you incur when you start in state s, execute action
a, and transition to state s′. Costs are simply negative rewards, so the objective changes from
minimizing costs to maximizing rewards. We define the value v(s) of state s similar to c(s) as
v(s) = max
a
∑
s′
p(s′|s, a)(r(s, a, s′) + γv(s′)), (4)
and the q-value q(s, a) of state s and action a as
q(s, a) =
∑
s′
p(s′|s, a)(r(s, a, s′) + γv(s′)). (5)
In other words, except for the different notation, we only changed the minimization to a
maximization in the value iteration, policy iteration, and q-learning algorithms learned in
class.
S1 F F S2
F H F H
F H F F
F R H G
0 1 2 3
0
1
2
3
x
y
Figure 1: 4x4 Frozen Lake Example
States
We show an example of a 4x4 frozen lake in Figure 1. The states are locations. The letters in
the locations mean the following: S1 and S2 are start states. You will be placed with uniform
probability in one of the start states when the game starts. Such states are safe. The reward
for visiting them is zero. F is a state with a frozen surface. Such states are safe. The reward
for visiting them is zero. R is a state with a reward. Such states are safe. The reward for
visiting them is positive. H is a terminal state with a hole, where you lose (by falling into the
lake) and the game terminates. The reward for visiting them is negative. G is a terminal
2
state with a frisbee, where you win (by retrieving the frisbee) and the game terminates. The
reward for visiting them is positive. There can be multiple states of each kind in a map.
We use S1(0, 0) to denote a start state at location (x = 0, y = 0) and G(3, 3) to denote
winning terminal state at location (x = 3, y = 3).
Actions
In each state, you can choose to move UP, DOWN, LEFT, and RIGHT, except if you would leave
the map. For example, in S2(0, 3), you cannot choose to move UP or RIGHT.
Transition Probabilities
The transition probability p(s′|s, a) encodes the probability with which you transition to
state s′ when you start in state s and execute action a. Since the ice is slippery, you move in
an unintended direction with slip probability p. (If p = 0, then actions have deterministic
rather than stochastic effects.) We assume that you move in the intended direction with
probability 1 − p and slip in to the left and right (if possible) with equal probability. For
example, if you are in state F (1, 2) and choose to move DOWN, then you transition to F (2, 2)
with probability 1− p, to H(1, 1) with probability p/2, and to H(1, 3) with probability p/2.
If you are in state S2(0, 3) and you choose to move LEFT, then you transition to F (0, 2) with
probability 1− p and to H(1, 3) with probability p.
Rewards
The rewards for visiting states R or G are positive, the rewards for visiting states S and F
are zero, and the rewards for visiting states H are negative. Different maps have different
rewards. The action reward r(s, a, s′) is the reward for visiting state s′.
Objective
You want to maximize the expected total discounted reward for different frozen lakes using
value iteration, policy iteration, and q-learning. Value iteration and policy iteration apply in
case the dynamics of the MDP are known a priori, that is, you know all transition probabilities
and rewards before executing any actions. This allows you to predict the outcomes of action
executions and enables you to plan offline, that is, before executing any actions. Q-learning
can be used even in case the dynamics of the MDP are unknown. You are then not able to
predict the outcomes of action executions and have to execute actions to learn the dynamics
of the MDP, which can be dangerous because you are (at least initially) not able to predict
whether you might fall into the lake.
3
2 Tasks
2.1 Installing Python 3
The frozen lake environment is written in Python 3. It requires a Python 3 installation with
numpy. You can ignore this part if you kept your Project 2 development environment.
On Ubuntu Linux, download python by using:
sudo apt install python3 python3-numpy
OnMac OS X, download Anaconda 2019.03 with Python 3.7 from https://www.anaconda.
com/distribution/#download-section and follow the installer. You can verify your instal-
lation by using:
python3 --version
On Windows, download Anaconda 2019.03 with Python 3.7 from https://www.anaconda.
com/distribution/#download-section.
On Ubuntu Linux and Mac OS X, use python3 to run python. On Windows, use python
instead.
2.2 Experiments
Download the project folder and unzip it. Change the directory with cd to the frozen_lake
folder and run:
python3 play_game.py --map maps/bridge.json
You should see a GUI like the following:
Figure 2: Example Frozen Lake Environment
4
The top two lines contain metadata, including the slip probability p and the mapping
from states to the rewards for visiting them. You start a game episode by clicking RESET.
Then you play the game by clicking the different MOVE buttons. Your current state is always
highlighted in yellow.
2.3 Questions [5 pts]
First, review the book chapters 17.2, 17.3, and 21.3 and the lecture slides. Always show all
your work, not just your answer to the following questions. When specifying a policy, you
only need to show the actions for non-terminal states.
2.3.1 bridge.json
Run python3 play_game.py --map maps/bridge.json and answer the following question
for γ = 0.99:
• What is the optimal policy (that is, mapping from states to actions) and the value v(S)
of start state S? (You can compute the optimal policy backward from the winning
terminal states.)
2.3.2 bridge_stochastic.json
Run python3 play_game.py --map maps/bridge_stochastic.json and answer the follow-
ing questions for γ = 1.0:
• Which winning terminal state (G1 or G2) should you choose to move to if you are at
S(1, 1)? Which winning terminal state (G1 or G2) should you choose to move to if you
are at F (1, 3)? Explain why.
• What are the q-values q(S(1, 1), LEFT ) of the start state S(1, 1) and action LEFT and
q(F (1, 4), RIGHT ) of state F (1, 4) and action RIGHT?
• What are the values v(S(1, 1)), v(F (1, 2)), v(F (1, 3)), and v(F (1, 4))?
2.3.3 bridge_near_stochastic.json
Run python3 play_game.py --map maps/bridge_near_stochastic.json and answer the
following questions for γ = 1.0:
• What is the optimal policy?
• What is a value for the discount factor γ so that the optimal policy changes and what
is the optimal policy for this value of γ? Explain why the optimal policy changes using
your understanding of the effect of the discount factor.
5
2.3.4 frozen_lake_8x8.json
Run python3 play_game.py --map maps/frozen_lake_8x8.json --not_fully_visible
and answer the following question for γ = 0.99. The flag --not_fully_visible hides the
dynamics from you:
• What exploration strategy did you use when playing this frozen lake environment?
Remember that you want to achieve an expected total discounted reward that is as
large as possible. This means that you need to balance between exploring actions and
exploiting the currently seemingly best action.
3 Code Structure and APIs
The source code is in the src folder, and the test code is in the test folder. Try to understand
the code in
• FrozenLake.hpp/cpp,
• ValueIterationAgent.hpp/cpp,
• PolicyIterationAgent.hpp/cpp, and
• QLearningAgent.hpp/cpp.
You probably want to look into the test folder when you debug. You are not required to
look at any other support files.
3.1 APIs
The environment is implemented in FrozenLake.hpp/cpp. We implement two classes,
FrozenLakeEnv and FrozenLakeMDP, where FrozenLakeMDP is a subclass of FrozenLakeEnv.
You will work with FrozenLakeMDP when implementing value iteration and policy iteration
and with FrozenLakeEnv when implementing q-learning. The only difference between them
is that FrozenLakeMDP has access to all rewards and transition probabilities of the underlying
MDP. The APIs of FrozenLakeEnv are:
• GameState reset(), which resets the environment and returns your start state with
uniform probability;
• GameState getNextState(const GameState &state, const Action &action),
which samples the successor state for the given state and action;
• std::vector getPossibleActions(const GameState &state) const,
which returns a vector of actions that can be executed in the given state;
6
• double getReward(const GameState &state, const Action &action, const
GameState &nextState) const, which returns the reward when the given successor
state results from executing the given action in the given state (here: the reward of
visiting the given successor state); and
• bool isTerminal(const GameState &state) const, which returns whether the
given state is a terminal state.
There are two additional APIs for FrozenLakeMDP:
• std::map getTransitionStatesAndProbs(const GameState
&state, const Action &action) const, which returns a mapping from states to the
transition probabilities with which the execution of the given action in the given state
results in the states as the successor states; and
• std::set getStates() const, which returns a set of all states.
The above APIs are all APIs that you need for your code.
3.2 Building the Project
Build the project as follows:
cd frozen_lake
mkdir build; cd build;
cmake ../
make
which will create two targets, namely frozen_lake and run_test. The code has been tested
on macOS Catalina 10.15.1, Ubuntu 18.04, and Windows 10 using Visual Studio.
frozen_lake is in the src folder. ./frozen_lake --help prints the help message. The
program takes an agent, a map, and a set of parameters. It trains the agent on the map and
tests the agent on the same map for test_episodes test episodes and records the average
and standard deviation of the total discounted reward over all episodes. For example,
./frozen_lake --agent v --map ../maps/cliff_stochastic.json --theta 1e-7
--iteration 50 --gamma 0.99 --test_episodes 10000↪→
loads map ../maps/cliff_stochastic.json, solves the MDP using value iteration with a
limit of 50 iterations, error threshold ε = 1e− 7, and discount factor γ = 0.99. Afterwards, it
tests on the same map for 10,000 episodes and prints the mean and standard deviation of the
total discounted reward over all episodes.
run_test is in the test folder. It runs basic tests to verify your implementation. We will
run these tests but also additional tests (for example, to avoid someone trying to hard-code
policies) while grading.
7
4 Value Iteration [25 pts]
In this part, you will implement the ValueIterationAgent class and test it on various
FrozenLakeMDP configurations. The constructor of the ValueIterationAgent class is:
ValueIterationAgent::ValueIterationAgent(FrozenLakeMDP const &mdp, double
gamma, int iterations, double threshold) :↪→
ValueEstimateAgent(gamma, iterations, threshold), m_mdp(mdp) {
initialize();
solve();
}
The arguments of the constructor are:
• FrozenLakeMDP const &mdp, which is a frozen lake MDP instance defined in
Frozen_lake.hpp;
• double gamma, which is the discount factor;
• int iterations, which is the iteration limit; and
• double threshold, which is the error threshold.
The constructor calls initialize(), where you should initialize any variables that you want
to define, and solve(), where you should solve the MDP.
Your task is to implement the class methods marked with TODO in
ValueIterationAgent.cpp:
• double ValueIterationAgent::getValue(const GameState &state), which re-
turns the value of the given state;
• double ValueIterationAgent::getQValue(const GameState &state, const
Action &action), which returns the q-value of the given state and the given action;
• Action ValueIterationAgent::getPolicy(const GameState &state), which re-
turns the action to execute in the given state;
• void ValueIterationAgent::solve(), which solves the MDP with value iteration
and stores the optimal policy in a private class member. The autograder will call
getValue, getQValue, and getPolicy to verify its correctness. Value iteration should
terminate once it reaches the given iteration limit m_iterations or the absolute error
of the value of each state from one iteration to the next one is less than the given error
threshold m_threshold; and
• void ValueIterationAgent::initialize(), which initializes the internal variables.
(You need to declare them as private class members before initializing them.)
You can run all value iteration tests using ./run_test
--gtest_filter="ValueIteration*". You can run individual tests using CLion.
8
5 Policy Iteration [25 pts]
In this part, you will implement the PolicyIterationAgent class. The class methods look
very similar to the ones of value iteration, with the following differences:
• You are not allowed to create any new class members. The optimal policy is stored in
std::map m_policy. Remember to initialize this data structure
in void PolicyIterationAgent::initialize().
• Implement the helper function virtual std::map
evaluateCurrentPolicy(), which evaluates the current policy and returns a
map that stores the value of each state. Instead of solving linear equations using
Gaussian elimination, you can choose to implement the iterative approach discussed
in the lecture. The iterative approach should terminate when the absolute error of
the value of each state from one iteration to the next one is less than the given error
threshold m_threshold.
• Policy iteration in void PolicyIterationAgent::solve() should terminate once it
reaches the given iteration limit m_iterations or the policy does not change from one
iteration to the next one.
You can run all policy iteration tests using ./run_test
--gtest_filter="PolicyIteration*".
6 Q-Learning [40 pts]
In this part, you will implement the QLearningAgent class. We have already implemented
void QLearningAgent::solve() for you, which implements the following learning procedure
for m_iterations iterations:
1. Reset the environment to obtain your start state.
2. Call getAction to obtain the next action to execute.
3. Call m_env.getNextState(state, action) to obtain the resulting successor state.
4. Call m_env.getReward(state, action, nextState) to obtain the resulting reward.
5. Call update(state, action, nextState, reward) to update the corresponding q-
value. (You need to implement it.)
6. If the successor state is not a terminal state, then go to 2.
7. Evaluate the current policy and record its expected total discounted reward into a
file so that you can plot it later. Do this by calling getPolicy instead of getAction.
getPolicy executes the current policy, which consists of the seemingly best action
9
in every state, while getAction also explores other actions. We use the ε-greedy
exploration strategy, where you choose a random action in your current state s with
probability ε and the currently seemingly best action argmaxa q(s, a) with probability
1− ε.
Your task is to implement getValue, getQValue, getPolicy, getAction with the ε-
greedy exploration strategy, update, and initialize. You can declare any variables inside
the class. The class has two more members as hyperparameters:
• m_alpha, which is the learning rate of q-learning.
• m_epsilon, which is the probability of choosing a random action of the ε-greedy
exploration strategy.
We provide the performance of our code on various maps for your help but please note
that the results of q-learning vary across different runs even with the same hyperparameters.
The provided performance was achieved for most of our runs.
• ./frozen_lake --agent q --map ../maps/bridge.json --epsilon 0.99
--alpha 1.0 --iteration 500 --gamma 0.99 should achieve an expected total
discounted reward of 9.703.
• ./frozen_lake --agent q --map ../maps/bridge_stochastic.json --epsilon
0.2 --alpha 0.05 --iteration 500 --gamma 0.99 should achieve an expected
total discounted reward of around [−20,−18].
• ./frozen_lake --agent q --map ../maps/bridge_near_stochastic.json
--epsilon 0.2 --alpha 0.1 --iteration 200 --gamma 0.99 should achieve
an expected total discounted reward of around [70, 90].
• ./frozen_lake --agent q --map ../maps/cliff_stochastic.json --epsilon
0.2 --alpha 0.2 --iteration 100 --gamma 0.99 should achieve an expected total
discounted reward of around 0.94.
Subtask 1: Experiment with α
Run frozen_lake_8x8.json with γ = 0.99, ε = 0.1, and α = 0.01, 0.05, 0.1, 0.2 and 0.4 for
2,000 iterations and plot a figure with 5 graphs, where the x-axis is the iteration and the
y-axis is the expected total discounted reward during the evaluation of the policy (since the
expected total discounted reward while learning the policy can be less due to the exploration
strategy). Each graph should be the average over 3 learning episodes since you might move
differently during each learning episode. Make sure to clearly label each graph with the value
of α. Run value iteration (or policy iteration) on this environment and report the difference
in expected total discounted reward for the q-learning algorithms with different values of α
and value iteration (or policy iteration). Summarize your findings of how α affects the results
of q-learning and why q-learning on large maps is difficult.
10
After learning, QLearnerAgent will output a csv file named result.csv in the current
working directory. You can either open it with Microsoft Excel or Python package pandas
and plot it with matplotlib.
Subtask 2: Experiment with ε
Run frozen_lake_8x8.json with γ = 0.99, α = 0.05, and ε = 0.01, 0.1, 0.2, and 0.5 for
2,000 iterations and plot a figure with 4 graphs, where the x-axis is the iteration and the
y-axis is the expected total discounted reward. Each graph should be the average over 3
learning episodes. Make sure to clearly label each graph with the value of ε. Summarize your
findings of how ε affects the results of q-learning.
Subtask 3: Experiment with Counting-Based Exploration
Implement the counting-based exploration strategy, which prefers to execute actions that
have not been executed many times yet. It augments the q-value with the number of times
that action a has been executed in state s as q′(s, a) = q(s, a)+βN(s, a)−
1
2 [1]. The modified
q update rule becomes:
q′(s, a) = q′(s, a) + α(r(s, a, s′) + γmax
a′
q′(s′, a′)− q′(s, a)) (6)
and the optimal policy is argmaxa q′(s, a). You will need to
• maintain N(s, a) for each state s and action a;
• update N(s, a) in update(); and
• modify getAction() and getQValue() to include the counting-based exploration strat-
egy and meanwhile disable ε-greedy exploration.
Run frozen_lake_8x8.json with γ = 0.99, α = 0.05, and different values of β ≥ 0 for 2,000
iterations and plot a figure with 2 graphs, where the x-axis is the iteration and the y-axis is the
expected total discounted reward. One graph is for q-learning with the ε-greedy exploration
strategy for the best value of ε that you found, and the other graph is for q-learning with
the counting-based exploration strategy for the best value of β that you found. The graph
for q-learning should be the average over 3 learning episodes. Make sure to clearly label
each graph with the exploration strategy and the value of ε or β used. You should be able
to achieve optimal policy for appropriate β. We will grade by how close your performance
in q learning using counting-based exploration vs. the optimal performance, which can be
calculated by value/policy iteration. Summarize your findings of how the exploration strategy
affects the results of q-learning.
11
Submissions
You need to submit your solutions through the blackboard system by Monday, November 25th,
11:59pm (which is the time when the grace period starts). For the programming part, submit
your code in a zip file named “code.zip” and your report as a PDF file named "report.pdf,"
that contains the answers to all questions in Section 2.3 and the figures and insights into the
tasks in Section 6. Upload the two files individually and not as an archive. If you have any
questions about the project, you can post them on Piazza or ask a TA directly during office
hours. If you have a bug in your code that you are unable to find, you can ask a CP for help.
References
[1] Strehl, A. L., and Littman, M. L. An analysis of model-based interval estimation
for Markov decision processes. J. Comput. Syst. Sci. 74 (2008), 1309–1331.
12