FIT 5226 (2022) Assignment 2

Reinforcement Learning

Due midnight, September 11, 2022

This assignment is for a total of 25 marks and accounts for 15% of your final mark.

A note on assignments and tutorials: During assignment periods tutorials are structured to give you

the maximum support for solving the assignment tasks. We will discuss the assignment in full detail

and guide you in solving similar tasks to give you a solid grounding for your own work. You can also

discuss your questions and ideas for the assignments with your tutors. Of course, there are limitations

- tutors can’t give you solutions but they can vet your ideas and approaches and they can help you to

overcome difficulties with coding.

The tutorial in Week 6 will lay the groundwork for defining an MDP and for solving it using an appropri-

ate library.

There are three bonus marks for the implementation part. Bonus questions are optional and can help

you make up for a shortfall in other tasks in this or other assignments. You can reach 25 marks (i.e.

100% of the assignment) without the bonus question and bonus marks will be added to the total of all

your assignment results.

You can submit solutions in Python or in Mathematica and your code must be submitted in the form of

a Jupyter Notebook or a Mathematica Notebook. If you are using Python you should use PyMDPtool-

box . We are only using value iteration to solve the MDP in this assignment. PyMDPtoolbox supports

this directly so that you will not have to implement this yourself. Note that q-learning appears to be

broken in the current version of PyMDPtoolbox. You will not have to use q-learning in this assignment.

If you prefer to do so, you are allowed to use any other MDP solver of your choice (at your own risk).

1. James Bot 007 [25 marks]

In the following you will define an MDP for a dynamic grid world and solve it using a predefined

MDP solver.

MI6 has decided that James Bond’s habit of fast cars and luxury is becoming too expensive: it’s time

to replace him with a robot. James Bond’s trusty old colleague Q, the inventor of the wonder

weapons arsenal, has been charged with the task of devising this robot, code-named James Bot.

Since it is expected that James Bot will have to face a laser chamber protecting a secret vault in its

first assignment, Q has decided to tackle this part of the problem first.

A laser chamber is a room that is protected by a laser-based security system. The security system

fires random laser beams into the room and if one of the beams hits the intruder, the alarm system

is triggered. The agent has to traverse this room from a given entry point to the position of the vault

without triggering the security system (which would have very unpleasant consequences, given the

vault does, of course, belong to a super villain).

If you have never been in a laser chamber yourself or at least seen one, you can take inspiration

from this link to a video on YouTube. Of course, James Bot cannot even dream of achieving this

level of elegance.

Like any good engineer, Q decides to to develop the AI for the bot by first tackling the problem in a

drastically simplified form. To this end, he defines a grid world of size with an entry point in the

bottom-right corner and the vault in the top-left corner. James Bot has to start at the entry point

and reach the vault without being hit by a laser beam. For each time step, the bot is allowed to

make a single 1-step move (which can be any of north, south, west, east, or rest). After the bot has

made its move, a single laser beam is fired into the room. If this beam hits the bot it is considered

dead. This ends the episode. It’s a terminal outcome from which the robot cannot recover. If it’s not

hit it can make its next move. See below for an illustration of point in time as the bot moves

through the chamber.

Laser beams go in a straight line and can be fired along a column, a row, or any diagonal (not just

the main diagonals). We will work in a 3x3 grid world. To clarify, here are all diagonals that follow

the direction of the major main diagonal:

beam type 1 beam type 2 beam type 3

beam type 4 beam type 5 beam type 6

2 Assignment 2.nb

beam type 7 beam type 8 beam type 9 beam type 10 beam type 11

beam type 12 beam type 13 beam type 14 beam type 15 beam type 16

For the purpose of this exercise the bot will need to reach the target in at most 10 steps. Exactly 10

beams will be fired during this process, one after each step. Q decides to select the directions of the

ten beams randomly before the training and make them available to the program at the start of the

training.

Your task is to help Q implement a learning mechanism for James Bot that can master this laser

chamber.

Tasks

Design [9 marks]

1.1 [1 mark] Define the action set for the robot.

1.2 [2 mark] Q suggests to simply use the position of the bot as the state, which he has seen done in

many other grid worlds. You suggest instead to use a tuple of position and time step. Give the

reasons why Q’s idea cannot work but yours does.

1.3 [3 marks] Give the reward definition in plain language taking into account that our solver will

only be able to handle infinite episodes. How will you handle this given that the robot is only

allowed a fixed number of steps?

1.4 [3 marks] Write down the full state-transition matrix and the reward function for the laser

chamber problem given in the following. Note that If you first solve the programming parts below,

you can do this automatically. If you don’t write the code you can still solve this sub-task manually

instead.

To make it possible for you to do this manually we restrict the problem for this sub-task (and only

this subtask) to a 2x2 grid. Start and vault are still in the bottom--right and top-left corners, respec-

tively. We give the beams fired in the order of time steps graphically below.

Out[ ]=

fired after step 1 fired after step 2 fired after step 3 fired after step 4 fired after step 5

Assignment 2.nb 3

Implementation [12 marks]

Q has heard of MDPs and value iteration, so he asks you to implement it this way. Note that you

need to map the (position, timestep) state tuples to plain state numbers for this. Your program

must work for a 3x3 grid world. beams will be given as a list of beam type numbers as defined

above.

1.5 [2 mark] Define a function reward(state, move, beams) that returns the reward for a given move

in a given state (see Subtask 1.2 for the definition of states) and a given list of beams.

1.6 [2 mark] Definite a function nextState(state, move, beams) that returns the next state for a given

move in a given state and a given list of beams.

1.7 [3 mark] Define a function transitionMatrix() that generates the complete transition matrix for

this MDP as a nested list of lists indexed by states.

1.8 [3 mark] Define a function rewardMatrix() that generates the complete transition matrix for this

MDP as a nested list of lists indexed by states and actions.

1.9 [2 mark] Define function approach(beams) that takes a list of beams as the argument and uses

value iteration in the PyMDPtoolbox solver to learn an optimal way to navigate the grid for the given

set of beams, which will be passed as a list of beam type numbers as explained above. You function

should return the optimal list of actions the robot takes.

[for 3 bonus marks] Extend your program to produce an animation that clearly shows the grid, the

robot’s position and the ray fired for each time step.

Critique of Q’s Design Thinking [4 marks]

1.10 [2 mark] After you have dutifully implemented Q’s idea to use value-iteration for this problem

you realise that this is a fundamentally flawed idea in the first place. It works in Q’s example task

but it will never be able to be used in practice. The robot is doomed to fail on its first mission. Give

the reasons why this is a fundamentally flawed idea.

1.11 [2 mark] After considering your thoughts on the use of value-iteration, Q goes back to the

drawing board and attempts to fix the problem by inventing a new learning algorithm. He calls this

“Q-learning” in honor of himself. This happens to be the same q-learning you know from our

course. (Of course, we have slightly deviated from the truth, Q-Learning was invented by Chris

Watkins in 1989). Part of Q’s design thinking for q-learning is right, but he forgot something else…

Give the reasons why the new James Bot AI equipped with Q-learning will still fail miserably in a

real situation when trying to conquer a new laser chamber.

... and so James Bond will remain indispensable for another sequel.

4 Assignment 2.nb

欢迎咨询51作业君