程序辅导案例 > Program >

代写辅导接单-COMP0037

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

COMP0037 2024/25 Coursework 2

COMP0037 Teaching Team

14th March 2025

Overview

• Coursework 02 Part 01 Release Date: Friday, 14th March 20245

•Assignment Due Date: Wednesday, 23rd April (16:00 UK time) 2025

• Weighting: 60% of the module total

•Final Submission Format. Each group must submittwothings:

1. A zip file (which contains the source code implemented to tackle this coursework).

The name of the zip file must be of the formCOMP0037_CW2_GROUP_g.zip,

wheregis the letter code of your group.

2. A report in PDF format. It will be namedCOMP0037_CW2_GROUP_g.pdf.

Note that a penalty of 2% will be applied if the files are submitted with the wrong

name.

The total marks for each question are written in the form [Xmarks].

Approximately 67% of the total coursework marks are from your analysis and 33% from

your coding of the algorithms. Therefore, please take time and care with your writing and

analysis of solutions.

For the submitted code, you must use the notation variable names and code provided in the

lectures. If you do not use these, we will consider the work a potential plagiarism case and

investigate further.

You will need to implement additional routines to support the functionality required in these

questions. The code to implement this is available on Moodle. It is based upon but extends and

refactors the code provided in the labs. The general areas requiring adjustment will be shown

using comments.

As discussed in Lecture 00 (Overview), we expect you to take care in writing your report.

For example, figures and graphs must have captions and be numbered and labelled. Equations

captured by a screenshot, from the slides or elsewhere, and pasted into the document are of low

quality and are likely to cause a penalty for your marks. When we ask you to write some code,

provide an explanation in your report of what you wrote.

The final mark will be computed by summing all the marks awarded on individual questions

and dividing by the mark total.

The code will not be independently marked but must be provided before the submission

deadline. It might be checked, and if it does not run or produce the results presented in the

report, it can impact your marks.

Use Case

The use case continue the scenario you encountered in Coursework 01. A robot has been tasked

with automatically cleaning various terminals in an airport. Airports are generally very busy

places and require constant cleaning.

In this coursework, we will extend this in two ways:

1. Part 1: Use model-free methods to implement a low-level grid-based planner. This will

use discrete tabular methods to plan the overall trajectory of the robot.

2. Part 2: Use model-free methods to implement the low-level control of the robot. This will

use continuous-valued variables, and so function approximation must be used.

Note that we have had problems getting Part 2 to work properly over the past couple

of years. If we are still unable to get it to work, the coursework will only consist of Part

1 and marks will be scaled accordingly. The decision about whether to release Part 2 will

be made by Friday, 21st March 2025.

2CONTINUED

Part 1

The airport still hasn’t learned its lesson and is still using the noisy robot from Coursework 1.

As a reminder, when commanded to move in a given direction, it is not guaranteed to move

in that direction. Rather, as illustrated in Figure 1, the robot moves in the nominal direction

with a probabilitypor one square on either side of the nominal direction with a probability

q= 0.5(1−p).

Figure 1: Examples of the errors in the process model. In both cases, the original robot position

is the black square. The nominal direction of motion is the dark grey square. The errors in the

robot mean that the robot will either end up in the dark grey square or the light grey square.

The engineers propose to use a model-free approach to learn a policy using a grid-based

controller to plan the movement of the robot. This uses a simplified version of the robot state

and the environment. The robot’s state consists of its position expressed as the grid coordinates

(x,y).

Two methods for model-free evaluation and control have been proposed: Monte Carlo meth-

ods, and Temporal Difference (TD) Learning.

1.a. What are the on-policy and off-policy Monte Carlo-based approaches for estimat-

ingv

(s)? What are the main differences between them and the model-based ap-

proaches encountered in the first part of the module? What are the potential issues

with using MC-style learning?

[3 marks]

3TURN OVER

Both the on-policy and off-policy MC algorithms are to be used to estimatev

(s)for a

simple scenario and a single policy. In this scenario, the policy evaluator from Lecture 07

can be used to provide a ground truth solution.

b. Run the scriptq1b.pyand compare the effects of both first visit and every visit

strategies on the MC solutions. By discussing the results of each, which algorithm

appears to perform better? How does changing the number of episodes change the

results?

Hint:you should consider a way toquantitativelyevaluate the difference between

the ground truth and different predictor algorithms.

[6 marks]

c. Off policy MC policy prediction algorithms use a target policy and a behavior policy.

Explain what these policies are. Using the scriptq1

c.py, test several different

values ofε

to explore the behaviour. Are there any choices ofε

where the algorithm

will fail (crash or throw an exception)? Are there any choices ofε

where the system

converges extremely slowly? Explain your reasoning for both answers.

[4 marks]

d. Temporal Difference Learning (TDL) is another sample-based approach for eval-

uating policies and learning controllers. Describe a TD learning algorithm to es-

timatev

(s). What are the advantages or disadvantages of TD compared to MC

approaches. Are TD learning algorithms on or off policy? Is that good or bad?

[4 marks]

e. ModifyTDPolicyPredictorintd/td

policypredictor.pyto finish

implementing the Tabular TD(0) algorithm for estimatingv

π(s)

. Explain your im-

plementation.

[6 marks]

4CONTINUED

f. Using the scriptq1f.py, try several values ofαin the TD algorithm. How do

these values affect the rate of convergence? Do they influence the final values the

solutions converge to? Why do you think you get this result?

[5 marks]

g. Being able to figure out the policy directly from observing a set of episodes is very

useful for behaviour cloning and learning by example. One way to do this is to use

the Monte Carlo / TD learned prediction ofv

(s)to reverse engineer the policyπ

that the robot is running. What algorithm would be used to work out the policy?

What do you need to know to be able to apply this?

[4 marks]

[Total 33 marks]

5TURN OVER

Some policies now need to be designed to guide the movement of the robot through the

airport. In this case, the policy needs to take the robot from an arbitrary start location and bring

it to one of a set of terminal states. TD algorithms will be used to determine the policies.

2.a. Q-Learning is described as an off-policy control algorithm. Briefly describe Q-

learning algorithm for TD(0) and explain why it is off-policy.

[2 marks]

b. Modify theQLearnerclass to complete the implementation of the Q-learning al-

gorithm. Explain your code.

Hint:You might find the small scenarios provided inq2

f.pyhelpful for debug-

ging your code and illustrating that it functions correctly.

[6 marks]

The next few subquestions will use thecorridorscenario.

c. Using the scriptq2c.py, find and show the policy and associated value function

for thecorridorscenario. Discuss the policy and value functions. Do you

think the converged policy appears to be sensible or not? Explain your reasoning.

[4 marks]

d. One way to change the performance of the algorithm is to vary the number of it-

erations required. By examining the policy and value function for several different

choices, do you see evidence that changing the number of iterations affects the over-

all behaviour? Discuss and present the evidence for your conclusions.

[4 marks]

e. Instrumentq2

e.pyto store the time required for each episode. Present the tim-

ings. Are the timings constant or do they vary? If they vary, what do you think

causes this variation? Provide evidence to support your answer.

[5 marks]

6CONTINUED

f. The Sarsa algorithm is an on-policy controller. Briefly describe the Sarsa algorithm.

What are its potential advantages and disadvantages compared to Q-learning?

[2 marks]

g. Modify theSARSAclass to complete the implementation of the SARSA algorithm.

Explain your code.

Hint:You might find the small scenarios provided inq2f.pyhelpful for debug-

ging your code and illustrating that it functions correctly.

[6 marks]

h. Compare the performance of Q-learning and SARSA on the corridor scenario using

h.py. What do you notice about the state values and the extracted policies for

each algorithm? What do you think might be the reason for this?

[4 marks]

The following is purely optional andnomarks are awarded for it. It’s provided just in

case you might find it interesting.

The floor of the airport can be very smooth. As a result, the robot can slip with a proba-

bilityp

slip

(s). When the robot slips, it stays in the same state and has a penalty -1. With

probability1−p

slip

(s)robot does not slip. In this case it uses the noisy model above.

i. Run the scriptq2

i.pyon this example and compare how the algorithm behaves

compared with the earlier examples where the floor was not slippery. You might

want to look at things such as the computed policy, the value function, and the time

required.

[Total 33 marks]

[Coursework part 1 total 66 marks]