程序辅导案例 > Program >

代写接单-INFR11010 REINFORCEMENT LEARNING

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS

INFR11010 REINFORCEMENT LEARNING

Wednesday 18 th May 2022 13:00 to 15:00

INSTRUCTIONS TO CANDIDATES

1. Note that ALL QUESTIONS ARE COMPULSORY.

2. DIFFERENT QUESTIONS MAY HAVE DIFFERENT NUMBERS OF TOTAL MARKS. Take note of this in allocating time to questions.

3. This is an OPEN BOOK examination.

4. CALCULATORS MAY BE USED IN THIS EXAMINATION.

MSc Courses

Convener: A.Pieris

External Examiners: A.Cali, V.Gutierrez Basulto.

THIS EXAMINATION WILL BE MARKED ANONYMOUSLY

1. (a)

Is the following statement true or false? Explain your answer.

• If a policy π is greedy with respect to its own value function vπ, then it is an optimal policy.

(b) In TD(0) policy evaluation, if we use a constant learning rate α > 0, is TD(0) guaranteed to converge to the correct value function vπ for a fixed policy π? Explain your answer.

(c) In temporal-difference (TD) learning methods, what is the trade-off between using a large vs. small constant learning rate α?

(d) Give the Sarsa learning rule. Based on this learning rule, explain why Sarsa is an on-policy method.

(e) Comparing rollout planning with forward vs. backward updating, which method typically converges faster to the optimal value function q∗? Explain your answer.

(f) The REINFORCE learning rule is defined as G ∇π(At|St,θ)

t π(At|St, θ)

where Gt is the return and π is the policy with parameters θ. Explain

the purpose of each of the three factors (Gt,∇π(At|St,θ), 1 ) in this π(At|St,θ)

[3 marks]

[4 marks ] [4 marks ] [4 marks ]

[4 marks ]

[6 marks ]

learning rule (1-2 sentences for each factor suffices).

Page 1 of 4

2. Please solve all of the below problems.

(a) Monte Carlo & Episode Termination

Assume an agent acting under a given policy for some unknown Markov Decision Process. You are presented with the following trajectory from an episode:

• {state 1, action 1, +1}, {state 1, action 1, +1}, {state 2, action 1, +4}, {state 4, action 1, 0}, {state 3}

i. Use Monte Carlo every-state visit Policy Evaluation (use the averaging rule) with a discount factor of 0.5, to compute an estimation of the action-value function for the policy the agent is following (you do not need to know what the policy is) and using the above episode’s trajectory. Make sure you explicitly state what samples you are extract- ing from the trajectory for each of the state-action pairs and what the estimate is after each update.

ii. Monte Carlo Policy Evaluation (MCPE) assumes terminating episodes. Assume a Markov Decision Process that has been modelled to include an absorbing state. If we change the MCPE algorithm to terminate when the agent has arrived 3 times at the absorbing state, will this negatively affect our computations? Explain why.

[6 marks]

[4 marks]

(Note: Disregard any concerns about computational time).

Page 2 of 4

(b) Policy & TD-Learning

Assume we are evaluating the following policy π for some Markov Decision

Process:

Policy for Evaluation

π(a|s) action 1 state 1 1

state 2 0.5

action 2 0

0.5

Assume the following trajectory for the same Markov Decision Process: {state 1, action 1, +1}, {state 2, action 1, +1}, {state 1, action 2, −10}, {state 3}

i. Is the above trajectory possible under the given policy? Explain why.

ii. We are looking to use an off-policy Monte Carlo Policy Evaluation al- gorithm for evaluating the given policy π. We are also handed the following policy μ to use as our behaviour policy.

[2 marks]

Behaviour Policy

μ(a|s) state 1 state 2

action 1 1

action 2 0

Unfortunately, μ is not an appropriate behaviour policy for our evalua- tion. Explain why this is the case. Then, propose a new behaviour policy μ′ that would be appropriate for the task.

iii. Q-Learning is an Off-Policy Temporal-Difference Control algorithm. As- sume an initial estimate of 0 for each action-state, a discount factor of 0.9, and a learning rate of 0.5.

Use Q-learning to update your estimates for the action-value function Q(s,a) on the trajectory given above. Show the update at every time-step, and the calculations in detail.

[3 marks]

[5 marks]

Page 3 of 4

Assume the following problem to be modelled as a Markov Decision Process

(MDP):

An elevator can move between 3 floors: floor 0 (or “ground floor”), floor 1, and floor 2. The elevator has space for 1 person only, and each floor’s elevator waiting area also allows for 1 person at maximum. When an elevator arrives at a floor the following events are assumed to happen in order:

i. If there is a person in the elevator and this is the floor requested by the person in the elevator, that person exits and disappears (they do not take up space in its waiting area).

ii. If the elevator is empty (after the above event) and there is a person in this floor’s waiting area, that person enters the elevator (therefore also leaving the waiting area). If a person enters the elevator, they then make a request to move to one of the other two floors (with some known probability of requesting each floor).

iii. For every waiting area, if that waiting area is empty (after the above events) a person may arrive at that waiting area (with some known probability). They then request (“call”) an elevator.

A person waiting in any of the 3 floors’ waiting areas, defines a request (“calls an elevator”) for moving towards any of the other 2 floors in accordance with the following rules:

• A person in the ground/floor 0 waiting area can only request an elevator “going up” (meaning they are either going to floor 1 or floor 2, but they cannot specify to which floor exactly before having entered the elevator).

• A person in the floor 2 waiting area can only request an elevator “going down” (meaning they are either going to floor 1 or floor 0, but they cannot specify to which floor exactly before having entered the elevator).

• A person in the floor 1 waiting area can request to either “go up” (to floor 2) or “go down” (to floor 0).

We are interested in controlling the elevator such that people are brought to their desired floor with the minimum amount of waiting (whether in the waiting area or in the elevator).

Define a state space for this MDP.

(hint 1: define any relevant state variables and each one’s domain)

(hint 2: a correct answer will also accurately represent whether there is a person in each waiting area, and what their request is).

Page 4 of 4

[5 marks]