COMP3106—Introduction to Artificial Intelligence Assignment 3 Winter 2022 School of Computer Science Carleton University Instructor: Yuhong Guo Due: Friday, April 15, 2022 at 23:59:59 EST time Worth: 13 points, i.e., 13% of final grade Submission through Brightspace: You need to submit two files: 1. One “A3.py” file that contains the functions you implemented. 2. One “A3-answer.pdf” file that contains your answers/solutions to the questions in this assignment. This PDF file should be no more than 8 pages. Please use a standard page format (i.e. page size 8.5′′ × 11′′, minimum 1/2′′ margins, minimum font size 10, no condensed fonts or spacing). Note: • Programming language: Python 3. You may use standard Python libraries and the NumPy package. You must implement the code yourself. Do not copy-and-paste code from other sources. Your implementation must follow the required specifica- tions. Implementations which do not follow the specifications may receive a grade of zero. Please make sure your code is readable. Your implementation will be evaluated on a set of test cases for correctness. • Please prepare your answers in the PDF file using computer tools. Problem In this assignment, we consider MDPs and reinforcement learning in the world of a 4 × 4 grid. One example of this grid world is shown in Figure 1: The agent’s state is described by its position in this grid world; a state s is described as [si,sj] where si is the row index and sj is the column index for the position 1 Figure 1: An example of 4 × 4 grid world of state s. This grid world has two terminal states: a victory terminal state s+ and a failure terminal state s−. In the grid world example in Figure 1, the victory terminal state (in green color) is s+ = [0, 3] and the failure terminal state (in red color) is s− = [1, 3]. The immediate reward function for the agent in this environment is 0 Reward(s, a, s′) = R+ if s′ is a non-terminal state if s′ is the victory terminal state s+ if s′ is the failure terminal state s− R− For the grid world example in Figure 1, we have R+ = 10 and R− = −10. Ateachnon-terminalstates,therearefour possible deterministic actions, A={‘L’, ‘R’, ‘U’, ‘D’}, such that: ‘L’: move left—if sj > 0, the move left action will reduce sj by 1; if sj = 0, the move left action will make no change in position. ‘R’: move right—if sj < 3, the move right action will increase sj by 1; if sj = 3, the move left action will make no change in position. ‘U’: move up—if si > 0, the move up action will reduce si by 1; if si = 0, the move up action will make no change in position. ‘D’: move down—if si < 3, the move down action will increase si by 1; if si = 3, the move down action will make no change in position. Let S(s,a) denotes the resulting state by applying an action a on state s. For example, we have S([1, 2], ‘L’) = [1, 1], S([1, 0], ‘L’) = [1, 0], S([2, 2], ‘R’) = [2, 3], S([2, 3], ‘R’) = [2, 3], S([1, 2], ‘U’) = [0, 2], S([0, 2], ‘U’) = [0, 2], S([2, 2], ‘D’) = [3, 2], and S([3, 2], ‘D’) = [3, 2]. 2 Let N (s) denote the set of neighbor states of s. Note any internal state s (i.e., 0 < si < 3 & 0 < sj < 3) has four neighbor states, such that N(s)={[si,sj −1],[si,sj +1],[si −1,sj],[si +1,sj]}; any corner state s (i.e., [0,0], [0,3], [3,0], or [3,3]) has only two neighbors, such that N ([0, 0]) = {[0, 1], [1, 0]}, N ([0, 3]) = {[0, 2], [1, 3]}, N ([3, 0]) = {[2, 0], [3, 1]}, and N ([3, 3]) = {[2, 3], [3, 2]}; any non-corner border state s (i.e., [0, x], [3, x], [y, 0], [y, 3] for 0 < x,y < 3) has three neighbors, such that N ([0, x]) = {[1, x], [0, x − 1], [0, x + 1]}, N ([3, x]) = {[2, x], [3, x − 1], [3, x + 1]}, N ([y, 0]) = {[y, 1], [y − 1, 0], [y + 1, 0]}, and N ([y, 3]) = {[y, 2], [y − 1, 3], [y + 1, 3]}. Next we define the transition model P (or T) for the grid world. We consider stochastic transitions; i.e., when taking an action a in a state s, the transition model not only will have a large probability (1 − pn) to enter into the resulting state S(s, a), but also will have a small probability pn to enter into the resulting states for two other actions that can be triggered by the given action a. Let G(a) denote the actions that can be triggered by an action a; we specify G as follows: G(‘L’)= {‘U’, ‘D’}, G(‘R’)= {‘U’, ‘D’}, G(‘U’)= {‘L’, ‘R’}, G(‘D’)= {‘L’, ‘R’}. Let M(s,G(a)) denote the set of resulting states from s when applying the two actions triggered by action a. Then M(s, G (a)) is the union of S (s, a′ ) for all a′ ∈ G (a). Since each action a triggers two other actions, each M(s, G(a)) should have two states. Moreover, when s is a corner state, S(s,a) and M(s,G(a)) can have a common state s, when both action a and a triggered action make no change in position. Overall, we define the stochastic transition model as follows: if s′ = S(s, a) and s′ ∈/ M(s, G(a)) ifs′ =S(s,a)ands′ ∈M(s,G(a)) where 0 ≤ pn ≤ 0.3 is the transition noise probability. 3 1 − pn ′ 1−pn/2 P(s, a, s ) = 0 ′ ′ pn/2 if s ̸= S(s,a) and s ∈ M(s,G(a)) if s′ ̸= S(s,a) and s′ ∈/ M(s,G(a)) With the definitions above, we can provide values for s+, s−, R+, R− and pn to define a 4×4 grid world with its specific terminal states, reward function, and transition model. In this assignment, you will need to implement MDP and reinforcement learning functions to find the corresponding utility values or policies for each specific 4 × 4 grid world. Representation We consider deterministic policy π. We represent a policy π as a nested list in Python. For example, for the grid world environment above with terminal states s+ = [0,3] and s− = [1, 3], one policy π can be written as: π =[ [‘R’, ‘R’, ‘R’, ‘X’], [‘R’, ‘U’, ‘U’, ‘X’], [‘R’, ‘U’, ‘U’, ‘L’], [‘U’, ‘U’, ‘U’, ‘U’] ] where π[si][sj] provides the action for state s under policy π. Note since the policy mapping is for non-terminal states, we put a constant ‘X’ for the terminal states. For the Q(s, a) function, we represent it as a three-dimensional matrix (i.e., a nested list in Python as well). We consider the four actions in the order of ‘L’, ‘R’, ‘U’, ‘D’. Let Qvalues denote this nested list of Q function values for the 4 × 4 grid world, then Qvalues[si][sj][0]: Qvalues[si][sj][1]: Qvalues[si][sj][2]: Qvalues[si][sj][3]: the Q(s,a) value for state s and action a=‘L’. the Q(s,a) value for state s and action a=‘R’. the Q(s,a) value for state s and action a=‘U’. the Q(s,a) value for state s and action a=‘D’. Fortheutility function U(s),werepresentitasanestedlistinPython(atwo-dimensional matrix). Let Uvalues denote the nested list of the utility values for for the 4 × 4 grid world, then Uvalues[si][sj] denotes the U(s) value for a non-terminal state s = [si,sj]. (We do not require U values for the terminal states and you can keep them as zeros.) Part 1 (Implementation —9 points) 1.1 (4.5 points) Implement the Value Iteration method (for MDP) in the 4 × 4 grid world environment introduced above: value iteration(noiseP, vState, fState, victoryR, failureR, gamma, max iternum) The function takes the following arguments: • noiseP: this is the 0 ≤ pn ≤ 0.3 used to determine the transition model P (or T ). 4 • vState and fState: they specify the positions for the victory terminal state s+ = [s+i , s+j ] and the failure terminal state s− = [s−i , s−j ] respectively. • victoryR and failureR: they are the rewards, R+ and R−, for entering the victory terminal state s+ and the failure terminal state s− respectively. • gamma: the discount factor γ for calculating the expected utility; e.g. gamma = 0.9. • max iternum: the maximum number of iterations for conducting value iteration; in each iteration, U values for all the non-terminal states are updated. The function returns one item: • Uvalues: the utility values for all the non-terminal states, in the representation format introduced above. Note: Initialize the U values for all the states as zeros. 1.2 (4.5 points) Implement the Q-Learning with ε-greedy exploration to perform reinforcement learning in the 4 × 4 grid world environment: QL explore(noiseP, initS0, vState, fState, victoryR, failureR, gamma, alpha, epsilon, max iternum) The function takes the following arguments: • noiseP: this is the 0 ≤ pn ≤ 0.3 used to determine the transition model P (or T ). • initS0: the position for initial state s0 • vState and fState: they specify the positions for the victory terminal state s+ = [s+i , s+j ] and the failure terminal state s− = [s−i , s−j ] respectively. • vicotryR and failureR: they are the rewards, R+ and R−, for entering the victory terminal state s+ and the failure terminal state s− respectively. • gamma: the discount factor γ for calculating the expected utility; e.g. gamma = 0.9. • alpha: the learning rate. • epsilon: the ε parameter for the ε-greedy exploration; e.g., epsilon = 0.2 • max iternum: the maximum number of iterations. The function returns two items in order: • Qvalue: the learned Q(s, a) value for all non-terminal state s and all actions a. • policy: the policy produced from the Qvalues: π(s) = argmax Q(s, a) a The representation formats for Qvalues and policy have been described above. 5 Algorithm Note: Here are the main points of the algorithms: • In each iteration, the agent starts from the given initial state • At each step within the iteration – The agent takes an action a from s with ε-greedy exploration from current policy – Observe new state and receive reward: the agent makes TD-update for Q(s,a) with a constant learning rate α – The agent moves into new state. If it is a terminal state, goes to the next iteration • Q values are only for non-terminal states Part 2 (Question Answering—4 points) 2.1 (2 point) Analyze the value iteration method you implemented above. Let noiseP=0.2, vState=[0,3], fState=[1,3], victoryR=1, failureR=-1. • Let gamma = 0.9, test your method with different max iternum from {10, 50, 100, 200}. Please present the U values obtained with different max iternum (show each set of U values on the cells of one grid). Did the U values converge (i.e., the max change is smaller than a very small constant such as 1e-4) after 200 iterations? • Test your method with different gamma values from {0.7, 0.8, 0.9, 0.95}. Please analyze how does the gamma value affect the convergence of value iteration? You can show the maximum change in U-value at iteration number {10, 50, 100, 200} for different gamma values and state your analysis and conclusion. 2.2 (2 points)Analyze the Q-learning method you implemented above. LetnoiseP=0.2, initS0=[3,0], vState=[0,3], fState=[1,3], victoryR=1, failureR=-1, gamma=0.9, max iternum = 200, epsilon = 0.2. • Please try different learning rate alpha from {0.1, 0.2, 0.5, 0.7} and analyze how does the learning rate alpha affect convergence. Note, you can plot the maximum Q-value change (and/or #of policy change) across iterations (e.g., 10, 50, 100, 150, 200) for each learning rate as a curve and compare these curves for different learning rates. Please choose the most suitable learning rate. • Use the learning rate chosen from above. The epsilon parameter controls the trade- off between exploitation and exploration. Please test the Q-learning with different epsilon values from {0, 0.1, 0.2, 0.5} (note when epsilon=0, there is no exploration) and analyze how does the epsilon value affect the convergence of Q-learning. 6