SDSC4001 (Semester B, 2022) Foundation of Reinforcement Learning Assignment 2 All questions are weighted equally. For questions that require Python, please submit the .py file and not the screenshot of the code. Question 1. Use the idea of Bellman operator Tpi to solve the following equation vpi = 1 2 3 + γ 3/4 1/4 0 1/4 3/4− 0 1− vpi, where vpi ∈ R3. By applying Tpi, solve the above problem (in Python) with all combinations of γ = 0.9, 0.999 and = 0.5, 0.01. What is the difference in terms of convergence? Discuss your results and please submit your Python code. Question 2. As a stock enthusiast, Tom decides to participate in the stock market and make extremely risky investments on every Friday. • He plans to start with an initial budget $300, and he will stop when he either (i) lose it all or (ii) double his money or above (i.e. ≥ $600). • On each Friday, he would invest in expiring option contracts that either (a) give him an immediate reward at the end of that day or (b) lose all the money that he invested on that day. • He has two strategies. On each Friday, he would choose either Strategy A: Invest $100. With probability 0.45, it will return $200 (i.e. net gain $100), and $0 (i.e. net gain −$100) otherwise. Strategy B: Invest $100. With probability 0.4, it will return $300 (i.e. net gain $200), and $0 (i.e. net gain −$100) otherwise. 1 • His discount factor is γ = 1 (or you may use γ = 0.99999 as an approximation for computa- tion) Based on the above information, (a) Model this problem as a MDP. Define the state space, action space, reward function, and the transition kernel. (b) Suppose Tom would choose one strategy (A or B) at the beginning and stick with this strategy until this process stops. His good friend, Pete, claims that Strategy A is a better choice. Do you agree with him? Explain your answer. You may use Python to help for the computations; please submit your Python code as part of your submission in that case. Please include the code that will print/display your result(s). Note: Questions 3 and 4 require the book “Reinforcement Learning: An Introduction” (2nd edition), which can be found here: http://incompleteideas.net/book/RLbook2018. pdf Question 3. Choose either (a) or (b) below. (a) Write Python code to reproduce the right figure in Example 6.2 (Random Walk) on page 125 from the book. Please submit your code. (b) Write Python code to reproduce the lower figure in Example 6.6 (Cliff Walking) on page 132 from the book. Please submit your code. Question 4. Write Python code to reproduce Figure 13.1 on page 328 from the book (softmax policy is used). Please submit your code. 2
欢迎咨询51作业君