1

ELEC-E7851 User Interfaces, Fall 2019 Aalto University

Assignment 7: Multi-Armed Bandits

General: There are two tasks in this assignment, each worth max 5 points and a bonus. As always, you can choose which (if any) tasks you attend to. I recommend Task 7.1.

A7.1. Modifying the Bandits Problem and the ε-Greedy Solver (5

p) [Recommended] During the lecture, we used Bernoulli Bandits. Here, every time an arm was pulled, the bandit returned reward 1 with probability and 0 otherwise. We constructed different strategies for solving the Bandit problem, including the ε-Greedy solver, which constantly explored with a fixed probability (ε).

Goal: In this task, we will modify the Bandit problem to address varying re-wards, as well as improve upon the ε-Greedy solver.

Steps:

1. Update the Bandit: In this sub-goal, we will update the Bandit problem such that every arm returns a different reward. Modify the definition of the Bandit object, such that each bandit is also assigned some reward, along with the reward probability. You should be able to specify the rewards for each arm, when instantiating a new object (i.e. pass it as a parameter). As a bonus, instead of having a fixed reward for each arm, you could sample the reward from a distribution. Using this updated definition of a Bandit, create 5 different instances of a Bandit problem. Set the reward probabilities and rewards for arms such that you can observe how this influences the solution (you could use some exaggerated values to highlight the differences). Run the implemented solvers to get solutions for these 5 problems, and summa-rise your results.

2. Create a variation of the ε-Greedy Solver: One of the limitations of the ε-Greedy solver is that it continues explora-tion forever, even though it might have already found good estimates of the reward probabilities. Address this limitation by modifying the solver such that the exploration probability (ε) can be varied. Define a strategy for updating the ε value, and implement this into the solver. Please pro-vide a rationale (in text) about your chosen strategy.

Materials: We offered a Jupyter notebook including Python code for a basic im-plementation of the Bandits and some solvers.

Report: Submit a Jupyter notebook with your modified code. Please comment the code appropriately. Textual explanations of your rationale should be in-cluded within the notebook (as markdown).

2

A7.2. Design and Run an Online Experiment (5p) Multi-Armed Bandits are commonly used to test different design variations in-the-wild, to determine how they perform with users. The designer constructs the different design variations, and the Bandit system decides which of the designs it should present to the user next.

Goals: In this task, you will design and run an online experiment where you will test 5 different GUI designs with users. You will create an interactive interface that: 1. At each time step, the system displays a design (picked by solver) to the user. 2. The user interacts and provides feedback (in the form of reward). 3. The system observes this feedback, and uses it to update itself. 4. The system picks a new design, and repeats the steps in the next iteration. You should design and implement this experiment, and report your results after a given number of trials. Which of the designs performed well? How many times was each design presented to users? How did the system evolve over time, while selecting the next design to display? For a complete system, you should implement a solution that displays one of the designs visually, and some technique to enable user feedback. You can add inter-active widgets to your Jupyter notebook using ipywidgets (https://ipy-widgets.readthedocs.io/), or you can use the TkInter interactive Python library. For a partial solution, you can textually specify the design variations, and simu-late the user feedback.

Materials: You can use the provided notebook as a starting point.

Report: Your solution (with textual explanations/comments) as a Jupyter note-book or python program. If you create the interactive experiment, please record a short screen capture of a few experiment trials, and submit the video.

A7.3. Bonus: Comparison of cumulated reward (1P) The goal of multi-armed bandits is to minimize the cumulative regret or maximize the cumulative reward.

Goals: Based on the provided notebook calculate the cumulative reward over time for each solver. Construct the comparison plot of the cumulative reward for: 1. bandit1 (in the provided notebook) 2. bandit2 (in the provided notebook) 3. Design a third bandit with 6-arms where all arms have similar, but not the same probability .

Materials: You can use the provided notebook as a starting point.

Report: Your three plots and calculation as a Jupyter notebook or python pro-gram. Add 1) a discussion how the solvers of each bandit compare to each other and the main reasons for it; 2) a final discussion of the differences of solvers across the three bandits and what are the main reasons for these differences.