COMP0090 2020/21 Assignment 1: “Bag, not bag”

Pontus SAITO STENETORP

[email protected]

1 Instructions

For this assignment the maximum number of points obtainable is 100, which maps one-to-one for

the mark given for the assignment. The assignment are to be carried out in groups of up to five

students, which may or may not be the same groups as those for future assignments.

You are free to use any programming language and even mix programming languages between tasks.

You are not to use any automatic differentiation packages (such as Flux, TensorFlow, PyTorch, etc.),

you are however free to use any of the (sloppy) code provided in the lecture notebooks, packages

that do not implement automatic differentiation, or to implement automatic differentiation on your

own.

The assignment is due by Thursday, November 5th 2020 at 16:00 (Europe/London) and is to be

submitted via Moodle. You shall use LATEX1 to produce a single PDF document. Use the following

template as a starting point: https://www.overleaf.com/read/gyphwtvffnxj, but you are free

to improve upon it as you see fit.

Describe briefly (about a paragraph) in your submission which group member contributed to which

part to the assignment. It is expected that each group member contributes towards at least one of the

tasks, but as with all group work all members are collectively responsible for the submission in its

entirety and should confirm that they are familiar with all solutions prior to submission.

Kindly report errors (even typos) and ask for clarifications when needed, this assignment is to be

an exercise in deep learning, not mind reading. Corrections to this assignment post-release will be

listed in Section 4.

1https://www.overleaf.com/learn/latex/Learn_LaTeX_in_30_minutes

Figure 1: Examples from each of ten classes from Xiao et al. (2017), each class occupies three rows.

2 Data

A standard dataset in machine learning is the MNIST dataset of handwritten digits that has for

many years been used to train and evaluate machine learning algorithms. However, even simple

algorithms can perform relatively well on MNIST and its usage has increasingly come into question.

As a response, alternative “drop-in” replacement datasets have been made available and we will use

exactly such a dataset for all tasks in this assignment.

Zalando2 is a German e-commerce company and that has released a dataset “Fashion-MNIST” (Xiao

et al., 2017) which instead of digits from 0 to 10, contains 28-by-28 pixel, greyscale images of ten

distinct fashion items (see Figure 1 for an example subset). The module organisers have prepared a

subset of the dataset where the task is to classify a given image as to whether it depicts a bag or not.

We have also created train, validation, and test splits.

A Colaboratory notebook loading the data can be found at: https://colab.research.google.

com/drive/1Tt4LTpLm_vjwkeOwrjNkNUAf1e49pF6I.

2https://www.zalando.de

2

3 Tasks

3.1 A memory-efficient perceptron (20 points)

Freund & Schapire (1999) introduced the voted-perceptron algorithm, which scales the contribution

to its weight vectors by a factor depending on how many correct classifications it has taken part in

during training. They also demonstrated that despite its simplicity, the algorithm can be competitive

with significantly more involved algorithms such as the support vector machine. However, a major

drawback is that it has a memory complexity of O(k), where k is the number of times a sample is

misclassified during training. For this task we will consider a novel perceptron algorithm inspired

by the voted perceptron and thus suffering from the same drawback in terms of memory complexity.

Data: D := {(x1, y1), . . . , (xn, yn)}

Result: v := {(w1, c1), . . . , (wk, ck)}

w1 := 0

d;

c1 := 0;

k := 1;

while not converged do

for i := 1 . . . , n do

yˆ :=

{

1 wk · xi ≥ 0

0 otherwise

;

if yˆ = y then

ck := ck + 1;

else

wk+1 := wk + (y − yˆ)xi;

ck+1 := 1;

k := k + 1;

end

end

end

Algorithm 1: Training algorithm with bias terms omitted.

Once trained, the algorithm performs predictions as follows (bias terms omitted):

f(x) =

{

1

[∑k

i:=1 ci(wi · x)

]

≥ 0

0 otherwise

(1)

To complete this task you are expected to:

1. Derive a variant of the algorithm above with memory complexity O(1) – informally, that

the memory usage remains constant regardless of the number of misclassifications.

2. Prove that your efficient variant is functionally equivalent to the algorithm above.

3. Implement your efficient variant.

4. Train your model to convergence3 on the training set.

5. Provide a plot of the accuracy on the training set for each epoch.

6. Provide the accuracy on the training and validation set, for the epoch on which you obtain

your highest accuracy on the validation set.

3When you observe a lack of improvement on the training set over several epochs after having seen a large

amount of initial improvement.

3

3.2 Mean squared-loss logistic regression (20 points)

In our lectures we have covered using the log-likelihood loss function for logistic regression, but it

is of course possible to use a wide range of loss functions. Consider for example the mean squared

loss below which is common for regression tasks, but can also be applied for binary classification:

L = 1

n

n∑

i=1

(yi − yˆi)2

2

(2)

To complete this task you are expected to:

1. Derive the analytical gradients for the model parameters given the square loss.

2. Provide code verifying that your analytical gradients are correct using finite difference4 for

two or more samples from the training data (you are free to use an external package that

calculates the finite difference).

3. Implement the algorithm.

4. Train your model to convergence5 using full-batch gradient descent on the training set.

5. Provide a plot of the square loss on the training set for each epoch.

6. Provide a plot of the accuracy on the training set for each epoch.

7. Provide the accuracy on the training and validation set, for the epoch on which you obtain

your highest accuracy on the validation set.

4Keep in mind that the precision of your floating point type will be a major factor when using finite differ-

ence, it is thus common to use Float64 and avoid using the GPU when verifying gradients this way.

5When you observe a lack of improvement on the training set over several epochs after having seen a large

amount of initial improvement.

4

3.3 Three-layer multi-layer perceptron (20 points)

In our lectures we have introduced and implemented a two-layer multi-layer perceptron (MLP), a

natural next step would be a three-layer MLP:

W1 ∈ Rh1×282 (3)

W2 ∈ Rh2×h1 (4)

W3 ∈ R1×h2 (5)

b1 ∈ Rh1 (6)

b2 ∈ Rh2 (7)

b3 ∈ R (8)

f(x) = σ(W3σ(W2σ(W1x+ b1) + b2) + b3) (9)

To complete this task you are expected to:

1. Assuming the log-likelihood loss, derive the analytical gradients for the model parameters.

2. Provide code verifying that your analytical gradients are correct using finite difference6 for

two or more samples from the training data (you are free to use an external package that

calculates the finite difference).

3. Implement the algorithm.

4. Train your model to convergence7 using full-batch gradient descent on the training set.

5. Provide a plot of the loss on the training set for each epoch.

6. Provide a plot of the accuracy on the training set for each epoch.

7. Provide the accuracy on the training and validation set, for the epoch on which you obtain

your highest accuracy on the validation set.

6Keep in mind that the precision of your floating point type will be a major factor when using finite differ-

ence, it is thus common to use Float64 and avoid using the GPU when verifying gradients this way.

7When you observe a lack of improvement on the training set over several epochs after having seen a large

amount of initial improvement.

5

3.4 Hyperparameter tuning (20 points)

For any deep learning (or machine learning for that matter) model you will be faced with the task of

determining a set of “good” hyperparameters given a validation loss.

To complete this task you are expected to:

1. Starting with the two-layer MLP implementation we used in class, implement the follow-

ing:

(a) (Full-)batch gradient descent without momentum

(b) Stochastic gradient descent without momentum

(c) Mini-batch gradient descent without momentum

(d) (Full-)batch gradient descent with momentum

(e) Stochastic gradient descent with momentum

(f) Mini-batch gradient descent with momentum

2. Attempt to find “good” values for the following hyperparameters:

(a) Learning rate

(b) Batch size

(c) Momentum coefficient

3. List the hyperparameter values you have evaluated and describe briefly (a single paragraph)

your strategy and how you ultimately arrived at your “good” hyperparameters.

4. Provide a plot of the loss on the training and validation set for each epoch for your “good”

hyperparameters.

5. Provide a plot of the accuracy on the training and validation set for each epoch for your

“good” hyperparameters.

6. Provide the accuracy on the training and validation set for your “good” hyperparameters,

for the epoch on which you obtain your highest accuracy on the validation set.

6

3.5 Model shootout (20 points)

To prove a models viability for a dataset, it is common to first establish the performance of a “weaker”

model and show that performance can be improved using a more complex model.

To complete this task you are expected to:

1. Implement a “vanilla” perceptron model, we will refer to this as our baseline model.

2. Train your baseline model on the training set and use the validation set to determine when

to stop training.

3. Consider the models we have covered so far and make a reasonable attempt to outperform

the baseline model on the validation set.

4. Briefly describe in one or three paragraphs how you approached the task and what your

final best model is, also listing its hyperparameters.

5. Provide a plot of the loss on the training and validation set for each epoch for your best

model.

6. Provide a plot of the accuracy on the training and validation set for each epoch for your

best model.

7. Provide the accuracy on the training and validation set for your best model, for the epoch

on which you obtain your highest accuracy on the validation set.

7

4 Errata

26-10-2020 (a) Corrected Equation 1 to correspond to the Voted Perceptron.

(b) Corrected ci initialisation to 0, as opposed to 1, in Algorithm 1.

(c) Removed references and requirement to evaluate on the test set in Sections 2 and 3.5,

as this procedure caused confusion and marking concerns for a number of students.

(d) Removed duplicate sub-task in Section 3.3.

(e) Removed spurious left-over Colaboratory mentions.

27-10-2020 (a) Reverted Equation 1 to correspond to its previous state and corrected the writing to be

unambiguous in regards to the given algorithm being novel.

8

References

Yoav Freund and Robert E. Schapire. Large Margin Classification Using the Perceptron Algorithm.

Machine Learning, 37(3):277–296, Dec 1999. ISSN 1573-0565. doi: 10.1023/A:1007662407062.

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a Novel Image Dataset for Bench-

marking Machine Learning Algorithms, 2017.

9

欢迎咨询51作业君