辅导案例-COMP0081

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Applied Machine Learning
Alternative Final Assessment
COMP0081 2019–20
Always provide justification and show any intermediate work for your answers.
A correct but unsupported answer may not receive any marks.
Page 1 of 5 COMP0081 2019–20
Suitable for current and prior cohorts
1. Optimisation
(a) Define a gradient of a function and show that it points along the direction of
maximal change. [3 marks]
(b) For a quadratic function f(x) = 12x
TAx− bT x with positive definite matrix A
i. Write the gradient descent updates. [2 marks]
ii. Explain which learning rates lead to the convergence of gradient descent
algorithm. [3 marks]
iii. What is the optimal learning rate and the rate of convergence for it?
[3 marks]
(c) Explain the tradeoffs in using stochastic, mini-batch and pure gradient descent.
When would you select one over the others and why? [4 marks]
(d) Suppose you want to avoid overfitting and implement Early Stopping for your
model. You train your model with minibatch gradient descent. After every
batch you check the validation loss and stop as soon as it goes up. Is this a
good idea? Why? [5 marks]
(e) Suppose that you have a linear regression problem with a square loss function.
That is, the loss is l(y, yˆ) = (y − yˆ)2, where the predictions of your model are
yˆ = wT x. The training set consists of N labeled examples (xi, yi), xi ∈ RD.
i. The problem has analytic solution for the optimal w. What is it? [3 marks]
ii. What might be the reasons for using gradient-based optimisation rather
than using the equation from the previous question? What is the connec-
tion with the second-order methods? [4 marks]
iii. For sufficiently small learning rates, does batch gradient descent necessarily
converge to the global minimum of the loss function? Why is that?
[3 marks]
[Total: 30 marks]
Page 2 of 5 COMP0081 2019–20 CONTINUED
2. Autodiff
(a) Explain what are the benefits of using automatic differentiation as opposed to
using finite differences method to estimate the gradient. [3 marks]
(b) Explain what the forward and the reverse mode of automatic differentiation is.
Explain how you can compute Jacobian-vector products using them. What is
the complexity of both methods? [4 marks]
(c) A softmax function y = f(x) could be computed as follows. Compute Z =∑N
i=1 e
xi , then compute yi =
exi
Z .
i. Draw a computation graph on xi,Z,yi for this function. [3 marks]
ii. Write down the explicit expressions for reverse mode automatic differenti-
ation for this function. [5 marks]
(d) Explain why the reverse mode of automatic diffentiation (also known as back-
propagation) is preferred to the forward mode as a method of computing the
gradients for neural networks. [5 marks]
[Total: 20 marks]
Page 3 of 5 COMP0081 2019–20
3. ML competitions
(a) Suppose that you take part in a Kaggle competition where the task is binary
classification and the quality of predictions is measured by accuracy. You build
a sophisticated Neural Network model. Is it possible to directly optimise accu-
racy? Support your arguments. [5 marks]
(b) Suppose that in another competition the quality of predictions is evaluated
using average log-loss. You suspect that the train-test split is such that the
fraction of positive examples in the training set is significantly different from
that in (the public part of) the test set.
i. How can you verify this suspicion of yours? [5 marks]
ii. Suppose that it is indeed the case. What would that mean for your ap-
proach to validation? [5 marks]
(c) Suppose that you are designing an image recognition competition. Your dataset
consists of satellite images showing various cities and towns. The goal is to
predict whether a given image contains a certain type of roads (gravel tracks,
small roads, motorways, etc.). How would you split your dataset into train and
test part? [5 marks]
[Total: 20 marks]
Page 4 of 5 COMP0081 2019–20 CONTINUED
4. Various Problems
(a) Consider the classification problem with binary labels yi ∈ {0, 1}. You want to
use a logistic regression model f(x|θ) = σ(θT x), where σ(t) = 1/(1 + e−t) is
the logistic function. Instead of using cross-entropy loss you want to optimise
the parameters of your model by minimising a squared loss, that is, L(θ) =∑N
i=1 l(yi, f(xi|θ)), where l(y, yˆ) = (y − yˆ)2. Is L(θ) a convex function? Prove
your answer. [6 marks]
(b) Suppose that we would like to train an autoencoder to perform dimensionality
reduction. An autoencoder is a pair of functions, f : Rd → Rm and g : Rm → Rd
where m < d. Suppose that g is a linear function, g(z) = Wgz where Wg is a
weight matrix. You have a training set of N examples. You can train the model
using squared loss, L(θ) =
∑N
i=1 l(xi, g(f(xi))), where l(x, x
′) = ||x− x′||2.
Consider two cases:
i. f(x) is a complex non-linear mapping (say, a neural net or a t-SNE embed-
ding).
ii. f(x) is a linear function.
Which of the models is more powerful in a sense that it can achieve lower loss
on the training set? Support your answer. [6 marks]
(c) Suppose that you have trained a random forest model and you would like to
study feature importance. A friend tells you that you can look at the top level
split for all the trees in the model, find the feature that is used to make this
split in most of the trees and then this feature is the most important one in
your data. Is this a good idea? If you think it is, explain why, if you think it
may fail, provide a counterexample. [6 marks]
(d) Suppose that you trained a random forest and a gradient boosting models,
both consisting of 1000 decision trees trained sequentially. Suppose that for
each model you now remove the first tree from the resulting ensemble. What
do you expect to happen to the performance of your models? [6 marks]
(e) Implement a selection of different optimisation algorithms (gradient descent,
gradient descent with momentum, conjugate gradient and another one of your
choice) for the Rosenbrock function, f(x, y) = (a−x)2+b(y−x2)2 with param-
eters a = 1 and b = 100. Start at point (x0, y0) = (5, 5) and experiment with
the parameters of your algorithms. For which learning rates do the algorithms
converge to the global minimum? Which algorithm is more sensitive to the
choice of the learning rate? [6 marks]
[Total: 30 marks]
Page 5 of 5 COMP0081 2019–20