AN

SW

ER

S

ECM3420

(with Answers)

UNIVERSITY OF EXETER

COLLEGE OF ENGINEERING, MATHEMATICS

AND PHYSICAL SCIENCES

COMPUTER SCIENCE

Examination, May 2019

Learning from Data

Module Leader: Dr. Lorenzo Livi

Duration: TWO HOURS

Answer question 1, and two out of the other four questions.

Question 1 is worth 40 marks. Other questions are worth 30 marks each.

This assessment will contribute 60% to the final mark for this module.

No electronic calculators of any sort are to be used during the course of this

examination.

This is a CLOSED BOOK examination.

AN

SW

ER

S

ECM3420 – 28/Jan/2019 – with answers Page 1

SECTION A (Compulsory Question)

Question 1

(a) What is the difference between supervised and unsupervised learning?

Provide an example of each. What is meant by the term semi-supervised

learning?

(8 marks)

In supervised learning, the information regarding the target/desired

output is available and exploited during training. On the other hand,

in unsupervised learning this information is not available. [3] Examples:

linear regression (supervised) and clustering (unsupervised). [2]

In semi-supervised learning a small amount of labelled training data is

available, but usually a large quantity of unlabelled data. Learning can

take advantage of the structure of the data implicit in the unlabelled data

as well as the labels. [3]

(b) Briefly explain what is meant by the kernel trick and how it may be used

to obtain a nonlinear classifier from a linear one. Name and provide

expressions for at least two (positive definite) kernel functions.

(10 marks)

The kernel trick describes the fact that inner products 〈φ(x), φ(y)〉 [2]

between vectors x and y mapped to a (usually high dimensional) feature

space by φ(·) can be computed via a kernel k(x,y) in the original space

(subject to the kernel being a Mercer kernel) [3].

If the computation of the linear classifier can be done via inner products,

then the classification can be performed after nonlinearly mapping φ(x)

to a high-dimensional space, but the computations can be achieved in the

original space. The basis of support vector machines and kernel logistic

regression. [3]

Two well-known (positive definite) kernel functions are [2]:

• Polynomial: k(x, z) = (x · z+ v)p, v ≥ 0, p ≥ 0

• RBF: k(x, z) = exp(−γ × ‖x− z‖22), which is known as Gaussian

RBF when γ = 1

2σ2

(c) Explain the difference between covariance and correlation. The covariance

ECM3420 (2019) 1

AN

SW

ER

S

ECM3420 – 28/Jan/2019 – with answers Page 2

matrix for two variables is given by

S =

[

10.2 −45.6

−45.6 8.1

]

Calculate the correlation between them and provide comments on the

outcome.

Briefly explain why two decorrelated variables need not be independent.

(8 marks)

Correlation is a standardised measure of covariance [2]. They are related

by ρ = covxy/(σxσy), where σx, σy denote the standard deviations of

x and y, respectively. So ρ = −45.6/(10.2 × 8.1) = −0.55. [2]

Identification of correct elements in covariance matrix and calculation.

Therefore, the two variables are moderately anti-correlated [2].

Correlation is a linear measure so the variables might be positively

correlated in one place and negatively correlated in another summing

to zero; independence requires p(x, y) = p(x)p(y) which is stronger [2].

Standard material, but not all appreciate the differences.

(d) Explain what is meant by the terms generalisation error and over-fitting.

How can over-fitting be combatted?

(6 marks)

Generalisation error: The average error made on unseen test data after

training. [2]

Overfitting: The phenomenon of learning the details of the training data

rather than the trends. [2]

Combatted by controlling the complexity of the learning machines. [2]

(e) With the aid of a carefully labelled diagram, explain what is meant by a

convex function. Why is it desirable for an error function to be convex?

(8 marks)

ECM3420 (2019) 2 Please Turn Over

AN

SW

ER

S

ECM3420 – 28/Jan/2019 – with answers Page 3

Several possible definitions in terms of the secant line or second

derivative for functions of one variable. For multi-variate functions

the eigenvalues of the Hessian are all positive or the epigraph of the

function is a convex set. [4] for a multivariate definition ; for a univariate

definition. Diagram: [2].

Desirable to have a convex error function because convex functions have

only one minimum. Therefore if a minimum is located, it is the global

minimum. [2]

(Total 40 marks)

ECM3420 (2019) 3

AN

SW

ER

S

ECM3420 – 28/Jan/2019 – with answers Page 4

SECTION B (Optional Questions)

Question 2

(a) Write the equation and draw a diagram describing the perceptron model.

Specify all variables involved in the model (using formal notation) and

describe their role.

(10 marks)

The equation describing a perceptron reads:

y = φ(v) = φ

(

d∑

i=1

wixi + b

)

.

[4] Let us describe the elements involved in the above expression:[6]

• x = [x1, x2, ..., xd]T is the input vector (feature vector);

• w = [w1, w2, ..., wd]T are the synaptic weights;

• b ∈ R is called bias;

• v =

∑d

i=1wixi + b is called local field or pre-activation;

• φ : R→ R is called activation (or transfer) function;

• y = φ(v) is the neuron output.

(b) Write the equations describing a three-layer MLP. Specify all variables and

functions involved in the model and describe their role. Assume one hidden

layer with L neurons and one output neuron.

(10 marks)

We assume bias is included in the network weights. Let us start from the

output neuron:

y = φ(v), v =

L∑

i=0

xiαi. (1)

[3] Each component xi in Eq. 1 is the output y

(H)

i of the activation

function of a neuron in the hidden layer, i.e., xi ≡ y(H)i [3]. Therefore, a

ECM3420 (2019) 4 Please Turn Over

AN

SW

ER

S

ECM3420 – 28/Jan/2019 – with answers Page 5

three-layer MLP can be described as follows:

y = φ

L∑

i=0

αi φ

(

d∑

j=0

wjixj

)

︸ ︷︷ ︸

y

(H)

i

, (2)

where d is the input dimensionality. [4]

(c) Write the update rule and describe the Least-Mean-Square (LMS) algorithm.

(10 marks)

Least-Mean-Square (LMS) is an online learning algorithm that is used

for finding the weights of a linear model. LMS updates the solution after

the presentation of each input datum. [5] The LMS update rule reads:

w(k + 1) = w(k) + µx(k)e(k), (3)

where w(k + 1) denotes the updated weights at time-step k + 1, w(k)

are the current weights, µ > 0 is the learning rate, x(k) is the current

input, and finally e(k) = d(k)− y(k) is the instantaneous error given by

the difference between desired and computed outputs, respectively. [5]

(Total 30 marks)

ECM3420 (2019) 5

AN

SW

ER

S

ECM3420 – 28/Jan/2019 – with answers Page 6

Question 3

A start-up wishes to use machine learning to predict the polluting emissions in

car exhaust from the characteristics of the car, such as engine size, fuel, driving

speed, etc. They have collected a large data set of measured emissions and the

corresponding features for each measurement.

(a) In supervised learning, an error function, E(w), may be defined as:

E(w) = −

N∑

n=1

log p(tn |xn;w).

State the meaning of each of the symbols w, N,xn and tn in this definition

and carefully explain how the error function may be derived from the

principle of maximum likelihood.

(11 marks)

• w Parameters/weights of the model. [1]

• N Number of training data pairs. [1]

• xn Training features. [1]

• tn Training targets. [1]

The likelihood is

∏N

n=1 p(tn,xn |w). [1] Taking negative logarithms

and assuming that the targets are independent [2], and writing

p(tn,xn |w) = p(tn |xn,w)p(xn) [2] obtains the error function to be

minimised (because log is monotonic [1]) if the constant due to p(xn) is

dropped. [1]

Straightforward, but many find it difficult.

(b) The start-up uses a radial basis function neural network with a mean squared

error function to learn the relationship between features and emissions.

Explain what the use of the mean squared error function implies about the

noise or measurement errors in the data.

(4 marks)

The mean squared error function is derived from the assumption that the

errors between the systematic model and the measurements are Gaussian

distributed. [4]

Discussed in lectures.

ECM3420 (2019) 6 Please Turn Over

AN

SW

ER

S

ECM3420 – 28/Jan/2019 – with answers Page 7

(c) Explain how the parameters of the radial basis function network may be

determined with the mean squared error function.

(7 marks)

Learning by selecting a number of data points as RBF centres and then

linear regression from the basis function activations. [3] This requires

constructing a design/data matrix X whose n, j-th entry is the activation

of basis function j by feature xn. Weights w found by w = X†y where

y is the vector of targets. [2] Regularisation constant found by cross

validation. [2]

Discussed in lectures and illustrated in a workshop.

(d) In order to cope with occasional outliers the start-up decides to use a sum of

absolute errors function. Explain why the sum of absolute errors function

may help in this context.

(4 marks)

Sum of absolute errors function is corresponds to Laplacian distributed

errors [2] p(t |x;w) ∝ exp(−|t − y(w)|) so large deviations from the

prediction are more likely and will not have such a large influence on the

weights as for the Gaussian case. [2]

(e) Suggest how the parameters of the radial basis function network could be

learned using the sum of absolute errors.

(4 marks)

Numerical optimisation rather than linear algebra must be used. [2]

Either Nelder-Mead simplex method or a gradient-based method. [2]

This was illustrated in a workshop for ordinary linear regression, but

most candidates will not put the RBF training and optimisation ideas

together.

(Total 30 marks)

ECM3420 (2019) 7

AN

SW

ER

S

ECM3420 – 28/Jan/2019 – with answers Page 8

Question 4

(a) Explain what is meant by the term recurrent neural network and what

recurrent neural networks are used for.

(4 marks)

Recurrent neural networks feed the output signal back to the inputs. [2]

Used for modelling and predicting time series. [2]

Bookwork

(b) Illustrating your answer with a diagram, explain what it meant by an echo

state network. Your answer should include equations giving the update rule

and output for an echo state network.

(11 marks)

Diagram showing input layer, reservoir, output layer and feedback. [4]

Discussed in lectures

Update rule:

xt =φ(W

rxt−1 +Wiut + ),

zt =W

oxt,

[2] where φ(·) is the activation function (typically, but not necessarily,

a sigmoidal function) xt is the ESN state at time t, ut and zt are the

input and output at time t, respectively, and is an additive white noise

term. Wr are the weights of the recurrent layer (connections between

hidden neurons) and Wi are weights between input and recurrent layer.

[2]Wo weights connect the ESN state with the output and form the so-

called read-out layer. [3]

Standard material.

(c) Describe in detail how the read-out weights of echo state networks are

obtained. You should include in your answer the expressions describing the

related optimization problem and how it is solved.

(11 marks)

ECM3420 (2019) 8 Please Turn Over

AN

SW

ER

S

ECM3420 – 28/Jan/2019 – with answers Page 9

The read-out weights Wo are obtained by solving the following

regularized [1] least-square problem [2].

argmin

W∈RNr×No

1

2

‖XW − t‖2 + λ

2

‖W‖2,

[3] where λ ≥ 0 is the regularization parameter [1] and X ∈ RN×Nr

contains the ESN states produced by the action of an input of size N and

t ∈ RN is a vector containing the target values. [2]

Closed-form solution for the solution is obtained as:

Wo =

(

X>X+ λ2I

)−1

X>t, (4)

where I is a Nr ×Nr identity matrix. [2]

(d) Explain why regularisation is used in finding the read-out weights.

(4 marks)

Regularization improves the numerical stability of the matrix inversion

operation and reduces the magnitudes of entries in Wo, thus mitigating

sensitivity to noise and overfitting. [4]

(Total 30 marks)

ECM3420 (2019) 9

AN

SW

ER

S

ECM3420 – 28/Jan/2019 – with answers Page 10

Question 5

(a) Explain the difference between a hierarchical clustering algorithm and a

clustering algorithm that yields a partition. Explain when you might prefer

to use a hierarchical algorithm.

(5 marks)

Hierarchical algorithm gives clusters at different resolutions while a

partition just divides the data into a fixed number of clusters [3].

Hierarchical algorithms useful for examining structure of data at

different resolutions and when the number of clusters is unknown. [2]

(b) Explain how agglomerative clustering is performed.

(6 marks)

Agglomerative clustering begins with each data point as a single

cluster. [1] At each stage the two closest clusters are merged to form a

new cluster. [2] Algorithm terminates when a single cluster remains. [1]

Distances between clusters can be measured in a variety of ways, for

example, as the minimum or maximum distance between any two points,

one from each cluster. [2]

Standard, but many will not appreciate different distance measures

(c) Carefully defining the symbols you use, give an expression for the objective

function minimised by the k-means algorithm.

(6 marks)

Objective function:

k∑

j=1

∑

xi∈Cj

‖xi − µj‖22, [2]

where ‖·‖22 is the squared Euclidean norm, Cj is the jth cluster in the

partition P , k = |P|, and µj, j = 1, ..., k, are k cluster representatives.

[2] If clusters are represented by centroids, then µj is defined as:

µj =

1

|Cj|

∑

xi∈Cj

xi [2]

ECM3420 (2019) 10 Please Turn Over

AN

SW

ER

S

ECM3420 – 28/Jan/2019 – with answers Page 11

(d) Giving pseudo-code, explain how the k-means clustering is achieved.

(7 marks)

Here is detailed pseudo-code, but full marks for less formal presentation

that includes initialisation [2], assignment [2] and update steps [2] and

termination [1] steps.

Input: Input data X = {x1, ..., xn}, partition order k, a distance

function d : X × X → R+0 , MAX number of iterations

Output: A partition P = {C1, ..., Ck}

1: t = 0,P = ∅

2: Initialisation: Assign k cluster representatives as randomly chosen

xi

3: loop

4: t = t+ 1

5: Assignment Step: assign each sample xj to the cluster with the

closest representative

6: C

(t)

i = {xj : d(xj, µi) ≤ d(xj, µh) for all h = 1, . . . , k}

7: Update Step: update the representatives

8: µ

(t+1)

i =

1

|C(t)i |

∑|C(t)i |

j=1 xj

9: Update the partition with the modified clusters:

P t = {C(t)1 , ..., C(t)k }

10: if t ≥ MAX OR P t = P t−1 then

11: return P t

12: end if

13: end loop

(e) Two users run the k-means algorithm on the same data set, but find that it

gives two different partitions. Explain how this may occur and what may be

done in practice to cope with this phenomenon.

(6 marks)

The reason is that the clusters are initialised at random so the algorithm

may converge to different local minima. [3] In practice the algorithm is

run several times from different initialisations and the partition with the

minimum objective function returned. [3]

Standard, but not appreciated by many.

(Total 30 marks)

ECM3420 (2019) 11 End of Paper