UNIVERSITY OF SOUTHAMPTON COMP3206W1

SEMESTER 1 EXAMINATION 2017-2018 MACHINE LEARNING

DURATION 120 MINS (2 Hours)

### All answers must be written within the designated spaces in this booklet.

### No credit will be given for answers presented elsewhere.

You are advised to write using a soft pencil so that you may readily correct mistakes with an eraser.

This paper contains 5 questions.

Answer ANY FOUR, each worth 25% of the total marks of the examination.

An outline marking scheme is shown in brackets to the right of each question. This examination provides 80% of the module’s marks.

University approved calculators MAY be used.

A foreign language translation dictionary (paper version) is permitted provided it contains no notes, additions or annotations

2 COMP3206W1

This page is intentionally left blank—do not write in this space

*•*

COMP3206W1

1

i=1

You are given a dataset {(xi, yi)}N

where

i is a p-dimensional vector of**x**real-valued features associated with an individual i and yi is a real-valued

“output" variable for i. Show how you would set up a

**linear regression**model by expressing yi as a function of

i.**x***[4 marks]*Express the learning problem in terms of a design matrix

, so that**A**=**Aw**## y. You must outline what the size of the design matrix is, and what each of its rows represents. The vector of residuals is orthogonal to the column space of A. Show how this leads to the solution which is written out in terms of the pseudo-inverse

## A+ = .AT A.—1 AT .

[6 marks]

# TURN OVER

COMP3206W1

If the design matrix has a singular value decomposition

k

## A = U ⌃V T = X ukokvT

k

## where U = (u1, ·· · , ur ) is a matrix composed of N ⇥ 1 column vectors and

## V = (v1, ·· · , vr ) is a matrix of p-dimensional column vectors, express A+

## in terms of U , V and ⌃. Rewrite y, w as linear combinations of the columns

## ui, vi of the matrices U , V respectively:

## w = X ↵ivi, y = X Øiui

i i

to express

in terms of the elements in the SVD. Comment on how the singular values in ⌃ might affect the outcome of the learning.**w***[10 marks]*

COMP3206W1

Show how the solution to the linear regression problem can be obtained from an optimisation approach where a weight vector

is chosen to min- imise the loss function**w**

## l(w) = (Aw — y)T (Aw — y)+ LwT w.

Why is the ‘regularization’ term involving L, that was not indicated in the dataset (xi, yi), introduced?

*[5 marks]*# TURN OVER

COMP3206W1

2

For a probability distribution p over an event space X p := {pi|i 2 X }, explain why the entropy H(p)

1

H(p) = X pi log( )

i

p

i

is often interpreted as the average degree of ‘surprise’.

*[3 marks]*For 2 probability distributions p = {pi|i 2 X } and q = {qi|i 2 X } over the same event space X the cross-entropy H(p, q) and the Kullback-Leibler (KL) divergence KL(pkq) are defined as:

1

H(p, q) = X pi log(

i

q

i

X

pi

q

) and KL(pkq) = H(p, q) — H(p) = pi log( ).

i i

Provide some intuition for what each of the two metrics over pairs of prob- ability distributions captures. You may treat the distribution p as the one representing ‘ground truth.’

*[7 marks]*

COMP3206W1

You are given a sequence x := {x(1),. .., x(N )} of heads (x(i) = H) and tails (x(i) = T ) which are the outcomes of N tosses of a (potentially biased) coin. All possible outcomes of N coin tosses would constitute the event space X . A binomial distribution B(N, ✓) sets the probability of occurrence of n1

events of type 1, and n2 = N — n1 of type 2 as

*N*! n1 n2n

P (n1, n2|N, ✓) =

1

✓

!

*n*2!(1 — ✓)

*.*Describe how you would fit the data X to a binomial distribution using maxi- mum likelihood estimation and find the result to be the same as the empirical distribution

*p*˜ = (*p*˜H*, p*˜T ):nH

✓⇤ =

N

=:

*p*˜H.

*[8 marks]*# TURN OVER

COMP3206W1

Use Bayes’ theorem to write the posterior probability P (✓

*|*) of the parame- ters ✓ upon observing some data**x**. Discuss how a conjugate Beta prior**x**

✓↵—1(1 — ✓)Ø—1

R 1

F(↵ + Ø)

= ✓

F(↵)F(Ø)

↵—1

(1 — ✓)Ø—1

0 ✓↵—1(1 — ✓)Ø—1d✓

introduces “psuedo-counts" to affect the estimation of parameters of the binomial distribution as above, and combats the problem of overfitting, par- ticularly for data sets of small size N .

(You

*won’t need*this information, but the definition of F(x) is given below.)Z 1

F(

*x*) =0

tx—1e—tdt, and F(n) = (n — 1)! for integer n.

[7 marks]

COMP3206W1

3

i=1

## You are given a dataset X = {xi}N

where

is a (N ⇥ p) matrix, each row**X**of which

i is a p-dimensional vector of real-valued features associated with**x**an individual i. Explain in detail how you would use Principal Components

Analysis (PCA) to reduce the dimensionality p of the feature set. You have to

introduce the covariance matrix, (ii) discuss what role its eigenvalues play,

(iii) explain how to perform the projection onto a low-dimensional subspace and (iv) comment on the magnitudes p and N .

*[12 marks]*# TURN OVER

COMP3206W1

For classification problems where there is a training set of labelled data, explain why dimensionality reduction using PCA may give rise to poor clas- sification outcomes. You may draw a figure to illustrate your arguments.

[4 marks]

2

A uniform random variable U (

*a, b*) has mean a+b12

2

and variance (b—a)

. For

any sample x of N numbers x = {x1, x2,. .., xN } drawn from a uniform distribution U (0, 2) its average < x > is a random variable. What is the mean and variance of this random variable? What is the relevance to ma- chine learning implied by the dependence on sample size N of the result.

[4 marks]

COMP3206W1

o2

A 2-dimensional Gaussian distribution is defined by the probability density function of X = (X1, X2)T parameterised by its mean µ = (µ1, µ2)T and

variance-covariance matrix ⌃ =

1 ⇢o1o2 ◆.

2

⇢o1o2 o2

1 ✓ 1

◆

T 1

1

2

—

p(X|µµ, ⌃) = 2⇡o o p1

exp

⇢2

— 2 (X — µ) ⌃

(X —

*µ*)*.*Draw 3 contour plots of equiprobable values of (X1, X2) for the cases (a)

⇢ = 0, (b) ⇢ < 0 and (c) ⇢ > 0 providing reasons for doing so. [5 marks]

# TURN OVER

COMP3206W1

4

Describe the k-means clustering algorithm.

*[7 marks]*

COMP3206W1

A

In Fisher’s Linear Discriminant Analysis (LDA), training data –

i**x**B

and

j**x**from 2 classes A and B – is projected along some vector

so that**w**i ,**x**j**x**A B

are mapped into

yi T i

j T j

A := w

xA,i = 1,. .., NA, and yB := w

xB ,j = 1,. .., NB

respectively. Mean vectors and variance-covariance matrices are computed from the training data. These empirical means are denoted

A and**m**B and the variance-covariance matrices are called SA and SB respectively.**m**Let µA and µB be the (scalar) means for the values

*{*yi*}*and*{*yj*}*, andA B

oA, oB the corresponding variances. These will, of course, depend on the

choice of vector

. What is the quantity that LDA seeks to maximise in**w**order to achieve classification accuracy? Explain why that is a good choice.

[5 marks]

Express this quantity as a ratio of the form:

## wT Uw

*R*(w) =## wT Vw

by finding the suitable

,**U**.**V***[4 marks]*# TURN OVER

COMP3206W1

Using

@

*R*(w)@w

2

## = (wT V w)2

## .(wT V w)Uw — (wT U w)V w.

## show that the optimal vector w is in the direction of V —1(mA — mB ). If the data from each class A and B were scattered isotropically, V would be proportional to the identity matrix. Hence, interpret the result. [4 marks]

COMP3206W1

You are given a scatter plot of 2-dimensional labelled data (labels "x" and "o") as shown in the figure, and equiprobable contour plots of their best fit Gaussian distributions.

6

o

4

o x

o

o

2 o o

o o oo

o

x x

xox x

0 o oo x xx

o xox x

oo o

x

x

o

-2 xo o x o x x

x xx x

-4 x x

o x

-6

-6 -4 -2 0 2 4 6

Explain how you would find an optimal direction for projecting the data onto to linearly separate the data into two classes. You may use the answers to the previous parts to comment on your choices.

*[5 marks]*# TURN OVER

COMP3206W1

5 Suppose you are given the 2 class dataset in the figure below.

+ Class 1

* Class 2 * *

*

+ * *

+

*

*

* *

+

+ +

+ +

Would a perceptron be a suitable classifer for this dataset? Briefly explain your answer

*[3 marks]*Would a neural network with 1 hidden layer and no bias be a suitable clas- sifier for this dataset? Briefly explain your answer

*[3 marks]*COMP3206W1

Assume you now have a fully functional neural network architecture in place and you are training your model. How can you check to see if you are overfitting? Explain exactly what you could plot from your output and what on this plot may indicate overfitting.

*[7 marks]*NOTE: THERE ARE 3 MORE PARTS TO THIS QUESTION [(d), (e) and (f)]

on the next page!!

# TURN OVER

Now assume you are working with a different dataset of 1000 images. These images are from the MNIST dataset and consist of handwritten digits from 0 to 9. Some samples are given below. Each one of the images consists of 28 x 28 pixels each with a value of 0 or 1 and you want to build a neural network to classify the digits. Note, for any of the answers below, if you do

not have a calculator and cannot work out the final number you will not be penalised. Simply write out the simplest form of the equation to the answer of the question.

How many inputs would you need in your neural network?

*[3 marks]*How many neurons would you need in your output layer?

*[3 marks]*Explain how you would know if your model has converged?

*[6 marks]*