# SEMESTER 1 EXAMINATION 2017-2018 MACHINE LEARNING

C:\Users\92466\Desktop\COMP3206W1_COMP3206_201718_01

UNIVERSITY OF SOUTHAMPTON COMP3206W1

SEMESTER 1 EXAMINATION 2017-2018 MACHINE LEARNING

DURATION 120 MINS (2 Hours)

### No credit will be given for answers presented elsewhere.

You are advised to write using a soft pencil so that you may readily correct mistakes with an eraser.

This paper contains 5 questions.

Answer ANY FOUR, each worth 25% of the total marks of the examination.

An outline marking scheme is shown in brackets to the right of each question. This examination provides 80% of the module’s marks.

University approved calculators MAY be used.

A foreign language translation dictionary (paper version) is permitted provided it contains no notes, additions or annotations

2 COMP3206W1

• This page is intentionally left blank—do not write in this space

1. COMP3206W1

1

i=1

1. You are given a dataset {(xi, yi)}N

where xi is a p-dimensional vector of

real-valued features associated with an individual i and yi is a real-valued

“output" variable for i. Show how you would set up a linear regression

model by expressing yi as a function of xi. [4 marks]

2. Express the learning problem in terms of a design matrix A, so that Aw =

[6 marks]

# TURN OVER

2. COMP3206W1

1. If the design matrix has a singular value decomposition

k

k

## w = X↵ivi, y = XØiui

i i

to express w in terms of the elements in the SVD. Comment on how the singular values in might affect the outcome of the learning. [10 marks]

3. COMP3206W1

1. Show how the solution to the linear regression problem can be obtained from an optimisation approach where a weight vector w is chosen to min- imise the loss function

## l(w) = (Aw — y)T(Aw — y)+ LwTw.

Why is the ‘regularization’ term involving L, that was not indicated in the dataset (xi, yi), introduced? [5 marks]

# TURN OVER

4. COMP3206W1

2

1. For a probability distribution p over an event space X p := {pi|i 2 X }, explain why the entropy H(p)

1

H(p) = X pi log( )

i

p

i

is often interpreted as the average degree of ‘surprise’. [3 marks]

2. For 2 probability distributions p = {pi|i 2 X } and q = {qi|i 2 X } over the same event space X the cross-entropy H(p, q) and the Kullback-Leibler (KL) divergence KL(pkq) are defined as:

1

H(p, q) = X pi log(

i

q

i

X

pi

q

) and KL(pkq) = H(p, q) H(p) = pi log( ).

i i

Provide some intuition for what each of the two metrics over pairs of prob- ability distributions captures. You may treat the distribution p as the one representing ‘ground truth.’ [7 marks]

5. COMP3206W1

1. You are given a sequence x := {x(1),. .., x(N )} of heads (x(i) = H) and tails (x(i) = T ) which are the outcomes of N tosses of a (potentially biased) coin. All possible outcomes of N coin tosses would constitute the event space X . A binomial distribution B(N, ) sets the probability of occurrence of n1

events of type 1, and n2 = N n1 of type 2 as

N ! n1 n2

n

P (n1, n2|N, ) =

1

!n2!

(1 ) .

Describe how you would fit the data X to a binomial distribution using maxi- mum likelihood estimation and find the result to be the same as the empirical distribution p˜ = (p˜H , p˜T ):

nH

=

N

=: p˜H

. [8 marks]

# TURN OVER

6. COMP3206W1

1. Use Bayes’ theorem to write the posterior probability P (|x) of the parame- ters upon observing some data x. Discuss how a conjugate Beta prior

↵—1(1 )Ø—1

R 1

F(+ Ø)

=

F()F(Ø)

↵—1

(1 )Ø1

0 1(1 )Ø1d

introduces “psuedo-counts" to affect the estimation of parameters of the binomial distribution as above, and combats the problem of overfitting, par- ticularly for data sets of small size N .

(You won’t need this information, but the definition of F(x) is given below.)

Z 1

F(x) =

0

tx1etdt, and F(n) = (n 1)! for integer n.

[7 marks]

7. COMP3206W1

3

i=1

1. ## You are given a dataset X = {xi}N

where X is a (N p) matrix, each row

of which xi is a p-dimensional vector of real-valued features associated with

an individual i. Explain in detail how you would use Principal Components

Analysis (PCA) to reduce the dimensionality p of the feature set. You have to

1. introduce the covariance matrix, (ii) discuss what role its eigenvalues play,

(iii) explain how to perform the projection onto a low-dimensional subspace and (iv) comment on the magnitudes p and N . [12 marks]

# TURN OVER

8. COMP3206W1

1. For classification problems where there is a training set of labelled data, explain why dimensionality reduction using PCA may give rise to poor clas- sification outcomes. You may draw a figure to illustrate your arguments.

[4 marks]

2

2. A uniform random variable U (a, b) has mean a+b

12

2

and variance (ba)

. For

any sample x of N numbers x = {x1, x2,. .., xN } drawn from a uniform distribution U (0, 2) its average < x > is a random variable. What is the mean and variance of this random variable? What is the relevance to ma- chine learning implied by the dependence on sample size N of the result.

[4 marks]

9. COMP3206W1

• o2

1. A 2-dimensional Gaussian distribution is defined by the probability density function of X = (X1, X2)T parameterised by its mean µ = (µ1, µ2)T and

variance-covariance matrix =

1 ⇢o1o2 .

2

⇢o1o2 o2

1 1

T 1

1

2

p(X|µµ, ) = 2⇡o o p1

exp

2

2 (X µ)

(X µ) .

Draw 3 contour plots of equiprobable values of (X1, X2) for the cases (a)

= 0, (b) < 0 and (c) > 0 providing reasons for doing so. [5 marks]

# TURN OVER

10. COMP3206W1

4

1. Describe the k-means clustering algorithm. [7 marks]

11. COMP3206W1

A

1. In Fisher’s Linear Discriminant Analysis (LDA), training data – xi

B

and xj

from 2 classes A and B – is projected along some vector w so that xi , xj

A B

are mapped into

yi T i

j T j

A := w

xA,i = 1,. .., NA, and yB := w

xB ,j = 1,. .., NB

respectively. Mean vectors and variance-covariance matrices are computed from the training data. These empirical means are denoted mA and mB and the variance-covariance matrices are called SA and SB respectively.

Let µA and µB be the (scalar) means for the values {yi } and {yj }, and

A B

oA, oB the corresponding variances. These will, of course, depend on the

choice of vector w. What is the quantity that LDA seeks to maximise in

order to achieve classification accuracy? Explain why that is a good choice.

[5 marks]

2. Express this quantity as a ratio of the form:

R(w) =

## wTVw

by finding the suitable U , V . [4 marks]

# TURN OVER

12. COMP3206W1

1. Using

@R(w)

@w

2

## show that the optimal vector w is in the direction of V —1(mA— mB). If the data from each class A and B were scattered isotropically, V would be proportional to the identity matrix. Hence, interpret the result. [4 marks]

13. COMP3206W1

1. You are given a scatter plot of 2-dimensional labelled data (labels "x" and "o") as shown in the figure, and equiprobable contour plots of their best fit Gaussian distributions.

6

o

4

o x

o

o

2 o o

o o oo

o

x x

xox x

0 o oo x xx

o xox x

oo o

x

x

o

-2 xo o x o x x

x xx x

-4 x x

o x

-6

-6 -4 -2 0 2 4 6

Explain how you would find an optimal direction for projecting the data onto to linearly separate the data into two classes. You may use the answers to the previous parts to comment on your choices. [5 marks]

# TURN OVER

14. COMP3206W1

5 Suppose you are given the 2 class dataset in the figure below.

+ Class 1

* Class 2 * *

*

+ * *

+

*

*

* *

+

+ +

+ +

1. Would a perceptron be a suitable classifer for this dataset? Briefly explain your answer [3 marks]

2. Would a neural network with 1 hidden layer and no bias be a suitable clas- sifier for this dataset? Briefly explain your answer [3 marks]

1. COMP3206W1

3. Assume you now have a fully functional neural network architecture in place and you are training your model. How can you check to see if you are overfitting? Explain exactly what you could plot from your output and what on this plot may indicate overfitting. [7 marks]

NOTE: THERE ARE 3 MORE PARTS TO THIS QUESTION [(d), (e) and (f)]

on the next page!!

# TURN OVER

1. COMP3206W1

Now assume you are working with a different dataset of 1000 images. These images are from the MNIST dataset and consist of handwritten digits from 0 to 9. Some samples are given below. Each one of the images consists of 28 x 28 pixels each with a value of 0 or 1 and you want to build a neural network to classify the digits. Note, for any of the answers below, if you do

not have a calculator and cannot work out the final number you will not be penalised. Simply write out the simplest form of the equation to the answer of the question.

4. How many inputs would you need in your neural network? [3 marks]

5. How many neurons would you need in your output layer? [3 marks]

6. Explain how you would know if your model has converged? [6 marks]

Email:51zuoyejun

@gmail.com