程序辅导案例 > Machine Learning >

SEMESTER 1 EXAMINATION 2017-2018 MACHINE LEARNING

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

C:\Users\92466\Desktop\COMP3206W1_COMP3206_201718_01

UNIVERSITY OF SOUTHAMPTON COMP3206W1

DURATION 120 MINS (2 Hours)

All answers must be written within the designated spaces in this booklet.

No credit will be given for answers presented elsewhere.

You are advised to write using a soft pencil so that you may readily correct mistakes with an eraser.

This paper contains 5 questions.

Answer ANY FOUR, each worth 25% of the total marks of the examination.

An outline marking scheme is shown in brackets to the right of each question. This examination provides 80% of the module’s marks.

University approved calculators MAY be used.

A foreign language translation dictionary (paper version) is permitted provided it contains no notes, additions or annotations

2 COMP3206W1

This page is intentionally left blank—do not write in this space •

COMP3206W1

1

i=1
1. You are given a dataset {(xi, yi)}N
  
  where xi is a p-dimensional vector of
  real-valued features associated with an individual i and yi is a real-valued
  “output" variable for i. Show how you would set up a linear regression
  model by expressing yi as a function of xi. [4 marks]
2. Express the learning problem in terms of a design matrix A, so that Aw =
  y. You must outline what the size of the design matrix is, and what each of its rows represents. The vector of residuals is orthogonal to the column space of A. Show how this leads to the solution which is written out in terms of the pseudo-inverse
  A+ = .AT A.—1 AT .
  
  [6 marks]
  
  TURN OVER
COMP3206W1
1. If the design matrix has a singular value decomposition
  k
  A = U ⌃V T = X ukokvT
  k
  where U = (u1, ·· · , ur ) is a matrix composed of N ⇥ 1 column vectors and
  V = (v1, ·· · , vr ) is a matrix of p-dimensional column vectors, express A+
  in terms of U , V and ⌃. Rewrite y, w as linear combinations of the columns
  ui, vi of the matrices U , V respectively:
  w = X ↵ivi, y = X Øiui
  i i
  
  to express w in terms of the elements in the SVD. Comment on how the singular values in ⌃ might affect the outcome of the learning. [10 marks]
COMP3206W1
1. Show how the solution to the linear regression problem can be obtained from an optimisation approach where a weight vector w is chosen to min- imise the loss function
l(w) = (Aw — y)T (Aw — y)+ LwT w.
Why is the ‘regularization’ term involving L, that was not indicated in the dataset (xi, yi), introduced? [5 marks]

TURN OVER
COMP3206W1

2
1. For a probability distribution p over an event space X p := {pi|i 2 X }, explain why the entropy H(p)
  
  1
  H(p) = X pi log( )
  i
  p
  i
  is often interpreted as the average degree of ‘surprise’. [3 marks]
2. For 2 probability distributions p = {pi|i 2 X } and q = {qi|i 2 X } over the same event space X the cross-entropy H(p, q) and the Kullback-Leibler (KL) divergence KL(pkq) are defined as:
  1
  H(p, q) = X pi log(
  i
  q
  i
  X
  pi
  q
  ) and KL(pkq) = H(p, q) — H(p) = pi log( ).
  i i
  Provide some intuition for what each of the two metrics over pairs of prob- ability distributions captures. You may treat the distribution p as the one representing ‘ground truth.’ [7 marks]
COMP3206W1
1. You are given a sequence x := {x(1),. .., x(N )} of heads (x(i) = H) and tails (x(i) = T ) which are the outcomes of N tosses of a (potentially biased) coin. All possible outcomes of N coin tosses would constitute the event space X . A binomial distribution B(N, ✓) sets the probability of occurrence of n1
  events of type 1, and n2 = N — n1 of type 2 as
  N ! n1 n2
  n
  P (n1, n2|N, ✓) =
  1
  ✓
  !n2!
  (1 — ✓) .
  Describe how you would fit the data X to a binomial distribution using maxi- mum likelihood estimation and find the result to be the same as the empirical distribution p˜ = (p˜H , p˜T ):
  nH
  ✓⇤ =
  N
  =: p˜H
  . [8 marks]
  
  TURN OVER
COMP3206W1
1. Use Bayes’ theorem to write the posterior probability P (✓|x) of the parame- ters ✓ upon observing some data x. Discuss how a conjugate Beta prior
✓↵—1(1 — ✓)Ø—1
R 1
F(↵ + Ø)
= ✓
F(↵)F(Ø)

↵—1
(1 — ✓)Ø—1
0 ✓↵—1(1 — ✓)Ø—1d✓
introduces “psuedo-counts" to affect the estimation of parameters of the binomial distribution as above, and combats the problem of overfitting, par- ticularly for data sets of small size N .
(You won’t need this information, but the definition of F(x) is given below.)
Z 1
F(x) =
0
tx—1e—tdt, and F(n) = (n — 1)! for integer n.

[7 marks]
COMP3206W1

3

i=1
1. You are given a dataset X = {xi}N
  
  where X is a (N ⇥ p) matrix, each row
  of which xi is a p-dimensional vector of real-valued features associated with
  an individual i. Explain in detail how you would use Principal Components
  Analysis (PCA) to reduce the dimensionality p of the feature set. You have to
  1. introduce the covariance matrix, (ii) discuss what role its eigenvalues play,
    (iii) explain how to perform the projection onto a low-dimensional subspace and (iv) comment on the magnitudes p and N . [12 marks]
    
    TURN OVER
COMP3206W1
1. For classification problems where there is a training set of labelled data, explain why dimensionality reduction using PCA may give rise to poor clas- sification outcomes. You may draw a figure to illustrate your arguments.
  [4 marks]
  
  2
2. A uniform random variable U (a, b) has mean a+b
  12
  2
  and variance (b—a)
  . For
  any sample x of N numbers x = {x1, x2,. .., xN } drawn from a uniform distribution U (0, 2) its average < x > is a random variable. What is the mean and variance of this random variable? What is the relevance to ma- chine learning implied by the dependence on sample size N of the result.
  [4 marks]
COMP3206W1
- o2
1. A 2-dimensional Gaussian distribution is defined by the probability density function of X = (X1, X2)T parameterised by its mean µ = (µ1, µ2)T and
variance-covariance matrix ⌃ =
1 ⇢o1o2 ◆.
2
⇢o1o2 o2
1 ✓ 1
◆
T 1
1
2
—
p(X|µµ, ⌃) = 2⇡o o p1
exp
⇢2
— 2 (X — µ) ⌃
(X — µ) .
Draw 3 contour plots of equiprobable values of (X1, X2) for the cases (a)
⇢ = 0, (b) ⇢ < 0 and (c) ⇢ > 0 providing reasons for doing so. [5 marks]

TURN OVER
COMP3206W1

4
1. Describe the k-means clustering algorithm. [7 marks]
COMP3206W1

A
1. In Fisher’s Linear Discriminant Analysis (LDA), training data – xi
  B
  and xj
  from 2 classes A and B – is projected along some vector w so that xi , xj
  A B
  are mapped into
  
  yi T i
  j T j
  A := w
  xA,i = 1,. .., NA, and yB := w
  xB ,j = 1,. .., NB
  respectively. Mean vectors and variance-covariance matrices are computed from the training data. These empirical means are denoted mA and mB and the variance-covariance matrices are called SA and SB respectively.
  Let µA and µB be the (scalar) means for the values {yi } and {yj }, and
  A B
  oA, oB the corresponding variances. These will, of course, depend on the
  choice of vector w. What is the quantity that LDA seeks to maximise in
  order to achieve classification accuracy? Explain why that is a good choice.
  [5 marks]
2. Express this quantity as a ratio of the form:
  wT Uw
  R(w) =
  wT Vw
  by finding the suitable U , V . [4 marks]
  
  TURN OVER
COMP3206W1
1. Using
  
  @R(w)
  @w
  
  2
  = (wT V w)2
  .(wT V w)Uw — (wT U w)V w.
  show that the optimal vector w is in the direction of V —1(mA — mB ). If the data from each class A and B were scattered isotropically, V would be proportional to the identity matrix. Hence, interpret the result. [4 marks]
COMP3206W1
1. You are given a scatter plot of 2-dimensional labelled data (labels "x" and "o") as shown in the figure, and equiprobable contour plots of their best fit Gaussian distributions.
6

o
4
o x
o
o
2 o o
o o oo
o
x x
xox x
0 o oo x xx
o xox x
oo o
x
x
o
-2 xo o x o x x
x xx x
-4 x x
o x
-6
-6 -4 -2 0 2 4 6

Explain how you would find an optimal direction for projecting the data onto to linearly separate the data into two classes. You may use the answers to the previous parts to comment on your choices. [5 marks]

TURN OVER
COMP3206W1

5 Suppose you are given the 2 class dataset in the figure below.

+ Class 1

* Class 2 * *

+ * *

* *

+ +

Would a perceptron be a suitable classifer for this dataset? Briefly explain your answer [3 marks]
Would a neural network with 1 hidden layer and no bias be a suitable clas- sifier for this dataset? Briefly explain your answer [3 marks]
1. COMP3206W1
Assume you now have a fully functional neural network architecture in place and you are training your model. How can you check to see if you are overfitting? Explain exactly what you could plot from your output and what on this plot may indicate overfitting. [7 marks]

NOTE: THERE ARE 3 MORE PARTS TO THIS QUESTION [(d), (e) and (f)]
on the next page!!

TURN OVER
1. COMP3206W1
Now assume you are working with a different dataset of 1000 images. These images are from the MNIST dataset and consist of handwritten digits from 0 to 9. Some samples are given below. Each one of the images consists of 28 x 28 pixels each with a value of 0 or 1 and you want to build a neural network to classify the digits. Note, for any of the answers below, if you do

not have a calculator and cannot work out the final number you will not be penalised. Simply write out the simplest form of the equation to the answer of the question.
How many inputs would you need in your neural network? [3 marks]
How many neurons would you need in your output layer? [3 marks]
Explain how you would know if your model has converged? [6 marks]

SEMESTER 1 EXAMINATION 2017-2018 MACHINE LEARNING

All answers must be written within the designated spaces in this booklet.

No credit will be given for answers presented elsewhere.

y. You must outline what the size of the design matrix is, and what each of its rows represents. The vector of residuals is orthogonal to the column space of A. Show how this leads to the solution which is written out in terms of the pseudo-inverse

A+ = .AT A.—1 AT .

TURN OVER

A = U ⌃V T = X ukokvT

where U = (u1, ·· · , ur ) is a matrix composed of N ⇥ 1 column vectors and

V = (v1, ·· · , vr ) is a matrix of p-dimensional column vectors, express A+

in terms of U , V and ⌃. Rewrite y, w as linear combinations of the columns

ui, vi of the matrices U , V respectively:

w = X ↵ivi, y = X Øiui

l(w) = (Aw — y)T (Aw — y)+ LwT w.

TURN OVER

TURN OVER

You are given a dataset X = {xi}N

TURN OVER

TURN OVER

wT Uw

wT Vw

TURN OVER

= (wT V w)2

.(wT V w)Uw — (wT U w)V w.

show that the optimal vector w is in the direction of V —1(mA — mB ). If the data from each class A and B were scattered isotropically, V would be proportional to the identity matrix. Hence, interpret the result. [4 marks]

TURN OVER

TURN OVER

END OF PAPER