G53MLE-E1 G53MLE-E1 Turn over The University of Nottingham SCHOOL OF COMPUTER SCIENCE A LEVEL 3 MODULE, AUTUMN SEMESTER 2016-2017 MACHINE LEARNING (G53MLE) Time allowed TWO hours Candidates may complete the front cover of their answer book and sign their desk card but must NOT write anything else until the start of the examination period is announced Answer All 3 Questions Dictionaries are not allowed with one exception. Those whose first language is not English may use a standard translation dictionary to translate between that language and English provided that neither language is the subject of this examination. Subject specific translation dictionaries are not permitted. Only silent, self-contained calculators with a Single-Line Display or Dual-Line Display are permitted in this examination No electronic devices capable of storing and retrieving text, including electronic dictionaries, may be used. DO NOT turn your examination paper over until instructed to do so INFORMATION FOR INVIGILATORS: - Please ensure that Exam Paper and Answer Book are both submitted G53MLE-E1 G53MLE-E1 G53MLE-E1 G53MLE-E1 Question 1: Foundations of Machine Learning [overall 34 marks] (a) (i) give the formula for a multivariate linear regressor with a 3rd degree polynomial and 2 variables. (ii) See the data in the table below of paired values x1 and x2. If the goal is to predict x2 from x1, what type of linear regressor would be used here? Give its general formula (you don’t need to find the actual values of any weights or parameters). (iii) Give the formula for the generalised linear basis function. (iv) Why would you want to introduce non-linear basis functions in linear regression? x1 x2 0 2 5 29.5 10 107 15 234.5 20 412 25 639.5 30 917 (8 marks) (b) Consider a Linear Discriminant Analysis classifier with two parameters to learn a weight for – one weight per parameter. Two different ways of learning the parameters of this algorithm are brute force search and gradient descent. Describe how each works by (i) giving pseudocode and the formula for the update of gradient descent and (ii) pseudocode for brute force search, clearly addressing the range of values tested. (iii) Explain for both gradient descent and brute force search what the biggest drawback is in using them. (iv) Include a sketch illustrating how gradient descent works for a quadratic error function. Include in your sketch a visualisation of the termination criterion. (14 marks) (c) Explain why in Data Mining one is never able to rely purely on machine-measurable objective functions. Use in your explanation the criteria for interesting patterns. (6 marks) (a) (i) Draw a set of 2-dimensional linearly separable data points with binary labels, and clearly indicate the decision boundary found by a maximum-margin classifier, and at least one decision boundary that does not obtain a maximum margin but still represents a perfect classifier. (ii) Explain how it’s possible to obtain such a non-maximum margin solution, using the concept of a loss function. (6 marks) G53MLE-E1 G53MLE-E1 Question 2: Artificial Neural Networks and Deep Learning [overall 33 marks] Below is given a training set of eruption duration and time to next eruption for two type of geyser eruptions of The Old Faithful. This will be used for question 2a. (a) Draw a diagram of an ANN’s topology that can learn this pattern based on the given data, using a single hidden layer with three steerable units plus any additional necessary nodes. Name all relevant elements using indices that indicate source and target layer numbers, and ensure you account for biases. Initialise all weights to 1. (6 marks) (b) Design a Deep-Learning architecture using Convolutional Neural Networks as one of a number of components for the task of classifying images into images that contain cats and images that don’t. Name the different layers/components of the network; explain what function they have, and provide the cost function used in the output layer to inform back-propagation. Use at least one convolutional layer, one dimensionality reduction layer, and any other layers you deem necessary. (12 marks) (c) Explain how the ReLU revolutionised Deep Learning, by relating it to the concept of the vanishing gradient. Sketch the activation function of the ReLU and compare this with at least one other activation function to illustrate your explanation. (11 marks) (d) Consider a CNN that applies a single channel convolutional layer with a 3x3 kernel to a 10x10 input image, followed by a MaxPool layer. How many weighted connections come out of this network? (4 marks) X1: Eruption duration (min) X2: Time to next eruption (min) Y: Eruption Type 3.6 79 1 1.8 54 2 3.3 74 1 2.3 62 3 4.5 85 1 2.9 55 2 4.7 88 1 3.6 85 1 2.0 51 3 4.4 85 3 1.8 54 2 G53MLE-E1 G53MLE-E1 Question 3: Graphs, Dynamics, and Decision Trees (33 marks) (a) In learning decision trees, one has to find for every node the query that maximises the change in entropy, where this change in entropy is given by: For the partially learned tree displayed below, with two classes named c1 and c2, calculate the values of , and the change in entropy for the split made at the root node. The numbers in each node indicate how many examples of that class there are. (12 marks)