Page 1 of 7 MAY 2021 48 HOUR ASSESSMENT SCHOOL OF COMPUTER SCIENCE MODULE CODE: CS5014 MODULE TITLE: Machine Learning TIME TO HAND IN: 48 hours EXAM INSTRUCTIONS a. Answer all questions b. Each question indicates the number of marks it carries. The paper carries a total of 60 marks. This assessment consists of exam-style questions and you should answer as you would in an exam. As such, citations of sources are not expected, but your answers should be from your own memory and understanding and significant stretches of text should not be taken verbatim from sources. Any illustrations or diagrams you include should be original (hand or computer drawn). You may word-process your answers, or hand-write and scan them. In either case, please return your answers as a single PDF. If you handwrite, please make sure the pages are legible, the right way up and in the right order. Your submission should be your own unaided work. While you are encouraged to work with your peers to understand the content of the course while revising, once you have seen the questions you should avoid any further discussion until you have submitted your results. You must submit your completed assessment on MMS within 48 hours of it being sent to you. Assuming you have revised the module contents beforehand, answering the questions should take no more than three hours. Page 2 of 7 1. Machine Learning Concepts: (a) A machine learning model is associated with the following cost function: J() = !()")#"$! +(,% − & − '%0()%$! + (("(#"$! where % is a vector representing the th training sample, % is a scalar representing the target variable (output) for the ith sample, is the number of training samples, and is the number of features. The |∙| operator represents taking the absolute value. Answer the following questions, always explaining your reasoning: (i) Identify all loss terms and regularisation terms. [2 marks] (ii) What is the equation of the model's prediction function f(x)? Is it a linear or a non-linear model and why? [2 marks] (iii) Is this a regression or classification model and why? [2 marks] (iv) What are the parameters and hyperparameters of your model? [2 marks] (v) How would you determine the exact values of parameters and hyperparameters to use in your final model? [2 marks] (b) Do you agree or disagree with the following statements and why? (i) Method A is better than method B if A's training accuracy is better than B's training accuracy. [2 marks] (ii) A neural network is a non-linear model as long as it has one or more hidden layers. [2 marks] (iii) By using Newton's method, both linear regression and logistic regression will converge in one iteration. [2 marks] (iv) The following approach is a correct way to reduce dimensionality from % ∈ ℝ* to ℝ( while retaining as much variation as possible: • zero-centre the data: % ← % − !)∑ %)%$! • calculate the scatter matrix • find the n pairs of eigenvalues/eigenvectors {% , %} of • find the two smallest eigenvalues and corresponding eigenvectors vectors !, ( • output % = @(%)'!, (%)'(A for = 1, . . . , [2 marks] Page 3 of 7 (v) Given the loss function (&, !) = F(,% − !% − &0()%$! H + 100000!&&&&&&( the fitted model corresponds to the dashed red line in the following plot: [2 marks] [Total marks 20] 0 2 4 6 8 10 0 20 40 60 x y Page 4 of 7 2. Support Vector Machines The plot shown in Figure 1 shows the results of applying a Support Vector Machine classifier on a 2-dimensional training set. The solid line indicates the decision boundary between the two classes, and dotted lines represent the margin. Each data point is identified by a number. (a) Based on this plot, answer the following questions, showing your reasoning and any calculations: (i) Calculate classification accuracy, precision, recall, and F1-score on the training data in Figure 1. [4 marks] (ii) Is this a soft-margin or a hard-margin solution and does it use a kernel? [2 marks] (iii) Identify all support vectors and all non-zero slack variables in Figure 1. [2 marks] (iv) Imagine that you retrain the model after removing point 10. What effect would this have on the decision boundary and the margin? [2 marks] (b) The SVM classification equation is given by: () = & +(,M%N, %O-%$! , where 〈, %〉 is the scalar (dot) product of vectors and %. (i) How does the presence of the dot product allow us to use the kernel trick? [2 marks] (ii) Based on Figure 1, list all training points % which are needed to classify a new data point . [2 marks] 1 2 3 4 6 7 8 5 9 10 11 Figure 1 Page 5 of 7 (c) The equation for a third-order polynomial kernel is given by: () = (1 + 〈, ′〉)., where 〈, ′〉 is the scalar (dot) product of vectors and ′. Assuming two-dimensional input samples = [!, (]' , this kernel represents a polynomial expansion into a higher-dimensional feature space [ℎ!(),⋯ , ℎ/()]'. (i) How can you verify that the equation above for () is a valid kernel that can be used in a Support Vector Machine? [2 marks] (ii) What is the number of dimensions R of the expanded features and what is their form? [2 marks] (iii) Once a model has been trained using this kernel, will it be faster or slower to evaluate on new data compared to a linear model? Why? [2 marks] [Total marks 20] Page 6 of 7 3. Unsupervised Learning (a) What is the relationship between K-means and the EM algorithm for finite mixture of Gaussian models? [4 marks] (b) Give an example where you expect the EM algorithm to outperform K- means, and an example where you expect K-means to perform better. Explain your reasoning and show any plots you find useful. [4 marks] (c) James wants to cluster a dataset by using a finite mixture of Gaussians. He has tried K=1, 2, 3, ..., 10 and found the following plot of log-likelihood (Figure 2) as a function of K: Figure 2 Log-likelihood of mixture Gaussian models with different K Based on Figure 2, he decides to choose K=10 as the value with the highest log-likelihood. Do you agree or disagree with this decision? If you agree, explain why. If you disagree, how would you choose K? [2 marks] (d) Write an EM algorithm with a hard assignment E-step in pseudo code for a finite mixture of K Gaussians where the prior distribution of the membership % is the same across the K components, i.e. ! = ( =. . . = 0. Make sure you clearly state all the assumptions you make. [6 marks] 2 4 6 8 10 −2 40 −2 30 −2 20 −2 10 −2 00 −1 90 K log lik eli ho od Page 7 of 7 (e) A von-Mises Fisher distribution is a popular distribution to model directional vectors i.e. norm one unit vectors: ∈ ℝ* and ‖‖( = 1 . Its density can be written as: () = ∙ exp('), where c is a normalisation constant and u is the parameter of the distribution called mean vector ( itself is a norm one directional vector, i.e. ∈ ℝ* and ‖‖( = 1). Its maximum likelihood estimator is: 12 = ∑ %)%$!d∑ %)%$! d(, where || ∙ ||( denotes the ( norm: ‖‖( = f!( + (( + . . . + *(. Its weighted ML estimator is: 3 = ∑ %%)%$!d∑ %%)%$! d( . Write the E step and M step in pseudo code for an EM algorithm that fits a finite mixture of von Mises Fisher distributions (of the given type). Make sure you clearly state all the assumptions you have made. [4 marks] [Total marks 20] *** END OF PAPER ***
欢迎咨询51作业君