IAML INFR10069 (LEVEL 10): Assignment #1 Due on Tues, October 20, 2020 @ 16:00 NO LATE SUBMISSIONS IMPORTANT INFORMATION N.B. This document is best viewed on a screen as it contains a number of (highlighted) clickable hyperlinks. It is very important that you read and follow the instructions below to the letter. You will be deducted marks for not adhering to the advice below. Good Scholarly Practice: Please remember the University requirement regarding all assessed work for credit. Details about this can be found at: http://web.inf.ed.ac.uk/infweb/admin/policies/academic-misconduct Specifically, this assignment should be your own individual work. Moreover, please note that Piazza is NOT a forum for discussing the solutions of the assignment. You may, in exceptional circumstances, ask private questions to the instructors if you deem that something may be incorrect, and if we feel that the issue is justified, we will send out an announcement. General Instructions • There are two versions of this assignment. One for INFR10069 (level 10) and the other for INFR11182 (level 11). The level 11 version has some additional parts. MAKE SURE you are doing the assignment that corresponds to the course you are registered on; you can check this on EUCLID. • You should use Python for implementing your solutions as this will standardise the output and also provide a consistent experience with the labs. Set up your environment as specified in the Labs. It is VERY IMPORTANT that you use IAML INFR10069 (LEVEL 10) the exact same package versions as those specified in the requirements file from the labs! Using the correct environment (i.e. py3iaml) is necessary to ensure that your outputs are consistent with the expected solutions. The correct package versions are specified here. • If running import sklearn; print(sklearn.__version__) in your Jupyter Note- book does not print the package version 0.19.1, then you are not using the correct environment. • This assignment consists of multiple questions. MAKE SURE to use the correct dataset for each question. • This assignment accounts for 20% of your final grade for this course and is graded based on a written report (compiled from a latex template which we provide). • The criteria on which you will be judged include the quality of the textual answers and/or any plots asked for. While code will be needed to generate the results, this is not a programming assignment, and you are not expected to provide code unless explicitly requested. • Read the instructions carefully, answering what is required and only that. Keep your answers brief and concise. Specifically, for textual answers, the size of the text-box in the latex template will give you an idea of themaximum length of your answer. You do not need to fill in the whole text-box but you will be penalised if you go over. This does not apply to figure-based answers. • For answers involving figures, make sure to clearly label your plots and provide legends where necessary. You will be penalised if the visualisations are not clear. • For answers involving numerical values, use correct units where appropriate and format floating point values to an appropriate number of decimal places. Submission Mechanics Important: You must submit this assignment by Tuesday 20/10/2020 at 16:00. We do not accept Late Submissions for this coursework, except in the case of mitigating circumstances. Please refer to the ITO Website for further details. • We will use the Gradescope submission system for uploading PDF assignments. Information describing how to upload your completed assignment will be made available on the IAML Learn page. • You should clone or download the Assignment Repository from https://github.com/ uoe-iaml/INFR10069-2020-CW1. This contains: 1. The data you will need for the assignment under the data directory. Page 2 of 12 IAML INFR10069 (LEVEL 10) 2. Two tex files, Assignment_1.tex and style.tex. These provide the template for you to fill out the assignment questions. In particular, the template forces your answers to appear on separate pages and also controls the length of textual answers. • You should only modify the Assignment_1.tex template by: 1. Uncommenting and specifying your student number at the top of the document (compilation will automatically fail if you forget to do this). Remove the `%' and enter your student number e.g. \newcommand{\assignmentAuthorName}{s1234567} 2. Filling in the answers in the provided answerbox environment. DO NOT modify anything else in the template and certainly DO NOT edit the style file. We reserve the right to not mark assignments which do not adhere to the template. Latex Tips • To fill in text answers, you can modify the corresponding answerbox: \ begin {answerbox }{5em} Your answer here \end{answerbox} with your answer (replacing `Your answer here'): \ begin {answerbox }{5em} Steam locomot ives were f i r s t developed in the United Kingdom during the e a r l y 19 th century and used f o r ra i lway t ranspo r t u n t i l the middle o f the 20 th century . \end{answerbox} which, when compiled gives: Steam locomotives were first developed in the United Kingdom during the early 19th century and used for railway transport until the middle of the 20th century. • To add an image, you can use: \ begin {answerbox }{18em} This image shows a t r a i n . \ begin { cente r } \ i n c l udeg r aph i c s [ width=0.7\ textwidth ] { stock_image . jpg } \end{ cente r } \end{answerbox} Page 3 of 12 IAML INFR10069 (LEVEL 10) which will be compiled to: This image shows a train. Make sure that you specify the correct path to your image. For example. if your image was stored in a directory called results, you would change the relevant line to read: \ i n c l udeg r aph i c s [ width=0.7\ textwidth ] { r e s u l t s / stock_image . jpg } You can find more information about inserting images into latex documents here. • You can also add two images side-by-side: \ begin {answerbox }{18em} Below we see two t r a i n s . \ begin { cente r } \ begin { tabu la r }{ l l } \ i n c l udeg r aph i c s [ width=0.4\ textwidth ] { stock_image . jpg } & \ in c l udeg r aph i c s [ width=0.4\ textwidth ] { stock_image . jpg } \end{ tabu la r } \end{ cente r } \end{answerbox} which will be compiled to: Page 4 of 12 IAML INFR10069 (LEVEL 10) Below we see two trains. • To add an inline equation, you can use the `$' symbol to write: \ begin {answerbox }{3em} I am us ing the f o l l ow i ng model , $y = \mathbf{x}^T\mathbf{w}$ . \end{answerbox} which compiles to: I am using the following model, y = xTw. • To add a table for numerical results you can use: \ begin {answerbox }{7em} Resu l t s are presented in the tab l e below . \ begin { cente r } \ begin { tabu la r }{ | c | c | c | } \ h l i n e Parameter Value & Train Accuracy & Test Accuracy \\ \ h l i n e 1 & 10.1\% & 9.1\% \\ 2 & 12.5\% & 10.1\% \\ \ h l i n e \end{ tabu la r } \end{ cente r } \end{answerbox} which compiles to: Results are presented in the table below. Parameter Value Train Accuracy Test Accuracy 1 10.1% 9.1% 2 12.5% 10.1% You can find more information about tables in latex here. Page 5 of 12 IAML INFR10069 (LEVEL 10) • For a small number of questions we may ask you to report your code. You can include code as an image, but if you prefer you can use the following command: \ begin {answerbox }{5em} \begin {verbatim} import numpy as np mean_time = 10 .0 p r i n t ( `mean time ' , mean_time ) \end{verbatim} \end{answerbox} which, when compiled gives: import numpy as np mean_time = 10.0 print(`mean time', mean_time) • Once you have filled in all the answers, compile the latex document to generate the PDF that you will submit. You can use Overleaf, your favourite latex editor, or just run pdflatex Assignment_1.tex twice on a DICE machine to compile the PDF. Page 6 of 12 IAML INFR10069 (LEVEL 10) Question 1 : (22 total points) Linear Regression In this question we will fit linear regression models to data. Here we will investigate the relationship between the amount of time (in hours) each student in a class studied for an exam and their end of semester exam performance. We will model this relationship for each student using yi = φ(xi)w, where φ(xi) = [1, xi] is a row vector and w are the model parameters we will learn. Here, yi is the exam score for student i and xi is the amount of time they spent revising. The dataset is contained in regression_part1.csv. You should load it into a Pandas DataFrame using pandas.read_csv(). (a) (3 points) Describe the main properties of the data, focusing on the size, data ranges, and data types. . (b) (3 points) Fit a linear model to the data so that we can predict exam_score from revision_time. Report the estimated model parameters w. Describe what the parame- ters represent for this 1D data. For this part, you should use the sklearn implementation of Linear Regression. Hint: By default in sklearn fit_intercept = True. Instead, set fit_intercept = False and pre-pend 1 to each value of xi yourself to create φ(xi) = [1, xi]. . (c) (3 points) Display the fitted linear model and the input data on the same plot. . (d) (3 points) Instead of using sklearn, implement the closed-form solution for fitting a linear regression model yourself using numpy array operations. Report your code in the answer box. It should only take a few lines (i.e. <5). Hint: Only report the relevant lines for estimating w e.g. we do not need to see the data loading code. You can write the code in the answer box directly or paste in an image of it. . (e) (3 points) Mean Squared Error (MSE) is a common metric used for evaluating the performance of regression models. Write out the expression for MSE and list one of its limitations. Hint: For notation, you can use y for the ground truth quantity and yˆ ($\hat{y}$ in latex) in place of the model prediction. . (f) (3 points) Our next step will be to evaluate the performance of the fitted models using Mean Squared Error (MSE). Report the MSE of the data in regression_part1.csv for your prediction of exam_score. You should report the MSE for the linear model fitted using sklearn and the model resulting from your closed-form solution. Comment on any differences in their performance. Question 1 continued on next page. . . Page 7 of 12 IAML INFR10069 (LEVEL 10) . (g) (4 points) Assume that the optimal value of w0 is 20, it is not but let's assume so for now. Create a plot where you vary w1 from −2 to +2 on the horizontal axis, and report the Mean Squared Error on the vertical axis for each setting of w = [w0, w1] across the dataset. Describe the resulting plot. Where is its minimum? Is this value to be expected? Hint: You can try 100 values of w1 i.e. w1 = np.linspace(-2,2, 100). . Page 8 of 12 IAML INFR10069 (LEVEL 10) Question 2 : (18 total points) Nonlinear Regression In this question we will tackle regression using basis functions. Here we will look at a regression dataset where the attribute we want to predict (output) does not have a linear relationship with the input attribute (input) we can measure. To overcome this problem we will first use polynomial basis functions, where again yi = φ(xi)w. However, now the row vector φ(xi) = [1, xi, x 2 i , ..., x M i ], where M is an integer. The dataset is contained in regression_part2.csv. You should load it into a Pandas DataFrame using pandas.read_csv(). (a) (5 points) Fit four different polynomial regression models to the data by varying the degree of polynomial features used i.e. M = 1 to 4. For example, M = 3 means that φ(xi) = [1, xi, x 2 i , x 3 i ]. Plot the resulting models on the same plot and also include the input data. Hint: You can again use the sklearn implementation of Linear Regression and you can also use PolynomialFeatures to generate the polynomial features. Again, set fit_intercept = False. . (b) (3 points) Create a bar plot where you display the Mean Squared Error of each of the four different polynomial regression models from the previous question. . (c) (4 points) Comment on the fit and Mean Squared Error values of theM = 3 andM = 4 polynomial regression models. Do they result in the same or different performance? Based on these results, which model would you choose? . (d) (6 points) Instead of using polynomial basis functions, in this final part we will use another type of basis function - radial basis functions (RBF). Specifically, we will define φ(xi) = [1, rbf(xi; c1, α), rbf(xi; c2, α), rbf(xi; c3, α), rbf(xi; c4, α)], where rbf(x; c, α) = exp(−0.5(x − c)2/α2) is an RBF kernel with center c and width α. Note that in this example, we are using the same width α for each RBF, but different centers for each. Let c1 = −4.0, c2 = −2.0, c3 = 2.0, and c4 = 4.0 and plot the resulting nonlinear predictions using the regression_part2.csv dataset for α ∈ {0.2, 100, 1000}. You can plot all three results on the same figure. Comment on the impact of larger or smaller values of α. . Page 9 of 12 IAML INFR10069 (LEVEL 10) Question 3 : (26 total points) Decision Trees In this question we will train a classifier to predict if a person is smiling or not. Instead of images of faces, our dataset consists of a set of 2D coordinates that encode the location of points on the faces (e.g. the corners of the eyes, the nose, the chin, etc) of a set of different people. Each row in the dataset is a different person and the columns (i.e. the input attributes) are pairs of 2D coordinates e.g. the first coordinate is (x0, y0), the second is (x1, y1), etc. Note, in the notation used in the lectures we often use yi to specify an output attribute, but here each (xi, yi) represents a 2D coordinate on the face and we will concatenate all D of them to form our attribute vector [x0, y0, x1, y1, ..., xD, yD]. We assume that the location of these 2D points for a person that is smiling will be different than those for a person who is not smiling. Included with the 2D coordinates is an attribute called smiling, which is the binary class label that we want to predict. For a given row in the data, smiling = 1 indicates that that person is smiling. The training data is contained in faces_train.csv, and the test data can be found in faces_test.csv. You should load the data into two different Pandas DataFrames using pandas.read_csv(). (a) (4 points) Load the data, taking care to separate the target binary class label we want to predict, smiling, from the input attributes. Summarise the main properties of both the training and test splits. . (b) (4 points) Even though the input attributes are high dimensional, they actually consist of a set of 2D coordinates representing points on the faces of each person in the dataset. Create a scatter plot of the average location for each 2D coordinate. One for (i) smiling and (ii) one not smiling faces. For instance, in the case of smiling faces, you would average each of the rows where smiling = 1. You can plot both on the same figure, but use different colors for each of the two cases. Comment on any difference you notice between the two sets of points. Hint: Your plot should contain two faces. . (c) (2 points) There are different measures that can be used in decision trees when evaluat- ing the quality of a split. What measure of purity at a node does the DecisionTreeClassifier in sklearn use for classification by default? What is the advantage, if any, of using this measure compared to entropy? . (d) (3 points) One of the hyper-parameters of a decision tree classifier is the maximum depth of the tree. What impact does smaller or larger values of this parameter have? Give one potential problem for small values and two for large values. . Question 3 continued on next page. . . Page 10 of 12 IAML INFR10069 (LEVEL 10) (e) (6 points) Train three different decision tree classifiers with a maximum depth of 2, 8, and 20 respectively. Report the maximum depth, the training accuracy (in %), and the test accuracy (in %) for each of the three trees. Comment on which model is best and why it is best. Hint: Set random_state = 2001 and use the predict() method of the DecisionTreeClas- sifier so that you do not need to set a threshold on the output predictions. You can set the maximum depth of the decision tree using the max_depth hyper-parameter. . (f) (5 points) Report the names of the top three most important attributes, in order of importance, according to the Gini importance from DecisionTreeClassifier. Does the one with the highest importance make sense in the context of this classification task? Hint: Use the trained model with max_depth = 8 and again set random_state = 2001. . (g) (2 points) Are there any limitations of the current choice of input attributes used i.e. 2D point locations? If so, name one. . Page 11 of 12 IAML INFR10069 (LEVEL 10) Question 4 : (14 total points) Evaluating Binary Classi- fiers In this question we will perform performance evaluation of binary classifiers. You have been tasked with evaluating the performance of four different binary classifica- tion algorithms, alg_1, alg_2, alg_3, and alg_4. Unfortunately, you do not have access to the models themselves, only their predictions on a held-out test set. Your goal is to evaluate how well the different models perform at predicting the ground truth class labels gt for this test set. The dataset is contained in classification_eval_1.csv. You should load it into a Pandas DataFrame using pandas.read_csv(). (a) (4 points) Report the classification accuracy (in %) for each of the four different models using the gt attribute as the ground truth class labels. Use a threshold of >= 0.5 to convert the continuous classifier outputs into binary predictions. Which model is the best according to this metric? What, if any, are the limitations of the above method for computing accuracy and how would you improve it without changing the metric used? . (b) (4 points) Instead of using classification accuracy, report the Area Under the ROC Curve (AUC) for each model. Does the model with the best AUC also have the best accuracy? If not, why not? Hint: You can use the roc_auc_score function from sklearn. . (c) (6 points) Plot ROC curves for each of the four models on the same plot. Comment on the ROC curve for alg_3? Is there anything that can be done to improve the performance of alg_3 without having to retrain the model? Hint: You can use the roc_curve function from sklearn. . Page 12 of 12
欢迎咨询51作业君