COMP9417 - Machine Learning Homework 1: Gradient Descent & Friends Introduction In this homework, you will be required to manually implement (Stochastic) Gradient Descent in Python to learn the parameters of a linear regression model. You will make use of the publicly avail- able ‘real estate.csv’ dataset, which you can download directly from the course Moodle page. The dataset contains 414 real estate records, each of which contains the following features: • transactiondate: date of transaction • age: age of property • nearestMRT: distance of property to nearest supermarket • nConvenience: number of convenience stores in nearby locations • latitude • longitude The target variable is the property price. The goal is to learn to predict property prices as a function of a subset of the above features. Question 1. (Pre-processing) A number of pre-processing steps are required before we attempt to fit any models to the data. (a) Remove any rows of the data that contain a missing (‘NA’) value. List the indices of the removed data points. Then, delete all features from the dataset apart from: age, nearestMRT and nConve- nience. (b) An important pre-processing step: feature normalisation. Feature normalisation involves rescaling the features such that thay all have similar scales. This is also important for algorithms like gradient descent to ensure the convergence of the algorithm. One common technique is called min-max normalisation, in which each feature is scaled to the range [0, 1]. To do this, for each feature we must find the minimum and maximum values in the available sample, and then use the following formula to make the transformation: xnew = x−min(x) max(x)−min(x) . After applying this normalisation, the minimum value of your feature will be 0, and the maximum will be 1. For each of the features, provide the mean value over your dataset. Question 2. (Train and Test sets) Now that the data is pre-processed, we will create train and test sets to build and then evaluate our 1 models on. Use the first half of observations (in the same order as in the original csv file excluding those you removed in the previous question) to create the training set, and the remaining half for the test set. Print out the first and last rows of both your training and test sets. Question 3. (Loss Function) Consider the loss function Lc(x, y) = √ 1 c2 (x− y)2 + 1− 1, where c ∈ R is a hyper-parameter. Consider the (simple) linear model yˆ(i) = w0 + w1x (i) 1 + w2x (i) 2 + w3x (i) 3 , i = 1, . . . , n. We can write this more succintly by letting w = (w0, w1, w2, w3)T and X(i) = (1, x (i) 1 , x (i) 2 , x (i) 3 ) T , so that yˆ(i) = wTX(i). The mean-loss achieved by our model (w) on a given dataset of n observations is then Lc(y, yˆ) = 1 n n∑ i=1 Lc(y(i), yˆ(i)) = 1 n n∑ i=1 [√ 1 c2 (y(i) − 〈w(t), X(i)〉)2 + 1− 1 ] , Compute the following derivatives: ∂Lc(y(i)yˆ(i)) ∂wk , k = 0, 1, 2, 3. You must show your working for full marks. Question 4. (Gradient Descent Psuedocode) For some value of c > 0, we have ∂Lc(y(i), yˆ(i)) ∂wk = x (i) k (〈w(t), X(i)〉 − y(i)) 2 √ (〈w(t), X(i)〉 − y(i))2 + 4 , k = 0, 1, 2, 3. Using this result, write down the gradient descent updates for w0, w1, w2, w3 (using pseudocode), as- suming a step size of η. Note that in gradient descent we consider the loss over the entire dataset, not just at a single observation. Further, provide pseudocode for stochastic gradient descent updates. Here, w(t) denotes the weight vector at the t-th iteration of gradient descent. Question 5. (Gradient Descent Implementation) In this section, you will implement gradient descent from scratch on the generated dataset using the gradients computed in Question 3, and the pseudocode in Question 4. (a) Initialise your weight vector to w(0) = [1, 1, 1, 1]T . Consider step sizes η ∈ {10, 5, 2, 1, 0.5, 0.25, 0.1, 0.05, 0.01} (a total of 9 step sizes). For each step-size, generate a plot of the loss achieved at each iteration of gradient descent. You should use 400 iterations in total (which will require you to loop over your training data in order). Generate a 3×3 grid of plots showing performance for each step-size. Add a screen shot of the python code written for this question in your report (as well as pasting the code in to your solutions.py file). The following code may help with plotting: Page 2 1 fig, ax = plt.subplots(3,3, figsize=(10,10)) 2 nIter = 400 3 alphas = [10,5,2, 1,0.5, 0.25,0.1, 0.05, 0.01] 4 for i, ax in enumerate(ax.flat): 5 # losses is a list of 9 elements. Each element is an array of length nIter storing the loss at each iteration for 6 # that particular step size 7 ax.plot(losses[i]) 8 ax.set_title(f"step size: {alphas[i]}") # plot titles 9 plt.tight_layout() # plot formatting 10 plt.show() 11 (b) From your results in the previous part, choose an appropriate step size (and state your choice), and explain why you made this choice. (c) For this part, take η = 0.3, re-run GD under the same parameter initilisations and number of iterations. On a single plot, plot the progression of each of the four weights over the iterations. Print out the final weight vector. Finally, run your model on the train and test set, and print the achieved losses. Question 6. (Stochastic Gradient Descent Implementation) We will now re-run the analysis in question 5, but using stochastic gradient descent (SGD) instead. In SGD, we update the weights after seeing a single observation. (a) Use the same settings as in 5 (a). For each step-size, generate a plot of the loss achieved at each iteration of stochastic gradient descent, which is to be run for 6 Epochs. An epoch is one pass over the entire training dataset, so in total there should be 6 × ntrain updates, where ntrain is the size of your training data. Be sure to run these updates in the same ordering that the data is stored. Generate a 3 × 3 grid of plots showing performance for each step-size. Add a screen shot of the python code written for this question in your report (as well as pasting the code in to your solutions.py file). (b) From your results in the previous part, choose an appropriate step size (and state your choice), and explain why you made this choice. (c) Take η = 0.4 and re-run SGD under the same parameter initilisations and number of epochs. On a single plot, plot the progression of each of the four weights over the iterations. Print yoru final model. Finally, run your model on the train and test set, and record the achieved losses. Question 7. Results Analysis In a few lines, comment on your results in Questions 5 and 6. Explain the importance of the step-size in both GD and SGD. Explain why one might have a preference for GD or SGD. Explain why the GD paths look much smoother than the SGD paths. Points Allocation There are a total of 10 marks, the available marks are: • Question 1: 1 mark (split 0.5/0.5) • Question 2: 0.5 marks • Question 3: 1.5 marks • Question 4: 1 mark • Question 5: 2.5 marks (split 1/0.5/1) Page 3 • Question 6: 2.5 marks (split 1/0.5/1) • Question 7: 1 mark What to Submit • A single PDF file which contains solutions to each question. For each question, provide your solution in the form of text and requested plots. For any question in which you use code, provide a copy of your code at the bottom of the relevant section. • .py file(s) containing all code you used for the project, which should be provided in a seperate .zip file. This code must match the code provided in the report. • You may be deducted points for not following these instructions. • You may be deducted points for poorly presented/formatted work. Please be neat and make your solutions clear. When and Where to Submit • Due date: Monday 8 March, 2021 by 5:00pm. • Late submission incur a penalty of 10% per day for the first 5 days and 100% for more than 5 days. • Submission has to be done through Moodle. Final Reminder: You are required to submit one PDF file, AND also to submit the Python file(s) with all your code as a separate .zip file. Page 4
欢迎咨询51作业君