University of Toronto Mississauga Department of Mathematical and Computational Sciences CSC 311 - Introduction to Machine Learning, Fall 2020 Assignment 2 Due date: Sunday November 15, 11:59pm. No late assignments will be accepted. As in all work in this course, 20% of your grade is for quality of presentation, including the use of good English, properly commented and easy-to-understand programs, and clear proofs. In general, short, simple answers are worth more than long, complicated ones. Unless stated otherwise, all answers should be justified. The TA has a limited amount of time to devote to each assignment, so what you hand in should be legible (either typed or neatly hand-written), well-organized and easy to evaluate. (An employer would demand no less.) All computer problems are to be done in Python with the NumPy, SciPy and scikit-learn libraries. Hand in five files: The source code of all your programs (functions and script) in a single Python file, a pdf file of figures generated by the programs, a pdf file of all printed output, a pdf file of answers to all the non-programming questions (such as proofs and explanations), and a scanned, signed copy of the cover sheet at the end of the assignment. All proofs should be typed. (Word, Latex and many other programs have good facilities for typing equations.) Be sure to indicate clearly which question(s) each program and piece of output refers to. All the Python code (functions and script) for a given question should appear in one location in your source file, along with a comment giving the question number. All material in all files should appear in order; i.e., material for Question 1 before Question 2 before Question 3, etc. It should be easy for the TA to identify the material for each question. In particular, all figures should be titled, and all printed output should be identified with the Question number. The five files should be submitted electronically as described on the course web page. In addition, if we run your source file, it should not produce any errors, it should produce all the output that you hand in (figures and print outs), and it should be clear which question each piece of output refers to. Output that is not labelled with the Question number will not be graded. Programs that are suppose to produce output, but don’t, will not be graded. Style: Use the solutions to Assignment 1 and the midterm as a guide/model for how to present your solutions to this assignment. I don’t know policy: If you do not know the answer to a question (or part), and you write “I don’t know”, you will receive 20% of the marks of that question (or part). If you just leave a question blank with no such statement, you get 0 marks for that question. 1 No more questions will be added. Tips on Scientific Programming in Python If you haven’t already done so, please read the NumPy tutorial on the course web page. Special numbers. The term numpy.inf represents infinity. It results from dividing by 0 in numpy. It can also result from overflow (i.e., from numbers that are too large to represent in the computer, like 101000). The term numpy.nan stands for “not a number”, and it results from doing 0/0, inf/inf or inf-inf in numpy. Indexing. Array indexing begins at 0, not 1. Thus, if A is a matrix, then A[7,0] is the element in row 7 and column 0. Likewise, A[0,4] is the element in row 0 and column 4. We use both cardinal and ordinal numbers to refer to rows and columns. Thus, the first row is row 0, the second row is row 1, etc. Slicing allows large segments of an array to be referenced. For example, A[:,5] returns column 5 of matrix A, and A[7,[3,6,8]] returns elements 3, 6 and 8 of row 7. Similarly, if v is a vector, then the statement A[6,:]=v copies v into row 6 of matrix A. You can read more about indexing in the following Numpy tutorial: https://numpy.org/doc/stable/reference/arrays.indexing.html Vectorized code. Whenever possible, do not use loops for numerical computations, as they are very slow in Python. In particular, avoid iterating over the elements of a large vector or matrix. Instead, use NumPy’s vector and matrix operations, which are much faster and can be executed in parallel on a gpu. This is called vectorized code. For example, if A is a matrix and v is a column vector, then A+v will add v to every column of A. Likewise for rows and row vectors. Note that if A and B are matrices, then A*B performs element- wise multiplication, not matrix multiplication. To perform matrix multiplication, you can use numpy.matmul(A,B), or A@B in Python 3. Also, the functions sum and mean in numpy are useful for summing or averaging over all or part of an array. Many NumPy functions that are defined for single numbers can be passed lists, vectors and matrices instead. For example, f([x1, x2, ..., xn]) returns the list [f(x1), f(x2), ..., f(xn)]. The same is true for many user-defined functions. You should also avoid the use of recursion or higher-order functions in numerical compu- tations. This includes the python map function or any numpy function listed under “func- tional programming”, such as apply-along-axis, unless otherwise specified. You should also avoid using Python functions that operate on lists, such as zip. These are often loops in disguise and are very slow. With few exceptions, arrays should be the only large objects in your program, and you should only operate on them with NumPy functions. Sometimes, you will need loops to iterate over short lists or to implement iterative algo- rithms, such as gradient descent, or you may need recursion to traverse a small graph. This is OK. Typically, this represents a tiny fraction of total computation time, since large arrays are processed at each iteration of a loop or at each node in graph. It is these compute- intensive operations on large arrays that must be vectorized. In general, all linear-algebra 2 computations should be vectorized, that is, implemented using Numpy’s matrix and vector functions. In fact, one of the goals of this course is to teach you to write vectorized code, since it is ubiquitous in machine learning. To give you maximum practice, all your vectorized code should only use basic operations of linear algebra, such as matrix addition, multiplication, inverse and transpose, unless specified otherwise. The point here is for you to implement vectorized code yourself, not to use complex Numpy procedures that solve most of a problem for you. You may, of course, use Numpy’s array-indexing facilities to vectorize operations on all or part of an array. Broadcasting. Another index-related feature in Numpy for vectorization is broadcasting, which combines arrays of different shapes. As an example, suppose A and B are Numpy arrays, where shape(A) = [I,J,K] and shape(B) = [I,K]. And suppose we want to define a new array, C, where shape(C) = [I,J,K] and Cijk = AijkBik for all i, j, k. We can do this with the following Numpy statements: B = np.reshape(B,[I,1,K]} C = A*B Similarly, suppose shape(A) = [I,K] and shape(B) = [J,K], and we want to define C, where shape(C) = [I,J,K] and Cijk = Aik + Bjk for all i, j, k. We can do this with the following Numpy statements: A = np.reshape(A,[I,1,K]} B = np.reshape(B,[1,J,K]} C = A+B You can read more about broadcasting in the following Numpy tutorial: https://numpy. org/doc/stable/user/basics.broadcasting.html Plotting. For generating and annotating plots, the following functions in matplotlib.pyplot are used frequently: plot, scatter, figure, xlabel, ylabel, title, suptitle, xlim and ylim. The functions semilogx, semilogy and loglog generate plots with a log scale on one or both axes. You can use Google to conveniently look up these functions. e.g., Google “pyplot suptitle”. To plot a smooth function, y = f(x), you compute y for many closely- spaced values of x, and then plot all the x, y pairs. For example, the following code plots the function y = sinx for x between 0 and 10 by plotting 1000 values of y at 1000 evenly-spaced values of x. import numpy as np import matplotlib.pyplot as plt xmin = 0 xmax = 10 xList = np.linspace(xmin,xmax,1000) yList = np.sin(xList) plt.plot(xList,yList) The plot function draws a tiny line segment between consecutive (x, y) pairs, giving the illusion of a smooth curve. 3 Printouts. Finally, if a program prints any output, you should identify the question (and part) that it comes from by preceding all code for that part with lines like the following: print(’\n’) print(’Question 3(d).’) print(’-------------’) If a program is not suppose to print anything, then do not include these lines in your program, so as to reduce clutter in your output. In any case, you do not have to include these lines in your line-counts of code. Unless otherwise specified, you may assume in this assignment that all inputs are correct and no error-checking is required. Tips on Proving Theorems When proving theorems, all steps should be justified. Appeals to intuition and leaps of logic are not allowed. Explanations in English should be minimized and must not replace careful logical inference. Trivial or obvious steps can be skipped (but if you have to think about something for more than a few seconds, then it is not obvious). Everything should be proved from scratch, ie, from basic results and definitions. Unless specified otherwise, you should not use any powerful theorems or results from the lecture slides, notes, books or any other source. The point is to prove everything yourself. Proofs should be clear and concise. Use the proofs in the solutions to Assignment 1 and the midterm as a guide to what proofs should look like. Proofs like this will receive full marks. Note, in particular, that every step is justified with a short explanation (e.g., by the definition of matrix multiplication, or since Xij = x (i) j , or by Equation (1)). Unless otherwise specified, you should never write out the contents of large vectors or matrices, as in y1... yn = x11 . . . x1m... . . . ... xn1 . . . xnm w1... wm Instead, you should write yi = ∑ j xijwj, which is more compact and leads to much clearer proofs. Often, you can simply write y = Xw, which is the same thing in vectorized form. When you need to refer to matrix elements, it is convenient to use the notation [A]ij to refer to the ijth element of matrix A, and [V ]i for the i th element of vector V . In some cases, the square brackets are unecessary. For example, [w]j = wj in the equations above. However, when A or V is a complicated expression, we do need them. For example, here are some equalities you may find useful: [AB]ij = ∑ k AikBkj matrix multiplication [Aw]i = ∑ j Aijwj matrix-vector multiplication[ AT ] ij = Aji matrix transpose[ ∂f ∂w ] j = ∂f ∂wj gradient 4 You may need other such equalities in your proofs. Some may be trivial (such as [w]j = wj). All others should be justified. Finally, to speed up grading, all proofs should be typed. 1. Linear Regression with Feature Mapping. (? points total) In this question, you will write a Python program to fit a non-linear function to data with linear least-squares regression. As in Question 3 of Assignment 1, the data consists of a set of pairs, (x, t), where the input, x, and the target value, t, are both real numbers. However, in this question, you will use trigonometric functions to define a non-linear feature mapping, ψ, that transforms x into a feature vector, z, before doing linear regression. In particular, z = ψ(x) = [1, sin(x), sin(2x), ... sin(kx), cos(x), cos(2x), ..., cos(kx)] (1) for some positive integer, k. Note that z is a vector with 2k + 1 elements. Likewise, the function ψ is said to consist of 2k + 1 basis functions. For more information on linear regression and basis functions, see Chapter 3.1 in Bishop. The function you will fit to data has the form y(x) = wT z = wTψ(x) (2) where w is a vector of weights. Thus, y is a linear combination of basis functions. Your program should find the weight vector, w, for which the function y(x) best fits the data. In particular, it should find w to minimize the loss function, l(w) = N∑ n=1 [t(n) − y(x(n))]2 = N∑ n=1 [t(n) − wT z(n)]2 (3) where the sum is over all training points, (x(n), t(n)). Your program should also compute the mean squared training and test errors, given by: errtrain = Ntrain∑ n=1 [t(n) − y(x(n))]2/Ntrain errtest = Ntest∑ n=1 [t(n) − y(x(n))]2/Ntest where the two sums are over the training data and test data, respectively, and Ntrain and Ntest are the number of training and test points, respectively. The data you will use is in the file dataA2Q1.pickle.zip on the course web site. Download and uncompress this file. The file contains training and test data. Next, start the Python interpreter and import the pickle module. You can then read the file with the following Python command: 5 Figure 1: with open(’dataA2Q1.pickle’,’rb’) as file: dataTrain,dataTest = pickle.load(file) The variable dataTrain will now contain the training data, and dataTest will contain the test data. Specifically, dataTrain is a 2× 25 Numpy array, where each column is an (x, t) pair. (Equivalently, the first row gives all the input values, and the second row gives all the target values.) Likewise, dataTest is a 2 × 1000 array of test data. The training data is illustrated by the blue dots in Figure 1. In answering the questions below, do not use any Python loops unless specified other- wise and do not use any functions from sklearn. The point is to implement everything yourself from scratch. All code should be vectorized. (a) Write a Python function fit plot(dataTrain,dataTest,K) that uses linear least squares regression to fit a function to the data in dataTrain. You should use the feature mapping in Equation (1) with k = K to construct a feature vector for each data point. You should use the method lstsq(Z,T) in numpy.linag to solve the 6 linear least-squares problem itself.1 Here Z is the feature matrix of the training data, and T is a vector of the target values. That is, row n of Z is the feature vector for training point n, and T[n] is its target value. The function lstsq returns a number of values. The first value is the weight vector, w. Your function should use w to compute the training and test error. (Do not use any of the other values returned by lstsq in this question.) The function should return the weight vector, the training error and the test error. In addition, your function should plot the fitted curve on top of the training data. Use the function scatter to plot the training data as blue dots, and specify a dot size of 20. Use the function plot to draw a fitted curve in red using 1,000 values of x equally spaced between the smallest x-value in the training set and the largest. Use the function ylim in pyplot to put upper and lower limits on the vertical axis. For the upper limit, use the maximum target value in the training data + 5. For the lower limit, use the minimum target value in the training data - 5. If you have done everything correctly you should get exactly the result shown in Figure 1 when the function is called on the training data in the file dataA2Q1.pickle with K=4. You should implement the function so that it can help you later in this question. The function can easily be written in at most 39 lines of modular and highly- readible code (not counting comments and blank lines). Do not use any loops. Hint 1: to avoid loops, you may find the following general fact useful: if u and v are column vectors, then uvT is a matrix and [uvT ]ij = uivj. Hint 2: as an exercise, first try to create the vector [sinx, sin 2x, sin 3x, ... sin kx] for a single value of x without using loops. (b) Run your function fit plot on the training and test data in the file dataA2Q1.pickle with K=3. Print the value of K, the training error, the test error and the weight vector, in that order. Title the figure, Question 1(b): the fitted function (K=3). Label the horizontal axis x, and the vertical axis y. (c) Repeat part (b) using K=9. Title the figure appropriately. You should be able to do this in at most 2 lines of highy-readible code. (d) Repeat part (b) using K=12. Title the figure appropriately. You should be able to do this in at most 2 lines of highy-readible code. If everything is working correctly, the fitted curve should swing wildly up and down, but it should appear to pass through each training point exactly. Likewise, the training error should be almost zero, while the test error is huge. Many of the weights should also be gargantuan. This is extreme over fitting. (e) Run your function fit plot on the training and test data in the file dataA2Q1.pickle for values of K from 1 to 12 inclusive. Using subplot, arrange the 12 plots in a 4× 3 grid with K increasing from left to right and top to bottom. You may use a single loop, but no nested loops, for this purpose. Title the figure, Question 1(e): fitted functions for many values of K. 1You may need to add the keyword argument rcond=None to lstsq to supress warning messages. 7 Figure 2: Manually put the figure into full-screen mode before you save it, to increase res- olution and make the figure more readible. If you have done everything correctly the figure should look approximately like that in Figure 2, which is based on a different training set from the one you are using. Notice that the fitted function underfits the data for K=1, and overfits for large values of K. As K increases, the function becomes more-and-more wiggly, then varies wildly and finally soars far above and below the training data and beyond the limits of the axes. You should notice even more extreme behavior with the data in dataQ2A1.pickle. (f) To be realistic, we will pretend in this question that we have only a limited amount of data for training and testing (ignoring the 1,000 test points). To make the best use of limited data, you will use 5-fold cross validation to estimate the best value of K. Do not use any built-in functions that do (a significant part of) cross validation, such as cross validate in sklearn. Instead, you should implement your own cross-validation procedure from scratch using basic Numpy operations.. You should test values of K from 0 to 12 inclusive. First, divide the training data into five equal-sized folds. For this question, use 8 the first 5 training points as the first fold, the next five training points as the second fold, etc. Use one fold as validation data, and the remaining four folds as training data. For each value of K, use linear least squares to fit a function to the four folds of training data. Use the one fold of validation data to compute the validation error, and the four folds of training data to compute the training error. Do this five times, using each of the five folds in turn as validation data. Thus, for each value of K, you will have five estimates of validation error, and five estimates of training error. Use their averages as overall estimates of validation and training error, respectively. You may use one doubly-nested loop for this. You should reuse code from part (a), but do not cut-and-paste. Instead, part (a) should have well-designed subroutines that you can use here. You can read more about cross validation in Bishop and in Hastie, Tibshirani and Friedman. On a single pair of axes, plot mean training error v.s. K in blue, and mean validation error v.s. K in red. Label the horizontal axis K, and the vertical axis Mean error. Title the figure, Question 1(f): mean training and validation error. When plotting the errors, use the function semilogy in pyplot to put a log scale on the vertical axis. Without this, a few extremely large validation errors will dominate the entire graph, making most of the graph flat and uninformative. You should find that for large values of K, the validation error is extremely large and the training error is almost 0. If everything is working correctly, the plotted curves should look approximately like those in Figure 3, which is for a different data set than the one you are using. Choose the value of K with the smallest mean validation error. Using this value of K, repeat part (b). Use all the training data (five folds) and all the test data (1,000 points). This maximizes the use of the training data for the final classifier, and gives an unbiased estimate of test error (though in practice we would have far fewer than 1,000 test points). Title the plot, Question 1(f): the best fitting function. 2. Probabilistic Multi-class Classification. In this question, you will use three different probabilistic classifiers to do multi-class classification on 2-dimensional (2D) cluster data. You will compute the accuracies of the classifiers, and you will plot the decision boundaries. In the process, you will com- pare the generative and discriminative approaches to classification. For full marks, you should not do any unnecessary calculations. In particular, you often do not have to fully compute output probabilities into order to make predictions and compute accu- racy. Besides increasing execution time, such unnecessary calculations can introduce additional numerical error which can reduce prediction accuracy. The data you will use is in the file dataA2Q2.pickle.zip on the course web site. Download and uncompress this file. The file contains training and test data. Next, start the Python interpreter and import the pickle module. You can then read the file and extract the data with the following Python commands: with open(’dataA2Q2.pickle’,’rb’) as file: 9 Figure 3: 10 dataTrain,dataTest = pickle.load(file) Xtrain,Ttrain = dataTrain Xtest,Ttest = dataTest The variables Xtrain and Ttrain will now contain training data, that is, 2D input points and corresponding target values, respectively. Likewise, Xtest and Ttest con- tain test data. There are 1,800 training points and 1,800 test points. The training data is illustrated in the scatter plot in Figure 4. The figure shows three overlapping clusters coloured red, blue and green, which are called clusters 0, 1 and 2, respectively, in the target data. In addition, download the file bonnerlib2D.py.zip from the course web site and import it into your program with the statement import bonnerlib2D as bl2d. It contains functions for displaying 2-dimensional decision boundaries. In answering the questions below, do not use any Python loops unless specified other- wise. You will be using some functions from sklearn, but only those that are specified. The point is to implement everything else yourself from scratch. All code should be vectorized. (a) Data Visualization. Define a function plot data(X,T) to plot data X,T. Here X is a data matrix in which each row is a 2-dimensional data point, and T is a vector of corresponding class labels. You may assume there are three possible class labels: 0, 1 and 2. You should use the scatter function to plot each point in X as a dot of size of 2. Colour the dots red, blue or green for classes 0, 1 and 2, respectively. Set the limits of the axes using the functions xlim and ylim so that the plot holds all the training points with a margin of 0.1, i.e., so that the distance from the edge of the plot to the nearest training point is 0.1. When you plot the training data in the file dataA2Q2.pickle, it should look exactly like Figure 4. (Do not hand this in.) The function can be written in at most 7 lines of highly readible code (not counting comments or blank lines). (b) Logistic Regression. As in Question 4 of Assignment 1, use the Python class LogisticRegression in sklearn.linear model to define a classifier. This time, use the keyword ar- guments multi class=’multinomial’ and solver=’lbfgs’, to get multi-class classification. Use the fit method to train the classifier on the data that you read in above. Compute the accuracy of your classifier on the test data. Do this in two ways: (1) using the score method of the LogisticRegression class, and (2) using the weight matrix and bias vector of the classifier to make predictions. Call the two estimates of accuracy accuracy1 and accuracy2, respectively. In addition, generate a plot of the training data with the decision boundaries of the classifier superimposed on top. 11 Figure 4: 12 To compute accuracy2, define a Python function accuracyLR(clf,X,T) that computes and returns the accuracy of classifier clf on data X,T, where clf does lo- gistic regression. Your function should not use any methods of LogisticRegression or any functions in sklearn. The point is to implement everything yourself from scratch starting from the fitted linear functions, one linear function for each class. The linear functions are defined by a weight matrix and bias vector, which are stored in the attributes coef and intercept , respectively, of the classifier. The function accuracyLR should use them to make predictions on the data. The ac- curacy is then the average number of correct predictions. The function should work on classifiers with any number of classes, not just 3 classes, and on data of any dimensionality, not just 2 dimensions. It can be written in at most 6 lines of highly-readible code (not counting comment lines or blank lines). Do not use any loops. After computing accuracy1 and accuracy2, print their values and their differ- ence. If everything is working properly, they should have exactly the same value, and the difference should be 0. To generate the plot, use the function plot data from part (a) to plot the training data, and use the function boundaries(clf) in bonnerlib2D to draw the decision boundaries of the classifier. Be sure to call plot data before you call bounaries, to set the limits on the axes. If you have done everything properly, the decision boundaries should appear as three black line segments meeting at a common point between the three clusters. Title the figure, Question 2(b): decision boundaries for logistic regression. This entire question can be implemented as a Python script with at most 11 lines of highly-readible code, not counting the code for accuracyLR and plot data. (c) Gaussian (aka Quadratic) Discriminative Analysis. Repeat part (b) using the Python class QuadraticDiscriminantAnalysis in sklearn.discriminant analysis to define the classifier. Use the keyword ar- gument store covariance=True, to store the covariance matrices for each class. Compute the values of accuracy1 and accuracy2, and generate a plot of the training data with the decision boundaries of the classifier superimposed on top. To compute accuracy2, define a function accuracyQDA(clf,X,T) that com- putes and returns the accuracy of classifier clf on data X,T, where clf does quadratic discriminant analysis. Your function should not use any methods of QuadraticDiscriminantAnalysis or any functions in sklearn. The point is to implement everything yourself from scratch starting from the fitted class mod- els. The class models consist of the mean vector, covariance matrix and prior probability of each class. They are stored in the attributes of the classifier. You should use the class models and Bayes rule to make predictions, from which you should compute accuracy. You may use the function multivariate normal.pdf in scipy.stats to compute probability densities for each class. Do not use any other functions in scipy.stats. Also, the function accuracyQDA should work on classifiers with any number of classes, not just 3 classes, and on data of any 13 Figure 5: dimensionality, not just 2 dimensions. It can be written in at most 13 lines of highly-readible code. You may use 1 loop, and no nested loops. After computing accuracy1 and accuracy2, print their values and their differ- ence. If everything is working properly, they should have exactly the same value, and the difference should be 0. Finally, plot the training data and decision boundaries. Title the figure, Question 2(c): decision boundaries for quadratic discriminant analysis. If everything is working correctly, it should look approximately like Figure 5, which is based on a data set with slightly different properties from the one you are using. This entire question can be implemented as a Python script with at most 11 lines of highly-readible code, not counting the code for accuracyQDA and plot data. (d) Gaussian Naive Bayes. Repeat part (b) using the Python class GaussianNB in sklearn.naive bayes to define the classifier. Compute the values of accuracy1 and accuracy2, and generate a plot of the training data with the decision boundaries of the classifier superimposed on top. 14 To compute accuracy2, define a function accuracyNB(clf,X,T) that computes and returns the accuracy of classifier clf on data X,T, where clf does Gaussian naive Bayes. Your function should not use any methods of GaussianNB or any functions in sklearn. The point is to implement everything yourself from scratch starting from the fitted class models. The class models consist of the mean and variance of each feature of each class, and of the prior probability of each class. They are stored in the attributes of the classifier.2 You should use the class models and Bayes rule to make predictions, from which you should compute accuracy. Do not use any functions in numpy.random or scipy.stats. Also, your function accuracyNB should work on classifiers with any number of classes, not just 3 classes, and on data of any dimensionality, not just 2 dimensions. It can be written in at most 16 lines of highly-readible code. Do not use any loops. Recall from the lecture slides that in Gaussian naive Bayes, if x = (x1, ..., xd) is a data point, then its class-conditional densities are given by P (x|t = k) = ΠiP (xi|t = k) where P (xi|t = k) = e −(xi−µik)2/2σ2ik√ 2piσik The first equation says that within class k, the features of x are independent, and the second equation says that in class k, feature xi has a Gaussian distribution with mean µik and variance σ 2 ik. You should use these formulas in the function accuracyNB. You can use broadcasting to implement them without using loops (See the Tips on Scientific Programming in Python). After computing accuracy1 and accuracy2, print their values and their differ- ence. If everything is working properly, they should have exactly the same value, and the difference should be 0. Finally, plot the training data and decision boundaries. Title the figure, Question 2(d): decision boundaries for Gaussian naive Bayes. If everything is working cor- rectly, it should look similar to the plot in part (c) but the two decision boundaries should be much more circular (but not perfect circles). This entire question can be implemented as a Python script with at most 11 lines of highly-readible code, not counting the code for accuracyNB and plot data. 3. Neural Networks: intro This question is a warm-up execise to familiarize you with neural networks before you implement them yourself in Questions 4 and 5. Here, you will fit a variety of neural networks to the data from Question 2, explore their properties, and write code that does forward propagation from inputs to outputs. We focus here on neural networks for classification, not regression. To define a neural net, use the Python class MLPClassifier in sklearn.neural network. In this question, all neural networks will have a single hidden layer that uses the logis- tic activation function. You should train them using stochastic gradient descent (sgd) 2Note: the attribute clf.sigma is variance, not standard deviation. 15 with an initial learning rate of 0.01 and an optimization tolerance of 10−6. These are parameters of MLPClassifier that you can set. All other parameters should use their default values. When training a neural net, be sure that the training data uses integer encodings for the class labels, not 1-of-K encodings, as MLPclassifier expects integer labels for multi-class classification.3 In answering the questions below, do not use any Python loops, except for generat- ing subplots. You will be using some functions from sklearn, but only those that are specified. The point is to implement everything else yourself from scratch. All code should be vectorized. During training, you will probably get a warning message about convergence. You can ignore these messages, but please remove them from your printout before handing it in. (a) (0 points) Reload the data from the file dataA2Q2.pickle. (b) Because the weights of a neural network are initialized randomly, set the seed for random number generation to 0 by calling numpy.random.seed(0), so that everyone gets the same initial weights, and thus the same results. Then define a neural network as specified above. Also specify 1 hidden unit and a maximum of 1000 epochs of training. Fit the neural network to the training data using the fit method. Compute the accuracy of the trained network on the test data using the score method. Print the test accuracy. Plot the training data and the decision boundaries as in Question 2. If everything is working correctly, the decision boundary should be a straight line. Title the figure, Question 3(b): Neural net with 1 hidden unit. Design the code so that it will help you with later questions. (c) Repeat part (b) for a neural net with 2 hidden units. Title the figure appropriately. This can be done with at most 3 lines of highly-readible code. If everything is working correctly, you should see two curved decision boundaries that divide the input space into three decision regions that roughly correspond to the three clusters. (d) Repeat part (b) for a neural net with 9 hidden units. Title the figure appropriately. This can be done with at most 3 lines of highly-readible code. If everything is working correctly, the figure shold look similar (but not identical) to that of Question 2(c). (e) To see how the decision boundary evolves during learning, repeat part (b) nine times using a neural net with 7 hidden units, but with increasing amounts of training. Specifically, use 22, 23, 24, ... 210 training iterations. Use subplot to arrange the nine plots in a 3 × 3 grid in a single figure, with the number of iterations increasing from left to right and top to bottom. Title the figure, Question 3(e): different numbers of epochs. This can be done with at most 5 lines of highly-readible code. Manually put the figure into full-screen mode before saving it, to increase reso- lution and make the figure more readible. If everything is working correctly, the 3MLPclassifier assumes that class labels encoded as binary arrays are meant for multi-label classification, something we have not studied, and which is different from multi-class classification. 16 Figure 6: results should look similar to those in Figure 6, which is based on a dataset with slightly different properties from the one you are using. In your case, the decision boundaries for 210 iterations (in the bottom right) should look much more similar to those in part (d). Nine test accuracies should also be printed out, generally increasing with the number of iterations (but not always). (f) To see the effect of the initial weights, repeat part (b) nine times using a neural net with 5 hidden units, but using nine different sets of initial weights. To do this, use a different seed for random number generation each time. Specifically, use the numbers 1,2,...,9 as seeds, in that order. Use subplot to arrange the nine plots in a 3× 3 grid in a single figure, with the seed value increasing from left to right and top to bottom. Title the figure, Question 3(f): different initial weights. This can be done with at most 5 lines of highly-readible code. Manually put the figure into full-screen mode before saving it, to increase reso- lution and make the figure more readible. If everything is working correctly, you should get slightly different decision boundaries each time (or most times), all somewhat similar to those in part (d). Nine test accuracies should also be printed 17 out, not all the same. (g) Following the approach in Question 2, compute the test accuracy of a neural net- work in two ways: (1) by using the score method of the MLPClassifer class, and (2) by implementing forward propagation within the neural net to make predic- tions. Call the two estimates of accuracy accuracy1 and accuracy2, respectively. To compute accuracy2, define a Python function accuracyNN(clf,X,T) that computes and returns the accuracy of classifier clf on data X,T, where clf is a neural network with one hidden layer. Your function should not use any methods of MLPClassifier or any functions in sklearn. The point is to implement every- thing yourself from scratch using the fitted weight matrices and bias vectors of the neural net, which are stored in the attributes of clf. The function accuracyNN should use them to propagate the inputs of the neural net to the outputs, and to make predictions. The function should work for neural nets with any number of output classes, not just 3 classes, and for data of any dimensionality, not just 2 dimensions. It can be written in at most 8 lines of highly-readible code (not counting comment lines or blank lines). Do not use any loops. As in part (d), define a neural network with 9 hidden units and at most 1,000 epochs of training. Fit the network to the training data. Be sure to set the seed for random number generation to 0 before you start. After computing accuracy1 and accuracy2, print their values and their difference. If everything is working properly, they should have exactly the same value, and the difference should be 0. This question can be implemented as a Python script with at most 8 lines of highly-readible code, not counting the code for accuracyNN. The line that uses MLPClassifier to define the neural network counts as 1 line of code, even if it is spread over several lines to improve readibility. (h) Repeat part (g), but compute the mean cross entropy of the neural net on the test data, instead of the accuracy, and do so in two different ways. Call the resulting cross entropies CE1 and CE2. Specifically, define a function ceNN(clf,X,T) that computes CE1 and CE2 for classifier clf on data X,T, where clf is a neural net with one hidden layer. For each input vector in X, first compute the log probability of each output class, and do this in two ways. For CE1, use the predict log proba method of clf. For CE2, do not use any methods of clf or any functions in sklearn. Instead, compute the log probabilities yourself after first propagating the inputs of the neural net to the output, as in part (g). From the two log probabilities, compute and return the two cross entropies. Your function should work for neural nets with any number of output classes, not just 3 classes, and for data of any dimensionality, not just 2 dimensions. It can be written in at most 22 lines of highly-readible code (not counting comment lines or blank lines). Do not call any subroutines from part (g), but feel free to cut-and-paste as much code as you like. You may use any of the attributes of clf. Do not use any loops. Hint: use broadcasting and indexing to compute a one-hot encoding of T. As in part (g), define a neural network with 9 hidden units and at most 1,000 epochs of training. Fit the network to the training data. Be sure to set the seed 18 for random number generation to 0 before you start. After using ceNN to compute CE1 and CE2, print their values and their difference. Print the cross entropies out in full (about 17 significant digits). If everything is working properly, they should have exactly the same value (approximately 0.43), and the difference should be 0 (or almost 0, depending on your machine). This question can be implemented as a Python script with at most 7 lines of highly-readible code, not counting the code for ceNN. The line that uses MLPClassifier to define the neural network counts as 1 line of code, even if it is spread over several lines to improve readibility. 4. (? points total) Neural Networks: theory In this question, you will develop matrix equations needed to implement a neural net in Question 5. The neural net: The net has two hidden layers and a single output unit, where the hidden units use a tanh activation function. We will use the network for binary classification, so the activation function at the output is a sigmoid. The operation of the neural net can be specified as follows: o = σ(g˜) g˜ = gU + u0 (4) g = tanh (h˜) h˜ = hV + v0 (5) h = tanh (x˜) x˜ = xW + w0 (6) Here x, h, g and o are row vectors representing the input, the first hidden layer, the second hidden layer, and the output, respectively; g˜, h˜ and x˜ are row vectors representing linear transformations of g, h and x, respectively; U , V and W are weight matrices; u0, v0 and w0 are row vectors representing bias terms; and σ and tanh are the sigmoid function and the hyperbolic tangent function, respectively. Because there is only a single output, the output vector, o, has only a single element, so we will treat o as a real number. Likewise for g˜. Similarly, the weight matrix U has only a single column, so we will treat U as a column vector. Recall that the tanh function is given by tanh (y) = ey − e−y ey + e−y and the sigmoid function is given by σ(y) = 1/(1 + e−y), for any real number, y. In this question, you may use the following properties of the tanh and sigmoid functions without proving them: ∂ tanh (y) ∂y = 1− [tanh (y)]2 (7) ∂σ(y) ∂y = σ(y)[1− σ(y)] (8) 19 You may not use any other results from the lecture slides, although you may copy their proofs if you find them useful. Since we are using the neural network for classification, the loss function used during training is cross entropy: C = ∑ n c(t(n), o(n)) where c(t, o) = − t log o− (1− t) log (1− o) (9) where the sum is over training points. Here, o(n) is the output of the neural net on input x(n), and t(n) is a binary value representing the target class of x(n). The chain rule: Below, you will derive gradient equations for back propagation in the neural net. Back propagation is based on the chain rule, and since each layer has multiple inputs and outputs, we will need a generalization of the chain rule for multi- variate functions. Abstractly, suppose that y = g(x) and z = h(x), where x, y and z are all real numbers. Then, the multi-variate chain rule says that ∂f [g(x), h(x)] ∂x = ∂f(y, z) ∂y ∂y ∂x + ∂f(y, z) ∂z ∂z ∂x for any differentiable function, f(y, z). Intuitively, this says that x affects f through y and through z, and that both these effects must be added up to determine the total effect on f due to a change in x. To put this in the context of our neural net, let h˜i and hj be components of the vectors h˜ and h, respectively, let c be an abreviation for c(o, t), and suppose we want to compute ∂c/∂hj. Observe that c depends on h˜, which depends on h. In particular, changes in the value of hj affect all of the h˜i, and each of these affect c. All these separate effects must be added up to determine the total effect on c due to changes in hj. Formally, ∂c ∂hj = ∑ i ∂c ∂h˜i ∂h˜i ∂hj (10) Note that we do not always need the multivariate chain rule. For example, g˜i = tanhhi, so g˜i depends on hi, but not on any other hj. Similarly, hi affects gi, but not any other gj. So, the univariate chain rule is all we need here: ∂c ∂h˜i = ∂c ∂gi ∂gi ∂h˜i (11) Intuitively, this says that h˜i affects c only through gi. In this question, you will have to think carefuly about the dependencies within a neural net in order to determine which chain rule to use. Be sure to make it clear in your proofs which chain rule you are using, and why. 20 What to do: Prove each of the five matrix equations below from scratch, using only the results stated above, along with basic results from calculus and linear algebra. Use the proofs in the solutions to Assignment 1 and the midterm as a model and guide for how your proofs should be done. Please also see the Tips on Proving Theorems on page 4 of this assignment. To make your proofs easier to mark, use the indices m,n to range over training instances, and use i, j, k to range over hidden and input units, as in the equations above. In the matrix equations below, X, H, H˜, and G are data matrices whose nth rows are x(n), h(n), h˜(n) and g(n), respectively. Likewise, G˜, O and T are column vectors whose nth elements are g˜(n), o(n) and t(n), respectively. ~1 is a row vector of 1’s. Note that your proofs must be rooted in Equations (4) to (6), which describe the behaviour of the neural net. These equations are vector-based, since a neural net operates on one vector at a time. The neural net in this question takes a single vector, x, as input, generates a vector of hidden values, h, from which it generates another vector of hidden values, g, from which it generates a single real number, o, as output. From this vector-based description, you must derive matrix equations that describe the behaviour of the neural net on a set of vectors. These equations are used for efficient batch (or mini-batch) training of a neural net. The derivations can be broken down into several steps. The first is to express Equa- tions (4) to (6) at a finer level of detail. For instance, the two equalities in Equation (5) can be written as gj = tanh h˜j h˜j = ∑ i hiVij + v0j Such equations describe the neural net in terms of individual neurons, such as hi and gj, and individual weights and bias terms, such as Vij and v0j. There are no vectors or matrices in these equations, only real numbers. For this reason, you can infer derivatives using the familiar rules of calculus for real variables. The next step is to derive equations for the derivatives needed by back propagation, such as ∂c/∂Vij and ∂c/∂hi. These equations should be entirely in terms of real num- bers, not vectors or matrices. This step typically involves calculus, but no linear algebra. The final step is to vectorize the equations, that is, to translate them into the vector and matrix-based equations given below. This step typically involves linear algebra, but little or no calculus. It will probably be the least familiar step for you, and therefore the hardest. It will be worth the most marks. You can find examples of it in the solutions to Assignment 1 and the midterm. (a) (? points) Draw the computation graph of the neural net. (See slide 36 of Lec- ture 5). Each node of the graph is a vector or matrix. You may draw the graph by hand and scan it in. (b) (? points) ∂C ∂G˜ = O − T 21 (c) (? points) [ ∂C ∂H˜ ] nk = (1−G2nk) [ ∂C ∂G ] nk (d) (? points) ∂C ∂V = HT ∂C ∂H˜ (e) (? points) ∂C ∂v0 = ~1 ∂C ∂H˜ (f) (? points) ∂C ∂H = ∂C ∂H˜ V T Here is a complete set of matrix equations for back propagation in the neural net. You do not have to prove them, but you may use them in Question 5. Each of them is similar to one of the five matrix equations above, and has a similar proof. ∂C ∂G˜ = O − T ∂C ∂U = GT ∂C ∂G˜ ∂C ∂V = HT ∂C ∂H˜ ∂C ∂W = XT ∂C ∂X˜ ∂C ∂u0 = ~1 ∂C ∂G˜ ∂C ∂v0 = ~1 ∂C ∂H˜ ∂C ∂w0 = ~1 ∂C ∂X˜ ∂C ∂G = ∂C ∂G˜ UT ∂C ∂H = ∂C ∂H˜ V T [ ∂C ∂H˜ ] nk = (1−G2nk) [ ∂C ∂G ] nk [ ∂C ∂X˜ ] nk = (1−H2nk) [ ∂C ∂H ] nk 5. (? points total) Neural Networks: implementation In this question, you will use the theory developed in Question 4 to write Python programs that train neural networks on the MNIST data. As in Question 4, we will only consider neural nets with two hidden layers, a tanh activation function, and a single logistic (sigmoid) output. In addition, the neural nets in this question will have 100 neurons in each hidden layer. You will implement both batch and stochastic gradient descent. The batch implementation is more straighforward, but as you will see, it takes much longer to converge. 22 For comparison, and to provide some quick results, you will first train neural networks on MNIST using MLPclassifier. Unless otherwise specified, do not use any functions in sklearn other than MLPclassifier and its methods. When using MPLClassifier to define a neural net, always specify stochastic gradient descent (sgd) as the optimiza- tion method, with an optimization tolerance of 10−6. You will not be workng with the full MNIST data set, but with a reduced version consisting of only two digits, since you will be doing binary classification. In answering the questions below, do not use any Python loops unless specified other- wise. All code should be vectorized. (a) Reload the data from the file mnistTVT.pickle.zip and create a reduced data set consisting of just the digits 5 and 6, exactly as in Question 6 of Assignment 1 (except you will not need a validation set). Be sure to preserve the relative order of the digits. Also, since we will be doing binary classification, the target labels should be 0s and 1s. Use 1 to labels the 5s, and 0 to label the 6s. There should be 9444 points in the reduced training set, and 1850 points in the reduced test set. You may cut-and-paste and adapt code from Assignment 1 for this purpose. You will also need a reduced data set of the digits 4 and 5. Use 1 to label the 4s, and 0 to label the 5s. This data set is for testing purposes only, and no code or results related to it should be handed in. (b) Write a function evaluateNN(clf,X,T) that evaluates classifier clf on data X,T, where clf is a neural net defined by MLPClassifier as specified above. The function should estimate the accuracy and the cross entropy of the neural net on the data in two different ways, as described in Questions 3(g) and (h). Call these estimates accuracy1, accuracy2, CE1, CE2, as before. The function should return all four of these values. To compute accuracy2 and CE2, you will have to propagate the data in X from the input of the neural net to the output. You will test this code in part (c), and use it again in part (d). If you design the sub- routines properly, you will also be able to reuse most of the code for computing accuracy2 and CE2 in parts (f) and (g). Because this code is already tested, it will simplify your debugging job there. (c) Using MLPclassifier, define a neural net as specified in the introduction. Specify an initial learning rate of 0.01, a batch size of 100, and a maximum of 100 iterations of training. If you want to monitor the loss function as training proceeds, then set verbose=True, but do not hand in the information printed out. Fit the network to the reduced training data. Be sure to set the seed for random number generation to 0 before you start. Call evaluateNN on the trained neural net with the reduced test data. Print the values of accuracy1 and accuracy2 and their difference. If everything is working correctly, both values should be exactly the same and the difference should be 0. Likewise, print the values of CE1 and CE2 23 and their difference. If everything is working correctly, both values should be exactly the same and the difference should be 0 (or almost 0, depending on your machine). If you test your programs by running them on the reduced data set of digits 4 and 5, then the accuracy should be exactly 0.9973319103521878, and the cross entropy should be exactly 0.00891227323537442 (or almost exactly, depending on your machine).4 Do not hand in these results. Instead, hand in results for the reduced data set of digits 5 and 6. (d) In this question, you will explore the effect of batch size by training and testing neural nets using different batch sizes. In particular, use MLPClassifier to define 14 different neural nets. Use the settings specified in the introduction, and also specify an initial learning rate of 0.001, at most one epoch of training, and batch sizes of 2k, for k from 0 to 13, inclusive. (You may use one loop, but no nested loops, for this purpose.) Note that this gives batch sizes from 20 = 1 to 213 = 8192. The latter encompases almost the entire training set of 9444 samples, so it effectively corresponds to batch gradient descent. Fit the neural nets to the reduced training data. Be sure to set the seed for random number generation to 0 before training each one. Use evaluateNN to compute the accuracy and cross entropy of each trained net on the reduced test data. Record the values of accuracy2 and CE2. Finally, generate two plots: one of accuracy2 v.s. batch size, and one of CE2 v.s. batch size. Use a log scale on the horizonal axes. Title the plots, Question 5(d): Accuracy v.s. batch size, and Question 5(d): Cross entropy v.s. batch size, respectively. Label the horizontal axes, batch size, and label the vertical axes accuracy and cross entropy, respectively. If everything is working correctly, and if you test your programs by running them on the reduced data set of digits 4 and 5, then the two plots should look exactly like those in Figures 7 and 8. Do not hand these in. Instead, hand in plots for the reduced data set of digits 5 and 6. (e) The salient feature of your plots in part (d) is that accuracy decreases with batch size, and cross entropy increases. Explain why this happens. The graphs show in particular that a batch size of 1 gives the geatest accuracy and lowest cross entropy. Why would we not use a batch size of 1. To answer this, pay attention during the execution of your programs, and explain any significant observations that do not show up in your graphs. If you do not notice anything, then print something out after each net has been trained. What difference would it make if you could execute your programs on a massively parallel machine, such as a gpu? (f) Batch Gradient Descent: implementation. 4“Almost exactly” means that all but the last few significant digits should be identical. 24 Figure 7: 25 Figure 8: 26 Use the theory developed in Question 4 to write a Python program that trains a neural net as specified in the introduction using batch gradient descent. Initialize the weight matrices randomly using a standard Gaussian distribution (i.e., mean 0 and variance 1). Initialize them bottom up: first W , then V , then U .5 Initialize the bias terms to 0. Be sure to set the seed for random number generation to 0 at the start of your program. Your program may use one loop but no nested loops, and all code should be vectorized. Using program comments, clearly indicate what portion of your program implements the forward pass of training, what portion implements back propagation and what portion implements the weight updates. When doing weight updates, use the average gradient, not the gradient itself. This means dividing the gradient, as given by the formulas in Question 4, by N , the number of terms in the sum in Equation (9). In batch gradient descent, N is the number of points in the training set, but would be the batch size in stochastic gradient descent. Using the average gradient means that the learning rate does not have to change much when the size of the training set changes. Notice that using the average gradient is equivalent to dividing the learning rate by N . At the beginning of each iteration, before the weights and bias terms are updated, compute the accuracy of the neural net on the reduced test data, then print the iteration number and the test accuracy. (The number of the first iteration is 0.) When training has finished, compute and print the accuracy and cross entropy of the neural net on the reduced test data. Run your program on the reduced training data using a learning rate of 0.1 and 10 iterations of gradient descent. If everything is working correctly, then if you test your program by running it on the reduced data set of digits 4 and 5, the test accuracy printed at the end should be exactly 0.6905016008537886, and the test cross-entropy should be ex- actly 1.5652003569260415 (or almost exactly, depending on your machine). Do not hand in these results. Instead, hand in results for the reduced data set of digits 5 and 6.) (g) Stochastic Gradient Descent: implementation. Modify your program in part (f) to peform stochastic gradient descent with mini- batches. That is, instead of computing the gradient of the loss function on the entire training set at once, compute the gradient on a small, random subset of the training data (called a mini-batch), perform weight updates, and then move on to the next mini-batch, and so on. As you saw in part (d), this can lead to much faster convergence. To produce random mini-batches, shuffle the training data randomly, then sweep across the shuffled data from start to finish. For example, if we want mini-batches of size 100, then the first mini-batch is the first 100 points in the training set. The second mini-batch is the second 100 points. The third mini-batch is the third 5Because initialization is random, the order affects the initial state of the neural net, and thus the final state, and thus the output of the trained net. 27 100 points, etc. (If the number of training points is not a multiple of 100 then the last mini-batch in a sweep will have fewer than 100 points in it.) Each such sweep of the training data is called an epoch. Program comments should clearly indicate where an epoch begins and where mini-batches are created. During each epoch, your program should do the following. First, compute the ac- curacy of the neural net on the reduced test data, and print the epoch number and the accuracy. (The number of the first epoch is 0.) Second, use the following state- ment to randomly shuffle the training data: X,T = sklearn.utils.shuffle(X,T). (Do not do any other shuffling of the data.) Third, sweep through the shuffled data, generating mini-batches and processing each of them in turn as described above. When updating weights, use the average gradient, averaged over the cur- rent mini-batch. Unlike part (f), your program here may use two loops, one nested inside the other. Like part (f), your program should also do the following. Set the seed for random number generation to 0 at the start of the program. Initialize the weight matrices randomly using a standard Gaussian distribution (i.e., mean 0 and variance 1), and initialize the bias terms to 0. When training has finished, compute and print the accuracy and cross entropy of the neural net on the reduced test data. Use program comments to clearly indicate what portion of your program implements the forward pass of training, what portion implements back propagation, and what portion implements the weight updates. Use vectorized code. As in part (f), run your program on the reduced training data using a learning rate of 0.1 and 10 epochs of gradient descent. In addition, use a mini-batch size of 10. If everything is working correctly, then if you test your program by running it on the reduced data set of digits 4 and 5, the test accuracy printed at the end should be above 0.95, and the test cross-entropy should be below 0.1. Do not hand in these results. Instead, hand in results for the reduced data set of digits 5 and 6. Notice that the accuracy for stochastic gradient descent is much higher than for batch gradient descent, and the cross entropy (which we want to minimize) is much lower. ?? points total 28 University of Toronto Mississauga CSC 311 - Introduction to Machine Learning Cover sheet for Assignment 2 Complete this page and hand it in with your assignment. Name: (Underline your last name) Student number: I declare that the solutions to Assignment 1 that I have handed in are solely my own work, and they are in accordance with the University of Toronto Code of Behavior on Academic Matters. Signature: 29
欢迎咨询51作业君