CS6140: Machine Learning Homework Assignment # 2 Assigned: 02/16/2021 Due: 03/01/2021, 11:59pm, through Canvas Three problems, 100 points in total. Good luck! Prof. Predrag Radivojac, Northeastern University Problem 1. (20 points) Naive Bayes classifier. Consider a binary classification problem where there are eight data points in the training set. That is, D = {(−1,−1,−1,−), (−1,−1, 1,+), (−1, 1,−1,+), (−1, 1, 1,−), (1,−1,−1,+), (1,−1, 1,−), (1, 1,−1,−), (1, 1, 1,+)} , where each tuple (x1, x2, x3, y) represents a training example with input vector (x1, x2, x3) and class label y. a) (10 points) Construct a naive Bayes classifier for this problem and evaluate its accuracy on the training set. Measure accuracy as the fraction of correctly classified examples. b) (10 points) Transform the input space into a higher-dimensional space (x1, x2, x3, x1x2, x1x3, x2x3, x1x2x3, x 2 1, x 2 2, x 2 3, x 2 1x2, x1x 2 2, x1x 2 3, x 2 2x3, x2x 2 3) and repeat the previous step. Carry out all steps manually and show all your calculations. Discuss your main observations. Problem 2. (25 points) Consider a binary classification problem in which we want to determine the optimal decision surface. A point x is on the decision surface if P (Y = 1|x) = P (Y = 0|x). a) (10 points) Find the optimal decision surface assuming that each class-conditional distribution is defined as a two-dimensional Gaussian distribution: p(x|Y = i) = 1 (2pi)d/2|Σi|1/2 · e − 12 (x−mi)TΣ−1i (x−mi) where i ∈ {0, 1}, m0 = (1, 2), m1 = (6, 3), Σ0 = Σ1 = I2, P (Y = 0) = P (Y = 1) = 1/2, Id is the d-dimensional identity matrix, and |Σi| is the determinant of Σi. b) (5 points) Generalize the solution from part (a) using m0 = (m01,m02), m1 = (m11,m12), Σ0 = Σ1 = σ2I2 and P (Y = 0) 6= P (Y = 1). c) (10 points) Generalize the solution from part (b) to arbitrary covariance matrices Σ0 and Σ1. Discuss the shape of the optimal decision surface. Problem 3. (55 points) Consider a multivariate linear regression problem of mapping Rd to R, with two different objective functions. The first objective function is the sum of squared errors, as presented in class; i.e., ∑n i=1 e 2 i , where ei = w0+ ∑d j=1 wjxij−yi. The second objective function is the sum of square Euclidean distances to the hyperplane; i.e., ∑n i=1 r 2 i , where ri is the Euclidean distance between point (xi, yi) to the hyperplane f(x) = w0 + ∑d j=1 wjxj . 1 2 Homework Assignment # 2 a) (10 points) Derive a gradient descent algorithm to find the parameters of the model that minimizes the sum of squared errors. b) (20 points) Derive a gradient descent algorithm to find the parameters of the model that minimizes the sum of squared distances. c) (20 points) Implement both algorithms and test them on 3 datasets. Datasets can be randomly generated, as in class, or obtained from resources such as UCI Machine Learning Repository. Compare the solutions to the closed-form (maximum likelihood) solution derived in class and find the R2 in all cases on the same dataset used to fit the parameters; i.e., do not implement cross-validation. Briefly describe the data you use and discuss your results. d) (5 points) Normalize every feature and target using a linear transform such that the minimum value for each feature and the target is 0 and the maximum value is 1. The new value for feature j of data point i can be found as xnewij = xij −mink∈{1,2,...,n} xkj maxk∈{1,2,...,n} xkj −mink∈{1,2,...,n} xkj , where n is the dataset size. The new value for the target i can be found as ynewi = yi −mink∈{1,2,...,n} yk maxk∈{1,2,...,n} yk −mink∈{1,2,...,n} yk . Measure the number of steps towards convergence and compare with the results from part (c). Briefly discuss your results. Homework Assignment # 2 3 Directions and Policies Submit a single package containing all answers, results and code. Your submission package should be compressed and named firstnamelastname.zip (e.g., predragradivojac.zip). In your package there should be a single pdf file named main.pdf that will contain answers to all questions, all figures, and all relevant results. Your solutions and answers must be typed1 and make sure that you type your name and Northeastern username (email) on top of the first page of the main.pdf file. The rest of the package should contain all code that you used. The code should be properly organized in folders and subfolders, one for each question or problem. All code, if applicable, should be turned in when you submit your assignment as it may be necessary to demo your programs to the teaching assistants. Use Matlab, Python, R, Java, or C/C++. However, you are encouraged to use languages with good machine learning libraries (e.g., Matlab, Python, R), which may be handy in future assignments. Unless there are legitimate circumstances, late assignments will be accepted up to 5 days after the due date and graded using the following rules: on time: your score × 1 1 day late: your score × 0.9 2 days late: your score × 0.7 3 days late: your score × 0.5 4 days late: your score × 0.3 5 days late: your score × 0.1 For example, this means that if you submit 3 days late and get 80 points for your answers, your total number of points will be 80 × 0.5 = 40 points. All assignments are individual, except when collaboration is explicitly allowed. All the sources used for problem solution must be acknowledged; e.g., web sites, books, research papers, personal communication with people, etc. Academic honesty is taken seriously! For detailed information see Office of Student Conduct and Conflict Resolution. 1We recommend Latex; in particular, TexShop-MacTeX combination for a Mac and TeXnicCenter-MiKTex combination on Windows. An easy way to start with Latex is to use the freely available Lyx. You can also use Microsoft Word or other programs that can display formulas professionally.
欢迎咨询51作业君