Homework 3 for CSI 531 Due: November 22 All homeworks are individual assignments. This means: write your own solutions and do not copy code/solutions from peers or online. Should academic dishonesty be detected, the proper reporting protocols will be invoked (see Syllabus for details). For the programming part: You should submit your solutions in one single zip file, named “[yourFirstName-yourLastName].zip”, to the homework1-code assignment through blackboard. The zip file should include one folder, named “code” and one jing.txt file that includes the link to your JING screencast video (look at JING instructions on blackboard). All the code for programming problems should be in the “code” folder. Make sure to mention the Python version you are using. Make sure you can import your .py files into Python without error and test your solutions before submitting. For the written part: Submit a PDF file named “[yourFirstName-yourLastName].pdf” that includes your solution, typed in an editor like MS word or using LATEX. Do not submit images of handwritten solutions. 1. Support Vector Machines [21 pts.] Recall that we have learned about hard-margin linear SVM (with no slack allowed), soft-margin linear SVM (slack is allowed), and soft-margin kernel SVM in class. Suppose that we want to solve a problem of hard-margin kernel SVM. In other words, if φ is our mapping from the input space to the feature space (for each input point xi we have φ(xi) in feature space); K(xi, xj) denote the kernel function; and w is the weight vector in feature space, we want to find h(φ(x)) : wTφ(x) = 0. Since we are looking for a hard-margin kernel SVM, our objective funstion would be: min J(w) = 1 2 w2 s.t. ∀(xi, yi), yi(wTφ(xi)) ≥ 1 (1) In this homework, you will find a solution to this problem following these steps: (a) [3 pts.] Using Lagrange multipliers for the constraints, write the Lagrangian objective function for this problem. What new constraints are added to this objective function? (b) [3 pts.] Write the derivative of the the Lagrangian objective function, that you obtained in previous section with respect to w. (c) [3 pts.] Set the derivatives of previous section to 0 and solve for w. What do you achieve? How do you interpret these results? (d) [3 pts.] Replace the w, obtained in part (c) in the objective function in part (b). Use your conclusions in part (c) to rewrite the objective function. The new objective function, dual Lagrangian, should be defined in terms of Lagrange multipliers only. What are the constraints that are added to the dual Lagrangian objective function? (e) [3 pts.] Suppose that we want to solve the dual Lagrangian objective function, to obtain the Lagrange multipliers, using gradient ascent algorithm. Remember that gradient ascent is an iterative algorithm that updates the parameters, according to the gradient of loss with respect to the parameters, in each iteration. In other words, if the J(α) is the objective function, in iteration t + 1 we have αt+1 = αt+step∗ ∂J(α)∂α , in which step is a constant. As a first step to do so, calculate the partial gradient of the dual Lagrangian objective function with respect to one of the Lagrange multipliers. Write the gradient in terms of the kernel function. (f) [3 pts.] Suppose that we used the result of part (e) to calculate all of the Lagrange multipliers. How can we use them to calculate w? How can we classify any new point z? (g) [3 pts.] Assume that we chose a kernel function and found hard-margin kernel SVM for some data points according to some training data. Which of the following pictures can represent a hard-margin kernel SVM? Explain why, or why not, for each. 2. [30 pts.] Spectral Clustering: Consider the following similarity graph with similarity values annotated on edges. 1 a) b) c) d) 1 2 3 4 5 6 2 3 3 4 10 10 100 (a) [6 pts.] Cuts: Consider the following cuts: cut 1: S1 = {1, 2, 3}, T1 = {4, 5, 6} cut 2: S2 = {1}, T2 = {2, 3, 4, 5, 6} cut 3: S3 = {1, 2, 3, 4}, T3 = {5, 6} Compute: (1) the cut weight W (Si, Ti); (2) the ratio cut: W (Si,Ti)|Si| + W (Si,Ti) |Ti| ; and (3) the normalized cut W (Si,Ti)vol(Si) + W (Si,Ti) vol(Ti) for each of the cuts (i = 1 . . . 3). How do the three cuts rank (from smallest to highest) according to each of the cut criteria? (b) [6 pts.] Laplacians: Compute the adjacency matrix A, degree matrix D, Laplacian matrix L = D−A and Symmetric Laplacian matrix Ls = D −1/2LD−1/2 of the graph above (do this by hand and show them in your Solutions file). (c) [6 pts.] Eigen vectors (requires coding): Encode Ls and L (you obtained above) into a python program and perform eigen decomposition to obtain the eigenvectors corresponding to the smallest non-zero eigenvalue of both L and Ls. (Hint: Use numpy.linalg.eig(X) to compute the eigen decom- position of L and Ls.) Let u be the aforementioned eigenvector for L and us for LS . Report the values corresponding to each nodes in u and us. Plot these values, where on the x axis you have the node id (1 to 6) and on the y axis you have the corresponding value of u and us. Hint: Note that since numpy uses numerical methods to find eigenvalue/eigenvector pairs, numerical errors can lead to the first eigenvalue being non-zero but rather some small number like 10−15. This is still actually a zero eigenvalue. (d) [6 pts.] Hierarchical clusters from eigenvectors: Here we will do agglomerative single-link clustering (on paper, NOT in a program) of the nodes represented by their u values and by their us values computed in the previous part. Note that we will get two dendrograms, one for u and one for us and in each the nodes will be one-dimensional points. Draw the two dendrograms. (e) [6] Hierarchical clusters from eigenvectors: Cut the dendrograms above to obtain two clusters. Show the resulting partitions in your solution file. What are their cut weights, normalized cuts and ratio cuts? Discuss why we obtain these partitions by relating the Laplacians to the specific cut measures. 2 3. [22 pts.] K-means and K-medians: There is a reason why we typically use the mean of data points as cluster representative when we do k-means with L2 distance measure: the mean minimizes the sum of squared distances to all points. Similarly, we use the median coupled with the L1 distance. In this question we will convince ourselves theoretically and by an example that these are good choices. (a) [6 pts.] Example : Consider the following set of numbers {0, 1, 2, 2, 10}. Compute the mean µ and median m. (1) Show that the sum of squared distances from the mean is smaller than that to the median, i.e.∑ i (xi − µ)2 ≤ ∑ i (xi −m)2 . (2) Show that the sum of absolute distances to the median is smaller than that to the mean∑ i |xi −m| ≤ ∑ i |xi − µ| . (b) [8 pts.] Proof (mean + L2) : Prove that the mean µ of a set X of numbers (i.e. one-dimensional points) minimizes the sum of squared distances to all numbers xi in the set. You need to prove that: µ = argmina ∑ i (xi − a)2 (c) [8 pts.] Proof (median + L1) : Similarly prove that the median m of a set X of numbers (i.e. one-dimensional points) minimizes the sum of absolute distances to all numbers xi in the set. You need to prove that: m = argmina ∑ i |xi − a| 4. [27 pts.] Density-based Clustering: Considering the data points in figure below, answer the following questions. (a) [6 pts.] L∞ : Using = 1, minpts = 3, and L∞ distance, find all core, border and noise points. (b) [6 pts.] L∞ Clusters : Find all the clusters found by DBSCAN using the setting in part (a) and show them on the figure. (c) [6 pts.] L2 : Using = 1, minpts = 3, and L2 distance, find all core, border and noise points. Compare your results with part (a). What differences do you see? Why? (d) [9 pts.] L2 Clusters : Find all the clusters found by DBSCAN using the setting in part (c) and show them on the figure. Compare your results with part (b). What differences do you see? Why? 3
欢迎咨询51作业君