辅导案例-CSI 531

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
Homework 3 for CSI 531
Due: November 22
All homeworks are individual assignments. This means: write your own solutions and do not
copy code/solutions from peers or online. Should academic dishonesty be detected, the proper
reporting protocols will be invoked (see Syllabus for details). For the programming part: You should
submit your solutions in one single zip file, named “[yourFirstName-yourLastName].zip”, to the homework1-code
assignment through blackboard. The zip file should include one folder, named “code” and one jing.txt file that
includes the link to your JING screencast video (look at JING instructions on blackboard). All the code for
programming problems should be in the “code” folder. Make sure to mention the Python version you are using.
Make sure you can import your .py files into Python without error and test your solutions before submitting.
For the written part: Submit a PDF file named “[yourFirstName-yourLastName].pdf” that includes your
solution, typed in an editor like MS word or using LATEX. Do not submit images of handwritten solutions.
1. Support Vector Machines [21 pts.] Recall that we have learned about hard-margin linear SVM (with
no slack allowed), soft-margin linear SVM (slack is allowed), and soft-margin kernel SVM in class. Suppose
that we want to solve a problem of hard-margin kernel SVM. In other words, if φ is our mapping from the
input space to the feature space (for each input point xi we have φ(xi) in feature space); K(xi, xj) denote
the kernel function; and w is the weight vector in feature space, we want to find h(φ(x)) : wTφ(x) = 0. Since
we are looking for a hard-margin kernel SVM, our objective funstion would be:
min J(w) =
1
2
w2
s.t. ∀(xi, yi), yi(wTφ(xi)) ≥ 1
(1)
In this homework, you will find a solution to this problem following these steps:
(a) [3 pts.] Using Lagrange multipliers for the constraints, write the Lagrangian objective function for this
problem. What new constraints are added to this objective function?
(b) [3 pts.] Write the derivative of the the Lagrangian objective function, that you obtained in previous
section with respect to w.
(c) [3 pts.] Set the derivatives of previous section to 0 and solve for w. What do you achieve? How do you
interpret these results?
(d) [3 pts.] Replace the w, obtained in part (c) in the objective function in part (b). Use your conclusions
in part (c) to rewrite the objective function. The new objective function, dual Lagrangian, should be
defined in terms of Lagrange multipliers only. What are the constraints that are added to the dual
Lagrangian objective function?
(e) [3 pts.] Suppose that we want to solve the dual Lagrangian objective function, to obtain the Lagrange
multipliers, using gradient ascent algorithm. Remember that gradient ascent is an iterative algorithm
that updates the parameters, according to the gradient of loss with respect to the parameters, in each
iteration. In other words, if the J(α) is the objective function, in iteration t + 1 we have αt+1 =
αt+step∗ ∂J(α)∂α , in which step is a constant. As a first step to do so, calculate the partial gradient of the
dual Lagrangian objective function with respect to one of the Lagrange multipliers. Write the gradient
in terms of the kernel function.
(f) [3 pts.] Suppose that we used the result of part (e) to calculate all of the Lagrange multipliers. How
can we use them to calculate w? How can we classify any new point z?
(g) [3 pts.] Assume that we chose a kernel function and found hard-margin kernel SVM for some data
points according to some training data. Which of the following pictures can represent a hard-margin
kernel SVM? Explain why, or why not, for each.
2. [30 pts.] Spectral Clustering: Consider the following similarity graph with similarity values annotated
on edges.
1
a) b)
c) d)
1
2
3 4
5
6
2
3
3
4
10
10
100
(a) [6 pts.] Cuts: Consider the following cuts:
cut 1: S1 = {1, 2, 3}, T1 = {4, 5, 6}
cut 2: S2 = {1}, T2 = {2, 3, 4, 5, 6}
cut 3: S3 = {1, 2, 3, 4}, T3 = {5, 6}
Compute:
(1) the cut weight W (Si, Ti);
(2) the ratio cut: W (Si,Ti)|Si| +
W (Si,Ti)
|Ti| ; and
(3) the normalized cut W (Si,Ti)vol(Si) +
W (Si,Ti)
vol(Ti)
for each of the cuts (i = 1 . . . 3). How do the three cuts rank (from smallest to highest) according to
each of the cut criteria?
(b) [6 pts.] Laplacians: Compute the adjacency matrix A, degree matrix D, Laplacian matrix L = D−A
and Symmetric Laplacian matrix Ls = D
−1/2LD−1/2 of the graph above (do this by hand and show
them in your Solutions file).
(c) [6 pts.] Eigen vectors (requires coding): Encode Ls and L (you obtained above) into a python
program and perform eigen decomposition to obtain the eigenvectors corresponding to the smallest
non-zero eigenvalue of both L and Ls. (Hint: Use numpy.linalg.eig(X) to compute the eigen decom-
position of L and Ls.) Let u be the aforementioned eigenvector for L and us for LS . Report the values
corresponding to each nodes in u and us. Plot these values, where on the x axis you have the node id (1
to 6) and on the y axis you have the corresponding value of u and us.
Hint: Note that since numpy uses numerical methods to find eigenvalue/eigenvector pairs, numerical
errors can lead to the first eigenvalue being non-zero but rather some small number like 10−15. This is
still actually a zero eigenvalue.
(d) [6 pts.] Hierarchical clusters from eigenvectors: Here we will do agglomerative single-link
clustering (on paper, NOT in a program) of the nodes represented by their u values and by their us
values computed in the previous part. Note that we will get two dendrograms, one for u and one for us
and in each the nodes will be one-dimensional points. Draw the two dendrograms.
(e) [6] Hierarchical clusters from eigenvectors: Cut the dendrograms above to obtain two clusters.
Show the resulting partitions in your solution file. What are their cut weights, normalized cuts and ratio
cuts? Discuss why we obtain these partitions by relating the Laplacians to the specific cut measures.
2
3. [22 pts.] K-means and K-medians: There is a reason why we typically use the mean of data points
as cluster representative when we do k-means with L2 distance measure: the mean minimizes the sum of
squared distances to all points. Similarly, we use the median coupled with the L1 distance. In this question
we will convince ourselves theoretically and by an example that these are good choices.
(a) [6 pts.] Example : Consider the following set of numbers {0, 1, 2, 2, 10}. Compute the mean µ and
median m.
(1) Show that the sum of squared distances from the mean is smaller than that to the median, i.e.∑
i
(xi − µ)2 ≤

i
(xi −m)2
.
(2) Show that the sum of absolute distances to the median is smaller than that to the mean∑
i
|xi −m| ≤

i
|xi − µ|
.
(b) [8 pts.] Proof (mean + L2) : Prove that the mean µ of a set X of numbers (i.e. one-dimensional
points) minimizes the sum of squared distances to all numbers xi in the set. You need to prove that:
µ = argmina

i
(xi − a)2
(c) [8 pts.] Proof (median + L1) : Similarly prove that the median m of a set X of numbers (i.e.
one-dimensional points) minimizes the sum of absolute distances to all numbers xi in the set. You need
to prove that:
m = argmina

i
|xi − a|
4. [27 pts.] Density-based Clustering: Considering the data points in figure below, answer the following
questions.
(a) [6 pts.] L∞ : Using = 1, minpts = 3, and L∞ distance, find all core, border and noise points.
(b) [6 pts.] L∞ Clusters : Find all the clusters found by DBSCAN using the setting in part (a) and show
them on the figure.
(c) [6 pts.] L2 : Using = 1, minpts = 3, and L2 distance, find all core, border and noise points. Compare
your results with part (a). What differences do you see? Why?
(d) [9 pts.] L2 Clusters : Find all the clusters found by DBSCAN using the setting in part (c) and show
them on the figure. Compare your results with part (b). What differences do you see? Why?
3

欢迎咨询51作业君
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468