辅导案例-MA429

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

MA429 Algorithmic Techniques for Data Mining
Exercises 9 Homework is summative
Submission
Submit the summative solutions by 11.59pm of Thursday 2nd April (London time) to
the Moodle submission link. This homework is worth 10% of the final course mark. (This
is increased from the 5% planned, in order to decrease the weight on the final exam due to
coronavirus disruption.)
Submit a single pdf that contains the solutions to all questions. (This can include scanned
versions of handwritten arguments, although typesetting is preferred.) Submit your R code for
the programming question as an R script or Rmd file; if Rmd, please include the report that
includes the code’s output.
Your files should be named with your candidate exam number, which should also appear
within the files to be sure. Do not show your name or registration number anywhere: these
submissions are anonymous.
Details of whether you need to submit a plagiarism/receipt form, and how, are under dis-
cussion within Mathematics and will follow in due course.
You are to do this homework individually. You are welcome to use the course materials
provided and recommended, including all textbooks mentioned. You may also use generic online
reference sources such as Wikipedia pages, but (though I doubt that it would help anyway),
you must not search specifically for solutions to these problems.
1. The figure below shows some data points in two dimensions.
(a) What are the two principal component (PC) directions?

(b) If the coordinates of the red point are (3,−1), roughly what would be its coordinates
in the principal component space?
2. You wish to cluster the points shown below into two clusters, hoping for the clusters
indicated here by colour. We compare K-means with K = 2, DBSCAN, and spectral
clustering. What clusters would you expect each to produce? Which method(s) would
find the right clusters, as colour-coded here?
Gregory Sorkin (version: 2020.03.24 10:37) c©LSE 2020
MA429 Algorithmic Techniques for Data Mining Exercises 9 Homework is summative
3. You begin with standardised data with n data points in p dimensions.
(a) What is the sum of squares of all n× p scores, and why?
(b) You perform PCA on this data, getting p score vectors, each of dimension n. What
do they represent?
(c) What is the sum of squares of all pn entries of these PCA score vectors, and why?
(d) Is the variance of entries of the first score vector more or less than 1, and why?
(e) What is the interpretation of the variance of the first score vector divided by p?
(f) The biplot below is for the iris dataset, with 4 features. Explain the meaning of the
“42” near the top left corner, near coordinates (−2, 2.3).
−3 −2 −1 0 1 2 3
−
3
−
2
−
1
0
1
2
3
PC1
PC
2
1
2
3
4
5
6
7 8
9
10
11
12
13
14
15
16
17
8
19
20
21
22
23
2425
6
27
2
29
031
32
33
34
5
36
3738
39
4041
42
43
44
45
46
47
48
49
50
51
52 53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
8
8182
83 84
85
86
87
88
89
90
91
92
93
94
95
969798
99
100
101
102
103
104
105
106
07
108
109
110
111
112
113
114
1 5
116
117
118
119
120
121
122
123
124
125126
127
128
129
130
131
132
133134
135
136
137
138
139
1401412
43
4445
46
147
48
149
50
−1.0 −0.5 0.0 0.5 1.0
−
1.
0
−
0.
5
0.
0
0.
5
1.
0
Sepal.Length
Sepal.Width
Petal.LengthPetal.Width
(g) The picture also indicates the loading vectors. (These are the coefficients wij in the
linear combinations in the lecture notes.) Interpret the sepal.width loading vector,
including its coordinates; the appropriate scales are shown on the top and the right
hand side of the box.
Page 2 of 6
MA429 Algorithmic Techniques for Data Mining Exercises 9 Homework is summative
Homework exercises
Again, this is a summative homework set. See instructions at the top of the exercises. The
marks indicated sum to 100%, but will be scaled so that the assignment counts as 10% of the
final course mark.
4. 50 marks. This exercise is long but self-guiding. Students had good success with, and
seemed to learn from, a similar (but different and longer) past exercise.
In this exercise you are to do one step of back-propagation on the neural network described,
which has one hidden layer. Exercise set 8 problem 2 may be helpful even though the
setup is not exactly the same.
In short, we have a 2-layer network with 2 inputs, 2 neurons in the first layer, 1 neuron
in the second layer, and bias inputs in both layers. Both layers have sigmoid activation
functions, σ(y) = 1/(1 + e−y), and at the end we compute the logistic loss, L(y, yˆ) =
−y ln(yˆ)− (1− y) ln(1− yˆ).
We will take the input vector x1 = (x11, x
1
2) = (2, 3) with desired output y = 1. All vectors
written as above are to be considered column vectors, and all superscripts are layer indices,
not powers. See Figure 1.
The weight matrices are
W 1 =
(
W 110 W
1
11 W
1
12
W 120 W
1
21 W
1
22
)
=
(−0.7 −1 1
−0.9 −0.6 1.1
)
(1)
W 2 =
(
W 210 W
2
11 W
2
12
)
=
(−0.3 1 1.5) . (2)
For both matrices we index the first column by 0 to indicate that it acts on the bias input
(identically 1).
For convenience we define σ on a vector, meaning we apply it to each element. Then, the
net inputs to the first layer of neurons are the matrix product
y1 = (y11, y
1
2) = W · (1, x11, x12) (3)
and their outputs are the vector
x2 = (x21, x
2
2) = σ(y
1) = (σ(y11), σ(y
1
2)). (4)
Figure 1: Sketch of network described. To avoid clutter, only 2 of the 6 weights W 1 are labeled.
Page 3 of 6
MA429 Algorithmic Techniques for Data Mining Exercises 9 Homework is summative
Similarly, the sole neuron at the second layer has net input y2 = W 2 ·(1, x21, x22) and output
x3 = σ(y2) leading to loss ` = L(1, x3).
(a) Write to 4-digit accuracy (x.xxxx) the values of y1, x2, y2, x3, `. (Suggestion: you will
need to repeat this several times with slightly different weights, in parts 4(b), 4(h),
and 4(i). So, if you are comfortable with any programming language, write a small
program to do this, taking x1, W 1 and W 2 as inputs.)
(Hint: you should find ` = 0.2151.)
(b) Add ε = 0.01 to W 211, compute the new loss `, and compare the new and old losses
(take the difference and divide by ε) to estimate ∂`
∂W 211
.
(Hint: in this case you don’t need to recompute too much.)
(c) The derivative of the logistic function is σ′(y) = (e−y)/(1 + e−y)2 (here the “2”
is of course “squared”) and that of the logistic loss (for target class 1, as here) is
f ′(yˆ) = ddyˆf(1, yˆ) = −1/yˆ.
Find ∂`
∂x3
, ∂x
3
∂y2
, and ∂y
2
∂W 2
to compute ∂`
∂W 2
. Note that ∂y
2
∂W 2
is a 3-vector, ( ∂y2
∂W 20
, ∂y2
∂W 21
, ∂y2
∂W 22
),
and similarly for ∂`
∂W 2
.
Hint: Calculus with matrices works (there’s a Wikipedia page), if you are careful
about notation and what things mean. Here, from y2 = W 2 · (1, x21, x22), it is true
that ∂y
2
∂W 2
= ( ∂y
2
∂W 20
, ∂y
2
∂W 21
, ∂y
2
∂W 22
) = (1, x21, x
2
2).
(d) Just to check your work, compare your results in parts 4(b) and 4(c) to be sure they
match. (Remember that the derivative applies to an infinitesimal change, so the
result for that and ε = 0.01 won’t be identical.)
(e) The next three parts are linked. First, compute ∂`
∂x2
. Note that this requires just a
small change to part 4(c), especially since W 2 and x2 play nearly symmetric roles).
(f) Now compute ∂x
2
∂y1
= σ′(y1). Match this up appropriately with the previous result
and use the chain rule to obtain ∂`
∂y1
.
(g) Finally, from y1 = W 1 · (1, x11, x12), note that ∂y
1
∂W 1
= (1, x11, x
1
2), meaning that ∂y
1
1
∂W 110
∂y11
∂W 111
∂y11
∂W 112
∂y12
∂W 110
∂y12
∂W 111
∂y12
∂W 112
 = (1 x12 x12
1 x12 x
1
2
)
. (5)
Using the chain rule, match up the results of 4(e), 4(f) and this to compute ∂`
∂W 1
. It
is convenient to organise ∂`
∂W 1
as a 2× 3 matrix as in (5).
Hint: Think for example of a small change in W 11 causing small changes to both y
2
1
and y22, whose effects on ` you computed in the first paragraphs of this part.
(h) Similar to what you did in part 4(b), add ε = 0.001 to W 111 compare that to the result
in part 4(a) to estimate ∂`
∂W 111
. Confirm that this roughly matches the corresponding
value in part 4(g). (4-digit accuracy is insufficient, but it should agree within 10%.)
(i) Note that parts 4(g) and 4(c) together give the gradient ∇` with respect to the
network’s 9 weights. Set λ = 0.1 (normally it would be smaller) and change the
weights by λ ·∇`: say if this should be added or subtracted to decrease `. Recompute
` and state its new value.
5. 25 marks. We use spectral clustering on a dataset with just 6 points, arranged in two
tightly packed equilateral triangles with a large gap between them. The similarities are
s12 = s23 = s31 = 0.8 and s45 = s56 = s64 = 0.7, with sij = 0 if i ∈ {1, 2, 3} and
j ∈ {4, 5, 6}.
Page 4 of 6
MA429 Algorithmic Techniques for Data Mining Exercises 9 Homework is summative
(a) Write down the normalized Laplacian matrix L.
(b) Verify that v0 = (1, 1, 1, 1, 1, 1) is an eigenvector of L with eigenvalue 0.
(c) The eigenvalues of L are 0 with multiplicity 2, and 1.5 with multiplicity 4. One of the
corresponding eigenvectors is v0 as above. Find 5 other eigenvectors, and normalise
them to have length 1. (Hint: you can take advantage of the block structure of the
matrix to find the eigenvectors.)
(d) Give the representation of the original dataset using the eigenvectors from part 5(c)
(not including v0), in an r-dimensional space, for r = 1 and also for r = 2. (See
lecture notes for a reminder of the representation.) For r = 2, you can pick any of the
eigenvectors corresponding to 1.5. (Normally, we would pick all or none eigenvectors
for the same eigenvalue; for the purpose of the exercise, we just pick one out of these
4.)
(e) What does K-means give for K = 2 in the two cases? For both r = 1 and r = 2,
answer the following question: Is it true that you will always obtain the two natural
clusters for any choice of the initial centres?
6. 10 marks. Spectral clustering uses K-means on a different representation of the point
set via eigenvectors obtained from the normalised Laplacian matrix. Instead, we could try
using the principal components obtained by PCA and run K-means clustering on them;
let us call this method PCA clustering. Consider the example in the picture:
Let K = 3. We have seen in the lecture that spectral clustering would successfully identify
the three clusters corresponding to the three rings. Would PCA clustering also succeed
identifying these rings? Justify your answer. (Hint: What are the two principal component
directions? What would the points look like in that coordinate system?)
7. 15 marks.
(a) Generate a simulated data set with 20 observations in each of three classes (i.e. 60
observations total), and 50 variables. The three classes should be clearly separated
Page 5 of 6
MA429 Algorithmic Techniques for Data Mining Exercises 9 Homework is summative
from one another. (Hint: there are many ways to obtain such a data set. For
example you can generate the variables randomly, and then add different shifts to
certain variables for the different classes.)
(b) Perform PCA on the 60 observations and plot the first two principal component score
vectors. Use a different color to indicate the observations in each of the three classes.
If the three classes appear separated in this plot, then continue on to part (c). If not,
then return to part (a) and modify the simulation so that there is greater separation
between the three classes.
(c) Perform K-means clustering of the observations with K = 3. How well do the clusters
that you obtained in K-means clustering compare to the true class labels?
(d) Now perform PCA clustering: K-means clustering with K = 3 on the first two
principal component score vectors, rather than on the raw data. That is, perform K-
means clustering on the 60× 2 matrix of which the first column is the first principal
component score vector, and the second column is the second principal component
score vector. How do the clusters compare to the true class labels?
Page 6 of 6