辅导案例-MA429
MA429 Algorithmic Techniques for Data Mining Exercises 9 Homework is summative Submission Submit the summative solutions by 11.59pm of Thursday 2nd April (London time) to the Moodle submission link. This homework is worth 10% of the final course mark. (This is increased from the 5% planned, in order to decrease the weight on the final exam due to coronavirus disruption.) Submit a single pdf that contains the solutions to all questions. (This can include scanned versions of handwritten arguments, although typesetting is preferred.) Submit your R code for the programming question as an R script or Rmd file; if Rmd, please include the report that includes the code’s output. Your files should be named with your candidate exam number, which should also appear within the files to be sure. Do not show your name or registration number anywhere: these submissions are anonymous. Details of whether you need to submit a plagiarism/receipt form, and how, are under dis- cussion within Mathematics and will follow in due course. You are to do this homework individually. You are welcome to use the course materials provided and recommended, including all textbooks mentioned. You may also use generic online reference sources such as Wikipedia pages, but (though I doubt that it would help anyway), you must not search specifically for solutions to these problems. 1. The figure below shows some data points in two dimensions. (a) What are the two principal component (PC) directions? (b) If the coordinates of the red point are (3,−1), roughly what would be its coordinates in the principal component space? 2. You wish to cluster the points shown below into two clusters, hoping for the clusters indicated here by colour. We compare K-means with K = 2, DBSCAN, and spectral clustering. What clusters would you expect each to produce? Which method(s) would find the right clusters, as colour-coded here? Gregory Sorkin (version: 2020.03.24 10:37) c©LSE 2020 MA429 Algorithmic Techniques for Data Mining Exercises 9 Homework is summative 3. You begin with standardised data with n data points in p dimensions. (a) What is the sum of squares of all n× p scores, and why? (b) You perform PCA on this data, getting p score vectors, each of dimension n. What do they represent? (c) What is the sum of squares of all pn entries of these PCA score vectors, and why? (d) Is the variance of entries of the first score vector more or less than 1, and why? (e) What is the interpretation of the variance of the first score vector divided by p? (f) The biplot below is for the iris dataset, with 4 features. Explain the meaning of the “42” near the top left corner, near coordinates (−2, 2.3). −3 −2 −1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 PC1 PC 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 8 19 20 21 22 23 2425 6 27 2 29 031 32 33 34 5 36 3738 39 4041 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 8 8182 83 84 85 86 87 88 89 90 91 92 93 94 95 969798 99 100 101 102 103 104 105 106 07 108 109 110 111 112 113 114 1 5 116 117 118 119 120 121 122 123 124 125126 127 128 129 130 131 132 133134 135 136 137 138 139 1401412 43 4445 46 147 48 149 50 −1.0 −0.5 0.0 0.5 1.0 − 1. 0 − 0. 5 0. 0 0. 5 1. 0 Sepal.Length Sepal.Width Petal.LengthPetal.Width (g) The picture also indicates the loading vectors. (These are the coefficients wij in the linear combinations in the lecture notes.) Interpret the sepal.width loading vector, including its coordinates; the appropriate scales are shown on the top and the right hand side of the box. Page 2 of 6 MA429 Algorithmic Techniques for Data Mining Exercises 9 Homework is summative Homework exercises Again, this is a summative homework set. See instructions at the top of the exercises. The marks indicated sum to 100%, but will be scaled so that the assignment counts as 10% of the final course mark. 4. 50 marks. This exercise is long but self-guiding. Students had good success with, and seemed to learn from, a similar (but different and longer) past exercise. In this exercise you are to do one step of back-propagation on the neural network described, which has one hidden layer. Exercise set 8 problem 2 may be helpful even though the setup is not exactly the same. In short, we have a 2-layer network with 2 inputs, 2 neurons in the first layer, 1 neuron in the second layer, and bias inputs in both layers. Both layers have sigmoid activation functions, σ(y) = 1/(1 + e−y), and at the end we compute the logistic loss, L(y, yˆ) = −y ln(yˆ)− (1− y) ln(1− yˆ). We will take the input vector x1 = (x11, x 1 2) = (2, 3) with desired output y = 1. All vectors written as above are to be considered column vectors, and all superscripts are layer indices, not powers. See Figure 1. The weight matrices are W 1 = ( W 110 W 1 11 W 1 12 W 120 W 1 21 W 1 22 ) = (−0.7 −1 1 −0.9 −0.6 1.1 ) (1) W 2 = ( W 210 W 2 11 W 2 12 ) = (−0.3 1 1.5) . (2) For both matrices we index the first column by 0 to indicate that it acts on the bias input (identically 1). For convenience we define σ on a vector, meaning we apply it to each element. Then, the net inputs to the first layer of neurons are the matrix product y1 = (y11, y 1 2) = W · (1, x11, x12) (3) and their outputs are the vector x2 = (x21, x 2 2) = σ(y 1) = (σ(y11), σ(y 1 2)). (4) Figure 1: Sketch of network described. To avoid clutter, only 2 of the 6 weights W 1 are labeled. Page 3 of 6 MA429 Algorithmic Techniques for Data Mining Exercises 9 Homework is summative Similarly, the sole neuron at the second layer has net input y2 = W 2 ·(1, x21, x22) and output x3 = σ(y2) leading to loss ` = L(1, x3). (a) Write to 4-digit accuracy (x.xxxx) the values of y1, x2, y2, x3, `. (Suggestion: you will need to repeat this several times with slightly different weights, in parts 4(b), 4(h), and 4(i). So, if you are comfortable with any programming language, write a small program to do this, taking x1, W 1 and W 2 as inputs.) (Hint: you should find ` = 0.2151.) (b) Add ε = 0.01 to W 211, compute the new loss `, and compare the new and old losses (take the difference and divide by ε) to estimate ∂` ∂W 211 . (Hint: in this case you don’t need to recompute too much.) (c) The derivative of the logistic function is σ′(y) = (e−y)/(1 + e−y)2 (here the “2” is of course “squared”) and that of the logistic loss (for target class 1, as here) is f ′(yˆ) = ddyˆf(1, yˆ) = −1/yˆ. Find ∂` ∂x3 , ∂x 3 ∂y2 , and ∂y 2 ∂W 2 to compute ∂` ∂W 2 . Note that ∂y 2 ∂W 2 is a 3-vector, ( ∂y2 ∂W 20 , ∂y2 ∂W 21 , ∂y2 ∂W 22 ), and similarly for ∂` ∂W 2 . Hint: Calculus with matrices works (there’s a Wikipedia page), if you are careful about notation and what things mean. Here, from y2 = W 2 · (1, x21, x22), it is true that ∂y 2 ∂W 2 = ( ∂y 2 ∂W 20 , ∂y 2 ∂W 21 , ∂y 2 ∂W 22 ) = (1, x21, x 2 2). (d) Just to check your work, compare your results in parts 4(b) and 4(c) to be sure they match. (Remember that the derivative applies to an infinitesimal change, so the result for that and ε = 0.01 won’t be identical.) (e) The next three parts are linked. First, compute ∂` ∂x2 . Note that this requires just a small change to part 4(c), especially since W 2 and x2 play nearly symmetric roles). (f) Now compute ∂x 2 ∂y1 = σ′(y1). Match this up appropriately with the previous result and use the chain rule to obtain ∂` ∂y1 . (g) Finally, from y1 = W 1 · (1, x11, x12), note that ∂y 1 ∂W 1 = (1, x11, x 1 2), meaning that ∂y 1 1 ∂W 110 ∂y11 ∂W 111 ∂y11 ∂W 112 ∂y12 ∂W 110 ∂y12 ∂W 111 ∂y12 ∂W 112 = (1 x12 x12 1 x12 x 1 2 ) . (5) Using the chain rule, match up the results of 4(e), 4(f) and this to compute ∂` ∂W 1 . It is convenient to organise ∂` ∂W 1 as a 2× 3 matrix as in (5). Hint: Think for example of a small change in W 11 causing small changes to both y 2 1 and y22, whose effects on ` you computed in the first paragraphs of this part. (h) Similar to what you did in part 4(b), add ε = 0.001 to W 111 compare that to the result in part 4(a) to estimate ∂` ∂W 111 . Confirm that this roughly matches the corresponding value in part 4(g). (4-digit accuracy is insufficient, but it should agree within 10%.) (i) Note that parts 4(g) and 4(c) together give the gradient ∇` with respect to the network’s 9 weights. Set λ = 0.1 (normally it would be smaller) and change the weights by λ ·∇`: say if this should be added or subtracted to decrease `. Recompute ` and state its new value. 5. 25 marks. We use spectral clustering on a dataset with just 6 points, arranged in two tightly packed equilateral triangles with a large gap between them. The similarities are s12 = s23 = s31 = 0.8 and s45 = s56 = s64 = 0.7, with sij = 0 if i ∈ {1, 2, 3} and j ∈ {4, 5, 6}. Page 4 of 6 MA429 Algorithmic Techniques for Data Mining Exercises 9 Homework is summative (a) Write down the normalized Laplacian matrix L. (b) Verify that v0 = (1, 1, 1, 1, 1, 1) is an eigenvector of L with eigenvalue 0. (c) The eigenvalues of L are 0 with multiplicity 2, and 1.5 with multiplicity 4. One of the corresponding eigenvectors is v0 as above. Find 5 other eigenvectors, and normalise them to have length 1. (Hint: you can take advantage of the block structure of the matrix to find the eigenvectors.) (d) Give the representation of the original dataset using the eigenvectors from part 5(c) (not including v0), in an r-dimensional space, for r = 1 and also for r = 2. (See lecture notes for a reminder of the representation.) For r = 2, you can pick any of the eigenvectors corresponding to 1.5. (Normally, we would pick all or none eigenvectors for the same eigenvalue; for the purpose of the exercise, we just pick one out of these 4.) (e) What does K-means give for K = 2 in the two cases? For both r = 1 and r = 2, answer the following question: Is it true that you will always obtain the two natural clusters for any choice of the initial centres? 6. 10 marks. Spectral clustering uses K-means on a different representation of the point set via eigenvectors obtained from the normalised Laplacian matrix. Instead, we could try using the principal components obtained by PCA and run K-means clustering on them; let us call this method PCA clustering. Consider the example in the picture: Let K = 3. We have seen in the lecture that spectral clustering would successfully identify the three clusters corresponding to the three rings. Would PCA clustering also succeed identifying these rings? Justify your answer. (Hint: What are the two principal component directions? What would the points look like in that coordinate system?) 7. 15 marks. (a) Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables. The three classes should be clearly separated Page 5 of 6 MA429 Algorithmic Techniques for Data Mining Exercises 9 Homework is summative from one another. (Hint: there are many ways to obtain such a data set. For example you can generate the variables randomly, and then add different shifts to certain variables for the different classes.) (b) Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (c). If not, then return to part (a) and modify the simulation so that there is greater separation between the three classes. (c) Perform K-means clustering of the observations with K = 3. How well do the clusters that you obtained in K-means clustering compare to the true class labels? (d) Now perform PCA clustering: K-means clustering with K = 3 on the first two principal component score vectors, rather than on the raw data. That is, perform K- means clustering on the 60× 2 matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. How do the clusters compare to the true class labels? Page 6 of 6