EE 240: Pattern Recognition and Machine Learning Homework 4 Due date: May 30, 2021 Description: K-means clustering, principal component analysis. Reading assignment and references: Instructor notes, AML Ch. 6, Appendix C; ESL Ch. 13 & 14. Homework and lab assignment submission policy: All homework and lab assignments must be submitted online via https://iLearn.ucr.edu. Homework solutions should be written and submitted individually, but discussions among students are encouraged. All assignments should be submitted by the due date. You are allowed a total of 5 “late days” in the whole quarter. If you submit your assignment late, after using all the late days, you will not receive any credit. H4.1 K-means clustering: In this exercise we will perform color-based segmentation using K-means algorithm. (a) Implement K-means algorithm in Python that accepts target number of clusters (K) and a color image as input parameters. Treat each color pixel as 3-dim. feature vector xi. (5 pts) A general K-means algorithm can be described as follows. Suppose we are given training examples x1,x2, . . . ,xN , where each xn ∈ Rd. We want to group the N data samples into K clusters. i. Initialize cluster centers µ1, . . . , µK ∈ Rd at random. ii. Repeat until convergence { For every data point xi, update its label as li = argmin j ‖xi − µj‖22. (1) For each cluster j, update its center µj as mean of all points assigned to cluster j: µj = ∑N i=1 δ{li = j}xi∑N i=1 δ{li = j} . } (b) Take a selfie of yourself with a background that has different colors from your skin and clothing. Use K-means script from previous step to segment your image into K clusters. To create a segmented output image, replace every pixel in your image with the center of the cluster assigned to it. Report your results for K= {2, 4, 8, 16} clusters. (10 pts) (c) Repeat steps (a) and (b) with absolute distance instead of squared euclidean distance. That is, implement a new script that replaces minimum euclidean distance in (1) with minimum absolute distance1 li = argminj ‖xi − µj‖1. Report your results for selfie segmentation using the new distance. (10 pts) H4.2 Principal component analysis (PCA): In this problem we will consider two tasks. First, we will explore the efficiency of PCA as a tool for dimensionality reduction and compression. Then, we will utilize PCA for constructing a rudimentary face recognition algorithm. Download ATT Face dataset 1‖u‖1 = ∑d j=1 |u(j)| denotes `1 norm of vector u, and it is defined as absolute sum of all entries in u. 1 from Piazza. ATT Face dataset contains images of the faces of 40 individuals. For each individual, there are 10 images taken under different poses. Divide your data into two sets: select 60% of images for training and remaining 40% for testing. You can read about eignefaces at this link: http://www.scholarpedia.org/article/Eigenfaces. You are allowed to use the PCA module in sklearn: from sklearn.decomposition import PCA (a) Perform PCA on the training images viewed as points in high-dimensional space (using their pixel values). Plot a curve displaying the amount of “energy” captured by the first k principal components, where energy is the cumulative sum of top-k components variances, divided by the sum of all the variances. How many components do we need in order to capture 50% of the energy? How much of the energy is captured with k = 25? (10 pts) (b) Visualize the previously discovered top 25 eigenfaces (eigenvectors obtained from PCA). Order them according to the magnitudes of their corresponding eigenvalues and plot them in a single figure. (5 pts) (c) Let us now try to recognize the identity of a person’s face in a previously unseen image. Load an image from the test set, subtract from it the mean of the training images and project it to the previously computed top-25 principal components. Then, use a nearest neighbor search to find its closest image in the training set. If the nearest neighbor found depicts the face of the same person as the one of the unseen image, consider this as a successful discovery of the person’s identity. Repeat this experiment for all the test images and report the mean accuracy on the entire test data set. Make comments on the test images that are mistakenly identified. (10 pts) Maximum points: 50 2
欢迎咨询51作业君