# 辅导案例-CS 383-Assignment 3

CS 383 - Machine Learning
Assignment 3 - Dimensionality Reduction
Introduction
In this assignment you’ll work on visualizing data, reducing its dimensionality and clustering it.
You may not use any functions from machine learning library in your code, however you may use
statistical functions. For example, if available you MAY NOT use functions like
• pca
• k-nearest neighbors functions
Unless explicitly told to do so. But you MAY use basic statistical functions like:
• std
• mean
• cov
• eig
Part 1 (Theory) 10pts
Part 2 (PCA) 40pts
Part 3 (Eigenfaces) 40pts
Report 10pts
TOTAL 100pts
Table 1: Grading Rubric
1
DataSets
Labeled Faces in the Wild Datasaet This dataset consists of celebrities download from the
Internet from the early 2000s. We use the grayscale version from sklearn.datasets.
we will download the images in a specific way as shown below. You will have 3,023 images, each
87x65 pixels large, belonging to 62 different people.
from sk l e a rn . da ta s e t s import f e t c h l f w p e o p l e
import matp lo t l i b . pyplot as p l t
import matp lo t l i b . cm as cm
people = f e t c h l f w p e o p l e ( m i n f a c e s p e r p e r s o n =20, r e s i z e =0.7)
image shape = people . images [ 0 ] . shape
f i g , axes = p l t . subp lo t s (2 , 5 , f i g s i z e =(15 , 8 ) ,
subplot kw ={ ’ x t i ck s ’ : ( ) , ’ y t i ck s ’ : ( )} )
f o r target , image , ax in z ip ( people . ta rget , people . images , axes . r a v e l ( ) ) :
ax . imshow ( image , cmap=cm. gray )
ax . s e t t i t l e ( people . target names [ t a r g e t ] )
2
1 Theory Questions
1. Consider the following data:

−2 1
−5 −4
−3 1
0 3
−8 11
−2 5
1 0
5 −1
−1 −3
6 1

(a) Find the principle components of the data (you must show the math, including how you
compute the eivenvectors and eigenvalues). Make sure you standardize the data first and
that your principle components are normalized to be unit length. As for the amount of
detail needed in your work imagine that you were working on paper with a basic calculator.
Show me whatever you would be writing on that paper. (7pts).
(b) Project the data onto the principal component corresponding to the largest eigenvalue
found in the previous part (3pts).
3
2 Dimensionality Reduction via PCA
Import the data as shown above. This the labeled faces in the wild dataset.
Verify that you have the correct number of people and classes
p r i n t (” people . images . shape : {}” . format ( people . images . shape ) )
p r i n t (”Number o f c l a s s e s : {}” . format ( l en ( people . target names ) ) )
people . images . shape : (3023 , 87 , 65)
Number o f c l a s s e s : 62
This dataset is skewed toward George W. Bush and Colin Powell as you can verify here
# count how o f t en each t a r g e t appears
counts = np . bincount ( people . t a r g e t )
# pr in t counts next to t a r g e t names
f o r i , ( count , name) in enumerate ( z ip ( counts , people . target names ) ) :
p r i n t (”{0 :25} {1 : 3}” . format (name , count ) , end=’ ’ )
i f ( i + 1) % 3 == 0 :
p r i n t ( )
To make the data less skewed, we will only take up to 50 images of each person (otherwise, the
feature extraction would be overwhelmed by the likelihood of George W. Bush):
mask = np . z e r o s ( people . t a r g e t . shape , dtype=np . bool )
f o r t a r g e t in np . unique ( people . t a r g e t ) :
mask [ np . where ( people . t a r g e t == t a r g e t ) [ 0 ] [ : 5 0 ] ] = 1
X people = people . data [ mask ]
y peop le = people . t a r g e t [ mask ]
# s c a l e the g r a y s c a l e va lue s to be between 0 and 1
# ins t ead o f 0 and 255 f o r b e t t e r numeric s t a b i l i t y
X people = X people / 255 .
We are now going to compute how well a KNN classifier does using just the pixels alone.
4
from sk l e a rn . ne ighbors import KNe ighbo r sC la s s i f i e r
# s p l i t the data in to t r a i n i n g and t e s t s e t s
X train , X test , y t ra in , y t e s t = t r a i n t e s t s p l i t (
X people , y people , s t r a t i f y=y people , random state =0)
# bu i ld a KNe ighbo r sC la s s i f i e r us ing one neighbor
knn = KNe ighbo r sC la s s i f i e r ( n ne ighbors =1)
knn . f i t ( X train , y t r a i n )
p r i n t (” Test s e t s co r e o f 1−nn : { : . 2 f }” . format ( knn . s co r e ( X test , y t e s t ) ) )
You should have an accuracy around 23% - 27%.
Once you have your setup complete, write a script to do the following:
1. Write your own version of KNN (k=1) where you use the SSD (sum of squared differences) to
compute similarity
2. Verify that your KNN has a similar accuracy as sklearn’s version
3. Standardize your data (zero mean, divide by standard deviation)
4. Reduces the data to 100D using PCA
5. Compute the KNN again where K=1 with the 100D data. Report the accuracy
6. Compute the KNN again where K=1 with the 100D Whitened data. Report the accuracy
7. Reduces the data to 2D using PCA
8. Graphs the data for visualization
Recall that although you may not use any package ML functions like pca, you may use statistical
functions like eig or svd.
Your graph should end up looking similar to Figure 1 (although it may be rotated differently, de-
pending how you ordered things).
Figure 1: 2D PCA Projection of data
5
3 Eigenfaces
Import the data as shown above. This the labeled faces in the wild dataset.
Use the X train data from above. Let’s analyze the first and second principal components.
Write a script that:
1. Imports the data as mentioned above.
2. Standardizes the data.
3. Performs PCA on the data (again, although you may not use any package ML functions like
pca, you may use statistical functions like eig). No need to whiten here.
4. Find the max and min image on PC1’s axis. Find the max and min of PC2. Plot and report
the faces, what variation do these components capture?
5. Visualizes the most important principle component as a 87x65 image (see Figure 2).
6. Reconstructs the X train[0,:] image using the primary principle component. To best see the full
re-construction, “unstandardize” the reconstruction by multiplying it by the original standard
deviation and adding back in the original mean.
7. Determines the number of principle components necessary to encode at least 95% of the infor-
mation, k.
8. Reconstructs the X train[0,:] image using the k most significant eigen-vectors (found in the
previous step, see Figure 4). For the fun of it maybe even look to see if you can perfectly
reconstruct the face if you use all the eigen-vectors! Again, to best see the full re-construction,
“unstandardize” the reconstruction by multiplying it by the original standard deviation and
adding back in the original mean.
Your principle eigenface should end up looking similar to Figure 2.
Figure 2: Primary Principle Component
6
Your principal reconstruction should end up looking similar to Figure 3.
Figure 3: Reconstruction of first person
Your 95% reconstruction should end up looking similar to Figure 4.
Figure 4: Reconstruction of first person)
7
Submission
For your submission, upload to Blackboard a single zip file containing:
1. A LaTeX typeset PDF containing:
(a) Part 1: Your answers to the theory questions.
(b) Part 2: The visualization of the PCA result, KNN accuracies
(c) Part 3:
i. Visualization of primary principle component
ii. Number of principle components needed to represent 95% of information, k.
iii. Visualization of the reconstruction of the first person using
A. Original image
B. Single principle component
C. k principle components.
(d) Source Code - python notebook
8

Email:51zuoyejun

@gmail.com