Fall 2019 CS 475 Machine Learning: Homework 5 1
CS 475 Machine Learning: Homework 5
Dimensionality Reduction, Structured Prediction, and FATE
Due: Thursday December 5, 2019, 11:59pm
100 Points Total Version 1.0
Make sure to read from start to finish before beginning the assignment.
Introduction This assignment will consist of three parts:
• (45 points) - A coding portion, where you will implement two dimensionality re-
duction techniques: PCA and LLE. You will also implement a K-Nearest Neighbors
classifier to operate over your dimensionality reduced data.
• (15 points) - A Jupyter notebook where you will explore bias in word embeddings,
and implement a debiasing technique.
• (40 points) - Written questions about HMMs, CRFs, RNNs and FATE.
1 Programming (45 points)
1.1 K-Nearest Neighbor Classifier
You will implement a K-Nearest Neighbors Classifier (KNN). This is a very simple su-
pervised model, which takes training data xi ∈ X , which is a finite set of d-dimensional
vectors. Each xi is paired with a label yi. “Training” a KNN is trivial: simply store the
data. All of the “work” is done at prediction time. To predict a label for an example x
1. Find the k nearest examples: K(x) = x(1), .., x(k) ∈ X where distance is measured
by Euclidean distance to x.
2. Infer the label for ŷ as the most common label of y(1), ..., y(k), where y(i) is the label
for x(i) ∈ K(x). In case of a tie, select the label with the lowest integer index.
We have implemented the fit function for you (it just saves the data in the model
object). Please implement the predict function for the KNN model in models.py.
You should be able to run this without specifying a dr-algorithm (dimensionality
reduction algorithm), which will just run a KNN model over your raw data. For example,
the command:
python3 main.py --knn-k 5 --test-split data/dev --predictions-file simple-knn-preds.txt
should output line-by-line predictions for the file dev into simple-knn-preds.txt, similar
to past homeworks. You could then evaluate your predictions by running:
python3 accuracy.py --labeled data/labels-mnist-dev.npy --predicted simple-knn-preds.txt
Fall 2019 CS 475 Machine Learning: Homework 5 2
Since KNN uses distance for prediction, larger dimensional data will be slower, and
also suffer from the curse of dimensionality. We ask you to implement KNN as it will
For example, you could run your KNN model on the regular 748-dimension MNIST
data. But this might take a while! We’re going to attempt to fix that in the next two
sections.
1.2 Principal Component Analysis
We discussed PCA as a dimensionality reduction technique in class. PCA is one of many
different techniques, and the choice of method depends on the data and intended use of
the reduced dimension dataset.
PCA relies on eigendecomposition, which factorizes a matrix into a canonical form
consisting of two parts - eigenvalues and eigenvectors (see https://en.wikipedia.org/
wiki/Eigendecomposition_of_a_matrix). The eigenvectors determine the directions of
the new feature space, and the eigenvalues determine their magnitude.
First, preprocess the data matrix: standardize each feature in the dataset to unit scale,
defined as having a mean of zero and a variance of 1. Note that this is for each feature
index, not the example or the entire dataset. You do this by subtracting the mean from
all points and then dividing all points by the standard deviation. Since PCA yields a
feature space that maximizes variance along the axes, it’s important to ensure all features
are measured in the same scale. (If the standard deviation is 0, which can happen in
MNIST, you should not divide at all. This operation will leave you with every feature for
this dimension being 0, which it probably was already.)
Second, calculate the covariance matrix Σ, which is a D ×D matrix (where D is the
number of features). Each element of the matrix Σ represents the covariance between two
features. For a dataset of size N , each element in the matrix is defined as
Σst =
1
N − 1
N∑
i=1
(xis − x¯s)(xit − x¯t) (1)
Third, we are now ready for PCA, which uses an eigensolver to perform eigendecom-
position on the covariance matrix (see Section 1.2.1 for implementation details). For some
square matrix A, it can be factorized as
A = QΛQ−1 (2)
where Q are the eigenvectors and Λ are the eigenvalues.
Finally, select which eigenvectors to use in our lower-dimensional space by using the
top d corresponding eigenvalues (they will be sorted by absolute value). The eigenvalues
indicate the amount of information the eigenvectors contain about the variance within
the overall data. With these eigenvectors, we can build a weight vector W, which is a
D × d weight matrix. Finally, we can calculate our reduced feature space Y,
Y = XW (3)
The result is a reduced dimension data matrix Y of size N × d, where N is the number
of data points.
Fall 2019 CS 475 Machine Learning: Homework 5 3
Figure 1: From https://link.springer.com/article/10.1007%2Fs11692-018-9464-9, an explanation
of the linearity of PCA
1.2.1 Implementation Details
Most of PCA can be implemented using straightforward numpy operations. The most
challenging step is the eigendecomposition. We recommend using the scipy’s eigensolver,
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.eig.html.
Additionally, you may use any function in either numpy or scipy.
You should implement PCA as a subclass of Model. It’s fit function will take in both
your training and test data and jointly reduce all of your data points together. Fit should
then return the dimensionality reduced data in the same order that it’s passed in. This
is important because we will split the data back into training and test splits to perform
KNN. You do not need to implement the predict function for this model.
Here is an example of how your model should run after you’ve implemented PCA:
python3 main.py --knn-k 5 --dr-algorithm pca --target-dim 300 --test-split dev \
--predictions-file pca-knn-preds.txt
You should find this runs a lot faster than without dimensionality reduction!
1.3 Locally Linear Embedding (LLE)
While PCA strictly models linear variabilities in the data (see Figure 1), this strategy fails
when the data lie on a low-dimensional “manifold” (https://en.wikipedia.org/wiki/
Manifold), which can be a non-linear embedding of the data. Non-linear dimensionality
reduction methods attempt to find this manifold and produce a low-dimensional space
that encodes distances and relative position on the manifold.
Locally Linear Embedding (LLE) was introduced by Roweis and Saul in 2000. The
original paper was published in Science, but you can find a more introductory tutorial
here: https://cs.nyu.edu/~roweis/lle/papers/lleintro.pdf. The key insight to
LLE is that if you have enough data points on a manifold, we can assume a point and its
Fall 2019 CS 475 Machine Learning: Homework 5 4
Figure 2: An illustration of LLE from http://science.sciencemag.org/content/290/5500/
2323
nearest neighbors lie in a patch of the manifold that is approximately linear. Thus, LLE
aims to reduce dimensionality by finding linear weights that best reconstruct each point
from its neighbors, and then finding low-dimensional points for which these same weight
matrices apply. LLE is illustrated in Figure 2.
You should begin by preprocessing your data to have features normalized with mean
0 and variance 1 as in PCA.
Then the LLE algorithm can be broken down into 3 steps:
1. For each data point xi, determine the K nearest neighbors (where K is a hyperpa-
2. For each data point xi, compute the weights wij that best reconstruct xi from its
neighbors (each neighbor is xj, and

j is a sum over neighbors of xi), using the
reconstruction error function
ε(W) =

i
|xi −

j
wijxj|2 (4)
3. Compute the lower-dimensional vectors yi that are best reconstructed by the wij
computed in step 2, using an eigensolver to find the bottom nonzero eigenvectors to
minimize the embedding cost function (same form as in step 2, but now wij fixed
Φ(Y) =

i
|yi −

j
wijyi|2 (5)
We use X to refer to the input data matrix, which consists of N columns of data points
of dimensionality D. We will be returning a low-dimensional embedding matrix Y, which
is d x N where d D. xi and yi refer to the ith column of X and Y respectively, which
corresponds to the ith data point. K is the number of neighbors we will consider. W is
the reconstruction weight matrix, and wij is the entry corresponding to the i
th row and
jth column.
Fall 2019 CS 475 Machine Learning: Homework 5 5
We now present the pseudocode, in the same three high-level steps outlined above.1
We use for loops for simplicity of pseudocode, but we encourage you to consider numpy
speed-ups.
1. Find nearest neighbors
for i ∈ 1 . . . N do
Keep track of the indices of the K nearest neighors as an array n
(1)
i , . . . n
(k)
i
for j ∈ 1 . . . N do
Calculate the Euclidean distance between xi and xj
if xj is in K nearest then
Assign j as a neighbor of i
end if
end for
end for
2. Solve for reconstruction weights
Initialize reconstruction weight matrix W as N x N matrix of zeros.
for i ∈ 1 . . . N do
Initialize Z as D x K matrix.
for k ∈ 1 . . . K do
Set the kth column of Z to be x
n
(k)
i
− xi
end for
Compute local covariance C = ZTZ
Regularization term = 1e−3 × trace(C)
C = C + × I
Solve the linear system Cŵ = 1 for ŵ
for k ∈ 1 . . . K do
w
in
(k)
i
= ŵk/sum(ŵ) where ŵk is the k
th element of ŵ
end for
end for
3. Compute embedding coordinates
Create sparse matrix M = (I−W)T (I−W)
Find bottom d+ 1 eigenvectors of M (corresponding to the smallest eigenvalues)
Remove the eigenvector with eigenvalue 0
Set the rows of Y to be the remaining d eigenvectors
When we refer to “bottom eigenvectors,” we mean the eigenvectors with the smallest
eigenvalues. “Top eigenvectors” have the largest. We can get these easily from any
standard eigensolver, which we’ll detail more in the next section. Because M is symmetric,
it will have only real eigenvalues and eigenvectors, so you don’t have to worry about
complex numbers here.
1Our pseudocode is adapted from the authors’ web page, https://cs.nyu.edu/~roweis/lle/algorithm.html.
Fall 2019 CS 475 Machine Learning: Homework 5 6
1.3.1 Implementation Details
i Just as a small tip, LLE can take a while to train on all 3,000 train examples + 1,000
test examples. It might be a good idea to reduce it down to 1,000 examples total while
ii Because we’re working with a symmetric matrix, you’ll want to use the eigensolver pro-
vided with scipy here: https://docs.scipy.org/doc/scipy/reference/generated/
scipy.sparse.linalg.eigsh.html#scipy.sparse.linalg.eigsh. Because we’re look-
ing for the eigenvectors corresponding to the smallest eigenvalues, we’ll want to use
the flag sigma=0.0 (this seems to work better than which=’sm’).
iii Because the eigensolver is a numerical approximation, it is not guaranteed the recover
exact zeros in the bottom eigenvalues. You may use np.finfo(float).eps as the
lower bound.
iv Just as you implemented a subclass of Model for PCA, you’ll do the same for LLE
with the same behavior.
v It is not always trivial to decide k (the number of nearest neighbors) in LLE. For this
assignment, you can use k = 10. You should feel free to play with this, but this setting
should work.
As before, here is an example of how we might run your code
python3 main.py --knn-k 5 --dr-algorithm lle --lle-k 10 --target-dim 300 --test-split dev \
--predictions-file pca-knn-preds.txt
1.4 Programming Details
As before, your code will be evaluated on it’s predictions on held-out test data. We
have provided you with the test-data, so you can ensure that your code does not fail
when running over it, but not the labels. You may compare the scores of your classifier
using different dimensionality reduction techniques on development data with your peers
on Piazza. To make sure that your predictions output is in the correct format, you
should test that accuracy.py can run on your dev predictions, as described in the KNN
instructions.
We will evaluate your code in 2 settings (1) classification accuracy on MNIST test data
when using PCA dimensionality reduction, and (2) classification accuracy on MNIST test
data when using LLE dimensionality reduction.
1.4.1 Data
You will be working with a subset of the popular MNIST 2 digit recognition dataset. This
dataset is a multiclass classification dataset which is composed of 10 classes (0-9). Each
example is a 28 by 28 greyscale image of a handwritten digit (0-9). Our version of the
dataset comes in 3 splits, of which you have 5 files: mnist-train.npy, mnist-dev.npy,
mnist-test.npy, labels-mnist-train.npy, labels-mnist-dev.npy
The training set consists of 3,000 labeled examples of hand written digits. The dev
and test splits each contain 1,000 examples. While this is not a lot of training data (the
actual MNIST dataset has 60,000 examples), we’ve cut things down considerably to help
2https://en.wikipedia.org/wiki/MNIST_database
Fall 2019 CS 475 Machine Learning: Homework 5 7
you run your code faster. You might be surprised at the accuracy of a simple KNN model
in this setting. Your implementations should not take longer than 7-8 minutes to run
with knn-k=5 and lle-k=10 (LLE will be much longer runtime than PCA).
the Notebook portion of this assignment.
1.4.2 Command Line Arguments
Your code should contain the following command line arguments, which you should not
change:
• --data: Points to the directory containing the MNIST data splits.
• --predictions-file: The file that your predictions will be stored in.
• --test-split: The split (test or dev) that your KNN model will predict over.
• --knn-k: A hyperparameter for your KNN model. How many nearest neighbors to
consider during prediction.
• --dr-algorithm: The type of dimensionality reduction algorithm to use (pca or
lle).
• --target-dim: The number of dimensions to keep with your dr technique.
• --lle-k: A hyperparameter for your lle algorithm, as described above.
Fall 2019 CS 475 Machine Learning: Homework 5 8
2 Jupyter Notebook (15 points)
Lastly, you will explore bias in word embeddings, and implement a debiasing method
in the jupyter notebook we’ve provided to you called FATE.ipynb. Please complete the
notebook by filling in all the TODOs in the coding cells, and answering the questions
with red FILL IN text. You will submit your notebook to Gradescope.
Fall 2019 CS 475 Machine Learning: Homework 5 9
3 Analytical (40 points)
Question 1) (15 points) Expressive power of sequence classifiers Consider the fol-
lowing sequence classification task. We are given an arbitrarily long sequence of tokens
from the alphabet X = {(, ), a, b, c, . . . , z}. In other words, each sequence x ∈ X ? where
(·)? denotes the Kleene closure operator. Our sequence classification task is to predict
whether or not a given x is a well formed—meaning that the parentheses are balanced.
Let Y = {Yes, No} denote the prediction.
Examples :
x1, . . . , x5 = ( ( ( ( ( ⇒ y = No
x1, . . . , x3 = ) ( a ⇒ y = No
x1, . . . , x4 = ( ) ( ) ⇒ y = Yes
x1, . . . , x8 = ( a ( b ) ( ) ) ⇒ y = Yes
Notice that we are not predicting a y label for each position, we are classifying the entire
sequence with a y label. However, the model for predicting the final classification will
make use of hidden states at each position, z1 . . . zT where T = |x|. Each latent state
zt comes from a set Z, the specific set will depend on the model family. For HMMs
and CRFs, Z is a discrete set of symbols of some fixed size. For RNNs, Z is a real-
valued vector of some fixed dimensionality. The latent state may be used to track the
unobserved properties of the input sequence (e.g., some sort of latent representation of
how the parentheses in the expression are matched).
When we are given a new sequence to classify x, we infer the hidden z states according
to our model parameters. We use the final state zT , to predict whether or not the sequence
was well formed. You can make this prediction by either examining the final state itself,
or using the final state as input to a classifier that predicts Yes or No.
1.1: For each of the following family of sequence models, state as to whether there exists
model parameters that can accurately capture the desired behavior. Explain.
(a) Hidden Markov Model
Fall 2019 CS 475 Machine Learning: Homework 5 10
(b) Conditional Random Field
(c) Recurrent Neural Network
1.2: How would your answers to question 1.1 change if the sequences where upper
bounded in length, |x| ≤ Tmax <∞? Explain.
Fall 2019 CS 475 Machine Learning: Homework 5 11
Question 2) CRF vs. HMM (15 points) Consider the following sequence models:
• HMM with K = |Z| latent states and M = |X | possible observations.
• Linear chain CRF over the same state space as the HMM, but with feature function
f(x, yt, yt−1) ∈ RD.
For the following questions give a big-O expression that is a function of K, D, M , and
input length T .
(a) Computing the posterior distribution over a specific label pHMM(Yt = y | x)
(b) Computing the posterior distribution over a specific label pCRF(Yt = y | x)
(c) How many parameters are in the HMM?
Fall 2019 CS 475 Machine Learning: Homework 5 12
(d) How many parameters are in the CRF?
Fall 2019 CS 475 Machine Learning: Homework 5 13
Question 3) FATE (10 points) Suppose you work for a bank and you have been tasked
with building a machine learning system that automatically approves or rejects credit
card applications. You are given a dataset that contains a sample of previously processed
applications, for which a human decision maker decided to approve or reject. Each ex-
ample contains information from the application, such as education level, zip code, and
income level, as well as a label indicating the decision (accept/reject). The demographics
of the applicant (gender, age, race, and ethnicity) are not included in the dataset or as
features in the data.
(a) Despite the fact that demographic information is not included in the dataset, you are
concerned that the resulting classifier may be biased against a demographic group.
Name two potential sources of such bias.
(b) Take the same dataset, but add the observation of whether or not the applicant kept in
good credit standing for 5 years after having their application approved (with rejected
candidates labeled as “unknown, due to rejection”). Suppose we trained to predict
the good credit standing rather than the person’s judgement. How would that change
the potential for bias? Explain.
Fall 2019 CS 475 Machine Learning: Homework 5 14
(c) Would using an interpretable machine learning model necessarily reduce bias and
increase fairness?
Fall 2019 CS 475 Machine Learning: Homework 5 15
4 What to Submit
In this assignment you will submit two things.
1. Submit your code (.py files) to cs475.org as a zip file. Your code must be
uploaded as code.zip with your code in the root directory. By ‘in the root
directory,’ we mean that the zip should contain *.py at the root (./*.py) and not
in any sort of substructure (for example hw1/*.py). One simple way to achieve this
is to zip using the command line, where you include files directly (e.g., *.py) rather
than specifying a folder (e.g., hw1):
zip code.zip *.py
A common mistake is to use a program that automatically places your code in a
subfolder. It is your job to make sure you have zipped your code correctly.
We will run your code using the exact command lines described earlier, so make sure
it works ahead of time, and make sure that it doesn’t crash when you run it on the
test data. A common mistake is to change the command line flags. If you do this,
Remember to submit all of the source code, including what we have provided to you.
We will include requirements.txt and provide the data, but nothing else.
from latex and uploaded as a PDF. The writeup should contain all of the
answers to the analytical questions asked in the assignment. Make sure to include
your name in the writeup PDF and to use the provided latex template for your
answers following the distributed template. You will submit this to the assignment
called “Homework 5: Dimensionality Reduction, Structured Prediction, and FATE:
Written”.  