ECE 174: Introduction to Linear and Nonlinear Optimization with Applications

Fall 2019

Programming Assignment # 1

Due: November 5

Instructions: In this programming assignment you will implement a “K-Means Clustering"

heuristic, test your program on real datasets under various settings, and interpret your results.

• You can write the source code in any language of your choice (Python, MATLAB). The

code should be properly commented and you will be required to submit the source code.

You should write your own code and not copy from any resource on the internet. If you

consult any resource online before writing your own code, you must cite the source, add

a comment to your code with a link to the source you got it from. Any uncited code you

did not write yourself, or any code which is substantially similar to another student’s,

will be referred to the Office of Academic Integrity.

• In addition to the source code, you will submit a detailed report that describes the simula-

tions and explains the results using figures and mathematical interpretations. Your report

must address all the questions asked in this assignment. We do not have any detailed re-

quirements on the formatting, but we expect the report to be presented in a complete and

professional manner, with appropriate labels on items and comments which demonstrate

understanding of the results.

• Your grade on the programming assignment will be based on the correctness of the code

as well as the quality of the report. Since our evaluation of your code is based on its results

as presented in your report, correct code is very important. We have every intention of

awarding partial credit, but for instance if a small bug causes your program’s output to be

unusable throughout the report, we cannot award much credit even if the bug was very

small.

• Reading: Read Sec. 4.2 and 4.3 from the textbook to review “k-means" clustering. Also

study Example 4.4.1 on image clustering. You will implement this example and do several

numerical experiments.

• Dataset: We will use the well-known MNIST (Mixed National Institute of Standards)

database of handwritten digits which contains grayscale images of size 28ˆ 28 which can

be represented as vectors of size 784. The dataset consists of N = 60000 images that you

will cluster using k-means. The dataset is provided as mnist.mat which you can directly

read into MATLAB. For those working in Python, you can also refer to http://rasbt.

github.io/mlxtend/user_guide/data/loadlocal_mnist/ and http://rasbt.github.io/

mlxtend/installation/ for obtaining the equivalent data from the MNIST images.

1. Write a program k-means to implement k-means clustering. The functions accepts two in-

puts: a list of N data vectors to be clustered, and the number of clusters, K. The program

outputs three quantities: a list of N group assignment indices (c1, c2, ¨ ¨ ¨ , cN), a list of K

group representative vectors (z1, ¨ ¨ ¨ , zK) and the value of Jclust after each iteration of the

algorithm until convergence or termination.

Initialization and Termination: Initialize k-means by randomly assigning the data points

to K group, i.e., randomly selecting each ci to be an integer between 1 and K. Alternatively,

you can also randomly assign K data points as the K group representatives. Explicitly

mention your initialization technique in the report. You can follow any of the termination

rules suggested in your textbook (Page 76). Again, mention in your report what criterion

you used.

2. Let K = 20. Implement k-means on the MNIST dataset, starting with a random assignment

of the vectors to clusters, and repeating the experiment P = 30 times. Save the output of

the algorithm for each of the 30 runs. Consider the two runs that produce the maximum

and minimum value of Jclust upon convergence. For each of these two runs

• Plot the value of Jclust as a function of the number of iterations. Comment.

• Visualize the K group representatives as images. You should generate something

similar to Fig. 4.8 and 4.9 of the textbook. Interpret the group representatives (if

which digits they represent, or if they represent something else) and compare them

for these two runs. Discuss your results.

• Find the 10 nearest data points to each of the K group representatives. Manually

list what digits they represent (by eye estimation). In your opinion, how many of

these 10 images match the corresponding group representative and how many of

them are misclassified? Generate a table and interpret your results. Plot a few of

the correctly classified as well as misclassified data points as images, alongside the

group representatives (also represented as images).

3. Repeat part 2 for K = 10 and P = 20. Compare with the previous results.

4. Repeat part 2 for K = 5 and P = 10. Compare with the previous results.