COMP61021: Modelling and Visualization of High Dimensional Data

Lab 2: Applications of PCA (Assessed Lab Exercise)

This coursework (a zipped file) must be submitted via the Blackboard. The deadline of

this lab exercise is 23:30 on 19th November 2019. The late submission policy is

applied (see the teaching website and FAQs for details).

PCA is one of the most important data analysis tools and can be applied in high dimensional

data visualization and compression. In this exercise, you are asked to apply appropriate

PCA implementations in Matlab provided to real data sets for visualization and compression.

You can download relevant Matlab code and data sets from

http://syllabus.cs.manchester.ac.uk/pgt/COMP61021/Lab/lab1.zip

After unzipping it, you should be able to find two sub-directories; code and data. In the

code sub-directory, there are three Matlab functions:

pca1.m Matlab function of PCA derived from the co-variance matrix

pca2.m Matlab function of PCA derived from SVD (dual PCA)

display_digit.m Matlab function for displaying a grey-level image

In the data sub-directory, you find two files iris.mat and digit.mat. The file

digit.mat has further two data subsets: train and test that contain grey-level images

of hand-written digits “6”.

iris.mat IRIS data set consisting of 150 four-dimensional data points

digit.mat The whole data set of hand-written digit images “6” of two data subsets below

train Training subset of 300 hand-written digits images “6” of 28X28 pixels

test Test subset of 10 hand-written digits images “6” of 28X28 pixels

Now… the lab exercise …

PART 1 – Visualisation

For this part, you are asked choose an appropriate implementation of PCA provided. You

need to apply it to the IRIS data set and then project all 150 data points onto a two-

dimensional PCA subspace consisting of two principal components for visualization. Let PC1,

PC2 and PC3 denote top three principal components of this data set. You need to use a

Matlab display function to show your results in PC1-PC2, PC1-PC3 and PC2-PC3 subspaces.

Based on your observations on visualized results, describe any non-trivial properties you find

out for this data set. Three plots showing your results and the description on your

observation must appear in your report.

PART 2 – Image compression

You are asked to apply an appropriate implementation of PCA to hand-written digit images

for compression. Those 300 images in the train set are used to achieve a PCA

compression system while 10 images in the test set will be used to evaluate your PCA

compression system. In this part, you need to answer the following questions experimentally

with some sensible justification.

• What is the appropriate implementation of PCA used in this application? (Justify why)

• How many principal components are needed to establish a satisfactory compression

system? (Justify why you choose such a number of principal components with evidence)

• For 10 images in the test set, what are their low-dimensional representations?

(Explicitly list all the low-dimensional representations of those 10 images in a table that

should be put in an appendix.)

• How can you reconstruct those images from their low-dimensional representations?

(Display 10 original and their corresponding reconstructed images for comparison)

• How can you estimate their reconstruction errors for 10 test images? (Report them with

your chosen measure in detail and justify why you choose such a measure to estimate

10 reconstruction errors)

The answers to all the above questions must explicitly appear in your report.

After loading the digit.mat in Matlab, you can extract an image from either the train

subset or the test subset directly, e.g., Ik = train{k} extracts the kth image in the

train subset. You can use the display_digit.m function provided to display any image,

e.g., display_digit(Ik) shows the kth image Ik in the train subset.

PART 3 – Bonus marks

Additionally, bonus marks are available for truly exceptional students. To obtain marks in this

category, you should show evidence of learning outside the supplied lecture notes and

closely linking to problems on Part 2. In any case, data sets given in Part 2 must be used in

this part. An example of things you could do: applying an alternative dimensionality reduction

technique to problems described in Part 2 for a better performance with fully understanding

the technique applied rather than reporting the performance improvement only. Along with

the experimental results, the justification of your chosen method and the reason(s)

attributed to a success must be described clearly in the report.

DELIVERABLES

A zipped file, named “yourname-lab1.zip”, including a report in the PDF format (two

single-side A4 pages (font of 11pt) + one-page appendix) and all relevant source code

along with a readme.txt file in the text format. The zipped file must be submitted via the

Blackboard.

Your report must address all requirements and key points/results as specified in Parts 1 and

2. The same requirements are applied to Part 3 if you do that. Your readme.txt file must

contain a step-by-step procedure so that a marker can follow your instructions to run your

submitted code straightforwardly for replicating the results described in your report.

Take note, we are not interested in the details of your code, what Matlab functions are

called, what they return etc. This course unit is about machine learning algorithms, and is

indifferent to how you program them in Matlab.

There is no specific format – marks will be allocated roughly on the basis of:

• rigorous experimentation

• how informative and well your results are presented in your report

• imagination/research/understanding/performance in Part 2 and above

• grammar, ease of reading

The lab is marked out of 15:

Part 1 – Visualization 4 marks

Part 2 – Image Compression 8 marks

Part 3 – Bonus 3 marks

Mark and Feedback will be available on the Blackboard. Once the marking is

completed, you will be notified via email.