CS5487 – Assignment 3 – Course Project

Antoni Chan

Department of Computer Science

City University of Hong Kong

Proposal due date: Fri, Week 10

Presentation date: TBA, Week 14

Report due date: Fri, Week 14

1 Course Project

The final assignment is a student-defined course project. The goal of the project is to get some

hands-on experience using the course material on your own research problems. If you can’t think

of a project, then you can do the “default” project, which is digit classification (see Section 2).

1.1 Project topic

The goal of the project is to get some hands-on experience using the course material on your own

research problems. Keep in mind that there will only be about 4 weeks to do the project, so the

scope should not be too large. Following the major themes of the course, here are some general

topics for the project:

• regression (supervised learning) – use regression methods (e.g. ridge regression, Gaussian

processes) to model data or predict from data.

• classification (supervised learning) – use classification methods (e.g., SVM, BDR, Logistic

Regression) to learn to distinguish between multiple classes given a feature vector.

• clustering (unsupervised learning) – use clustering methods (e.g., K-means, EM, Mean-Shift)

to discover the natural groups in data.

• visualization (unsupervised learning) – use dimensionality reduction methods (e.g., PCA,

kernel-PCA, non-linear embedding) to visualize the structure of high-dimensional data.

You can pick any one of these topics and apply them to your own problem/data. Before actually

doing the project, you need to write a project proposal so that we can make sure the project is

doable within the 3-4 weeks. I can also give you some pointers to relevant methods, if necessary.

• Can my project be my recently submitted or soon-to-be submitted paper? If you plan to just

turn in the results from your paper, then the answer is no. The project cannot be be work

that you have already done. However, your course project can be based on extending your

work. For example, you can try some models introduced in the course on your data/problem.

1

1.2 Project details

• Group project – all projects should have a group of 2 students. To sign up for a group, go

to Canvas⇒ “People” and then join one of the existing “Project Groups”. If you cannot find

a group, please use the Discussion board.

• Project Proposal – For the first part of the project, you need to write a project proposal.

The project proposal should be at most one page with the following contents: 1) an intro-

duction that briefly states the problem; 2) a precise description of what you plan to do –

e.g., What types of features do you plan to use? What algorithms do you plan to use? What

dataset will you use? How will you evaluate your results? How do you define a good outcome

for the project? – The goal of the proposal is to work out, in your head, what your project

will be. Once the proposal is done, it is just a matter of implementation!

• Project Poster Presentation – Project poster presentations will be at the end of the

semester, before the project due date (Week 14). More details will be sent out later. The

poster presentation is optional. However, if you want to get an “A” on your project, then

you must give a presentation. Put it another way, if you don’t give a presentation, then you

will get at most a “B+” on your project.

• Project Report – The project report is essentially the project proposal with all the details

filled in. The report should have the following contents: 1) introduction – what is the problem?

why is important?; 2) methodology – what algorithms did you use and what are the technical

details? what are the advantages and disadvantages?; 3) experimental setup – what data

did you use? how did you pre-process the data? which algorithms did you run on the data?

what is the metric for evaluation?; 4) experimental results – what were the results? what

insight do you get from these results? what are some typical success and failure cases? –

The project report should be at least 4 pages. There is no upper page limit, but probably it

should not be more than 8 pages long. For group projects, the project report must state the

level of contribution from each project member.

• What to hand in – You need to turn in the following things:

1. Project proposal (due Friday, Week 10).

2. Project report (due Friday, Week 14).

3. Presentation poster (due Friday, Week 14).

4. Source code files (due Friday, Week 14).

Only one group member needs to submit the files on Canvas. You must submit your course

project materials using the Canvas website. Go to “Assignments” ⇒ “Course Project” ⇒

select the appropriate entry.

• Third Party Code – In the course project, you may use 3rd party source code, e.g., libsvm,

etc. If you use 3rd party code, you must acknowledge it with an appropriate reference.

• Grading – The marks for this project will be distributed as follows:

– 16.7% – Project proposal.

– 16.7% – Technical correctness (whether you used the algorithms correctly)

2

– 16.7% – Experiments. More points for thoroughness and testing interesting cases (e.g.,

different parameter settings).

– 16.7% – Analysis of the experiments. More points for insightful observations and anal-

ysis.

– 16.7% – Quality of the written report (organized, complete descriptions, etc).

– 16.7% – Project poster presentation.

Note: Here 16.7% means 5/30

3

2 Default Course Project – Digit Classification

The default project is handwritten digit classification on a subset of the MNIST digits dataset.

• Dataset – The provided dataset is a subset of the MNIST digits. The dataset has 10 classes

(digits 0 through 9) with 4000 images (400 images per class). Each feature vector is a

vectorized image (784 dimensions), containing grayscale values [0, 255]. The original image

dimensions are 28× 28. Here is an example montage of the digits:

The MATLAB file digits4000.mat (or digits4000 *.txt for non-MATLAB users) contains

the following data:

– digits vec – a 784 × 4000 matrix, where each column is a vectorized image, i.e. the

feature vector xi ∈ R784.

– digits labels – a 1× 4000 matrix with the corresponding labels yi ∈ {0, · · · , 9}.

– trainset – a 2× 2000 matrix, where each row is a set of indices to be used for training

the classifier.

– testset – a 2 × 2000 matrix, where each row is the corresponding set of indices to be

used for testing the classifier.

The image above was generated with the following MATLAB code:

4

testX = digits_vec(:,testset(1,:)); % get test data (trial 1)

testXimg = reshape(testX, [28 28 1 2000]); % turn into an image sequence

montage(uint8(testXimg), ’size’, [40 50]); % view as a montage

• Methodology – You can use any technique from the course material, e.g., Bayes classifiers,

Fisher’s Discriminant, SVMs, logistic regression, perceptron, kernel functions, etc. You may

also use other classification techniques not learned in class, but you will need to describe them

in detail in your report. Two useful libraries for classification are “libsvm” and “liblinear”.

You can also pre-process the feature vectors, e.g., using PCA or kPCA to reduce the dimen-

sion, or apply other processing techniques (e.g., normalization or some image processing).

Finally, a common trick for doing multi-class classification using only binary classifiers (e.g.

SVMs) is to use a set of 1-vs-all binary classifiers. Each binary classifier is trained to distin-

guish one digit (+1) vs. the rest of the digits (-1). In this case, there are 10 binary classifiers

total. Given a test example, each binary classifier makes a prediction. Hopefully, only one

classifier has a positive prediction, which can then be selected as the class. If not, then the

classifier that has the most confidence in its prediction is selected. For example, for SVMs

the classifier that places the test example furthest from the margin would be selected. For

logistic regression, the selection would be based on the calculated class probability.

• Evaluation – The classifiers are evaluated over 2 experiment trials. In each trial, 50% of the

data has been set aside for training (and cross-validation of parameters), and the remaining

50% is held out for testing only. The indices of the training set and test sets are given in the

trainset and testset matrices. For a given trial, the same writer does not appear in both

the training and test sets.

For each trial, train a classifier using only the training set data (images and labels). You may

also use the training set to select the optimal model parameters using cross-validation. After

training the classifier, apply the classifier to the test data (images only) to predict the class.

Record the accuracy (number correct predictions / total number) for that trial. Do not tune

the parameters to optimize the test accuracy directly! You can only tune the parameters using

the training set.

As a baseline, a simple nearest-neighbors classifier with Euclidean distance was used on the

test data. The resulting classification accuracy for each experiment trial is:

trial 1 2 mean (std)

1-NN 0.9135 0.9185 0.9160 (0.0035)

In your experiments, which classifier does better? What feature pre-processing helps or hurts

the performance? How does the performance vary with parameter values?

• Bonus Challenge – In the bonus challenge, I will give you a new test set containing my own

handwritten digits, and you will try to classify them using your trained classifiers. Whoever

gets the best performance wins a prize!

5