APANPS5335: Machine Learning

Final Projects

(Deadline: 12/6/2019 – no exceptions)

Requirements for the Final Project Report

You need to write a final project report in a conference paper style with related citations. You can

choose any programming language (though the preferred language should be R) while using any

machine learning packages to help you achieve the goal.

In your report, you need to give a detailed explanation of each step you take to arrive at your solution.

Please give a justification or explanation of the results you obtain.

The reports should also include the lessons you may have learned from doing the project. We will

provide an anonymous ranking order of the accuracy each team is able to achieve.

Grading guidelines

Here is the grade distribution of each report. Please note that this is an individual project, hence no

teams. You are free to discuss with your peers about the project, but you should do the work yourself.

Details of the gradings are given below.

• E-score (45%) for the amount of efforts to tackle the problem (including the clarity of the

reports). The more efforts as demonstrated in the report, the higher the grade for this portion.

One way to demonstrate the efforts is by looking the number of machine learning techniques

you have explored for solving the problem.

• Q-score (45%) for the overall quality of the results and the validity of your code submission. The

higher the quality, the higher the grade for this portion.

• C-score (10%) for the quality of the results in comparison with other teams. The grade will be

divided among the teams in 5 buckets: the top 20%-tile get the 10 points, the next 20%-tile get 8

points, etc.

Handwriting recognition

Handwriting recognition is a well-studied subject in computer vision and has found wide applications in

our daily life (such as USPS mail sorting). In this project, we will explore various machine learning

techniques for recognizing handwriting digits. The dataset you will be using is the well-known MINST

dataset.

(1) The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set

of 10,000 examples. (http://yann.lecun.com/exdb/mnist/)

Below is an example of some digits from the MNIST dataset.

The goal of this project is to build a 10-class classifier to recognize those handwriting digits as accurately

as you can. All the assignments below should use the training data (60K examples) and test data (10 K

examples) as given by the dataset.

Here are the basic requirements for this project.

Assignment #1: (E-score: 10%, Q-score: 10%)

Do a clustering of the training dataset with the cluster number K=10 first (without looking at the labels),

and then assign a class label to each cluster based on majority vote of the cluster member’s known

digital labels.

Based on that, compute the training classification accuracy for 10 classes from clustering. Then use

each cluster’s centroid as the anchor point, and you will obtain 10 class centroids corresponding 10

anchor points.

To compute testing accuracy, first assign each test image to the nearest centroid with the

corresponding centroid label, and then compute the test accuracy.

Assignment #2: (E-score: 10%, Q-score: 10%)

Build a number of non DNN based classifiers using all pixels as features for handwriting recognition. You

need to use at least the following four techniques we have learned from the class to do the work:

• Logistic regression

• SVM

• Decision tree, and

• Random forest.

For each technique, please use your own language to give a general description of the technique, its

pros and cons, and why such a technique is suitable for solving the handwriting recognition problem.

The goal is to make sure you know why you decide to choose this technique.

Please also make a comparison table among the four techniques as well as the method from

Assignment #1 above.

Assignment #3: (E-score: 5%, Q-score: 5%)

Run at least one feature engineering techniques (such as PCA) to see if you are able to improve the

accuracy for the techniques as discussed in Assignment #2. You are welcome to explore more than one

technique. For each technique, please explain the process and reason why such a technique may be

useful.

Assignment #4 (E-score: 10%, Q-score: 10%)

In this assignment, we will explore various techniques related to neural network and deep learning to

solve the 10-class classification problem.

Since there are many existing implementations to solve the MINST problem, we need to give some

twists to this problem to make it worthwhile to do for our final project. (Please refer to the ranking list

for MNIST at http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html.)

The basic network structure that we are trying to explore is something like the following (i.e., the fully

connected deep neural nets). The number of hidden layers and the size of each hidden layer in terms of

neurons are left as tuning parameters that you can explore.

More advanced students can also try to add some convolutional layers, but that’s not required.

The goal of this assignment is to determine a best neural network structure that can be trained using

the least number of training data, while attaining the best possible testing data accuracy.

In other words, even though there are 60K training data, we want you to use as few of those training

data as possible (say for example 30K), yet still able to achieve reasonable accuracy for the 10K test

data.

Your final results should include three parts:

• The final network topology you have chosen to use (in terms of number of hidden layers,

number of neurons in each hidden layer, and possibly the activation functions etc).

• The number of training data (out of 60K) you have used to train the network from the above. In

your report, please include a percentage number for the training data (out of 60K). Denoted this

as P1

• The test accuracy on the trained network by using the 10K test data. Denote the test accuracy as

P2

The quality of your results will be measured using the following figure-of-merit formula:

FOM = P1/2 + (1-P2)

For example, if you have used only 6K training data out of 60K to achieve a test accuracy of 70%. Then

you P1 = 6K/60K = 10%, while P2 = 70%. Then FOM = 10%/2 + (1-70%) = 35%.

In other words, we want to obtain a FOM that’s as small as possible. This can be achieved by using as

small number of training data (P1/2) as possible while achieving as small test error (1-P2) as possible.

The scaling factor of 2 for P1 is to give more weight to the test error than training data size.

Some ideas you may want to try (and some of them are discussed in the lectures):

§ How to best initialize the network.

§ How to artificially generate more training data (as a regularization method) to help address the

reduced training data issues. Note that the artificially generated training data do not count

against the number of training data you have decided to use out of 60K. For example, if you

have decided to use 30K training data, then you can only use those 30K chosen training data to

general whatever number of artificially expanded dataset. Since the base is the same 30K data,

we will consider as if you have only used 30K training data.

§ How to adjust/design the network topologies

Please feel free to use any existing DNN implementation or modify the existing implementation to

achieve your own goals. You’re also free to use CNN or other more advanced techniques to tune the

network.

Assignment #5 (E-score: 10%, Q-score: 10%)

This assignment reflects the data collection process. Everyone is required to

• Hand write 5 styles of your own digits from 0 to 9 on a paper, and make sure your own

handwritings are for sure recognizable by yourself. Please take a picture of each digit you write

(so you have total 5 x 10 = 50 images), resize and convert it to the same data input format as the

MNIST dataset. In other words, you have 50 new data points with lablels.

• Treat these 50 images as “brand-new” test dataset and run your own ML models from

Assignment #1 - #4 on these 50 images and report the achieved test accuracy. Note, the goal for

this exercise is not for achieving “high” accuracy, but to show what potential gaps there may be

between existing MNIST dataset and your own test dataset, a scenario you would encounter in

real life.

In your report and submission, the following is required for this assignment

• Show all 50 images you create with the corresponding labels you intend to assign,

• Make a table to show the test accuracy on these 50 images for each ML model you obtained

from Assignment #1 to #4

• Submit in a subfolder with your 50 handwriting dataset in MNIST format.

Assignment #6 (C-score: 10%)

Note that we could have also computed the FOM for the Assignment #1 - #3, where P1 = 100% as we

have asked you to use all the training dataset.

Please build a table to document the different FOMs from all of the ML techniques you have

experimented in this project, and list the corresponding P1 and P2.

In your final report, please clearly indicate the best achieved FOM in your project title. Please round it to

4 decimal points if needed.