Final Exam

MSBD 5002 Data Mining, Spring 2020

Due: 8:00am June 1st

Exam Guidelines

• This is an open book examination.

• Exam Duration: 8:00pm May 29 to 8:00am Jun 1.

• Turn In: You must submit your solutions via canvas before 8:00am Jun 1.

In oder to avoid network congestion and submission failure, please submit

your solutions in advance. Any late submission will lead to zero

mark directly.

• You must pack all solutions together into one zip file, named as itsc studentID final.zip.

The example of directory structure is shown in the Fig 1.

• Academic integrity:

– Your program and report should be based on your own effort. Stu-

dents cannot collaborate with anyone.

– In case you seek help from any reference source, you should state it

clearly in your report. Failure to do so is considered plagiarism which

will lead to appropriate disciplinary actions.

– Plagiarism will lead to zero mark directly.

Figure 1: An example of the directory structure to be submitted.

1

Datasets

The data sets for Q1, Q2, Q5, Q6 and Q7 can be downloaed via the link or the

canvas. The data set for Q3 can be downloaded via the link.

Grading

Your score will be given based on:

1. Correctness, completeness and clarity in your report.

2. Codes must be runnable, and code comments are necessary. Missing the

necessary comments will be deducted a certain score.

3. Evaluation performance (if any).

4. You can get bonus if you propose a novel approach to any question other

than Q7.

Notes

Please read the following notes carefully.

1. Please read Exam Guidelines carefully.

2. Please work hard and arrange your time reasonably.

3. For programming language, in principle, python is preferred.

4. Computation of some questions is very large, students might use cloud

computing platform, such as azure.

5. If your codes or answer refer to any blog, github, paper and so on, please

provide the links in corresponding QX report.pdf.

6. If you have any question that Google cannot answer, please send email to

{hlicg,lwangcg,sdiaa}@connect.ust.hk. Questions from other channels will

not be answered.

2

1 Dimensionality Reduction on Data Feature (10

Points)

In real-world problems, some data has a high dimensional feature that contains

noise and impedes the performance of machine learning models. In this question,

for simplicity, we give a small toy dataset and you are required to design two

dimensionality reduction algorithms to reduce the dimension of features.

You are required to code for your dimensionality reduction algorithms man-

ually, but you are allowed to use any existing package of the classification model

or clustering model to test the effect of your algorithms. For example, you feed

the features obtained by different dimension reduction algorithms to the same

classification or clustering model and compare the performance.

1.1 Data Description

The dataset contains 3686 points where each point has 400 dimension feature.

Also, in the dataset, X = Multi-dimensional point data, y = labels (0 or 1).

1.2 Submission

1. Pack all code files in folder Q1 code .

2. You should write the pseudo code of your dimensionality reduction algo-

rithms in the report Q1 report.pdf and describe them in detail.

3. You should write the experiment settings in your report Q1 report.pdf.

For example, if you split the dataset, you should write it clearly

4. You should compare the performance of your algorithms and analyze them

in the report Q1 report.pdf.

5. Please state clearly how to run your program in Q1 readme.md.

1.3 Notes

1. If your pseudo code includes the classification model or the cluster model,

you can use a function to denote it directly instead of writing it in detail.

2. How many dimensions you want to reduce depends on you.

3

2 Imbalanced Data Classification (15 Points)

In this task, you are required to model classifiers for 5 imbalanced data sets.

2.1 Data Description

You are provided with 5 training imbalanced dataset as follows:

1. Bi-class Datasets v train.csv and p train.csv are data sets with binary

classes (e.g., positive, negative).

2. Multi-class Datasets y train.csv, e train.csv and a train.csv are data

sets with multi-classes.

2.2 Submission

1. Pack all code files in folder Q2 code.

2. Predict the test data and put your prediction results in folder Q2 output.

3. Please report the algorithm you utilized in details, the experiment setting,

and training accuracy in your report Q2 report.pdf.

4. Please state clearly how to run your program in Q2 readme.md.

2.3 Notes

1. You are allowed to use any existing package for this question. But please

clearly state your reference link in Q2 report.pdf.

4

3 Fake Video Detection (15 Points)

Nowadays, people all enjoy sharing their lives via video platforms like TikTok

and Kuaishou. However, due to the development of deep fake technologies,

many online videos are actually synthesised intentionally. In this problem, you

need to design and implement a fake video detection algorithm by yourself.

3.1 Data description

1. You can download the whole dataset from here.

2. Each sub-folder denotes one category of videos. For example, for videos in

./ApplyEyeMakeup folder, their labels are all ’ApplyEyeMakeup’. These

videos are all real instead of synthesised.

3.2 Submission

1. Please write down your algorithm’s principles and details inQ3 report.pdf.

If your code refers to blogs, GitHub codes, papers or anything else, please

cite them clearly.

2. Please put all your code of this question in folder Q3 code.

3. You need put your fake video detection results in Q3 output.csv. You

can choose arbitrary fake and real videos online to do the validation. The

output format should be like

The Name of Video Is This Video real?

Diving Bird.avi False

Obama Presenting.avi True

4. Please state clearly how to run your program in Q3 readme.md.

3.3 Notes

1. You can use any algorithm that you know. You can NOT directly use com-

plete models that others have already trained to do classification without

any detailed process.

2. We will test your algorithm on separated dataset and use this metric as

the grading criteria.

3. The term fake video means that the video is not originated from the

nature world, instead it is generated by some algorithms from noise or

some distributions.

5

4 Stock Prediction (15 Points)

Everybody wants to earn money from the stock market. Is it possible to predict

stock prices in the future with data mining and machine learning technologies?

In this problem, you need to design and implement an stock price prediction

algorithm to predict the closing prices of AAPL (Apple company), PG (P&G)

and GM (General Motors) on June 2, 2020.

4.1 Data Description

1. You are free to collect arbitrary data and information from the internet

like stock prices, twitters, newspapers etc.

2. Your data collecting method is also not restricted.

4.2 Submission

1. Please write down your algorithm’s principles and details inQ4 report.pdf.

If your code refers to blogs, GitHub codes, papers or anything else, please

cite them clearly.

2. Please put all your codes of this question in Q4 code folder.

3. You need give you prediction for the closing prices of AAPL (Apple com-

pany), PG (P&G) and GM (General Motors) on June 2 inQ4 output.txt.

The .txt file should only contains one float number representing the stock

price like 318.25.

4. Please state clearly how to run your program in Q4 readme.md.

4.3 Notes

1. You can use any algorithm that you know. You can NOT directly use

complete models that others have already trained to do sentiment analysis

without any detailed process.

6

5 Frequent Itemset Mining (15 Points)

You are required to write programs to mine the frequent itemsets (min sup

= 100) in the provided data. Based on the frequent itemsets you mined, please

write program to mine the closed frequent itemsets and maximal frequent

itemsets.

5.1 Data Description

The benchmark data set is supported by the IBM Almaden Quest research

group, which contains 1,000 items and 100,000 transactions. For simplicity,

each number uniquely identifies an item.

5.2 Task Description

1. Task 1 (5 points) Mine the frequent itemsets (min sup = 100).

2. Task 2 (5 points) Mine the closed frequent itemsets.

3. Task 3 (5 points) Mine the maximal frequent itemsets.

5.3 Submission

1. Pack all code files in folder Q5 code.

2. Following the sample submission, put your outputs in folder Q5 output.

3. Describe the algorithm you utilized for task 1 and the running time of

each task in Q5 report.pdf.

4. Please state clearly how to run your program in Q5 readme.md.

5.4 Notes

1. You are allowed to use any existing package for this question. But please

clearly state your reference link in Q5 report.pdf.

7

6 Community Detection (10 Points)

In many social networks, many different relations usually exist between a same

set of users. In this task, given several datasets, you are required to detect

the communities among 419 tweet users. The ground truth consists of five

communities.

6.1 Data Description

419 users are referenced by their unique Twitter IDs in data ids.mtx. You are

provided with three different sources of these users social networks:

1. follow/followedby This folder records “follow” information between users.

2. mention/mentionedby This folder records “mention” information be-

tween users.

3. retweet/retweetedby This folder records “retweete” information be-

tween users.

6.2 Task Description

1. Task 1 (10 points) Detect 5 communities between 419 users.

6.3 Submission

1. Pack all code files in folder Q6 code.

2. Following the sample submission, put your outputs in folder Q6 output.

3. Describe in detail the algorithm you are using in Q6 report.pdf.

4. Please state clearly how to run your program in Q6 readme.md.

6.4 Notes

1. You are allowed to use any existing package for this question. But please

clearly state your reference link in Q6 report.pdf.

2. Do NOT use any external data.

8

7 Matrix Factorization in Graph Representation

and Recommendation System (20 Points)

7.1 Preliminary

Matrix Factorization (MF) (e.g., Probabilistic Matrix Factorization and Non-

Negative Matrix Factorization) techniques have become the crux of many real-

world scenarios, including graph representation and recommendation system

(RecSys), because they are powerful models to find the hidden properties behind

the data. More specifically, Non-Negative Matrix Factorization (NNMF) is a

group of models in multivariate analysis and linear algebra where a matrix

A P R|B|ˆ|C| is decomposed into B P R|B|ˆd and C P R|C|ˆd according to:

B,C “ arg min

B1,C1

››A´B1C1J››2

F

, (1)

where }¨}F denotes the Frobenius norm.

Graph Representation Given a graph G “ pE, V q (V is the set of nodes

and E is the set of edges), the adjacency matrix of G is represented by A P

R|V |ˆ|V |, where xij “ 1 if there is an edge e P E between the node vi with the

node vj , otherwise xij “ 0. Following Eq. 1, you can represent nodes V into

V P R|V |ˆd by minimizing the loss Lg “

››A´VVJ››2

F

. You may refer to the

last lecture slides for more details.

User-Movie RecSys Given a collection of users U and a collection of

movies V , let X P R|U |ˆ|V | denotes the user history of rating items, where

xij “ 0 if i-th user ui has not rated the j-th movie vj , otherwise xij P p0, 5s

represents the ratings of ui to the movie vj . Following Eq. 1, you can repre-

sent users U and movies V into U P R|U |ˆd and V P R|V |ˆd respectively, by

minimizing the loss Lr “

››X´UVJ››2

F

.

User-Movie RecSys with Movie Tags In real-world applications, it is

important to handle the cold-start issue in RecSys since there are new movies

coming out every day. The above user-movie RecSys cannot well handle this

issue since no user have seen a new movie before. This is a variant of MF

techniques to alleviate the cold-start issue by incorporating the information of

movie tags. Given a collection of movie tags T and a collection of movies V , let

Y P R|T |ˆ|V | denote movie tag information, where yij “ 1 if movie vj is tagged

with ti, otherwise xij “ 0. Therefore, you can represent users U , movies V , and

tags T into U P R|U |ˆd, V P R|V |ˆd, and T P R|T |ˆd respectively, by minimizing

the loss Lt “

››X´UVJ››2

F

` ››Y ´TVJ››2

F

as:

U,V,T “ arg min

U1,V1,T1

´››X´U1V1J››2

F

` ››Y ´T1V1J››2

F

¯

.

9

7.2 Task Description

You are required to code three different matrix factorization techiniques as in-

troduced above.

1. Task 1 (5 points) Given the graph data graph.npy, please represent nodes

V by minimizing Lg.

2. Task 2 (5 points) Given the rating data rating.csv, please represent users

U and movies V by minimizing Lr.

3. Task 3 (10 points) Given the rating data rating.csv and the movie data

movie.csv, please represent users U , movies V , and tags T by minimizing

Lt.

7.3 Submission

1. Pack all code files in folder Q7 code.

2. Please report the experiment settings (e.g., learning rate), your final losses

(Lg, Lr, and Lt), and the corrsponding learning curves (X-axis: iteration

and Y-axis: loss) for three tasks in Q7 report.pdf.

3. Please state clearly how to run your program in Q7 readme.md.

7.4 Notes

1. You are allowed to use any existing package for this question. But please

clearly state your reference link in Q7 report.pdf.

2. Hints: Learning in Eq. 1 is usually more stable if U or V is updated in

each subproblem based on gradient descent by reducing L slightly, such

as:

• while not converged:

– Minimizing L over U by gradient descent with V fixed.

– Minimizing L over V by gradient descent with U fixed.

10