PAPER CODE NO. EXAMINER: Dr. Angrosh Mandya Tel. No. 0151 7954293

COMP527/337 DEPARTMENT: Computer Science

SECOND SEMESTER EXAMINATIONS 2019/20

Data Mining and Visualisation

TIME ALLOWED : Open-Book Exam

INSTRUCTIONS TO CANDIDATES

• Answer all SIX questions.

• The final exams for the course Data Mining and Visualisation is conducted as open-book

exam. The question paper will be provided on 22nd May, 2020 at 10:00 a.m. You are

expected to complete the exam and return scanned copies of answer scripts before 29th

May 10:00 a.m. Answer scripts received after 29th May 10:00 a.m. will be invalid.

• Completed answer scripts should be submitted using the link:

https://sam.csc.liv.ac.uk/COMP/Submissions.pl

• Answer scripts can be prepared in the following ways:

1. You can write your answers by hand on plain paper, then scan or photograph the

sheets, convert them to a single PDF and submit it at the link provided above.

2. In case you do not have access to scanners or unable to photograph a handwritten

solution, you can use a word processor (MS Word or LaTeX) and convert the file into a

single PDF and submit it as above.

• Please note the recommended time for completing the exam when conducted as a regular

exam during previous years was 2.5 hours. However, the open-book exam is available to

you from 22nd May, 2020 10:00 a.m. - 29th May, 2020, 10:00 a.m. Answer scripts can be

submitted at any time during this period. The latest version received before 29th May, 2020,

10:00 a.m. will be considered for evaluation. All submissions received after the deadline will

be considered as invalid.

• Since the exam is conducted as an open-book exam, students are expected to work on their

own to prepare their answers. Any sort of plagiarism will be strictly penalised.

• Calculators can be used for performing calculations.

PAPER CODE COMP527/337 page 1 of 7 Continued

Question 1 Consider the two sentences

S1 = great movie my family loved it

S2 = worst movie my family hated it

Let us consider the words my and it to be a stop words and apply lemmatisation on the other

words. We represent S1 and S2 respectively by vectors x1,x2 ∈ R6 in the vector space where the

first six dimensions defined respectively by great, movie, family, loved, worst and hated. Answer

the following questions.

A. Write the two vectors x1 and x2 in this vector space. (2 marks)

B. Calculate `1 and `2 norms of x1 and x2 (2 marks)

C. Calculate the Euclidean distance and Manhattan distance between x1 and x2 (2 mark)

D. Compute the cosine similarity between x1 and x2 (1 mark)

E. Write the set of unigrams, bigrams and trigrams that can be extracted from S1. (3 mark)

F. Write the set of bigrams that can be extracted from S2. (3 mark)

G. State one advantage of removing stop words in text mining tasks. (1 mark)

H. State one disadvantage of removing stop words in text mining tasks. (1 mark)

I. What is meant by part-of-speech in text mining? (2 marks)

PAPER CODE COMP527/337 page 2 of 7 Continued

Question 2 Consider the following two training datasets D1 = {(xn, tn)}4n=1 and D2 = {(xn, tn)}4n=1,

where xn ∈ R2.

In D1, x1 = (0, 0)>, x2 = (0, 1)>, x3 = (1, 0)>, and x4 = (1, 1)> and the labels t1 = t2 = t3 = −1

and t4 = 1.

In D2, x1 = (0, 0)>, x2 = (0, 1)>, x3 = (1, 0)>, and x4 = (1, 1)> and the labels t1 = t4 = −1 and

t2 = t3 = 1.

A. What is linear separability? Given D1 and D2, which of the two datasets is not linearly sepa-

rable and why? (3 marks)

B. If the current weight vector and bias are respectively w (k ) and b(k ), write their updated values

w (k+1) and b(k+1) after misclassifying an instance x with label t . (2 marks)

C. Consider the training set D1. Initialising w (0) = (0, 0)> and b(0) = 0, compute the final values

of the weight vector and bias when the Perceptron algorithm is applied to D1 in the order

x1,x2,x3 and x4 using the update rule ta ≤ 0, where a =

∑4

n=1 w

>x + b is the activation

function. (4 marks)

D. Show that the Perceptron with weight vector w ′ = (3, 2)> and bias b′ = −4 also correctly

classifies the data points in training dataset D1 . (4 marks)

E. Show that the Perceptron with weight vector w ′ = (3, 2)> and bias b′ = −4 that correctly

classifies all four instances in D1 in part D, fails to correctly classify all data points in the

training dataset D2. (4 marks)

PAPER CODE COMP527/337 page 3 of 7 Continued

Question 3 Consider five data points in R2 given by x1 = (1, 1)>, x2 = (2, 1)>, x3 = (2, 2)>,

x4 = (1, 2)>, and x5 = (0, 1)>. Answer the following questions about this dataset.

A. Let us assume that we clustered this dataset into two clusters S1 = {x1, x2} and S2 =

{x3, x4, x5}. Moreover, let us represent S1 and S2 by two 2-dimensional arbitrary vectors

respectively µ1 and µ2. Write the Within-Cluster Sum of Squares (WCSS) objective, f (S1,S2)

for this clustering. (2 marks)

B. Calculate the partial derivatives of f (S1,S2) with respect to µ1 and µ2 and find the critical

points of f (S1,S2) by setting the derivatives to zero. (2 marks)

C. We would like to apply the k -means clustering algorithm taught in the lectures to the above-

mentioned dataset to create two clusters. If the initial centroids for the two clusters were set to

µ1 = (2, 1)> and µ2 = (2, 2)>, then what would be the final set of clusters after the first iteration

using Euclidean distance as a similarity measure between the data points. (3 marks)

D. Compute the values for cluster centroids µ1 and µ2 following the assignment resulting in part

(C). (3 marks)

E. In a different setting, consider two clusters S1 = {B,B,R} and S2 = {R,B,R} consisting of

blue (B) and red (R) colour balls. Compute the purity for this clustering. (2 mark)

F. Compute the rand index for the clustering described in part (E). (3 marks)

G. Compute the precision, recall and F-score for the clustering described in part (F). (2 mark)

PAPER CODE COMP527/337 page 4 of 7 Continued

Question 4 Consider the following logistic regression model shown in Figure 1. Here a two

dimensional input (represented by two features x1 and x2) is multiplied by two-dimensional weight

vector w. A bias term b is also used in the network. Let z be the net score and yn by the activation

function computed in the network resulting in an output o.

Figure 1: Logistic regression model

Answer the following questions:

A. Write the formula for computing net score z in the network. (1 mark)

B. Assuming that the logistic model uses sigmoid function to compute activation yn, write down

the formula for computing yn. (1 mark)

C. Provide the reasons for using sigmoid function in the logistic model. (1 mark)

D. Given a training dataset {(xn, tn)}Nn=1, where tn ∈ {0, 1}, with n = 1, ... ,N, write down the

steps involved in deriving the likelihood as a function of w for this dataset. (3 marks)

E. Define the negative log of the likelihood function obtained in part D to derive cross entropy

error function E(w). (2 marks)

F. Write down the gradient ∇E(w), by differentiating E(w) with respect to w obtained in part E .

(3 marks)

G. Using a fixed learning rate η, write down the stochastic gradient descent weight update rule

using the gradient obtained in F. (2 mark)

H. Modify the update rule in G to add an `2 regularisation on w with a regularisation coefficient

λ. (2 mark)

I. Discuss some of the possible solutions to avoid overfitting in a logistic regression model.

(2 mark)

PAPER CODE COMP527/337 page 5 of 7 Continued

Question 5 Consider the following sentence provided below in S1. Consider that you are required

to train a Continuous Bag of Words (CBOW) Word2Vec model using S1.

S1 = i like data science

A. Illustrate with a diagram the CBOW model with a hidden layer consisting of 2 neurons. In-

dicate the input, hidden and output layer of the network with x,h and u, respectively. Use

W and W ′ to indicate the parameters of the model from input layer to hidden layer and from

hidden layer to output layer, respectively. Write down the dimensions of x,h and u. (3 marks)

B. Let W =

1.0 0.2

0.5 2.4

1.2 0.8

2.0 0.6

be a randomly initialised weight matrix between the input and the hidden

layer. Following the vectorised form, compute the output h at the hidden layer obtained in

CBOW model for the input context word “science”, using the one-hot vector representation of

“science” obtained in B. (4 marks)

C. Justify the use of softmax activation at the output layer in CBOW model (5 marks)

D. Let W ′ =

1 2 3.0 2.5

1.5 0.5 1.0 0.5

be a randomly initialised weight matrix defined between the

hidden layer and the output layer. Write down the softmax function to compute output ui at

the output layer in CBOW model. Also compute the output probabilities for each word in V

using the hidden output h obtained in part D. (5 marks)

PAPER CODE COMP527/337 page 6 of 7 Continued

Question 6 Consider that you are hired for a project that studies what adverse side effects are

reported by patients who are taking a particular drug X on Twitter. You are given a large collection

of tweets that mention the drug name X for this purpose. A subset of these tweets, annotated

with regard to the effect of the medicine as positive or negative by the Research and Development

(R&D) wing of the company is also provided to you. Given this scenario, answer the following

questions

A. Describe three challenges that you are likely to encounter in this project. (5 marks)

B. Assuming that you want to use an Artificial Neural Network (ANN) for training a supervised

model for sentiment analysis using the annotated corpus, answer the following:

B1. Describe the pre-processing steps that you would use to clean the data. (5 marks)

B2. Provide the details as to how you would proceed in implementing the ANN model.

(Please note that we don’t want you to write or explain implementation code. We are

more interested in seeing your approach as to how you would proceed. For example,

your approach to pre-processing the Twitter data, convert pre-processed text into feature

vectors (using pre-trained word embeddings), details of different layers of the network,

activation and loss functions used in the model). (5 marks)

PAPER CODE COMP527/337 page 7 of 7 End