# 辅导案例-COMP527/337

PAPER CODE NO. EXAMINER: Dr. Angrosh Mandya Tel. No. 0151 7954293
COMP527/337 DEPARTMENT: Computer Science
SECOND SEMESTER EXAMINATIONS 2019/20
Data Mining and Visualisation
TIME ALLOWED : Open-Book Exam
INSTRUCTIONS TO CANDIDATES
• The final exams for the course Data Mining and Visualisation is conducted as open-book
exam. The question paper will be provided on 22nd May, 2020 at 10:00 a.m. You are
expected to complete the exam and return scanned copies of answer scripts before 29th
May 10:00 a.m. Answer scripts received after 29th May 10:00 a.m. will be invalid.
https://sam.csc.liv.ac.uk/COMP/Submissions.pl
• Answer scripts can be prepared in the following ways:
1. You can write your answers by hand on plain paper, then scan or photograph the
sheets, convert them to a single PDF and submit it at the link provided above.
2. In case you do not have access to scanners or unable to photograph a handwritten
solution, you can use a word processor (MS Word or LaTeX) and convert the file into a
single PDF and submit it as above.
• Please note the recommended time for completing the exam when conducted as a regular
exam during previous years was 2.5 hours. However, the open-book exam is available to
you from 22nd May, 2020 10:00 a.m. - 29th May, 2020, 10:00 a.m. Answer scripts can be
10:00 a.m. will be considered for evaluation. All submissions received after the deadline will
be considered as invalid.
• Since the exam is conducted as an open-book exam, students are expected to work on their
own to prepare their answers. Any sort of plagiarism will be strictly penalised.
• Calculators can be used for performing calculations.
PAPER CODE COMP527/337 page 1 of 7 Continued
Question 1 Consider the two sentences
S1 = great movie my family loved it
S2 = worst movie my family hated it
Let us consider the words my and it to be a stop words and apply lemmatisation on the other
words. We represent S1 and S2 respectively by vectors x1,x2 ∈ R6 in the vector space where the
first six dimensions defined respectively by great, movie, family, loved, worst and hated. Answer
the following questions.
A. Write the two vectors x1 and x2 in this vector space. (2 marks)
B. Calculate `1 and `2 norms of x1 and x2 (2 marks)
C. Calculate the Euclidean distance and Manhattan distance between x1 and x2 (2 mark)
D. Compute the cosine similarity between x1 and x2 (1 mark)
E. Write the set of unigrams, bigrams and trigrams that can be extracted from S1. (3 mark)
F. Write the set of bigrams that can be extracted from S2. (3 mark)
G. State one advantage of removing stop words in text mining tasks. (1 mark)
H. State one disadvantage of removing stop words in text mining tasks. (1 mark)
I. What is meant by part-of-speech in text mining? (2 marks)
PAPER CODE COMP527/337 page 2 of 7 Continued
Question 2 Consider the following two training datasets D1 = {(xn, tn)}4n=1 and D2 = {(xn, tn)}4n=1,
where xn ∈ R2.
In D1, x1 = (0, 0)>, x2 = (0, 1)>, x3 = (1, 0)>, and x4 = (1, 1)> and the labels t1 = t2 = t3 = −1
and t4 = 1.
In D2, x1 = (0, 0)>, x2 = (0, 1)>, x3 = (1, 0)>, and x4 = (1, 1)> and the labels t1 = t4 = −1 and
t2 = t3 = 1.
A. What is linear separability? Given D1 and D2, which of the two datasets is not linearly sepa-
rable and why? (3 marks)
B. If the current weight vector and bias are respectively w (k ) and b(k ), write their updated values
w (k+1) and b(k+1) after misclassifying an instance x with label t . (2 marks)
C. Consider the training set D1. Initialising w (0) = (0, 0)> and b(0) = 0, compute the final values
of the weight vector and bias when the Perceptron algorithm is applied to D1 in the order
x1,x2,x3 and x4 using the update rule ta ≤ 0, where a =
∑4
n=1 w
>x + b is the activation
function. (4 marks)
D. Show that the Perceptron with weight vector w ′ = (3, 2)> and bias b′ = −4 also correctly
classifies the data points in training dataset D1 . (4 marks)
E. Show that the Perceptron with weight vector w ′ = (3, 2)> and bias b′ = −4 that correctly
classifies all four instances in D1 in part D, fails to correctly classify all data points in the
training dataset D2. (4 marks)
PAPER CODE COMP527/337 page 3 of 7 Continued
Question 3 Consider five data points in R2 given by x1 = (1, 1)>, x2 = (2, 1)>, x3 = (2, 2)>,
A. Let us assume that we clustered this dataset into two clusters S1 = {x1, x2} and S2 =
{x3, x4, x5}. Moreover, let us represent S1 and S2 by two 2-dimensional arbitrary vectors
respectively µ1 and µ2. Write the Within-Cluster Sum of Squares (WCSS) objective, f (S1,S2)
for this clustering. (2 marks)
B. Calculate the partial derivatives of f (S1,S2) with respect to µ1 and µ2 and find the critical
points of f (S1,S2) by setting the derivatives to zero. (2 marks)
C. We would like to apply the k -means clustering algorithm taught in the lectures to the above-
mentioned dataset to create two clusters. If the initial centroids for the two clusters were set to
µ1 = (2, 1)> and µ2 = (2, 2)>, then what would be the final set of clusters after the first iteration
using Euclidean distance as a similarity measure between the data points. (3 marks)
D. Compute the values for cluster centroids µ1 and µ2 following the assignment resulting in part
(C). (3 marks)
E. In a different setting, consider two clusters S1 = {B,B,R} and S2 = {R,B,R} consisting of
blue (B) and red (R) colour balls. Compute the purity for this clustering. (2 mark)
F. Compute the rand index for the clustering described in part (E). (3 marks)
G. Compute the precision, recall and F-score for the clustering described in part (F). (2 mark)
PAPER CODE COMP527/337 page 4 of 7 Continued
Question 4 Consider the following logistic regression model shown in Figure 1. Here a two
dimensional input (represented by two features x1 and x2) is multiplied by two-dimensional weight
vector w. A bias term b is also used in the network. Let z be the net score and yn by the activation
function computed in the network resulting in an output o.
Figure 1: Logistic regression model
A. Write the formula for computing net score z in the network. (1 mark)
B. Assuming that the logistic model uses sigmoid function to compute activation yn, write down
the formula for computing yn. (1 mark)
C. Provide the reasons for using sigmoid function in the logistic model. (1 mark)
D. Given a training dataset {(xn, tn)}Nn=1, where tn ∈ {0, 1}, with n = 1, ... ,N, write down the
steps involved in deriving the likelihood as a function of w for this dataset. (3 marks)
E. Define the negative log of the likelihood function obtained in part D to derive cross entropy
error function E(w). (2 marks)
F. Write down the gradient ∇E(w), by differentiating E(w) with respect to w obtained in part E .
(3 marks)
G. Using a fixed learning rate η, write down the stochastic gradient descent weight update rule
using the gradient obtained in F. (2 mark)
H. Modify the update rule in G to add an `2 regularisation on w with a regularisation coefficient
λ. (2 mark)
I. Discuss some of the possible solutions to avoid overfitting in a logistic regression model.
(2 mark)
PAPER CODE COMP527/337 page 5 of 7 Continued
Question 5 Consider the following sentence provided below in S1. Consider that you are required
to train a Continuous Bag of Words (CBOW) Word2Vec model using S1.
S1 = i like data science
A. Illustrate with a diagram the CBOW model with a hidden layer consisting of 2 neurons. In-
dicate the input, hidden and output layer of the network with x,h and u, respectively. Use
W and W ′ to indicate the parameters of the model from input layer to hidden layer and from
hidden layer to output layer, respectively. Write down the dimensions of x,h and u. (3 marks)
B. Let W =

1.0 0.2
0.5 2.4
1.2 0.8
2.0 0.6
 be a randomly initialised weight matrix between the input and the hidden
layer. Following the vectorised form, compute the output h at the hidden layer obtained in
CBOW model for the input context word “science”, using the one-hot vector representation of
“science” obtained in B. (4 marks)
C. Justify the use of softmax activation at the output layer in CBOW model (5 marks)
D. Let W ′ =
 1 2 3.0 2.5
1.5 0.5 1.0 0.5
 be a randomly initialised weight matrix defined between the
hidden layer and the output layer. Write down the softmax function to compute output ui at
the output layer in CBOW model. Also compute the output probabilities for each word in V
using the hidden output h obtained in part D. (5 marks)
PAPER CODE COMP527/337 page 6 of 7 Continued
Question 6 Consider that you are hired for a project that studies what adverse side effects are
reported by patients who are taking a particular drug X on Twitter. You are given a large collection
of tweets that mention the drug name X for this purpose. A subset of these tweets, annotated
with regard to the effect of the medicine as positive or negative by the Research and Development
(R&D) wing of the company is also provided to you. Given this scenario, answer the following
questions
A. Describe three challenges that you are likely to encounter in this project. (5 marks)
B. Assuming that you want to use an Artificial Neural Network (ANN) for training a supervised
model for sentiment analysis using the annotated corpus, answer the following:
B1. Describe the pre-processing steps that you would use to clean the data. (5 marks)
B2. Provide the details as to how you would proceed in implementing the ANN model.
(Please note that we don’t want you to write or explain implementation code. We are
more interested in seeing your approach as to how you would proceed. For example,
your approach to pre-processing the Twitter data, convert pre-processed text into feature
vectors (using pre-trained word embeddings), details of different layers of the network,
activation and loss functions used in the model). (5 marks)
PAPER CODE COMP527/337 page 7 of 7 End

Email:51zuoyejun

@gmail.com