程序辅导案例 > Program >

程序代写案例-COMP3670/6670

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

COMP3670/6670 Introduction to Machine Learning
Semester 2, 2020
Final Exam
• Write your name and UID on the first page (you will be fine if you forget to write them).
• This is an open book exam. You may bring in any materials including electronic and paper-
based ones. Any calculators (programmable included) are allowed. No communication devices
are permitted during the exam.
• Reading time: 30 minutes
• Writing time: 180 minutes
• For all the questions, write your answer CLEARLY on papers prepared by yourself.
• There are totally 8 pages (including the cover page)
• Points possible: 100
• This is not a hurdle.
• When you are asked to provide a justification to your answer, if your justification is incorrect,
you will get 0.
1
• Section 1. Linear Algebra and Matrix Decomposition (13 points)
1. (6 points) Let {v1,v2} be linearly independent vectors in Rn. Let v3 be a vector in Rn that
does not lie in the span of v1,v2. Prove that {v1,v2,v3} is linearly independent.
Solution. Assume for a contradiction that {v1,v2,v3} are not linearly independent. Then
there exists constants c1, c2, c3, not all zero, such that
c1v1 + c2v2 + c3v3 = 0
Now, if c3 = 0, then the above equation becomes
c1v1 + c2v2 = 0
and therefore {v1,v2} are linearly dependent, a contradiction. So, we have that c3 6= 0.
Hence
c1v1 + c2v2 + c3v3 = 0
c1v1 + c2v2 = −c3v3
−c1
c3
v1 +
−c2
c3
v2 = v3
and hence v3 lies in the span of v1,v2, a contradiction.
2. (7 points) Consider the matrix [
0 −1
1 0
]
Find its eigenvalues. What does this matrix geometrically do when applied to a vector?
Explain how this relates to the set of eigenvalues for this matrix.
Solution. We form the characteristic equation det(A− λI)
det(A− λI) = det
[−λ −1
1 λ
]
= λ2 − (−1)(1) = λ2 + 1
Setting it equal to zero and solving,
λ2 = 1⇒ λ = ±√−1
We find no eigenvalues, as we cannot square root a negative number (at least not in the real
numbers). This matrix can be seen to perform a rotation of 90 degrees about the origin,[
0 −1
1 0
] [
x
y
]
=
[−y
x
]
which means that no vector in R2 will be transformed into a scalar multiple of itself. This
is why we see no real eigenvalues.
2
• Section 2. Analytic Geometry and Vector Calculus (12 points)
1. (6 points) Find all matrices T ∈ R2×2 such that for any v ∈ R2,
T (v) · v = 0
Solution. Let
T =
[
a b
c d
]
,v =
[
x
y
]
.
Then,
T (v) · v =
[
a b
c d
] [
x
y
]
·
[
x
y
]
= 0[
ax+ by
cx+ dy
]
·
[
x
y
]
= 0
(ax+ by)x+ (cx+ dy)y = 0
Choose x = 0, then
(a0 + by)0 + (c0 + dy)y = 0⇒ dy2 = 0
Since dy2 = 0 needs to hold for all y, this is only possible when d = 0. Now, choose y = 0,
then
(ax+ b0)x+ (cx+ d0)0 = 0⇒ ax2 = 0
Since ax2 = 0 needs to hold for all x, this is only possible when a = 0. Substituting this
into the above,
(0x+ by)x+ (cx+ 0y)y = 0
bxy + cxy = 0
Since this needs to hold for all x and for all y, it must be the case that b = −c. Hence,
{T ∈ R2×2 : ∀v ∈ R2, (Tv) · v = 0} =
{[
0 α
−α 0
]
: α ∈ R
}
2. (6 points) Let x,a ∈ Rn×1, and define f : Rn×1 → R as
f(x) = xTaaTx
Compute ∇xf(x).
Solution. There are a few ways to do this. Students could take the derivative with respect
to a component of x, or could use product rule, or could recognize that
xTaaTx = xTaxTa = (xTa)2
and use chain rule. I will use the latter approach. Let h(x) = xTa and g(x) = x2. Then
f(x) = g(h(x)), and so
df
dx
=
dg
dh
dh
dx
Then, dgdh = 2h and
dh
dx = a
T , by identities provided in lecture slides. Hence,
df
dx
= 2haT = 2xTaaT .
as required.
3
• Section 3. Probability (15 points)
Consider the following scenario. I flip a fair coin.
If the coin comes up heads, I roll a fair 4 sided die (with sides {1, 2, 3, 4}), and then I tell you
the result of rolling the die.
If the coin comes up tails, I roll a fair 6 sided die (with sides {1, 2, 3, 4, 5, 6}), and then I tell you
the result of rolling the die.
Let X denote the number I tell you.
1. (3 points) What is the set of all possible outcomes X for X?
Solution. We either roll a 4-sided die, and get a number from 1 to 4, or roll a 6-sided die,
and get a number from 1 to 6. Hence X = {1, 2, 3, 4, 5, 6}.
2. (4 points) Compute P (X = x) for all x ∈ X .
Solution. First consider the case where x ∈ {1, 2, 3, 4}. Let C denote the outcome of the
coin, with C = {heads, tails} denoting the set of outcomes for the coin.
P (X = x) =
∑
c∈C
P (X = x | C = c)P (C = c)
= P (X = x | C = heads)P (C = heads) + P (X = x | C = tails)P (C = tails)
= 1/4 · 1/2 + 1/6 · 1/2 = 5/24
Then consider the case where x ∈ {5, 6}, which will be the same as before, but now P (X =
x | C = heads) = 0, as the four-sided die can’t roll 5’s or 6’s.
P (X = x) =
∑
c∈C
P (X = x | C = c)P (C = c)
= P (X = x | C = heads)P (C = heads) + P (X = x | C = tails)P (C = tails)
= 0 · 1/2 + 1/6 · 1/2 = 1/12
Hence,
P (X = x) =
{
5/24 x ∈ {1, 2, 3, 4}
1/12 x ∈ {5, 6}
3. (4 points) I tell you that X = 1. How likely is it that the coin flipped heads?
Solution. We apply Bayes rule.
P (C = heads | X = 1)
=
P (X = 1 | C = heads)P (C = heads)
P (X = 1)
=
1/4 · 1/2
5/24
= 3/5
4. (4 points) We repeat the above experiment, but this time whatever die is selected, is rolled
twice. I inform you that the outcome for both rolls was a 1. How likely is it that the coin
flipped was heads?
Solution. Each die roll is i.i.d.
P (C = heads | rolled one twice) = P (rolled one twice | C = heads)P (C = heads)
P (rolled one twice)
Now, P (rolled one twice | C = heads) = P (X = 1 | C = heads)2 = 1/16, and
P (rolled one twice | C = tails) = P (X = 1 | C = tails)2 = 1/36
P (rolled one twice)
= P (rolled one twice | C = heads)P (C = heads) + P (rolled one twice | C = tails)P (C = tails
= P (X = 1 | C = heads)2(1/2) + P (X = 1 | C = tails)2(1/2)
= 1/16 · 1/2 + 1/36 · 1/2 = 13/288
4
Hence,
P (C = heads | rolled one twice) = 1/16 · 1/2
13/288
= 9/13
5
• Section 4. Clustering and Gaussian Mixture Model (GMM) (15 points)
Both Kmeans and GMM can be viewed as aiming to find θ to optimise p(X |θ). Here, X is the
dataset, and θ is related to the model. Answer the following questions.
1. (2 points) In kmeans, use no more than 2 sentences to describe what θ contains.
Solution. θ contains the coordinates of centroids, as well as the cluster ID that each
sample is assigned to. Deduct 0.5 mark if answering the number of clusters.
2. (3 points) In kmeans, use no more than 2 sentences to describe the probabilistic meaning
of p(X |θ).
Solution. Given model parameters θ, the probability that dataset X is generated by this
model.
3. (3 points) Assume that samples in X are from 3 classes. After training a GMM with 3
components on X , we use this GMM as a classifier to predict which class a new sample x
belongs to. In no more than 3 sentences, describe the prediction process. (Use math symbols
where relevant; you do not have to explain the symbols if they are same with lecture slides,
e.g., µ).
Solution. We obtain three Gaussian distributions from training. We compute p(x|θk), the
probability that the test sample x is generated by the kth Gaussian distribution. We pick
the Gaussian distribution that gives the largest probability as the label of this test sample.
4. (3 points) Is it correct to say that the kmeans method enables us to find θ that minimises
p(X |θ)? Explain your answer in 2 sentences.
Solution. Incorrect. First, we aim to maximise instead of minimise p(X |θ). Second, kmeans
enables us to find a local maximum instead of a global one.
5. (4 points) Suppose we have the following 10 data points. They are partitioned into two
classes, red and blue. Which model could generate this partition, kmeans only, GMM only,
both, or neither? Explain your answer in three sentences. (Note: where GMM is relevant,
the classification principal is similar to Question 3 above, except that this question has two
classes.)
Figure 1: 10 data points divided into two classes, blue and red.
Solution. GMM only. It cannot be generated by kmeans because of the right most point
is closer to the center of the red class, so it will be classified as red. GMM can generate this
figure because the blue class has a larger variance under GMM.
6
• Section 5. Linear Regression (13 points)
You are doing a machine learning internship at a bank, analysing user age and their daily
expense. You collected seven samples and plotted them in Fig. 2(a).
20 25 30 35 40 45 50
age
30
40
50
60
70
80
90
ex
pe
ns
e
20 25 30 35 40 45 50
age
30
40
50
60
70
80
90
ex
pe
ns
e
Point !
(a) Regression with normal data (b) Regression with normal and abnormal data
Figure 2: Linear regression with normal and abnormal data.
1. (3 points) From your observation of Fig. 2(a), use 1 sentence to describe the relationship
between user age and expense.
Solution. Expense increases linearly as age increases.
After obtaining Fig. 2(a), you collected a new data point A. Together with the previous seven
points, you do linear regression for a second time and obtain the dashed line in Fig. 2(b).
2. (4 points) Generally when adding new samples to the dataset, it is expected that the fitted
line will be different. In our example, there is quite a large difference between the new model
(dashed line) and the old model (solid line). In two sentences, explain why the change is
large. (Hint: you don’t have to explain why there is a “change”. Focus on “large”.)
Solution. 1) Point A deviates a lot from the other points, thus rendering a large gradi-
ent/loss asking the model to be closer to it. 2) There are limited number of samples in this
dataset, so the big gradient from A causes the model to move a lot.
3. (6 points) You originally used the squared error, i.e., for the nth training sample (xn, yn),
l(xn, yn,θ) = (yn − θTxn)2,
where xn is the feature of the sample, yn is the label, and θ contains the model parameters.
Your supervisor tells you that Point A is an outlier and that it is best to exclude its impact
on your model. Write down an amended loss function that can achieve this goal. Explain
how it excludes the impact of outliers on your linear model. Note: you will get partial marks
if your loss function can merely alleviate the impact of A.
Solution.
l(xn, yn,θ) = min((yn − θTxn)2 −m, 0).
m is a hyperparameter. From Fig. 1, its value should be approximately between 100 and 400
(it’s not necessary for a student to make this estimation). There could be other equivalent
ones.
Explanation. For a proper value of m, the outlier point A will have a large (yn− θTxn)2,
thus causing term (yn − θTxn)2 −m to be greater than 0. In this case, there will be 0 loss
value. For other points, (yn − θTxn)2 − m will be less than 0, thus preserving the value
after the min operator.
7
• Section 6. Principal Component Analysis (PCA) and Linear Regression (20 points)
We are given a centered dataset, (x1, y1), (x2, y2), ..., (xN , yN ), where xn ∈ R, yn ∈ R, and N is
the number of samples. “Centered” means
∑N
n=1 xn = 0, and
∑N
n=1 yn = 0. Now for this dataset,
we apply linear regression and PCA. For linear regression, our model is y = θx+θ0 = θx, where
we treat yn as labels and xn as the feature. For PCA, we obtain the first and second principal
components: pc1 and pc2. An example is shown in Fig. 5.
-6 -4 -2 0 2 4 6
x
-5
-4
-3
-2
-1
0
1
2
3
4
5
y
data points
y= x
pc1
pc2
Figure 3: Linear regression with normal and abnormal data.
1. (3 points) Are pc1 and pc2 orthogonal? Explain your answer in two sentences.
Solution. Yes. The covariance matrx of this 2-D dataset has two different eigenvalues
because the data distribution is oval. Eigenvectors corresponding to different eigenvalues
are orthogonal.
2. (4 points) In usual cases, the regression output θ is not in the same direction with pc1
(and pc2). Explain why θ and pc1 are usually different in direction. You can use whatever
is relevant to help you illustrate, such as figures or maths. (Hint: differences of PCA and
linear regression in their optimisation objective.)
Solution. Both linear regression and PCA aim to find a model to minimise a loss function.
Linear regression aims to find a subspace such that the predicted output and the model
output are close. In comparison, PCA wants to find a subspace such that the orthogonal
projection of samples onto this subspace is minimised. See figure below for an illustration.
Regression
PCA
Figure 4: Difference between PCA and regression.
3. (4 points) On your paper, draw an example dataset for which θ and pc1 are of the same
direction. Your figure should contain the x-axis, the y-axis, at least 3 data points, as well
as θ and pc1 (the latter two should be overlapping). If necessary, write the coordinates of
the data points.
8
Solution. Many scenarios. For example, if three different points are on the same line, θ
and pc1 overlap. Two examples are shown below.
(-1, 0) (0, 0) (1, 0)
(-2, -1)
(-2, 1) (2, 1)
(2, -1)
Example 1 Example 2
Figure 5: Two examples of overlapping PCA and regression outputs.
4. (5 points) Let a = [x1, x2, ..., xN ]
T ∈ RN , and b = [y1, y2, ..., yN ]T ∈ RN . On this centered
dataset, show that when aTb = 0, the regression output is the x-axis. (We assume the MSE
as loss function)
Solution. The model of linear regression is
y = θ1x+ θ0 = θx.
The closed form solution of linear regression is
[θ1, θ0] = θ = (X
TX)−1XTy,
where X =
[
x1 x2 ... xN
1 1 ... 1
]T
,y = [y1, y2, ..., yN ]
T .
Because aTb = 0 and y1 + y2 + ...+ yN = 0, we obtain that
XTy = [
N∑
i=1
xiyi,
N∑
i=1
yi] = [0, 0].
Therefore, the output of linear regression is the x-axis, i.e., y = 0.
5. (4 points) Continuing from Question 4, calculate the covariance matrix of this dataset (you
do not have to do standardization). When
∑N
i=1 x
2
i>
∑N
i=1 y
2
i , show that the first principal
component pc1 is horizontal.
Solution. The covariance matrix is calculated as,
S =
1
N
[
x1 x2 ... xN
y1 y2 ... yN
] [
x1 x2 ... xN
y1 y2 ... yN
]T
=
1
N
[ ∑N
i=1 x
2
i
∑N
i=1 xiyi∑N
i=1 xiyi
∑N
i=1 y
2
i
]
=
1
N
[∑N
i=1 x
2
i 0
0
∑N
i=1 y
2
i
]
.
When
∑N
i=1 x
2
i 6=
∑N
i=1 y
2
i , the two eigenvalues are
λ1 =
N∑
i=1
x2i , λ2 =
N∑
i=1
y2i .
When
∑N
i=1 x
2
i>
∑N
i=1 y
2
i , the eigenvector corresponding to λ1 is the principal component
pc1. We calculate this eigenvector by solving the following equation system,
1
N
[∑N
i=1 x
2
i 0
0
∑N
i=1 y
2
i
]
b = λ1b.
Substituting λ1 with
∑N
i=1 x
2
i , we can obtain the eigenvector as
b = [1, 0]T ,
which is horizontal.
9
• Section 7. Classification (12 points)
You have developed a linear classifier to classify the sentiment of a sentence into positive and
negative. Assume that positive sentences and negative sentences have equal numbers in both
training and testing sets. Your classifier obtains an accuracy of 40% on the test data.
1. (2 points) Is this classifier meaningful? Explain your answer in two sentences.
Solution. No. Random guess would have 50% accuracy, so the classifier is not meaningful.
2. (3 points) Without re-training the classifier, use three sentences to describe how you improve
the previous classifier and why it becomes better.
Solution. Negate every prediction, i.e., if the prediction is negative then change it to
positive and vice versa. Then we get a classifier that has 60% accuracy.
PCA is a useful technique to project data samples onto a lower-dimensional subspace that
preserves data variance. Oftentimes, it is used to preprocess features before training a classifier.
3. (4 points) You have four data points of two classes. Their class labels (A or B) and coordi-
nates are listed below.
- Class A: (10, 1) and (-10, 1).
- Class B: (10, -1) and (-10, -1)
For this case, is it a useful step to project the data onto the first principle component before
training a classifier? If yes, draw the decision boundary after the projection. If no, briefly
explain your answer.
Solution. No. The first principal component is the x-axis. If we project the points on the
x-axix, (10, 1) and (10, -1) will overlap, and (-10, 1) and (-10, -1) will overlap. The two
classes are no longer separable. In this question, the informative dimension is trimmed off.
4. (3 points) Explain why PCA is helpful for classifier training in many real-world cases.
Solution. In real-world cases, features have a higher dimension and there might not be
sufficient training data. PCA allows us to remove the redundant features, alleviate overfit-
ting and improve training efficiency. A student would get full marks if the answer contains
“remove redundant features”.
——— End of the paper ——–
10

欢迎咨询51作业君