COMP0036
LSA - Assignment 1
September 16, 2019
1
1 Question 1
Using the ordinal least squares(OLS) method to find such a curve model is not feasible.
Because the OLS leads to no unique solution.
A cubic function to model the relationship between x and y takes the following format:
y =

ni>=0,n1+n2+n3+n4+n5=3
w

1≤i≤5
xnii
The real number of the independent variable in the model is C38 = 56, which is larger
than the sample number 50. Assuming Z is the real design matrix of this model, then
the OLS solution takes the form:
W = (ZTZ)−1ZTy
As mentioned before, Z is a matrix of 50 × 56, thus ZTZ is non-invertible. That is to
say, we don’t have enough equations to specify the number of unknowns.
If one of the attributes is discarded from the cubic model,then the number of the inde-
pendent variables in the model is C37 = 35 ,and the dimension of the design matrix is
50× 35. We now have enough equations to determine the unknown parameters. So the
problem in (a) can be solved.
In (a) we indeed encountered an over fitting problem, the model pays much more at-
tention to the performance of the in-sample data and shows poor performance to the
out-of-sample data. To improve this, we need to consider a shrinkage method that
could control the complexity of the model. Ridge regression can do this by adding some
penalty to the weights in the model. Ridge regression adds the `2 penalty to the weights
by adding a threshold t as ‖ w ‖22 < t. Using the theory of Lagrange Multipliers, we can
rewrite the objective function as:
argmin
w
1
2
n∑
i=1
(y(i) − wx(i))2 + λ
2
‖ w ‖22
By finding the optimal weight, we can get the following solution:w = (XTX+λI)−1XTy.
In this question, our design matrix is Z, so we change X to Z.The rank of ZTZ + λI is
always 56, so ZTZ + λI is invertible and we can find a unique solution for the model.
2
If we need to do some feature selection and only hold the most important features, Ridge
regression would be quite limited. We should use the LASSO method. Different from
the Ridge regression, Lasso uses the `1 − norm to shrikage some weight to 0. So if we
want to hold some important features in our model and discard those unimportant, we
can add a `1-norm regularization to our objective function and use a gradient descent
method to find the optimal solution. Finally, we just need to keep the features with
nonzero weight.
3
2 Question 2
As ∼ N(0, α) and x are independent variables, so y ∼ N(wx, α).
P (S|w) =
n∏
i=1
P ((x(i), y(i))|w) =
n∏
i=1
(2piα2)−
1
2 exp(−(y
(i) − wx(i))2
2α2
) = (2piα2)−
n
2 exp(−
∑n
i=1(y
(i) − wx(i))2
2α2
)
So we can get A and B as, A = (2piα2)−
n
2 and B = −
∑n
i=1(y
(i)−wx(i))2
2α2
Considering the Bayesian theory, w is now a random variable and w ∼ N(0, β).As said
before y ∼ N(wx, α), so we can get the posterior distribution of w as:
P (w|x, y) = P (x, y, w)
P (x, y)
=
P (y|x,w)P (w)
P (x, y)
We can see P (x, y) is a constant that is not related to w, we assume P (x, y) = Z,then
P (w|x, y) = 1
Z
P (y|x,w)P (w)
=
1
Z
(2piα2)−
1
2 exp(−(y − wx)
2
2α2
)(2piβ2)−
1
2 exp(− w
2
2β2
)
= (2piα2)−
1
2 (2piβ2)−
1
2 exp(−((y − wx)
2
2α2
+
w2
2β2
))
The Maximum A Posteriori estimation(MAP) can be derived by the following equation:
argmax
w
P (w|x, y)⇔ argmax
w
P (y|x,w)P (w)
Then we use the log likelihood to find this solution, the log likelihood takes the following
form:
L = argmax
w
ln(
n∏
i=1
P (y(i)|x(i), w)P (w))
= − 1
2α2
n∑
i=1
(y(i) − wx(i))2 − 1
2β2
w2 + const
Taking the derivative of the above function, we can get the MAP estimate of w as:
wMAP =
∑n
i=1 x
(i)y(i)
α2
× (
∑n
i=1(x
(i))2
α2
+
1
β2
)−1
4
No, this solution is not unique. Because MAP estimation is only a point estimate of w,
while in Bayesian theory, w is a random variable. We only use MAP to do a sort of
halfway house in Bayesian inference.
The MAP estimation of w reminds me of the MLE(Maximum Likelihood Estimator)
method. From the analysis(c) we can find, the biggest difference between MAP estima-
tion and MLE methods is that MAP takes the prior distribution of w into consideration.
So if the prior distribution of w is a uniform distribution, then these two method will
give the same result.
5
3 Question 3
Discriminative Classification, also known as the distribution-free method, is to learn a
classifier by giving the optimal loss function directly. Like SVM, it just tries to learn
a suitable loss function that can minimize the classification loss. This is different from
Generative Classification methods. As Generative Classification always makes use of the
probability distribution of the sample, and give the optimal result by maximizing the
posterior probability that a sample belonged to a class.
The main purpose of classification is to find a relationship between the observed sample x
and the classification y. Due to y being a binary variable, we introduce a latent variable
y∗ and map the latent variable to y by:
y =
{
1, if y∗ ≥ 0
0, if y∗ < 0
We assume y∗ has the following relationship with x: y∗ = wx+, and ∼ Logsitic(0, 1)
PY (y = 1|x) = P (y∗ ≥ 0|x)
= P (wx+ ≥ 0)
= P ( ≥ −wx)
= 1− P ( < −wx)
= 1− 1
1 + ewx
=
ewx
1 + ewx
=
1
1 + e−wx
And this is equivalent to the posterior probability described in question(b)
If we allow the Logistic parameters to take more general values a ∈ R and b > 0, the
final logistic regression model will take the form: PY (y = 1|x) = 11+exp(−wx+a
b
)
.
Then, unlike the logistic regression model stated in (b), we will get a differnt linear
discriminant separating hyperplane. The hyperplane in this situation is defined as:
wx+ a = 0
6
When we use a Gaussian Noise Latent Variable Model, that is we assume ∼ N(0, 1),
then similar to (b), we can describe the posterior of PY (y|x) as:
PY (y = 1|x) = P (y∗ ≥ 0|x)
= P (wx+ ≥ 0)
= P ( ≥ −wx)
= 1− P ( < −wx)
= 1− Φ(−wx)
= Φ(wx)
=
1√
2pi
∫ wx
−∞
e−
t2
2 dt
And PY (y = 0|x) = 1− PY (y = 1|x).
There indeed exists some difference between these two models when it comes to outliers
in the sample. If we compare the distribution, we will find out that the the rising slope
of the model derived in(d) is steeper than the model derived in (b). So Gaussian Noise
Latent Variable Model will be more robust to outliers than the Logistic Noise Latent
Variable Model.
When k=2, there are two classes. In order to make it comform to our previous notation,
we assume the two classes are 0 and 1.
PY (y = 1|x) = e
w1x
ew0x + ew1x
=
1
1 + e−(w1−w0)x
Let w = w1 − w0, then we can see that this is a logistic regression.
For the Multinomial Logistic Regression, when K > 2, the sample belonging to class
j(1 ≤ j < K) must satisfy PY (y = j|x) > PY (y = i|x), ∀i, i 6= j, where 1 ≤ i < K. That
is:
exp(wjx)∑K
k=1 exp(wkx)
>
exp(wix)∑K
k=1 exp(wkx)
⇒ exp(wjx) > exp(wix)
⇒ wjx > wix
⇒ (wj − wi)x > 0
7
So we can get the classification boundary for Multinomial Logistic Regression, and we
can see the classification boundary is a series linear discriminant separating hyperplane.
8  Email:51zuoyejun

@gmail.com