COMP0036

LSA - Assignment 1

September 16, 2019

1

1 Question 1

1.1 Answer for (a)

Using the ordinal least squares(OLS) method to find such a curve model is not feasible.

Because the OLS leads to no unique solution.

A cubic function to model the relationship between x and y takes the following format:

y =

∑

ni>=0,n1+n2+n3+n4+n5=3

w

∏

1≤i≤5

xnii

The real number of the independent variable in the model is C38 = 56, which is larger

than the sample number 50. Assuming Z is the real design matrix of this model, then

the OLS solution takes the form:

W = (ZTZ)−1ZTy

As mentioned before, Z is a matrix of 50 × 56, thus ZTZ is non-invertible. That is to

say, we don’t have enough equations to specify the number of unknowns.

1.2 Answer for (b)

If one of the attributes is discarded from the cubic model,then the number of the inde-

pendent variables in the model is C37 = 35 ,and the dimension of the design matrix is

50× 35. We now have enough equations to determine the unknown parameters. So the

problem in (a) can be solved.

1.3 Answer for (c)

In (a) we indeed encountered an over fitting problem, the model pays much more at-

tention to the performance of the in-sample data and shows poor performance to the

out-of-sample data. To improve this, we need to consider a shrinkage method that

could control the complexity of the model. Ridge regression can do this by adding some

penalty to the weights in the model. Ridge regression adds the `2 penalty to the weights

by adding a threshold t as ‖ w ‖22 < t. Using the theory of Lagrange Multipliers, we can

rewrite the objective function as:

argmin

w

1

2

n∑

i=1

(y(i) − wx(i))2 + λ

2

‖ w ‖22

By finding the optimal weight, we can get the following solution:w = (XTX+λI)−1XTy.

In this question, our design matrix is Z, so we change X to Z.The rank of ZTZ + λI is

always 56, so ZTZ + λI is invertible and we can find a unique solution for the model.

2

1.4 Answer for (d)

If we need to do some feature selection and only hold the most important features, Ridge

regression would be quite limited. We should use the LASSO method. Different from

the Ridge regression, Lasso uses the `1 − norm to shrikage some weight to 0. So if we

want to hold some important features in our model and discard those unimportant, we

can add a `1-norm regularization to our objective function and use a gradient descent

method to find the optimal solution. Finally, we just need to keep the features with

nonzero weight.

3

2 Question 2

2.1 Answer for (a)

As ∼ N(0, α) and x are independent variables, so y ∼ N(wx, α).

P (S|w) =

n∏

i=1

P ((x(i), y(i))|w) =

n∏

i=1

(2piα2)−

1

2 exp(−(y

(i) − wx(i))2

2α2

) = (2piα2)−

n

2 exp(−

∑n

i=1(y

(i) − wx(i))2

2α2

)

So we can get A and B as, A = (2piα2)−

n

2 and B = −

∑n

i=1(y

(i)−wx(i))2

2α2

2.2 Answer for (b)

Considering the Bayesian theory, w is now a random variable and w ∼ N(0, β).As said

before y ∼ N(wx, α), so we can get the posterior distribution of w as:

P (w|x, y) = P (x, y, w)

P (x, y)

=

P (y|x,w)P (w)

P (x, y)

We can see P (x, y) is a constant that is not related to w, we assume P (x, y) = Z,then

P (w|x, y) = 1

Z

P (y|x,w)P (w)

=

1

Z

(2piα2)−

1

2 exp(−(y − wx)

2

2α2

)(2piβ2)−

1

2 exp(− w

2

2β2

)

= (2piα2)−

1

2 (2piβ2)−

1

2 exp(−((y − wx)

2

2α2

+

w2

2β2

))

2.3 Answer for (c)

The Maximum A Posteriori estimation(MAP) can be derived by the following equation:

argmax

w

P (w|x, y)⇔ argmax

w

P (y|x,w)P (w)

Then we use the log likelihood to find this solution, the log likelihood takes the following

form:

L = argmax

w

ln(

n∏

i=1

P (y(i)|x(i), w)P (w))

= − 1

2α2

n∑

i=1

(y(i) − wx(i))2 − 1

2β2

w2 + const

Taking the derivative of the above function, we can get the MAP estimate of w as:

wMAP =

∑n

i=1 x

(i)y(i)

α2

× (

∑n

i=1(x

(i))2

α2

+

1

β2

)−1

4

2.4 Answer for (d)

No, this solution is not unique. Because MAP estimation is only a point estimate of w,

while in Bayesian theory, w is a random variable. We only use MAP to do a sort of

halfway house in Bayesian inference.

2.5 Answer for (e)

The MAP estimation of w reminds me of the MLE(Maximum Likelihood Estimator)

method. From the analysis(c) we can find, the biggest difference between MAP estima-

tion and MLE methods is that MAP takes the prior distribution of w into consideration.

So if the prior distribution of w is a uniform distribution, then these two method will

give the same result.

5

3 Question 3

3.1 Answer for (a)

Discriminative Classification, also known as the distribution-free method, is to learn a

classifier by giving the optimal loss function directly. Like SVM, it just tries to learn

a suitable loss function that can minimize the classification loss. This is different from

Generative Classification methods. As Generative Classification always makes use of the

probability distribution of the sample, and give the optimal result by maximizing the

posterior probability that a sample belonged to a class.

3.2 Answer for (b)

The main purpose of classification is to find a relationship between the observed sample x

and the classification y. Due to y being a binary variable, we introduce a latent variable

y∗ and map the latent variable to y by:

y =

{

1, if y∗ ≥ 0

0, if y∗ < 0

We assume y∗ has the following relationship with x: y∗ = wx+, and ∼ Logsitic(0, 1)

PY (y = 1|x) = P (y∗ ≥ 0|x)

= P (wx+ ≥ 0)

= P ( ≥ −wx)

= 1− P ( < −wx)

= 1− 1

1 + ewx

=

ewx

1 + ewx

=

1

1 + e−wx

And this is equivalent to the posterior probability described in question(b)

3.3 Answer for (c)

If we allow the Logistic parameters to take more general values a ∈ R and b > 0, the

final logistic regression model will take the form: PY (y = 1|x) = 11+exp(−wx+a

b

)

.

Then, unlike the logistic regression model stated in (b), we will get a differnt linear

discriminant separating hyperplane. The hyperplane in this situation is defined as:

wx+ a = 0

6

3.4 Answer for (d)

When we use a Gaussian Noise Latent Variable Model, that is we assume ∼ N(0, 1),

then similar to (b), we can describe the posterior of PY (y|x) as:

PY (y = 1|x) = P (y∗ ≥ 0|x)

= P (wx+ ≥ 0)

= P ( ≥ −wx)

= 1− P ( < −wx)

= 1− Φ(−wx)

= Φ(wx)

=

1√

2pi

∫ wx

−∞

e−

t2

2 dt

And PY (y = 0|x) = 1− PY (y = 1|x).

3.5 Answer for (e)

There indeed exists some difference between these two models when it comes to outliers

in the sample. If we compare the distribution, we will find out that the the rising slope

of the model derived in(d) is steeper than the model derived in (b). So Gaussian Noise

Latent Variable Model will be more robust to outliers than the Logistic Noise Latent

Variable Model.

3.6 Answer for (f)

When k=2, there are two classes. In order to make it comform to our previous notation,

we assume the two classes are 0 and 1.

PY (y = 1|x) = e

w1x

ew0x + ew1x

=

1

1 + e−(w1−w0)x

Let w = w1 − w0, then we can see that this is a logistic regression.

3.7 Answer for (g)

For the Multinomial Logistic Regression, when K > 2, the sample belonging to class

j(1 ≤ j < K) must satisfy PY (y = j|x) > PY (y = i|x), ∀i, i 6= j, where 1 ≤ i < K. That

is:

exp(wjx)∑K

k=1 exp(wkx)

>

exp(wix)∑K

k=1 exp(wkx)

⇒ exp(wjx) > exp(wix)

⇒ wjx > wix

⇒ (wj − wi)x > 0

7

So we can get the classification boundary for Multinomial Logistic Regression, and we

can see the classification boundary is a series linear discriminant separating hyperplane.

8