Electrical Engineering Data Science

L2 – Linear Methods for Prediction

Dr. Vidhyasaharan Sethu

School of Electrical Engineering & Telecommunications

University of New South Wales

Sydney, Australia

V. Sethu 1

V. Sethu 2

Topics

1. Linear Regression

2. Regularisation

3. Bayesian View of Linear Regression

4. Classification Systems

5. Discriminant functions

6. Logistic Regression

References

Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning, New York, NY,

USA:: Springer series in statistics.

Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. John Wiley & Sons.

Bishop, C. M. (2006). Pattern recognition and machine learning.

V. Sethu 3

Linear Regression

MachineInput Data Prediction about some quantity of interest

= 1, … , � ∈ ℛℎ �

� = 0 + �

=1

Parameters of the model: = 0,1, … ,

Residual Sum of Squares: = �

=1

− �

2

Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical

learning New York, NY, USA:: Springer series in statistics.

Least squares estimate is the � that

corresponds to the minimum RSS

V. Sethu 4

Least Squares Linear Regression Model

Given a dataset,

= , =1,…,

Residual sum of squares:

= �

=1

− �

2 = �

=1

− 0 −�

=1

2

In matrix notation:

= − −

= ⋮

= 111 11 ⋯ 1⋮ ⋱ ⋮1 ⋯ =

0

1

⋮

= 1⋮

+ × × × +

V. Sethu 5

Least Squares Linear Regression Model

For least squares solution,

= − − =

⇒ � = −

Noting:

= 2

‘Hat’ Matrix:

� = � = − =

is Positive Definite if is full rank

Columns of

V. Sethu 6

Sequential (on-line) Learning

The least squares solution may be hard to compute in big data setting,

� = −

Very largeComputationally expensive!

May not be available all at once

(e.g., Real-time applications where

data is continuously streaming)

Consider

= �

=1

; = − 0 −�

=1

2 = − 2

Iterative estimation of can be carried out:

(+) = −

For the least squares case this is,

(+) = − −

LMS (Least Mean Squares) algorithm

Stochastic Gradient Descent

V. Sethu 7

Note on Gradient Descent

Negative of the gradient points in the direction of steepest descent (of the surface of cost function).

Noisy gradients estimates from single data points (or small batches of data) used in stochastic

gradient descent.

Noise gradients may be beneficial when negotiating complex surfaces.

https://en.wikipedia.org/wiki/Gradient_descent

V. Sethu 8

Detour - Principal Component Analysis

The principal components of a set of datapoints in

ℛ are a sequence of best linear approximations to

that data (of all ranks ≤ ).

Given the data, , … , , we can consider a

linear rank model for approximating them:

Fitting this model to the data and obtaining a least squares estimate involves the following minimisation,min

, ,�

=1

− −

2

It is straight forward to see that (by considering the case when = 0),

= � = 1

�

=1

The optimisation problem then reduces to:min

�

=1

� −

2 , ℎ � = − �

× × × ×

� = +

Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning

New York, NY, USA:: Springer series in statistics.

V. Sethu 9

Detour - Principal Component Analysis

Consider = 1,min

,�

=1

� − 1

2 = min

, �

=1

1

2

2 + �

=1

�

2 − 2�

=1

1

�

Imposing the condition = 1, and partially differentiating wrt 1

1 = 1�1

This reduces the minimisation problem to,min

�

=1

�

+ �

=1

�

2 − 2�

=1

�

Noting that �

2 = �� and writing in terms of the scatter matrix, = ∑=1 ��, the problem

reduces to min

−1

1 , 1 = 1

Equivalently, max

1 ℎ = 11 + − 1

Differentiating wrt 1

1

1

= 21 − 2 = 0

V. Sethu 10

Detour - Principal Component Analysis

i.e.,

=

In other words is the largest eigenvalue and is the corresponding eigenvector of .

Similarly, for > 1, it can be shown that the columns of are all eigenvectors of corresponding to the

largest eigenvalues.

Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning

New York, NY, USA:: Springer series in statistics.

= � + 1 + 2 =

V. Sethu 11

Regularisation

Potential issues with Least squares estimate,

Low bias but may have high variance

(Prediction accuracy may suffer)

Maybe desirable to determine a subset of

features that exhibit the strongest effects.

� = − � = �

� = arg min

�

=1

− 0 −�

=1

2 + �

=1

2

Ridge Regression

Regularlisation term ( regularisation)

Note: 0 is not included

Equivalent to

� = arg min

�

=1

− 0 −�

=1

2

�

=1

2 ≤ tsubject to

One-to-one correspondence between and

V. Sethu 12

Regularisation

� = arg min

�

=1

− 0 −�

=1

2 + �

=1

2

Solution can be separated into two parts after centering inputs :

- Estimate 0

- Estimate 1, … ,

Solutions are not equivariant

under scaling of the inputs Inclusion of 0 will make it dependent on origin chosen

for – i.e., adding a constant to all will not simply

result in � being offset by the same amount .

→ − ̅

� = �

=1

� = arg min

− − +

= ⋮

= 11 ⋯ 1⋮ ⋱ ⋮

1 ⋯

= 1⋮

= 1⋮

× × ×

V. Sethu 13

Ridge Regression

� = arg min

− − +

� = + −

Makes the problem non-singular

even if is not of full rank

Comparing Least Squares (�) and Ridge Regression (�)

� = −(Using singular value decomposition: = ) =

� = + −= + −= �

=

2

2 +

V. Sethu 14

Ridge Regression

(Using singular value decomposition: = )

� = �

=

2

2 +

Note:

=

Also sample covariance, = 1

Therefore, (columns of ) are eigen-vectors

Also, = (from = )

Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning

New York, NY, USA:: Springer series in statistics.

Shrinks directions of least

variance the most

V. Sethu 15

Tuning hyperparameter

Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning

New York, NY, USA:: Springer series in statistics.

df = �

=1

2

2 + Effective degrees of freedom

V. Sethu 16

Regularisation (Lasso)

� = arg min

�

=1

− 0 −�

=1

2 + �

=1

regularisation term

Equivalent to

� = arg min

�

=1

− 0 −�

=1

2

�

=1

≤ tsubject to

Encourages sparsity – some coefficients will be exactly zero

Makes the problem non-linear – no closed form solution

Quadratic programming problem but efficient algorithms exist

V. Sethu 17

Feature Selection with Lasso

Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning

New York, NY, USA:: Springer series in statistics.

=

∑=1

V. Sethu 18

and regularisation (sparsity)

Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning

New York, NY, USA:: Springer series in statistics.

1 + 2 ≤ 12 + 22 ≤

V. Sethu 19

Bayesian View

Assuming the error model,

= � + = + , ~ � 0,2

We can write,

,, = ,2

In terms of the dataset, = , =1,…,

,, = �

=1

,2

Taking the logarithm, ln ,, = �

=1

ln ,2

⇒ ln ,, = − ln − 2 ln 2 − 122�

=1

−

2

Likelihood function,

Maximum likelihood (ML) solution is equivalent to least squares solution

In the Bayesian framework the

parameters () are treated as

random variables instead of fixed

but unknown parameters.

20

Bayesian View (of Regularisation)

Bayes theorem states,

,, = ,,

|,Posterior

Likelihood Prior

Evidence

If we assume a Gaussian prior for the parameters () conditional on some hyperparameter (),

= = ,−1

We obtain

,, = ,

Where,

= 12

= + 12 −1

Consequently the log posterior is given by,ln ,, = − 122�

=1

−

2

−

2

Maximum a posteriori (MAP) estimate is equivalent to Ridge regression

V. Sethu 21

Visualising Bayesian Learning

Bishop, C. M. (2006). Pattern recognition and machine learning

V. Sethu 22

Classification

MachineInput Data Prediction about some quantity of interest

= 1, … , ∈ ℛ � ∈ 1, … ,

∈ 1, … ,ℎ �

The Bayesian Decision Theory Picture:

=

Defining a loss function () such that describes the loss incurred by estimating the class as

when the true class was (Note: Here is used to denote � = ) gives us the expected loss

associated with as:

= �

=1

V. Sethu 23

Bayesian Decision Theory

For any general decision rule, , where assumes one of the values 1, … ,, the overall risk is

given by,

= �

If is chosen such that is as small as possible for every , then overall risk () will be

minimised.

∗ = arg min

Bayes Decision Rule

Leads to minimum overall risk, called Bayes risk and

denotes as ∗

V. Sethu 24

Minimum Error Rate Classification

In many classification tasks, only the number of errors/mistakes (error rate) is of interest. This leads to

the so-called symmetric/zero-one loss function,

= �0 = 1 ≠

The corresponding conditional risk is then,

= �

≠

= 1 −

Therefore for minimum error rate:

ℎ = > ∀ ≠

V. Sethu 25

Discriminant Functions

There are many ways to represent classifiers – One of them is in terms of a set of discriminant function, { : = 1, … , } such that,

ℎ = > ,∀ ≠

Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern

classification. John Wiley & Sons.

Bayes classifier: = −

Minimum error rate: = = ∑=

=

= ln + ln

You can replace every by where � is

monotonically increasing and the classification is unchanged.

V. Sethu 26

Decision Boundaries/Surfaces

Every decision rule divides the feature space into decision regions, separated by decision boundaries.

The decision regions need not be simply connected.

Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern

classification. John Wiley & Sons.

27

Gaussian Data Distributions – Linear Discriminants

The minimum error rate classification can be achieved by the discriminant functions:

= ln + ln

In the case of multivariate normal data distribution within each class,

= ,

The discriminant functions can be readily evaluated as,

= − 12 − −1 − − 2 ln 2 − 12 ln Σ + ln

Consider the case where the data distribution in all the classes are normally distributed with identical

covariance matrices. i.e., = ,∀. This corresponds to a situation where data from all classes fall into

hyperellipsoidal clusters of the same shape and size but different locations in the feature space:

= − 12 − −1 − + ln

Expanding the Mahalanobis distance and dropping the quadratic term () which is identical for all

the classes makes the discriminant functions linear,

= + 0

Where,

= −

0 = − 12− + ln

Independent of data and class

distributions – can be dropped

1

2

ln Σ term is dropped since

it is identical for all classes

Mahalanobis distance

28

Gaussian Data Distributions – Decision Surfaces

The decision surfaces between decision regions ℛi and ℛj is given by = . In the case of the

linear discriminant functions arising from normally distributed data, they are hyperplanes given by

−

− − = 0

Where,

= 12 + − ln ()/() − −1 − −

Hyperplane passes through

V. Sethu

Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern

classification. John Wiley & Sons.

29

Arbitrary Gaussian Data Distributions

When data is every class are normally distributed but with

arbitrary covariance matrices, decision regions need not be

simply connected (even in one dimensional data)

V. Sethu

Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern

classification. John Wiley & Sons.

Decision surfaces can be hyperplanes, pairs of

hyperplanes, hyperspheres, hyperellipsoids,

hyperparaboloids, and hyperhyperboloids

(broadly referred to as hyperquadrics)

Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern

classification. John Wiley & Sons.

V. Sethu 30

Detour - Linear Discriminant Analysis

Principal components give the directions of the greatest variance. However these may not necessarily be

the best directions

Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern

classification. John Wiley & Sons.

Linear discriminants can help, they identify the hyperplane the best separates classes (clusters) of data

that are normally distributed (with a shared covariance matrix ). Therefore projections onto the

subspace orthogonal to this hyperplane would provide maximal separation between the clusters.

Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical

learning New York, NY, USA:: Springer series in statistics.

V. Sethu 31

Detour - Linear Discriminant Analysis

To determine the basis vectors of this subspace we look for the principal components of the set of

centroids of the classes (after whitening). Specifically, given data from classes, we can compute a

matrix of class centroids, :

= ⋮

= 11 ⋯ 1⋮ ⋱ ⋮

1 ⋯

Given the within class covariance (shared covariance of all the classes), we can project to a space

where the shared covariance is ,

∗ = −12

If the covariance of ∗ is given by ∗, the between class covariance, the eigenvectors of ∗ (principal

components) will provide the desired coordinates of the optimal subspace. Specifically, the linear

discriminant direction, is given as

= −12∗

Where, ∗ are the columns of ∗ obtained by the following eigen decomposition:

∗ = ∗∗

×

V. Sethu 32

Quantifying Error

Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern

classification. John Wiley & Sons.

In a 2-class problem, the classifier separations

the feature space into two decision regions,

ℛ1 and ℛ2.

Errors can arise in two ways – Points from 1

falling within ℛ2 and vice versa.

= ∈ ℛ2,1 + ∈ ℛ1,2= ∈ ℛ2 1 1 + ∈ ℛ1 2 2= ∫ℛ2 1 1 + ∫ℛ 2 2

V. Sethu 33

Receiver Operating Characteristic (ROC) Curve

Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern

classification. John Wiley & Sons.

′ → Discriminability (some measure of distance

between distributions)

e.g., ′ = 2−

for the 1-D case (equal variance)

Each point on the curve corresponding

to different operating point of classifier

V. Sethu 34

Logistic Regression

Arises from the desire to:

1. Model posterior probability of classes as linear functions (and treat it as a regression problem)

2. Ensure that sum of class posteriors is one.

The model has the form, ln 1

= 10 + ln 2

= 20 +

⋮ln −1

= −1 0 + −

Equivalently,

= 11 + ∑=1−1 0+

= 0+1 + ∑=1−1 0+ , = 1, … , − 1

V. Sethu 35

2-Class Logistic Regression

Given data, = , =,…, where,

= �1, 10, 2

The log-likelihood of the logistic regression model is then,

ℓ = ln = �

=1

ln 1 , + 1 − ln 1 − 1 ,

This reduces to,

ℓ = �

=1

− ln 1 +

To obtained the maximum likelihood solution we can set the derivatives to zero,

ℓ

= �

=1

− 1 , = 0

Nonlinear equations – Use iterative

solution (Newton-Raphson algorithm)

V. Sethu 36

2-Class Logistic Regression

Newton update:

+1 = − 2ℓ

−1

ℓ

In matrix notation,

ℓ

= −

2ℓ

= −

Newton update for logistic regression,

+1 = + −1 −

Hessian matrix

= ⋮

= 111 11 ⋯ 1⋮ ⋱ ⋮1 ⋯ =

1

⋮

= 1 , ⋮

1 ,

= 1,2, … , ;

= , 1 − 1 ,

欢迎咨询51作业君