# 程序代写案例-L2

Electrical Engineering Data Science
L2 – Linear Methods for Prediction
Dr. Vidhyasaharan Sethu
School of Electrical Engineering & Telecommunications

University of New South Wales
Sydney, Australia
V. Sethu 1
V. Sethu 2
Topics
1. Linear Regression
2. Regularisation
3. Bayesian View of Linear Regression
4. Classification Systems
5. Discriminant functions
6. Logistic Regression
References
Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning, New York, NY,
USA:: Springer series in statistics.
Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. John Wiley & Sons.
Bishop, C. M. (2006). Pattern recognition and machine learning.
V. Sethu 3
Linear Regression
MachineInput Data Prediction about some quantity of interest
= 1, … , � ∈ ℛℎ �
� = 0 + �
=1

Parameters of the model: = 0,1, … ,
Residual Sum of Squares: = �
=1

− �
2
Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical
learning New York, NY, USA:: Springer series in statistics.
Least squares estimate is the � that
corresponds to the minimum RSS
V. Sethu 4
Least Squares Linear Regression Model
Given a dataset,
= , =1,…,
Residual sum of squares:
= �
=1

− �
2 = �
=1

− 0 −�
=1

2
In matrix notation:
= − −
= ⋮

= 111 11 ⋯ 1⋮ ⋱ ⋮1 ⋯ =
0
1

= 1⋮

+ × × × +
V. Sethu 5
Least Squares Linear Regression Model
For least squares solution,

= − − =
⇒ � = −
Noting:

= 2
‘Hat’ Matrix:
� = � = − =
is Positive Definite if is full rank

Columns of
V. Sethu 6
Sequential (on-line) Learning
The least squares solution may be hard to compute in big data setting,
� = −
Very largeComputationally expensive!
May not be available all at once
(e.g., Real-time applications where
data is continuously streaming)
Consider
= �
=1

; = − 0 −�
=1

2 = − 2
Iterative estimation of can be carried out:
(+) = −

For the least squares case this is,
(+) = − −
LMS (Least Mean Squares) algorithm
V. Sethu 7
Note on Gradient Descent
Negative of the gradient points in the direction of steepest descent (of the surface of cost function).
Noisy gradients estimates from single data points (or small batches of data) used in stochastic
Noise gradients may be beneficial when negotiating complex surfaces.
V. Sethu 8
Detour - Principal Component Analysis
The principal components of a set of datapoints in
ℛ are a sequence of best linear approximations to
that data (of all ranks ≤ ).
Given the data, , … , , we can consider a
linear rank model for approximating them:
Fitting this model to the data and obtaining a least squares estimate involves the following minimisation,min
, ,�
=1

− −
2
It is straight forward to see that (by considering the case when = 0),
= � = 1

=1

The optimisation problem then reduces to:min

=1

� −
2 , ℎ � = − �
× × × ×
� = +
Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning
New York, NY, USA:: Springer series in statistics.
V. Sethu 9
Detour - Principal Component Analysis
Consider = 1,min
,�
=1

� − 1
2 = min
, �
=1

1
2
2 + �
=1

2 − 2�
=1

1

Imposing the condition = 1, and partially differentiating wrt 1
1 = 1�1
This reduces the minimisation problem to,min

=1

+ �
=1

2 − 2�
=1

Noting that �
2 = �� and writing in terms of the scatter matrix, = ∑=1 ��, the problem
reduces to min

−1
1 , 1 = 1
Equivalently, max

1 ℎ = 11 + − 1
Differentiating wrt 1
1
1
= 21 − 2 = 0
V. Sethu 10
Detour - Principal Component Analysis
i.e.,
=
In other words is the largest eigenvalue and is the corresponding eigenvector of .
Similarly, for > 1, it can be shown that the columns of are all eigenvectors of corresponding to the
largest eigenvalues.
Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning
New York, NY, USA:: Springer series in statistics.
= � + 1 + 2 =
V. Sethu 11
Regularisation
Potential issues with Least squares estimate,
Low bias but may have high variance
(Prediction accuracy may suffer)
Maybe desirable to determine a subset of
features that exhibit the strongest effects.
� = − � = �
� = arg min

=1

− 0 −�
=1

2 + �
=1

2
Ridge Regression
Regularlisation term ( regularisation)
Note: 0 is not included
Equivalent to
� = arg min

=1

− 0 −�
=1

2

=1

2 ≤ tsubject to
One-to-one correspondence between and
V. Sethu 12
Regularisation
� = arg min

=1

− 0 −�
=1

2 + �
=1

2
Solution can be separated into two parts after centering inputs :
- Estimate 0
- Estimate 1, … ,
Solutions are not equivariant
under scaling of the inputs Inclusion of 0 will make it dependent on origin chosen
for – i.e., adding a constant to all will not simply
result in � being offset by the same amount .
→ − ̅
� = �
=1

� = arg min

− − +
= ⋮

= 11 ⋯ 1⋮ ⋱ ⋮
1 ⋯
= 1⋮

= 1⋮

× × ×
V. Sethu 13
Ridge Regression
� = arg min

− − +
� = + −
Makes the problem non-singular
even if is not of full rank
Comparing Least Squares (�) and Ridge Regression (�)
� = −(Using singular value decomposition: = ) =
� = + −= + −= �
=

2

2 +
V. Sethu 14
Ridge Regression
(Using singular value decomposition: = )
� = �
=

2

2 +
Note:
=
Also sample covariance, = 1

Therefore, (columns of ) are eigen-vectors
Also, = (from = )
Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning
New York, NY, USA:: Springer series in statistics.
Shrinks directions of least
variance the most
V. Sethu 15
Tuning hyperparameter
Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning
New York, NY, USA:: Springer series in statistics.
df = �
=1

2

2 + Effective degrees of freedom
V. Sethu 16
Regularisation (Lasso)
� = arg min

=1

− 0 −�
=1

2 + �
=1

regularisation term
Equivalent to
� = arg min

=1

− 0 −�
=1

2

=1

≤ tsubject to
Encourages sparsity – some coefficients will be exactly zero
Makes the problem non-linear – no closed form solution
Quadratic programming problem but efficient algorithms exist
V. Sethu 17
Feature Selection with Lasso
Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning
New York, NY, USA:: Springer series in statistics.
=
∑=1

V. Sethu 18
and regularisation (sparsity)
Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning
New York, NY, USA:: Springer series in statistics.
1 + 2 ≤ 12 + 22 ≤
V. Sethu 19
Bayesian View
Assuming the error model,
= � + = + , ~ � 0,2
We can write,
,, = ,2
In terms of the dataset, = , =1,…,
,, = �
=1

,2
Taking the logarithm, ln ,, = �
=1
ln ,2
⇒ ln ,, = − ln − 2 ln 2 − 122�
=1

2
Likelihood function,
Maximum likelihood (ML) solution is equivalent to least squares solution
In the Bayesian framework the
parameters () are treated as
random variables instead of fixed
but unknown parameters.
20
Bayesian View (of Regularisation)
Bayes theorem states,
,, = ,,
|,Posterior
Likelihood Prior
Evidence
If we assume a Gaussian prior for the parameters () conditional on some hyperparameter (),
= = ,−1
We obtain
,, = ,
Where,
= 12
= + 12 −1
Consequently the log posterior is given by,ln ,, = − 122�
=1

2

2
Maximum a posteriori (MAP) estimate is equivalent to Ridge regression
V. Sethu 21
Visualising Bayesian Learning
Bishop, C. M. (2006). Pattern recognition and machine learning
V. Sethu 22
Classification
MachineInput Data Prediction about some quantity of interest
= 1, … , ∈ ℛ � ∈ 1, … ,
∈ 1, … ,ℎ �
The Bayesian Decision Theory Picture:
=
Defining a loss function () such that describes the loss incurred by estimating the class as
when the true class was (Note: Here is used to denote � = ) gives us the expected loss
associated with as:
= �
=1

V. Sethu 23
Bayesian Decision Theory
For any general decision rule, , where assumes one of the values 1, … ,, the overall risk is
given by,
= �

If is chosen such that is as small as possible for every , then overall risk () will be
minimised.
∗ = arg min

Bayes Decision Rule
Leads to minimum overall risk, called Bayes risk and
denotes as ∗
V. Sethu 24
Minimum Error Rate Classification
In many classification tasks, only the number of errors/mistakes (error rate) is of interest. This leads to
the so-called symmetric/zero-one loss function,
= �0 = 1 ≠
The corresponding conditional risk is then,
= �

= 1 −
Therefore for minimum error rate:
ℎ = > ∀ ≠
V. Sethu 25
Discriminant Functions
There are many ways to represent classifiers – One of them is in terms of a set of discriminant function, { : = 1, … , } such that,
ℎ = > ,∀ ≠
Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern
classification. John Wiley & Sons.
Bayes classifier: = −
Minimum error rate: = = ∑=
=
= ln + ln
You can replace every by where � is
monotonically increasing and the classification is unchanged.
V. Sethu 26
Decision Boundaries/Surfaces
Every decision rule divides the feature space into decision regions, separated by decision boundaries.
The decision regions need not be simply connected.
Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern
classification. John Wiley & Sons.
27
Gaussian Data Distributions – Linear Discriminants
The minimum error rate classification can be achieved by the discriminant functions:
= ln + ln
In the case of multivariate normal data distribution within each class,
= ,
The discriminant functions can be readily evaluated as,
= − 12 − −1 − − 2 ln 2 − 12 ln Σ + ln
Consider the case where the data distribution in all the classes are normally distributed with identical
covariance matrices. i.e., = ,∀. This corresponds to a situation where data from all classes fall into
hyperellipsoidal clusters of the same shape and size but different locations in the feature space:
= − 12 − −1 − + ln
Expanding the Mahalanobis distance and dropping the quadratic term () which is identical for all
the classes makes the discriminant functions linear,
= + 0
Where,
= −
0 = − 12− + ln
Independent of data and class
distributions – can be dropped
1
2
ln Σ term is dropped since
it is identical for all classes
Mahalanobis distance
28
Gaussian Data Distributions – Decision Surfaces
The decision surfaces between decision regions ℛi and ℛj is given by = . In the case of the
linear discriminant functions arising from normally distributed data, they are hyperplanes given by

− − = 0
Where,
= 12 + − ln ()/() − −1 − −
Hyperplane passes through
V. Sethu
Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern
classification. John Wiley & Sons.
29
Arbitrary Gaussian Data Distributions
When data is every class are normally distributed but with
arbitrary covariance matrices, decision regions need not be
simply connected (even in one dimensional data)
V. Sethu
Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern
classification. John Wiley & Sons.
Decision surfaces can be hyperplanes, pairs of
hyperplanes, hyperspheres, hyperellipsoids,
hyperparaboloids, and hyperhyperboloids
Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern
classification. John Wiley & Sons.
V. Sethu 30
Detour - Linear Discriminant Analysis
Principal components give the directions of the greatest variance. However these may not necessarily be
the best directions
Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern
classification. John Wiley & Sons.
Linear discriminants can help, they identify the hyperplane the best separates classes (clusters) of data
that are normally distributed (with a shared covariance matrix ). Therefore projections onto the
subspace orthogonal to this hyperplane would provide maximal separation between the clusters.
Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical
learning New York, NY, USA:: Springer series in statistics.
V. Sethu 31
Detour - Linear Discriminant Analysis
To determine the basis vectors of this subspace we look for the principal components of the set of
centroids of the classes (after whitening). Specifically, given data from classes, we can compute a
matrix of class centroids, :
= ⋮

= 11 ⋯ 1⋮ ⋱ ⋮
1 ⋯
Given the within class covariance (shared covariance of all the classes), we can project to a space
where the shared covariance is ,
∗ = −12
If the covariance of ∗ is given by ∗, the between class covariance, the eigenvectors of ∗ (principal
components) will provide the desired coordinates of the optimal subspace. Specifically, the linear
discriminant direction, is given as
= −12∗
Where, ∗ are the columns of ∗ obtained by the following eigen decomposition:
∗ = ∗∗
×
V. Sethu 32
Quantifying Error
Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern
classification. John Wiley & Sons.
In a 2-class problem, the classifier separations
the feature space into two decision regions,
ℛ1 and ℛ2.
Errors can arise in two ways – Points from 1
falling within ℛ2 and vice versa.
= ∈ ℛ2,1 + ∈ ℛ1,2= ∈ ℛ2 1 1 + ∈ ℛ1 2 2= ∫ℛ2 1 1 + ∫ℛ 2 2
V. Sethu 33
Receiver Operating Characteristic (ROC) Curve
Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern
classification. John Wiley & Sons.
′ → Discriminability (some measure of distance
between distributions)
e.g., ′ = 2−

for the 1-D case (equal variance)
Each point on the curve corresponding
to different operating point of classifier
V. Sethu 34
Logistic Regression
Arises from the desire to:
1. Model posterior probability of classes as linear functions (and treat it as a regression problem)
2. Ensure that sum of class posteriors is one.
The model has the form, ln 1

= 10 + ln 2

= 20 +
⋮ln −1

= −1 0 + −
Equivalently,
= 11 + ∑=1−1 0+
= 0+1 + ∑=1−1 0+ , = 1, … , − 1
V. Sethu 35
2-Class Logistic Regression
Given data, = , =,…, where,
= �1, 10, 2
The log-likelihood of the logistic regression model is then,
ℓ = ln = �
=1

ln 1 , + 1 − ln 1 − 1 ,
This reduces to,
ℓ = �
=1

− ln 1 +
To obtained the maximum likelihood solution we can set the derivatives to zero,

= �
=1

− 1 , = 0
Nonlinear equations – Use iterative
solution (Newton-Raphson algorithm)
V. Sethu 36
2-Class Logistic Regression
Newton update:
+1 = − 2ℓ

−1

In matrix notation,

= −
2ℓ

= −
Newton update for logistic regression,
+1 = + −1 −
Hessian matrix
= ⋮

= 111 11 ⋯ 1⋮ ⋱ ⋮1 ⋯ =
1

= 1 , ⋮
1 ,
= 1,2, … , ;
= , 1 − 1 ,

Email:51zuoyejun

@gmail.com