Electrical Engineering Data Science L2 – Linear Methods for Prediction Dr. Vidhyasaharan Sethu School of Electrical Engineering & Telecommunications University of New South Wales Sydney, Australia V. Sethu 1 V. Sethu 2 Topics 1. Linear Regression 2. Regularisation 3. Bayesian View of Linear Regression 4. Classification Systems 5. Discriminant functions 6. Logistic Regression References Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning, New York, NY, USA:: Springer series in statistics. Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. John Wiley & Sons. Bishop, C. M. (2006). Pattern recognition and machine learning. V. Sethu 3 Linear Regression MachineInput Data Prediction about some quantity of interest = 1, … , � ∈ ℛℎ � � = 0 + � =1 Parameters of the model: = 0,1, … , Residual Sum of Squares: = � =1 − � 2 Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning New York, NY, USA:: Springer series in statistics. Least squares estimate is the � that corresponds to the minimum RSS V. Sethu 4 Least Squares Linear Regression Model Given a dataset, = , =1,…, Residual sum of squares: = � =1 − � 2 = � =1 − 0 −� =1 2 In matrix notation: = − − = ⋮ = 111 11 ⋯ 1⋮ ⋱ ⋮1 ⋯ = 0 1 ⋮ = 1⋮ + × × × + V. Sethu 5 Least Squares Linear Regression Model For least squares solution, = − − = ⇒ � = − Noting: = 2 ‘Hat’ Matrix: � = � = − = is Positive Definite if is full rank Columns of V. Sethu 6 Sequential (on-line) Learning The least squares solution may be hard to compute in big data setting, � = − Very largeComputationally expensive! May not be available all at once (e.g., Real-time applications where data is continuously streaming) Consider = � =1 ; = − 0 −� =1 2 = − 2 Iterative estimation of can be carried out: (+) = − For the least squares case this is, (+) = − − LMS (Least Mean Squares) algorithm Stochastic Gradient Descent V. Sethu 7 Note on Gradient Descent Negative of the gradient points in the direction of steepest descent (of the surface of cost function). Noisy gradients estimates from single data points (or small batches of data) used in stochastic gradient descent. Noise gradients may be beneficial when negotiating complex surfaces. https://en.wikipedia.org/wiki/Gradient_descent V. Sethu 8 Detour - Principal Component Analysis The principal components of a set of datapoints in ℛ are a sequence of best linear approximations to that data (of all ranks ≤ ). Given the data, , … , , we can consider a linear rank model for approximating them: Fitting this model to the data and obtaining a least squares estimate involves the following minimisation,min , ,� =1 − − 2 It is straight forward to see that (by considering the case when = 0), = � = 1 � =1 The optimisation problem then reduces to:min � =1 � − 2 , ℎ � = − � × × × × � = + Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning New York, NY, USA:: Springer series in statistics. V. Sethu 9 Detour - Principal Component Analysis Consider = 1,min ,� =1 � − 1 2 = min , � =1 1 2 2 + � =1 � 2 − 2� =1 1 � Imposing the condition = 1, and partially differentiating wrt 1 1 = 1�1 This reduces the minimisation problem to,min � =1 � + � =1 � 2 − 2� =1 � Noting that � 2 = �� and writing in terms of the scatter matrix, = ∑=1 ��, the problem reduces to min −1 1 , 1 = 1 Equivalently, max 1 ℎ = 11 + − 1 Differentiating wrt 1 1 1 = 21 − 2 = 0 V. Sethu 10 Detour - Principal Component Analysis i.e., = In other words is the largest eigenvalue and is the corresponding eigenvector of . Similarly, for > 1, it can be shown that the columns of are all eigenvectors of corresponding to the largest eigenvalues. Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning New York, NY, USA:: Springer series in statistics. = � + 1 + 2 = V. Sethu 11 Regularisation Potential issues with Least squares estimate, Low bias but may have high variance (Prediction accuracy may suffer) Maybe desirable to determine a subset of features that exhibit the strongest effects. � = − � = � � = arg min � =1 − 0 −� =1 2 + � =1 2 Ridge Regression Regularlisation term ( regularisation) Note: 0 is not included Equivalent to � = arg min � =1 − 0 −� =1 2 � =1 2 ≤ tsubject to One-to-one correspondence between and V. Sethu 12 Regularisation � = arg min � =1 − 0 −� =1 2 + � =1 2 Solution can be separated into two parts after centering inputs : - Estimate 0 - Estimate 1, … , Solutions are not equivariant under scaling of the inputs Inclusion of 0 will make it dependent on origin chosen for – i.e., adding a constant to all will not simply result in � being offset by the same amount . → − ̅ � = � =1 � = arg min − − + = ⋮ = 11 ⋯ 1⋮ ⋱ ⋮ 1 ⋯ = 1⋮ = 1⋮ × × × V. Sethu 13 Ridge Regression � = arg min − − + � = + − Makes the problem non-singular even if is not of full rank Comparing Least Squares (�) and Ridge Regression (�) � = −(Using singular value decomposition: = ) = � = + −= + −= � = 2 2 + V. Sethu 14 Ridge Regression (Using singular value decomposition: = ) � = � = 2 2 + Note: = Also sample covariance, = 1 Therefore, (columns of ) are eigen-vectors Also, = (from = ) Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning New York, NY, USA:: Springer series in statistics. Shrinks directions of least variance the most V. Sethu 15 Tuning hyperparameter Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning New York, NY, USA:: Springer series in statistics. df = � =1 2 2 + Effective degrees of freedom V. Sethu 16 Regularisation (Lasso) � = arg min � =1 − 0 −� =1 2 + � =1 regularisation term Equivalent to � = arg min � =1 − 0 −� =1 2 � =1 ≤ tsubject to Encourages sparsity – some coefficients will be exactly zero Makes the problem non-linear – no closed form solution Quadratic programming problem but efficient algorithms exist V. Sethu 17 Feature Selection with Lasso Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning New York, NY, USA:: Springer series in statistics. = ∑=1 V. Sethu 18 and regularisation (sparsity) Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning New York, NY, USA:: Springer series in statistics. 1 + 2 ≤ 12 + 22 ≤ V. Sethu 19 Bayesian View Assuming the error model, = � + = + , ~ � 0,2 We can write, ,, = ,2 In terms of the dataset, = , =1,…, ,, = � =1 ,2 Taking the logarithm, ln ,, = � =1 ln ,2 ⇒ ln ,, = − ln − 2 ln 2 − 122� =1 − 2 Likelihood function, Maximum likelihood (ML) solution is equivalent to least squares solution In the Bayesian framework the parameters () are treated as random variables instead of fixed but unknown parameters. 20 Bayesian View (of Regularisation) Bayes theorem states, ,, = ,, |,Posterior Likelihood Prior Evidence If we assume a Gaussian prior for the parameters () conditional on some hyperparameter (), = = ,−1 We obtain ,, = , Where, = 12 = + 12 −1 Consequently the log posterior is given by,ln ,, = − 122� =1 − 2 − 2 Maximum a posteriori (MAP) estimate is equivalent to Ridge regression V. Sethu 21 Visualising Bayesian Learning Bishop, C. M. (2006). Pattern recognition and machine learning V. Sethu 22 Classification MachineInput Data Prediction about some quantity of interest = 1, … , ∈ ℛ � ∈ 1, … , ∈ 1, … ,ℎ � The Bayesian Decision Theory Picture: = Defining a loss function () such that describes the loss incurred by estimating the class as when the true class was (Note: Here is used to denote � = ) gives us the expected loss associated with as: = � =1 V. Sethu 23 Bayesian Decision Theory For any general decision rule, , where assumes one of the values 1, … ,, the overall risk is given by, = � If is chosen such that is as small as possible for every , then overall risk () will be minimised. ∗ = arg min Bayes Decision Rule Leads to minimum overall risk, called Bayes risk and denotes as ∗ V. Sethu 24 Minimum Error Rate Classification In many classification tasks, only the number of errors/mistakes (error rate) is of interest. This leads to the so-called symmetric/zero-one loss function, = �0 = 1 ≠ The corresponding conditional risk is then, = � ≠ = 1 − Therefore for minimum error rate: ℎ = > ∀ ≠ V. Sethu 25 Discriminant Functions There are many ways to represent classifiers – One of them is in terms of a set of discriminant function, { : = 1, … , } such that, ℎ = > ,∀ ≠ Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. John Wiley & Sons. Bayes classifier: = − Minimum error rate: = = ∑= = = ln + ln You can replace every by where � is monotonically increasing and the classification is unchanged. V. Sethu 26 Decision Boundaries/Surfaces Every decision rule divides the feature space into decision regions, separated by decision boundaries. The decision regions need not be simply connected. Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. John Wiley & Sons. 27 Gaussian Data Distributions – Linear Discriminants The minimum error rate classification can be achieved by the discriminant functions: = ln + ln In the case of multivariate normal data distribution within each class, = , The discriminant functions can be readily evaluated as, = − 12 − −1 − − 2 ln 2 − 12 ln Σ + ln Consider the case where the data distribution in all the classes are normally distributed with identical covariance matrices. i.e., = ,∀. This corresponds to a situation where data from all classes fall into hyperellipsoidal clusters of the same shape and size but different locations in the feature space: = − 12 − −1 − + ln Expanding the Mahalanobis distance and dropping the quadratic term () which is identical for all the classes makes the discriminant functions linear, = + 0 Where, = − 0 = − 12− + ln Independent of data and class distributions – can be dropped 1 2 ln Σ term is dropped since it is identical for all classes Mahalanobis distance 28 Gaussian Data Distributions – Decision Surfaces The decision surfaces between decision regions ℛi and ℛj is given by = . In the case of the linear discriminant functions arising from normally distributed data, they are hyperplanes given by − − − = 0 Where, = 12 + − ln ()/() − −1 − − Hyperplane passes through V. Sethu Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. John Wiley & Sons. 29 Arbitrary Gaussian Data Distributions When data is every class are normally distributed but with arbitrary covariance matrices, decision regions need not be simply connected (even in one dimensional data) V. Sethu Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. John Wiley & Sons. Decision surfaces can be hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, and hyperhyperboloids (broadly referred to as hyperquadrics) Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. John Wiley & Sons. V. Sethu 30 Detour - Linear Discriminant Analysis Principal components give the directions of the greatest variance. However these may not necessarily be the best directions Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. John Wiley & Sons. Linear discriminants can help, they identify the hyperplane the best separates classes (clusters) of data that are normally distributed (with a shared covariance matrix ). Therefore projections onto the subspace orthogonal to this hyperplane would provide maximal separation between the clusters. Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning New York, NY, USA:: Springer series in statistics. V. Sethu 31 Detour - Linear Discriminant Analysis To determine the basis vectors of this subspace we look for the principal components of the set of centroids of the classes (after whitening). Specifically, given data from classes, we can compute a matrix of class centroids, : = ⋮ = 11 ⋯ 1⋮ ⋱ ⋮ 1 ⋯ Given the within class covariance (shared covariance of all the classes), we can project to a space where the shared covariance is , ∗ = −12 If the covariance of ∗ is given by ∗, the between class covariance, the eigenvectors of ∗ (principal components) will provide the desired coordinates of the optimal subspace. Specifically, the linear discriminant direction, is given as = −12∗ Where, ∗ are the columns of ∗ obtained by the following eigen decomposition: ∗ = ∗∗ × V. Sethu 32 Quantifying Error Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. John Wiley & Sons. In a 2-class problem, the classifier separations the feature space into two decision regions, ℛ1 and ℛ2. Errors can arise in two ways – Points from 1 falling within ℛ2 and vice versa. = ∈ ℛ2,1 + ∈ ℛ1,2= ∈ ℛ2 1 1 + ∈ ℛ1 2 2= ∫ℛ2 1 1 + ∫ℛ 2 2 V. Sethu 33 Receiver Operating Characteristic (ROC) Curve Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. John Wiley & Sons. ′ → Discriminability (some measure of distance between distributions) e.g., ′ = 2− for the 1-D case (equal variance) Each point on the curve corresponding to different operating point of classifier V. Sethu 34 Logistic Regression Arises from the desire to: 1. Model posterior probability of classes as linear functions (and treat it as a regression problem) 2. Ensure that sum of class posteriors is one. The model has the form, ln 1 = 10 + ln 2 = 20 + ⋮ln −1 = −1 0 + − Equivalently, = 11 + ∑=1−1 0+ = 0+1 + ∑=1−1 0+ , = 1, … , − 1 V. Sethu 35 2-Class Logistic Regression Given data, = , =,…, where, = �1, 10, 2 The log-likelihood of the logistic regression model is then, ℓ = ln = � =1 ln 1 , + 1 − ln 1 − 1 , This reduces to, ℓ = � =1 − ln 1 + To obtained the maximum likelihood solution we can set the derivatives to zero, ℓ = � =1 − 1 , = 0 Nonlinear equations – Use iterative solution (Newton-Raphson algorithm) V. Sethu 36 2-Class Logistic Regression Newton update: +1 = − 2ℓ −1 ℓ In matrix notation, ℓ = − 2ℓ = − Newton update for logistic regression, +1 = + −1 − Hessian matrix = ⋮ = 111 11 ⋯ 1⋮ ⋱ ⋮1 ⋯ = 1 ⋮ = 1 , ⋮ 1 , = 1,2, … , ; = , 1 − 1 ,
欢迎咨询51作业君