辅导案例-EC340
Machine Learning and Big Data Beyond the Lasso EC340 - Topics in Applied Economics Dr Pedro CL Souza University of Warwick Economics Department AT 2019/20 Beyond the Lasso 1 / 69 Contents 1. Regularization Methods: `0, `1 and `2 2. Forest and Trees 3. Forecasting Macroeconomic Variables 4. Lasso Theory (optional material) Beyond the Lasso 2 / 69 Regularization Methods: `0, `1 and `2 • In the previous lecture we defined the Lasso estimator βˆlasso = arg minβ [ N∑ i=1 (yi − xiβ)2 + λ K∑ j=1 |βj | ] and the Adaptive Lasso version βˆadalasso = arg minβ [ N∑ t=1 (yi − xiβ)2 + λ K∑ j=1 ωj |βj | ] where ωj = |βˆlasso |−1. Beyond the Lasso 3 / 69 Regularization Methods: `0, `1 and `2 • The Lasso and Adaptive Lasso fit into a broader category of penalized estimation methods • Indeed, the Akaike and Schwartz criteria can also be seen as penalized methods • In the previous lecture, we defined AIC = 2 ln ( 1 n n∑ i=1 (yi − xiβ)2 ) + 2(k + 1) n BIC = 2 ln ( 1 n n∑ i=1 (yi − xiβ)2 ) + (k + 1) ln n n where k is the number of included variables in the model. Since, in this case, no estimated parameter is zero, k = ‖β‖0 So this is also seen as `0 penalization, but as we showed it is computationally infeasible Beyond the Lasso 4 / 69 Regularization Methods: `0, `1 and `2 • Ridge regression: `2 penalization βˆridge = arg minβ [ N∑ i=1 (yi − xiβ)2 + λ K∑ j=1 β2j ] The apparent minor difference (`1 penalization becomes `2) has profound impact on the properties of the estimator • The Ridge regression has been shown to handle correlated regressors well very, close to multicollinearity • Define the objective function f (β) = N∑ i=1 (yi − xiβ)2 + λ K∑ j=1 β2j with K = 1, the first-order conditions are ∂f (β) ∂β = −2 N∑ i=1 (yi − xi βˆridge)xi + 2λβˆridge = 0 Beyond the Lasso 5 / 69 Regularization Methods: `0, `1 and `2 • Working the expression for the first-order condition, ∂f (β) ∂β = −2 N∑ i=1 (yi − xi βˆridge)xi + 2λβˆridge = 0 and therefore N∑ i=1 yixi − βˆridge N∑ i=1 x2i − λβˆridge = 0 ⇒ βˆridge = ∑N i=1 yixi∑N i=1 x 2 i + λ • Since λ ≥ 0, βˆridge = ∑N i=1 yixi∑N i=1 x 2 i + λ ≤ ∑N i=1 yixi∑N i=1 x 2 i = βˆols and therefore βˆridge is also biased towards zero • βˆridge is non-zero for every j ⇒ no element of model selection Beyond the Lasso 6 / 69 Regularization Methods: `0, `1 and `2 • Elastic Net: combining `1 and `2 penalties βˆenet = arg minβ [ N∑ i=1 (yi − xiβ)2 + λ1 K∑ j=1 |βj |+ λ2 K∑ j=1 β2j ] and the adaptive version βˆenet = arg minβ [ N∑ i=1 (yi − xiβ)2 + λ1 K∑ j=1 ωj |βj |+ λ2 K∑ j=1 β2j ] where ωj = 1 |βˆj,enet | Beyond the Lasso 7 / 69 Regularization Methods: `0, `1 and `2 • Finally, penalized likelihood methods are very useful and find many applications, including textual analysis • Lasso estimator, again βˆlasso = arg min β N∑ i=1 (yi − xiβ)2︸ ︷︷ ︸ sum of squared residuals +λ‖β‖1 suitable to continuous outcomes yi • Penalized likelihood βˆpenlik = arg min β lnL(β)︸ ︷︷ ︸ log-likelihood function +λ‖β‖1 suitable to, for example, binary outcomes yi Beyond the Lasso 8 / 69 Regularization Methods: `0, `1 and `2 • Why likelihood? - Example: logistic regression and limited dependent variable models - Suppose that we were looking for causes that determine employment (yi = 1) or unemployment (yi = 0) - Many candidate explanatory variables xi that can include age, gender, education, parents’ education, ... - Logit model P(yi = 1|xi ) = G(β0 + xi1β1 + · · ·+ xiKβK ) = G(xiβ) with G(z) = exp(z) 1 + exp(z) Beyond the Lasso 9 / 69 Regularization Methods: `0, `1 and `2 • Why likelihood? - The likelihood of the logit model is L(β) = N∏ i=1 [G(xiβ)] yi [1− G(xiβ)]1−yi and the log-likelihood is lnL(β) = N∑ i=1 yi ln [G(xiβ)] + (1− yi ) ln[1− G(xiβ)] = N∑ i=1,yi=1 ln [G(xiβ)] + N∑ i=1,yi=0 ln[1− G(xiβ)] Beyond the Lasso 10 / 69 Regularization Methods: `0, `1 and `2 • Why likelihood? - Fitting the standard logit model is feasible if K is small - If K is large, estimates are very volatile and lead to poor predictive performance ⇒ Penalized likelihood solves this issue - Application: analysis of text such as Facebook posts, tweets, speeches, Central Banks’ reports Beyond the Lasso 11 / 69 B See beyondLasso.Rmd, sections A-C B Beyond the Lasso 12 / 69 Contents 1. Regularization Methods: `0, `1 and `2 2. Forest and Trees 3. Forecasting Macroeconomic Variables 4. Lasso Theory (optional material) Beyond the Lasso 13 / 69 Regression Trees • Regression trees fall outside the scope of regularized methods, but are very useful for prediction and classification problems • Very flexible and inherently incorporates non-linearities, while the previous methods were linear • The starting point for the Lasso and varieties was the linear model yi = xiβ + i and for the logistic regression, the probability is still a linear function of xi , P(yi = 1|xi ) = xiβ for a K × 1-sized vector β, where K is large Beyond the Lasso 14 / 69 Regression Trees • Non-linearities can take many forms: - Approximately linear yi = α + xi1β11 + x 2 i1β12 + · · ·+ xiKβK1 + x2iKβK2 + i - Piecewise linear yi = f1(xi1) + f2(xi2) + · · ·+ fK (xiK ) + i - Strongly non-linear yi = f (xi1, . . . , xiK ) + i ... and the econometrician never knows the true model Beyond the Lasso 15 / 69 Regression Trees • Regression Trees is a very flexible, inherently non-linear model, while still parsimonious • To simplify, consider that K = 2 and there are two predictors, xi1 and xi2, which belong to the unit interval [0, 1] xi1 xi2 (0,0) (1,0) (1,1)(0,1) Beyond the Lasso 16 / 69 Regression Trees • Start the process by letting the prediction anywhere in the interval be cˆ xi1 xi2 (0,0) (1,0) (1,1)(0,1) cˆ • If we are minimizing the mean squared prediction error, the prediction is cˆ = y¯ Beyond the Lasso 17 / 69 Regression Trees • The prediction can possibly be made better if we split into two regions R1 and R2, say R1 = {(xi1, xi2) : xi1 ≤ t1} R2 = {(xi1, xi2) : xi1 > t1} xi1 xi2 (0,0) (1,0) (1,1)(0,1) Beyond the Lasso 18 / 69 Regression Trees • The best prediction (in the sense of minimizing the mean squared prediction error) within R1 is the average of y conditional on values of xi1 and xi2 that satisfy R1 - We define this prediction as cˆ1 - Similarly for R2 with prediction cˆ2 xi1 xi2 (0,0) (1,0) (1,1)(0,1) cˆ1 cˆ2 Beyond the Lasso 19 / 69 Regression Trees • We can keep on partitioning the areas and conditioning the predictions • For example, with five partitions xi1 xi2 (0,0) (1,0) (1,1)(0,1) cˆ1 cˆ2 cˆ3 cˆ4 cˆ5 Beyond the Lasso 20 / 69 xi1 xi2 (0,0) (1,0) (1,1)(0,1) cˆ1 cˆ2 cˆ3 cˆ4 cˆ5 Graph from Chapter 9 of Hastie, Tibshirani and Friedman, “The Elements of Statistical Learning”, Springer 2009 Beyond the Lasso 21 / 69 Regression Trees • Overall, given a partition {R1, . . . ,RM}, the prediction is yˆ = M∑ m=1 cm · I{(xi1, xi2) ∈ Rm} where I is the indicator function • More generally, for more than K = 2 predictors, yˆ = M∑ m=1 cm · I (xi ∈ Rm) • How to select the partitions Rm and how many of them there are (M)? Beyond the Lasso 22 / 69 Regression Trees • Step-wise algorithm: 1. Start with the full data and a single R1, so yˆ = y¯ 2. Find the variable j and the split point s such that the prediction error is minimized min j,s ∑ xi∈R1(j,s) (yi − cˆ1) + ∑ xi∈R2(j,s) (yi − cˆ2) 3. Keep splitting until M partitions are obtained • Algorithm: automated and with objective criteria, but computationally heavy although not unfeasible Beyond the Lasso 23 / 69 Regression Trees • How to select M? - The prediction error for each partition m is Qm = 1 Nm ∑ xi∈Rm (yi − cˆm)2 where Nm is the number of xi that fall within Rm Nm = ]{xi ∈ Rm} and cˆm is prediction in partition Rm cˆm = 1 Nm ∑ xi∈Rm yi Beyond the Lasso 24 / 69 Regression Trees • How to select M? - The greater M is selected, the larger the tree - Risk of overfitting the data, with consequences similar to the linear regression we saw in lecture 1 - But a very small tree might not capture important feature of the data - Define the criteria that balances the two objectives Cα = M∑ m=1 αmQm + α|M| where α governs the trade-off between a large and small trees - There are automatic methods that pick α to maximize the performance of the algorithm Beyond the Lasso 25 / 69 Random Trees • Random Trees: small modification of the regression trees • Step-wise algorithm, random-tree version: 1. Start with the full data and a single R1, so yˆ = y¯ 2a. Randomly a subset p ≤ K of the covariates as candidates for splitting (usual value is p = √ K) 2b. Find the variable j within those selected in step 2a and the split point s such that the prediction error is minimized min j,s ∑ xi∈R1(j,s) (yi − cˆ1) + ∑ xi∈R2(j,s) (yi − cˆ2) 3. Keep splitting until M partitions are obtained Beyond the Lasso 26 / 69 Random Trees • Random Tree may seem counter-intuitive. Why add the random pre-selection of the covariates in step 2a? - On itself, just the random step 2a might deteriorate the performance because the search for variables (and split points) in step 2b is conducted on a smaller subset of the covariates - The idea is to generate many random trees and combine their predictions - Many random trees = random forest Beyond the Lasso 27 / 69 Random Forest • Random Forest algorithm, version 1: 1. Follow steps 1-3 of the random tree algorithm B times. For each, let Tb(x) be the prediction of the b-th tree {T1(xi ), . . . ,TB(xi )} 2. Average the predictions from each individual B random trees for the final output yˆi = 1 B B∑ b=1 Tb(xi ) 2 (alt). If yi is a categorical outcome, choose the most frequent output among {T1(xi ), . . . ,TB(xi )} i.e., the majority output Beyond the Lasso 28 / 69 Random Forest • The Random Forest actually goes a step further • Bootstrap: procedure whereby a random subsample is drawn from the original data - In the cross-sectional case, for a dataset of size N, the bootstrapped sample is constructed by sampling N observations from the original data with replacement and probability 1 N Original Data y1 x11 x12 y2 x21 x22 y3 x31 x32 ... ... ... yN xN1 xN2 Bootstrap−→ Bootstrap Data y7 x71 x72 y9 x91 x92 y1 x11 x12 ... ... ... y9 x91 x92 Note that, given the sampling is done with replacement, the same observation may occur more than once in the bootstrapped sample Beyond the Lasso 29 / 69 Random Forest • Bootstrap: (cont’d) - In the time-series case, sampling the observations in this way would destroy the dependence over time - Instead, observations are sampled in blocks y1 y2 y3 y4 y5 y6 y7 y8 y9 Block 1Block 2 Block 3 Block 4 so the bootstrapped data is y5 y6 y7 y2 y3 y7 y8 y9 y9 Block 1 Block 2 Block 3 Block 4 Beyond the Lasso 30 / 69 Random Forest • Random Forest algorithm, version 2: 1. Draw a bootstrap sample from the original data, and follow steps 1-3 of the random tree algorithm. Repeat the process B times. For each, let Tb(x) be the prediction of the b-th tree {T1(xi ), . . . ,TB(xi )} 2. Average the predictions from each individual B random trees for the final output yˆi = 1 B B∑ b=1 Tb(xi ) 2 (alt). If yi is a categorical outcome, choose the most frequent output among {T1(xi ), . . . ,TB(xi )} i.e., the majority output Beyond the Lasso 31 / 69 Random Forest • Why add the bootstrap and random tree steps? - The reason is that, perhaps counter-intuitively, it makes the averaging step more efficient - Suppose that we had B independent and identically distributed predictions, each with mean zero and variance σ2 - The variance of the average prediction will be Var [ 1 B B∑ b=1 yˆi ] = 1 B2 B∑ b=1 Var[yˆi ] = 1 B σ2 so the more predictions B, the lower the variance of the average prediction (which decreases the mean squared error) Beyond the Lasso 32 / 69 Random Forest • Why add the bootstrap and random tree steps? - The various predictions from the random tree are likely to be very correlated among them, since they were generated from the same original set of data – so they are hardly independent - If there is a positive pairwise correlation ρ between the predictions yˆi , then the variance of the averaged prediction can be worked out to be Var [ 1 B B∑ b=1 yˆi ] = ρσ2 + 1− ρ B σ2 Now, the second term disappears with B →∞, but the first will not go away - Bootstrapping is also a way to reduce the dependence among the different samples, and therefore reduce the dependence in the predictions ⇒ reduction in the variance - Very successful method, as we will see in the following application Beyond the Lasso 33 / 69 Contents 1. Regularization Methods: `0, `1 and `2 2. Forest and Trees 3. Forecasting Macroeconomic Variables 4. Lasso Theory (optional material) Beyond the Lasso 34 / 69 Forecasting Macroeconomic Variables • Paper by Medeiros et al, “Forecasting Inflation in a Data-Rich Environment”, Working Paper, 2018 It is difficult to overemphasize the importance of forecasting inflation in rational economic decision-making. Many contracts concerning em- ployment, sales, tenancy, and debt are set in nominal terms. Therefore, inflation forecasting is of great value to households, businesses and pol- icymakers. In addition, central banks rely on inflation forecasts not only to inform monetary policy but also to anchor inflation expectations and thus enhance policy efficacy. (p. 1) Beyond the Lasso 35 / 69 Forecasting Macroeconomic Variables • In the literature, improving forecast accurately beyond simple benchmark models such as AR has proven to be a challenge • Paper shows that machine learning and big data does just that – dramatic improvement in the capacity to forecast future inflation • Data: FRED-MD database - Full sample: January 1960 to December 2015, 672 observations - Out-of-sample window: January 1990 to December 2015 - 122 variables with observations in all sample periods - Consider four lags, as well as four autoregressive terms: 508 potential predictors Beyond the Lasso 36 / 69 Forecasting Macroeconomic Variables • Benchmark measure: change in inflation pit = log(Pt)− log(Pt−1) for a price level Pt Inflation Beyond the Lasso 37 / 69 Forecasting Macroeconomic Variables • Consider the following (general) model pit+h = Gh(xt) + ut+h where h is the forecast horizon, xt is a vector of covariates (possibly with lags of pit , and Gh(·) is the mapping between the covariates and the inflation • Models: vast array 1. Random walk: pˆit+h = pit 2. Autoregressive of p-th order: pˆit+h = φˆ0 + φˆ1pit + · · ·+ φˆhpit−p+1 3. Shrinkage models: Lasso, Adaptive Lasso, Ridge, Elastic Net 4. Random Forest (RF) (among many others) Beyond the Lasso 38 / 69 Forecasting Macroeconomic Variables • Evaluate the predictive power by the out-of-the-sample criteria MSEm,h = 1 T − T0 + 1 T∑ t=T0 eˆ2t,m,h MAEm,h = 1 T − T0 + 1 T∑ t=T0 |eˆt,m,h| of model m for forecast horizon h and RMSE = √ MSE • Next slide: Random Forest easily outperforms all other methods, along all forecast horizons by 20%-25% • Many tests, subsamples, etc... in the paper Beyond the Lasso 39 / 69 Beyond the Lasso 40 / 69 Contents 1. Regularization Methods: `0, `1 and `2 2. Forest and Trees 3. Forecasting Macroeconomic Variables 4. Lasso Theory (optional material) Beyond the Lasso 41 / 69 Lasso Theory The model selection problem • We framed our machine learning problem as one of model selection • Our problem is to select among K variables yi = α + xi1β1 + · · ·+ xiKβ1 + i = α + xiβ + i where xi is K × 1 where K is very large, and possibly K > N • In stacked notation, y = Xβ + where y is N × 1, X is N × K and is N × 1 Beyond the Lasso 42 / 69 Lasso Theory • For example, y = Xβ + where K = 75 and N = 100, β1 = · · · = β5 = 1, and β6 = · · · = β75 = 0 The econometrician has K = 75 covariates, but only 5 are relevant (βi 6= 0) and explain the outcomes Unfortunately ex-ante the econometrician doesn’t know which are the relevant (βi 6= 0) and irrelevant covariates (βi = 0) As we saw in the last lecture, including all K = 75 will lead to poor predictive performance, so this is not really an option I.e., this is a problem of model selection Beyond the Lasso 43 / 69 Lasso Theory The support • Define the support S of the model as the set of relevant covariates or the set of i for which the true βi is not zero • In the example above, S = {1, 2, 3, 4, 5} because the true β1, β2, β3, β4 and β5 are non-zero, and all other true β’s are zero • If S was known, the econometrician would have regressed x1, x2, x3, x4 and x5 on y . Denote that regression as y = XSβ + where XS selects the columns of X in the set S. This is called the Oracle regression • The OLS estimator for the Oracle regression is then βˆoracle = (X ′ SXS) −1X ′Sy Beyond the Lasso 44 / 69 Lasso Theory Norms • Define the `p norm of the v × 1 vector ν as ‖ν‖p = (|ν1|p + · · ·+ |νv |p) 1 p • In particular, we care about the `1 and `2 norms ‖ν‖1 = (|ν1|+ · · ·+ |νv |) = v∑ i=1 |νi | ‖ν‖2 = ( |ν1|2 + · · ·+ |νv |2 ) 1 2 = √√√√ n∑ i=1 ν2i = √ ν′ν ... and the `0 norm which is defined as the sum of non-zero elements in ν ‖ν‖0 = number of non-zero elements in ν Beyond the Lasso 45 / 69 Lasso Theory • The notation above is useful because it allows us to write the Lasso problem in a more concise way βˆlasso = arg minβ [ N∑ i=1 (yi − xiβ)2 + λ K∑ i=1 |βi | ] where N∑ i=1 (yi − xiβ)2 = ‖y − Xβ‖22 and K∑ i=1 |βi | = ‖β‖1 so the Lasso problem is may also be written in a concise notation as βˆlasso = arg minβ ‖y − Xβ‖22 + λ‖β‖1 Beyond the Lasso 46 / 69 Lasso Theory • There is a solid and deep theory behind the Lasso and Adaptive Lasso • Derivations make use of high-level mathematics and are nothing similar to standard asymptotic theory ... and full derivations are beyond the scope of this module • Yet, we will derive the properties of the Lasso under the following special conditions a. Average of yi is zero: 1 N ∑N i=1 yi = 0 b. Average of xi is zero: 1 N ∑N i=1 xi = 0 c. Variance of xi is 1 and xi ’s are uncorrelated, so X ′X = IK where IK is the identity matrix X ′X = IK ⇒ (X ′X )−1 = IK Clearly, our interest is in deriving properties of the Lasso estimator under more general assumptions, but working out this “simple” case clarifies properties that hold more generally Beyond the Lasso 47 / 69 Lasso Theory • Working out the theory allows us to derive and clarify the properties of the estimators, what it achieves and what it does not Three types of objectives 1. Prediction loss: the true model is y = Xβ + since is random noise and not able to be predicted, we focus on Xβ If βˆ is the Lasso estimator, the Lasso prediction is X βˆ So the prediction error is the vector Xβ − X βˆ = X (β − βˆ) Take the `2 norm ‖X (β − βˆ)‖22 as the metric of prediction loss Objective is to show that prediction loss is small Beyond the Lasso 48 / 69 Lasso Theory Three types of objectives 2. Parameter estimation: estimated βˆ are close to the true parameter β So the objective is to show that ‖β − βˆ‖22 = K∑ i=1 (βi − βˆi )2 is small Beyond the Lasso 49 / 69 Lasso Theory Three types of objectives 3. Model selection: true model is correctly selected Also referred to as the “support” of β The support maps positive entries into 1, negative entries into -1 and 0 into 0. For example, supp 3.80 −1.1 = 10 −1 The objective is to show that P { supp(β) = supp(βˆ) } is as close as possible to 1 Beyond the Lasso 50 / 69 Lasso Theory • We will use some additional tools to derive the properties of the Lasso estimator: some inequalities, and the subgradient Inequalities • Triangle inequality: if v and u are vectors, ‖u + v‖ ≤ ‖u‖+ ‖v‖ which also applies to the `2 norm ‖u + v‖2 ≤ ‖u‖2 + ‖v‖2 • Reverse triangle inequality: ‖u − v‖ ≥ ‖u‖ − ‖v‖ with the variation that, ‖u + v‖ = ‖u − (−v)‖ ≥ ‖u‖ − ‖ − v‖ = ‖u‖ − ‖v‖ Beyond the Lasso 51 / 69 Lasso Theory Subgradient • The Lasso objective function is βˆ = arg minβ [ N∑ i=1 (yi − xiβ)2 + λ K∑ i=1 |βi | ] = arg minβ ‖y − Xβ‖22 + λ‖β‖1 which is non-differentiable at 0 because of the |βi | component The function f (x) = |x | Beyond the Lasso 52 / 69 Lasso Theory • The lack of differentiability prevents us from being able to find analytical solutions of the Lasso estimator • Subgradient: the subgradient of a convex function f at a point x0 is a scalar g such that f (x)− f (x0) ≥ g(x − x0) for all x • We define the set of subderivatives as the interval [a, b] where a and b are the one-sided limits a = lim x→x−0 f (x)− f (x0) x − x0 b = lim x→x+0 f (x)− f (x0) x − x0 Beyond the Lasso 53 / 69 Lasso Theory • For the absolute function f (x) = |x |, the subderivative is given as ∂f (x) ∂x = −1 if x < 0 [−1, 1] if x = 0 1 if x > 0 The function f (x) = |x | Beyond the Lasso 54 / 69 Lasso Theory • For simplicity, let K = 1 • Working out the first order conditions for the Lasso objective function f (β) = N∑ i=1 (yi − xiβ)2 + λ|β1| • If βˆlasso > 0, ∂f (β) ∂β = −2 N∑ i=1 (yi − xi βˆlasso)xi + λ = 0 = N∑ i=1 yixi − βˆlasso N∑ i=1 x2i − λ2 = 0 Beyond the Lasso 55 / 69 Lasso Theory • The expression above implies that βˆlasso = ∑n i=1 yixi∑n i=1 x 2 i︸ ︷︷ ︸ OLS estimator − λ 2 1∑n i=1 x 2 i︸ ︷︷ ︸ Downward bias = βˆols − λ 2 which only applies if βˆlasso > 0 • If βˆlasso < 0, then βˆlasso = ∑n i=1 yixi∑n i=1 x 2 i︸ ︷︷ ︸ OLS estimator + λ 2 1∑n i=1 x 2 i︸ ︷︷ ︸ Upward bias = βˆols + λ 2 which only applies if βˆlasso < 0 • Shrinkage estimator: in either case, the non-zero estimates are biased towards zero Beyond the Lasso 56 / 69 Lasso Theory • What happens if βˆols is small? • Example: say, for instance, that βˆols = 0.1 and λ = 1 - If βˆlasso > 0, then βˆlasso = βˆols − λ2 = 0.1− 12 = −0.4 (contradiction) - If βˆlasso < 0, then βˆlasso = βˆols + λ 2 = 0.1 + 1 2 = 0.6 (contradiction) - Then βˆlasso must be exactly equal to zero • Overall, βˆlasso = βˆols − λ2 if βˆols > λ2 0 if − λ 2 ≤ βˆols ≤ λ2 βˆols + λ 2 if βˆols < −λ2 which also means that the Lasso estimator selects the coefficient exactly as zero, depending on the penalization parameter Beyond the Lasso 57 / 69 Lasso Theory βˆlasso = βˆols − λ2 if βˆols > λ2 0 if − λ 2 ≤ βˆols ≤ λ2 βˆols + λ 2 if βˆols < −λ2 • Also means that: - If λ→∞, βˆlasso = 0 - If λ = 0, βˆlasso = βˆols - For λ > 0, estimate of β1 might be zero or not Beyond the Lasso 58 / 69 Lasso Theory Beyond the Lasso 59 / 69 Lasso Theory • In the K = 1 case, we had that βˆlasso = βˆols − λ2 if βˆols > λ2 0 if − λ 2 ≤ βˆols ≤ λ2 βˆols + λ 2 if βˆols < −λ2 where βˆols = (X ′X )−1X ′y = X ′y • For any value of K , it can be shown that βˆj,lasso = βˆj,ols − λ2 if βˆj,ols > λ2 0 if − λ 2 ≤ βˆj,ols ≤ λ2 βˆj,ols + λ 2 if βˆj,ols < −λ2 where βˆols = (X ′X )−1X ′y = X ′y and βˆj,ols = x ′j y Beyond the Lasso 60 / 69 Lasso Theory • The Lasso solution for any K βˆj,lasso = βˆj,ols − λ2 if βˆj,ols > λ2 0 if − λ 2 ≤ βˆj,ols ≤ λ2 βˆj,ols + λ 2 if βˆj,ols < −λ2 where βˆj,ols = x ′ j y can be summarized as βˆj,lasso = x ′j y − λ2 if x ′j y > λ2 0 if − λ 2 ≤ x ′j y ≤ λ2 x ′j y + λ 2 if x ′j y < −λ2 = { x ′j y − sign(x ′j y) · λ2 if |x ′j y | > λ2 0 if |x ′j y | ≤ λ2 Beyond the Lasso 61 / 69 Lasso Theory • The Lasso solution βˆj,lasso = { x ′j y − sign(x ′j y) · λ2 if |x ′j y | > λ2 0 if |x ′j y | ≤ λ2 clarifies that βˆj,lasso will be zero if |x ′j y | ≤ λ2 and non-zero otherwise • Model selection refers to the capacity to correctly estimate the support That is, - If the true parameter is zero (βj = 0), a zero is estimated (βˆj,lasso = 0), which will happen if |x ′j y | ≤ λ2 - If the true parameter is non-zero (βj 6= 0), a non-zero is estimated (βˆj,lasso 6= 0), which will happen if |x ′j y | > λ2 Beyond the Lasso 62 / 69 Lasso Theory • If βj = 0, when will |x ′j y | ≤ λ2 ? - Substituting y = Xβ + , |x ′j y | = |x ′j (Xβ + )| = |βj + x ′j | the latter equality is due to the fact that the covariance between columns j1 and j2 of X are zero if j1 6= j2 and 1 if j1 = j2 - Define the set C that limits the correlation of X with the disturbance vector . Within the set C, max(′X ) ≤ λ 2 To clarify, max(′X ) is the maximum covariance between and every column of X max { cov(, x1) , . . . , cov(, xK ) } and is less likely to hold if λ is small Beyond the Lasso 63 / 69 Lasso Theory • If βj = 0, when will |x ′j y | ≤ λ2 ? - If βj = 0, |x ′j y | = |x ′j (Xβ + )| = |βj + x ′j | = |x ′j | ≤ λ2 on the set C - The last equation means that: on the set C, if βj = 0, then |x ′j y | ≤ λ2 which implies that βˆj,lasso = 0 In other words, true zeros are always selected as zeros! (on the set C) Beyond the Lasso 64 / 69 Lasso Theory • If βj 6= 0, when will |x ′j y | > λ2 ? - If βj 6= 0 |x ′j y | = |x ′j (Xβ + )| = |βj + x ′j | ≥ |βj | − |x ′j | using the reverse triangle inequality. On the set C, |x ′j y | ≥ |βj | − |x ′j | ≥ |βj | − λ2 - On the other hand, βˆj,lasso is non-zero if |x ′j y | > λ2 - Given that |x ′j y | ≥ |βj | − λ2 , surely |x ′j y | > λ2 will be satisfied if |βj | − λ 2 > λ 2 because, in this case, |x ′j y | ≥ |βj | − λ2 > λ 2 Beyond the Lasso 65 / 69 Lasso Theory • If βj 6= 0, when will |x ′j y | > λ2 ? - The condition |βj | − λ 2 > λ 2 is equivalent to |βj | > λ - So βˆj,lasso will be non-zero as long as βj > λ - Also means that all non-zero βj with be estimated as non-zero as long as the smallest βj is greater than the penalization parameter λ min j,βj 6=0 |βj | > λ (this is also known as the beta-min condition) - Small but non-zero βj they will be shrunk to zero Beyond the Lasso 66 / 69 Lasso Theory: takeaway from model selection λ intuition theory ⊗ too low many βˆj,lasso = 0 are not estimated as zeros be- cause the penalization is too small, and the Lasso is not able to achieve model selection set C, defined by max(′X ) ≤ λ 2 is un- likely to be satisfied, and therefore the model selection results do not apply too high βˆj,lasso 6= 0 are shrunk to zero because the pe- nalization is too high, and Lasso is not able to achieve model selection set C is likely to be satisfied, but the beta-min condition minj,βj 6=0 |βj | > λ is not likely to be, so model selection is not achieved somewhere in the middle some βˆj,lasso are zero, and some non-zero set C is likely to be satisfied and also the beta-min condition minj,βj 6=0 |βj | > λ so model selection is achieved Beyond the Lasso 67 / 69 Lasso Theory • We have discussed model selection, which is not equivalent to the distance between β and βˆlasso • On set C, the Lasso estimator will be zero on SC . Therefore ‖β − βˆlasso‖2 = ‖βS − βˆS,lasso‖2 = ‖βS − βˆoracle + βˆoracle − βˆS,lasso‖2 • Using the triangle inequality, ‖β − βˆlasso‖2 ≤ ‖βS − βˆoracle‖2 + ‖βˆoracle − βˆS,lasso‖2 = ‖βS − βˆoracle‖2 + √∑ j∈S (x ′j y − sign(x ′j y)λ− x ′j y)2 = ‖βS − βˆoracle‖2 + √∑ j∈S (sign(x ′j y)λ)2︸ ︷︷ ︸ λ √ s were s is the number of elements in S • So the cost of not knowing S is λ√s Beyond the Lasso 68 / 69 Lasso Theory ‖β − βˆlasso‖2 ≤ ‖βS − βˆoracle‖2 + λ √ s • The result above is also called an Oracle Inequality because it relates the error in the parameters as the error that would have been obtained if the true support S was known plus an additional term - It also highlights that, the smaller the support set s, the smaller the error is - Concept of sparsity: most βj are equal to zero, so s is small • Prediction loss ‖X (β − βˆlasso)‖2 = √ (β − βˆlasso)′X ′X (β − βˆlasso) = √ (β − βˆlasso)′(β − βˆlasso) = ‖β − βˆlasso‖2 Beyond the Lasso 69 / 69