辅导案例-EC340

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Machine Learning and Big Data
Beyond the Lasso
EC340 - Topics in Applied Economics
Dr Pedro CL Souza
University of Warwick
Economics Department
AT 2019/20
Beyond the Lasso 1 / 69
Contents
1. Regularization Methods: `0, `1 and `2
2. Forest and Trees
3. Forecasting Macroeconomic Variables
4. Lasso Theory (optional material)
Beyond the Lasso 2 / 69
Regularization Methods: `0, `1 and `2
• In the previous lecture we defined the Lasso estimator
βˆlasso = arg minβ
[
N∑
i=1
(yi − xiβ)2 + λ
K∑
j=1
|βj |
]
and the Adaptive Lasso version
βˆadalasso = arg minβ
[
N∑
t=1
(yi − xiβ)2 + λ
K∑
j=1
ωj |βj |
]
where ωj = |βˆlasso |−1.
Beyond the Lasso 3 / 69
Regularization Methods: `0, `1 and `2
• The Lasso and Adaptive Lasso fit into a broader category of penalized
estimation methods
• Indeed, the Akaike and Schwartz criteria can also be seen as penalized
methods
• In the previous lecture, we defined
AIC = 2 ln
(
1
n
n∑
i=1
(yi − xiβ)2
)
+
2(k + 1)
n
BIC = 2 ln
(
1
n
n∑
i=1
(yi − xiβ)2
)
+
(k + 1) ln n
n
where k is the number of included variables in the model. Since, in this
case, no estimated parameter is zero, k = ‖β‖0
So this is also seen as `0 penalization, but as we showed it is
computationally infeasible
Beyond the Lasso 4 / 69
Regularization Methods: `0, `1 and `2
• Ridge regression: `2 penalization
βˆridge = arg minβ
[
N∑
i=1
(yi − xiβ)2 + λ
K∑
j=1
β2j
]
The apparent minor difference (`1 penalization becomes `2) has profound
impact on the properties of the estimator
• The Ridge regression has been shown to handle correlated regressors well
very, close to multicollinearity
• Define the objective function
f (β) =
N∑
i=1
(yi − xiβ)2 + λ
K∑
j=1
β2j
with K = 1, the first-order conditions are
∂f (β)
∂β
= −2
N∑
i=1
(yi − xi βˆridge)xi + 2λβˆridge = 0
Beyond the Lasso 5 / 69
Regularization Methods: `0, `1 and `2
• Working the expression for the first-order condition,
∂f (β)
∂β
= −2
N∑
i=1
(yi − xi βˆridge)xi + 2λβˆridge = 0
and therefore
N∑
i=1
yixi − βˆridge
N∑
i=1
x2i − λβˆridge = 0
⇒ βˆridge =
∑N
i=1 yixi∑N
i=1 x
2
i + λ
• Since λ ≥ 0,
βˆridge =
∑N
i=1 yixi∑N
i=1 x
2
i + λ
≤
∑N
i=1 yixi∑N
i=1 x
2
i
= βˆols
and therefore βˆridge is also biased towards zero
• βˆridge is non-zero for every j ⇒ no element of model selection
Beyond the Lasso 6 / 69
Regularization Methods: `0, `1 and `2
• Elastic Net: combining `1 and `2 penalties
βˆenet = arg minβ
[
N∑
i=1
(yi − xiβ)2 + λ1
K∑
j=1
|βj |+ λ2
K∑
j=1
β2j
]
and the adaptive version
βˆenet = arg minβ
[
N∑
i=1
(yi − xiβ)2 + λ1
K∑
j=1
ωj |βj |+ λ2
K∑
j=1
β2j
]
where ωj =
1
|βˆj,enet |
Beyond the Lasso 7 / 69
Regularization Methods: `0, `1 and `2
• Finally, penalized likelihood methods are very useful and find many
applications, including textual analysis
• Lasso estimator, again
βˆlasso = arg min
β
N∑
i=1
(yi − xiβ)2︸︷︷︸
sum of squared residuals
+λ‖β‖1
suitable to continuous outcomes yi
• Penalized likelihood
βˆpenlik = arg min
β
lnL(β)︸︷︷︸
log-likelihood function
+λ‖β‖1
suitable to, for example, binary outcomes yi
Beyond the Lasso 8 / 69
Regularization Methods: `0, `1 and `2
• Why likelihood?
- Example: logistic regression and limited dependent variable models
- Suppose that we were looking for causes that determine employment
(yi = 1) or unemployment (yi = 0)
- Many candidate explanatory variables xi that can include age, gender,
education, parents’ education, ...
- Logit model
P(yi = 1|xi ) = G(β0 + xi1β1 + · · ·+ xiKβK ) = G(xiβ)
with
G(z) =
exp(z)
1 + exp(z)
Beyond the Lasso 9 / 69
Regularization Methods: `0, `1 and `2
• Why likelihood?
- The likelihood of the logit model is
L(β) =
N∏
i=1
[G(xiβ)]
yi [1− G(xiβ)]1−yi
and the log-likelihood is
lnL(β) =
N∑
i=1
yi ln [G(xiβ)] + (1− yi ) ln[1− G(xiβ)]
=
N∑
i=1,yi=1
ln [G(xiβ)] +
N∑
i=1,yi=0
ln[1− G(xiβ)]
Beyond the Lasso 10 / 69
Regularization Methods: `0, `1 and `2
• Why likelihood?
- Fitting the standard logit model is feasible if K is small
- If K is large, estimates are very volatile and lead to poor predictive
performance ⇒ Penalized likelihood solves this issue
- Application: analysis of text such as Facebook posts, tweets,
speeches, Central Banks’ reports
Beyond the Lasso 11 / 69
B See beyondLasso.Rmd, sections A-C B
Beyond the Lasso 12 / 69
Contents
1. Regularization Methods: `0, `1 and `2
2. Forest and Trees
3. Forecasting Macroeconomic Variables
4. Lasso Theory (optional material)
Beyond the Lasso 13 / 69
Regression Trees
• Regression trees fall outside the scope of regularized methods, but are very
useful for prediction and classification problems
• Very flexible and inherently incorporates non-linearities, while the previous
methods were linear
• The starting point for the Lasso and varieties was the linear model
yi = xiβ + i
and for the logistic regression, the probability is still a linear function of xi ,
P(yi = 1|xi ) = xiβ
for a K × 1-sized vector β, where K is large
Beyond the Lasso 14 / 69
Regression Trees
• Non-linearities can take many forms:
- Approximately linear
yi = α + xi1β11 + x
2
i1β12 + · · ·+ xiKβK1 + x2iKβK2 + i
- Piecewise linear
yi = f1(xi1) + f2(xi2) + · · ·+ fK (xiK ) + i
- Strongly non-linear
yi = f (xi1, . . . , xiK ) + i
... and the econometrician never knows the true model
Beyond the Lasso 15 / 69
Regression Trees
• Regression Trees is a very flexible, inherently non-linear model, while still
parsimonious
• To simplify, consider that K = 2 and there are two predictors, xi1 and xi2,
which belong to the unit interval [0, 1]
xi1
xi2
(0,0) (1,0)
(1,1)(0,1)
Beyond the Lasso 16 / 69
Regression Trees
• Start the process by letting the prediction anywhere in the interval be cˆ
xi1
xi2
(0,0) (1,0)
(1,1)(0,1)
cˆ
• If we are minimizing the mean squared prediction error, the prediction is
cˆ = y¯
Beyond the Lasso 17 / 69
Regression Trees
• The prediction can possibly be made better if we split into two regions R1
and R2, say
R1 = {(xi1, xi2) : xi1 ≤ t1}
R2 = {(xi1, xi2) : xi1 > t1}
xi1
xi2
(0,0) (1,0)
(1,1)(0,1)
Beyond the Lasso 18 / 69
Regression Trees
• The best prediction (in the sense of minimizing the mean squared
prediction error) within R1 is the average of y conditional on values of xi1
and xi2 that satisfy R1
- We define this prediction as cˆ1
- Similarly for R2 with prediction cˆ2
xi1
xi2
(0,0) (1,0)
(1,1)(0,1)
cˆ1 cˆ2
Beyond the Lasso 19 / 69
Regression Trees
• We can keep on partitioning the areas and conditioning the predictions
• For example, with five partitions
xi1
xi2
(0,0) (1,0)
(1,1)(0,1)
cˆ1
cˆ2
cˆ3
cˆ4
cˆ5
Beyond the Lasso 20 / 69
xi1
xi2
(0,0) (1,0)
(1,1)(0,1)
cˆ1
cˆ2
cˆ3
cˆ4
cˆ5
Graph from Chapter 9 of Hastie, Tibshirani and Friedman, “The Elements of Statistical Learning”, Springer 2009
Beyond the Lasso 21 / 69
Regression Trees
• Overall, given a partition {R1, . . . ,RM}, the prediction is
yˆ =
M∑
m=1
cm · I{(xi1, xi2) ∈ Rm}
where I is the indicator function
• More generally, for more than K = 2 predictors,
yˆ =
M∑
m=1
cm · I (xi ∈ Rm)
• How to select the partitions Rm and how many of them there are (M)?
Beyond the Lasso 22 / 69
Regression Trees
• Step-wise algorithm:
1. Start with the full data and a single R1, so yˆ = y¯
2. Find the variable j and the split point s such that the prediction error
is minimized
min
j,s
 ∑
xi∈R1(j,s)
(yi − cˆ1) +
∑
xi∈R2(j,s)
(yi − cˆ2)

3. Keep splitting until M partitions are obtained
• Algorithm: automated and with objective criteria, but computationally
heavy although not unfeasible
Beyond the Lasso 23 / 69
Regression Trees
• How to select M?
- The prediction error for each partition m is
Qm =
1
Nm
∑
xi∈Rm
(yi − cˆm)2
where Nm is the number of xi that fall within Rm
Nm = ]{xi ∈ Rm}
and cˆm is prediction in partition Rm
cˆm =
1
Nm
∑
xi∈Rm
yi
Beyond the Lasso 24 / 69
Regression Trees
• How to select M?
- The greater M is selected, the larger the tree
- Risk of overfitting the data, with consequences similar to the linear
regression we saw in lecture 1
- But a very small tree might not capture important feature of the data
- Define the criteria that balances the two objectives
Cα =
M∑
m=1
αmQm + α|M|
where α governs the trade-off between a large and small trees
- There are automatic methods that pick α to maximize the
performance of the algorithm
Beyond the Lasso 25 / 69
Random Trees
• Random Trees: small modification of the regression trees
• Step-wise algorithm, random-tree version:
1. Start with the full data and a single R1, so yˆ = y¯
2a. Randomly a subset p ≤ K of the covariates as candidates for splitting
(usual value is p =
√
K)
2b. Find the variable j within those selected in step 2a and the split point
s such that the prediction error is minimized
min
j,s
 ∑
xi∈R1(j,s)
(yi − cˆ1) +
∑
xi∈R2(j,s)
(yi − cˆ2)

3. Keep splitting until M partitions are obtained
Beyond the Lasso 26 / 69
Random Trees
• Random Tree may seem counter-intuitive. Why add the random
pre-selection of the covariates in step 2a?
- On itself, just the random step 2a might deteriorate the performance
because the search for variables (and split points) in step 2b is
conducted on a smaller subset of the covariates
- The idea is to generate many random trees and combine their
predictions
- Many random trees = random forest
Beyond the Lasso 27 / 69
Random Forest
• Random Forest algorithm, version 1:
1. Follow steps 1-3 of the random tree algorithm B times. For each, let
Tb(x) be the prediction of the b-th tree
{T1(xi ), . . . ,TB(xi )}
2. Average the predictions from each individual B random trees for the
final output
yˆi =
1
B
B∑
b=1
Tb(xi )
2 (alt). If yi is a categorical outcome, choose the most frequent output among
{T1(xi ), . . . ,TB(xi )}
i.e., the majority output
Beyond the Lasso 28 / 69
Random Forest
• The Random Forest actually goes a step further
• Bootstrap: procedure whereby a random subsample is drawn from the
original data
- In the cross-sectional case, for a dataset of size N, the bootstrapped
sample is constructed by sampling N observations from the original
data with replacement and probability 1
N
Original Data
y1 x11 x12
y2 x21 x22
y3 x31 x32
...
...
...
yN xN1 xN2
Bootstrap−→
Bootstrap Data
y7 x71 x72
y9 x91 x92
y1 x11 x12
...
...
...
y9 x91 x92
Note that, given the sampling is done with replacement, the same
observation may occur more than once in the bootstrapped sample
Beyond the Lasso 29 / 69
Random Forest
• Bootstrap: (cont’d)
- In the time-series case, sampling the observations in this way would
destroy the dependence over time
- Instead, observations are sampled in blocks
y1 y2 y3 y4 y5 y6 y7 y8 y9
Block 1Block 2
Block 3
Block 4
so the bootstrapped data is
y5 y6 y7 y2 y3 y7 y8 y9 y9
Block 1 Block 2 Block 3 Block 4
Beyond the Lasso 30 / 69
Random Forest
• Random Forest algorithm, version 2:
1. Draw a bootstrap sample from the original data, and follow steps 1-3
of the random tree algorithm. Repeat the process B times. For each,
let Tb(x) be the prediction of the b-th tree
{T1(xi ), . . . ,TB(xi )}
2. Average the predictions from each individual B random trees for the
final output
yˆi =
1
B
B∑
b=1
Tb(xi )
2 (alt). If yi is a categorical outcome, choose the most frequent output among
{T1(xi ), . . . ,TB(xi )}
i.e., the majority output
Beyond the Lasso 31 / 69
Random Forest
• Why add the bootstrap and random tree steps?
- The reason is that, perhaps counter-intuitively, it makes the averaging
step more efficient
- Suppose that we had B independent and identically distributed
predictions, each with mean zero and variance σ2
- The variance of the average prediction will be
Var
[
1
B
B∑
b=1
yˆi
]
=
1
B2
B∑
b=1
Var[yˆi ] =
1
B
σ2
so the more predictions B, the lower the variance of the average
prediction (which decreases the mean squared error)
Beyond the Lasso 32 / 69
Random Forest
• Why add the bootstrap and random tree steps?
- The various predictions from the random tree are likely to be very
correlated among them, since they were generated from the same
original set of data – so they are hardly independent
- If there is a positive pairwise correlation ρ between the predictions yˆi ,
then the variance of the averaged prediction can be worked out to be
Var
[
1
B
B∑
b=1
yˆi
]
= ρσ2 +
1− ρ
B
σ2
Now, the second term disappears with B →∞, but the first will not
go away
- Bootstrapping is also a way to reduce the dependence among the
different samples, and therefore reduce the dependence in the
predictions ⇒ reduction in the variance
- Very successful method, as we will see in the following application
Beyond the Lasso 33 / 69
Contents
1. Regularization Methods: `0, `1 and `2
2. Forest and Trees
3. Forecasting Macroeconomic Variables
4. Lasso Theory (optional material)
Beyond the Lasso 34 / 69
Forecasting Macroeconomic Variables
• Paper by Medeiros et al, “Forecasting Inflation in a Data-Rich
Environment”, Working Paper, 2018
It is difficult to overemphasize the importance of forecasting inflation
in rational economic decision-making. Many contracts concerning em-
ployment, sales, tenancy, and debt are set in nominal terms. Therefore,
inflation forecasting is of great value to households, businesses and pol-
icymakers. In addition, central banks rely on inflation forecasts not only
to inform monetary policy but also to anchor inflation expectations and
thus enhance policy efficacy. (p. 1)
Beyond the Lasso 35 / 69
Forecasting Macroeconomic Variables
• In the literature, improving forecast accurately beyond simple benchmark
models such as AR has proven to be a challenge
• Paper shows that machine learning and big data does just that – dramatic
improvement in the capacity to forecast future inflation
• Data: FRED-MD database
- Full sample: January 1960 to December 2015, 672 observations
- Out-of-sample window: January 1990 to December 2015
- 122 variables with observations in all sample periods
- Consider four lags, as well as four autoregressive terms: 508 potential
predictors
Beyond the Lasso 36 / 69
Forecasting Macroeconomic Variables
• Benchmark measure: change in inflation pit = log(Pt)− log(Pt−1) for a
price level Pt
Inflation
Beyond the Lasso 37 / 69
Forecasting Macroeconomic Variables
• Consider the following (general) model
pit+h = Gh(xt) + ut+h
where h is the forecast horizon, xt is a vector of covariates (possibly with
lags of pit , and Gh(·) is the mapping between the covariates and the
inflation
• Models: vast array
1. Random walk: pˆit+h = pit
2. Autoregressive of p-th order:
pˆit+h = φˆ0 + φˆ1pit + · · ·+ φˆhpit−p+1
3. Shrinkage models: Lasso, Adaptive Lasso, Ridge, Elastic Net
4. Random Forest (RF)
(among many others)
Beyond the Lasso 38 / 69
Forecasting Macroeconomic Variables
• Evaluate the predictive power by the out-of-the-sample criteria
MSEm,h =
1
T − T0 + 1
T∑
t=T0
eˆ2t,m,h
MAEm,h =
1
T − T0 + 1
T∑
t=T0
|eˆt,m,h|
of model m for forecast horizon h and RMSE =
√
MSE
• Next slide: Random Forest easily outperforms all other methods, along all
forecast horizons by 20%-25%
• Many tests, subsamples, etc... in the paper
Beyond the Lasso 39 / 69
Beyond the Lasso 40 / 69
Contents
1. Regularization Methods: `0, `1 and `2
2. Forest and Trees
3. Forecasting Macroeconomic Variables
4. Lasso Theory (optional material)
Beyond the Lasso 41 / 69
Lasso Theory
The model selection problem
• We framed our machine learning problem as one of model selection
• Our problem is to select among K variables
yi = α + xi1β1 + · · ·+ xiKβ1 + i
= α + xiβ + i
where xi is K × 1 where K is very large, and possibly K > N
• In stacked notation,
y = Xβ +
where y is N × 1, X is N × K and is N × 1
Beyond the Lasso 42 / 69
Lasso Theory
• For example,
y = Xβ +
where K = 75 and N = 100, β1 = · · · = β5 = 1, and β6 = · · · = β75 = 0
The econometrician has K = 75 covariates, but only 5 are relevant
(βi 6= 0) and explain the outcomes
Unfortunately ex-ante the econometrician doesn’t know which are the
relevant (βi 6= 0) and irrelevant covariates (βi = 0)
As we saw in the last lecture, including all K = 75 will lead to poor
predictive performance, so this is not really an option
I.e., this is a problem of model selection
Beyond the Lasso 43 / 69
Lasso Theory
The support
• Define the support S of the model as the set of relevant covariates or the
set of i for which the true βi is not zero
• In the example above,
S = {1, 2, 3, 4, 5}
because the true β1, β2, β3, β4 and β5 are non-zero, and all other true β’s
are zero
• If S was known, the econometrician would have regressed x1, x2, x3, x4
and x5 on y . Denote that regression as
y = XSβ +
where XS selects the columns of X in the set S. This is called the Oracle
regression
• The OLS estimator for the Oracle regression is then
βˆoracle = (X
′
SXS)
−1X ′Sy
Beyond the Lasso 44 / 69
Lasso Theory
Norms
• Define the `p norm of the v × 1 vector ν as
‖ν‖p = (|ν1|p + · · ·+ |νv |p)
1
p
• In particular, we care about the `1 and `2 norms
‖ν‖1 = (|ν1|+ · · ·+ |νv |) =
v∑
i=1
|νi |
‖ν‖2 =
(
|ν1|2 + · · ·+ |νv |2
) 1
2
=
√√√√ n∑
i=1
ν2i =
√
ν′ν
... and the `0 norm which is defined as the sum of non-zero elements in ν
‖ν‖0 = number of non-zero elements in ν
Beyond the Lasso 45 / 69
Lasso Theory
• The notation above is useful because it allows us to write the Lasso
problem in a more concise way
βˆlasso = arg minβ
[
N∑
i=1
(yi − xiβ)2 + λ
K∑
i=1
|βi |
]
where
N∑
i=1
(yi − xiβ)2 = ‖y − Xβ‖22
and
K∑
i=1
|βi | = ‖β‖1
so the Lasso problem is may also be written in a concise notation as
βˆlasso = arg minβ ‖y − Xβ‖22 + λ‖β‖1
Beyond the Lasso 46 / 69
Lasso Theory
• There is a solid and deep theory behind the Lasso and Adaptive Lasso
• Derivations make use of high-level mathematics and are nothing similar to
standard asymptotic theory
... and full derivations are beyond the scope of this module
• Yet, we will derive the properties of the Lasso under the following special
conditions
a. Average of yi is zero:
1
N
∑N
i=1 yi = 0
b. Average of xi is zero:
1
N
∑N
i=1 xi = 0
c. Variance of xi is 1 and xi ’s are uncorrelated, so X
′X = IK where IK is
the identity matrix
X ′X = IK ⇒ (X ′X )−1 = IK
Clearly, our interest is in deriving properties of the Lasso estimator
under more general assumptions, but working out this “simple” case
clarifies properties that hold more generally
Beyond the Lasso 47 / 69
Lasso Theory
• Working out the theory allows us to derive and clarify the properties of the
estimators, what it achieves and what it does not
Three types of objectives
1. Prediction loss: the true model is
y = Xβ +
since is random noise and not able to be predicted, we focus on Xβ
If βˆ is the Lasso estimator, the Lasso prediction is X βˆ
So the prediction error is the vector Xβ − X βˆ = X (β − βˆ)
Take the `2 norm ‖X (β − βˆ)‖22 as the metric of prediction loss
Objective is to show that prediction loss is small
Beyond the Lasso 48 / 69
Lasso Theory
Three types of objectives
2. Parameter estimation: estimated βˆ are close to the true parameter β
So the objective is to show that
‖β − βˆ‖22 =
K∑
i=1
(βi − βˆi )2
is small
Beyond the Lasso 49 / 69
Lasso Theory
Three types of objectives
3. Model selection: true model is correctly selected
Also referred to as the “support” of β
The support maps positive entries into 1, negative entries into -1 and 0
into 0. For example,
supp
 3.80
−1.1
 =
 10
−1

The objective is to show that
P
{
supp(β) = supp(βˆ)
}
is as close as possible to 1
Beyond the Lasso 50 / 69
Lasso Theory
• We will use some additional tools to derive the properties of the Lasso
estimator: some inequalities, and the subgradient
Inequalities
• Triangle inequality: if v and u are vectors,
‖u + v‖ ≤ ‖u‖+ ‖v‖
which also applies to the `2 norm
‖u + v‖2 ≤ ‖u‖2 + ‖v‖2
• Reverse triangle inequality:
‖u − v‖ ≥ ‖u‖ − ‖v‖
with the variation that,
‖u + v‖ = ‖u − (−v)‖ ≥ ‖u‖ − ‖ − v‖ = ‖u‖ − ‖v‖
Beyond the Lasso 51 / 69
Lasso Theory
Subgradient
• The Lasso objective function is
βˆ = arg minβ
[
N∑
i=1
(yi − xiβ)2 + λ
K∑
i=1
|βi |
]
= arg minβ ‖y − Xβ‖22 + λ‖β‖1
which is non-differentiable at 0 because of the |βi | component
The function f (x) = |x |
Beyond the Lasso 52 / 69
Lasso Theory
• The lack of differentiability prevents us from being able to find analytical
solutions of the Lasso estimator
• Subgradient: the subgradient of a convex function f at a point x0 is a
scalar g such that
f (x)− f (x0) ≥ g(x − x0)
for all x
• We define the set of subderivatives as the interval [a, b] where a and b are
the one-sided limits
a = lim
x→x−0
f (x)− f (x0)
x − x0
b = lim
x→x+0
f (x)− f (x0)
x − x0
Beyond the Lasso 53 / 69
Lasso Theory
• For the absolute function f (x) = |x |, the subderivative is given as
∂f (x)
∂x
=

−1 if x < 0
[−1, 1] if x = 0
1 if x > 0
The function f (x) = |x |
Beyond the Lasso 54 / 69
Lasso Theory
• For simplicity, let K = 1
• Working out the first order conditions for the Lasso objective function
f (β) =
N∑
i=1
(yi − xiβ)2 + λ|β1|
• If βˆlasso > 0,
∂f (β)
∂β
= −2
N∑
i=1
(yi − xi βˆlasso)xi + λ = 0
=
N∑
i=1
yixi − βˆlasso
N∑
i=1
x2i − λ2 = 0
Beyond the Lasso 55 / 69
Lasso Theory
• The expression above implies that
βˆlasso =
∑n
i=1 yixi∑n
i=1 x
2
i︸︷︷︸
OLS estimator
− λ
2
1∑n
i=1 x
2
i︸︷︷︸
Downward bias
= βˆols − λ
2
which only applies if βˆlasso > 0
• If βˆlasso < 0, then
βˆlasso =
∑n
i=1 yixi∑n
i=1 x
2
i︸︷︷︸
OLS estimator
+
λ
2
1∑n
i=1 x
2
i︸︷︷︸
Upward bias
= βˆols +
λ
2
which only applies if βˆlasso < 0
• Shrinkage estimator: in either case, the non-zero estimates are biased
towards zero
Beyond the Lasso 56 / 69
Lasso Theory
• What happens if βˆols is small?
• Example: say, for instance, that βˆols = 0.1 and λ = 1
- If βˆlasso > 0, then βˆlasso = βˆols − λ2 = 0.1− 12 = −0.4 (contradiction)
- If βˆlasso < 0, then βˆlasso = βˆols +
λ
2
= 0.1 + 1
2
= 0.6 (contradiction)
- Then βˆlasso must be exactly equal to zero
• Overall,
βˆlasso =

βˆols − λ2 if βˆols > λ2
0 if − λ
2
≤ βˆols ≤ λ2
βˆols +
λ
2
if βˆols < −λ2
which also means that the Lasso estimator selects the coefficient exactly
as zero, depending on the penalization parameter
Beyond the Lasso 57 / 69
Lasso Theory
βˆlasso =

βˆols − λ2 if βˆols > λ2
0 if − λ
2
≤ βˆols ≤ λ2
βˆols +
λ
2
if βˆols < −λ2
• Also means that:
- If λ→∞, βˆlasso = 0
- If λ = 0, βˆlasso = βˆols
- For λ > 0, estimate of β1 might be zero or not
Beyond the Lasso 58 / 69
Lasso Theory
Beyond the Lasso 59 / 69
Lasso Theory
• In the K = 1 case, we had that
βˆlasso =

βˆols − λ2 if βˆols > λ2
0 if − λ
2
≤ βˆols ≤ λ2
βˆols +
λ
2
if βˆols < −λ2
where βˆols = (X
′X )−1X ′y = X ′y
• For any value of K , it can be shown that
βˆj,lasso =

βˆj,ols − λ2 if βˆj,ols > λ2
0 if − λ
2
≤ βˆj,ols ≤ λ2
βˆj,ols +
λ
2
if βˆj,ols < −λ2
where βˆols = (X
′X )−1X ′y = X ′y and βˆj,ols = x ′j y
Beyond the Lasso 60 / 69
Lasso Theory
• The Lasso solution for any K
βˆj,lasso =

βˆj,ols − λ2 if βˆj,ols > λ2
0 if − λ
2
≤ βˆj,ols ≤ λ2
βˆj,ols +
λ
2
if βˆj,ols < −λ2
where βˆj,ols = x
′
j y can be summarized as
βˆj,lasso =

x ′j y − λ2 if x ′j y > λ2
0 if − λ
2
≤ x ′j y ≤ λ2
x ′j y +
λ
2
if x ′j y < −λ2
=
{
x ′j y − sign(x ′j y) · λ2 if |x ′j y | > λ2
0 if |x ′j y | ≤ λ2
Beyond the Lasso 61 / 69
Lasso Theory
• The Lasso solution
βˆj,lasso =
{
x ′j y − sign(x ′j y) · λ2 if |x ′j y | > λ2
0 if |x ′j y | ≤ λ2
clarifies that βˆj,lasso will be zero if |x ′j y | ≤ λ2 and non-zero otherwise
• Model selection refers to the capacity to correctly estimate the support
That is,
- If the true parameter is zero (βj = 0), a zero is estimated
(βˆj,lasso = 0), which will happen if |x ′j y | ≤ λ2
- If the true parameter is non-zero (βj 6= 0), a non-zero is estimated
(βˆj,lasso 6= 0), which will happen if |x ′j y | > λ2
Beyond the Lasso 62 / 69
Lasso Theory
• If βj = 0, when will |x ′j y | ≤ λ2 ?
- Substituting y = Xβ + ,
|x ′j y | = |x ′j (Xβ + )| = |βj + x ′j |
the latter equality is due to the fact that the covariance between
columns j1 and j2 of X are zero if j1 6= j2 and 1 if j1 = j2
- Define the set C that limits the correlation of X with the disturbance
vector . Within the set C,
max(′X ) ≤ λ
2
To clarify, max(′X ) is the maximum covariance between and every
column of X
max { cov(, x1) , . . . , cov(, xK ) }
and is less likely to hold if λ is small
Beyond the Lasso 63 / 69
Lasso Theory
• If βj = 0, when will |x ′j y | ≤ λ2 ?
- If βj = 0,
|x ′j y | = |x ′j (Xβ + )| = |βj + x ′j | = |x ′j | ≤ λ2
on the set C
- The last equation means that: on the set C, if βj = 0, then |x ′j y | ≤ λ2
which implies that βˆj,lasso = 0
In other words, true zeros are always selected as zeros! (on the set C)
Beyond the Lasso 64 / 69
Lasso Theory
• If βj 6= 0, when will |x ′j y | > λ2 ?
- If βj 6= 0
|x ′j y | = |x ′j (Xβ + )| = |βj + x ′j | ≥ |βj | − |x ′j |
using the reverse triangle inequality. On the set C,
|x ′j y | ≥ |βj | − |x ′j | ≥ |βj | − λ2
- On the other hand, βˆj,lasso is non-zero if |x ′j y | > λ2
- Given that |x ′j y | ≥ |βj | − λ2 , surely |x ′j y | > λ2 will be satisfied if
|βj | − λ
2
>
λ
2
because, in this case,
|x ′j y | ≥ |βj | − λ2 >
λ
2
Beyond the Lasso 65 / 69
Lasso Theory
• If βj 6= 0, when will |x ′j y | > λ2 ?
- The condition
|βj | − λ
2
>
λ
2
is equivalent to
|βj | > λ
- So βˆj,lasso will be non-zero as long as βj > λ
- Also means that all non-zero βj with be estimated as non-zero as long
as the smallest βj is greater than the penalization parameter λ
min
j,βj 6=0
|βj | > λ
(this is also known as the beta-min condition)
- Small but non-zero βj they will be shrunk to zero
Beyond the Lasso 66 / 69
Lasso Theory: takeaway from model selection
λ intuition theory ⊗
too low many βˆj,lasso = 0 are not
estimated as zeros be-
cause the penalization is
too small, and the Lasso
is not able to achieve
model selection
set C, defined by
max(′X ) ≤ λ
2
is un-
likely to be satisfied,
and therefore the model
selection results do not
apply
too high βˆj,lasso 6= 0 are shrunk
to zero because the pe-
nalization is too high,
and Lasso is not able to
achieve model selection
set C is likely to be
satisfied, but the
beta-min condition
minj,βj 6=0 |βj | > λ is not
likely to be, so model
selection is not achieved
somewhere in
the middle
some βˆj,lasso are zero,
and some non-zero
set C is likely to be
satisfied and also the
beta-min condition
minj,βj 6=0 |βj | > λ so
model selection is
achieved
Beyond the Lasso 67 / 69
Lasso Theory
• We have discussed model selection, which is not equivalent to the distance
between β and βˆlasso
• On set C, the Lasso estimator will be zero on SC . Therefore
‖β − βˆlasso‖2 = ‖βS − βˆS,lasso‖2 = ‖βS − βˆoracle + βˆoracle − βˆS,lasso‖2
• Using the triangle inequality,
‖β − βˆlasso‖2 ≤ ‖βS − βˆoracle‖2 + ‖βˆoracle − βˆS,lasso‖2
= ‖βS − βˆoracle‖2 +
√∑
j∈S
(x ′j y − sign(x ′j y)λ− x ′j y)2
= ‖βS − βˆoracle‖2 +
√∑
j∈S
(sign(x ′j y)λ)2︸︷︷︸
λ
√
s
were s is the number of elements in S
• So the cost of not knowing S is λ√s
Beyond the Lasso 68 / 69
Lasso Theory
‖β − βˆlasso‖2 ≤ ‖βS − βˆoracle‖2 + λ
√
s
• The result above is also called an Oracle Inequality because it relates the
error in the parameters as the error that would have been obtained if the
true support S was known plus an additional term
- It also highlights that, the smaller the support set s, the smaller the
error is
- Concept of sparsity: most βj are equal to zero, so s is small
• Prediction loss
‖X (β − βˆlasso)‖2 =
√
(β − βˆlasso)′X ′X (β − βˆlasso)
=
√
(β − βˆlasso)′(β − βˆlasso)
= ‖β − βˆlasso‖2
Beyond the Lasso 69 / 69