程序辅导案例 > Program >

程序代写案例-ECON0019

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

ECON0019: Quantitative Economics & Econometrics
Notes for Term 1
Dennis Kristensen
University College London
November 2, 2021
These notes are meant to accompany Wooldridges "Introductory Econometrics" in providing
more details on the mathematical results of the book.
Part I
Simple Linear Regression (SLR)
We are interested in estimating the relationship between x and y in a given population. Suppose
the following two assumptions are satised:
SLR.1 In the population, the following relationship holds between x and y:
y = 0 + 1x+ u; E [ujx] = 0: (0.1)
where 0 and 1 are unknown parameters and u is an unobserved error term.
The error term u captures other factors, in addition to x, that inuences/generates y. In
order to be able to disentangle the impact of x from these other factors, we will require u to be
mean-independent of x:
SLR.4 E [ujx] = 0.
That is, conditional on x the expected value of u in the population is 0; it implies that no x
conveys any information about u on average. It is here helpful to remind ourselves what conditional
expectations are:
1 Refresher on conditional distributions
Consider two random variables, y and x. These do not have to satisfy the SLR (or any other
model). Suppose for simplicity that both are discrete values (all subsequent arguments and results
1
easily generalise to the case where they are continuously distributed). Let

x(1); :::; x(K)

, for some
K 1, be the possible values that x can take and y(1); :::; y(L) , for some L 1, be the possible
values of y. Now, let
pX;Y

x(i); y(j)

= Pr

x = x(i); y = y(j)

; i = 1; :::;K; j = 1; :::; L:
be the joint probability function. From this we can, for example, compute the marginal distributions
of x and y,
pX (x) =
LX
j=1
p

x; y(j)

; pY (y) =
KX
i=1
p

x(i); y

;
for any given x 2 x(1); :::; x(K) and y 2 y(1); :::; y(L) .
The conditional distribution of yjx is dened as
pY jX (yjx) =
pY;X (y; x)
pX (x)
; (1.1)
for any values of (y; x). The conditional distribution slices the distribution of y up according to the
value of x. pY jX

y(j)jx(i) is the probability of observing y = y(j) in the subpopulation fo which
x = x(i). If x is informative about y (that is, they are dependent) then pY jX (yjx) 6= pY (y).
In this course we are generally interested in modelling conditional distributions because we are
interested in causal e¤ects. So very often we will write up a model for pY jX (yjx). Any given model
of pY jX (yjx) can then be used to, for example, make statements about the marginal distribution
of y since
pY (y) =
KX
i=1
pY jX

yjx(i)

pX

x(i)

: (1.2)
Example: Male and Female wages. Consider the population of UK adults. Let y 2 f0; 1; 2; ::::::; Lg
be the log weekly earnings in pounds of a UK adult (binned so that it is discrete and with
number of bins (L) being some very large number) in a given year, and x 2 f0; 1g being a
dummy variable indicating the gender of that same individual: x = 0 if the individual is male
while x = 1 the individual is female.
In this case, pY jX (yjx = 0) is the UK log-wage distribution for men and pY jX (yjx = 1) is the
log-wage distribution for women. If men and women have di¤erent earnings distributions,
then pY jX (yjx = 0) 6= pY jX (yjx = 1).
The over-all UK wage distribution is
pY (y) = pX (x = 0) pY jX (yjx = 0) + pX (x = 1) pY jX (yjx = 1) ;
where p (x = 0) and p (x = 1) is the proportion of men and women in the UK, respectively.
In 2018, the UK population was 66.78 million, with 33.82 million females and 32.98 million
males. Thus, if the year of interest is 2018,
pX (x = 0) =
32:98
66:78
= 0:49; pX (x = 1) =
33:82
66:78
= 0:51: (1.3)
2
For any given value of x, pY jX (yjx) is a probability distribution. In particular, we can compute
means and variances, etc. For example, the conditional mean is dened as
E [yjx] =
LX
j=1
y(j)pY jX

y(j)jx

:
More generally, for any function f (y) ;
E [f (y) jx] =
LX
j=1
f

y(j)

pY jX

y(j)jx

:
For example, the conditional variance is given by
Var (yjx) = E
h
(y E [yjx])2 jx
i
= E

y2jx E [yjx]2 :
Example: Male and Female wages. In the above wage example,
E [yjx = 0] = the average log-wages for men
E [yjx = 1] = the average log-wages for women
and
Var (yjx = 0) = the spread/variance of log-wages for men
Var (yjx = 1) = the spread/variance of log-wages for women
1.1 Some useful rules for computing conditional expectations
Recall that for any two constants a and b,
E [ay + b] = aE [y] + b:
When we compute expectations conditional on x, we can treat x as a constant (we keep it xed at
the particular value). Thus, the following rules applies: For any two functions a (x) and b (x),
E [a (x) y + b (x) jx] = a (x)E [yjx] + b (x) :
Example: Male and Female wages. Suppose (y; x) satises SRL.1SLR.4 with 0 = 3 and
1 = 0:25. What is the average log-wages for men and women, respectively? Taking
conditional expectations on both sides of eq. (0.1) and then using the above rules, we obtain
E [yjx] = E [3 0:25x+ ujx] = 3 0:25x+ E [ujx] = 3 0:25x;
where the last equality uses SLR.4. That is, E [ujx = 0] = E [ujx = 1] = 0.1 Thus,
E [yjx = 0] = 3 0:25 0 = 3;
E [yjx = 1] = 3 0:25 1 = 2:75:
So in this example average logearnings for men are higher than those for women.
1 Is it reasonable that SLR.4 holds for this regression? To answer this, you should rst determine which factors
a¤ect wages besides gender. Next, you should contemplate whether these factors are meanindependent of gender.
3
Recall eq. (1.2) which relates the marginal distribution of y to its conditional distribution.
Similarly, we can link the unconditional mean of y, E [y], to the conditional ones, E [yjx]:
Theorem 1.1 (Law of iterated expectations) For any two random variables y and x, the
following hold:
E [y] =
KX
i=1
E
h
yjx = x(i)
i
pX

x(i)

= E[E [yjx]]:
Proof. By denition
E [y] =
LX
j=1
y(j)pY

y(j)

:
But pY (y) satises eq. (1.2). Substituting the right hand side of (1.2) into the above equation
yields
E [y] =
LX
j=1
y(j)p

y(j)

=
LX
j=1
y(j)
(
KX
i=1
pY jX

y(j)jx(i)

pX

x(i)
)
=
KX
i=1
8<:
LX
j=1
y(j)pY jX

y(j)jx(i)
9=; pX x(i) ;
where the last equality just changes the order of summation. We now recognise the inner sum as
LX
j=1
y(j)pY jX

y(j)jx(i)

= E
h
yjx = x(i)
i
;
and so
E [y] =
KX
i=1
E
h
yjx = x(i)
i
pX

x(i)

= E[E [yjx]]:
The LIE provides an alternative way of computing the mean of a random variable from knowl-
edge of the conditional distribution of yjx.
Example: Male and Female wages. Maintaining SRL.1SLR.4 with 0 = 3 and 1 = 0:25,
what is the average wage in the full population? That is, what is E [y]? We saw earlier that
E [yjx] = 3 :25x and we know that x satises (1.3). Thus, using the LIE,
E [y] = E[E [yjx]]
= E [yjx = 0] pX (x = 0) + E [yjx = 1] pX (x = 1)
= 3 pX (x = 0) + 2:75 pX (x = 1)
= 2:87:
4
An alternative route to the same result is to rst observe
E [y] = E[E [yjx]] = E[3 :25x] = 3 :25E[x];
and then use that
E [x] = 0 pX (x = 0) + 1 pX (x = 1) = 0:51:
Combining these two equations again yields E [y] = 2:87.
2 Two implications of SLR.4
SLR.4 (above) has two useful implications,
E[u] = 0 and E[ux] = 0;
which we will use in several proofs.
2.1 First implication: E[u] = 0.
Lemma 2.1 Under SLR.4, E[u] = 0.
Proof. Let us prove this implication holds using the LIE. If we substitute u for y in Theorem 1.1,
we get:
E [u] = E[E [ujx]]:
Under SLR.4, we know that E[ujx] = 0. Plugging this into the right-hand side of the above equation
gives us
E [u] = E [0] = 0;
where the last equality uses that the expected value of a random variable which is 0 with probability
1 is... 0. And so we have shown that E[ujx] = 0) E[u] = 0.
It is stated in the lectures that E[u] = 0 is not a strong restriction. Why is that? Lets illustrate
this with an example.
Example: Male and Female wages. Suppose that (y; x) satises SLR.1SLR.3 with 0 = 3
and 1 = 0:25. However, suppose that E[ujx] = 4 so that SLR.4 is violated. Is this a
serious issue in terms of interpretation of the model? Not really since the conditional mean
is still constant: First, add and subtract E[ujx] = 4 on the right hand side of the population
equation,
y = 0 + 1 + x+ u = 3 + 4 0:25x+ u 4: (2.1)
Obviously this does not change the relationship. Now dene two new variables:
0 0 + 4; u u 4:
5
With these new variables, we can rewrite (2.1) as
y = 0 + 1x+ u
= 7 0:25x+ u
Notice that as u u 4 and E[ujx] = 4, we must have that E[ujx] = 0. And so we have
shown that a population relationship that initially appeared to break SLR.4 can be rewritten
in a way that conforms to it. The intercept changes from 0 = 3 to

0 = 7 when doing so,
but this is just a normalisation. More importantly, 1 (which tends to be the key parameter
of interest) is unchanged.
2.1.1 Reverse implication?
It is important to note that the reverse implication does not hold. That is, E[u] = 0; E[ujx] = 0.
We can show this with a simple counter-example. Suppose in the population there are only two
possible outcomes:
Outcome A, where u = 1 and x = 5.
Outcome B, where u = 1 and x = 7.
Suppose that the two outcomes are equally likely, so Pr(A) = Pr(B) = 0:5. Notice that as half
the time there is an outcome where u = 1 and half the time there is an outcome where u = 1, we
have E [u] = 0. But notice that we do not have E [ujx] = 0. There are only two possible values x
can take: 5 and 7. When x = 5, u can only be 1, and so E [ujx = 5] = 1. When x = 7, u can only
be -1, and so E [ujx = 7] = 1. And so in this example E[u] = 0 holds but E[ujx] 6= 0, and so we
know E[u] = 0; E[ujx] = 0.
2.2 Second implication: E[ux] = 0
Lemma 2.2 Under SLR.4, E[ux] = 0.
Proof. We again use the LIE. Notice that as x and u are both random variables, ux is also a
random variable. If we substitute ux for y in Theorem 1.1 we get:
E [ux] = E[E [uxjx]]:
In the above equation we have an x inside an expectation conditional on x. This means this x can
be treated as a constant, and can be taken outside the conditional expectation, E [uxjx] = xE [ujx].
Thus,
E [ux] = E[xE [ujx]]:
Under SLR.4 E[ujx] = 0. Plugging this into the above equation gives us:
E [ux] = E[x 0] = E[0] = 0;
which proves the claimed result.
6
2.2.1 Reverse implication?
It is again important to note that the reverse implication does not hold. That is, E[ux] = 0 ;
E[ujx] = 0. And we can again show this with a counter-example. Suppose in the population there
are only two possible outcomes:
Outcome A, where u = 3 and x = 4.
Outcome B, where u = 2 and x = 6.
Suppose that the two outcomes are equally likely, so Pr(A) = Pr(B) = 0:5.
Notice that in outcome A, ux = 12, and in outcome B, ux = 12. And so as the outcomes are
equally likely, we have E(ux) = 0. But notice that we do not have E [ujx] = 0. There are only two
possible values x can take: 4 and 6. When x = 4, u can only be 3, and so E [ujx = 4] = 3. When
x = 6, u can only be -2, and so E [ujx = 6] = 2. And so in this example E[ux] = 0 holds but
E[ujx] 6= 0, and so we have shown that E[ux] = 0; E[ujx] = 0.
3 Derivation of OLS Estimators
In most applications, the population is not directly observable; however we can still learn a lot
about the SLR model from a random sample of the population, f(xi; yi) : 1; : : : ; ng. That is,
we have randomly sampled n units from the population and for each unit i (= 1; ::::; n) in the
sample, we have observed the value (x; y), (xi; yi). Random sampling implies that we can treat
the observations as being independently drawn from the population/distribution of interest. Given
that they are drawn from the same distribution, they are also identically distributed. We formally
state this assumption below:
SLR.2 fxi; yig, i = 1; ::::; n, are i.i.d. (independent and identically distributed).
From the sample, we wish to learn about 0 and 1, where 1 measures the impact of x on
y. Clearly, if there is no variation in x, we cannot obtain any reasonable estimate of 1. We will
formally require that:
SLR.3 xi, i = 1; :::; n, exhibit variation so that SSTx =
Pn
i=1 (xi x)2 > 0.
Based on the sample, we wish to estimate 0 and 1 by OLS. There are two equivalent ways to
derive the OLS estimators ^0 and ^1. The rst way is the Method of Moments; the second is the
Sum of Squared Residuals.
Neither method requires SLR.1-SLR.2 and SLR.4 to hold we can always compute the OLS
estimators as long as SLR.3 is satised. But the estimators are motivated by SLR.1-SLR.2 and
SLR.4 under these assumptions the estimators can be given a formal/more rigorous interpretation
and will enjoy certain desirable properties.
7
3.1 Method of Moments
We saw earlier that E[ujx] = 0 implies that (i) E[u] = 0 and (ii) E[xu] = 0. Using that u =
y 0 1x, c.f. (0.1), we can write (i)(ii) as
E[u] = E[y 0 1x] = 0; (3.1)
E[xu] = E[x(y 0 1x)] = 0: (3.2)
That is, given the population (distribution) 0 and 1 must satisfy (3.1)(3.2). Do these two
equations identify 0 and 1? If not, the moment conditions are not helpful in learning about 0
and 1. If Var(x) > 0 the answer is a¢ rmative: One can show that there exists a unique solution
to (3.1)(3.2) in terms of 0 and 1. This can be shown by following the same steps as for the
derivation of the OLS estimators below.
We do not know the population (moments), all we have is the random sample. So let us replace
the above population moments by their sample counterparts,
1
n
nX
i=1

yi ^0 ^1xi

= 0; (3.3)
1
n
nX
i=1
xi

yi ^0 ^1xi

= 0: (3.4)
where
Pn
i=1 xi := x1 + :::: + xn and similar for any other sequence; in the following we will write
x :=
Pn
i=1 xi=n and similar for other variables.
To nd the solutions, ^0 and ^1, to these two equations, we rst manipulate (3.4) to obtain
0 =
1
n
nX
i=1

yi ^0 ^1xi

=
1
n
nX
i=1
yi 1
n
nX
i=1
^0
1
n
nX
i=1
^1xi
= y ^0 ^1x:
We conclude that
^0 = y ^1x (3.5)
Next, substitute the right-hand side of (3.5) into (3.3) to obtain:
0 =
1
n
nX
i=1

yi ^0 ^1xi

xi
=
1
n
nX
i=1

yi y + ^1x ^1xi

xi
=
1
n
nX
i=1

fyi yg ^1 fxi xg

xi
=
1
n
nX
i=1
(yi y)xi ^1
1
n
nX
i=1
(xi x)xi
8
and we conclude that
^1 =
1
n
Pn
i=1 (yi y)xi
1
n
Pn
i=1 (xi x)xi
=
1
n
Pn
i=1 (yi y) (xi x)
1
n
Pn
i=1 (xi x)2
: (3.6)
The last equality in (3.6) uses that (left as exercise)
1
n
nX
i=1
(yi y) x = 0 and 1
n
nX
i=1
(xi x) x = 0:
Plugging ^1 back into (3.5) gives us our estimator of 0.
Once we have computed ^0 and ^1, we can obtain:
Predicted/tted values: y^i = ^0 + ^1xi, i = 1; :::; n
OLS regression line: y^ = ^0 + ^1x for any given value of x (as chosen by us).
Residuals: u^i = yi y^i, i = 1; :::; n.
3.2 Sum of Squared Residuals
An alternative (but equivalent) way to obtain the OLS estimators is by minimizing the sum of
squared residuals (SSR). First recall the residual u^i dened above; what does this measure? Think
about observation i with (xi; yi); for xi the aforementioned OLS regression line predicts a y-value
of y^i. How far is y^i from the actual yi? This information is given by u^i! In other words, u^i is the
vertical distance between yi and y^i; it can be either positive or negative depending on whether the
actual point lies above or below the OLS regression line.
If we want our OLS regression line to t the data well, then we must minimize the distances
u^i, 8i. How do we do that? One way would be to minimize the sum of residuals
Pn
i=1 u^i. But this
is not very meaningful and leads to rubbish estimators: We can always choose ^0 smaller and that
way decrease the value of
Pn
i=1 u^i further. In the limit, ^0 = 1 is the best estimator in terms of
this criterion. Instead we choose to minimise the SSR since this penalises estimates that generate
both large negative and large positive residuals.
Formally, for a given set of candidate values for our estimators, b0 and b1, we compute the
corresponding SSR,
SSR (b0; b1) =
nX
i=1
(yi b0 b1xi)2 :
Given data (so we treat (yi; xi), i = 1; :::; n, as xed), we then choose our estimators, ^0 and ^1,
as the minimizers of SSR (b0; b1) w.r.t. (b0; b1):
min
b0;b1
SSR (b0; b1) :
That is, we seek values ^0 and ^1 so that
SSR(^0; ^1) < SSR (b0; b1) for all (b0; b1) 6= (^0; ^1): (3.7)
9
To solve this bivariate minimisation problem, we derive the rst order conditions (FOCs) and then
nd the values (^0; ^1) at which they are both zero:
@SSR(b0; b1)
@b0

(b0;b1)=(^0;^1)
= 0;
@SSR(b0; b1)
@b1

(b0;b1)=(^0;^1)
= 0:
Using the chain rule, the two FOCs take the form:
FOC with respect to ^0:
0 = 2
nX
i=1

yi ^0 ^1xi

which is equivalent to (3.3).
FOC with respect to ^1:
0 = 2
nX
i=1

yi ^0 ^1xi

xi;
which is equivalent to (3.4).
Thus, the solution (^0; ^1) to (3.7) is identical to (3.5) and (3.6).
4 Unbiasedness
Every time we draw a new random sample from the population, the estimates ^0 and ^1 will
be di¤erent. To formally assess the variation of the estimators across di¤erence samples, we will
treat their outcomes as random variables. We then wish to better understand the features of the
distributions of ^0 and ^1.
The rst question we will ask about their distributions is the following: On average (across all
the di¤erent samples), will our estimation method get it right? That is, do the expected values of
^0 and ^1 equal the unknown values of 0 and 1, respectively? The answer is a¢ rmative as long
as our regularity conditions are satised:
Theorem 4.1 Under SLR.1SLR.4, E[^0jXn] = 0 and E[^1jXn] = 1, where Xn = fx1; x2; :::; xng,
and so the OLS estimates are unbiased.
Proof. We here show that E[^1] = 1. The proof of E[^0] = 0 is left as an exercise. Our proof
proceeds in three steps:
1. Obtain a convenient expression of estimator
2. Decompose estimator as population parameter + sampling error
3. Show that E [sampling error] = 0.
10
Step 1: Obtain a convenient expression of estimator
It is convenient to rewrite the expression of ^1 in eq. (3.6) as
^1 =
Pn
i=1(xi x)yiPn
i=1(xi x)2
;
where we have used that
Pn
i=1(xi x)y = 0. Dene SSTx =
Pn
i=1(xi x)2, the sum of total
variation in the xi, and write
^1 =
Pn
i=1(xi x)yi
SSTx
(4.1)
The existence of ^1 follows from SLR.3 which guarantees that SSTx > 0.
Step 2: Write estimator as parameter + sampling error
Replace each yi in the numerator of (4.1) with yi = 0+ 1xi+ ui (which holds due to SLR.1):
nX
i=1
(xi x)yi =
nX
i=1
(xi x)(0 + 1xi + ui)
= 0
nX
i=1
(xi x) + 1
nX
i=1
(xi x)xi +
nX
i=1
(xi x)ui
= 0 + 1
nX
i=1
(xi x)2 +
nX
i=1
(xi x)ui
= 1SSTx +
nX
i=1
(xi x)ui
where we have used
Pn
i=1(xi x) = 0 and
Pn
i=1(xi x)xi =
Pn
i=1(xi x)2. In total,
^1 =
1SSTx +
Pn
i=1(xi x)ui
SSTx
= 1 +
Pn
i=1(xi x)ui
SSTx| {z }
sampling error
:
Now dene
wi =
(xi x)
SSTx
; i = 1; :::; n; (4.2)
so we can express the sampling error as a linear function of the unobserved errors, u1; :::; un,
^1 = 1 +
nX
i=1
wiui:
Step 3: Derive (conditional) mean of estimator
We now take conditional expectations w.r.t. Xn := fx1; x2; :::; xng. That is, we condition on the
observed regressors so that the only random components are the uis. Under Assumptions SLR.2
and SLR.4, E [uijXn] = E [uijxi] = 0 (left as exercise). Conditional on Xn, w1; :::; wn are constants
(nor random variables) since each of them depend on Xn alone, c.f. eq. (4.2). This in turn implies
E [wiuijXn] = wiE [uijXn] = 0:
11
Now we can complete the proof: Conditional on Xn = fx1; x2; :::; xng,
E[^1jXn] = E
"
1 +
nX
i=1
wiui
Xn
#
= 1 +
nX
i=1
E [wiuij Xn] = 1 +
nX
i=1
wiE [uijXn]
= 1:
where we used two important properties of expected values: (i) the expected value of a sum is the
sum of the expected values and (ii) the expected value of a constant, 1 in this case, is just itself.
Unbiasedness conditional on any particular draw of regressors in turn implies that OLS is
unbiased unconditionally:
E[^1] = E
h
E[^1jXn]
i
= E[1] = 1,
where the rst equality uses the Law of Iterated Expectations.
5 Variance of OLS estimator
The previous section showed that the OLS estimators on average target the population values
exactly. However, this is not particularly helpful in itself. For a given sample, the observed estimates
may be very far below or above the population values. We therefore now wish to understand the
degree of variation of the estimators. If, for example, their variance is zero, the OLS estimators
get it right everytime not just on average. Unfortunately, this will not be the case: In almost all
applications the OLS estimators will have positive variance. But knowledge of the variances of the
estimators will prove important when we develop testing procedures later on.
To derive the variance of the estimators we impose the following additional assumption:
SLR.5 The errors are homoskedastic,
Var[ujx] = 2 for all values of x
Theorem 5.1 Under Assumptions SLR.1SLR.5, and conditional on Xn = fx1; x2; :::; xng,
Var(^1jXn) =
2Pn
i=1(xi x)2
=
2
SSTx
Var(^0jXn) =
2

n1
Pn
i=1 x
2
i

SSTx
Proof. To derive the expression of Var(^1jXn), recall that
^1 = 1 +
nX
i=1
wiui; wi = (xi x)=SSTx; i = 1; :::; n:
12
Again, conditional on Xn, we can treat the wi as nonrandom in the derivation. Because 1 is a
constant, it does not a¤ect Var(^1jXn).
Now, we need to use the fact that, for uncorrelated random variables, the variance of the sum
is the sum of the variances. The fui : i = 1; 2; :::; ng are independent across i, and so they are
uncorrelated. In fact, by SLR.2 and SLR.4,
Cov(ui; uj jXn) = E[uiuj jxi; xj ] = 0:
Moreover, by SLR.2 and SLR.5,
Var(uijXn) = Var(uijxi) = 2:
The proofs of these two equations are left as an exercise. Therefore,
Var(^1jXn) = Var

nX
i=1
wiui
Xn
!
=
nX
i=1
Var(wiuijXn)
=
nX
i=1
w2iVar(uijXn) =
nX
i=1
w2i
2 = 2
nX
i=1
w2i :
Finally, note that
nX
i=1
w2i =
nX
i=1
(xi x)2
SST 2x
=
Pn
i=1(xi x)2
SST 2x
=
SSTx
SST 2x
=
1
SSTx
;
and so
Var(^1jXn) =
2
SSTx
:
The derivation of the conditional variance of ^0 follows a same logic and so left as an exercise.
6 Goodness-of-Fit (R2)
How much of the variation in y is explained by variation in x? If we are interested in that question,
we are after the coe¢ cient of determination R2. R2 gives us a sense of the goodness-of-t of our
regression; i.e. it informs us about what fraction of the variation in y is due to variation in x.
What do we mean by saying variation in y? The squared distance of yi from the sample mean
y informs us about the spread of yi and is denoted by
Pn
i=1(yi y)2. Notice that if we divide
this expression by n 1 we get the sample variance for yi. For what follows we will work onPn
i=1(yi y)2. As yi = u^i + y^i we can write:
nX
i=1
(yi y)2 =
nX
i=1
(u^i + y^i y)2
=
nX
i=1

u^2i + (y^i y)2 + 2u^i(y^i y)

=
nX
i=1
u^2i +
nX
i=1
(y^i y)2 + 2
nX
i=1
u^i(y^i y)
13
The third term on the RHS of the last equation is 0: Use y^i = ^0 + ^1xi to write:
nX
i=1
u^i(y^i y) =
nX
i=1
u^i(^0 + ^1xi y)
=
nX
i=1
u^i(^0 + ^1xi)
nX
i=1
u^iy
= ^0
nX
i=1
u^i + ^1
nX
i=1
u^ixi y
nX
i=1
u^i
= 0;
where the last line comes from our sample moment conditions
Pn
i=1 u^i = 0 and
Pn
i=1 u^ixi = 0.
Hence:
Pn
i=1 u^i(y^i y) = 0. We conclude that
nX
i=1
(yi y)2| {z }
SST
=
nX
i=1
u^2i| {z }
SSR
+
nX
i=1
(y^i y)2| {z }
SSE
where:
SST: Total sum of squares (variation in y)
SSR: Sum of squared residuals (unexplained variation in y)
SSE: Explained sum of squares (explained variation in y by x)
Dividing across by SST we get:
1 =
SSR
SST
+
SSE
SST
, SSE
SST
= 1 SSR
SST
, R2 SSE
SST
= 1 SSR
SST
R2 is the fraction of sample variation in y that is explained by x.
Part II
Multiple Linear Regression (MLR)
The multiple linear regression model in the population takes the form:
y = 0 + 1x1 + + kxk + u
where:
14
y is the independent variable
0 is the constant
xj , j = f1; 2; : : : ; kg is the independent variable
j is the coe¢ cient on xj
u is the error term
Advantages of controlling for more variables:
Zero conditional mean assumption more reasonable
Closer to estimating causal/ceteris paribus e¤ects (everything else equal)
More general functional form
Better prediction of y / better t of the model
7 Assumptions
Similar to the SLR model, we need to impose regularity conditions on the MLR model and the
sampling process in order for the regression coe¢ cients to have a ceteris paribus interpretation and
for OLS to deliver valid estimators of them:
MLR.1 (Linearity) The model is linear in parameters,
y = 0 + 1x1 + 2x2 + + kxk + u
As in the SLR case, this does not rule out non-linear transformations of the variables. Notice
that
y = 0 + 1 lnx1 + 2x2 + + kxk + u
is also linear in parameters. For z = lnx1 the above model can be rewritten as:
y = 0 + 1z + 2x2 + + kxk + u
which is clearly linear.
MLR.2 (Random sample) The sample is a random draw from the population. The data in the
sample are f(xi1; : : : ; xik; yi) : i = 1; : : : ; ng, where fxi1; : : : ; xik; yig are i.i.d. (independent
and identically distributed).
MLR.3 (Full rank/no perfect collinearity) xj 6= c and there is no exact linear relationship
among any xj in the population, i.e. xj cannot be written as
P
j jxj , where j are
constants and j = 1; : : : ; j 1; j + 1; : : : ; k.
15
MLR.4 (mean-independence) Conditional on x1; : : : ; xk the mean of u is 0, i.e. E[ujx1; : : : ; xk] =
0.
Notice that this assumption implies that E[u] = 0 and E[xju] = 0, 8j.
MLR.5 (Homoscedasticity) The variance of u is constant and independent of x, i.e. E[u2jx1; : : : ; xk] =
2.
For the remaining part of this section, we will focus on the case with two regressors,
yi = 0 + 1xi1 + 2xi2 + ui: (7.1)
(i.e. with two independent variables only). All the results derived herein can be easily extended
to accommodate the general case of k independent variables, but the notation and mathematical
derivations get more tedious.
8 OLS estimators
Let ^0, ^1, and ^2 be the OLS estimators for 0, 1, and 2 respectively in (7.1). Once, the OLS
estimators have been computed, we can obtain predicted values and residuals as before,
y^i = ^0 + ^1xi1 + ^2xi2
u^i = yi y^i:
Given these, we have
yi = ^0 + ^1xi1 + ^2xi2 + u^i:
Notice that these expressions resemble the ones derived for tted values and residuals in the context
of the Simple Linear Regression Model. How can we interpret ^1 in this context? ^1 measures the
ceteris-paribus change in y given an one unit change in x1; put di¤erently, it measures the change in
y given an one unit change in x1, holding x2 xed. Obviously, ^2 has an analogous interpretation.
How do we actually obtain ^0, ^1, and ^2 in the context of the Multiple Linear Regression
Model? There are three equivalent ways: the minimization of the sum of squared residuals, the
partialling-out method, and, nally, a method which makes use of a number of moment conditions.
8.1 Sum of squared residuals
We can obtain ^0, ^1, and ^2 by minimizing the SSR of the model. The SSR now takes the form
SSR(b0; b1; b2) =
nX
i=1
(yi b0 b1xi1 b2xi2)2 ;
where b0, b1 and b2 are any given candidate values for our nal estimators. We then dene the OLS
estimators as the solution to the following minimisation problem,
min
b0;b1;b2
SSR(b0; b1; b2):
16
The estimators can also be characterised as the solutions to the FOCs. These take the form
@SSR(b0; b1; b2)
@b0

(b0;b1;b2)=(^0;^1;^2)
= 2
nX
i=1

yi ^0 ^1xi1 ^2xi2

= 0 (8.1)
@SSR(b0; b1; b2)
@b1

(b0;b1;b2)=(^0;^1;^2)
= 2
nX
i=1
xi1

yi ^0 ^1xi1 ^2xi2

= 0 (8.2)
@SSR(b0; b1; b2)
@b2

(b0;b1;b2)=(^0;^1;^2)
= 2
nX
i=1
xi2

yi ^0 ^1xi1 ^2xi2

= 0 (8.3)
These are three equations with three unknowns; solving them simultaneously we get ^0, ^1, and
^2 (you do not have to remember the following formulae):
^1 =
Pn
i=1(xi2 x2)2
Pn
i=1(yi y)(xi1 x1)
Pn
i=1(yi y)(xi2 x2)
Pn
i=1(xi1 x1)(xi2 x2)Pn
i=1(xi1 x1)2
Pn
i=1(xi2 x2)2 (
Pn
i=1(xi1 x1)(xi2 x2))2
(8.4)
^2 =
Pn
i=1(xi1 x1)2
Pn
i=1(yi y)(xi2 x2)
Pn
i=1(yi y)(xi1 x1)
Pn
i=1(xi1 x1)(xi2 x2)Pn
i=1(xi1 x1)2
Pn
i=1(xi2 x2)2 (
Pn
i=1(xi1 x1)(xi2 x2))2
^0 = y ^1x1 ^2x2
where xj = n1
Pn
i=1 xij and y = n
1Pn
i=1 yi.
8.2 The partialling-out method
An alternative, and perhaps more intuitive, way to characterise the OLS estimators ^0, ^1, and
^2 is the following. First, we estimate a Simple Linear Regression of x1 on x2 (and any other
independent variables in the context of a general k-variable model):
xi1 = 0 + 1xi2 + ri1
where ri1 is an error term. We use the Simple Linear Regression tools to get the OLS estimates ^0
and ^1 which we then use to compute the residual:
r^i1 = xi1 ^0 ^1xi2
How should we interpret r^i1? It is the variation in xi1 that is left after removing the variation in xi2
we have partialled out the part of the variation in xi1 which was explained by xi2 the remaining
variation as captured by r^i1 contains the "clean" signal contained in xi1.
Properties of residual r^i1: The usual properties apply to r^i1:
1. Residuals sum to zero:
Pn
i=1 r^i1 = 0
2. Residuals are orthogonal to regressors:
Pn
i=1 r^i1xi2 = 0
3. Sum of products between residual and dependent variable equals sum of squared residuals:Pn
i=1 r^i1xi1 =
Pn
i=1 r^
2
i1
17
The rst two properties are well-known features of OLS in SLR. The third property is a conse-
quence of the rst two:
nX
i=1
r^i1xi1 =
nX
i=1
r^i1 (xi1 ^0 ^1xi2 + ^0 + ^1xi2)
=
nX
i=1
r^i1 (r^i1 + ^0 + ^1xi2)
=
nX
i=1

r^2i1 + ^0r^i1 + ^1xi2r^i1

=
nX
i=1
r^2i1 + ^0
nX
i=1
r^i1| {z }
=0
+^1
nX
i=1
r^i1xi2| {z }
=0
=
nX
i=1
r^2i1
After having obtained r^i1, we regress yi on r^i1 and a constant:
yi = 0 + 1r^i1 + vi
This is another Simple Linear Regression model. The OLS estimate of the slope coe¢ cient 1 is
^1 =
Pn
i=1

r^i1 ^r1

yiPn
i=1

r^i1 ^r1
2 = Pni=1 r^i1yiPn
i=1 r^
2
i1
; (8.5)
where the second equality uses that ^r1 = 0. We will not provide a proof of this, but the above
expression is numerically identical to (8.4). That is, the partialling out algorithm leads to the same
estimates as minimizing the SSR.
The intuition for this result is the following: The expression of ^1 in (8.5) measures the change
in y which is due to an one-unit change in x1 after having x2 partialled out; put di¤erently, ^1
measures the change in y which is due to an one-unit change in x1 holding x2 xed. But that
is exactly what ^1 is measuring too (return to the discussion at the beginning of this section).
Formally, replacing r^i1 in (8.5) with an analytical expression consisting of xi1 and xi2 only (obtained
from the regression in the rst stage), one can see that the expression in (8.5) is exactly the same
as ^1 from the minimization of squared residuals.
Going back to (7.1), how can we obtain ^2? Using the same two-step approach one can show
that
^2 =
Pn
i=1 r^i2yiPn
i=1 r^
2
i2
;
where the r^i2s are the residualsfrom rst regressing x2 onto x1,
r^i2 = xi2 ^0 ^1xi1; i = 1; :::; n
More generally, in a model with k independent variables
^j =
Pn
i=1 r^ijyiPn
i=1 r^
2
ij
; j = 1; 2; : : : ; k
18
where r^ij is the OLS residuals from a regression of xj on the other explanatory variables and a
constant.
Finally, the estimated constant is
^0 = y ^1x1 ^2x2
where xj = n1
Pn
i=1 xij and y = n
1Pn
i=1 yi. In the general case with k variables, ^0 = y
^1x1 ^kxk.
8.3 Moment conditions
As in the context of the Simple Linear Regression, one can use the sample counterparts of the three
moment conditions (or k + 1 in the case of a general k-variable model) that follow from MLR.4 :
E[u] = 0 and E[xju] = 0 j = 1; 2:
The corresponding sample moment conditions are equivalent to the FOCs for the least-squares
problem as stated in (8.1)(8.3).
9 Unbiasedness
Theorem 9.1 Under MLR.1MLR.5, E[^j jXn] = j, j = 0; 1; :::; k.
Proof. We only give a proof for ^1 in the case of k = 2. The proofs of the remaining estimators
are left as an exercise. The derivation proceeds along the same lines as for the SLR case. First, we
obtain a convenient expression of the OLS estimator:
^1 =
Pn
i=1 r^i1yiPn
i=1 r^
2
i1
MLR.1
=
Pn
i=1 r^i1 (0 + 1xi1 + 2xi2 + ui)Pn
i=1 r^
2
i1
=
Pn
i=1 r^i10Pn
i=1 r^
2
i1
+
Pn
i=1 r^i11xi1Pn
i=1 r^
2
i1
+
Pn
i=1 r^i12xi2Pn
i=1 r^
2
i1
+
Pn
i=1 r^i1uiPn
i=1 r^
2
i1
= 0
Pn
i=1 r^i1Pn
i=1 r^
2
i1| {z }
=0
+1
Pn
i=1 r^i1xi1Pn
i=1 r^
2
i1| {z }
=1
+2
Pn
i=1 r^i1xi2Pn
i=1 r^
2
i1| {z }
=0
+
Pn
i=1 r^i1uiPn
i=1 r^
2
i1
= 1 +
Pn
i=1 r^i1uiPn
i=1 r^
2
i1
Dening
wi =
r^i1Pn
j=1 r^
2
j1
; i = 1; :::; n; (9.1)
19
we can express the OLS estimator as
^1 = 1 +
nX
i=1
wiui; (9.2)
where the second term is the sampling error. Thus, as with the SLR model, we can express the
sampling error of the OLS estimator as a weighted sum of the errors u1; :::; :un. And the weights
are again functions of the observed regressors alone, Xn = f(xi1; xi2) : i = 1; :::; ng, c.f. (9.1). Thus,
taking expectations conditional on Xn,
E[^1jXn] = 1 + E
"
nX
i=1
wiuijXn
#
= 1 +
nX
i=1
wiE [uijXn]
= 1 +
nX
i=1
wiE [uijxi1; xi2]
= 1;
where, as in the SLR, we have used that E [uijXn] = E [uijxi1; xi2] under MLR.2. Unbiasedness
(E[^1] = 1) now follows from the Law of Iterated Expectations. By the same arguments, it can
be shown that ^2 is an unbiased estimator for 2.
10 Variance of OLS estimator
Let us rst derive the variance of the estimator under MLR.5: We use eq. (9.2) and the fact that
w1; :::; wn are all functions of Xn to obtain
Var

^1jXn

= Var

1 +
nX
i=1
wiui
Xn
!
(10.1)
= Var

nX
i=1
wiui
Xn
!
=
nX
i=1
w2iVar (uijXn)
=
nX
i=1
w2iVar (uijxi1; xi2)
= 2
nX
i=1
w2i
=
2
SST1(1R21)
20
where we have used that Cov(ui; uj jXn) = 0 (MLR.2) and Var(uijXn) =Var(uijxi1; xi2) = 2
(MLR.2+MLR.5). Finally, notice that
Pn
i=1 r^
2
i1 = SST1(1R21) because
R2j = 1
Pn
i=1 r^
2
i1Pn
i=1(xi1 x1)2
1
Pn
i=1 r^
2
i1
SST1
:
11 OLS is BLUE
There are many unbiased estimators of j , j = f0; 1; 2; : : : ; kg. A theorem usually referred to as
the Gauss-Markov theorem states that under assumptions MLR.1 through MLR.5, ^j is the best
linear unbiased estimator (BLUE) of j , j = f0; 1; 2; : : : ; kg. What does that mean?
First, we say that a given estimator of j , say, ~j , is a linear estimator if it can be written as
~j = ~w1y1 + ~wnyn;
where ~w1; :::; ~wn are functions of the regressors alone, Xn. We know that the OLS estimator can be
written as
^j = w1y1 + wnyn;
where, with r^1j ; ::::; r^nj being the residuals from the rst-stage regression,
wi =
r^ijPn
k=1 r^
2
kj
; i = 1; :::; n: (11.1)
So OLS is one particular example of a linear estimator. Next, we narrow the class of linear
estimators to only contain the ones that are unbiased. That is, we require the linear estimator
to satisfy E[~j ] = j . Again, the OLS estimator has this property.
Now, within this class of competing estimators, it can be shown that the OLS estimator has
the smallest variance,
Var(^j) < Var(~j) for any alternative linear unbiased estimator ~j :
We say that the OLS estimator is the Best (= smallest variance) Linear Unbiased Estimator.
Part III
Asymptotic Theory of OLS
In this part we will analyze the asymptotic properties of the OLS estimators. That is, we are
interested in analyzing the distribution of ^j along a sequence of ever-large samples so that n!1.
To this end, we rst remind ourselves of some important concepts and results regarding convergence
of distributions and some two key results used in asymptotic theory.
21
12 The Law of Large Numbers and consistency
In the following, we consider a random sample yi, i = 1; :::; n. This sample could come from a
population that satises the SLR model but this is not required for the fundamental results such
as the LLN and CLT presented below.
The Law of Large Numbers (LLN) state that under general conditions the average of a random
sample of size n will be near the population mean with high probability when n is large.
Theorem 12.1 (Law of Large Numbers) Let y1; :::; yn be i.i.d. random variables with mean
= E [y] and variance 2 =Var(y). The sample average y =
Pn
i=1 yi=n then satises:
For any > 0 : Pr (jy j > )! 0 as n!1:
and we write
y !p :
That y is "good" estimator of is not surprising. Recall that
E [y] = and Var (y) =
2
n
: (12.1)
In particular, Var(y) ! 0 as n ! 1 and a random variable with zero variance must necessarily
have all its probability mass at its mean, in this case E [y] = .
The LLN is illustrated in Figure 1 where the distribution of y is shown for four di¤erent sample
sizes in the case where y is a Bernoulli random variable,
Pr (y = 1) = and Pr (y = 0) = 1
with = 0:78. In particular, E [y] = = 0:78. For small n, the distribution of y is quite spread out
but eventually, as the sample size grows, most of the probability mass of ys distribution is situated
around .
Thus, treating y as an estimator of the population mean, we see that it is a consistent estimator:
Denition 12.2 Consider an estimator ^ of a population parameter . We say that ^ is a consis-
tent estimator of if
For any > 0 : Pr

j^ j >

! 0 as n!1;
and we write ^ !p .
Finally, we observe that the LLN can be extended the following result: Any population moment
of a distribution can be consistently estimated by the corresponding sample moment:
Corollary 12.3 Let y1; :::; yn be i.i.d. random variables. For a given transformation f (y) suppose
that E

f2 (y)

exists. Then f = E [f (y)] is consistently estimated by the corresponding sample
moment ^f =
Pn
i=1 f (yi) =n,
^f !p f :
Proof. Since y1; :::; yn are i.i.d. so are f (y1) ; :::; f (yn). We can therefore apply the LLN to the
latter sequence which yields the claim.
22
Figure 1: Sampling distribution of the sample average of n Bernoulli random variables.
23
13 The Central Limit Theorem and Asymptotic Normality
The central limit theorem (CLT) says that the distribution of y is well-approximated by a normal
distribution when n is large. Recall eq. (12.1) which informs us of the rst two moments of
the distribution of y for any given sample size n. The CLT states that as n increases the whole
distribution of y will be close to a normal distribution.
The result is formulated in terms of the following standardized version of the sample mean,
y E [y]
std (y)
=
y
=
p
n
: (13.1)
In particular,
E

y
=
p
n

= 0 and Var

y
=
p
n

= 1:
The standardisation is introduced in order to obtain a "stable" distribution: We know from the LLN
that y !p and so the non-standardised estimator y has a "degenerate" asymptotic distribution
with all probability mass of y being situated at in large samples, c.f. Figure 1.
The standardised version in (13.1), on the other hand, has a non-degenerate asymptotic distri-
bution:
Theorem 13.1 (Central Limit Theorem) Let y1; :::; yn be i.i.d. random variables with mean
= E [y] and variance 2 =Var(y). The sample average y =
Pn
i=1 yi=n then satises
For all 1 < z < +1 : Pr

y
=
p
n
z

! (z) as n!1;
where (z) is the cumulative distribution function of the standard normal distribution,
(z) =
Z z
1
1p
2
ey
2
dy;
and we write
y
=
p
n
a N (0; 1) :
In Figure 2, the CLT is illustrated for the same example as in Figure 1:
By recentering and rescaling y, we obtain a "stable" sequence of distributions that do not
degenerate as the sample size grows. Moreover the sequence converges towards a standard normal
as shown by the CLT. Note here that even though the Bernoulli distribution is discrete and so
very far from a Normal distribution a sample average from this distribution looks normal with
only n = 100 observations. So the CLT approximation works well even in moderate samples.
14 Slutskys theorem and continuous mapping theorem
The LLN shows that y is a consistent estimator of = E [y] and the CLT shows that the standard-
ised statistic
p
n (y ) = a N (0; 1). Suppose we wish to test the null
H0 : = 0:
24
Figure 2: Sampling distribution of y
=
p
n
where y is the average of n Bernoulli random variables.
25
We wish to do so using the corresponding t statistic,
t^ =
y
se (y)
=
y
^=
p
n
;
where
^2 =
1
n 1
nX
i=1
(yi y)2 :
In order to use it for testing, we want to derive the asymptotic distribution of t^. To this end, rst
observe that
t^ =

^
p
ny

; where
p
ny

a N (0; 1) under H0: (14.1)
But what about the extra term =^? This is also a random variable and so you would think this
would a¤ect the asymptotic distribution of t^ but how? The following theorem claries this:
Theorem 14.1 (Slutskys Theorem + Continuous Mapping Theorem) Let a^ and b^ be two
statistics/estimators satisfying
a^!p a and b^ a N (0; 1) :
Then
(i) a^+ b^ a N (a; 1) ; and (ii) a^b^ a N 0; a2 : (14.2)
Moreover, for any continuous function g,
(iii) g (a^)!p g (a) : (14.3)
This result can, for example, be used to derive the asymptotic distribution of t^:
Corollary 14.2 Let y1; :::; yn be i.i.d. random variables with mean = E [y] and variance 2 =Var(y).
The t-statistic associated with the sample mean y then satises
t^ =
y
se (y)
=
y
^=
p
n
a N (0; 1) :
Thus, replacing by a consistent estimator ^ does not a¤ect the asymptotic distribution of the
t statistic.
Proof. Dene
a^ =

^
; b^ =
p
ny

so we can write eq. (14.1) as
t^ = a^b^:
Next, by the LLN observe that ^2 !p 2. This in turn implies that ^= =
q
^2=2 !p 1 by
Theorem 14.1(iii) and so
a^!p 1; b^ a N (0; 1) :
Finally, use Theorem 14.1(ii) to conclude that
t^
a N (0; 1) :
26
15 Consistency and Asymptotic Normality of OLS
We here only cover the theory for the SLR slope estimator,
^1 =
Pn
i=1 (xi x) (yi y)Pn
i=1 (xi x)2
; (15.1)
to keep the notation and the arguments simple. The theory extends to the MLR model with minor
modications. In the proofs, we will use the following notation to save space: For any two statistics
a^ and b^, we write
a^ ' b^ if ja^ b^j !p 0:
Thus, "'" means that the di¤erence between a^ and b^ is asymptotically negiglible.
Theorem 15.1 Under SLR.1SLR.4, ^1 ! 1.
Proof. By the LLN, we have x!p E [x]. This in turn implies
^1 = 1 +
Pn
i=1 (xi x)uiPn
i=1 (xi x)2
'
Pn
i=1 (xi E [x])uiPn
i=1 (xi E [x])2
:
Two more applications of the LLN yields
^x;u : =
1
n
nX
i=1
(xi E [x])u!p E [(x E [x])u] = Cov (x; u) ;
^2x : =
1
n
nX
i=1
(xi E [x])2 !p E
h
(x E [x])2
i
= Var (x) ;
where Cov(x; u) = 0 by SLR.4 and Var(x) > 0 by SLR.3. Thus, by Theorem 14.1(iii),
^1 = 1 +
^x;u
^2x
!p 1 +
Cov (x; u)
Var (x)
= 1:
Theorem 15.2 Under SLR.1SLR.5,
^1 1
sd(^1)
a N (0; 1); (15.2)
where sd(^1) = =
p
SSTx. The same convergence holds when we replace sd(^1) with se(^1) =
^=
p
SSTx:
t^1
=
^1 1
se(^1)
a N (0; 1): (15.3)
Proof. Let
^2x =
1
n
SSTx;
27
be the sample variance of x. We can then write
^1 1 =
1
n
Pn
i=1 (xi x)ui
^2x
; sd(^1) =
p
n^x
:
Combining these two expressions yield
^1 1
sd(^1)
=
1
n
Pn
i=1 (xi x)ui
^x=
p
n
: (15.4)
By the LLN, the numerator satises
1
n
nX
i=1
(xi x)ui ' 1
n
nX
i=1
(xi E [x])ui;
where, with 2x =Var(x),
E [(x E [x])u] = 0 and Var ((x E [x])u) = 22x:
Thus, by the CLT,
b^n :=
1
n
Pn
i=1 (xi x)ui
x=
p
n
a N (0; 1) : (15.5)
Comparing (15.4) and (15.5), we see that
^1 1
sd(^1)
' x
^x
b^n;
where, by the LLN and Theorem 14.1(iii),
x
^x
!p 1: (15.6)
An application of Theorem 14.1(ii) now yields (15.2).
To show (15.3), write
t^j
=
se(^1)
sd(^1)
^1 1
se(^j)
;
where
se(^1)
sd(^1)
=
^

!p 1:
The result now follows from yet another application of Theorem 14.1(ii).
Part IV
Heteroskedasticity
MLR.5 assumes homoskedastic errors. This is a strong assumption which in many applications may
be violated. We here develop inferential tools that are robust to unknown heteroskedasticity.
28
If we drop MLR.5, the conditional variance of ui will generally depend on the regressors,
Var (uijxi1; xi2) = 2i = 2 (xi1; xi2) ;
where 2 (xi1; xi2) is the unknown conditional variance of uij (xi1; xi2). What are the consequences
of heteroskedasticity for the OLS estimators and the t test we developed earlier? First note that the
OLS estimators remain unbiased and consistent: These two properties were established without
making use of MLR.5. What remains is to understand how the variance and the large-sample
distribution are a¤ected.
16 Heteroskedasticity robust standard errors
Regarding the variance, recall the derivations of eq. (10.1). It is easily checked that the rst
four equalities in this display remains valid once we drop MLR.5. However, the fth equality uses
MLR.5 and so this is no longer correct. Instead, we now substitute Var(uijxi1; xi2) = 2i into the
expression and obtain
Var

^1jXn

=
nX
i=1
w2iVar (uijxi1; xi2) =
nX
i=1
w2i
2
i =
Pn
i=1 r^
2
i1
2
iPn
i=1 r^
2
i1
2 :
Similar to the homoskedastic case, we would now like to obtain an estimator of the last expression
so that we can compute standard errors of ^1.
If we are willing to assume a particular functional form of 2 (xi1; xi2), we could estimate 2i ,
i = 1; :::; n, from the residuals. We could, for example, assume that 2 (x1; x2) = 0 + 1x1 + 2x2.
But, similar to when we imposed MLR.5, there is a risk that 2 (x1; x2) is not a linear funtion
we do not know. And so any procedure based on ad hoc functional form assumptions are prone to
be invalid in many applications.
Instead, we will develop an estimator that targets the large-sample limit of Var

^1jXn

. Ap-
pealing to the LLN, we have
1
n
nX
i=1
r^2i1
2
i !p E

r21
2 (x1; x2)

;
1
n
nX
i=1
r^2i1 !p E

r21

;
and so
Var

^1jXn

' 1
n
E

r21
2 (x1; x2)

E

r21
2 :
Can we estimate the two population moments of the ratio?

1
n
Pn
i=1 r^
2
i1
2
is a consistent estimator
of E

r21

but how do we estimate E

r21
2 (x1; x2)

when we do not know 2 (x1; x2)? Let us rst
rewrite this moment: Using that
E

u2jx1; x2

= 2 (x1; x2) ;
we obtain, by the Law of Iterated Expectations,
E

r21
2 (x1; x2)

= E

r21E

u2jx1; x2

= E

r21u
2

:
29
Thus, a consistent estimator of the numerator is 1n
Pn
i=1 r^
2
i1u^
2
i and so
dVar^1jXn = 1n 1n
Pn
i=1 r^
2
i1u^
2
i
1
n
Pn
i=1 r^
2
i1
2 = Pni=1 r^2i1u^2iPn
i=1 r^
2
i1
2
is a consistent estimator of the large-sample variance of ^1. Taking the square root of this expression
yields heteroskedasticity robust standard errors that are valid in large samples,
seHR(^1) =
sPn
i=1 r^
2
i1u^
2
iPn
i=1 r^
2
i1
2 =
qPn
i=1 r^
2
i1u^
2
iPn
i=1 r^
2
i1
17 Large sample distribution of t statistic
Given the proposed standard errors under heteroskedasticty, we need to derive the large sample
distribution of the corresponding t-statistic. Recall that
^1 = 1 +
Pn
i=1 r^i1uiPn
i=1 r^
2
i1
:
Thus,
t^1
=
^1 1
seHR(^1)
=
Pn
i=1 r^i1ui=
Pn
i=1 r^
2
i1qPn
i=1 r^
2
i1u^
2
i =
Pn
i=1 r^
2
i1
=
Pn
i=1 r^i1uiqPn
i=1 r^
2
i1u^
2
i
'
Pn
i=1 ri1uiqPn
i=1 r
2
i1u
2
i
By the CLT together with E [ri1ui] = 0,
1
n
Pn
i=1 ri1uiq
1
nE

r2i1u
2
i
a N (0; 1) ;
while, by the LLN,
1
n
nX
i=1
r2i1u
2
i !p E

r2i1u
2
i
)s E r2i1u2i 1
n
Pn
i=1 r
2
i1u
2
i
!p 1
In total,
t^1
'
Pn
i=1 ri1uiqPn
i=1 r
2
i1u
2
i
=
s
E

r2i1u
2
i

1
n
Pn
i=1 r
2
i1u
2
i
1
n
Pn
i=1 ri1uiq
1
nE

r2i1u
2
i
a N (0; 1) :
Thus, t statistics based on heteroskedasticity robust standard errors will follow standard normal
distributions in large samples irrespectively of whether MLR.5 holds or not. In particular, we can
use the same critical values as before.
30
Part V
Repeated cross sections and panel data
Suppose we have observed (yit; xit), i = 1; :::; n and t = 1; :::; T , from the following population
regression model,
yit = 0 + 1xit + uit: (17.1)
The previous parts of the notes developed OLS estimators of 0 and 1 when T = 1 and explored
their properties. We will here extend the theory to the case with T 2. All subsequent results are
easily extended to the case of multiple regressors at the price of more cumbersome notation.
Data could either be repeated cross sections, in which case a new sample of n individuals are
drawn from the population in each time period t = 1; ::::; T , or panel data, in which case the same
n individuals are followed across all T time periods. Irrespectively of whether data have arrived
from the rst or the second scenario, we can still use the same estimators; these will be developed
in the next section. However, the properties of these estimators will be di¤erent depending on the
sampling scheme. We rst analyse the estimators in the case of repeated cross section and then
consider the case of panel data afterwards.
We will require at a minimum that E [uitjxit] = 0 so that uit does not contain any relevant
omitted variables. This rules out, for example, the presence of xed e¤ects in uit. In the case
of panel data with xed e¤ects present in the original data, this means that we should think of
(17.1) having been obtained after suitable transformation. Suppose we have observed (~yit; ~xit) from
following the "original" regression model,
~yit = ~0 + 0d2t + 1~xit + ~ai + ~uit; t = 1; :::; T + 1; (17.2)
where ~ai is a xed e¤ects and d2t is a time dummy. We would then rst remove the xed e¤ect
through suitable transformations before estimation. For example, in the case of the rst di¤erencing
estimator, we would compute
yit = ~yit; xit = ~xit;
and these new variables would now satisfy (17.1) with uit = ~uit.
18 OLS estimation
Given that the model (17.1) holds for all n individuals across the T time periods, a natural way to
estimate them is through pooled OLS. For given candidate values of the estimators, b0 and b1, we
dene the pooled SSR as
SSR (b0; b1) =
nX
i=1
TX
t=1
(yit b0 b1xit)2 ;
and then choose as estimators the values of b0 and b1 that minimise the SSR,
min
b0;b1
SSR (b0; b1) :
31
As usual, these estimators, denoted ^0 and ^1, can be characterised as the solutions to the rst
order conditions of this minimisation problem:
@SSR (b0; b1)
@b0

(b0;b1)=(^0;^1)
= 2
nX
i=1
TX
t=1

yit ^0 ^1xit

= 0; (18.1)
@SSR (b0; b1)
@b1

(b0;b1)=(^0;^1)
= 2
nX
i=1
TX
t=1

yit ^0 ^1xit

xit = 0: (18.2)
To derive explicit expressions of ^0 and ^1, we proceed as in the cross sectional setting (T = 1):
First, rearrange (18.1) to obtain
^0 =

1
nT
nX
i=1
TX
t=1
yit
!
^1

1
nT
nX
i=1
TX
t=1
xit
!
= y ^1x;
where the averages y and x are computed using the pooled data sets. Substituting this expression
of ^0 into (18.2) and rearranging yields
0 =
nX
i=1
TX
t=1

yit ^0 ^1xit

xit =
nX
i=1
TX
t=1

yit y ^1(xit x)

xit
=
nX
i=1
TX
t=1
(yit y)xit ^1
nX
i=1
TX
t=1
(xit x)xit;
with solution
^1 =
Pn
i=1
PT
t=1 (yit y)xitPn
i=1
PT
t=1(xit x)xit
=
Pn
i=1
PT
t=1 (yit y) (xit x)Pn
i=1
PT
t=1(xit x)2
; (18.3)
where we have used that
Pn
i=1
PT
t=1 (yit y) x =
Pn
i=1
PT
t=1(xit x)x = 0. We see that the
estimators are on the same form as when T = 1, except that all averages in their expressions are
now computed using the pooled data sets.
19 Analysis of OLS - repeated cross sections
We here analyse the properties of the OLS estimators derived in the previous section in the case
where data is repeated cross sections. We will work under the following conditions which are natural
extensions of the ones for the case of T = 1:
SLR.1* In the population, the following relationship holds between xt and yt,
yt = 0 + 1xt + ut; t = 1; :::; T (19.1)
SLR.2* fxit; yitg, i = 1; ::::; n and t = 1; :::; T , are i.i.d..
SLR.3* xit, i = 1; ::::; n and t = 1; :::; T , exhibit variation so that SSTx =
Pn
i=1
PT
t=1 (xit x)2 >
0.
32
SLR.4 E [utjxt] = 0, t = 1; :::; T .
SLR.1 assumes that the model holds true across all T time periods. We can think of each time
period corresponding to a new population, and SLR.1 then states that for each of these populations
our regression model holds true with the the same common parameters 0 and 1. SLR.2* formally
states that in each time period t = 1; :::; T , we draw a new sample which is independent of the
samples drawn in previous periods. SLR.3* requires su¢ cient variation in the regressor across
individuals and time periods so that the OLS estimators are well-dened. SLR.4 is the usual
mean-independence assumption.
Theorem 19.1 Under SLR.1, SLR.2*SLR.3* and SLR.4, with Xn = fxit : i = 1; :::; n; t = 1; :::; Tg,
E[^j jXn] = j ; j = 0; 1;
Var(^1jXn) =
Pn
i=1
PT
t=1 (xit x)2 2it
SST 2x
;
where 2it =Var(uitjxit).
Proof. The proof follows along the same lines as the ones of Theorems 4.1 and 5.1. Observe that
(18.3) can be rewritten as
^1 =
Pn
i=1
PT
t=1 yit (xit x)
SSTx
:
Next, substitute yit = 0 + 1xit + uit into this expression,
^1 =
Pn
i=1
PT
t=1 (0 + 1xit + uit) (xit x)
SSTx
= 0
Pn
i=1
PT
t=1 (xit x)
SSTx
+ 1
Pn
i=1
PT
t=1 xit (xit x)
SSTx
+
Pn
i=1
PT
t=1 uit (xit x)
SSTx
= 1 +
Pn
i=1
PT
t=1 uit (xit x)
SSTx
= 1 +
nX
i=1
TX
t=1
wituit;
where
wit =
xit x
SSTx
; i = 1; :::; n and t = 1; :::; T:
Now, using that wit is a function of Xn alone,
E[^1jXn] = 1 +
nX
i=1
TX
t=1
witE[uitjXn] = 1 +
nX
i=1
TX
t=1
witE[uitjxit]
= 1;
where the second and third equality use SLR.2 and SLR.4, respectively.
33
Similarly,
Var(^1jXn) =
nX
i=1
TX
t=1
w2itVar(uitjXn) =
nX
i=1
TX
t=1
w2itVar(uitjxit) =
nX
i=1
TX
t=1
w2it
2
it
=
Pn
i=1
PT
t=1 (xit x)2 2it
SST 2x
Consistent heteroskedasticity robust standard errors of ^1 are given by
seHR(^1) =
sPn
i=1
PT
t=1 (xit x)2 u^2it
SST 2x
; (19.2)
and, appealing to the LLN and CLT, it then follows that
t^1
=
^1 1
seHR(^1)
a N (0; 1) :
20 Analysis of OLS - panel data
In the case of panel data, we will keep SLR.3* from the previous section but replace SLR.2* and
SLR.4* by:
SLR.2** fxit; yit : t = 1; :::; Tg, i = 1; ::::; n , are i.i.d.
SLR.4** E [utjX] = 0, t = 1; :::; T , where X = (x1; :::; xT ).
SLR.2** assumes that the n individuals in our data set are randomly sampled from the pop-
ulation in which case the observations from unit i are independent of observations from unit j,
i 6= j. However, for any given individual i, (yis; xis) is allowed to be dependent on/correlated with
(yit; xit), s; t = 1; :::; T . This is a more realistic assumption compared to SLR.2* when working
with panel data: Recall that we here follow the same observational units over time. For a given
unit i, it is likely that outcomes in period t will be dependent on past and future outcomes. For
example, if yit is wages of a given individual in period t, then these will most likely be dependent
on past and future earnings of the same individual.
SLR.4** requires that ut is mean-independent of current, past and future values of the regressor.
This is a stronger assumption than SLR.4* and is needed because we here allow ut to be dependent
on xs, s 6= t.
In order to derive the variance of the OLS estimator under SLR.2**, we need the following
result: For any T random variables, v1; :::; vT ,
Var (v1 + + vT ) =
TX
t=1
Var (vt) + 2
TX
s=1
TX
t6=s;t=1
Cov (vs; vt) : (20.1)
34
To see this, dene vt = vt E [vt] so that
Var (v1 + + vT ) = E
h
(v1 + + vT )2
i
;
where
(v1 + + vT )2 = v21 + + v2T + 2
TX
s=1
TX
t6=s;t=1
vsvt:
Theorem 20.1 Under SLR.1, SLR.2**, SLR.3* and SLR.4, with Xn = fxit : i = 1; :::; n; t = 1; :::; Tg,
E[^j jXn] = j ; j = 0; 1;
Var(^1jXn) =
Pn
i=1
PT
t=1 (xit x)2 2it
SST 2x
+ 2
Pn
i=1
PT
s=1
PT
t6=s;t=1 (xis x) (xit x)Cov (uis; uitjXi)
SST 2x
where 2it =Var(uitjXi).
Proof. The proof proceeds as the one of Theorem 19.1, except for the variance computation. With
panel data, we have
Var(^1jXn) =
nX
i=1
Var

TX
t=1
wituit
Xn
!
;
where we have used that observations from unit i are independent on observations from unit j.
With vit = wituit, (20.1) yields
Var

TX
t=1
wituit
Xn
!
=
TX
t=1
w2itVar (uitjXn) + 2
TX
s=1
TX
t6=s;t=1
wiswitCov (uis; uitjXn)
=
TX
t=1
w2itVar (uitjXi) + 2
TX
s=1
TX
t6=s;t=1
wiswitCov (uis; uitjXi)
where the second equality uses SLR.2**.
Comparing the expression of Var(^1jXn) in Theorems 19.1 and 20.1, we see that an extra term
appears when working with panel data. This is due to potential autocorrelation of the regression
errors, Cov(uis; uitjXi) 6= 0. If the errors should happen to not exhibit autocorrelation, the variance
expressions are identical. To control for potential autocorrelation, we need to adjust the standard
errors stated in (19.2) and use the following heteroskedasticity and autocorrelation robust (HAR)
ones
seHAR(^1) =
sPn
i=1
PT
t=1 (xit x)2 u^2it + 2
PT
s=1
PT
t6=s;t=1 (xis x) (xit x) u^isu^it
SST 2x
: (20.2)
35

欢迎咨询51作业君