ECON0019: Quantitative Economics & Econometrics

Notes for Term 1

Dennis Kristensen

University College London

November 2, 2021

These notes are meant to accompany Wooldridges "Introductory Econometrics" in providing

more details on the mathematical results of the book.

Part I

Simple Linear Regression (SLR)

We are interested in estimating the relationship between x and y in a given population. Suppose

the following two assumptions are satis
ed:

SLR.1 In the population, the following relationship holds between x and y:

y = 0 + 1x+ u; E [ujx] = 0: (0.1)

where 0 and 1 are unknown parameters and u is an unobserved error term.

The error term u captures other factors, in addition to x, that inuences/generates y. In

order to be able to disentangle the impact of x from these other factors, we will require u to be

mean-independent of x:

SLR.4 E [ujx] = 0.

That is, conditional on x the expected value of u in the population is 0; it implies that no x

conveys any information about u on average. It is here helpful to remind ourselves what conditional

expectations are:

1 Refresher on conditional distributions

Consider two random variables, y and x. These do not have to satisfy the SLR (or any other

model). Suppose for simplicity that both are discrete values (all subsequent arguments and results

1

easily generalise to the case where they are continuously distributed). Let

x(1); :::; x(K)

, for some

K 1, be the possible values that x can take and y(1); :::; y(L) , for some L 1, be the possible

values of y. Now, let

pX;Y

x(i); y(j)

= Pr

x = x(i); y = y(j)

; i = 1; :::;K; j = 1; :::; L:

be the joint probability function. From this we can, for example, compute the marginal distributions

of x and y,

pX (x) =

LX

j=1

p

x; y(j)

; pY (y) =

KX

i=1

p

x(i); y

;

for any given x 2 x(1); :::; x(K) and y 2 y(1); :::; y(L) .

The conditional distribution of yjx is de
ned as

pY jX (yjx) =

pY;X (y; x)

pX (x)

; (1.1)

for any values of (y; x). The conditional distribution slices the distribution of y up according to the

value of x. pY jX

y(j)jx(i) is the probability of observing y = y(j) in the subpopulation fo which

x = x(i). If x is informative about y (that is, they are dependent) then pY jX (yjx) 6= pY (y).

In this course we are generally interested in modelling conditional distributions because we are

interested in causal e¤ects. So very often we will write up a model for pY jX (yjx). Any given model

of pY jX (yjx) can then be used to, for example, make statements about the marginal distribution

of y since

pY (y) =

KX

i=1

pY jX

yjx(i)

pX

x(i)

: (1.2)

Example: Male and Female wages. Consider the population of UK adults. Let y 2 f0; 1; 2; ::::::; Lg

be the log weekly earnings in pounds of a UK adult (binned so that it is discrete and with

number of bins (L) being some very large number) in a given year, and x 2 f0; 1g being a

dummy variable indicating the gender of that same individual: x = 0 if the individual is male

while x = 1 the individual is female.

In this case, pY jX (yjx = 0) is the UK log-wage distribution for men and pY jX (yjx = 1) is the

log-wage distribution for women. If men and women have di¤erent earnings distributions,

then pY jX (yjx = 0) 6= pY jX (yjx = 1).

The over-all UK wage distribution is

pY (y) = pX (x = 0) pY jX (yjx = 0) + pX (x = 1) pY jX (yjx = 1) ;

where p (x = 0) and p (x = 1) is the proportion of men and women in the UK, respectively.

In 2018, the UK population was 66.78 million, with 33.82 million females and 32.98 million

males. Thus, if the year of interest is 2018,

pX (x = 0) =

32:98

66:78

= 0:49; pX (x = 1) =

33:82

66:78

= 0:51: (1.3)

2

For any given value of x, pY jX (yjx) is a probability distribution. In particular, we can compute

means and variances, etc. For example, the conditional mean is de
ned as

E [yjx] =

LX

j=1

y(j)pY jX

y(j)jx

:

More generally, for any function f (y) ;

E [f (y) jx] =

LX

j=1

f

y(j)

pY jX

y(j)jx

:

For example, the conditional variance is given by

Var (yjx) = E

h

(y E [yjx])2 jx

i

= E

y2jx E [yjx]2 :

Example: Male and Female wages. In the above wage example,

E [yjx = 0] = the average log-wages for men

E [yjx = 1] = the average log-wages for women

and

Var (yjx = 0) = the spread/variance of log-wages for men

Var (yjx = 1) = the spread/variance of log-wages for women

1.1 Some useful rules for computing conditional expectations

Recall that for any two constants a and b,

E [ay + b] = aE [y] + b:

When we compute expectations conditional on x, we can treat x as a constant (we keep it
xed at

the particular value). Thus, the following rules applies: For any two functions a (x) and b (x),

E [a (x) y + b (x) jx] = a (x)E [yjx] + b (x) :

Example: Male and Female wages. Suppose (y; x) satis
es SRL.1SLR.4 with 0 = 3 and

1 = 0:25. What is the average log-wages for men and women, respectively? Taking

conditional expectations on both sides of eq. (0.1) and then using the above rules, we obtain

E [yjx] = E [3 0:25x+ ujx] = 3 0:25x+ E [ujx] = 3 0:25x;

where the last equality uses SLR.4. That is, E [ujx = 0] = E [ujx = 1] = 0.1 Thus,

E [yjx = 0] = 3 0:25 0 = 3;

E [yjx = 1] = 3 0:25 1 = 2:75:

So in this example average logearnings for men are higher than those for women.

1 Is it reasonable that SLR.4 holds for this regression? To answer this, you should
rst determine which factors

a¤ect wages besides gender. Next, you should contemplate whether these factors are meanindependent of gender.

3

Recall eq. (1.2) which relates the marginal distribution of y to its conditional distribution.

Similarly, we can link the unconditional mean of y, E [y], to the conditional ones, E [yjx]:

Theorem 1.1 (Law of iterated expectations) For any two random variables y and x, the

following hold:

E [y] =

KX

i=1

E

h

yjx = x(i)

i

pX

x(i)

= E[E [yjx]]:

Proof. By de
nition

E [y] =

LX

j=1

y(j)pY

y(j)

:

But pY (y) satis
es eq. (1.2). Substituting the right hand side of (1.2) into the above equation

yields

E [y] =

LX

j=1

y(j)p

y(j)

=

LX

j=1

y(j)

(

KX

i=1

pY jX

y(j)jx(i)

pX

x(i)

)

=

KX

i=1

8<:

LX

j=1

y(j)pY jX

y(j)jx(i)

9=; pX x(i) ;

where the last equality just changes the order of summation. We now recognise the inner sum as

LX

j=1

y(j)pY jX

y(j)jx(i)

= E

h

yjx = x(i)

i

;

and so

E [y] =

KX

i=1

E

h

yjx = x(i)

i

pX

x(i)

= E[E [yjx]]:

The LIE provides an alternative way of computing the mean of a random variable from knowl-

edge of the conditional distribution of yjx.

Example: Male and Female wages. Maintaining SRL.1SLR.4 with 0 = 3 and 1 = 0:25,

what is the average wage in the full population? That is, what is E [y]? We saw earlier that

E [yjx] = 3 :25x and we know that x satis
es (1.3). Thus, using the LIE,

E [y] = E[E [yjx]]

= E [yjx = 0] pX (x = 0) + E [yjx = 1] pX (x = 1)

= 3 pX (x = 0) + 2:75 pX (x = 1)

= 2:87:

4

An alternative route to the same result is to
rst observe

E [y] = E[E [yjx]] = E[3 :25x] = 3 :25E[x];

and then use that

E [x] = 0 pX (x = 0) + 1 pX (x = 1) = 0:51:

Combining these two equations again yields E [y] = 2:87.

2 Two implications of SLR.4

SLR.4 (above) has two useful implications,

E[u] = 0 and E[ux] = 0;

which we will use in several proofs.

2.1 First implication: E[u] = 0.

Lemma 2.1 Under SLR.4, E[u] = 0.

Proof. Let us prove this implication holds using the LIE. If we substitute u for y in Theorem 1.1,

we get:

E [u] = E[E [ujx]]:

Under SLR.4, we know that E[ujx] = 0. Plugging this into the right-hand side of the above equation

gives us

E [u] = E [0] = 0;

where the last equality uses that the expected value of a random variable which is 0 with probability

1 is... 0. And so we have shown that E[ujx] = 0) E[u] = 0.

It is stated in the lectures that E[u] = 0 is not a strong restriction. Why is that? Lets illustrate

this with an example.

Example: Male and Female wages. Suppose that (y; x) satis
es SLR.1SLR.3 with 0 = 3

and 1 = 0:25. However, suppose that E[ujx] = 4 so that SLR.4 is violated. Is this a

serious issue in terms of interpretation of the model? Not really since the conditional mean

is still constant: First, add and subtract E[ujx] = 4 on the right hand side of the population

equation,

y = 0 + 1 + x+ u = 3 + 4 0:25x+ u 4: (2.1)

Obviously this does not change the relationship. Now de
ne two new variables:

0 0 + 4; u u 4:

5

With these new variables, we can rewrite (2.1) as

y = 0 + 1x+ u

= 7 0:25x+ u

Notice that as u u 4 and E[ujx] = 4, we must have that E[ujx] = 0. And so we have

shown that a population relationship that initially appeared to break SLR.4 can be rewritten

in a way that conforms to it. The intercept changes from 0 = 3 to

0 = 7 when doing so,

but this is just a normalisation. More importantly, 1 (which tends to be the key parameter

of interest) is unchanged.

2.1.1 Reverse implication?

It is important to note that the reverse implication does not hold. That is, E[u] = 0; E[ujx] = 0.

We can show this with a simple counter-example. Suppose in the population there are only two

possible outcomes:

Outcome A, where u = 1 and x = 5.

Outcome B, where u = 1 and x = 7.

Suppose that the two outcomes are equally likely, so Pr(A) = Pr(B) = 0:5. Notice that as half

the time there is an outcome where u = 1 and half the time there is an outcome where u = 1, we

have E [u] = 0. But notice that we do not have E [ujx] = 0. There are only two possible values x

can take: 5 and 7. When x = 5, u can only be 1, and so E [ujx = 5] = 1. When x = 7, u can only

be -1, and so E [ujx = 7] = 1. And so in this example E[u] = 0 holds but E[ujx] 6= 0, and so we

know E[u] = 0; E[ujx] = 0.

2.2 Second implication: E[ux] = 0

Lemma 2.2 Under SLR.4, E[ux] = 0.

Proof. We again use the LIE. Notice that as x and u are both random variables, ux is also a

random variable. If we substitute ux for y in Theorem 1.1 we get:

E [ux] = E[E [uxjx]]:

In the above equation we have an x inside an expectation conditional on x. This means this x can

be treated as a constant, and can be taken outside the conditional expectation, E [uxjx] = xE [ujx].

Thus,

E [ux] = E[xE [ujx]]:

Under SLR.4 E[ujx] = 0. Plugging this into the above equation gives us:

E [ux] = E[x 0] = E[0] = 0;

which proves the claimed result.

6

2.2.1 Reverse implication?

It is again important to note that the reverse implication does not hold. That is, E[ux] = 0 ;

E[ujx] = 0. And we can again show this with a counter-example. Suppose in the population there

are only two possible outcomes:

Outcome A, where u = 3 and x = 4.

Outcome B, where u = 2 and x = 6.

Suppose that the two outcomes are equally likely, so Pr(A) = Pr(B) = 0:5.

Notice that in outcome A, ux = 12, and in outcome B, ux = 12. And so as the outcomes are

equally likely, we have E(ux) = 0. But notice that we do not have E [ujx] = 0. There are only two

possible values x can take: 4 and 6. When x = 4, u can only be 3, and so E [ujx = 4] = 3. When

x = 6, u can only be -2, and so E [ujx = 6] = 2. And so in this example E[ux] = 0 holds but

E[ujx] 6= 0, and so we have shown that E[ux] = 0; E[ujx] = 0.

3 Derivation of OLS Estimators

In most applications, the population is not directly observable; however we can still learn a lot

about the SLR model from a random sample of the population, f(xi; yi) : 1; : : : ; ng. That is,

we have randomly sampled n units from the population and for each unit i (= 1; ::::; n) in the

sample, we have observed the value (x; y), (xi; yi). Random sampling implies that we can treat

the observations as being independently drawn from the population/distribution of interest. Given

that they are drawn from the same distribution, they are also identically distributed. We formally

state this assumption below:

SLR.2 fxi; yig, i = 1; ::::; n, are i.i.d. (independent and identically distributed).

From the sample, we wish to learn about 0 and 1, where 1 measures the impact of x on

y. Clearly, if there is no variation in x, we cannot obtain any reasonable estimate of 1. We will

formally require that:

SLR.3 xi, i = 1; :::; n, exhibit variation so that SSTx =

Pn

i=1 (xi x)2 > 0.

Based on the sample, we wish to estimate 0 and 1 by OLS. There are two equivalent ways to

derive the OLS estimators ^0 and ^1. The
rst way is the Method of Moments; the second is the

Sum of Squared Residuals.

Neither method requires SLR.1-SLR.2 and SLR.4 to hold we can always compute the OLS

estimators as long as SLR.3 is satis
ed. But the estimators are motivated by SLR.1-SLR.2 and

SLR.4 under these assumptions the estimators can be given a formal/more rigorous interpretation

and will enjoy certain desirable properties.

7

3.1 Method of Moments

We saw earlier that E[ujx] = 0 implies that (i) E[u] = 0 and (ii) E[xu] = 0. Using that u =

y 0 1x, c.f. (0.1), we can write (i)(ii) as

E[u] = E[y 0 1x] = 0; (3.1)

E[xu] = E[x(y 0 1x)] = 0: (3.2)

That is, given the population (distribution) 0 and 1 must satisfy (3.1)(3.2). Do these two

equations identify 0 and 1? If not, the moment conditions are not helpful in learning about 0

and 1. If Var(x) > 0 the answer is a¢ rmative: One can show that there exists a unique solution

to (3.1)(3.2) in terms of 0 and 1. This can be shown by following the same steps as for the

derivation of the OLS estimators below.

We do not know the population (moments), all we have is the random sample. So let us replace

the above population moments by their sample counterparts,

1

n

nX

i=1

yi ^0 ^1xi

= 0; (3.3)

1

n

nX

i=1

xi

yi ^0 ^1xi

= 0: (3.4)

where

Pn

i=1 xi := x1 + :::: + xn and similar for any other sequence; in the following we will write

x :=

Pn

i=1 xi=n and similar for other variables.

To
nd the solutions, ^0 and ^1, to these two equations, we
rst manipulate (3.4) to obtain

0 =

1

n

nX

i=1

yi ^0 ^1xi

=

1

n

nX

i=1

yi 1

n

nX

i=1

^0

1

n

nX

i=1

^1xi

= y ^0 ^1x:

We conclude that

^0 = y ^1x (3.5)

Next, substitute the right-hand side of (3.5) into (3.3) to obtain:

0 =

1

n

nX

i=1

yi ^0 ^1xi

xi

=

1

n

nX

i=1

yi y + ^1x ^1xi

xi

=

1

n

nX

i=1

fyi yg ^1 fxi xg

xi

=

1

n

nX

i=1

(yi y)xi ^1

1

n

nX

i=1

(xi x)xi

8

and we conclude that

^1 =

1

n

Pn

i=1 (yi y)xi

1

n

Pn

i=1 (xi x)xi

=

1

n

Pn

i=1 (yi y) (xi x)

1

n

Pn

i=1 (xi x)2

: (3.6)

The last equality in (3.6) uses that (left as exercise)

1

n

nX

i=1

(yi y) x = 0 and 1

n

nX

i=1

(xi x) x = 0:

Plugging ^1 back into (3.5) gives us our estimator of 0.

Once we have computed ^0 and ^1, we can obtain:

Predicted/
tted values: y^i = ^0 + ^1xi, i = 1; :::; n

OLS regression line: y^ = ^0 + ^1x for any given value of x (as chosen by us).

Residuals: u^i = yi y^i, i = 1; :::; n.

3.2 Sum of Squared Residuals

An alternative (but equivalent) way to obtain the OLS estimators is by minimizing the sum of

squared residuals (SSR). First recall the residual u^i de
ned above; what does this measure? Think

about observation i with (xi; yi); for xi the aforementioned OLS regression line predicts a y-value

of y^i. How far is y^i from the actual yi? This information is given by u^i! In other words, u^i is the

vertical distance between yi and y^i; it can be either positive or negative depending on whether the

actual point lies above or below the OLS regression line.

If we want our OLS regression line to
t the data well, then we must minimize the distances

u^i, 8i. How do we do that? One way would be to minimize the sum of residuals

Pn

i=1 u^i. But this

is not very meaningful and leads to rubbish estimators: We can always choose ^0 smaller and that

way decrease the value of

Pn

i=1 u^i further. In the limit, ^0 = 1 is the best estimator in terms of

this criterion. Instead we choose to minimise the SSR since this penalises estimates that generate

both large negative and large positive residuals.

Formally, for a given set of candidate values for our estimators, b0 and b1, we compute the

corresponding SSR,

SSR (b0; b1) =

nX

i=1

(yi b0 b1xi)2 :

Given data (so we treat (yi; xi), i = 1; :::; n, as
xed), we then choose our estimators, ^0 and ^1,

as the minimizers of SSR (b0; b1) w.r.t. (b0; b1):

min

b0;b1

SSR (b0; b1) :

That is, we seek values ^0 and ^1 so that

SSR(^0; ^1) < SSR (b0; b1) for all (b0; b1) 6= (^0; ^1): (3.7)

9

To solve this bivariate minimisation problem, we derive the
rst order conditions (FOCs) and then

nd the values (^0; ^1) at which they are both zero:

@SSR(b0; b1)

@b0

(b0;b1)=(^0;^1)

= 0;

@SSR(b0; b1)

@b1

(b0;b1)=(^0;^1)

= 0:

Using the chain rule, the two FOCs take the form:

FOC with respect to ^0:

0 = 2

nX

i=1

yi ^0 ^1xi

which is equivalent to (3.3).

FOC with respect to ^1:

0 = 2

nX

i=1

yi ^0 ^1xi

xi;

which is equivalent to (3.4).

Thus, the solution (^0; ^1) to (3.7) is identical to (3.5) and (3.6).

4 Unbiasedness

Every time we draw a new random sample from the population, the estimates ^0 and ^1 will

be di¤erent. To formally assess the variation of the estimators across di¤erence samples, we will

treat their outcomes as random variables. We then wish to better understand the features of the

distributions of ^0 and ^1.

The
rst question we will ask about their distributions is the following: On average (across all

the di¤erent samples), will our estimation method get it right? That is, do the expected values of

^0 and ^1 equal the unknown values of 0 and 1, respectively? The answer is a¢ rmative as long

as our regularity conditions are satis
ed:

Theorem 4.1 Under SLR.1SLR.4, E[^0jXn] = 0 and E[^1jXn] = 1, where Xn = fx1; x2; :::; xng,

and so the OLS estimates are unbiased.

Proof. We here show that E[^1] = 1. The proof of E[^0] = 0 is left as an exercise. Our proof

proceeds in three steps:

1. Obtain a convenient expression of estimator

2. Decompose estimator as population parameter + sampling error

3. Show that E [sampling error] = 0.

10

Step 1: Obtain a convenient expression of estimator

It is convenient to rewrite the expression of ^1 in eq. (3.6) as

^1 =

Pn

i=1(xi x)yiPn

i=1(xi x)2

;

where we have used that

Pn

i=1(xi x)y = 0. De
ne SSTx =

Pn

i=1(xi x)2, the sum of total

variation in the xi, and write

^1 =

Pn

i=1(xi x)yi

SSTx

(4.1)

The existence of ^1 follows from SLR.3 which guarantees that SSTx > 0.

Step 2: Write estimator as parameter + sampling error

Replace each yi in the numerator of (4.1) with yi = 0+ 1xi+ ui (which holds due to SLR.1):

nX

i=1

(xi x)yi =

nX

i=1

(xi x)(0 + 1xi + ui)

= 0

nX

i=1

(xi x) + 1

nX

i=1

(xi x)xi +

nX

i=1

(xi x)ui

= 0 + 1

nX

i=1

(xi x)2 +

nX

i=1

(xi x)ui

= 1SSTx +

nX

i=1

(xi x)ui

where we have used

Pn

i=1(xi x) = 0 and

Pn

i=1(xi x)xi =

Pn

i=1(xi x)2. In total,

^1 =

1SSTx +

Pn

i=1(xi x)ui

SSTx

= 1 +

Pn

i=1(xi x)ui

SSTx| {z }

sampling error

:

Now de
ne

wi =

(xi x)

SSTx

; i = 1; :::; n; (4.2)

so we can express the sampling error as a linear function of the unobserved errors, u1; :::; un,

^1 = 1 +

nX

i=1

wiui:

Step 3: Derive (conditional) mean of estimator

We now take conditional expectations w.r.t. Xn := fx1; x2; :::; xng. That is, we condition on the

observed regressors so that the only random components are the uis. Under Assumptions SLR.2

and SLR.4, E [uijXn] = E [uijxi] = 0 (left as exercise). Conditional on Xn, w1; :::; wn are constants

(nor random variables) since each of them depend on Xn alone, c.f. eq. (4.2). This in turn implies

E [wiuijXn] = wiE [uijXn] = 0:

11

Now we can complete the proof: Conditional on Xn = fx1; x2; :::; xng,

E[^1jXn] = E

"

1 +

nX

i=1

wiui

Xn

#

= 1 +

nX

i=1

E [wiuij Xn] = 1 +

nX

i=1

wiE [uijXn]

= 1:

where we used two important properties of expected values: (i) the expected value of a sum is the

sum of the expected values and (ii) the expected value of a constant, 1 in this case, is just itself.

Unbiasedness conditional on any particular draw of regressors in turn implies that OLS is

unbiased unconditionally:

E[^1] = E

h

E[^1jXn]

i

= E[1] = 1,

where the
rst equality uses the Law of Iterated Expectations.

5 Variance of OLS estimator

The previous section showed that the OLS estimators on average target the population values

exactly. However, this is not particularly helpful in itself. For a given sample, the observed estimates

may be very far below or above the population values. We therefore now wish to understand the

degree of variation of the estimators. If, for example, their variance is zero, the OLS estimators

get it right everytime not just on average. Unfortunately, this will not be the case: In almost all

applications the OLS estimators will have positive variance. But knowledge of the variances of the

estimators will prove important when we develop testing procedures later on.

To derive the variance of the estimators we impose the following additional assumption:

SLR.5 The errors are homoskedastic,

Var[ujx] = 2 for all values of x

Theorem 5.1 Under Assumptions SLR.1SLR.5, and conditional on Xn = fx1; x2; :::; xng,

Var(^1jXn) =

2Pn

i=1(xi x)2

=

2

SSTx

Var(^0jXn) =

2

n1

Pn

i=1 x

2

i

SSTx

Proof. To derive the expression of Var(^1jXn), recall that

^1 = 1 +

nX

i=1

wiui; wi = (xi x)=SSTx; i = 1; :::; n:

12

Again, conditional on Xn, we can treat the wi as nonrandom in the derivation. Because 1 is a

constant, it does not a¤ect Var(^1jXn).

Now, we need to use the fact that, for uncorrelated random variables, the variance of the sum

is the sum of the variances. The fui : i = 1; 2; :::; ng are independent across i, and so they are

uncorrelated. In fact, by SLR.2 and SLR.4,

Cov(ui; uj jXn) = E[uiuj jxi; xj ] = 0:

Moreover, by SLR.2 and SLR.5,

Var(uijXn) = Var(uijxi) = 2:

The proofs of these two equations are left as an exercise. Therefore,

Var(^1jXn) = Var

nX

i=1

wiui

Xn

!

=

nX

i=1

Var(wiuijXn)

=

nX

i=1

w2iVar(uijXn) =

nX

i=1

w2i

2 = 2

nX

i=1

w2i :

Finally, note that

nX

i=1

w2i =

nX

i=1

(xi x)2

SST 2x

=

Pn

i=1(xi x)2

SST 2x

=

SSTx

SST 2x

=

1

SSTx

;

and so

Var(^1jXn) =

2

SSTx

:

The derivation of the conditional variance of ^0 follows a same logic and so left as an exercise.

6 Goodness-of-Fit (R2)

How much of the variation in y is explained by variation in x? If we are interested in that question,

we are after the coe¢ cient of determination R2. R2 gives us a sense of the goodness-of-
t of our

regression; i.e. it informs us about what fraction of the variation in y is due to variation in x.

What do we mean by saying variation in y? The squared distance of yi from the sample mean

y informs us about the spread of yi and is denoted by

Pn

i=1(yi y)2. Notice that if we divide

this expression by n 1 we get the sample variance for yi. For what follows we will work onPn

i=1(yi y)2. As yi = u^i + y^i we can write:

nX

i=1

(yi y)2 =

nX

i=1

(u^i + y^i y)2

=

nX

i=1

u^2i + (y^i y)2 + 2u^i(y^i y)

=

nX

i=1

u^2i +

nX

i=1

(y^i y)2 + 2

nX

i=1

u^i(y^i y)

13

The third term on the RHS of the last equation is 0: Use y^i = ^0 + ^1xi to write:

nX

i=1

u^i(y^i y) =

nX

i=1

u^i(^0 + ^1xi y)

=

nX

i=1

u^i(^0 + ^1xi)

nX

i=1

u^iy

= ^0

nX

i=1

u^i + ^1

nX

i=1

u^ixi y

nX

i=1

u^i

= 0;

where the last line comes from our sample moment conditions

Pn

i=1 u^i = 0 and

Pn

i=1 u^ixi = 0.

Hence:

Pn

i=1 u^i(y^i y) = 0. We conclude that

nX

i=1

(yi y)2| {z }

SST

=

nX

i=1

u^2i| {z }

SSR

+

nX

i=1

(y^i y)2| {z }

SSE

where:

SST: Total sum of squares (variation in y)

SSR: Sum of squared residuals (unexplained variation in y)

SSE: Explained sum of squares (explained variation in y by x)

Dividing across by SST we get:

1 =

SSR

SST

+

SSE

SST

, SSE

SST

= 1 SSR

SST

, R2 SSE

SST

= 1 SSR

SST

R2 is the fraction of sample variation in y that is explained by x.

Part II

Multiple Linear Regression (MLR)

The multiple linear regression model in the population takes the form:

y = 0 + 1x1 + + kxk + u

where:

14

y is the independent variable

0 is the constant

xj , j = f1; 2; : : : ; kg is the independent variable

j is the coe¢ cient on xj

u is the error term

Advantages of controlling for more variables:

Zero conditional mean assumption more reasonable

Closer to estimating causal/ceteris paribus e¤ects (everything else equal)

More general functional form

Better prediction of y / better
t of the model

7 Assumptions

Similar to the SLR model, we need to impose regularity conditions on the MLR model and the

sampling process in order for the regression coe¢ cients to have a ceteris paribus interpretation and

for OLS to deliver valid estimators of them:

MLR.1 (Linearity) The model is linear in parameters,

y = 0 + 1x1 + 2x2 + + kxk + u

As in the SLR case, this does not rule out non-linear transformations of the variables. Notice

that

y = 0 + 1 lnx1 + 2x2 + + kxk + u

is also linear in parameters. For z = lnx1 the above model can be rewritten as:

y = 0 + 1z + 2x2 + + kxk + u

which is clearly linear.

MLR.2 (Random sample) The sample is a random draw from the population. The data in the

sample are f(xi1; : : : ; xik; yi) : i = 1; : : : ; ng, where fxi1; : : : ; xik; yig are i.i.d. (independent

and identically distributed).

MLR.3 (Full rank/no perfect collinearity) xj 6= c and there is no exact linear relationship

among any xj in the population, i.e. xj cannot be written as

P

j jxj , where j are

constants and j = 1; : : : ; j 1; j + 1; : : : ; k.

15

MLR.4 (mean-independence) Conditional on x1; : : : ; xk the mean of u is 0, i.e. E[ujx1; : : : ; xk] =

0.

Notice that this assumption implies that E[u] = 0 and E[xju] = 0, 8j.

MLR.5 (Homoscedasticity) The variance of u is constant and independent of x, i.e. E[u2jx1; : : : ; xk] =

2.

For the remaining part of this section, we will focus on the case with two regressors,

yi = 0 + 1xi1 + 2xi2 + ui: (7.1)

(i.e. with two independent variables only). All the results derived herein can be easily extended

to accommodate the general case of k independent variables, but the notation and mathematical

derivations get more tedious.

8 OLS estimators

Let ^0, ^1, and ^2 be the OLS estimators for 0, 1, and 2 respectively in (7.1). Once, the OLS

estimators have been computed, we can obtain predicted values and residuals as before,

y^i = ^0 + ^1xi1 + ^2xi2

u^i = yi y^i:

Given these, we have

yi = ^0 + ^1xi1 + ^2xi2 + u^i:

Notice that these expressions resemble the ones derived for
tted values and residuals in the context

of the Simple Linear Regression Model. How can we interpret ^1 in this context? ^1 measures the

ceteris-paribus change in y given an one unit change in x1; put di¤erently, it measures the change in

y given an one unit change in x1, holding x2
xed. Obviously, ^2 has an analogous interpretation.

How do we actually obtain ^0, ^1, and ^2 in the context of the Multiple Linear Regression

Model? There are three equivalent ways: the minimization of the sum of squared residuals, the

partialling-out method, and,
nally, a method which makes use of a number of moment conditions.

8.1 Sum of squared residuals

We can obtain ^0, ^1, and ^2 by minimizing the SSR of the model. The SSR now takes the form

SSR(b0; b1; b2) =

nX

i=1

(yi b0 b1xi1 b2xi2)2 ;

where b0, b1 and b2 are any given candidate values for our
nal estimators. We then de
ne the OLS

estimators as the solution to the following minimisation problem,

min

b0;b1;b2

SSR(b0; b1; b2):

16

The estimators can also be characterised as the solutions to the FOCs. These take the form

@SSR(b0; b1; b2)

@b0

(b0;b1;b2)=(^0;^1;^2)

= 2

nX

i=1

yi ^0 ^1xi1 ^2xi2

= 0 (8.1)

@SSR(b0; b1; b2)

@b1

(b0;b1;b2)=(^0;^1;^2)

= 2

nX

i=1

xi1

yi ^0 ^1xi1 ^2xi2

= 0 (8.2)

@SSR(b0; b1; b2)

@b2

(b0;b1;b2)=(^0;^1;^2)

= 2

nX

i=1

xi2

yi ^0 ^1xi1 ^2xi2

= 0 (8.3)

These are three equations with three unknowns; solving them simultaneously we get ^0, ^1, and

^2 (you do not have to remember the following formulae):

^1 =

Pn

i=1(xi2 x2)2

Pn

i=1(yi y)(xi1 x1)

Pn

i=1(yi y)(xi2 x2)

Pn

i=1(xi1 x1)(xi2 x2)Pn

i=1(xi1 x1)2

Pn

i=1(xi2 x2)2 (

Pn

i=1(xi1 x1)(xi2 x2))2

(8.4)

^2 =

Pn

i=1(xi1 x1)2

Pn

i=1(yi y)(xi2 x2)

Pn

i=1(yi y)(xi1 x1)

Pn

i=1(xi1 x1)(xi2 x2)Pn

i=1(xi1 x1)2

Pn

i=1(xi2 x2)2 (

Pn

i=1(xi1 x1)(xi2 x2))2

^0 = y ^1x1 ^2x2

where xj = n1

Pn

i=1 xij and y = n

1Pn

i=1 yi.

8.2 The partialling-out method

An alternative, and perhaps more intuitive, way to characterise the OLS estimators ^0, ^1, and

^2 is the following. First, we estimate a Simple Linear Regression of x1 on x2 (and any other

independent variables in the context of a general k-variable model):

xi1 = 0 + 1xi2 + ri1

where ri1 is an error term. We use the Simple Linear Regression tools to get the OLS estimates ^0

and ^1 which we then use to compute the residual:

r^i1 = xi1 ^0 ^1xi2

How should we interpret r^i1? It is the variation in xi1 that is left after removing the variation in xi2

we have partialled out the part of the variation in xi1 which was explained by xi2 the remaining

variation as captured by r^i1 contains the "clean" signal contained in xi1.

Properties of residual r^i1: The usual properties apply to r^i1:

1. Residuals sum to zero:

Pn

i=1 r^i1 = 0

2. Residuals are orthogonal to regressors:

Pn

i=1 r^i1xi2 = 0

3. Sum of products between residual and dependent variable equals sum of squared residuals:Pn

i=1 r^i1xi1 =

Pn

i=1 r^

2

i1

17

The
rst two properties are well-known features of OLS in SLR. The third property is a conse-

quence of the
rst two:

nX

i=1

r^i1xi1 =

nX

i=1

r^i1 (xi1 ^0 ^1xi2 + ^0 + ^1xi2)

=

nX

i=1

r^i1 (r^i1 + ^0 + ^1xi2)

=

nX

i=1

r^2i1 + ^0r^i1 + ^1xi2r^i1

=

nX

i=1

r^2i1 + ^0

nX

i=1

r^i1| {z }

=0

+^1

nX

i=1

r^i1xi2| {z }

=0

=

nX

i=1

r^2i1

After having obtained r^i1, we regress yi on r^i1 and a constant:

yi = 0 + 1r^i1 + vi

This is another Simple Linear Regression model. The OLS estimate of the slope coe¢ cient 1 is

^1 =

Pn

i=1

r^i1 ^r1

yiPn

i=1

r^i1 ^r1

2 = Pni=1 r^i1yiPn

i=1 r^

2

i1

; (8.5)

where the second equality uses that ^r1 = 0. We will not provide a proof of this, but the above

expression is numerically identical to (8.4). That is, the partialling out algorithm leads to the same

estimates as minimizing the SSR.

The intuition for this result is the following: The expression of ^1 in (8.5) measures the change

in y which is due to an one-unit change in x1 after having x2 partialled out; put di¤erently, ^1

measures the change in y which is due to an one-unit change in x1 holding x2
xed. But that

is exactly what ^1 is measuring too (return to the discussion at the beginning of this section).

Formally, replacing r^i1 in (8.5) with an analytical expression consisting of xi1 and xi2 only (obtained

from the regression in the
rst stage), one can see that the expression in (8.5) is exactly the same

as ^1 from the minimization of squared residuals.

Going back to (7.1), how can we obtain ^2? Using the same two-step approach one can show

that

^2 =

Pn

i=1 r^i2yiPn

i=1 r^

2

i2

;

where the r^i2s are the residualsfrom
rst regressing x2 onto x1,

r^i2 = xi2 ^0 ^1xi1; i = 1; :::; n

More generally, in a model with k independent variables

^j =

Pn

i=1 r^ijyiPn

i=1 r^

2

ij

; j = 1; 2; : : : ; k

18

where r^ij is the OLS residuals from a regression of xj on the other explanatory variables and a

constant.

Finally, the estimated constant is

^0 = y ^1x1 ^2x2

where xj = n1

Pn

i=1 xij and y = n

1Pn

i=1 yi. In the general case with k variables, ^0 = y

^1x1 ^kxk.

8.3 Moment conditions

As in the context of the Simple Linear Regression, one can use the sample counterparts of the three

moment conditions (or k + 1 in the case of a general k-variable model) that follow from MLR.4 :

E[u] = 0 and E[xju] = 0 j = 1; 2:

The corresponding sample moment conditions are equivalent to the FOCs for the least-squares

problem as stated in (8.1)(8.3).

9 Unbiasedness

Theorem 9.1 Under MLR.1MLR.5, E[^j jXn] = j, j = 0; 1; :::; k.

Proof. We only give a proof for ^1 in the case of k = 2. The proofs of the remaining estimators

are left as an exercise. The derivation proceeds along the same lines as for the SLR case. First, we

obtain a convenient expression of the OLS estimator:

^1 =

Pn

i=1 r^i1yiPn

i=1 r^

2

i1

MLR.1

=

Pn

i=1 r^i1 (0 + 1xi1 + 2xi2 + ui)Pn

i=1 r^

2

i1

=

Pn

i=1 r^i10Pn

i=1 r^

2

i1

+

Pn

i=1 r^i11xi1Pn

i=1 r^

2

i1

+

Pn

i=1 r^i12xi2Pn

i=1 r^

2

i1

+

Pn

i=1 r^i1uiPn

i=1 r^

2

i1

= 0

Pn

i=1 r^i1Pn

i=1 r^

2

i1| {z }

=0

+1

Pn

i=1 r^i1xi1Pn

i=1 r^

2

i1| {z }

=1

+2

Pn

i=1 r^i1xi2Pn

i=1 r^

2

i1| {z }

=0

+

Pn

i=1 r^i1uiPn

i=1 r^

2

i1

= 1 +

Pn

i=1 r^i1uiPn

i=1 r^

2

i1

De
ning

wi =

r^i1Pn

j=1 r^

2

j1

; i = 1; :::; n; (9.1)

19

we can express the OLS estimator as

^1 = 1 +

nX

i=1

wiui; (9.2)

where the second term is the sampling error. Thus, as with the SLR model, we can express the

sampling error of the OLS estimator as a weighted sum of the errors u1; :::; :un. And the weights

are again functions of the observed regressors alone, Xn = f(xi1; xi2) : i = 1; :::; ng, c.f. (9.1). Thus,

taking expectations conditional on Xn,

E[^1jXn] = 1 + E

"

nX

i=1

wiuijXn

#

= 1 +

nX

i=1

wiE [uijXn]

= 1 +

nX

i=1

wiE [uijxi1; xi2]

= 1;

where, as in the SLR, we have used that E [uijXn] = E [uijxi1; xi2] under MLR.2. Unbiasedness

(E[^1] = 1) now follows from the Law of Iterated Expectations. By the same arguments, it can

be shown that ^2 is an unbiased estimator for 2.

10 Variance of OLS estimator

Let us
rst derive the variance of the estimator under MLR.5: We use eq. (9.2) and the fact that

w1; :::; wn are all functions of Xn to obtain

Var

^1jXn

= Var

1 +

nX

i=1

wiui

Xn

!

(10.1)

= Var

nX

i=1

wiui

Xn

!

=

nX

i=1

w2iVar (uijXn)

=

nX

i=1

w2iVar (uijxi1; xi2)

= 2

nX

i=1

w2i

=

2

SST1(1R21)

20

where we have used that Cov(ui; uj jXn) = 0 (MLR.2) and Var(uijXn) =Var(uijxi1; xi2) = 2

(MLR.2+MLR.5). Finally, notice that

Pn

i=1 r^

2

i1 = SST1(1R21) because

R2j = 1

Pn

i=1 r^

2

i1Pn

i=1(xi1 x1)2

1

Pn

i=1 r^

2

i1

SST1

:

11 OLS is BLUE

There are many unbiased estimators of j , j = f0; 1; 2; : : : ; kg. A theorem usually referred to as

the Gauss-Markov theorem states that under assumptions MLR.1 through MLR.5, ^j is the best

linear unbiased estimator (BLUE) of j , j = f0; 1; 2; : : : ; kg. What does that mean?

First, we say that a given estimator of j , say, ~j , is a linear estimator if it can be written as

~j = ~w1y1 + ~wnyn;

where ~w1; :::; ~wn are functions of the regressors alone, Xn. We know that the OLS estimator can be

written as

^j = w1y1 + wnyn;

where, with r^1j ; ::::; r^nj being the residuals from the
rst-stage regression,

wi =

r^ijPn

k=1 r^

2

kj

; i = 1; :::; n: (11.1)

So OLS is one particular example of a linear estimator. Next, we narrow the class of linear

estimators to only contain the ones that are unbiased. That is, we require the linear estimator

to satisfy E[~j ] = j . Again, the OLS estimator has this property.

Now, within this class of competing estimators, it can be shown that the OLS estimator has

the smallest variance,

Var(^j) < Var(~j) for any alternative linear unbiased estimator ~j :

We say that the OLS estimator is the Best (= smallest variance) Linear Unbiased Estimator.

Part III

Asymptotic Theory of OLS

In this part we will analyze the asymptotic properties of the OLS estimators. That is, we are

interested in analyzing the distribution of ^j along a sequence of ever-large samples so that n!1.

To this end, we
rst remind ourselves of some important concepts and results regarding convergence

of distributions and some two key results used in asymptotic theory.

21

12 The Law of Large Numbers and consistency

In the following, we consider a random sample yi, i = 1; :::; n. This sample could come from a

population that satis
es the SLR model but this is not required for the fundamental results such

as the LLN and CLT presented below.

The Law of Large Numbers (LLN) state that under general conditions the average of a random

sample of size n will be near the population mean with high probability when n is large.

Theorem 12.1 (Law of Large Numbers) Let y1; :::; yn be i.i.d. random variables with mean

= E [y] and variance 2 =Var(y). The sample average y =

Pn

i=1 yi=n then satis
es:

For any > 0 : Pr (jy j > )! 0 as n!1:

and we write

y !p :

That y is "good" estimator of is not surprising. Recall that

E [y] = and Var (y) =

2

n

: (12.1)

In particular, Var(y) ! 0 as n ! 1 and a random variable with zero variance must necessarily

have all its probability mass at its mean, in this case E [y] = .

The LLN is illustrated in Figure 1 where the distribution of y is shown for four di¤erent sample

sizes in the case where y is a Bernoulli random variable,

Pr (y = 1) = and Pr (y = 0) = 1

with = 0:78. In particular, E [y] = = 0:78. For small n, the distribution of y is quite spread out

but eventually, as the sample size grows, most of the probability mass of ys distribution is situated

around .

Thus, treating y as an estimator of the population mean, we see that it is a consistent estimator:

De
nition 12.2 Consider an estimator ^ of a population parameter . We say that ^ is a consis-

tent estimator of if

For any > 0 : Pr

j^ j >

! 0 as n!1;

and we write ^ !p .

Finally, we observe that the LLN can be extended the following result: Any population moment

of a distribution can be consistently estimated by the corresponding sample moment:

Corollary 12.3 Let y1; :::; yn be i.i.d. random variables. For a given transformation f (y) suppose

that E

f2 (y)

exists. Then f = E [f (y)] is consistently estimated by the corresponding sample

moment ^f =

Pn

i=1 f (yi) =n,

^f !p f :

Proof. Since y1; :::; yn are i.i.d. so are f (y1) ; :::; f (yn). We can therefore apply the LLN to the

latter sequence which yields the claim.

22

Figure 1: Sampling distribution of the sample average of n Bernoulli random variables.

23

13 The Central Limit Theorem and Asymptotic Normality

The central limit theorem (CLT) says that the distribution of y is well-approximated by a normal

distribution when n is large. Recall eq. (12.1) which informs us of the
rst two moments of

the distribution of y for any given sample size n. The CLT states that as n increases the whole

distribution of y will be close to a normal distribution.

The result is formulated in terms of the following standardized version of the sample mean,

y E [y]

std (y)

=

y

=

p

n

: (13.1)

In particular,

E

y

=

p

n

= 0 and Var

y

=

p

n

= 1:

The standardisation is introduced in order to obtain a "stable" distribution: We know from the LLN

that y !p and so the non-standardised estimator y has a "degenerate" asymptotic distribution

with all probability mass of y being situated at in large samples, c.f. Figure 1.

The standardised version in (13.1), on the other hand, has a non-degenerate asymptotic distri-

bution:

Theorem 13.1 (Central Limit Theorem) Let y1; :::; yn be i.i.d. random variables with mean

= E [y] and variance 2 =Var(y). The sample average y =

Pn

i=1 yi=n then satis
es

For all 1 < z < +1 : Pr

y

=

p

n

z

! (z) as n!1;

where (z) is the cumulative distribution function of the standard normal distribution,

(z) =

Z z

1

1p

2

ey

2

dy;

and we write

y

=

p

n

a N (0; 1) :

In Figure 2, the CLT is illustrated for the same example as in Figure 1:

By recentering and rescaling y, we obtain a "stable" sequence of distributions that do not

degenerate as the sample size grows. Moreover the sequence converges towards a standard normal

as shown by the CLT. Note here that even though the Bernoulli distribution is discrete and so

very far from a Normal distribution a sample average from this distribution looks normal with

only n = 100 observations. So the CLT approximation works well even in moderate samples.

14 Slutskys theorem and continuous mapping theorem

The LLN shows that y is a consistent estimator of = E [y] and the CLT shows that the standard-

ised statistic

p

n (y ) = a N (0; 1). Suppose we wish to test the null

H0 : = 0:

24

Figure 2: Sampling distribution of y

=

p

n

where y is the average of n Bernoulli random variables.

25

We wish to do so using the corresponding t statistic,

t^ =

y

se (y)

=

y

^=

p

n

;

where

^2 =

1

n 1

nX

i=1

(yi y)2 :

In order to use it for testing, we want to derive the asymptotic distribution of t^. To this end,
rst

observe that

t^ =

^

p

ny

; where

p

ny

a N (0; 1) under H0: (14.1)

But what about the extra term =^? This is also a random variable and so you would think this

would a¤ect the asymptotic distribution of t^ but how? The following theorem clari
es this:

Theorem 14.1 (Slutskys Theorem + Continuous Mapping Theorem) Let a^ and b^ be two

statistics/estimators satisfying

a^!p a and b^ a N (0; 1) :

Then

(i) a^+ b^ a N (a; 1) ; and (ii) a^b^ a N 0; a2 : (14.2)

Moreover, for any continuous function g,

(iii) g (a^)!p g (a) : (14.3)

This result can, for example, be used to derive the asymptotic distribution of t^:

Corollary 14.2 Let y1; :::; yn be i.i.d. random variables with mean = E [y] and variance 2 =Var(y).

The t-statistic associated with the sample mean y then satis
es

t^ =

y

se (y)

=

y

^=

p

n

a N (0; 1) :

Thus, replacing by a consistent estimator ^ does not a¤ect the asymptotic distribution of the

t statistic.

Proof. De
ne

a^ =

^

; b^ =

p

ny

so we can write eq. (14.1) as

t^ = a^b^:

Next, by the LLN observe that ^2 !p 2. This in turn implies that ^= =

q

^2=2 !p 1 by

Theorem 14.1(iii) and so

a^!p 1; b^ a N (0; 1) :

Finally, use Theorem 14.1(ii) to conclude that

t^

a N (0; 1) :

26

15 Consistency and Asymptotic Normality of OLS

We here only cover the theory for the SLR slope estimator,

^1 =

Pn

i=1 (xi x) (yi y)Pn

i=1 (xi x)2

; (15.1)

to keep the notation and the arguments simple. The theory extends to the MLR model with minor

modi
cations. In the proofs, we will use the following notation to save space: For any two statistics

a^ and b^, we write

a^ ' b^ if ja^ b^j !p 0:

Thus, "'" means that the di¤erence between a^ and b^ is asymptotically negiglible.

Theorem 15.1 Under SLR.1SLR.4, ^1 ! 1.

Proof. By the LLN, we have x!p E [x]. This in turn implies

^1 = 1 +

Pn

i=1 (xi x)uiPn

i=1 (xi x)2

'

Pn

i=1 (xi E [x])uiPn

i=1 (xi E [x])2

:

Two more applications of the LLN yields

^x;u : =

1

n

nX

i=1

(xi E [x])u!p E [(x E [x])u] = Cov (x; u) ;

^2x : =

1

n

nX

i=1

(xi E [x])2 !p E

h

(x E [x])2

i

= Var (x) ;

where Cov(x; u) = 0 by SLR.4 and Var(x) > 0 by SLR.3. Thus, by Theorem 14.1(iii),

^1 = 1 +

^x;u

^2x

!p 1 +

Cov (x; u)

Var (x)

= 1:

Theorem 15.2 Under SLR.1SLR.5,

^1 1

sd(^1)

a N (0; 1); (15.2)

where sd(^1) = =

p

SSTx. The same convergence holds when we replace sd(^1) with se(^1) =

^=

p

SSTx:

t^1

=

^1 1

se(^1)

a N (0; 1): (15.3)

Proof. Let

^2x =

1

n

SSTx;

27

be the sample variance of x. We can then write

^1 1 =

1

n

Pn

i=1 (xi x)ui

^2x

; sd(^1) =

p

n^x

:

Combining these two expressions yield

^1 1

sd(^1)

=

1

n

Pn

i=1 (xi x)ui

^x=

p

n

: (15.4)

By the LLN, the numerator satis
es

1

n

nX

i=1

(xi x)ui ' 1

n

nX

i=1

(xi E [x])ui;

where, with 2x =Var(x),

E [(x E [x])u] = 0 and Var ((x E [x])u) = 22x:

Thus, by the CLT,

b^n :=

1

n

Pn

i=1 (xi x)ui

x=

p

n

a N (0; 1) : (15.5)

Comparing (15.4) and (15.5), we see that

^1 1

sd(^1)

' x

^x

b^n;

where, by the LLN and Theorem 14.1(iii),

x

^x

!p 1: (15.6)

An application of Theorem 14.1(ii) now yields (15.2).

To show (15.3), write

t^j

=

se(^1)

sd(^1)

^1 1

se(^j)

;

where

se(^1)

sd(^1)

=

^

!p 1:

The result now follows from yet another application of Theorem 14.1(ii).

Part IV

Heteroskedasticity

MLR.5 assumes homoskedastic errors. This is a strong assumption which in many applications may

be violated. We here develop inferential tools that are robust to unknown heteroskedasticity.

28

If we drop MLR.5, the conditional variance of ui will generally depend on the regressors,

Var (uijxi1; xi2) = 2i = 2 (xi1; xi2) ;

where 2 (xi1; xi2) is the unknown conditional variance of uij (xi1; xi2). What are the consequences

of heteroskedasticity for the OLS estimators and the t test we developed earlier? First note that the

OLS estimators remain unbiased and consistent: These two properties were established without

making use of MLR.5. What remains is to understand how the variance and the large-sample

distribution are a¤ected.

16 Heteroskedasticity robust standard errors

Regarding the variance, recall the derivations of eq. (10.1). It is easily checked that the
rst

four equalities in this display remains valid once we drop MLR.5. However, the
fth equality uses

MLR.5 and so this is no longer correct. Instead, we now substitute Var(uijxi1; xi2) = 2i into the

expression and obtain

Var

^1jXn

=

nX

i=1

w2iVar (uijxi1; xi2) =

nX

i=1

w2i

2

i =

Pn

i=1 r^

2

i1

2

iPn

i=1 r^

2

i1

2 :

Similar to the homoskedastic case, we would now like to obtain an estimator of the last expression

so that we can compute standard errors of ^1.

If we are willing to assume a particular functional form of 2 (xi1; xi2), we could estimate 2i ,

i = 1; :::; n, from the residuals. We could, for example, assume that 2 (x1; x2) = 0 + 1x1 + 2x2.

But, similar to when we imposed MLR.5, there is a risk that 2 (x1; x2) is not a linear funtion

we do not know. And so any procedure based on ad hoc functional form assumptions are prone to

be invalid in many applications.

Instead, we will develop an estimator that targets the large-sample limit of Var

^1jXn

. Ap-

pealing to the LLN, we have

1

n

nX

i=1

r^2i1

2

i !p E

r21

2 (x1; x2)

;

1

n

nX

i=1

r^2i1 !p E

r21

;

and so

Var

^1jXn

' 1

n

E

r21

2 (x1; x2)

E

r21

2 :

Can we estimate the two population moments of the ratio?

1

n

Pn

i=1 r^

2

i1

2

is a consistent estimator

of E

r21

but how do we estimate E

r21

2 (x1; x2)

when we do not know 2 (x1; x2)? Let us
rst

rewrite this moment: Using that

E

u2jx1; x2

= 2 (x1; x2) ;

we obtain, by the Law of Iterated Expectations,

E

r21

2 (x1; x2)

= E

r21E

u2jx1; x2

= E

r21u

2

:

29

Thus, a consistent estimator of the numerator is 1n

Pn

i=1 r^

2

i1u^

2

i and so

dVar^1jXn = 1n 1n

Pn

i=1 r^

2

i1u^

2

i

1

n

Pn

i=1 r^

2

i1

2 = Pni=1 r^2i1u^2iPn

i=1 r^

2

i1

2

is a consistent estimator of the large-sample variance of ^1. Taking the square root of this expression

yields heteroskedasticity robust standard errors that are valid in large samples,

seHR(^1) =

sPn

i=1 r^

2

i1u^

2

iPn

i=1 r^

2

i1

2 =

qPn

i=1 r^

2

i1u^

2

iPn

i=1 r^

2

i1

17 Large sample distribution of t statistic

Given the proposed standard errors under heteroskedasticty, we need to derive the large sample

distribution of the corresponding t-statistic. Recall that

^1 = 1 +

Pn

i=1 r^i1uiPn

i=1 r^

2

i1

:

Thus,

t^1

=

^1 1

seHR(^1)

=

Pn

i=1 r^i1ui=

Pn

i=1 r^

2

i1qPn

i=1 r^

2

i1u^

2

i =

Pn

i=1 r^

2

i1

=

Pn

i=1 r^i1uiqPn

i=1 r^

2

i1u^

2

i

'

Pn

i=1 ri1uiqPn

i=1 r

2

i1u

2

i

By the CLT together with E [ri1ui] = 0,

1

n

Pn

i=1 ri1uiq

1

nE

r2i1u

2

i

a N (0; 1) ;

while, by the LLN,

1

n

nX

i=1

r2i1u

2

i !p E

r2i1u

2

i

)s E r2i1u2i 1

n

Pn

i=1 r

2

i1u

2

i

!p 1

In total,

t^1

'

Pn

i=1 ri1uiqPn

i=1 r

2

i1u

2

i

=

s

E

r2i1u

2

i

1

n

Pn

i=1 r

2

i1u

2

i

1

n

Pn

i=1 ri1uiq

1

nE

r2i1u

2

i

a N (0; 1) :

Thus, t statistics based on heteroskedasticity robust standard errors will follow standard normal

distributions in large samples irrespectively of whether MLR.5 holds or not. In particular, we can

use the same critical values as before.

30

Part V

Repeated cross sections and panel data

Suppose we have observed (yit; xit), i = 1; :::; n and t = 1; :::; T , from the following population

regression model,

yit = 0 + 1xit + uit: (17.1)

The previous parts of the notes developed OLS estimators of 0 and 1 when T = 1 and explored

their properties. We will here extend the theory to the case with T 2. All subsequent results are

easily extended to the case of multiple regressors at the price of more cumbersome notation.

Data could either be repeated cross sections, in which case a new sample of n individuals are

drawn from the population in each time period t = 1; ::::; T , or panel data, in which case the same

n individuals are followed across all T time periods. Irrespectively of whether data have arrived

from the
rst or the second scenario, we can still use the same estimators; these will be developed

in the next section. However, the properties of these estimators will be di¤erent depending on the

sampling scheme. We
rst analyse the estimators in the case of repeated cross section and then

consider the case of panel data afterwards.

We will require at a minimum that E [uitjxit] = 0 so that uit does not contain any relevant

omitted variables. This rules out, for example, the presence of
xed e¤ects in uit. In the case

of panel data with
xed e¤ects present in the original data, this means that we should think of

(17.1) having been obtained after suitable transformation. Suppose we have observed (~yit; ~xit) from

following the "original" regression model,

~yit = ~0 + 0d2t + 1~xit + ~ai + ~uit; t = 1; :::; T + 1; (17.2)

where ~ai is a
xed e¤ects and d2t is a time dummy. We would then
rst remove the
xed e¤ect

through suitable transformations before estimation. For example, in the case of the
rst di¤erencing

estimator, we would compute

yit = ~yit; xit = ~xit;

and these new variables would now satisfy (17.1) with uit = ~uit.

18 OLS estimation

Given that the model (17.1) holds for all n individuals across the T time periods, a natural way to

estimate them is through pooled OLS. For given candidate values of the estimators, b0 and b1, we

de
ne the pooled SSR as

SSR (b0; b1) =

nX

i=1

TX

t=1

(yit b0 b1xit)2 ;

and then choose as estimators the values of b0 and b1 that minimise the SSR,

min

b0;b1

SSR (b0; b1) :

31

As usual, these estimators, denoted ^0 and ^1, can be characterised as the solutions to the
rst

order conditions of this minimisation problem:

@SSR (b0; b1)

@b0

(b0;b1)=(^0;^1)

= 2

nX

i=1

TX

t=1

yit ^0 ^1xit

= 0; (18.1)

@SSR (b0; b1)

@b1

(b0;b1)=(^0;^1)

= 2

nX

i=1

TX

t=1

yit ^0 ^1xit

xit = 0: (18.2)

To derive explicit expressions of ^0 and ^1, we proceed as in the cross sectional setting (T = 1):

First, rearrange (18.1) to obtain

^0 =

1

nT

nX

i=1

TX

t=1

yit

!

^1

1

nT

nX

i=1

TX

t=1

xit

!

= y ^1x;

where the averages y and x are computed using the pooled data sets. Substituting this expression

of ^0 into (18.2) and rearranging yields

0 =

nX

i=1

TX

t=1

yit ^0 ^1xit

xit =

nX

i=1

TX

t=1

yit y ^1(xit x)

xit

=

nX

i=1

TX

t=1

(yit y)xit ^1

nX

i=1

TX

t=1

(xit x)xit;

with solution

^1 =

Pn

i=1

PT

t=1 (yit y)xitPn

i=1

PT

t=1(xit x)xit

=

Pn

i=1

PT

t=1 (yit y) (xit x)Pn

i=1

PT

t=1(xit x)2

; (18.3)

where we have used that

Pn

i=1

PT

t=1 (yit y) x =

Pn

i=1

PT

t=1(xit x)x = 0. We see that the

estimators are on the same form as when T = 1, except that all averages in their expressions are

now computed using the pooled data sets.

19 Analysis of OLS - repeated cross sections

We here analyse the properties of the OLS estimators derived in the previous section in the case

where data is repeated cross sections. We will work under the following conditions which are natural

extensions of the ones for the case of T = 1:

SLR.1* In the population, the following relationship holds between xt and yt,

yt = 0 + 1xt + ut; t = 1; :::; T (19.1)

SLR.2* fxit; yitg, i = 1; ::::; n and t = 1; :::; T , are i.i.d..

SLR.3* xit, i = 1; ::::; n and t = 1; :::; T , exhibit variation so that SSTx =

Pn

i=1

PT

t=1 (xit x)2 >

0.

32

SLR.4 E [utjxt] = 0, t = 1; :::; T .

SLR.1 assumes that the model holds true across all T time periods. We can think of each time

period corresponding to a new population, and SLR.1 then states that for each of these populations

our regression model holds true with the the same common parameters 0 and 1. SLR.2* formally

states that in each time period t = 1; :::; T , we draw a new sample which is independent of the

samples drawn in previous periods. SLR.3* requires su¢ cient variation in the regressor across

individuals and time periods so that the OLS estimators are well-de
ned. SLR.4 is the usual

mean-independence assumption.

Theorem 19.1 Under SLR.1, SLR.2*SLR.3* and SLR.4, with Xn = fxit : i = 1; :::; n; t = 1; :::; Tg,

E[^j jXn] = j ; j = 0; 1;

Var(^1jXn) =

Pn

i=1

PT

t=1 (xit x)2 2it

SST 2x

;

where 2it =Var(uitjxit).

Proof. The proof follows along the same lines as the ones of Theorems 4.1 and 5.1. Observe that

(18.3) can be rewritten as

^1 =

Pn

i=1

PT

t=1 yit (xit x)

SSTx

:

Next, substitute yit = 0 + 1xit + uit into this expression,

^1 =

Pn

i=1

PT

t=1 (0 + 1xit + uit) (xit x)

SSTx

= 0

Pn

i=1

PT

t=1 (xit x)

SSTx

+ 1

Pn

i=1

PT

t=1 xit (xit x)

SSTx

+

Pn

i=1

PT

t=1 uit (xit x)

SSTx

= 1 +

Pn

i=1

PT

t=1 uit (xit x)

SSTx

= 1 +

nX

i=1

TX

t=1

wituit;

where

wit =

xit x

SSTx

; i = 1; :::; n and t = 1; :::; T:

Now, using that wit is a function of Xn alone,

E[^1jXn] = 1 +

nX

i=1

TX

t=1

witE[uitjXn] = 1 +

nX

i=1

TX

t=1

witE[uitjxit]

= 1;

where the second and third equality use SLR.2 and SLR.4, respectively.

33

Similarly,

Var(^1jXn) =

nX

i=1

TX

t=1

w2itVar(uitjXn) =

nX

i=1

TX

t=1

w2itVar(uitjxit) =

nX

i=1

TX

t=1

w2it

2

it

=

Pn

i=1

PT

t=1 (xit x)2 2it

SST 2x

Consistent heteroskedasticity robust standard errors of ^1 are given by

seHR(^1) =

sPn

i=1

PT

t=1 (xit x)2 u^2it

SST 2x

; (19.2)

and, appealing to the LLN and CLT, it then follows that

t^1

=

^1 1

seHR(^1)

a N (0; 1) :

20 Analysis of OLS - panel data

In the case of panel data, we will keep SLR.3* from the previous section but replace SLR.2* and

SLR.4* by:

SLR.2** fxit; yit : t = 1; :::; Tg, i = 1; ::::; n , are i.i.d.

SLR.4** E [utjX] = 0, t = 1; :::; T , where X = (x1; :::; xT ).

SLR.2** assumes that the n individuals in our data set are randomly sampled from the pop-

ulation in which case the observations from unit i are independent of observations from unit j,

i 6= j. However, for any given individual i, (yis; xis) is allowed to be dependent on/correlated with

(yit; xit), s; t = 1; :::; T . This is a more realistic assumption compared to SLR.2* when working

with panel data: Recall that we here follow the same observational units over time. For a given

unit i, it is likely that outcomes in period t will be dependent on past and future outcomes. For

example, if yit is wages of a given individual in period t, then these will most likely be dependent

on past and future earnings of the same individual.

SLR.4** requires that ut is mean-independent of current, past and future values of the regressor.

This is a stronger assumption than SLR.4* and is needed because we here allow ut to be dependent

on xs, s 6= t.

In order to derive the variance of the OLS estimator under SLR.2**, we need the following

result: For any T random variables, v1; :::; vT ,

Var (v1 + + vT ) =

TX

t=1

Var (vt) + 2

TX

s=1

TX

t6=s;t=1

Cov (vs; vt) : (20.1)

34

To see this, de
ne vt = vt E [vt] so that

Var (v1 + + vT ) = E

h

(v1 + + vT )2

i

;

where

(v1 + + vT )2 = v21 + + v2T + 2

TX

s=1

TX

t6=s;t=1

vsvt:

Theorem 20.1 Under SLR.1, SLR.2**, SLR.3* and SLR.4, with Xn = fxit : i = 1; :::; n; t = 1; :::; Tg,

E[^j jXn] = j ; j = 0; 1;

Var(^1jXn) =

Pn

i=1

PT

t=1 (xit x)2 2it

SST 2x

+ 2

Pn

i=1

PT

s=1

PT

t6=s;t=1 (xis x) (xit x)Cov (uis; uitjXi)

SST 2x

where 2it =Var(uitjXi).

Proof. The proof proceeds as the one of Theorem 19.1, except for the variance computation. With

panel data, we have

Var(^1jXn) =

nX

i=1

Var

TX

t=1

wituit

Xn

!

;

where we have used that observations from unit i are independent on observations from unit j.

With vit = wituit, (20.1) yields

Var

TX

t=1

wituit

Xn

!

=

TX

t=1

w2itVar (uitjXn) + 2

TX

s=1

TX

t6=s;t=1

wiswitCov (uis; uitjXn)

=

TX

t=1

w2itVar (uitjXi) + 2

TX

s=1

TX

t6=s;t=1

wiswitCov (uis; uitjXi)

where the second equality uses SLR.2**.

Comparing the expression of Var(^1jXn) in Theorems 19.1 and 20.1, we see that an extra term

appears when working with panel data. This is due to potential autocorrelation of the regression

errors, Cov(uis; uitjXi) 6= 0. If the errors should happen to not exhibit autocorrelation, the variance

expressions are identical. To control for potential autocorrelation, we need to adjust the standard

errors stated in (19.2) and use the following heteroskedasticity and autocorrelation robust (HAR)

ones

seHAR(^1) =

sPn

i=1

PT

t=1 (xit x)2 u^2it + 2

PT

s=1

PT

t6=s;t=1 (xis x) (xit x) u^isu^it

SST 2x

: (20.2)

35

欢迎咨询51作业君