Group Project 1

ECON5120: Bayesian Data Analysis

Due date: February 27, 2023 at 12 pm Credit: 20% of final grade

This project will have you work with limited dependent variable models. These models are the backbone of many statistical applications in economics and finance. Furthermore, they will introduce you to the idea of data augmentation, which forms the basis of the computational techniques for these class of models (as well as other important techniques such as the Expectation-Maximization algorithm).

ˆ Task: You are asked to program from scratch three variations of limited dependent variable models and then apply your code to three real life datasets. You must comment on the results you obtain for each application, in terms of the coefficients and convergence of the resulting chains. Each exercise has an equal weight of the final grade (adding up to 20%). You will also receive bonus points if you provide the technical details for the derivations of each model (1% per exercise, for a total of up to 3%). These points supplement your score for the coding part, giving you a higher chance of achieving full marks. The details for each model and what you are expected to submit are outlined below. This is a group assignment and each group should have three (3) students each.

ˆ Submittables: For full credit, you must submit to the course’s Moodle page three computer codes that will execute posterior simulations of all three models along with the results of applying your code to each of the datasets. These must be in either Matlab or R. You can provide your discussion of the results in a separate typeset file or within the code. Points will be subtracted for incorrect, unreadable or uncommented code. For the bonus points, you must submit all technical derivations in a typeset document (e.g., using LATEX). No handwritten notes will be accepted. There is no word limit but please try to be concise with your comments and derivations.

The following context is common to all models: We will assume that the observed data (y1, x1), . . . , (yn, xn) come from a random sample of size n. The vector x is of dimension p+1, such that there are p covariates and a constant, and the dependent variable y is limited in the sense that the values it can take are constrained. That is, yi ∈ S, meaning the outcome variable for each observation takes values in a set S and not the usual real line; i.e., S ≠ R. From observed data, we can construct the observation matrices

y1 x1,0 x1,1 ··· x1,p y2 x2,0 x2,1 ··· x2,p

y=. and X= . . .. . . ....

yn xn,0 xn,1 · · · xn,p

where y is an n-dimensional vector and X is an n × (p + 1) design matrix. Remember to include a constant in all models and in your applications by setting xi,0 = 1 for all i (that is, by letting the first column of X be all ones). For all execises, you can use the following noninformative prior values

B = 1000·Ip+1 α = 0.02 β = 0p+1 δ = 0.02

where Ip+1 is an identity matrix of size (p + 1) × (p + 1) and 0p+1 is a vector of zeros of dimension p + 1. Finally, for all posterior simulations you can use a burn-in period of 5,000 iterations and keep 10,000 final iterations to estimate posterior means (for a total of 15,000 iterations). Feel free to experiment with all these values in your applications.

1

Model 1. Tobit: This model is useful when the outcome y is censored, i.e., there is no information about the variable if they fall above or below a certain value. For example, if y represented income, we might not observe the value of income for extremely wealthy individuals, such that the income variable in our sample is top-coded at a given maximum. If this maximum was £10,000, then y ≤ 10 000. This is an example of upper or right censoring. If y represented wages, we might not observe wages lower than the minimum wage (e.g., £9.50 in the UK), so that y ≥ 9.50. This is an example of lower or left censoring.

In this assignment, we will assume y to be lower-censored with a minimum value equal to 0. Specifically, let y∗ be a latent unobserved variable that is related to the observed outcome y, where

yi∗ =β0 +β1xi,1 +···+βpxi,p +ui =x′iβ+ui

Here, β = (β0,β1,...,βp) is a p + 1 vector of coefficients and we assume that ui ∼ N(0,σ2) independently across units i. Then, we say that the outcome variable is censored since the latent yi∗ are only observed when yi∗ > 0 and all other values are set to 0. Mathematically, this means that our observed outcome yi satisfies

(yi∗ ifyi∗ >0 (x′iβ+ui ifx′iβ+ui >0 yi= 0 ifyi∗≤0= 0 ifx′iβ+ui≤0

This model can be estimated using maximum likelihood by writing out the full likelihood and maximizing it numerically. However, the Bayesian approach takes advantage of the idea of data augmentation to simplify the problem. Instead of having the unknown quantities be only β and σ2, we can introduce the vector of latent variables y∗ = (y1∗, . . . , yn∗ ) as additional parameters. We would then write the likelihood of y in terms of y∗, β and σ2. We will use as prior

p(y∗, β, σ2) = p(y∗|β, σ2)p(β)p(σ2)

n

p(y∗|β, σ2) = Y N (yi∗|x′iβ, σ2) i=1

where

The idea behind data augmentation is that including the latent variables greatly simplifies the resulting posterior (even while it increases the number of variables we must sample). Let C be the set of censored observations, i.e., C = {i : yi = 0}. We can show that the conditional posterior distributions in the tobit case are given by

yi∗|β, σ2, y ∼ T N (−∞,0](x′iβ, σ2) for all i ∈ C β|σ2,y∗,y ∼ Np+1(β ̄,B ̄)

2∗ ?α ̄δ ̄? σ |β,y ,y ∼ Inv-Gamma 2, 2

where T N (−∞,0](μ, σ2) is the truncated normal distribution to the interval (−∞, 0] with mean μ and variance σ2. That is, a regular normal distribution for values of the latent variable yi∗ between −∞ and 0, but the density is set to 0 if yi∗ > 0. The updated posterior parameters are

̄ ?X′X −1?−1 B= σ2 +B

̄ ̄?X′y∗ −1? β=B σ2 +B β

α ̄ = α + n

δ ̄=δ+(y∗ −Xβ)′(y∗ −Xβ)

p(β) = Np+1(β|β, B)

2 ?2αδ?

p(σ )=Inv-Gamma σ 2,2

2

ˆ Task: Program a Gibbs sampler that samples from the appropriate conditional posterior distributions. Then, apply your program to the dataset on pension benefits (fringe.csv) where the outcome variable is the amount of pension benefits (pension) and the covariates are a constant, experience (exper), age (age), tenure duration (tenure), education level (educ), number of dependents (depends), marriage status (married), dummy for white race (white), and dummy for sex (male). Comment on your results.

ˆ Bonus: Present the derivations on how to obtain the conditional posterior distributions.

Model 2. Probit: The probit model is useful for binary outcome variables, i.e., where yi can only take the values 0 or 1. For example, if y represented whether someone obtained a loan, then we can let y = 0 be the case where an individual did not get the loan, and y = 1 the case where they did. In this model, we still define

y i∗ = x ′i β + u i

but now for identification purposes we must impose σ2 = 1 and we only observe the sign of the

latent variable:

(1 ifyi∗>0 yi= 0 ifyi∗≤0

Define two sets C0 and C1 for the observations with outcome 0 or 1, i.e., C0 = {i : yi = 0} and C1 = {i : yi = 1}. Using the same idea of data augmentation and the same priors as in the tobit model, we can obtain the conditional posteriors as

(TN (x′β,1) foralli∈C yi∗|β,y ∼ (−∞,0] i 0

T N (0,∞)(x′iβ, 1) for all i ∈ C1 β|y∗,y ∼ Np+1(β ̄,B ̄)

ˆ Task: Program a Gibbs sampler that samples from the appropriate conditional posterior dis- tributions. Then, apply your program to the dataset on mortgage applications (loanapp.csv) where the outcome variable is whether an individual was approved for a mortgage loan (approve) and the covariates are a constant, housing expenditures as percentage of total income (hrat), other obligations as percentage of total income (obrat), loan amount as percentage of house price (loanprc), unemployment rate in industry of occupation (unem), number of dependents (dep), dummy for high school education (sch), dummy for presence of cosigner (cosign), dummy for previous bankrupcy (pubrec), marriage status (married), dummy for white race (white), and dummy for sex (male). Comment on your results.

ˆ Bonus: Present the derivations on how to obtain the conditional posterior distributions.

Model 3. Logit: The logit model is also used for binary variables, similarly to the probit. The difference between the two lies in the distribution assumed for the errors in the latent variable equation. That is, assume as before that

y i∗ = x ′i β + u i

but now ui is assumed to have a logistic distribution P (Ui ≤ ui ) = eui /(1 + eui ) with density

p(ui) = eui /(1+eui )2. We can show that the likelihood using the logistic distribution has the form p(y|β) = Yn ? exp(x′iβ) ?yi ? 1 ?1−yi

i=1 1 + exp(x′iβ) 1 + exp(x′iβ)

The issue with this model is that data augmentation does not simplify the posteriors. Thus, using

the same prior for β as before, we simply write the posterior as

p(β|y) ∝ Yn ? exp(x′iβ) ?yi ? 1 ?1−yi

i=1 1 + exp(x′iβ) 1 + exp(x′iβ) ?1 ′−1 ?

×exp −2(β−β)B (β−β) 3

To sample from this posterior, we will need to use a Metropolis-Hastings algorithm (there is no Gibbs sampler available). As this algorithm requires the choice of a proposal distribution, we will use a common choice for this exercise. Specifically, we will assume that the proposal is g(β) = Np+1(β|βˆ,Ω), i.e., a multivariate normal distribution with mean given by the maximum likelihood estimate βˆ and variance-covariance matrix

Ω = τ 2 ( Σˆ − 1 + B − 1 ) − 1

where τ is a tuning parameter and Σˆ is the maximum likelihood estimate of the variance given by

the negative inverse of the log-likelihood Hessian at βˆ, i.e.,

" ∂2

Σˆ = −[H(βˆ)]−1 = − ∂β2 log p(y|β)

#−1

β=βˆ

You can use existing functions in Matlab or R to obtain these maximum likelihood estimates, you

do not need to code these from scratch as part of the assignment.

ˆ Task: Program a Metropolis-Hastings algorithm using the given proposal distribution to sample from the posterior of β. Start with τ = 1 but feel free to experiment with values of τ and see the effect they have on the chains and the acceptance rate. Then, apply your program to the Mroz dataset on female labor market participation (mroz.csv) where the outcome variable is whether a woman is currently active in the labor force (inlf) and the covariates are other sources of income aside from wage (nwifeinc), level of education (educ), years of experience and its square (exper and exper2), age (age), number of children less than 6 years of age (kidslt6) and number of children between 6 and 18 years of age (kidsge6). Comment on your results.

ˆ Bonus: Present the derivations on how to obtain the maximum likelihood estimates for the logit model. It is not possible to obtain a closed-form solution to these estimates, so please present the form of the first derivatives of the log-likelihood (as these are what you would use to solve the problem numerically) and the form of Σˆ (this means you will need to get the second derivatives as well).

4