PLSC 30600: Final [YOUR NAME] May 21, 2022
This final is due at 11:59 pm on Thursday, June 2nd for non-graduating students and at 11:59 pm on Friday,
May 27th for graduating students. You should submit your writeup (as a knitted .pdf or .html file along with the accompanying .rmd file) to the course website before 11:59pm EST on Thursday, May 5th. Please upload your solutions as a .pdf file saved as Yourlastname_Yourfirstinitial_final.pdf. In addition, an electronic copy of your .Rmd file (saved as Yourlastname_Yourfirstinitial_final.Rmd) should accompany this submission. Late finals will not be accepted, so start early and plan to finish early. Remember that exams often take longer to finish than you might expect. This exam has 3 questions and is worth a total of 50 points. Show your work in order to receive partial credit. Also, I will not accept .rmd files that have not been knitted. In general, you will receive points (partial credit is possible) when you demonstrate knowledge about the questions we have asked, you will not receive points when you demonstrate knowledge about questions we have not asked, and you will lose points when you make inaccurate statements (whether or not they relate to the question asked). Be careful, however, that you provide an answer to all parts of each question. You may use your notes, books, and internet resources to answer the questions below. However, you are to work on the exam by yourself. You are prohibited from corresponding with any human being regarding the exam (unless following the procedures below). I will answer clarifying questions during the exam. I will not answer statistical or computational questions until after the exam is over. If you have a question, send an email to me. If your question is a clarifying one, I will remove all identifying information from the email and reply on Canvas. Do not attempt to ask us questions in person (or by phone), and do not post on Stack Overflow. Problem 1 (20 points) In “New Evidence on the Impact of Sustained Exposure to Air Pollution on Life Expectancy from China’s Huai River Policy” (2017), Ebenstein et. al. build on the Huai River Policy RD design described in the earlier 2013 PNAS paper to evaluate the effect of air polution on life expectancy. While the life expectancy data is not publicly available, this problem will have you examine their RD estimates on air quality as measured by PM10 concentration (Particulate Matter with diameter less than 10 micrometers). The original source for this data is Ebenstein, A., Fan, M., Greenstone, M., He, G., & Zhou, M. (2017). New evidence on the impact of sustained exposure to air pollution on life expectancy from China’s Huai River Policy. Proceedings of the National Academy of Sciences, 114(39), 10384-10389. The code below will read in the dataset 1 river <- haven::read_dta("DSP_PM10.dta") river <- river %>% filter(!is.na(pm10)) # Drop missing PM10 stations Each row in the data contains a measurement of air quality in a particular geographic location along with the distance in degrees of latitude from the Huai River boundary. The relevant variables of interest are • pm10 - Air quality as measured by PM10 concentration (micrograms per cubic meter μg/m3) • dist_huai - Degrees of latitude north of the Huai river boundary • north_huai - An indicator for whether the location is north of the Huai river boundary (dist_huai > 0) Part A (5 points) Estimate the local average effect of the Huai River policy on PM10 concentration at the river boundary using a local quadratic regression with a triangular kernel. Use a bandwidth of h = 8 to the left and right of the cut-point. Provide a 95% confidence interval and discuss your results. Part B (5 points) Overlay your regression estimates from Part A onto a binned scatterplot and discuss how well you think the local quadratic regression approximates the conditional expectation function. Part C (5 points) Now estimate the effect at the cut-point using a bandwidth of h = 16. Provide a 95% confidence interval and interpret your result. Overlay the regressions onto a binned scatterplot as in Part B. Compare your results to what you find in Part A and discuss the possible reasons for any differences or similarities that you observe. Part D (5 points) Assess whether there is bunching near the discontinuity. Use any appropriate analytical technique or techniques and interpret your results. Is there evidence to suggest that the density of the running variable in this dataset is discontinuous at the cut-point? Problem 2 (20 points) Does having a daughter (as opposed to a son) affect how U.S. legislators vote on women’s issues? Washington (2008; American Economic Review) finds that having a daughter causes a legislator to vote more liberally, especially on issues related to women. You will examine this using the washington.dta dataset. While the original paper looks at the 105th - 108th Congresses, this dataset will focus on representatives in the 105th (1997-1999). The original source for this paper is Washington, E. L. (2008). Female socialization: how daughters affect their legislator fathers. American Economic Review, 98(1), 311-32. The code below will load the data washington <- haven::read_dta("washington.dta") The variables you will need are: • aauw - Outcome variable - Legislator’s voting score as assigned by the American Association of University Women (AAUW) (proxy for feminist/liberal-leaning voting record). Positive values indicate more liberal/feminist voting behavior. 2 • ngirls - Number of female children • nboys - Number of male children • totchi - Total number of children Part A (5 points) Our treatment of interest is a multi-valued treatment - the number of female children of a legislator is a count variable ranging from 0 to 7. While we could estimate the effects for each possible comparison (e.g. the effect of having 5 girls vs. 2 girls or 3 girls vs. 0 girls or any girls vs. 0 girls), this could yield very high-variance estimates. Instead, we would like to pool our effect estimates into a single summary estimate of the Average Treatment Effect of having one additional daughter on the legislator’s AAUW score. Let’s define a set of potential outcomes Yi(d) for all possible values of a treatment d ∈ D. We again assume consistency: that for a unit with treatment level Di = d, the observed outcome Yi equals the potential outcome Yi(d). Assume that the potential outcomes take on the following form Yi(d) = Yi(0) + τid What is the average treatment effect of having 3 daughters versus having 1 daughter on a legislator’s AAUW score? How about the average treatment effect of having 5 daughters versus 2 daughters? What assumption are we making about the treatment effects by writing the potential outcomes this way? How do we interpret E[τi]? Part B (5 points) Estimate the average treatment effect of having one additional daughter on AAUW score assuming that the number of female children is completely ignorable. Provide a 95% confidence interval and interpret your results. Part C (5 points) Assume instead that the number of female children is conditionally ignorable given the number of total children. Subset the sample to representatives with at least 1 child (of any sex) and no more than 5 total children (as there are very few representatives with 6+ children). We’ll be working with this sample for the rest of the problem including in Part D. Estimate the conditional average treatment effects of having one additional daughter on AAUW score condi- tional on the total number of children. Provide a 95% confidence interval for each CATE and interpret/discuss your results. Part D (5 points) Without making any additional assumptions on the outcome model, estimate the average treatment effect of having one additional daughter on AAUW score under the assumption of conditional ignorability. Provide a 95% confidence interval and interpret/discuss your results. Problem 3 (10 points) Consider a setting with N observations indexed by i = {1, 2, . . . , N }, a binary treatment Di , an outcome Yi , and pre-treatment covariates Xi. Assume consistency/SUTVA (Yi = DiYi(1) + (1 − Di)Yi(0)), positivity 0 < P r(Di = 1|Xi) < 1 and conditional ignorability {Yi(1), Yi(0)}⊥Di|Xi. 3 Let π(Xi) = Pr(Di = 1|Xi) denote the true propensity score function. Let μ1(Xi) = E[Yi(1)|Xi] and μ0(Xi) = E[Yi(0)|Xi] denote the true regression functions for the potential outcome under treatment and control respectively. Our estimand is the ATE τ = E[Yi(1)] − E[Yi(0)] As a general hint for this problem, you will find the law of total expectation very useful Part A (2 points) Consider the IPTW estimator E[Yi] = E[E[Yi|Xi]] τˆ =1��N DiYi−(1−Di)Yi Part B (4 points) Consider another estimator 1 ��N �� τˆ = N For this problem, treat μˆ1(Xi) and μˆ0(Xi) as constants given Xi. (1 − Di)(Yi − μˆ0(Xi))�� 1 − πˆ ( X i ) μˆ 1 ( X i ) + Di(Yi − μˆ1(Xi))�� �� πˆ ( X i ) − μˆ 0 ( X i ) + IPW N i=1 πˆ(Xi) 1 − πˆ(Xi) Show that if we know the true propensity score (πˆ(X ) = π(X )), τˆ i i IPW is unbiased for the ATE. i=1 Show that if we know the true propensity score (πˆ(Xi) = π(Xi)) τˆ will be unbiased for the ATE even if μˆ1(Xi) ̸= μ1(Xi) and μˆ0(Xi) ̸= μ0(Xi) Part C (4 points) Show that if we know the true regression models μˆ1(Xi) = μ1(Xi) and μˆ0(Xi) = μ0(Xi), τˆ is unbiased for the ATE even if we misspecify the propensity score (for this part, treat πˆ(Xi) as a constant but πˆ(Xi) ̸= π(Xi)) 4