Problem set Causal inference EBA 3530 Spring 2021 The goal of this problem set is to allow you to test your understanding of the materials covered in lectures 9, 10 and 11. If you master these problems (without relying on suggested solutions), you will most likely be prepared to solve these kinds of problems on the final exam. A suggested solution will be provided, but please note that there are in many cases not just one correct answer — it will depend on your own ability to reason and explain your thinking. For example, It is expected at this stage that you will be able to come up with possible omitted variables and set them in relation to the treatment- and outcome variables. It is expected that you also sketch/draw up simple flowcharts showing relationships between variables just like we have done many times in class. This is a very efficient way for you to communicate your thinking. However, everyone will come up with different examples which will lead to different conclusions wrt. the case at hand. Furthermore, the suggested solution will not necessarily reflect a perfect answer that will award 100% score. The official BI grading scale says the following about the grade of A (= Excellent): An excellent performance, clearly outstanding. The candidate demonstrates excellent judgement and a high degree of independent thinking. In order for you to showcase your aptitude for independent thinking on the exam, it is necessary to not provide solutions that are exhaustive and down to a T — especially since this is going to be an open-book exam. Good luck! 1. Randomised trials and linear regression You work at a medium-sized consultancy firm that specialises in evaluating public transit projects in Norway. During the COVID-19 pandemic, all workers was forced to work from home. Since then however, home office is no longer mandatory as the pandemic faded. Many of your colleagues have continued to work from home rather than to come into the office. You are a leading data scientist at your firm and your manager (who knows a lot about evaluating railway projects, but not much else) approaches you with a task. He shows you a spreadsheet where he has recorded each employees’ work output1 during the previous week as well as whether or not the employees worked from home or came into the office. He wants to revert to a pre-pandemic working environment as he misses walking around the open landscape office to check in on his subordinates, but is unsure about whether that will be good for the firm as some employees have voiced a high degree of satisfaction with working from home. With the data that he has gathered, he wants to know whether he should make coming into the office mandatory at the next general meeting (which to his dismay will be held over Microsoft Office Teams). (a) Explain to your manager (possible reasons) why you are not able to provide a causal estimate of the productivity effect based on these two pieces of information. In your answer, you should recast the problem into a causal inference problem. You can use a combination of words, drawings and maths. Your manager is a bit confused by your reasoning, but pulls out the results of a questionable non-anonymous commuting survey that your firm did back in 2019. The goal of the survey was to gather information about commuting habits following an initiative in 2018 when employee car parking facilities were removed to promote greener and healthier commutes. He argues that the solutions to the problems in a) will for sure be found in these data. In this survey, employees 1Measured as the fraction of completed work during a day divided by the benchmark amount of work to be completed during a day. If measure is equal to 1, it means that the employee produced exactly as much output as required. If measure is 0.7, the employee underperformed by 30% etc. This technicality allows for a better comparison across different positions and departments in the firm. 1 provided information regarding their daily commute time from home and to the office in terms of number of minutes. He used Google a fair bit and figured out how to run linear regression in Microsoft Office Excel: Yi = β0 + β1HOMEi + β2CTimei + ei HOMEi is a dummy variable that is equal to 1 if individual i works from home and zero otherwise. CTimei is the individual’s commuting time in number of minutes. (b) Reassure your manager that he is on the right track. Then try to explain why even in this case, you should be careful with interpreting β1 as a causal effect. In your answer, you can use terms like omitted variable bias and reverse causality. You can use a combination of words, drawings and maths. Your manager is very displeased by your arguments but is very eager to learn about the effect of home office on worker productivity. He gives you the mandate to obtain a causal effect by any means necessary. (c) Propose a research design that will sidestep the issues discussed above. Briefly discuss why this approach works. Are there any ethical issues that you should consider? What would happen if your colleagues do not cooperate? Is it really feasible? You can use a combination of words, drawings and maths. 2. Instrumental variable methods We build on the background information given in Problem 1 above. Since your consultancy firm evaluates transit projects, the senior partners of the firm are huge train nerds. They really enjoy spending time in the boardroom which features a panoramic view over a large train station and train depot area. The area is thus served mostly by the railway and bus services are limited. The commuting survey conducted in 2019 confirms that most employees (to the delight of senior partners) commute to work by using local and regional trains that call at the station nearby. The commuting survey also asked for information about what train lines the individual employee relied on to get to work. However, there are employees who walk or bike to work as well despite having the option to take the train. The removal of employee parking in 2018 eliminated the use of private cars for commutes. (a) Given this information, can you think of a valid instrument that could cause random variation in treatment? That is, is there randomness that can induce employees to stay home rather than going to work? Use words and drawings to support your discussion. Here you need to be a bit creative, but start by discussing what characterises a valid instrument. Maybe that will give you some inspiration. (b) Identify who the compliers, always-takers, never-takers and defiers are in this case. Are there likely to be defiers? (c) Explain how the instrumental variable approach solves the reverse causality problem discussed in Problem 1. In this case, it can be useful to again use a drawing and think about the interpretation of the 1st stage fitted estimate. (d) Define what we mean by a Local Average Treatment Effect and explain what the LATE will represent in this case. How does this limit the interpretation of the estimated causal effect? Is this LATE really the answer your manager is looking for? (e) Suppose that your manager hands you the data for yesterday in which you observe worker output, their treatment status that day as well as the value of their instrumental variable. You estimate the reduced form and the first stage equations and obtain the relevant parameters: pˆi1 = −0.12 from the reduced form and γˆ1 = 0.4 from the first stage. Give an interpretation of these estimates. Then compute the 2nd stage parameter βˆ1 and give it an interpretation. 3. Difference in differences We continue where we left off in Problems 1 and 2. Your manager is really getting impatient with all your talk about reverse causality and you being LATE for work. As tried to explain to him, the inclusion of control variables in the regression could not solve reverse causality. And the 2 IV approach, while technically valid, gave a LATE interpretation that really did not answer the question he asked. Growing more and more frustrated by the day, your manager sends you his Excel spreadsheet containing productivity data since the beginning of the pandemic and till now. You start to ponder whether it is possible to exploit the time dimension as well as the cross-section. You decide to focus the productivity data for the last day of mandatory home office as wells as yesterday’s data when some people worked from home and some came in to the office. (a) Briefly describe an empirical strategy to evaluate the productivity of those who work from home relative to those who do not using the Difference in differences methodology. In this setting, you might have to switch around who is considered treated and who is considered control (counterfactual). (b) What necessary assumption must hold for the empirical strategy to be valid? How can you check that the assumption hold? (c) Prior to the lifting of the restrictions, the control group had an average productivity of 0.92 and the treatment group 0.67. When the treatment group returned to the office, they averaged at 0.88. The control group that decided to stay at home had an average productivity of 0.97. Compute the estimated treatment effect and interpret the result. (d) You want a standard error on your estimate and would rather opt for a regression framework where you incorporate both the cross-section and the time-dimension using dummy variables with interactions. Set up this regression. With the numbers in the previous problem, compute what the estimated parameters would likely be. and finally post-treatment treatment group E(Yit|TREATi = 1, POSTt = 1) = 0.88 = β0+β1+β2+β3 = 0.92−0.25+0.05+β3 ⇒ β3 = 0.16 (e) Make a graphical representation of the setup. Clearly mark the axes, observation points as well as treatment-, control- and counterfactual graphs. Also mark the axes with appropriate coefficients. (f) You are now ready to approach your manager with your findings. What advise will you give him wrt. the future home office policy? 3
欢迎咨询51作业君