ECMT1020 Introduction to Econometrics Week 8, 2021S1 Lecture 7: Dummy Variables Instructor: Ye Lu Please read Chapter 5 of the textbook. Contents 1 Motivation: two groups in the data 2 1.1 Chow test: pool or separate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Limitation of separate regressions . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Regression with dummy variable 5 2.1 Separating two groups in one regression: dummy variable . . . . . . . . . . . 6 2.1.1 Intercept dummy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 Slope dummy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Dummy variable trap: perfect multicollinearity . . . . . . . . . . . . . . . . . 9 3 More than two groups: more than one dummy variable 11 3.1 More than two groups from one grouping criterion . . . . . . . . . . . . . . . 11 3.1.1 M categories: M − 1 dummies . . . . . . . . . . . . . . . . . . . . . . 11 3.1.2 Change of reference category . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Multiple grouping criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Figure 1: Cost against number of students for 74 secondary schools in Shanghai, among which 34 are occupational schools and 40 are regular academic schools. 1 1 Motivation: two groups in the data Running example: Yi = β1 + β2Xi + ui, i = 1, . . . , n, (1) where • Y = COST is the annual recurrent cost for running a school; • X = N is the number of students in a school. The question is to explain the effect of number of students on the cost of running a school. The scatter plot of the annual cost against number of students for 74 secondary schools in Shanghai is given in Figure 1. In the plot we can see that data points are divided into two categories/types: • occupational school: schools that aim to provide skills for specific occupations, • regular school: regular general academic schools. There are in total 34 occupational schools and 40 regular schools in this sample. It is clear from Figure 1 that the red dots (occupational schools) are generally above the grey dots (regular schools). This means overall it is more expensive to run the occupational school than regular school1, or the overhead cost of occupational schools is higher than that of regular school. Therefore, if we were to fit a linear regressoin like (1) separately for the two types of schools, we expect the intercept estimate from the occupational school regression to be higher than that from the regular school regression. Moreover, we can also see from Figure 1 that the marginal cost of running occupational schools for each additional student seems to be also higher than the case for regular schools. In other words, if we were to fit a linear regressoin like (1) separately for the two types of schools, we expect the slope estimate from occupational school regression to be higher than that from regular school regression. Given the above observations, we have good reason to think about fitting separate regressions for these two distinct subsamples: • subsample A: observations for occupational schools (sample size nA), • subsample B: observations for regular schools (sample size nB), where the subsample sizes must satisfy nA + nB = n. The separate regressions for subsample A and B are respecively Regression A: Y Ai = α1 + α2X A i + u A i , i = 1, . . . , nA, (2) and Regression B: Y Bi = γ1 + γ2X B i + u B i , i = 1, . . . , nB. (3) Accordingly, we call regression (1) with all the observations as the ‘pooled regression’. 1This is reasonable because the occupational school tends to be expensive to run for the need to maintain specialized workshops. 2 First of all, we should understand that if we fit separete regressions A and B rather than a pooled regression, we must get better model fit in terms of smaller total sum of the squares of the residuals. In other words, we should expect2 RSSA +RSSB ≤ RSSP (4) where RSSA, RSSB and RSSP denote, respectively, the sum of the squares of the residuals for regression A, regression B and the pooled regressoin (1) which we call regression P. The question is: does this improvement specified in (4) statistically significant or not? Put differently, is the quantity RSSP − (RSSA +RSSB) significantly different than zero or not? The Chow test (Chow, 1960) provides a formal statistical procedure to answer this question. 1.1 Chow test: pool or separate Recall how we perform F test in the multiple regression model to test whether adding extra regressor(s) can significantly improve the model fit. The logic for the Chow test is the same, as it is essentially a F test. When we fit the pooled regression (1), we have k = 2 parameters to estimate; where the total numbers of parameters doubles to 2k = 4 if we fit separate regressions (2) and (3). So the scenario here is again that we get improvement of model fit by adding more parameters, or in other words, sacrificing the degrees of freedoms (DF). Remember that extra parameters cost/use up extra degrees of freedom – no free lunch! The general formula for the F test statistic is F (extra DF,DF remaining) = improvement in fit/extra DF RSS remaining/DF remaining (5) where the improvement in fit comes from using a model with more parameters (or less DF) versus using a model with fewer parameters (or more DF); and the RSS remaining and DF remaining are the RSS and DF of the model with more parameters. The null hypothesis is H0 : no improvement in fit from a model with more parameters and we reject the null hypothesis if the F test statistic is greater than the critical value at certain significant level. Chow test aims to decide whether separate Regressions A and B can provide a significantly better fit than pooled Regression P. Note that we have • Regression A – sample size is nA; – k parameters which use up k degrees of freedom in fitting Regression A; – remaining DF is nA − k; 2Think about why? Read the textbook companion slides on the Chow test for the great illustration. Also in the textbook equations (5.41)–(5.43). 3 – residual sum of squares is RSSA. • Regression B – sample size is nB; – k parameters which use up k degrees of freedom in fitting Regression B; – remaining DF is nB − k; – residual sum of squares is RSSB. • Regression P – sample size is n = nA + nB; – k parameters which use up k degrees of freedom in fitting Regression P; – remaining DF is n− k; – residual sum of squares is RSSP . To put the Chow test into the F test context, we may consider Regression A and B together as one model with 2k parameters in total, and Regression P itself as another model with only k parameters. We want to use the F test given by (5) to test whether the former model with 2k parameters provides a better fit than the latter with k parameters. Let’s fit these quantities into the formula in (5): • improvement in fit = RSSP − (RSSA +RSSB); • extra DF used up = 2k − k = k; • RSS remaining = RSSA +RSSB; • DF remaining = (nA − k) + (nB − k) = (nA + nB)− 2k = n− 2k. Therefore, we have the F test statistic for Chow test as F (k, n− 2k) = (RSSP − (RSSA +RSSB))/k (RSSA +RSSB)/(n− 2k) , and this test statistic is distributed as F distribution with k and n− 2k degrees of freedom under the null hypothesis that there is no significant improvement in fit. An example of Chow test for the cost regerssion of 74 secondary schools is given in the textbook and companion slides. • Regression A (occupational schools) – nA = 34; – k = 2 parameters which use up k = 2 degrees of freedom in fitting Regression A; – remaining DF is nA − k = 34− 2 = 32; – residual sum of squares is RSSA = 3.49(×1011). • Regression B (regular schools) – nB = 40; – k = 2 parameters which use up k = 2 degrees of freedom in fitting Regression B; – remaining DF is nB − k = 40− 2 = 38; 4 – residual sum of squares is RSSB = 1.22(×1011). • Regression P – n = 34 + 40 = 74; – k = 2 parameters which use up k = 2 degrees of freedom in fitting Regression P; – remaining DF is n− k = 74− 2 = 72; – residual sum of squares is RSSP = 8.92(×1011). Let’s, again, fit these quantities into the formula in (5): • improvement in fit = RSSP − (RSSA +RSSB) = 8.91− (3.49 + 1.22) = 8.91− 4.71 = 4.20(×1011); • extra DF used up = 2k − k = k = 2; • RSS remaining = RSSA +RSSB = 3.49 + 1.22 = 4.71(×1011); • DF remaining = n− 2k = 74− 4 = 70(= 32 + 38). Therefore, the F test statistic for Chow test is F (2, 70) = (RSSP − (RSSA +RSSB))/k (RSSA +RSSB)/(n− 2k) = 4.21/2 4.71/70 = 31.3. The critical value of F (2, 70) distribution at the 0.1% level is 7.64 < 31.3. So we come to the conclusion that we reject the null hypothesis at the 0.1% significance level, and believe that we should run separate regressions two the two types of schools. 1.2 Limitation of separate regressions We often want to see how different αˆ1 and γˆ1 are, and how different αˆ2 and γˆ2 are from regressions (2) and (3) when we have the motivation to run separate regressions. They must differ in values, but we cannot see how statistically significant the differences are. This leads to the major drawback of the separate regressions. In particular, we draw your attention to the below two problems. 1. (major problem) How can we tell how different the coefficients are? How statistically significant are these differences? 2. When you run regressions with two small samples (nA < n, nB < n) instead of running one large pooled regression, there is an adverse effect on the precision of the estimates of the coefficients. 2 Regression with dummy variable The usual solution to above problems is to fit a single regression with an extra dummy vari- able. The dummy variable is a binary variable which only takes value either 0 or 1, and it indicates the category that an observation belongs to. Figure 2 gives the illustration of a dummy variable ‘OCC’ which takes value 1 if the observation is for the occupational school, and 0 if not. We can see that the dummy variable is essentially an indicator we use to mark each observation into certain category in the data set – like how we mark the data 5 Figure 2: Illustration of dummy variable: OCC is a 0-1 variable indicating the type of school in the data set. points using different colors (red and grey) in Figure 1. We often call the texts listed in the second column of the table in Figure 2 as ‘categorical data’. So, a dummy variable is actually a numerical variable which represents categorical data. 2.1 Separating two groups in one regression: dummy variable Below we consider the intercept dummy and slope dummy, respectively, to address the two issues we raised above: 1. Overhead costs for running occupational schools and regular schools can be different; 2. Marginal costs of each additional student can also be different for running occupational schools and regular schools. 2.1.1 Intercept dummy Let’s first see what happens if we introduce the above dummy variable OCC as an extra explanatory variable into the simple regression (1). Now we have a multiple regression: Yi = β1 + β2Xi + β3Di + ui, i = 1, . . . , n, (6) where I use the generic notations Y,X,D for the dependent variable, the first regressor and the dummy variable. In particular, • Yi = COSTi is the annual recurrent cost for running the ith school; • Xi = Ni is the number of students in the ith school. • Di = OCCi = 1 if the ith school is an occupational school, and Di = OCCi = 0 if the ith school is a regular school. Despite being binary, D can be treated the same as other explanatory variable(s) in the regression. We can fit the regression using the total n = 74 observations using OLS method, 6 and obtain the OLS estimates βˆ1, βˆ2 and βˆ3. The fitted regression is written as Yi = βˆ1 + βˆ2Xi + βˆ3Di, i = 1, . . . , n. How to interpret our parameter estimates? It would be clear if we look at the fitted regres- sions (interpreted as estimated cost functions here) separately for two types of schools: • Di = 1 : occupational school Yi = βˆ1 + βˆ2Xi + βˆ3 · 1 = (βˆ1 + βˆ3) + βˆ2Xi, (7) where the index i here runs through the indices of the occupational schools in the sample. • Di = 0 : regular school Yi = βˆ1 + βˆ2Xi + βˆ3 · 0 = βˆ1 + βˆ2Xi, (8) where the index i here runs through the indices of the regular schools in the sample. Now, by comparing (7) and (8), we can interpret the parameter estimates as follows: 1. βˆ1 is the estimated annual overhead cost for regular schools, 2. βˆ2 is the estimated annual marginal cost of each additional student for both regular schools and occupational schools.3 3. βˆ3 is the estimated extra annual overhead cost for occupational schools over regular schools. Note that the intercept in the fitted regression (7) for occupational schools is βˆ1 + βˆ3 which means the overhead cost for occupational schools is estimated as βˆ1 + βˆ3. The interpretation of βˆ3 is the key to understand how the dummy variable D works in regression (6). Implicitly, we have set the regular schools as the ‘reference’. We get the overhead cost estimate for the reference type of school as βˆ1, the estimate of the intercept in regression (6); and βˆ3 tells us how much extra overhead cost we need for the other type of school. Since the slope coefficient βˆ3 in front of the dummy variable D turns out to be the difference between the two estimated intercepts in fitted regressions (7) and (8) to capture the parallel shift from one fitted regression line to another, we often call the dummy variable D in regression (6) as the ‘intercept dummy’. Obtaining standard errors and conducting hypothesis testing in a regression with dummy variable are not different than usual. It can be very useful to perform a t test on the coefficient of the dummy variable to see whether there is a significant difference in the overhead costs of the two types of school. See textbook and companion slides for more discussion. 3Note that it is a restriction of this model that the marginal costs for the two types of schools have to be the same. Since this restriction sounds unrealistic, we will relax it in the next section. 7 2.1.2 Slope dummy The intercept dummy can only shift the fitted regression line in a parallel manner, and it cannot allow the two fitted regression lines to have different slopes. As mentioned above, this is the limitation of the intercept dummy model, where the marginal costs for both types of school have to be the same. The latter is clearly not a plausible assumption from the visual inspection of Figure 1: the fitted regression (cost function) for the occupational schools should be steeper, and that for the regular schools should be flatter. To allow the slopes to be different, we introduce another ‘slope dummy variable’ X ·D into the regression (6), and get Yi = β1 + β2Xi + β3Di + β4XiDi + ui, i = 1, . . . , n, (9) where the variables Y,X and D are the same as before. Again, we can fit the regression using the total n = 74 observations and obtain the OLS estimates βˆ1, βˆ2, β3 and βˆ4. The fitted regression is written as Yi = βˆ1 + βˆ2Xi + βˆ3Di + βˆ4XiDi, i = 1, . . . , n. We look at the fitted regressions separately for two types of school: • Di = 1 : occupational school Yi = βˆ1 + βˆ2Xi + βˆ3 · 1 + βˆ4Xi · 1 = (βˆ1 + βˆ3) + (βˆ2 + βˆ4)Xi, (10) where the index i here runs through the indices of the occupational schools in the sample. • Di = 0 : regular school Yi = βˆ1 + βˆ2Xi + βˆ3 · 0 + βˆ4Xi · 0 = βˆ1 + βˆ2Xi, (11) where the index i here runs through the indices of the regular schools in the sample. The fitted regressions (10) and (11) clearly show that now both intercepts and slopes can be different, and the differences are captured by βˆ3 and βˆ4, respectively. Specifically, the parameter estimates are now interpreted as follows: 1. βˆ1 is the estimated annual overhead cost for regular schools (the reference), 2. βˆ2 is the estimated annual marginal cost of each additional student for regular schools (the reference). 3. βˆ3 is the estimated extra annual overhead cost for occupational schools. 4. βˆ4 is the estimated extra annual marginal cost for occupational schools. Again, we can perform t tests as usual. The t test for the significance of the slope coef- ficient βˆ4 can be useful for telling whether the marginal cost per student in an occupational school is significantly higher than that in a regular school. 8 We can also perform an F test of the joint explanatory power of the intercept dummy and slope dummy in regression (9) by testing H0 : β3 = β4 = 0. (12) To do this, we compare the RSS from regression (9) where both dummies are included and the RSS from regression (1) where they are not, and use the usual F test statistic and critical values to make testing decision. If we reject the null, then it means that at least one of β3 and β4 is different from zero. In fact, the Chow test we introduced at the beginning of this lecture is equivalent to this F test. We verify this in the same data example. Recall that we hadRSSA+RSSB = 4.71×1011 in Section 1.1. For the regression (9) on the whole sample with both intercept and slope dummy variables, the residual sum of squares is also 4.71 × 1011. Note that the number of parameters in regression (9) is 4 which is the same as the total number of parameters from subsample Regressions A and B. Therefore, the below two sets of the model fit comparison are equivalent: • compare running (the pooled) regression (1) and running separate regressions A and B in (2) and (3). – Both regressions have no dummy variables. – Regression (1) uses the whole sample, while regressions A and B are separately fit using two subsamples. • compare regression (1) and regression (9) – Both regressions are based on whole sample. – Regression (1) has no dummies, while regression (9) has both intercept and slope dummies. 2.2 Dummy variable trap: perfect multicollinearity Wait. There are two types of school in the data, but why do we only use one dummy variable to separate the two groups? Can we use two dummies one for occupational schools and another for regular schools? In other words, can we set Do = { 1 if occupational school 0 if regular school Dr = { 0 if occupational school 1 if regular school and include both of them in a regression Y = β1 + β2X + β3D o + β4D r + u? (13) The answer is NO. We fall into the classic ‘dummy variable strap’ if we do this, and the reason is that we essentially run into a special case of the ‘perfect multicollinearity’ discussed in the lecture on multiple regressions. Note what’s special about the two dummies Do and Dr defined above – they always add to one, simply because a school in the sample is either occupational school or regular school. 9 So we have4 Do +Dr = 1. (14) Look at the 1 on the right-hand side of the equation: can you find it also hides in our regression (13)5 as one of the regressors? Yes, the constant 1 is the first regressor in the regression. Note we can always write (13) as Y = β1X1 + β2X2 + β3X3 + β4X4 + u where X1 = 1, X2 = X, X3 = D o, X4 = D r. (15) Then from (15) and (14), it is clear that we have X1 = X3 +X4, which is a perfect linear relationship among the regressors! With such perfect multicollinear- ity, we will not be able to perform the OLS estimation. As you might have imagined, we can actually escape from the dummy variable trap by excluding the intercept term in the regression. For example, instead of (13), we may run the following regression Y = β2X + β3D o + β4D r + u, (16) which will be perfectly fine. Without the presence of perfect multicollinearity we can obtain the OLS estimates for βˆ2, βˆ3 and βˆ4. Following the similar analysis as before (but notice that logically Do and Dr can never be both one or zero), we have • Do = 1 and Dr = 0: occupational school fitted regression Y = βˆ2X + βˆ3. (17) • Dr = 1 and Do = 0: regular school fitted regression Y = βˆ2X + βˆ4. (18) Apparently, now the interpretations of the parameters are different in general: 1. βˆ2 is the estimated annual marginal cost of each additional student for both regular schools and occupational schools. −→ This is the same as in the case before with only intercept dummy. 2. βˆ3 is the estimated annual overhead cost for occupational schools. 3. βˆ4 is the estimated annual overhead cost for regular schools. 4To be more explicit, we actually have Doi +D r i = 1 for all i = 1, . . . , n in the sample. 5Well, it hides in any regression with an intercept term as one of the regressors 10 The difference here is that there is no reference type of school any more! Neither of the slope coefficients for the dummy variables is interpreted as the ‘extra’ overhead cost of one type than the other. They are simply estimating the overhead costs for both types of school separately. Summary: If we would like to keep the intercept term in the regression, then for separating two groups we only need one dummy variable. The general rule is that we need M − 1 dummy variables to separate M categories using a grouping criterion, if we have intercept in the regression. 3 More than two groups: more than one dummy variable In practice, there are often cases where there are more than two distinct groups in the data, either it is because we obtain more than two categories by dividing the observations based on one grouping criterion, or it is because we use multiple criteria to group the observations. • One grouping criterion: for example, when we try to group the 74 secondary schools in Shanghai based on the type of the curriculum, we can do a finer job than just classifying them as occupational or regular. In fact, there are also two types of occupational school, and they are technical schools training technicians and skilled workers’ schools training craftsmen. There are also two types of regular secondary school in Shanghai, they are general schools which provide the usual academic education, and vocational schools. So, in total there are 4 types of school among the 74 schools. Figure 3 mark these 4 types into 4 different colors. • Multiple grouping criteria: suppose we want also to take into account of the fact that some schools are residential and some are not, then we can use two grouping criteria to group the observations: residential or not, and occupational or not. Using these two grouping criteria each has two categories, in total we divide the observations also into 4 groups. This is illustrated in Figure 4. In the following we consider these two cases. 3.1 More than two groups from one grouping criterion This is the case as illustrated in Figure 3, and the example is a straightforward extension of the example we discussed before with two groups. 3.1.1 M categories: M − 1 dummies To separate the four groups of schools shown in Figure 3, we need 4−1 = 3 dummy variables. They are illustrated in the last three columns of the table in Figure 5. Where is the general school? Yes, it is chosen as the reference type/category. Therefore, we only see dummy variables for the other three types: technical schools, worker’s schools, and vocational schools. The reference category is hence usually described as the ‘omitted’ category. The regression we run is Yi = β1 + β2Xi + β3D T i + β4D W i + β5D V i + ui, i = 1, . . . , n, (19) 11 Figure 3: Cost against number of students for 74 secondary schools in Shanghai classified into four categories. Figure 4: Cost against number of students for 74 secondary schools in Shanghai: classified into two sets of categories: residential/nonresidential and regular/occupational. where • Yi = COSTi is the annual recurrent cost for running the ith school; • Xi = Ni is the number of students in the ith school. • DTi = TECHi = 1 only if the ith school is a technical school. • DWi = WORKERi = 1 only if the ith school is a workers’ school. • DVi = V OCi = 1 only if the ith school is a vocational school. Keeping in mind that the reference category is the general school, we can easily obtain the below interpretations of the parameter estimates: 1. βˆ1 is the estimated annual overhead cost for general school (the reference), 12 2. βˆ2 is the estimated annual marginal cost of each additional student for all schools (because there is no slope dummy yet). 3. βˆ3 is the estimated extra annual overhead cost for technical school over the general school. 4. βˆ4 is the estimated extra annual marginal cost for workers’ school over the general school. 5. βˆ5 is the estimated extra annual marginal cost for vocational school over the general school. The standard errors and hypothesis tests are not different than usual. The analysis done for the two groups with one dummy variable can be simply generalized in this case with more than two groups. 3.1.2 Change of reference category In the above regression we chose the general school as the reference category, and thus we can compare the overhead costs of other schools with general schools and to test whether the differences were significant by using the t test for each individual parameters β3, β4 and β5. What if we were interested in testing whether the overhead costs of workers’ schools were different from those of the other types of schools? −→ The easiest way to do this is to re-run the regression making workers’ schools the reference category. This is simple: we just need to get rid of the dummy for worker’s school and add the dummy for the general school in regression (19)! What do we expect to see from the estimation of the regression with the new reference? • The parameter estimates for the intercept and slope coefficients are certainly different except for βˆ2 (which is the estimated marginal cost for all school types). • The fitted regression (cost function) for each category should remain the same! See more detailed discussion in the textbook and the companion slides. 3.2 Multiple grouping criteria To separate the four groups of schools shown in Figure 4 made by two grouping criteria, we need two sets of dummy variables. They are illustrated in the last two columns of the table in Figure 6. Since there are only two categories under each grouping criterion, we need one (two minus one) dummy variable for each grouping. In total we need two dummy variables. The regression we run is Yi = β1 + β2Xi + β3OCCi + β4RESi + ui, i = 1, . . . , n, (20) where • Yi = COSTi is the annual recurrent cost for running the ith school; • Xi = Ni is the number of students in the ith school. • OCCi = 1 if the ith school is an occupational school. 13 Figure 5: Three dummy variables for separating four types of secondary schools in Shanghai. Figure 6: Two sets of dummy variables: OCC and RES. 14 • RESi = 1 if the ith school is a residential school. Obviously, the ith school can be both occupational and residential, or both not. This means OCCi and RESi can both be one or zero. The fitted regression is written as (omitting the subscript i for observations) Y = βˆ1 + βˆ2X + βˆ3OCC + βˆ4RES. To interpret our parameter estimates, we consider: • OCC = 0, RES = 0 : regular, nonresidential school cost function Y = βˆ1 + βˆ2X. (21) • OCC = 0, RES = 1 : regular, residential school cost function Y = (βˆ1 + βˆ4) + βˆ2X. (22) • OCC = 1, RES = 0 : occupational, nonresidential school cost function Y = (βˆ1 + βˆ3) + βˆ2X. (23) • OCC = 1, RES = 1 : occupational, residential school cost function Y = (βˆ1 + βˆ3 + βˆ4) + βˆ2X. (24) Interpretations: 1. βˆ1 is the overhead cost for regular, nonresidential school −→ see (21). 2. βˆ2 is the marginal cost of each additional student for all schools (because there is no slope dummy yet) −→ see (21)–(24). 3. βˆ3 is the extra overhead cost for occupational, nonresidential school over the regular, nonresidential school (compare (21) and (23)) and also the extra overhead cost for occupational, residential school over the regular, residential school (compare (22) and (24)). 4. βˆ4 is the extra overhead cost for regular, residential school over the regular, nonresidential school (compare (21) and (22)) and also the extra overhead cost for occupational, residential school over the regular, nonresidential school (compare (23) and (24)). Clearly, βˆ3 estimates the extra overhead cost for occupational school over regular school, irrespective of the school being residential or not. Likewise, βˆ4 estimates the extra over- head cost for residential school over nonresidential school, irrespective of the school being occupational or not. This is part of the restrictions in the model. 15
欢迎咨询51作业君