Dr Güven Demirel BUSM014 Quantitative Research Methods Revision Lecture 2020/2021 Semester A Module Outline 2 1. Introduction & Structuring Research 2. Univariate Descriptive Statistics 3. Bivariate Descriptive Statistics, Statistical Graphs & Tables 4. Distributions, LLN & CLT, Confidence Intervals 5. Experiments, Comparing Groups, Testing Hypotheses 6. Testing Hypotheses (ctd) 7. Reading week (no lectures/labs) 8. Regression Analysis – Introduction 9. Regression Analysis – Inference 10. Regression Analysis – Further Aspects 11. Review 12. Linking theory and empirics, developing hypotheses Revision Lecture Outline • Recap of Week 9 • Model Specification • Functional Forms • Dummy Variables • Applications 3 Hypothesis Testing Statistical Inference - Overview 5 • Point estimation: “Best guess” numerical value for an unknown population parameter (sample mean, sample standard deviation, standard error, …) • Confidence interval: Interval estimate for an unknown population parameter (CI for mean, CI for difference between group means) • Hypothesis testing: Specify hypotheses for YES / NO questions (e.g. “Does mean income differ between men & women?”) and use sample data to test. Steps of Hypothesis Testing 6 1. Given the hypothesis you intend to test (typically the alternative hypothesis 1), determine its null-hypothesis (0). In other words, what is the no-effect version of your hypothesis. 2. Choose the significance level . 3. Compute the probability value (p-value) of observing the signal you have. To calculate the p-value we need a test statistic. 4. Compare the p-value to the significance level, if it is smaller reject 0. Otherwise, fail to reject 0. t-Statistic for Hypothesis Testing on Population Mean 7 • Null hypothesis 0: E = ,0 • Alternative hypothesis : E ≠ ,0 • Test statistics are in the general form • For the hypothesis tests on means and the differences between means, we use the t-statistic. o Signal: ത − ,0 o Noise: / • The population standard deviation is typically unknown, hence we instead use − = ഥ−, ഥ • ത = : standard error of the sample mean Significance Level and Critical Value 8 • Significance level (): the probability with which you are willing to reject the null the hypothesis when it is true. = 0 0 • Assuming the null hypothesis, t-statistic: = ത−,0 ത ~ 0,1 for large n. Therefore = > • : the critical value of the t-statistic for which more extreme values are observed with probability • For = 0.05, = 1.96 (obtained from the standard normal distribution for large n.) • But for small n, CLT does not apply. • For small samples, if the population is distributed normally, we can use the t-dist. Null Hypothesis Rejection Rule 9 • p-value (probability value): the probability of observing the given or a more extreme signal, given that the null hypothesis 0 is true. − = > 0 • For given = 0.05, statistical testing on the conditions about the observed o Reject 0 if ≥ . , which is equivalent to − ≤ o Fail to reject otherwise Two-Sided versus One-sided t-test 10 • Two sided t-test: The alternative hypothesis does not have a sense of directionality, we have no idea whether larger or smaller. : E ≠ ,0 rejection region rejection region • One sided t-test: The alternative hypothesis specifies the direction, i.e. smaller or larger. : E < ,0 : > , p-value and based on only one side Hypothesis Tests for Comparing Population Means 11 0: E − E = 0 : E − E ≠ 0; E − E > 0; E − E < 0 • Paired samples t-test: The same people/ firms are observed before and after the treatment and the equality of the populations means are tested (e.g. before – after comparison of the treatment group). • Independent samples t-test: The two groups tested are drawn independently and the equality of the populations means are tested (e.g. checking for balance). • In a paired test, there are n paired observations. The difference between the pairs is the data, where the standard error uses the sample standard deviation of the difference data. • In independent samples, we compare the mean values of the variable in both groups, where the standard error uses a weighted pooled sample variance. Testing for statistical independence 12 : Categorical variables X & Y are statistically independent. : There is an association between X & Y • For each cell for row i (R rows)& column j (C columns), we have: o Observed Count: , o Expected Count: , = × Chi-Square statistic χ2 = σ=1 σ=1 ,−, 2 , p-value Regression Analysis Simple Regression Model 14 • Population Model: simple regression model = 0 + 1 + Example: Return to education = 0 + 1 + • We refer to as the dependent variable (or regressand, explained variable, response variable, target variable). • In the simple and multiple regression models, is continuous. • We refer to as the independent variable (or regressor, explanatory variable, covariate, feature). • Independent variable can be continuous or categorical (assume continuous, unless otherwise stated). • The variable is called the error term or disturbance. • 0: intercept parameter • 1: slope parameter Simple Regression Estimators 15 • The regression model is for the population and the parameters 0 and 1 are unknown. • Hence the error term is unknown. • We have a sample , : = 1,… , • We use data to estimate the parameters 0 and 1. = መ0 + መ1 + ො : estimator of the intercept parameter 0 : estimator of the slope parameter 1 Sample regression function (regression line) ො = መ0 + መ1 ෝ is the predicted / fitted value of for observation ෝ = − ො: residual Ordinary Least Squares (OLS) Estimators 16 • Many alternative rules can be used to estimate the population parameters. Which estimator should we use? • Criteria: Minimise the Sum of Squared Residuals , = ෝ = =1 − ො 2 = =1 − መ0 − መ1 2 Ordinary Least Squares (OLS) Estimators – Simple Regression 17 • The Sum of Squared Residuals = σ=1 ො 2 = σ=1 − መ0 − መ1 2 is minimised when the partial derivatives of with respect to መ0 and መ1 are zero. • OLS slope estimator: መ1 = σ=1 − ҧ − ത σ=1 − ҧ 2 = , 2 • The slope is the ratio of the sample covariance between the independent and the dependent variables and the sample variance of the independent variable. • OLS intercept estimator: መ0 = ത − መ1 ҧ • Consequence: The sample average ҧ, ത is always on the OLS regression line. Multiple Linear Regression Model 18 • There are multiple explanatory variables that can affect the dependent variable. • Example: Gender, Age, Tenure, Ethnicity affect Wage alongside Education. If our focus is on Education, we call rest of the explanatory variables: control variables. • Model : wage = 0 + 1 + 2 + 2 + Multiple linear regression model: = 0 + 11 + 22 +⋯+ + • 0: intercept parameter • 1, … , : slope parameters • 1, … , : explanatory variables • : dependent variable • : error / disturbance Interpretation of Parameters 19 • Multiple linear regression model: = 0 + 11 + 22 +⋯+ + • Each slope coefficient provides the effect of its variable, holding other explanatory variables constant (restricted by variables in the model). • Model: = 0 + 1 + 2 + 3 + o 1 is the (expected) effect of one additional year of education, given tenure and gender are not changing. ∆ = 1∆ o 2 is the effect of one additional year worked in the company, given education and gender are not changing. o 3 is the mean difference between the wages women and men, given education and tenure are not changing. Ordinary Least Squares (OLS) Estimators – Multiple Regression 20 • The Sum of Squared Residuals = σ=1 ො 2 = σ=1 − መ0 − መ11 −⋯− መ 2 is minimised when the partial derivatives of with respect to , , …, are zero. • OLS slope estimator: መ = σ=1 Ƹ σ=1 Ƹ 2 where Ƹ are the residuals of the auxiliary regression of on all other explanatory variables. • Interpretation: መ is the effect of the part of the independent variable that is not correlated with the rest of the explanatory variables. • Partialling-out result: መ is called the partial effect of on , which enables the ceteris paribus interpretation (i.e. while holding other factors constant). Goodness of Fit 21 • The goodness of fit of the model quantifies how well the model predicts the dependent variable. (R-squared), also called coefficient of determination, provides the goodness- of-fit: 2 = • is the ratio of the explained variation = σ=1 ො − ത 2 to the total variation = σ=1 − ത 2 in the dependent variable. : actual value, ො: predicted value, ത: sample mean. • 0 ≤ 2 ≤ 1. • Having a low or high R-squared are not per-se bad/good. In social science low R-squared values are common as there are many unobserved factors and true randomness. • 2 never decreases when new variables are added (under no missing data), which will give us criteria on whether a group of control variables should be added to the model. Properties of OLS Estimators Unbiasedness of OLS Est. for Multiple Regression 23 MLR Assumptions: 1. Linearity in parameters: In the population model = 0 + 11 + 22 +⋯+ + 2. Random sampling: We have a random sample of size n 1, 2, … , , : = 1,… , 3. No perfect collinearity: In the sample, none of the variables is constant and there are no exact linear relationships between the independent variables. They can be correlated but not perfectly. 4. Zero conditional mean: The error has an expected value of zero given any value of the explanatory variable E |1, 2, … , = 0. If the MLR-Assumptions 1-4 hold, then OLS estimators are unbiased: E መ0 = 0, E መ1 = 1, E መ2 = 2,…,E መ = • For a random sample, the main concern is the last assumption. Homoskedasticity Assumption 24 Additional assumption: 5. Homoskedasticity: The error has the same variance given any value of the explanatory variables Var |1, 2, … , = 2. homoskedastic heteroskedastic MLR Assumptions, Unbiasedness, and Efficiency of OLS 25 Unbiasedness of OLS: If MLR-Assumptions 1-4 (Linearity in parameters; Random sampling; No perfect collinearity; Zero conditional mean) hold, then OLS estimators are unbiased: • E መ0 = 0,E መ1 = 1, E መ2 = 2,…,E መ = Efficiency of OLS: If Gauss-Markov Assumptions (MLR 1-4 + Homoskedasticity) hold, then the OLS estimator መ for is the Best Linear Unbiased Estimator (BLUE). • Best: the estimator with the least variance • Linear: ෨ = σ=1 Variance of OLS Estimator 26 If MLR-Assumptions 1-5 hold, then the variance of the OLS estimator መ is = . • Error variance, : Higher noise in the equation, more difficult to estimate partial effects. o The only way to reduce it is to add more explanatory variables. • Total sample variation in , = σ=1 − ҧ 2 = − 1 2 : More sample variation in , the more precise the estimate becomes. o The higher the sample variance , the easier to estimate. o The larger the sample size , the more precise the estimate becomes. • Variance Inflation Factor, = ൗ − : The higher the VIF of , the less precise the estimate becomes. Variance Inflation Factor and Multi-Collinearity 27 If MLR-Assumptions 1-5 hold, then the variance of the OLS estimator መ is = . • Variance Inflation Factor, = ൗ − : The higher the VIF of , the less precise the estimate becomes. • Here is the R-squared of the following auxiliary regression on 1, 2, … , −1, +1, … , o The higher the proportion of the variation in that can be explained by the variation in the other variables ( ), the less precise the estimate is. Hence, if the variable is linearly related with other variables, the harder it becomes to estimate its partial effect. This high VIFj is called multi-collinearity. o Multi-collinearity does not violate the perfect collinearity assumption, which is when 2 = 1 but makes estimates imprecise. OLS Standard Error 28 Under Gauss-Markov Assumptions = . But we do not know the error variance 2, which needs to be estimated from data. • Under Gauss-Markov Assumptions, the following is an unbiased estimator of the error variance, i.e. ො2 = 2: ො2 = − − 1 = σ=1 − ො 2 − − 1 ො is called the standard error of regression • This leads to the standard error of , which is the estimator of the sampling standard deviation of መ. o Sampling standard deviation of መ: = o Standard error of መ: = ෝ Distribution of OLS Estimators 29 • We have learned the mean and the sampling variation of the OLS estimator መ. The next question is “What is the sampling distribution of መ?” If we know it, we can test hypotheses on regression coefficients. For this, we need an additional assumption. Classical Linear Model (CLM) Assumptions (Gauss-Markov & normality): (1) Linearity in parameters; (2) Random sampling; (3) No perfect collinearity; (4) Zero conditional mean; (5) Homoskedasticity 6. Normality: The error is independent of the explanatory variables and is distributed normally with mean zero and variance 2. |, … , ~ + +⋯+ , Under CLM assumptions, ~ , .Hence, the standardized t-statistic follows a t- distribution with − − 1 degrees of freedom = − ~−− Testing Hypotheses About a Single Parameter 30 = 0 + 11 + 22 +⋯+ + Testing whether the variable has a significant effect on the dependent variable, once accounted for other explanatory variables. 0: = 0 1: < 0 (directional) 1: ≠ 0 (un-directional) 1: > 0 (directional) • We know that the t-statistic = is distributed as −−1 under the null hypothesis, where the real effect is zero. Hence we can proceed the hypothesis testing by calculating the p-value and comparing by the significance level . • Remember that we have made CLM assumptions. If the sample is random and the error term is not correlated with the explanatory variables, the t-statistic follows a standard normal distribution under large sample sizes even if the population error is not normal. F-statistic 31 Unrestricted Model = 0 + 11 +⋯+ −− + −+1−+1 +⋯+ + q exclusion restrictions: 0: −+1 = 0,… , = 0, Restricted Model = 0 + 11 +⋯+ −− + Fact: R-squared decreases as some explanatory variables are removed from the model, that is ≥ . Here 2 and 2 the R-squared values of the unrestricted and restricted models. Question: Is the decrease in R-squared statistically significant? Test statistic: F-statistic = − / − / −− Testing Overall Significance of a Regression 32 Unrestricted Model = 0 + 11 +⋯+ + 2 = 2 k exclusion restrictions: 0: 1 = 0,… , = 0, Restricted Model: = 0 + 2 = 0 Question: Does the overall model explain any significant variation in the dependent variable? Are all the variables jointly significant? Is the model plausible at all? F-statistic: F = 2/ 1−2 / −−1 is distributed as ,−−1. • This is the default F-test reported in Stata. • If − ≥ , fail to reject 0. There is not enough evidence that the explanatory variables help to explain the dependent variable jointly. Hence, the model is rejected. Omitted Variables, Functional Forms, Dummy Variables, Interactions Multiple Regression Analysis: Perspective 34 • How many variables should we add? On what basis should variables be included in an empirical model? o Model Under-Specification: Leaving out variables that affect the dependent variable and are also correlated with existing variables lead to biased OLS estimators (Omitted Variable Bias - OVB) . o Model Over-Specification: Including irrelevant variables that do not affect the dependent variable do not lead to bias but lead to higher variance of the estimators; hence should be avoided. Omitted Variable Bias 35 • If you leave out variables that affect the dependent variable and are also correlated with existing variables, this leads to biased OLS estimators (Omitted Variable Bias - OVB). • True model: = 0 + 11 + 22 +⋯+ + o OLS estimators: = መ0 + መ11 + መ22 +⋯+ መ o Unbiasedness: መ = • Omitted variable: o OLS estimators: = ෨0 + ෨11 + ෨22 +⋯+ ෨−1−1 = ෩ − = ෩, where ෩ is the slope coefficient of in the auxiliary regression of the omitted variable on the rest of the variables. A useful consideration for determining the direction of OVB is to assume the variables other than are pairwise uncorrelated. Then ~ , . Model Over-Specification 36 • If we include irrelevant variables that do not affect the dependent variable: = 0 + 11 + 22 +⋯+ + ,with = 0 • OLS estimators of the model with the irrelevant variable are still unbiased, that is መ = , = 0,… , • The sampling variance of estimators መ = 2 increases if the irrelevant variable is correlated with . • If you consider whether adding a new variable, you should contrast o Omitted Variable Bias o The sampling variance of estimators (decreases 2 if relevant but may increase ) Linearity in Parameters 37 Linear regression refers to linearity in the parameters • Simple regression model: = 0 + 1 + • Multiple regression model: = 0 + 11 + 22 + • The following are also linear regression models: o = 0 + 1 log + o log = 0 + 1 2 + o = 0 + 11 + 22 + 312 + • You can transfer the above to the standard form by redefining the variables accordingly, e.g. z = log . • For instance, the following are not linear in parameters: o = 0 + 1 + 3 2 + o = 1 0+1 + Logarithmic Variables 38 • Logarithms are commonly used either informed by theory or for skewed variables such as monetary variables (e.g. wage, GDP) and size (population, number of employees). It has special interpretations • Log on level: log = 0 + 11 + ℎ o 100 × 1 is approximately the percentage change in , if 1 increases by one unit. • Level on log: y = 0 + 1 log 1 + ℎ o 1/100 is approximately the level change in , if 1 increases by one percent. • Log on log: log = 0 + 1 log 1 + ℎ o 1 is approximately the percentage change in , if 1 increases by one percent. It is called elasticity ( ൗ∆ Τ∆ ). Interactions of Continuous Variables 39 • If the partial effect depends on the magnitude of another explanatory variable, it is called an interaction effect. • The most common form is the multiplication of two levels: = 0 + 11 + 22 + 312 + ℎ • Example: Attendance and prior performance impact on exam results, ATTEND dataset of Wooldridge(2019) = 0 + 1 + 2 + 3 + 4 2 + 5 2 + 6 × + : standardized test result for a final exam : attendance rate; : GPA prior; : ACT test result • Previous performance can be expected to have a decreasing / increasing marginal effect. • The attendance can be expected to have a different impact on students with different past performance. Dummy Variable Coefficients 40 Example: sex, base group: male log = 0 + 1 + 2 + 0: intercept for male 0 + 1: intercept for female 2: common slope Interactions between Dummy & Continuous Variables 41 log = 0 + 1 + 2 + 3 × + • 0: intercept for male • 0 + 1: intercept for female • 2: slope for male • 2 + 3: slope for female Hypotheses for Telecommuting 42 • H1: Are the treatment and control groups identical before the experiment? Before means = 0 and the difference between control and treatment group is given by switching on : / = 0 + 1 × + 2 × 0 + 3 × 0 + • H2: Does the control group change in the experiment? Control group means TreatGroup = 0 and the difference between before and after is given by switching on : / = 0 + 1 × 0 + 2 × + 3 × 0 + • H3: Does the treatment group change because of the experiment? Treatment group means = 1 and the difference between before and after is given by switching on : / = 0 + 1 × 1 + 2 × + 3 × + Hypotheses for Telecommuting 43 • H4: Does the treatment group change net of the control group? Treatment group change: 2 + 3 Control group change: 2 Difference-in-Differences effect (ATE): 3
欢迎咨询51作业君