Similar to how the sample mean, , is an unbiased estimator of the population mean, µ, with , thex σx = σ√n sample proportion, p, is an unbiased estimator of the population proportion, π. µp = π. The standard error of the proportion, σp , is given by . σp = √ nπ(1−π) When you replace , µ and in the formula for Z in the sampling distribution of the mean, with p, π andx σx , you get the Z-value for the sampling distribution of the proportion: σp Z = σpp−π = p−π√ nπ(1−π) You can use the normal distribution to approximate the sampling distribution of the proportion when are each at least 5.π, n(1 ) n − π Example. If the true proportion of voters who support a GST increase is π = 0.4, what is the probability that a sample of size 200 yields a sample proportion between 0.40 and 0.45? .03464 σp = √ nπ(1−π) = √ 2000.40(1−0.4) = 0 Converting to standardised normal, P(0.40≤p≤0.45) = P( )0.034640.40−0.40 ≤ Z ≤ 0.034640.45−0.40 = P(0≤Z≤1.44) = 0.4251. Sampling from finite populations The finite population correction (fpc) factor is used to adjust the standard error of both the sample mean and the sample proportion. It needs to be applied: ● When n > 5% of N, and ● When sampling occurs without replacement. The fpc is always <1, so it always reduces the standard error, resulting in more precise estimates of population parameters. Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 Week 8 - Confidence Intervals Confidence interval for the population mean where σ is known In an example from last week, we found that for a population with µ = 368 and σ = 15, the interval = 362.12 to 373.88 contains 95% of the possible sample means when n = 25.683 ± √25(1.96)(15) Suppose that now, you don’t know µ. You have taken a sample mean =362.3. You develop a 95%x confidence interval for this sample to estimate µ: , or 356.42 µ 368.18.62.33 ± √25(1.96)(15) ≤ ≤ ● This means: “If I repeatedly sample, each time generating a 95% confidence interval, then 95% of the generated intervals would contain the true µ.” ● This DOES NOT MEAN: “There is a 95% probability that µ lies within 356.42 and 368.18” ○ Because you generate a new confidence interval for each sample, you cannot make absolute statements about probabilities for any one given interval. We prefer confidence intervals over or p because they consider variation from sample to sample, basedx on observations from only one sample. The general formula for a confidence interval is: Point Estimate (Critical Value)(Standard Error) ± ● Point Estimate is or p, estimating µ or πx ● Standard Error is σ or σp x ● Critical Value is a value based on our desired confidence interval. Our confidence level (1-α)*100% is the probability that the interval will contain the unknown population parameter. ● If confidence level = 95%, 1- α = 0.95 (therefore α = 0.05) ● The region outside the interval has cumulative probability α ○ For a two-tailed test, α is split between α/2 in the upper tail and α/2 in the lower tail. The upper and lower limits of the confidence interval for µ, if σ is known, are .X ± Zα/2 σ√n ● We assume that the population is normally distributed, or that the Central Limit Theorem applies. ● is the value corresponding to an upper-tail probability of α/2 from the standardised normal Zα/2 distribution (i.e. a cumulative area of 1-α/2.) Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 ● To calculate the Z-value in Excel, use =NORM.S.INV([α/2]) Confidence interval for the population mean where σ is not known In most business situations, σ is not known exactly. (If you in fact knew σ, you would know µ, as per above.) If the population standard deviation σ is unknown, we substitute the sample standard deviation S. While has a normal distribution with mean 0 and variance 1,Z = σ √n X−μ has a specific distribution - a (Student’s) t distribution - with n-1 degrees of freedom.t = S √n X−μ ● To find the critical value of t for the appropriate degrees of freedom, you use a table of the t distribution. ○ The table contains t values (number of S away from the mean), NOT probabilities. ● As n increases, S → σ, and so t values → Z values. ○ As n increases, t distributions estimate the normal distribution. ● Degrees of freedom (d.f.) relates to how many of the sample values are free to vary. The upper and lower limits of the confidence interval for µ, if σ is unknown, are .X ± tα/2 S√n ● is the value corresponding to an upper-tail probability of α/2 from the t-distribution (i.e. a tα/2 cumulative are of 1-α/2.) ● To calculate tα/2 in Excel, use =TINV([α], [d.f.]) (=TINV gives the two tailed distribution, hence α instead of α/2) Example. A random sample of n=25 has X̄=50 and S=8. Form a 95% con dence interval for µ. n = 25, so d.f. = 25 - 1 = 24 α = (1 - 0.95) = 0.05, so α/2 = 0.025 t0.025 with d.f. 24 = 2.0639 The 95% confidence interval for this sample is 0 6.698 3.3025 ± √25(2.0639)(8) = 4 ≤ μ ≤ 5 Confidence interval for the population proportion Recall that the distribution of the sample proportion is approximately normal, if , withπ and n(1 ) n − π ≥ 5 . We estimate π with p. σp = √ nπ(1−π) The upper and lower limits of the confidence interval for π are if p ± Zα/2√ np(1−p) p and n(1 p) n − n ≥ 5 Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 Determining sample size The sampling error / margin of error, e, is the amount added or subtracted onto the sample statistic to get a confidence interval. Work out the appropriate sample size by solving for n (always round up). → → X ± Zα/2 σ√n e = Zα/2 σ√n )n = (Zα/22 e2σ 2 → → X ± Zα/2 S√n e = Zα/2 S√n Z )n = ( α/22 e2S 2 → → p ± Zα/2√ nπ(1−π) e = Zα/2√ nπ(1−π) )n = (Zα/22 e2π(1−π) Note: π can be estimated by p, or by setting π = 0.because it yields the largest value for n. Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 Week 9 - Hypothesis Testing (One Sample) A hypothesis is a claim, often about a population parameter. An example of an hypothesis is: ‘The mean monthly phone bill for all households is µ = $42.’ ‘Telstra’s market share proportion in mobile phone customers, π, is greater than 0.5.’ How can we assess whether a hypothesis is true or not? The null hypothesis (H0) states a status quo assertion about a population parameter. E.g.: ‘The average purchase made on Amazon Prime is worth $50 (H0: µ = 50).’ Tests usually begin by assuming H0 is true. H0 can’t be proven in the test, but it can be rejected. The alternative hypothesis (H1) opposes the status quo H0 in some way. E.g.: ‘The average purchase made on Amazon Prime is not worth $50 (H1: µ ≠ 50).’ -H1 is normally the hypothesis that the researcher is trying to find evidence for (or against). It is usually formed first. The hypothesis testing process (using the above example of Amazon Prime) 1. Test H1 (µ ≠ 50). First, sample the population and find the sample mean. 2. Assess the likelihood of H1. a. If we assume H0 (that µ = 50), and we return a sample mean close to the stated µ, then we can conclude that H0 is not rejected. b. If we assume H0 and we return a sample mean far from the stated µ, then we can conclude that H0 is rejected ( our assumption that µ = 50 is wrong) Possible errors in hypothesis test decision-making ACTUAL SITUATION H0 True H0 False DECISION Do Not Reject H0 Correct decision P = 1 – α Type II Error P = β = (do not reject H0 | H0 false) Reject H0 Type I Error P = α = (reject H0 | H0 true) Correct decision P = 1 - β ● The confidence coefficient 1-α is the probability of not rejecting H0 when it is true. ○ “Our confidence in the test to accept the true” ● The power of a statistical test 1-β is the probability of rejecting H0 when it is false. ○ “The power of the test to identify the false” Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 Importantly, we need to set critical values that tell us how ‘close’ is close enough to uphold H0, or how ‘far’ is far enough to reject H0. ● In two-tailed tests (i.e. where H1 specifies ‘≠’), we set values of α/2 on either ‘tail’ as critical values. Two-tailed hypothesis test for the mean where σ is known: Z-value approach 1. State H0 and H1. 2. Choose α and n. 3. Determine the critical Z values for a specified α/2. 4. Collect data. Convert the sample statistic, X̄, to a test statistic, ZSTAT. ZSTAT = σ√n X−μ 5. Decision rule: If ZSTAT falls within the rejection region, reject H0. Otherwise, do not reject H0. Example. Test the claim that the true mean diameter of a bolt is 30mm, given σ = 0.8. H0: µ = 30 H1: µ ≠ 30 Set α = 0.05 and n = 100 Since α/2 = 0.025, the critical Z values are ±1.96 Suppose that X̄ = 29.84 −ZSTAT = σ√n X−μ = 0.8 √100 29.84−30 = 0.08−0.16 = 2 Since ZSTAT < -1.96, ZSTAT is within the rejection region, we reject the null hypothesis and conclude there is sufficient evidence, at the 5% level of significance, that the mean diameter of the manufactured bolts is not equal to 30mm. Two-tailed hypothesis test for the mean where σ is known: p-value approach The p-value / observed level of significance is the probability of obtaining a value for the test statistic equal to or more extreme than the observed sample value, given H0 is true. ● ‘More extreme’ is defined by H1. 1. State H0 and H1. 2. Choose α and n. 3. Collect data and compute the test statistic. Determine the associated p-value for a specified α. 4. Decision rule: If p < α, reject H0 . Otherwise, do not reject H0. “If the p-value is too low, then H0 must go!” Example. Test the claim that the true mean diameter of a bolt is 30mm, given σ = 0.8. Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 H0: µ = 30 H1: µ ≠ 30 Set α = 0.05 and n = 100 Suppose that X̄ = 29.84 −ZSTAT = σ√n X−μ = 0.8 √100 29.84−30 = 0.08−0.16 = 2 To calculate the p-value, ask: How likely is it to get a ZSTAT of -2 (or further from the mean (0), in either direction) if H0 is true? Sum up both upper and lower tail probabilities. Since the p-value = 0.0456 is less than α = 0.05, we reject the null hypothesis and conclude there is sufficient evidence, at the 5% level of significance, that the mean diameter of the manufactured bolts is not equal to 30mm. Two-tailed hypothesis test for the mean where σ is unknown: t-value approach 1. State H0 and H1. 2. Choose α and n. 3. Determine the critical Z values for a specified α/2. 4. Collect data. Convert the sample statistic, X̄, to a test statistic, tSTAT. tSTAT = S√n X−μ 5. Decision rule: If tSTAT falls within the rejection region, reject H0 . Otherwise, do not reject H0. Example. The average cost of a hotel room in New York is said to be $168 per night. To determine if this is accurate, a random sample of 25 hotels is taken and resulted in X = $172.50 and S = $15.40. Test the appropriate hypotheses at α = 0.05. H0: µ = 168 H1: µ ≠ 168 α/2 = 0.025, n = 25, d.f. = 24 .46tSTAT = S√n X−μ = √25 15.4 172.5−168 = 1 Critical values are t24, 0.025 = ± 2.0639 Since tSTAT does not fall within the rejection region, we do not reject the null hypothesis and conclude there is insufficient evidence, at the 5% level of significance, that the mean cost of NY hotel rooms is not $168. Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 Two-tailed hypothesis test for the mean where σ is unknown: p-value approach 1. State H0 and H1. 2. Choose α and n. 3. Collect data and compute the test statistic. Determine the associated p-value for a specified α. 4. Decision rule: If p < α, reject H0 . Otherwise, do not reject H0. Example. The average cost of a hotel room in New York is said to be $168 per night. To determine if this is accurate, a random sample of 25 hotels is taken and resulted in X = $172.50 and S = $15.40. Test the appropriate hypotheses at α = 0.05. H0: µ = 168 H1: µ ≠ 168 α = 0.05, n = 25, d.f. = 24 .46tSTAT = S√n X−μ = √25 15.4 172.5−168 = 1 To calculate the p-value, ask: How likely is it to get a tSTAT of 1.46 (or further from the mean (0), in either direction) if H0 is true? Sum up both upper and lower tail probabilities. Since p-value = 0.157 is more than α = 0.05, we do not reject the null hypothesis and conclude there is insufficient evidence, at the 5% level of significance, that the mean cost of NY hotel rooms is not $168. Connection between confidence interval and two-tail tests - the length of the non-rejection region is also the length of the confidence interval, with the same α. One tail tests - in many cases, H1 focuses on a particular direction. E.g. H1: µ > 50 - this is an upper-tail test (focused on the upper tail above the value of 50). The rejection area is only in the upper tail. E.g. H1: π < 0.1 - this is a lower-tail test (focused on the lower tail below the value of 0.1). The rejection area is only in the lower tail. 1. Find the critical value for t 2. Reject if test statistic is in rejection region OR 1. Compute test statistic into tSTAT Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 2. Knowing tSTAT and d.f., calculate p-value 3. Reject if p-value < α =T.DIST([tSTAT],[d.f.],TRUE): Returns the left tailed student-t distribution (i.e. the p-value) =T.DIST.2T([tSTAT],[d.f.],TRUE): Returns the two tailed student-t distribution (i.e. the p-value) =T.DIST.RT([tSTAT],[d.f.],TRUE): Returns the right tailed student-t distribution (i.e. the p-value) Remember, the definition of the p-value: the probability of observing a test statistic more extreme than that observed, in the direction of the alternative hypothesis, if the null hypothesis is true. Also note that, with one-tailed tests, any ZSTAT or tSTAT will be based on α rather than α/2. Why? Because the rejection region, which by definition is α, is all in one tail. (If in doubt, look at the tables and their sample diagrams.) Hypothesis tests for the proportion Z = p−π√ nπ(1−π) Example. A marketing company claims that it receives responses from 8% of those surveyed. To test this claim, a random sample of 500 were surveyed with 25 responses. Test at the α = 0.05 significance level. H0: π = 0.08 H1: π ≠ 0.08 Set α = 0.05 and n = 500 p = 25/500 = 0.05 ZSTAT = = -2.47 0.05−0.08√ 5000.08(1−0.08) Since α/2 = 0.025, the critical Z values are ±1.96 Since ZSTAT < -1.96, ZSTAT is within the rejection region, we reject the null hypothesis and conclude there is sufficient evidence, at the 5% level of significance, that the company does not maintain an 8% response rate. p-value test: α = 0.05 ZSTAT, as calculated above, is -2.47 Corresponding p-value is 2(0.0068) = 0.0136 Reject H0 since p-value < α One-tailed hypothesis tests for the proportion Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 Remember the only difference is that we work with α instead of α/2, because the rejection region (of width α) is all in one tail. Example. Coca Cola wants at least 10% of customers to purchase a new cola. To test this claim, a random sample of 100 were surveyed with 8 positive responses. Test at the α = 0.05 significance level. H0: π 0.10 (this is a lower tail test)≥ H1: π < 0.10 Set α = 0.05 and n = 100 p = 8/100 = 0.08 ZSTAT = = -0.67 0.08−0.10√ 1000.10(1−0.10) Since α = 0.05, the critical Z value is -1.645 Since ZSTAT 1.645, ZSTAT is not within the rejection region, we do not reject the null hypothesis and rule≥ that Coca-Cola’s finding is plausible. p-value approach: α = 0.05 ZSTAT, as calculated above, is -0.67 Corresponding p-value is .2486+.5 = 0.7486 Do not reject H0 since p-value > α To calculate the p-value in Excel, use =NORM.S.DIST([ZSTAT], TRUE) to find out the probability from Z = − ∞ to Z = ZSTAT. Ethical considerations ● Use randomly collected data (probability samples) to reduce selection biases and non-sampling error AND allow the sampling distribution theory to be used! ● Choose the level of significance, α, and the type of test (one-tail or two-tail) before data collection ● Do not employ “data snooping” to choose between one-tail and two-tail tests, or to determine α ● Do not practice “data cleansing” to hide observations that do not support a stated hypothesis ● Report all pertinent findings including both statistical significance and practical importance Week 10 - Two Sample Tests Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 (Key question: how do we compare two data sets?) Businesses compare data sets all the time. ● E.g. eBay wants to test whether pop-up ads or sidebar ads generate more sales. ● E.g. Google wants to test whether showing 10 or 25 search results attracts more traffic. ● E.g. Woolworths wants to test the revenue generated from ‘special’ v non-special products. Our goal is to test hypotheses or form a confidence interval for the difference between two population means / proportions. ● The point estimate for the difference of two population means is . X1 − X2 ● The point estimate for the difference of two population proportions is . p1 − p2 Two-sample test comparing population means using independent samples (Key question: what is the relationship between µ1 and µ2?) ● Samples from two populations are independent if a sample selected from either population has no effect on the sample from the other population. ● The CLT says: linear combinations of independent rvs are approximately normally distributed, as the sample size gets larger. Thus CLT applies when the two groups are independent and randomly sampled, as long as each group is EITHER normally distributed OR has large enough sample size. (do not memorise) X1 − X2 = n1 ∑ n1 i=1 X i1 − n1 ∑ n2 i=1 X i2 Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 The pooled variance t-Test is used if σ1 and σ2 are unknown and assumed equal. (Q: Why might we assume σ1 and σ2 are equal?) ● We assume that samples are randomly and independently drawn. ● We assume that populations are normally distributed OR both sample sizes are symmetric OR both sample sizes are at least 30 for CLT. The pooled variance estimate is: (= )Sp2 n −1+n −11 2 +∑ n1 i=1 (X −X )i 1 1 2 ∑ n2 i=1 (X −X )i 2 2 2 = n +n −21 2 (n −1)S +(n −1)S1 12 2 22 ● n1, n2 are the sample sizes of sample 1 and sample 2 respectively. ● X̄ 1, X̄ 2 are the sample means of sample 1 and sample 2 respectively. ● S12, S22 are the sample variances of sample 1 and sample 2 respectively. The tSTAT, for µ1 - µ2 with σ1 and σ2 unknown and assumed equal, is: tSTAT= √S ( + )p2 1n1 1n2 (X −X )−(μ −μ )1 2 1 2 (If you want to try to derive this formula, don’t.) ● tSTAT has (n1 + n2 - 2) degrees of freedom. ● Generally, H0: µ1 = µ2 thus µ1 - µ2 = 0 The confidence interval, for µ1 - µ2 with σ1 and σ2 unknown and assumed equal, is: X ) ( 1 − X2 ± tα/2√S ( )p2 1n1 + 1n2 where tα/2 has (n1 + n2 - 2) degrees of freedom. Excel can do all the work for you: Data → Data Analysis → t-Test: Two-Sample Assuming Equal Variances → input data and raw α → done. The separate variance t-Test is used if σ1 and σ2 are unknown and NOT assumed to be equal. ● We assume that samples are randomly and independently drawn. ● We assume that populations are normally distributed OR both sample sizes are symmetric OR both sample sizes are at least 30 for CLT. The tSTAT, for µ1 - µ2 with σ1 and σ2 unknown and not assumed equal, is: Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 tSTAT= √ +n1S12 n2S22 (X −X )−(μ −μ )1 2 1 2 (Generally, H0: µ1 = µ2 → thus µ1 - µ2 = 0) The critical t-value has degrees of freedom according to the formula shown, rounded to the nearest integer (I can’t even transcribe them myself anymore). The confidence interval, for µ1 - µ2 with σ1 and σ2 unknown and not assumed equal, is: X ) ( 1 − X2 ± tα/2,d.f .√ n1S12 + n2S22 Excel can do all the work for you: Data → Data Analysis → t-Test: Two-Sample Assuming Unequal Variances → input data and raw α → done. Example. You have been asked to compare the mean value of a Sydney-Melbourne flight on Jetstar with the mean value on Qantas to see if there is a difference in the means of the samples. Using α = 0.05, (a) What are your hypotheses? (b) What is tSTAT? (c) How many degrees of freedom does the critical value have? (d) What is the critical value? What can you conclude? (e) What is the p-value? What can you conclude? (f) Form a 95% confidence interval for the population mean of the samples (i.e. µ1 - µ2 ). (g) What assumptions are you making? (a) H0: µ1 = µ2 → H1: µ1 ≠ µ2 → µ1 - µ2 = 0 Parts (b), (c), (d), (e) and (f) can be solved in Excel by going to the Data tab → Data Analysis → t-Test: Two-Sample Assuming Unequal Variances → enter the two arrays, α and hypothesised mean difference (=0) → OK. It will show up in a new sheet. However, for the purposes of this exercise, I want to go through things step by step. This is a representation of the formulas I put into my Excel spreadsheet: Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 And this is what the numbers looked like: (b) tSTAT = -4.096 (B31) (c) d.f. = 36 (E33) (d) This is a two-tail test. If tSTAT is more extreme than the critical values, reject H0 . Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 Critical values = ±2.028 (E34) Since tSTAT < -2.028, reject H0 (E36). There is a significant difference between the mean value of Sydney-Melbourne flights on Jetstar and Qantas. (e) The p-value is the probability of getting tSTAT equal to or more extreme than the sample result if there is no difference in the mean of the two samples (i.e. if H0 holds). p-value = 0.000227 (E35) Since p-value < α, reject H0 (E37). There is a significant difference between the mean value of Sydney-Melbourne flights on Jetstar and Qantas. (f) Confidence interval: -14.009 < µ1 - µ2 < -4.731 (B34, B35). You can conclude with 95% confidence that the difference between the population mean of the two samples falls inside this interval. (g) (i) Since the sample sizes are both less than 30, it must be assumed that both sampled populations are approximately normal. (ii) Observing a difference this large or larger in the two sample means is less likely if you assume equal population variances than if you assume unequal variances, but the null hypothesis would be rejected either way. Two-sample test comparing population means using related samples The paired difference test uses the difference between matched values (e.g. sales data ‘before’ and ‘after’ a marketing campaign; a student’s test scores in Quiz 1 and Quiz 2). ● We assume that both populations are normally distributed, or large enough for CLT. ● The difference between matched values Di = Xi1 - Xi2 . ● The point estimate for the paired difference population mean µD: D = n ∑n i=1 Di ● The sample standard deviation: SD = √ n−1∑ni=1 (D −D)i 2 ● tSTAT for µD: where d.f. = n-1tSTAT = √n SD D−μD ● Confidence interval for µD: D ± tα/2 √nSD Two-sample test comparing population proportions Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 (Key question: what is the relationship between two population proportions π1 and π2?) To compute the test statistic we assume the null hypothesis is true, so we assume π1 = π2 and pool the two sample estimate. The pooled estimate for the overall proportion is: where X1 and X2 are the number of items of interest in samples 1 and 2.p = n +n1 2X +X1 2 ZSTAT for π1 - π2: ZSTAT = (p −p )−(π −π )1 2 1 2√p(1−p)( + )1n1 1n2 once again, remember we assume π1 = π2 . Confidence interval for π1 - π2 : p ) ( 1 − p2 ± Zα/2√ n1p (1−p )1 1 + n2p (1−p )2 2 Covariance and correlation http://ci.columbia.edu/ci/premba_test/c0331/s7/s7_5.html The sample covariance indicates how two samples are related. ov(X , )c Y = n−1 (X −X)(Y −Y )∑ n i=1 i i If cov(X,Y) > 0 → X and Y tend to move in the same direction (positively correlated). If cov(X,Y) < 0 → X and Y tend to move in opposite directions (negatively related). If cov(X,Y) = 0 → X and Y are independent. Covariance cannot measure the degree to which variables move together because it does not use one standard unit of measurement. To measure the strength of the linear relationship between two numerical variables, you must use the sample coefficient of correlation. where SX and SY are the standard deviations of X and Y.r = S SX Ycov(X ,Y ) The sample coefficient of correlation ranges between -1 and 1. The closer to -1, the stronger the negative linear relationship. The closer to 1, the stronger the positive linear relationship. The closer to 0, the weaker the linear relationship - can be randomly scattered, or follow a non-linear pattern (like a quadratic). Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 Week 11, 12, 13 - Simple and Multiple Linear Regression Regression analysis leads us to construct a model which allows us to see how one or more independent variables can predict the value of a dependent variable. In simple linear regression, ● there is only one independent variable, X. ● the relationship between X and the dependent variable, Y, is described by a linear function. ● changes in X are assumed to cause changes in Y. Two things determine a line: its y-intercept and its slope. Therefore, a simple linear regression model looks like: Yi = β0 + β1Xi + εi Yi = dependent variable β0 = population y-intercept β1 = population slope coefficient Xi = independent variable β0 + β1Xi = linear component = E(Yi | X = Xi) εi = random error component The estimated SLR equation (also called the prediction line) provides an estimate of the population regression line: X Y︿i = b0 + b1 i Ŷ = predicted Y for observation i b0 = estimate of the y-intercept b1 = estimate of the slope Xi = value of X for observation i This line is shown in orange above. εi is therefore the difference between Yi and Ŷi. Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 So, how do we determine b0 and b1 (i.e. the y-intercept and the slope)? How do we know what line to draw through our scatter plot? Well, we want the variance of our standard errors εi (=Yi - Ŷi) around the line to be as small as possible, right? We use the least squares method, which minimises the sum of squared differences between the observed values Yi and the predicted values Ŷi. This sum of squared differences is equal to = .(Y∑n i=1 i − Y )︿i 2 [Y b X )]∑n i=1 i − ( 0 + b1 i 2 We minimise the sum of squared differences by differentiating the equation with respect to b0 and b1 and setting it equal to 0: (Y X ) (Y X )∂∂b0 ∑ n i=1 i − b0 − b1 i 2 = − 2 ∑ n i=1 i − b0 − b1 i = 0 (Y X )− 2 ∑n i=1 b0 = − 2 ∑ n i=1 i − b1 i Xb0 = Y − b1 (Y X ) (Y X )X∂∂b1 ∑ n i=1 i − b0 − b1 i 2 = − 2 ∑ n i=1 i − b0 − b1 i i = 0 (Y Y X) X )X− 2 ∑n i=1 i − ( − b1 − b1 i i = 0 (b X X)X (Y )X∑n i=1 1 i − b1 i = ∑ n i=1 i − Y i (X ) )b1 ∑ n i=1 i − X = ∑n i=1 (Y i − Y b1 = (X −X)∑n i=1 i (Y −Y )∑ n i=1 i = ∑n i=1 (X −X)i 2 (Y −Y )(X −X)∑ n i=1 i i You’re not expected to remember this exact process. What is important is that you understand the concept: we’ve got a quadratic and we optimise it by setting its derivative to 0. (Note: these equations were quite complicated and used the chain rule. If you’re unsure of what that is, I recommend you refresh your 2-unit mathematics or take a maths help class at uni.) Example. Find the estimated SLR equation for the relationship between the selling price of a home and its size (measured in square feet), given the data. Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 In Excel, go to Data → Data Analysis → Regression → input Xs and Ys. (Include the labels ‘House Price in $1000s’ and ‘Square Feet’.) You end up with something like: b0 (= 98.24833) is the estimated mean value of Y when the value of X is 0. But since a house cannot have a square footage of 0, b0 in this case has no practical application. b1 (= 0.10977) tells us that the mean value of a house increases by .10977($1000) = $109.77, on average, for each additional square foot of size. When using a regression model, you should (usually) only interpolate within the X-range of existing observations, not extrapolate. This is because when a new observation for an X outside the range occurs, it will change b0 and b1. (Y |X ) −> (Y |X ) b0 = E ︿ = 0 − β0 = E = 0 (Y |X ) (Y |X) −> (Y |X ) (Y |X) b1 = E ︿ + 1 − E︿ − β1 = E + 1 − E Measures of variation SST (total variation) measures the variation of the observed Yi values around their mean . This is relatedY to the sample variance of Y. SSE (unexplained variation) measures variation attributable to factors other than X. The whole thing on Least Squares was about minimising SSE. SSR (explained variation) measures variation attributable to the relationship between X and Y. Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 ● It’s the difference between how we would predict Y if we didn’t use X (SST), and how we would predict Y if we did use X (SSE). ● If SSE gets smaller, SSR gets larger (shown in diagram below). The coefficient of determination is the proportion of the linear variation in Y that is explained by the linear variation in X. where r is the coefficient of correlation.r2 = SSRSST = (Y −Y )∑n i=1 i 2 (Y −Y )∑ n i=1 i ︿ i 2 “r2*100% of the variation in Y is explained by variation in X, via the linear model.” The standard error of the estimate is the standard deviation of the observations around the regression line, and is measured in the same units as Y. SY X = √ n−2SSE = √ n−2(Y −Y )∑ni=1 i ︿i 2 We have two parameters to estimate the mean (the intercept and the slope), thus we divide by n-2. Residual analysis to verify assumptions The residual ei is the difference between the observed and predicted values. ei = Y i − Y︿i Assumptions about the residual (“LINE”): ● Linearity: the relationship between X and Y is linear ● Independence of Errors: error values are statistically independent ● Normality of Error: error values are normally distributed (only in small samples or when doing prediction intervals) ● Equal variance (homoskedasticity): the probability distribution of errors has constant variance Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 We can plot residuals vs. X to test these assumptions. We don’t want to see any pattern in the residual plot. If we do, that means that our regression model has not completely captured all the variance in Y as a function of X, and we will need to alter it. Tests that the linear relationship is not due to chance Central limit theorem: The Least Squares estimates are linear combinations of the observations Y. Thus, they have a CLT. Therefore, in large samples, Y, b0 and b1 are approximately normally distributed. The standard error of b1, the regression slope coefficient, is estimated by Sb1 = SY X √SSX = SY X √ (X −X)∑ni=1 i 2 t-test for a population slope: is there a linear relationship between X and Y? H0: β1 = 0 (there is no linear relationship) H1: β1 ≠ 0 (a linear relationship exists) with df = n-2.tSTAT = Sb1 b −β1 1 Example. ^house-price = 98.24833 + 0.10977(square feet). Is there a relationship between the square footage of the house and its sales price? Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 Note that if α is not specified, assume α = 0.05. ● tSTAT = 3.329378 Critical values of t0.025,8 = 2.3060 ± Decision = reject H0. There is sufficient evidence that square footage significantly affects house price. ● p-value = 0.01039 α = 0.05 Decision = reject H0. There is sufficient evidence that square footage significantly affects house price. ● Confidence interval for estimate of the slope: 0.03374, 0.18580 We are 95% confident that the average impact on sales price is between $33.74 and $185.80 per square foot of house size. This 95% confidence interval does not include 0 (i.e. there is not no impact). Decision = there is a significant relationship between house price and square feet at the 5% level of significance. t-test for a correlation coefficient: H0: ρ = 0 (there is no correlation between X and Y) H1: ρ ≠ 0 (correlation exists) ρ is the population correlation coefficient. with df = n-2. tSTAT = r−ρ√ n−21−r2 Example. Is there evidence of a linear relationship between square feet and house price at the .05 level of significance? (look familiar?) Since tSTAT falls outside the critical values, reject H0. There is evidence.329 tSTAT = .762−0√ 10−21−.7622 = 3 of a linear association at the 5% level of significance. NB: The t-test for a population slope and a correlation coefficient result in the same thing. Estimating mean values and predicting individual values Whereas a confidence interval is for the mean response, a prediction interval is for an individual value of Y. Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 Multiple linear regression In real life, when we’re trying to predict a variable Y, there are several variables X which may be acting upon Y. ● E.g. if Y is sales, X1 might be advertising, X2 might be price, X3 might be competitors’ sales ... So we need to use something called a multiple regression model with k independent variables: Yi = β0 + β1X1i + β2X2i + … + βkXki + εi Yi = observed value of Y β0 = y-intercept β1,2,...,k = population slopes for each occurrence i X1,2,...,k = observed values of X for each occurrence i εi = random error The coefficients of the multiple regression model are again estimated using sample data: X X .. X Y︿i = b0 + b1 1i + b2 2i + . + bk ki Ŷi = predicted value of Y b0 = estimated y-intercept b1,2,...,k = estimated slope coefficients If we represent Ŷ graphically, we get a plane. Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 Example. A distributor of pies wants to evaluate factors thought to influence demand. Dependent variable: Pie sales (units per week) Independent variables: Price of pies ($) Expenditure on advertising ($100’s) In Excel, 1. Input the data, with the dependent variable on the leftmost column. Include headings. 2. Data → Data Analysis → Regression. Input the dependent variable in the Y, input all the independent variables in the X. Therefore the prediction equation for sales: ^sales = 306.526 - 24.975(Price) + 74.131(Advertising). ● Sales will decrease on average by 24.975 pies per week for each $1 increase in selling price, if the level of advertising is held constant. ● Sales will increase on average by 74.131 pies per week for each $100 increase in advertising, if the selling price is held constant. Testing for the fit of a multiple regression model When we have two x-variables, we want to know that both of them are important. Therefore, we want to compare the fit of the model using two x-variables with the fit of a model using only one of them. Recall that r2 = SSR/SST = regression sum of squares / total sum of squares. If we add a third variable, no matter what it is, r2 can never decrease. This might throw out the usefulness of r2 as a comparative measure of the fit of models. What do we do? We use adjusted r2 - asking ‘how much have we reduced our SYX below SY?’ The closer r2adj is to 1, the better our model fits. This is not an intuitive formula … but that’s what it is. Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 Testing for linear significance: Are any of the X variables linearly related to Y, or are none? Hypotheses: H0: β1 = β2 = … = βk = 0 (no linear relationship) H1: at least one βi 0 (at least one independent variable affects Y)= / We can’t use a separate t-test for each βi, since they’re all correlated. So what do we do? What if, we take the sum of all the squares of bi (i = 1, 2, …, n)? If the sum is close enough to 0, the only way that could happen is if all the tSTAT, were close to 0. The only way that could happen is if all bi (i = 1, 2, …, n) were close to 0, and the only way that could happen is if all βi (i = 1, 2, …, n) were close to 0. That is roughly what we do when we use the F Test for overall significance of the model. FSTAT = MSR/MSE, where k = number of independent variables in the regression model. (“E12”) F1,n-2 t2n-2 i.e. FSTAT is a quadratic function of a set of tSTAT.≡ MSR = SSR/k (“D12”) MSE = SSE/(n-k-1) (“D13”) FSTAT follows an F distribution with k (numerator) and n-k-1 (denominator) degrees of freedom. (“B12”, “B13”) Decision rule: Reject H0 at the α level of significance if FSTAT < Fα; otherwise, do not reject H0. (“F12”) Residual analysis in multiple regression Residuals from the regression model: Y ) ei = ( i − Y ︿ i Assumptions about ε: ● The errors are normally distributed (if we have a small sample size or doing prediction intervals) ● They have a constant variance ● They are independent Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 These residual plots are used in multiple regression: (a) Residuals vs. (predicted)Y︿i (b) Residuals vs. X1i (c) Residuals vs. X2i (d) Residuals vs time (if time series data) NB: If the residuals are independent of both X1i and X2i, then they’ll be independent of . So we normallyY︿i just plot residuals vs. .Y︿i Use the residual plots to check for violations of regression assumptions. Testing whether individual variables are significant t-tests / p-values of individual variable slopes: shows if there is a linear relationship between the variable Xj and Y, holding constant all other X variables. H0: βj = 0 (no linear relationship) H1: βj 0 (linear relationship does exist between Xj and Y)= / tSTAT = Sbj b −0j Confidence interval estimate for the population slope βj: where t has (n-k-1) df.Sbj ± tα/2 bj If the interval does not contain 0, reject the null hypothesis Dummy variables A dummy variable is how we put categorical variables into regression models. We code binary dummy variables as 0 (did not occur / characteristic does not exist) or 1. Example. For our pie example, let: Y = pie sales X1 = price X2 = holiday; X2 = 1 if a holiday occurred during the week; X2 = 0 if there was no holiday that week. Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 You see that with a dummy variable in the mix, we get TWO regression lines! The only different between them is the impact of the holiday. If Y = pie sales and X1 = price, and the flavour of the pie (which has three ‘levels’: apple, strawberry, chocolate) is thought to matter, X2 = 1 if apple, 0 otherwise X3 = 1 if strawberry, 0 otherwise For apple pie, X X Y︿ = b0 + b1 1 + b2 2 For strawberry pie, X X Y︿ = b0 + b1 1 + b3 3 For chocolate pie, X Y︿ = b0 + b1 1 Therefore, the number of dummy variables is one less than the number of levels. Autocorrelation Autocorrelation is correlation of errors over time. If residuals show a repeating pattern, that is a sign of positive autocorrelation. Autocorrelation violates the regression assumption that residuals are random and independent. This plot of time against residuals seems to display autocorrelation: ● ‘spikes’ in the residuals around May and October of every year, as highlighted in orange. ● ‘troughs’ around June and December/January of every year, as highlighted in crimson. Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 ● Therefore, there seems to be an annually cyclical trend in the residuals. To be sure, we calculate the Durbin-Watson statistic to test for autocorrelation. H0: residuals are not autocorrelated H1: autocorrelation is present The critical values dL and dU can be found from a Durbin-Watson table, for sample size n and number of independent variables k. Testing for positive autocorrelation: Testing for negative autocorrelation: Pitfalls of regression ● Lacking an awareness of the assumptions underlying least-squares regression ● Not knowing how to evaluate the assumptions ● Not knowing the alternatives to least-squares regression if a particular assumption is violated ● Using a regression model without knowledge of the subject matter ● Extrapolating outside the relevant range To avoid these pitfalls, ● Start with a scatter plot of Y vs. X to observe possible relationship ● Perform graphical residual analysis to check the assumptions (“LINE”) ● If there is violation of any assumption, use alternative methods or models ● If there is no evidence of assumption violation, then test for the significance of the regression coefficients and construct confidence intervals and prediction intervals ● Avoid making predictions or forecasts outside the relevant range Downloaded by zhu meng ([email protected]) lOMoARcPSD|5184647 欢迎咨询51作业君