MM1F28 Seminar
Week 3
Table of content
Introduction
One-tailed tests
Two-tailed tests
Hypothesis testing
Interpreting the results of a two sample t-test
Summarising our results
Paired Samples T-tests
Interpreting paired samples t-tests
Effect sizes
Exercises
Introduction
This week we will move past confidence intervals, into a more formal way of checking whether there is an effect of an IV on a DV: t-tests.
Independent samples t-tests work if the following conditions are met:
The IV is a factor with two levels (e.g., male vs female, associate prof vs prof, default vs no default, etc.)
The DV is a numerical variable (not ordinal)
The DV is fairly normal (parametric assumption #1)
The variance of the DV between the two levels is similar (in other words the two groups have equal spread of data) (parametric assumption #2)
If the parametric assumptions are not met, there are things we can do. We will discuss these next week.
One-tailed t-tests
When your hypothesis has a direction, then you should use a one-tailed t-test. For example, if we were to ask whether tumour thickness in melanoma male patients is larger than female patients, then this is a directional hypothesis. We are saying that we expect group A to be greater than group B.
Note. The grey area is 5% of all observations.
One-tailed tests are less conservative, and can increase the chance of finding an effect (i.e., significance). However, you should only use these when there is a valid reason to do so (i.e., you are confirming previous evidence, or having no direction simply doesn’t make much sense). In other words it should be driven by theory.
Two-tailed tests
When there is no direction in our hypothesis, then we use a two-tailed test. Confidence intervals are by default two-tailed (they have two tails above and below the point estimate). The same can apply for the t-test. So, for example, if we wanted to simply check if men or women have significantly different tumour sizes when diagnosed with melanoma, we would use a two-tailed test, as we are not sure which way it can go.
Note. The two grey areas are 2.5% each.
Without going into the details too much, we want our t-statistic to cross the thresholds highlighted in the graphs shown above. In the first instance the t-critical threshold (denoted ta) is 1.697, which in the second graph (denoted ta/2 it’s 2.042. You can see that crossing the threshold of significance in the second graph is harder. This is a more conservative test.
Hypothesis testing
Let’s go back to the melanoma example. We are interested in whether male melanoma patients have significantly larger tumour sizes than female patients. This is directional, so it’s a one-tailed test. We start by formally writing out the null and alternative hypotheses:
H0: μ1 = μ2
H1: μ1 > μ2
Where μ1 is the mean tumour size for males, and μ2 is the mean tumour size for females. Note that I’m asking if this is likely true on a population level (we don’t care about the sample means, we know these will be different).
Next step is to run a t-test and interpret the results. I’ll show you how to do this using software in the workshop. For now, let’s just look at the output and break it down into its parts.
Interpreting the results of an independent samples (two sample) t-test
Here is the output of the one-tailed t-test
There is one statistic that is the most important one to be able to interpret, and that’s the p-value. The p-value is the probability that this difference in the means would occur under the null hypothesis. In other words, if the two groups had the exact same population mean, what would be the chance of getting these sample means for each one. According to the results, there is only a 0.39% chance that this would have happened.
The accepted significance threshold for our field is usually 5% --this is what we will use for this course. That suggests that any p-value less than 0.05 will lead us to reject the null hypothesis. This is universal for all frequentist tests, and makes things very convenient. Otherwise you would need to know how the test works, what the critical values are, and other things. As you will see later when we do ANOVA, the test statistics change, but the p-value remains and is interpreted in the same way.
According to these results, we found enough evidence to suggest that males have a significantly larger tumour size than females when first diagnosed with melanoma since 0.0039 is less than our 0.05 threshold. This is in contrast with our findings earlier when we used confidence intervals, partially because the CI plots were two-tailed, while this is a one-tailed test, but also for other reasons, which I will explain later but are ultimately not super important for this course. Ultimately, we always trust the t-test more than the CI plots.
Summarising our findings
Now that we’ve interpreted the output, it’s time to tell the reader what we found. Here is how we report the t-test:
According to our findings, male patients have a larger tumour size (mean = 3.41cm) than female patients (mean = 2.49). This difference is significant [t(203) = 2.69, p < 0.05].
The 203 is the degrees of freedom (df). The 2.69 is the t-value (note that I removed the sign, as it’s not important), and the p-value being less than 0.05, I just put < 0.05, rather than the exact number.
Paired Samples t-test
Up till now we’ve always had two independent groups. For example males and females.
Sometimes, you want the same people to sit through two different conditions. This is called a paired sample design.
Why would you want to do that? Well, when you have two independent samples you introduce unsystematic error due to e.g., individual differences (you may have placed more people with higher fluctuations of blood sugar in one of the groups, which will be a significant confounder). With paired sample designs you get rid of this error, since it’s the same people used twice.
Paired sample designs are more statistically powerful than independent samples designs (i.e., if there is an effect, you are more likely to find it, or if you prefer, you are more likely to cross the 0.05 significance threshold). The also required less data. However, they are harder to work with as well, since you need to bring participants in twice. You also need to control for any lingering or learning effects by splitting the group so that half do session 1 first, and half do session 2 first, before switching them around.
The assumptions of paired samples t-tests are different. We won’t be covering those in this course, so you don’t need to worry about them.
Interpreting paired samples t-tests
We interpret the test in the exact same way as with the independent samples. Check the p-value and if it is under 0.05, then reject the null hypothesis.
Effect sizes
Note that the p-value is simply a probability of the difference in means occurring under the null hypothesis. It tells us absolutely nothing about how meaningful this difference is. For example, if school A that has adopted a new learning curriculum ends up with average scores of 100 in some test, while school B ends up with 98, the difference may be significant (if the sample size is larger enough, the standard deviation small enough), but the difference is not very meaningful.
The way to check the meaningful distance between the two means is by using Cohen’s D. Cohen’s d is a simple calculation that can be done by hand if you want:
The d-statistic will tell us how large the effect size is, usually small being 0.2, medium being 0.5, and large being 0.8.
I will show you how to do this in the workshop using software, but for now all you need to know is that p-values should not be used as measures of effect size.
Exercises
Question 1
What are the assumptions that need to be met in order for independent samples t-tests to work properly?
Question 2
What is the difference between a one-tailed and a two-tailed test?
Question 3
Is a one-tailed test more, or less conservative than a two-tailed test? Why is that?
Question 4
Maria wants to examine whether employee satisfaction with the legacy payroll system is less than with the new and shiny online web-based system. Let’s call the legacy system A and the new system B. Formulate the null and alternative hypotheses.
Question 5
Maria wants to increase the statistical power of her test, as well as reduce the amount of participants she will need. Should she use two sample or paired sample design?
Question 6
Maria ends up using a two sample (independent samples) design. She chooses to use a t-test (possibly wrongly, as satisfaction is usually an ordinal variable, but let’s ignore this for the sake of this exercise). She reports that t(258) = 2.59, p < 0.05. Did Maria find enough evidence to support her hypothesis? Why?
Question 7
Let’s examine the output from Maria’s t-test again.
t(258) = 2.59, p < 0.05
What is the t-value, what are the degrees of freedom, what is the p-value?
Question 8
Maria wants to make her test more conservative and changes the significance threshold from 0.05 to 0.01. Her p-value was actually 0.02. Did she still find evidence to support her hypothesis given her new significance threshold?
Question 9
Maria used Cohen’s d and found d = 0.3. How large is the effect of the IV on the DV?