STA305/1004 - Week 4

(adapted from N. Taback)

Finding Power, Intro to Causal Inference

Week 4 Outline

I Finding Power

I Replication and Power: Case study on Power Poses

I Power and Sample size formulae: Two-sample proportions

I Power via simulation

I Introduction to causal inference:

I The fundamental problem

I The assignment mechanism: Weight gain study

Replication and Power: Case Study on Power poses:

(Carney et al. (2010) Power Posing: Brief Nonverbal Display Affect

Neuroendocrine Levels and Risk Tolerance, Psychological Science, 21(10),

1363-1368

Can power poses significantly change outcomes in your life?

Study methods (Carney et al. (2010)):

I Randomly assigned 42 participants to the high-power pose or the

low-power-pose condition.

I Participants believed that the study was about the science of physiological

recordings and was focused on how placement of electrocardiography

electrodes above and below the heart could influence data collection.

I Participants’ bodies were posed by an experimenter into high-power or

low-power poses. Each participant held two poses for 1 min each.

I Participants’ risk taking was measured with a gambling task; feelings of

power were measured with self-reports.

I Saliva samples, which were used to test cortisol and testosterone levels, were

taken before and approximately 17 min after the power-pose manipulation.

Can power poses significantly change outcomes in your life?

Study results (Carney et al. (2010)):

As hypothesized, high-power poses caused an increase in testosterone

compared with low-power poses, which caused a decrease in testosterone,

F(1, 39) = 4.29, p < .05; r = .34. Also as hypothesized, high-power

poses caused a decrease in cortisol compared with low-power poses,

which caused an increase in cortisol, F(1, 38) = 7.45, p < .02; r = .43

Can power poses significantly change outcomes in your life?

I The study was replicated by Ranehill et al. (2015)

I An initial power analysis based on the effect sizes in Carney et al. (power =

0.8, α = .05) indicated that a sample size of 100 participants would be

suitable.

library(pwr)

pwr.t.test(d=0.6,power = 0.8)

Two-sample t test power calculation

n = 44.58577

d = 0.6

sig.level = 0.05

power = 0.8

alternative = two.sided

NOTE: n is number in *each* group

Can power poses significantly change outcomes in your life?

I Ranehill et al. study used a sample of 200 participants to increase reliability.

I This study found none of the significant differences found in Carney et al.’s

study.

I The replication study obtained very precise estimates of the effects.

I What happened?

Can power poses significantly change outcomes in your life?

I Sampling theory predicts that the variation between samples is proportional

to 1√n .I In small samples we can expect variability.

I Many researchers often expect that these samples will be more similar than

sampling theory predicts.

Study replication

Suppose that you have run an experiment on 20 subjects, and have obtained a

significant result from a two-sided z-test (H0 : µ = 0 vs.H1 : µ 6= 0) which

confirms your theory (z = 2.23, p < 0.05, two-tailed). The researcher is

planning to run the same experiment on an additional 10 subjects. What is the

probability that the results will be significant at the 5% level by a one-tailed test

(H1 : µ > 0), separately for this group?

Week 4 Outline

I Power and Sample size formulae: Two-sample proportions

Comparing Proportions for Binary Outcomes

I In many clinical trials, the primary endpoint is dichotomous, for example,

whether a patient has responded to the treatment, or whether a patient has

experienced toxicity.

I Consider a two-arm randomized trial with binary outcomes. Let p1 denote

the response rate of the experimental drug, p2 as that of the standard drug,

and the difference is θ = p1 - p2.

Comparing Proportions for Binary Outcomes

Let Yik be the binary outcome for subject i in arm k; that is,

Yik =

{

1 with probability pk

0 with probability 1− pk ,

for i = 1, ..., nk and k = 1, 2. The sum of independent and identically distributed

Bernoulli random variables has a binomial distribution,

nk∑

i=1

Yik ∼ Bin(nk , pk), k = 1, 2.

(Yin, pg. 173-174)

Comparing Proportions for Binary Outcomes

The sample proportion for group k is

pˆk = Y¯k = (1/nk)

nk∑

i=1

Yik , k = 1, 2,

and E

(

Y¯k

)

= pk and Var

(

Y¯k

)

= pk (1−pk )nk .

The goal of the clinical trial is to determine if there is a difference between the

two groups using a binary endpoint. That is, we want to test H0 : θ = 0 versus

H0 : θ 6= 0.

The test statistic (assuming that H0 is true) is:

T = pˆ1 − pˆ2√

p1(1− p1)/n1 + p2(1− p2)/n2

∼ N(0, 1),

Comparing Proportions for Binary Outcomes

The test rejects at level α if and only if

|T | ≥ zα/2.

Using the same argument as the case with continuous endpoints and ignoring

terms smaller than α/2 we can solve for β

β ≈ Φ

(

zα/2 − |θ1|√

p1(1− p1)/n1 + p2(1− p2)/n2

)

.

Comparing Proportions for Binary Outcomes

Using this formula to solve for sample size. If n1 = r · n2 then

n2 =

(

zα/2 + zβ

)2

θ2

(p1(1− p1)/r + p2(1− p2)) .

Comparing Proportions for Binary Outcomes

I The built-in R function power.prop.test() can be used to calculate

sample size or power.

I For example suppose that the standard treatment for a disease has a

response rate of 20%, and an experimental treatment is anticipated to have

a response rate of 28%.

I The researchers want both arms to have an equal number of subjects. How

many patients should be enrolled if the study will conduct a two-sided test

at the 5% level with 80% power?

power.prop.test(p1 = 0.2,p2 = 0.25,power = 0.8)

Two-sample comparison of proportions power calculation

n = 1093.739

p1 = 0.2

p2 = 0.25

sig.level = 0.05

power = 0.8

alternative = two.sided

NOTE: n is number in *each* group

Week 4 Outline

I Power via simulation

Calculating Power by Simulation

I If the test statistic and distribution of the test statistic are known then the

power of the test can be calculated via simulation.

I Consider a two-sample t-test with 30 subjects per group and the standard

deviation of the clinical outcome is known to be 1.

I What is the power of the test H0 : µ1 − µ2 = 0 versus H0 : µ1 − µ2 = 0.5,

at the 5% significance level?

I The power is the proportion of times that the test correctly rejects the null

hypothesis in repeated sampling.

Calculating Power by Simulation

We can simulate a single study using the rnorm() command. Let’s assume that

n1 = n2 = 30, µ1 = 3.5, µ2 = 3, σ = 1, α = 0.05.

set.seed(2301)

t.test(rnorm(30,mean=3.5,sd=1),rnorm(30,mean=3,sd=1),var.equal = T)

Two Sample t-test

data: rnorm(30, mean = 3.5, sd = 1) and rnorm(30, mean = 3, sd = 1)

t = 2.1462, df = 58, p-value = 0.03605

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

0.03458122 0.99248595

sample estimates:

mean of x mean of y

3.339362 2.825828

Should you reject H0?

Calculating Power by Simulation

I Suppose that 10 studies are simulated.

I What proportion of these 10 studies will reject the null hypothesis at the

5% level?

I To investigate how many times the two-sample t-test will reject at the 5%

level the replicate() command will be used to generate 10 studies and

calculate the p-value in each study.

I It will still be assumed that

n1 = n2 = 30, µ1 = 3.5, µ2 = 3, σ = 1, α = 0.05.

set.seed(2301)

pvals <- replicate(10,t.test(rnorm(30,mean=3.5,sd=1),

rnorm(30,mean=3,sd=1),

var.equal = T)$p.value)

pvals # print out 10 p-values

[1] 0.03604893 0.15477655 0.01777959 0.40851999 0.34580930 0.11131007

[7] 0.14788381 0.00317709 0.09452230 0.39173723

#power is the number of times the test rejects at the 5% level

sum(pvals<=0.05)/10

[1] 0.3

Calculating Power by Simulation

But, since we only simulated 10 studies the estimate of power will have a large

standard error. So let’s try simulating 10,000 studies so that we can obtain a

more precise estimate of power.

set.seed(2301)

pvals <- replicate(10000,t.test(rnorm(30,mean=3.5,sd=1),

rnorm(30,mean=3,sd=1),

var.equal = T)$p.value)

sum(pvals<=0.05)/10000

[1] 0.4881

Calculating Power by Simulation

This is much closer to the theoretical power obtained from power.t.test().

power.t.test(n = 30,delta = 0.5,sd = 1,sig.level = 0.05)

Two-sample t test power calculation

n = 30

delta = 0.5

sd = 1

sig.level = 0.05

power = 0.477841

alternative = two.sided

NOTE: n is number in *each* group

Calculating Power by Simulation

I The built-in R functions power.t.test() and power.prop.test() don’t

have an option for calculating power where the there is unequal allocation

of subjects between groups.

I These built-in functions don’t have an option to investigate power if other

assumptions don’t hold (e.g., normality).

I One option is to simulate power for the scenarios that are of interest.

Another option is to write your own function using the formula derived

above.

Calculating Power by Simulation

I Suppose the standard treatment for a disease has a response rate of 20%,

and an experimental treatment is anticipated to have a response rate of

28%.

I The researchers want both arms to have an equal number of subjects.

I A power calculation above revealed that the study will require 1094 patients

for 80% power.

I What would happen to the power if the researchers put 1500 patients in the

experimental arm and 500 patients in the control arm?

Calculating Power by Simulation

I The number of subjects in the experimental arm that have a positive

response to treatment will be an observation from a Bin(1500, 0.28).

I The number of subjects that have a positive response to the standard

treatment will be an observation from a Bin(500, 0.2).

I We can obtain simulated responses from these distributions using the

rbinom() command in R.

set.seed(2301)

rbinom(1,1500,0.28)

[1] 403

rbinom(1,500,0.20)

[1] 89

Calculating Power by Simulation

I The p-value for this simulated study can be obtained using prop.test().

set.seed(2301)

prop.test(x=c(rbinom(1,1500,0.28),rbinom(1,500,0.20)),

n=c(1500,500),correct = F)

2-sample test for equality of proportions without continuity

correction

data: c(rbinom(1, 1500, 0.28), rbinom(1, 500, 0.2)) out of c(1500, 500)

X-squared = 16.62, df = 1, p-value = 4.568e-05

alternative hypothesis: two.sided

95 percent confidence interval:

0.05032654 0.13100680

sample estimates:

prop 1 prop 2

0.2686667 0.1780000

Calculating Power by Simulation

I A power simulation repeats this process a large number of times.

I In the example below we simulate 10,000 hypothetical studies to calculate

power.

set.seed(2301)

pvals <- replicate(10000,

prop.test(x=c(rbinom(n = 1,size = 1500,prob = 0.25),

rbinom(n=1,size=500,prob=0.20)),

n=c(1500,500),correct=F)$p.value)

sum(pvals<=0.05)/10000

[1] 0.6231

If the researchers decide to have a 3:1 allocation ratio of patients in the

treatment to control arm then the power will be _____?

Week 4 Outline

I Introduction to Causal Inference

Introduction to causal inference - Bob’s headache

I Suppose Bob, at a particular point in time, is contemplating whether or not

to take an aspirin for a headache.

I There are two treatment levels, taking an aspirin, and not taking an aspirin.

I If Bob takes the aspirin, his headache may be gone, or it may remain, say,

an hour later; we denote this outcome, which can be either “Headache” or

“No Headache,” by Y (Aspirin).

I Similarly, if Bob does not take the aspirin, his headache may remain an hour

later, or it may not; we denote this potential outcome by Y (No Aspirin),

which also can be either “Headache,” or “No Headache.”

I There are therefore two potential outcomes, Y (Aspirin) and Y (No Aspirin),

one for each level of the treatment. The causal effect of the treatment

involves the comparison of these two potential outcomes.

Introduction to causal inference - Bob’s headache

Because in this example each potential outcome can take on only two values, the

unit- level causal effect – the comparison of these two outcomes for the same

unit – involves one of four (two by two) possibilities:

1. Headache gone only with aspirin: Y(Aspirin) = No Headache, Y(No

Aspirin) = Headache

2. No effect of aspirin, with a headache in both cases: Y(Aspirin) = Headache,

Y(No Aspirin) = Headache

3. No effect of aspirin, with the headache gone in both cases: Y(Aspirin) =

No Headache, Y(No Aspirin) = No Headache

4. Headache gone only without aspirin: Y(Aspirin) = Headache, Y(No Aspirin)

= No Headache

Introduction to causal inference - Bob’s headache

There are two important aspects of this definition of a causal effect.

1. The definition of the causal effect depends on the potential outcomes, but

it does not depend on which outcome is actually observed.

2. The causal effect is the comparison of potential outcomes, for the same

unit, at the same moment in time post-treatment.

I The causal effect is not defined in terms of comparisons of outcomes at

different times, as in a before-and-after comparison of my headache before

and after deciding to take or not to take the aspirin.

The fundamental problem of causal inference

“The fundamental problem of causal inference” (Holland, 1986, p. 947) is the

problem that at most one of the potential outcomes can be realized and thus

observed.

I If the action you take is Aspirin, you observe Y (Aspirin) and will never

know the value of Y (No Aspirin) because you cannot go back in time.

I Similarly, if your action is No Aspirin, you observe Y (No Aspirin) but

cannot know the value of Y (Aspirin).

I In general, therefore, even though the unit-level causal effect (the

comparison of the two potential outcomes) may be well defined, by

definition we cannot learn its value from just the single realized potential

outcome.

The fundamental problem of causal inference

The outcomes that would be observed under control and treatment conditions

are often called counterfactuals or potential outcomes.

I If Bob took aspirin for his headache then he would be assigned to the

treatment condition so Ti = 1.

I Then Y (Aspirin) is observed and Y (No Aspirin) is the unobserved

counterfactual outcome—it represents what would have happened to Bob if

he had not taken aspirin.

I Conversely, if Bob had not taken aspirin then Y (No Aspirin) is observed

and Y (Aspirin) is counterfactual.

I In either case, a simple treatment effect for Bob can be defined as

treatment effect for Bob = Y (Aspirin)− Y (No Aspirin).

I The problem is that we can only observe one outcome.

The assignment mechanism

I Assignment mechanism: The process for deciding which units receive

treatment and which receive control.

I Ignorable Assignment Mechanism: The assignment of treatment or

control for all units is independent of the unobserved potential outcomes

(“nonignorable” means not ignorable)

I Unconfounded Assignment Mechanism: The assignment of treatment or

control for all units is independent of all potential outcomes, observed or

unobserved (“confounded” means not unconfounded)

The assignment mechanism

I Suppose that a doctor prescribes surgery (labeled 1) or drug (labeled 0) for

a certain condition.

I The doctor knows enough about the potential outcomes of the patients so

assigns each patient the treatment that is more beneficial to that patient.

unit Yi(0) Yi(1) Yi(1)− Yi(0)

patient #1 1 7 6

patient #2 6 5 -1

patient #3 1 5 4

patient #4 8 7 -1

Average 4 6 2

Y is years of post-treatment survival.

The assignment mechanism

I Patients 1 and 3 will receive surgery and patients 2 and 4 will receive drug

treatment.

I The observed treatments and outcomes are in this table.

unit Ti Y obsi Yi(1) Yi(0)

patient #1 1 7

patient #2 0 6

patient #3 1 5

patient #4 0 8

Average Drug 7

Average Surg 6

I This shows that we can reach invalid conclusions if we look at the observed

values of potential outcomes without considering how the treatments were

assigned.

I The assignment mechanism depended on the potential outcomes and was

therefore nonignorable (implying that it was confounded).

The assignment mechanism

The observed difference in means is entirely misleading in this situation. The

biggest problem when using the difference of sample means here is that we have

effectively pretended that we had an unconfounded treatment assignment when

in fact we did not. This example demonstrates the importance of finding a

statistic that is appropriate for the actual assignment mechanism.

The assignment mechanism

Is the treatment assignment ignorable?

I The doctor knows enough about the potential outcomes of the patients so

assigns each patient the treatment that is more beneficial to that patient.

I Suppose that a doctor prescribes surgery (labeled 1) or drug (labeled 0) for

a certain condition by tossing a biased coin that depends on Y (0) and

Y (1), where Y is years of post-treatment survival.

I If Y (1) ≥ Y (0) then P(Ti = 1|Yi(0),Yi(1)) = 0.8.

I If Y (1) < Y (0) then P(Ti = 1|Yi(0),Yi(1)) = 0.3.

unit Yi(0) Yi(1) p1 p0

patient #1 1 7 0.8 0.2

patient #2 6 5 0.3 0.7

patient #3 1 5 0.8 0.2

patient #4 8 7 0.3 0.7

where, p1 = P(Ti = 1|Yi(0),Yi(1)),and p0 = P(Ti = 0|Yi(0),Yi(1)).

Weight gain study

From Holland and Rubin (1983).

“A large university is interested in investigating the effects on the students of the

diet provided in the university dining halls and any sex differences in these effects.

Various types of data are gathered. In particular, the weight of each student at

the time of his [or her] arrival in September and his [or her] weight the following

June are recorded.”

I The average weight for Males was 180 in both September and June. Thus,

the average weight gain for Males was zero.

I The average weight for Females was 130 in both September and June.

Thus, the average weight gain for Females was zero.

I Question: What is the differential causal effect of the diet on male weights

and on female weights?

I Statistician 1: Look at gain scores: No effect of diet on weight for either

males or females, and no evidence of differential effect of the two sexes,

because no group shows any systematic change.

I Statistician 2: Compare June weight for males and females with the same

weight in September: On average, for a given September weight, men weigh

more in June than women. Thus, the new diet leads to more weight gain

for men.

I Is Statistician 1 correct? Statistician 2? Neither? Both?

Weight gain study

Questions:

1. What are the units?

2. What are the treatments?

3. What is the assignment mechanism?

4. Is the assignment mechanism useful for causal inference?

5. Would it have helped if all males received the dining hall diet and all

females received the control diet?

6. Is Statistician 1 or Statistician 2 correct?

Getting around the fundamental problem by using close substitutes

I Are there situations where you can measure both Y 0i and Y 1i on the same

unit?

I Drink tea one night and milk another night then measure the amount of

sleep. What has been assumed?

I Divide a piece of plastic into two parts then expose each piece to a

corrosive chemical. What has been assumed?

I Measure the effect of a new diet by comparing your weight before the diet

and your weight after. What has been assumed?

I There are strong assumptions implicit in these types of strategies.

Getting around the fundamental problem by using randomization and

experimentation

I The “statistical” idea of using the outcomes observed on a sample of units

to learn about the distribution of outcomes in the population.

I The basic idea is that since we cannot compare treatment and control

outcomes for the same units, we try to compare them on similar units.

I Similarity can be attained by using randomization to decide which units are

assigned to the treatment group and which units are assigned to the control

group.

Getting around the fundamental problem by using randomization and

experimentation

I It is not always possible to achieve close similarity between the treated and

control groups in a causal study.

I In observational studies, units often end up treated or not based on

characteristics that are predictive of the outcome of interest (for example,

men enter a job training program because they have low earnings and future

earnings is the outcome of interest).

I Randomized experiments can be impractical or unethical.

I When treatment and control groups are not similar, modeling or other

forms of statistical adjustment can be used to fill in the gap.

Fisherian Randomization Test

I The randomization test is related to a stochastic proof by contradiction

giving the plausibility of the null hypothesis of no treatment effect.

I The null hypothesis is Y 0i = Y 1i , for all units.

I Under the null hypothesis all potential outcomes are known from Y obs since

Y obs = Y 1 = Y 0.

I Under the null hypothesis, the observed value of any statistic, such as

y¯ 1 − y¯ 0 is known for all possible treatment assignments.

I The randomization distribution of y¯ 1 − y¯ 0 can then be obtained.

I Unless the data suggest that the null hypothesis of no treatment effect is

false then it’s difficult to claim evidence that the treatments are different.

欢迎咨询51作业君