程序代写案例-1A/B

1A/B Testing: Designs and Analysis
Feifang Hu
Department of Statistics
George Washington University
Email: [email protected]
Fall, 2022, Was
hington, DC, USA
2Three Essential Components of Statistics
(Data Science):
Data+Computer+Analytics
1 Introduction 3
1 Introduction
1.1 What is A/B testing?
A/B test is the shorthand for a simple controlled experiment. As the
name implies, two versions (A and B) of a single variable are
compared, which are identical except for one variation that might
affect a user’s behavior. A/B tests are widely considered the simplest
form of controlled experiment. However, by adding more variants to
the test, this becomes more complex.
A/B testing is the process of comparing two variations of a page
element, usually by testing users’ response to variant A vs variant B,
and concluding which of the two variants is more effective.
1 Introduction 4
A/B tests are useful for understanding user engagement and
satisfaction of online features, such as a new feature or product. Large
testing to make user experiences more successful and as a way to
streamline their services.
1 Introduction 5
Today, A/B tests are being used to run more complex experiments,
such as network effects when users are offline, how online services
affect user actions, and how users influence one another. Many jobs
use the data from A/B tests. This includes, data engineers, marketers,
designers, software engineers, and entrepreneurs. Many positions rely
on the data from A/B tests, as they allow companies to understand
growth, increase revenue, and optimize customer satisfaction.
1 Introduction 6
Version A might be the currently used version (control), while version
B is modified in some respect (treatment). For instance, on an
e-commerce website the purchase funnel is typically a good candidate
for A/B testing, as even marginal decreases in drop-off rates can
represent a significant gain in sales. Significant improvements can
sometimes be seen through testing elements like copy text, layouts,
images and colors, but not always. In these tests, users only see one of
two versions, as the goal is to discover which of the two versions is
preferable.
1 Introduction 7
Controlled experiments have a long and fascinating history. They are
sometimes called A/B tests, A/B/C tests (multiple variants), field
experiments, randomized controlled experiments, split tests, bucket
tests, and flights.
1 Introduction 8
1.2 Online experiments
Example 1. Online A/B testing. (Kohavi and Thomke, 2017,
conduct more than 10,000 online controlled experiments annually, with
many tests engaging millions of users.
Amazon’s experiment.
Treatment A: Credit card offers on front page.
Treatment B: Credit card offers on the shopping cart page.
This (change from A to B) boosted profits by tens of millions of US
Dollars annually.
1 Introduction 9
1.2.1 A/B Testing in eCommerce Industry
Through A/B testing, online stores can increase the average order
value, optimize their checkout funnel, reduce cart abandonment rate,
and so on. You may try testing: the way shipping cost is displayed and
where, if, and how free shipping feature is highlighted, text and color
tweaks on the payment page or checkout page, the visibility of reviews
or ratings, etc.
1 Introduction 10
In the eCommerce industry, Amazon is at the forefront in conversion
optimization partly due to the scale they operate at and partly due to
their immense dedication to providing the best customer experience.
Amongst the many revolutionary practices they brought to the
eCommerce industry, the most prolific one has been their ‘1-Click
Ordering’. Introduced in the late 1990s after much testing and
analysis, 1-Click Ordering lets users make purchases without having to
use the shopping cart at all. Once users enter their default billing card
details and shipping address, all they need to do is click on the button
and wait for the ordered products to get delivered. Users don’t have to
enter their billing and shipping details again while placing any orders.
With the 1-Click Ordering, it became impossible for users to ignore the
ease of purchase and go to another store. This change had such a
huge business impact that Amazon got it patented (now expired) in
1999. In fact, in 2000, even Apple bought a license for the same to be
used in their online store.
1 Introduction 11
People working to optimize Amazon’s website do not have sudden
‘Eureka’ moments for every change they make. It is through
continuous and structured A/B testing that Amazon is able to deliver
the kind of user experience that it does. Every change on the website
is first tested on their audience and then deployed. If you were to
notice Amazon’s purchase funnel, you would realize that even though
the funnel more or less replicates other websites’ purchase funnels,
each an every element in it is fully optimized, and matches the
audience’s expectations.
1 Introduction 12
Every page, starting from the homepage to the payment page, only
contains the essential details and leads to the exact next step required
to push the users further into the conversion funnel. Additionally, using
extensive user insights and website data, each step is simplified to their
maximum possible potential to match their users’ expectations.
1 Introduction 13
Take their omnipresent shopping cart, for example. There is a small
cart icon at the top right of Amazon’s homepage that stays visible no
matter which page of the website you are on.
1 Introduction 14
The icon is not just a shortcut to the cart or reminder for added
products. In its current version, it offers 5 options:
(i) Continue shopping (if there are no products added to the cart)
(ii) Learn about today’s deals (if there are no products added to the
cart)
(iii) Wish List (if there are no products added to the cart)
(iv) empty cart
(v) Proceed to checkout (when there are products in the cart). Sign in
to turn on 1-Click Checkout (when there are products in the cart).
1 Introduction 15
With one click on the tiny icon offering so many options, the user’s
cognitive load is reduced, and they have a great user experience. As
can be seen in the above screenshot, the same cart page also suggests
similar products so that customers can navigate back into the website
and continue shopping. All this is achieved with one weapon: A/B
Testing.
1 Introduction 16
1.2.2 A/B Testing in Travel Industry
Increase the number of successful bookings on your website or mobile
app, your revenue from ancillary purchases, and much more through
search results page, ancillary product presentation, your checkout
progress bar, and so on.
1 Introduction 17
In the travel industry, Booking.com easily surpasses all other
eCommerce businesses when it comes to using A/B testing for their
optimization needs. They test like it’s nobody’s business. From the
day of its inception, Booking.com has treated A/B testing as the
treadmill that introduces a flywheel effect for revenue. The scale at
which Booking.com A/B tests is unmatched, especially when it comes
to testing their copy. While you are reading this, there are nearly 1000
A/B tests running on Booking.com’s website.
1 Introduction 18
Even though Booking.com has been A/B testing for more than a
decade now, they still think there is more that they can do to improve
user experience. And this is what makes Booking.com the ace in the
game. Since the company started, Booking.com incorporated A/B
testing into its everyday work process. They have increased their
testing velocity to its current rate by eliminating HiPPOs and giving
priority to data before anything else. And to increase the testing
velocity, even more, all of Booking.com’s employees were allowed to
run tests on ideas they thought could help grow the business.
1 Introduction 19
This example will demonstrate the lengths to which Booking.com can
go to optimize their users’ interaction with the website. Booking.com
decided to broaden its reach in 2017 by offering rental properties for
vacations alongside hotels. This led to Booking.com partnering with
Outbrain, a native advertising platform, to help grow their global
property owner registration.
1 Introduction 20
Within the first few days of the launch, the team at Booking.com
realized that even though a lot of property owners completed the first
sign-up step, they got stuck in the next steps. At this time, pages built
for the paid search of their native campaigns were used for the sign-up
process.
1 Introduction 21
Both the teams decided to work together and created three versions of
landing page copy for Booking.com. Additional details like social
proof, awards, and recognitions, user rewards, etc. were added to the
variations.
1 Introduction 22
The test ran for two weeks and produced a 25% uplift in owner
registration. The test results also showed a significant decrease in the
cost of each registration.
1 Introduction 23
1.2.3 A/B Testing in B2B/SaaS Industry
actions by testing and polishing important elements of your demand
generation engine. To get to these goals, marketing teams put up the
most relevant content on their website, send out ads to prospect
buyers, conduct webinars, put up special sales, and much more. But
all their effort would go to waste if the landing page which clients are
directed to is not fully optimized to give the best user experience. The
aim of SaaS (Software as a service) A/B testing is to provide the best
user experience and to improve conversions. You can try testing your
lead form components, free trial sign-up flow, homepage messaging,
1 Introduction 24
POSist, a leading SaaS-based restaurant management platform with
more than 5,000 customers at over 100 locations across six countries,
wanted to increase their demo requests. Their website homepage and
Contact Us page are the most important pages in their funnel. The
team at POSist wanted to reduce drop-off on these pages. To achieve
this, the team created two variations of the homepage as well as two
variations of the Contact Us page to be tested. Let’s take a look at the
changes made to the homepage. This is what the control looked like:
1 Introduction 25
The team at POSist hypothesized that adding more relevant and
conversion-focused content to the website will improve user
experience, as well as generate higher conversions. So they created
two variations to be tested against the control.
Control was first tested against Variation 1, and the winner was
Variation 1. To further improve the page, variation one was then
tested against variation two, and the winner was variation 2. The new
variation increased page visits by about 5%.
1 Introduction 26
1.3 Clinical trials
Example 2. HIV transmission. Connor et al. (1994, The New
England Journal of Medicine) report a clinical trial to evaluate the
drug AZT in reducing the risk of maternal-infant HIV transmission.
50-50 randomization scheme is used:
• AZT Group—239 pregnant women (20 HIV positive infants).
• placebo group—238 pregnant women (60 HIV positive
infants).
1 Introduction 27
Given the seriousness of the outcome of this study, it is reasonable to
argue that 50-50 allocation was unethical. As accruing information
favoring (albeit, not conclusively) the AZT treatment became
available, allocation probabilities should have been shifted from
50-50 allocation proportional to weight of evidence for
AZT. Designs which attempt to do this are called Response-Adaptive
1 Introduction 28
If the treatment assignments had been done with the DBCD (Hu
and Zhang, 2004, Annals of Statistics) with urn target:
• AZT Group— 360 patients
• placebo group—117 patients
then, only 60 (instead of 80) infants would be HIV positive.
1 Introduction 29
Example 3: Remdesivir-COVID-19 trial (China). Remdesivir
in adults with severe COVID-19 trial (Wang et al. 2020) is a
randomized, double-blind, placebo-controlled, multicentre trial that
aimed to compare Remvesivir with placebo. There were 236 patients
in the trial. There are about 20 baseline covariates for each patient,
including 10 continuous variables (e.g. age and White blood cell
count) and 10 discrete variables (e.g. gender and Hypertension). The
stratified (according to the level of respiratory support) permuted
block (30 patients per block) randomization procedure were
implemented. At the end of this trial, some important imbalances
existed at enrollment between the groups, including more patients with
hypertension, diabetes, or coronary artery disease in the Remdesivir
group than the placebo group.
1 Introduction 30
Example 4: Moderna COVID-19 vaccine trial (2020). The
trial began on July 27, 2020, and enrolled 30,420 adult volunteers at
clinical research sites across the United States. Volunteers were
randomly assigned 1:1 to receive either two 100 microgram (mcg)
doses of the investigational vaccine or two shots of saline placebo 28
days apart. The average age of volunteers is 51 years. Approximately
47% are female, 25% are 65 years or older and 17% are under the age
of 65 with medical conditions placing them at higher risk for severe
COVID-19. Approximately 79% of participants are white, 10% are
Black or African American, 5% are Asian, 0.8% are American Indian or
Alaska Native, 0.2% are Native Hawaiian or Other Pacific Islander, 2%
are multiracial, and 21% (of any race) are Hispanic or Latino.
1 Introduction 31
From the start of the trial through Nov. 25, 2020, investigators
recorded 196 cases of symptomatic COVID-19 occurring among
participants at least 14 days after they received their second shot. One
hundred and eighty-five cases (30 of which were classified as severe
COVID-19) occurred in the placebo group and 11 cases (0 of which
were classified as severe COVID-19) occurred in the group receiving
mRNA-1273. The incidence of symptomatic COVID-19 was 94.1%
lower in those participants who received mRNA-1273 as compared to
those receiving placebo.
1 Introduction 32
Investigators observed 236 cases of symptomatic COVID-19 among
participants at least 14 days after they received their first shot, with
225 cases in the placebo group and 11 cases in the group receiving
mRNA-1273. The vaccine efficacy was 95.2% for this secondary
analysis.
Long-term Treatment Effects?
1 Introduction 33
1.4 Economics and Social Science
Political A/B testing
A/B tests are used for more than corporations, but are also driving
political campaigns. In 2007, Barack Obama’s presidential campaign
used A/B testing as a way to garner online attraction and understand
what voters wanted to see from the presidential candidate. For
example, Obama’s team tested four distinct buttons on their website
six different accompanying images to draw in users. Through A/B
testing, staffers were able to determine how to effectively draw in
1 Introduction 34
Example 5. The Project GATE (Growing America Through
Entrepreneurship), sponsored by the U.S. Department of Labor, was
designed to evaluate the impact of offering tuition-free
entrepreneurship training services (GATE services) on helping clients
create, sustain or expand their own business.
(https://www.doleta.gov/reports/projectgate/)
The cornerstone is complete randomization. Members of the
treatment group were offered GATE services; members of the control
group were not.
• n = 4, 198 participants
• p = 105 covariates
1 Introduction 35
1.5 Biological, psychological, and agricultural
research
Controlled experiments were mainly developed in these areas in
1900-1950.
1 Introduction 36
(i) The history of experiment design;
(ii) A/B testing in medical studies;
(iii) Online controlled experiments (A/B testing).
2 The history of experiment design 37
2 The history of experiment design
2.1 Experiment design before Fisher
Statistical experiments, following Charles S. Peirce Main article:
inference was developed by Charles S. Peirce in ”Illustrations of the
Logic of Science” (1877–1878) and ”A Theory of Probable Inference”
(1883), two publications that emphasized the importance of
randomization-based inference in statistics.
2 The history of experiment design 38
Randomized experiments: Charles S. Peirce randomly assigned
volunteers to a blinded, repeated-measures design to evaluate their
ability to discriminate weights. Peirce’s experiment inspired other
researchers in psychology and education, which developed a research
tradition of randomized experiments in laboratories and specialized
textbooks in the 1800s.
2 The history of experiment design 39
Optimal designs for regression models:
Charles S. Peirce also contributed the first English-language
publication on an optimal design for regression models in 1876. A
pioneering optimal design for polynomial regression was suggested by
Gergonne in 1815. In 1918, Kirstine Smith published optimal designs
for polynomials of degree six (and less).
2 The history of experiment design 40
2.2 Fisher’s principles
A methodology for designing experiments was proposed by Ronald
Fisher, in his innovative books: The Arrangement of Field Experiments
(1926) and The Design of Experiments (1935). Much of his pioneering
work dealt with agricultural applications of statistical methods. As a
mundane example, he described how to test the lady tasting tea
hypothesis, that a certain lady could distinguish by flavour alone
whether the milk or the tea was first placed in the cup. These
agricultural research.
2 The history of experiment design 41
2.2.1 Comparison
In some fields of study it is not possible to have independent
measurements to a traceable metrology standard. Comparisons
between treatments are much more valuable and are usually preferable,
and often compared against a scientific control or traditional
treatment that acts as baseline.
2 The history of experiment design 42
2.2.2 Randomization
Random assignment is the process of assigning individuals at random
to groups or to different groups in an experiment, so that each
individual of the population has the same chance of becoming a
participant in the study. The random assignment of individuals to
groups (or conditions within a group) distinguishes a rigorous, ”true”
experiment from an observational study or ”quasi-experiment”. There
is an extensive body of mathematical theory that explores the
consequences of making the allocation of units to treatments by means
of some random mechanism (such as tables of random numbers, or the
use of randomization devices such as playing cards or dice). Assigning
units to treatments at random tends to mitigate confounding, which
makes effects due to factors other than the treatment to appear to
result from the treatment.
2 The history of experiment design 43
The risks associated with random allocation (such as having a serious
imbalance in a key characteristic between a treatment group and a
control group) are calculable and hence can be managed down to an
acceptable level by using enough experimental units. However, if the
population is divided into several subpopulations that somehow differ,
and the research requires each subpopulation to be equal in size,
stratified sampling can be used. In that way, the units in each
subpopulation are randomized, but not the whole sample. The results
of an experiment can be generalized reliably from the experimental
units to a larger statistical population of units only if the experimental
units are a random sample from the larger population; the probable
error of such an extrapolation depends on the sample size, among
other things.
2 The history of experiment design 44
2.2.3 Statistical replication
Measurements are usually subject to variation and measurement
uncertainty; thus they are repeated and full experiments are replicated
to help identify the sources of variation, to better estimate the true
effects of treatments, to further strengthen the experiment’s reliability
and validity, and to add to the existing knowledge of the topic.
2 The history of experiment design 45
However, certain conditions must be met before the replication of the
experiment is commenced: the original research question has been
published in a peer-reviewed journal or widely cited, the researcher is
independent of the original experiment, the researcher must first try to
replicate the original findings using the original data, and the write-up
should state that the study conducted is a replication study that tried
to follow the original study as strictly as possible.
2 The history of experiment design 46
2.2.4 Blocking
Blocking is the non-random arrangement of experimental units into
groups (blocks) consisting of units that are similar to one another.
Blocking reduces known but irrelevant sources of variation between
units and thus allows greater precision in the estimation of the source
of variation under study.
2 The history of experiment design 47
2.2.5 Orthogonality
Orthogonality concerns the forms of comparison (contrasts) that can
be legitimately and efficiently carried out. Contrasts can be
represented by vectors and sets of orthogonal contrasts are
uncorrelated and independently distributed if the data are normal.
Because of this independence, each orthogonal treatment provides
different information to the others. If there are T treatments and T–1
orthogonal contrasts, all the information that can be captured from
the experiment is obtainable from the set of contrasts.
2 The history of experiment design 48
Example 2.1. Measurement Error: We would like to measure
the weight of a subject A by using a scale. We know that there is a
error of scale. Suppose that the error follows a normal distribution
with mean 0 and variance σ2. Mathematically, we may write:
w1 = A+ e1,
where wA is the true weight, YA is the observed weight and e1 is the
measurement error.
2 The history of experiment design 49
Figure 1: A scale to measure subject A
2 The history of experiment design 50
Now we would like to measure the weights of two subjects A and B by
using the same scale twice. What should we do?
2 The history of experiment design 51
Method 1:
w1 = A+ e1 and w2 = B + e2.
2 The history of experiment design 52
Figure 2: Subject B
2 The history of experiment design 53
Method 2:
w3 = A+B + e3 and w4 = A−B + e4.
2 The history of experiment design 54
Figure 3: A + B
2 The history of experiment design 55
Figure 4: A - B
2 The history of experiment design 56
The measurement errors:
Method 1:
Subject A: e1 ∼ N(0, σ2).
Subject B: e2 ∼ N(0, σ2).
Method 2:
Subject A: (e3 + e4)/2 ∼ N(0, σ2/2).
Subject B: (e3 − e4)/2 ∼ N(0, σ2/2).
2 The history of experiment design 57
Use of factorial experiments instead of the one-factor-at-a-time
method. These are efficient at evaluating the effects and possible
interactions of several factors (independent variables). Analysis of
experiment design is built on the foundation of the analysis of
variance, a collection of models that partition the observed variance
into components, according to what factors the experiment must
estimate or test.
2 The history of experiment design 58
2.2.6 Avoiding false positives
False positive conclusions, often resulting from the pressure to publish
or the author’s own confirmation bias, are an inherent hazard in many
fields. A good way to prevent biases potentially leading to false
positives in the data collection phase is to use a double-blind design.
When a double-blind design is used, participants are randomly assigned
to experimental groups but the researcher is unaware of what
participants belong to which group. Therefore, the researcher can not
affect the participants’ response to the intervention.
2 The history of experiment design 59
Experimental designs with undisclosed degrees of freedom are a
problem. This can lead to conscious or unconscious ”p-hacking”:
trying multiple things until you get the desired result. It typically
involves the manipulation – perhaps unconsciously – of the process of
statistical analysis and the degrees of freedom until they return a
figure below the p¡.05 level of statistical significance.
2 The history of experiment design 60
So the design of the experiment should include a clear statement
proposing the analyses to be undertaken. P-hacking can be prevented
by preregistering researches, in which researchers have to send their
data analysis plan to the journal they wish to publish their paper in
before they even start their data collection, so no data manipulation is
possible.
2 The history of experiment design 61
Another way to prevent this is taking the double-blind design to the
data-analysis phase, where the data are sent to a data-analyst
unrelated to the research who scrambles up the data so there is no
way to know which participants belong to before they are potentially
taken away as outliers.
2 The history of experiment design 62
In the pure experimental design, the independent (predictor) variable is
manipulated by the researcher – that is – every participant of the
research is chosen randomly from the population, and each participant
chosen is assigned randomly to conditions of the independent variable.
Only when this is done is it possible to certify with high probability
that the reason for the differences in the outcome variables are caused
by the different conditions. Therefore, researchers should choose the
experimental design over other design types whenever possible.
2 The history of experiment design 63
However, the nature of the independent variable does not always allow
for manipulation. In those cases, researchers must be aware of not
it. For example, in observational designs, participants are not assigned
randomly to conditions, and so if there are differences found in
outcome variables between conditions, it is likely that there is
something other than the differences between the conditions that
causes the differences in outcomes, that is – a third variable. The same
goes for studies with correlational design. (Ade´r Mellenbergh, 2008).
2 The history of experiment design 64
2.2.8 Statistical control
It is best that a process be in reasonable statistical control prior to
conducting designed experiments. When this is not possible, proper
blocking, replication, and randomization allow for the careful conduct
of designed experiments. To control for nuisance variables, researchers
institute control checks as additional measures. Investigators should
ensure that uncontrolled influences (e.g., source credibility perception)
do not skew the findings of the study. A manipulation check is one
example of a control check. Manipulation checks allow investigators to
isolate the chief variables to strengthen support that these variables
are operating as planned.
2 The history of experiment design 65
One of the most important requirements of experimental research
designs is the necessity of eliminating the effects of spurious,
intervening, and antecedent variables. In the most basic model, cause
(X) leads to effect (Y). But there could be a third variable (Z) that
influences (Y), and X might not be the true cause at all. Z is said to
be a spurious variable and must be controlled for. The same is true for
intervening variables (a variable in between the supposed cause (X)
and the effect (Y)), and anteceding variables (a variable prior to the
supposed cause (X) that is the true cause). When a third variable is
involved and has not been controlled for, the relation is said to be a
zero order relationship. In most practical applications of experimental
research designs there are several causes (X1, X2, X3). In most
designs, only one of these causes is manipulated at a time.
2 The history of experiment design 66
2.3 Experimental designs after Fisher
Some efficient designs for estimating several main effects were found
independently and in near succession by Raj Chandra Bose and K.
Kishen in 1940 at the Indian Statistical Institute, but remained little
known until the Plackett–Burman designs were published in
Biometrika in 1946. About the same time, C. R. Rao introduced the
concepts of orthogonal arrays as experimental designs. This concept
played a central role in the development of Taguchi methods by
Genichi Taguchi, which took place during his visit to Indian Statistical
Institute in early 1950s. His methods were successfully applied and
adopted by Japanese and Indian industries and subsequently were also
embraced by US industry albeit with some reservations.
2 The history of experiment design 67
In 1950, Gertrude Mary Cox and William Gemmell Cochran published
the book Experimental Designs, which became the major reference
work on the design of experiments for statisticians for years afterwards.
Developments of the theory of linear models have encompassed and
surpassed the cases that concerned early writers. Today, the theory
rests on advanced topics in linear algebra, algebra and combinatorics.
2 The history of experiment design 68
As with other branches of statistics, experimental design is pursued
using both frequentist and Bayesian approaches: In evaluating
statistical procedures like experimental designs, frequentist statistics
studies the sampling distribution while Bayesian statistics updates a
probability distribution on the parameter space.
2 The history of experiment design 69
Some important contributors to the field of experimental designs are
C. S. Peirce, R. A. Fisher, F. Yates, R. C. Bose, A. C. Atkinson, R. A.
Bailey, D. R. Cox, G. E. P. Box, W. G. Cochran, W. T. Federer, V. V.
Fedorov, A. S. Hedayat, J. Kiefer, O. Kempthorne, J. A. Nelder,
Andrej Pa´zman, Friedrich Pukelsheim, D. Raghavarao, C. R. Rao,
Shrikhande S. S., J. N. Srivastava, William J. Studden, G. Taguchi
and H. P. Wynn.
2 The history of experiment design 70
The textbooks of D. Montgomery, R. Myers, and G. Box/W.
Hunter/J.S. Hunter have reached generations of students and
practitioners.
identification (model building for static or dynamic models) is given
in[35] and [36].
2 The history of experiment design 71
2.4 Sequences of experiments
The use of a sequence of experiments, where the design of each may
depend on the results of previous experiments, including the possible
decision to stop experimenting, is within the scope of sequential
analysis, a field that was pioneered by Abraham Wald in the context of
sequential tests of statistical hypotheses. Herman Chernoff wrote an
overview of optimal sequential designs, while adaptive designs have
been surveyed by S. Zacks. One specific type of sequential design is
the ”two-armed bandit”, generalized to the multi-armed bandit, on
which early work was done by Herbert Robbins in 1952.
2 The history of experiment design 72
2.5 Human participant constraints
Laws and ethical considerations preclude some carefully designed
experiments with human subjects. Legal constraints are dependent on
jurisdiction. Constraints may involve institutional review boards,
informed consent and confidentiality affecting both clinical (medical)
trials and behavioral and social science experiments.[37] In the field of
toxicology, for example, experimentation is performed on laboratory
animals with the goal of defining safe exposure limits for humans.
Balancing the constraints are views from the medical field.[39]
Regarding the randomization of patients, ”... if no one knows which
therapy is better, there is no ethical imperative to use one therapy or
another.” (p 380) Regarding experimental design, ”...it is clearly not
ethical to place subjects at risk to collect data in a poorly designed
study when this situation can be easily avoided...”. (p 393)
2 The history of experiment design 73
2.6 Some important issues to design experiments
Clear and complete documentation of the experimental methodology is
also important in order to support replication of results.
Discussion topics when setting up an experimental design An
experimental design or randomized clinical trial requires careful
consideration of several factors before actually doing the experiment.
An experimental design is the laying out of a detailed experimental
plan in advance of doing the experiment. Some of the following topics
have already been discussed in the principles of experimental design
section:
2 The history of experiment design 74
1) How many factors does the design have, and are the levels of these
factors fixed or random?
2) Are control conditions needed, and what should they be?
3) Manipulation checks; did the manipulation really work?
4) What are the background variables?
5) What is the sample size. How many units must be collected for the
experiment to be generalisable and have enough power?
6) What is the relevance of interactions between factors?
2 The history of experiment design 75
7) What is the influence of delayed effects of substantive factors on
outcomes?
8) How do response shifts affect self-report measures?
9) How feasible is repeated administration of the same measurement
instruments to the same units at different occasions, with a post-test
and follow-up tests?
10) What about using a proxy pretest?
11) Are there lurking variables?
2 The history of experiment design 76
12) Should the client/patient, researcher or even the analyst of the
data be blind to conditions?
13) What is the feasibility of subsequent application of different
conditions to the same units?
14) How many of each control and noise factors should be taken into
account?
15) How to deal with missinbg values?
16) What are the good matrices?
........
2 The history of experiment design 77
The independent variable of a study often has many levels or different
groups. In a true experiment, researchers can have an experimental
group, which is where their intervention testing the hypothesis is
implemented, and a control group, which has all the same element as
the experimental group, without the interventional element. Thus,
when everything else except for one intervention is held constant,
researchers can certify with some certainty that this one element is
what caused the observed change. In some instances, having a control
group is not ethical. This is sometimes solved using two different
experimental groups. In some cases, independent variables cannot be
manipulated, for example when testing the difference between two
groups who have a different disease, or testing the difference between
genders (obviously variables that would be hard or unethical to assign
participants to). In these cases, a quasi-experimental design may be
used.
3 A/B tests (Randomized Control Studies) in clinical trials 78
3 A/B tests (Randomized Control
Studies) in clinical trials
3 A/B tests (Randomized Control Studies) in clinical trials 79
3.1 Drug development
Drug development is a complex and lengthy process that take 7 to 15
years for a single drug at a cost that may reach hundreds of millions of
dollars. There are three main parts of the drug development process:
• Discovery and decision;
• Preclinical studies;
• Clinical studies.
3 A/B tests (Randomized Control Studies) in clinical trials 80
Discovery and Decision
The process starts with the discovery of a new compound or of a new
potential application of an existing compound. Based on adequate
results, the decision whether to develop the drug is then made.
3 A/B tests (Randomized Control Studies) in clinical trials 81
Preclinical Studies
The initial toxicology of compound is studied in animals. Initial
formulation of the drug development and specific or comprehensive
pharmacological studies in animals are also performed at this stage. At
the end of preclinical study, the evidence of potential safety and
effectiveness of the drug is assessed by the company.
To proceed further, A US-based company needs to file a Notice of
Claimed Investigational New Drug Exemption (to allow the company
to conduct studies on human subjects).
3 A/B tests (Randomized Control Studies) in clinical trials 82
Clinical Studies There is sufficient evidence that the drug will be
benefit to human subjects. Testing the drug in human subjects is the
next step.
3 A/B tests (Randomized Control Studies) in clinical trials 83
Phase I clinical trial: To establish the initial safety information
about the effect of the drug on humans, such the range of acceptable
dosages and the pharmacokinetics of the drug. This studies are
normally conducted with healthy volunteers. The number of subjects
typically varies between 4 to 20 per study, with up to 100 subjects in
total used over the course of Phase I trials.
3 A/B tests (Randomized Control Studies) in clinical trials 84
Phase II clinical trial: This studies are conducted towards patients
who will potentially benefit from the new drug. Effective dose ranges
and initial effects of the drug on these patients are assessed. Up to
several hundred patients are usually selected in Phase II trials.
3 A/B tests (Randomized Control Studies) in clinical trials 85
Phase III clinical trial: Phase III studies provide assessment of
safety, efficacy, and optimum dosage. These studies are designed with
controls and treatment groups. Usually hundreds or even thousands
patients are involved in Phase II trials.
Based on successful results obtained from these studies, the company
can then submit a NDA (New Drug Application). The application
contains the results from all three stages (from discovery to Phase III)
and is reviewed by FDA.
The FDA review panel of the NDA consists of reviewers in the
following areas: medicine, pharmacology, biopharmaceutics, chemisty,
and statistics.
3 A/B tests (Randomized Control Studies) in clinical trials 86
Phase IV: Postmarket activities. Followup studies are conducted
to examine the longterm effects of the drug. The main propose of
these studies is to ensure that all claims made by the company about
the new drug can be substantiated by so called ”clinical evidence”. All
reported adverse effects must also be investigated by the company and
in some cases, the drug may need to be withdrawn from the market.
3 A/B tests (Randomized Control Studies) in clinical trials 87
Statistician’s Responsibilities:
• Participate in the development plan for study a drug.
• Study design and protocol development. Randomization schemes.
• Data cleaning and database construction format.
• Analysis plan and program development for analysis.
• Report preparation. Produce tables and figures.
• Integrate clinical study results, safety and efficacy reports.
• Communication and NDA defense to FDA review panel.
• Publication support and consulting with other company personnel.
3 A/B tests (Randomized Control Studies) in clinical trials 88
Example 3.1. HIV transmission. Connor et al. (1994, The New
England Journal of Medicine) report a clinical trial to evaluate the
drug AZT in reducing the risk of maternal-infant HIV transmission.
50-50 randomization scheme is used:
• AZT Group (A)—239 pregnant women (20 HIV positive
infants).
• placebo group (B)—238 pregnant women (60 HIV positive
infants).
3 A/B tests (Randomized Control Studies) in clinical trials 89
Given the seriousness of the outcome of this study, it is reasonable to
argue that 50-50 allocation was unethical. As accruing information
favoring (albeit, not conclusively) the AZT treatment became
available, allocation probabilities should have been shifted from
50-50 allocation proportional to weight of evidence for
AZT. Designs which attempt to do this are called Response-Adaptive
3 A/B tests (Randomized Control Studies) in clinical trials 90
If the treatment assignments had been done with the DBCD (Hu
and Zhang, 2004, Annals of Statistics) with urn target:
• AZT Group— 360 patients
• placebo group—117 patients
then, only 60 (instead of 80) infants would be HIV positive.
3 A/B tests (Randomized Control Studies) in clinical trials 91
Allocation rule AZT Placebo Power HIV+
EA 239 238 0.9996 80
DBCD 360 117 0.989 60
Neyman 186 291 0.9998 89
FPower 416 61 0.90 50
3 A/B tests (Randomized Control Studies) in clinical trials 92
Example 2 (ECMO Trial). Extracorporeal membrane oxygenation
(ECMO) is an external system for oxygenating the blood based on
techniques used in cardiopulmonary bypass technology developed for
cariac surgery. In the literature, there are three well-document clinical
trials on evaluating the clinical effectiveness of ECMO:
(i) the Michigan ECMO study (Bartlett, et al. 1985);
(ii) the Boston ECMO study (Ware, 1989);
(iii) the UK Collaborative ECMO Trials Group, 1996).
3 A/B tests (Randomized Control Studies) in clinical trials 93
Example 2 (Continued): Michigan ECMO trial using
RPW rule:
The RPW rule was used in a clinical trial of extracorporeal membrane
oxygenation (ECMO; Bartlett, et al. 1985, Pediatrics).
Total 12 patients.
• ECMO group– 11 patients, all survived.
• Conventional therapy– 1 patient, died.
3 A/B tests (Randomized Control Studies) in clinical trials 94
3.2 Determining the Sample Size
In the planning stages of a randomized clinical trial, it is necessary to
determine the numbers of subjects (sample size) to be randomized.
For two treatments (A and B), say n = nA + nB . We assume here
that the allocation proportions are known in advance, that is,
nA/n = ρ and nB/n = 1− ρ are predetermined.
3 A/B tests (Randomized Control Studies) in clinical trials 95
Examples of calculations of SS.
3 A/B tests (Randomized Control Studies) in clinical trials 96
3.3 Mathematical Framework of Randomization
Procedures
Suppose we compare two treatments A and B. Let T1, ..., Tn be a
sequence of random treatment assignments.
Ti = 1 if the patient i is assigned to treatment A;
Ti = 0 if the patient i is assigned to treatment B.
NA(n) =
∑n
i=1 Ti = number of patients onA and
NB(n) = n−NA(n).
3 A/B tests (Randomized Control Studies) in clinical trials 97
X1, ...,Xn: response variables. Where Xi represents the sequence of
responses that would be observed if each treatment were assigned to
the i-th patient independently.
Z1, ...,Zn: covariates. Here Zi represents the covariates of i-th
patient.
3 A/B tests (Randomized Control Studies) in clinical trials 98
When the (i+ 1)th patient is ready to be randomized in a clinical
trial, following information is available:
• patients assignments: T1, ..., Ti;
• responses: X1, ...,Xi (assume immediately responses);
• patients covariates: Z1, ...,Zi and Zi+1.
3 A/B tests (Randomized Control Studies) in clinical trials 99
Let Tn = σ{T1, ..., Tn} be the sigma-algebra generated by the first n
treatment assignments.
Let Xn = σ{X1, ...,Xn} be the sigma-algebra generated by the first
n responses.
Let Zn = σ{Z1, ...,Zn} be the sigma-algebra generated by the first n
covariate vectors. Let Fn = Tn ⊗Xn ⊗Zn+1.
3 A/B tests (Randomized Control Studies) in clinical trials 100
A randomization procedure is defined by
φn = E(Tn|Fn−1),
where φn+1 is Fn-measurable. We can describe φn as the conditional
probability of assigning treatments 1, ...,K to the n-th patient,
conditional on the previous n− 1 assignments, responses, and
covariate vectors, and the current patient’s covariate vector.
3 A/B tests (Randomized Control Studies) in clinical trials 101
We can describe five types of randomization procedures:
• (i) complete randomization if
φn = E(Tn|Fn−1) = E(Tn);
Not use any information.
• (ii) restricted randomization if
φn = E(Tn|Fn−1) = E(Tn|Tn−1);
Only use information of patients’ assignments.
φn = E(Tn|Fn−1) = E(Tn|Tn−1,Xn−1);
Use information of patients’ assignments and responses.
3 A/B tests (Randomized Control Studies) in clinical trials 102
φn = E(Tn|Fn−1) = E(Tn|Tn−1,Zn);
Use information of patients’ assignments and covariates.
φn = E(Tn|Fn−1) = E(Tn|Tn−1,Xn−1,Zn).
use all available information.
3 A/B tests (Randomized Control Studies) in clinical trials 103
3.4 Complete randomization
The simplest form of a randomization procedure is complete
randomization.
E(Ti|T1, ..., Ti−1) = P (Ti = 1|T1, ..., Ti−1) = 1/2, i = 1, ..., n.
NA(n) has binomial(n, 1/2).
This procedure is rarely used in practice because of the nonnegligible
probability of treatment imbalances in moderate samples.
3 A/B tests (Randomized Control Studies) in clinical trials 104
3.5 Restricted randomization
Truncated binomial design: Complete randomization is used until n/2
have been assigned to A or B, then the reminder is filled with the
opposite treatment with probability 1. Here the procedure is given by
φi = 1/2, if max{NA(i− 1), NB(i− 1)} ≤ n/2,
= 0, if NA(i− 1) = n/2,
= 1, if NB(i− 1) = n/2.
3 A/B tests (Randomized Control Studies) in clinical trials 105
Blocked Procedures: Because we do not know n exactly in advance,
we typically require overrunning of the randomization sequence.
Forced balance designs are therefore typically used in blocks.
• Permuted block design: Blocks of even size 2b are filled using
either a random allocation rule or a truncated binomial design.
• The maximum imbalance is b and the only possibility of a terminal
imbalance occurs if the last block is unfilled. Every block has at
least one deterministic assignment.
• Random block design: Blocks of size 2, 4, 6, ..., 2K are randomly
selected and equirobable.
3 A/B tests (Randomized Control Studies) in clinical trials 106
Efron’s biased coin design (BCD): (Efron, 1971). Let
Di = NA(i)−NB(i) be the imbalance between treatments A and B.
Define a constant pi ∈ (0.5, 1]. Then the procedure is given by
φi = 1/2, if Di−1 = 0,
= pi, if Di−1 < 0,
= 1− pi, if Di−1 > 0.
Efron suggested pi = 2/3 might be a reasonable value (without
justification).
3 A/B tests (Randomized Control Studies) in clinical trials 107
Many other designs have been proposed and studied in literature
(Smith’s design (1984), Wei’s design (1978), Big Stick design (Soares
and Wu, 1982), etc.)
When n = 50, V ar(Dn) = 49.92 (Complete randomization);
V ar(Dn) = 4.36 (Efron’s BCD with pi = 2/3). (Based on 100, 000
replications).
3 A/B tests (Randomized Control Studies) in clinical trials 108
3.6 Selection Bias
Selection Bias refers to biases that are introduced into an unmasked
study because an investigator maybe able to guess the treatment
assignment of future patients based on knowing the treatments
assigned to the past patients. Patients usually enter a trial sequentially
over time.
The great clinical trialist Chalmers (1990) was convinced that the
elimination of selection bias is the most essential requirement for a
good clinical trial.
3 A/B tests (Randomized Control Studies) in clinical trials 109
How to measure the Selection Bias?
3 A/B tests (Randomized Control Studies) in clinical trials 110
Blackwell and Hodge (1957), Berger, Ivanova and Knoll (2003) and
others had suggested the predictability of a randomization
sequence to measure the selection bias.
One measure of the predictability of a randomization
sequence is given by
Ppred =
∑n
i=1 |Eφi − 0.5|
n
.
3 A/B tests (Randomized Control Studies) in clinical trials 111
Selection bias of different designs.
procedures
.
4.1 Historical notes
Adaptive designs in the clinical trials context were first formulated as
solutions to optimal decision-making questions:
• Which treatment is better?
• What sample size should be used before determining a “better”
treatment to maximize the total number receiving the better
treatment?
• How do we incorporate prior data or accruing data into these
decisions?
The preliminary ideas can be traced back to Thompson (1933,
Biometrika) and Robbins (1952, Bulletin of the American
Mathematical Society) and led to a flurry of work in the 1960s by
Anscombe (1963, JASA), Colton (1963, JASA), Zelen (1969, JASA)
and Cornfield, Halperin, and Greenhouse (1969, Annals of
Mathematical Statistics), among others.
4.2 Play-the-winner rule
Perhaps the simplest of these adaptive designs is the play-the-winner
rule originally explored by Robbins (1952, Bulletin of the American
Mathematical Society) and later by Zelen (1969, JASA).
Binary response: treatment A and B.
• pA: P (success|A), qA = 1− pA;
• pB : P (success|B), qB = 1− pB ;
• NA(n): number of patients on A;
• NB(n): number of patients on B, n = NA(n) +NB(n).
Play-the-winner rule:
• a success on one treatment results in the next patient’s
assignment to the same treatment,
• a failure on one treatment results in the next patient’s assignment
to the opposite treatment.
That is
• φn = 1 if Tn−1 = 1 and Xn−1(A) = 1 or Tn−1 = 0 and
Xn−1(B) = 0.
• φn = 0 if Tn−1 = 1 and Xn−1(A) = 0 or Tn−1 = 0 and
Xn−1(B) = 1.
The properties of play-the-winner rule?
• What is the proportion of patients in treatment A:
NA(n)
n
→???.
• What is the variance (variability) of the allocation:
V ar(NA(n)) or V ar
(
NA(n)
n
)
=???.
• What is the distribution of the allocation:

n
(
NA(n)
n
−???
)
→???.
We have
• What is the proportion of patients in treatment A:
NA(n)
n
→ qB
qA + qB
.
• What is the variance (variability) of the allocation:
V ar(NA(n)) =
nqAqB(pA + pB)
(qA + qB)3
.
• What is the distribution of the allocation:

n
(
NA(n)
n
− qB
qA + qB
)
→ N
(
0,
qAqB(pA + pB)
(qA + qB)3
)
.
• more patients in the better treatment;
• intuitively attractive.
• Not a randomized procedure;
• Not based on any optimality.
4.2.1 Randomized play-the-winner rule
Randomized play-the-winner (RPW) rule (Wei and Durham, 1978,
JASA) has been the most-studied urn model in literature.
Binary response: treatment A and B.
• pA: P (success|A), qA = 1− pA;
• pB : P (success|B), qB = 1− pB ;
• NA(n): number of patients on A;
• NB(n): number of patients on B, n = NA(n) +NB(n).
Begin with c balls of A and c balls of B in an urn.
• Draw A:
– assign patient to A;
– replace ball;
– add 1 type A ball if treatment A is successful;
– add 1 type B ball if treatment A is failure.
• Draw B:
– assign patient to B;
– replace ball;
– add 1 type B ball if treatment B is successful;
– add 1 type A ball if treatment B is failure.
When the (i+ 1)th patient is ready to be randomized in a clinical
trial, following information is available:
• patients assignments: T1, ..., Ti;
• responses: X1, ..., Xi (assume immediately responses);
Then
• φ1 = 1/2.

φ2 =
c+ T1X1 + (1− T1)(1−X1)
2c+ 1
.

φi+1 =
c+
∑i
j=1[TjXj + (1− Tj)(1−Xj)]
2c+ i
.
Properties:
• Calculate ENA(n);
• Simulated results.
We have
• The limiting proportion of patients in treatment A:
NA(n)
n
→ qB
qA + qB
.
• The variance (variability) of the allocation (when qA + qB > 1/2):
V ar(NA(n)) =
nqAqB(3 + 2(pA + pB))
(qA + qB)2(2(qA + qB)− 1) .
• The asymptotic distribution of the allocation (when
qA + qB > 1/2):

n
(
NA(n)
n
− qB
qA + qB
)
→ N
(
0,
nqAqB(3 + 2(pA + pB))
(qA + qB)2(2(qA + qB)− 1)
)
.
Table 1: Asymptotic and simulated mean and variance (multipled by
n) of the allocation proportions NA(n)/n for the randomized play-the-
winner (RPW). Simulations based on n = 100 and 1000 replications.
From Hu and Rosenberger (2003), reprinted by permission from the
American Statistical Association.
(pA, pB) mean (A) S var (A) S
(0.8, 0.8) 0.50 0.50 N/A 2.29
(0.8, 0.7) 0.60 0.57 N/A 1.90
(0.7, 0.5) 0.63 0.61 1.33 0.90
(0.7, 0.3) 0.70 0.68 0.63 0.51
(0.5, 0.5) 0.50 0.50 0.75 0.65
(0.5, 0.2) 0.62 0.61 0.35 0.34
(0.2, 0.2) 0.50 0.50 0.20 0.19
Urn models:
• Play-the-winner (PW) rule (Zelen, 1969, JASA); Randomized
play-the-winner rule (Wei and Durham,1978, JASA).
• Generalized Friedman’s urn models (Wei, 1979, JASA; Smythe,
1996, Stochastic Process. Appl.; Bai, Hu and Shen, 2002, JMVA);
• Randomized Polya Urn (Durham, Flournoy, and Li, 1998,
Canadian J of Statistics); Ternary Urn (Ivanova and Flournoy,
2001);
• Drop-the-Loser rule (Ivanova, 2003, Metrika); Generalized
drop-the-Loser rule (Zhang, Chan, Cheung and Hu, 2007, Statistic
Sinica),
• Sequential estimated urn (Zhang, Hu and Cheung, 2006, Annals
of Applied Probability).
• Urn models with immigration balls (Zhang, Hu, Cheung and
Chan, Annals of Statistics, 2011).
4.3 Relationship Between Power and Variability
Example 3.2. ECMO trial (The UK trial). Extracorporeal
membrane oxygenation (ECMO) is an external system for oxygenating
the blood based on techniques used in cardiopulmonary bypass
technology developed for cariac surgery. In the literature, there are
three well-document clinical trials on evaluating the clinical
effectiveness of ECMO:
(i) the Michigan ECMO study (Bartlett, et al. 1985);
(ii) the Boston ECMO study (Ware, 1989);
(iii) the UK ECMO trial (UK Collaborative ECMO Trials Group, 1996).
Example 7 (Continued): Michigan ECMO trial using
RPW rule:
The RPW rule was used in a clinical trial of extracorporeal membrane
oxygenation (ECMO; Bartlett, et al. 1985, Pediatrics).
Total 12 patients.
• ECMO group– 11 patients, all survived.
• Conventional therapy– 1 patient, died.
Valid of this trial? No statistical conclusion.
Why? Power and variability.
Power is an increasing function of noncentrality parameter:
For the following one-side test: H0 : pA = pB vs H1 : pA > pB , The
corresponding testing statistic is
T =
pˆA − pˆB√
pˆA(1− pˆA)/NA(n) + pˆB(1− pˆB)/NB(n)
.
We can calculate the noncentrality parameter as followings:
(pA − pB)2
pAqA/NA(n) + pBqB/NB(n)
.
Assume NA(n)/n→ ρ in probability, we can rewrite this as:
See details in class.
4.4 Lower bound of the variability
Hu, Rosenberger and Zhang (2006) considered ”Asymptotically best
response-adaptive randomization procedures.” in Journal of Statistical
Planning and Inference.
See details in class.
Doubly-adaptive biased coin design (DBCD) (Eisele and Woodroofe,
1995, Annals of Statist, Hu and Zhang, 2004, Annals of Statist).
Let g be a function from [0, 1]× [0, 1] to [0, 1] satisfied certain
conditions. The procedure then allocates patient j to treatment A
with probability
g(
nA(j − 1)
j − 1 , ρˆj−1).
How to choose function g?
Recently, Hu and Zhang (2004) proposed (γ ≥ 0)
g(x, ρ) =
ρ(ρ/x)γ
ρ(ρ/x)γ + (1− ρ)((1− ρ)/(1− x))γ
• γ = 0, the g(x, ρ) = ρ (the SMLE);
• γ =∞, determined design.
Let
λ = ∂g/∂x
∣∣
(ρ,ρ)
, η = ∂g/∂y
∣∣
(ρ,ρ)
and
∇(ρ) = ( ∂ρ
∂pA
,
∂ρ
∂pB
)′.
Also let
σ23 =
(∇(ρ)|Θ)′V∇(ρ)|Θ and σ21 = ρ(1− ρ).
Where Θ = (pA, pB) and
V = diag(
V ar(ξA)
ρ
,
V ar(ξB)
1− ρ ).
Theorem. Under widely satisfied conditions,
n1/2(nA/n− ρ)→ N(0, σ2) (1)
in distribution. Where
σ2 =
σ21
1− 2λ +
2η2σ23
(1− λ)(1− 2λ)
Main Techniques used: Martingale, Gaussian Approximation and
Matrix theory.
Example 4.1. Binary response: treatment A and B.
• pA: P (success|A), qA = 1− pA;
• pB : P (success|B), qB = 1− pB ;
• nA: number of patients on A;
• nB : number of patients on B, n = nA + nB .
To see how this procedure works in practice, we look at a simple
illustration with γ = 2.
• Suppose we have already assigned 9 patients, 5 to A and 4 to B.
• We have observed a success rate of pˆA = 3/5 on A and pˆB = 1/4
on B.
If the target allocation is urn allocation, qB/(qA + qB) (Wei and
Durham, 1978), then
• estimate the target allocation as
ρˆ =
3/4
2/5 + 3/4
= 0.652.
• real allocation proportion is 5/9 = 0.556.
• Then the probability of assigning the 10th patient to treatment A
is computed as
P (A) =
0.652(0.652/0.556)2
0.652(0.652/0.556)2 + 0.348(0.348/0.444)2
= 0.807.
If we are interested in optimal allocation,

pA/(

pA +

pB)
(Rosenberger, et al, 2001), then
• estimate the target allocation as
ρˆ =

3/5√
3/5 +

1/4
= 0.6077.
• real allocation proportion is 5/9 = 0.556.
• Then the probability of assigning the 10th patient to treatment A
is computed as
P (A) =
0.6077(0.6077/0.556)2
0.6077(0.6077/0.556)2 + 0.3923(0.3923/0.444)2
= 0.704.
For binary responses with (ρ = qB/(qA + qB)),
n1/2(nA/n− ρ)→ N(0, σ2DBCD)
in distribution, whenever λ < 1/2, where
σ2DBCD =
q1q2
(1− 2λ)(q1 + q2)2 +
2η2
(1− λ)(1− 2λ)
q1q2(p1 + p2)
(q1 + q2)3
If
g(x, ρ) =
ρ(ρ/x)γ
ρ(ρ/x)γ + (1− ρ)((1− ρ)/(1− x))γ ,
then
σ2DBCD =
q1q2(p1 + p2)
(q1 + q2)3
+
2q1q2
(1 + 2γ)(q1 + q2)3
.
• γ = 0, σ2DBCD = q1q2(p1+p2+2)(q1+q2)3 .
• γ =∞, σ2DBCD = q1q2(p1+p2)(q1+q2)3 (Lower bound).
• γ = 2, σ2DBCD = q1q2(p1+p2+.4)(q1+q2)3 .
• can target any given allocation ρ(θ);
• very close to the low bound; but NOT ATTAIN the low bound.
• and apply to all types of responses.
Hu, Zhang and He (2009, Annals of Statistics) proposed Efficient
• can target any given allocation ρ(θ);
• ATTAIN the low bound.
• and apply to all types of responses.
The ERADE is analogous to discretized version of Hu and Zhang’s
function. For a parameter α ∈ (0, 1), Then the procedure allocates jth
patient to treatment A with probability
φj = 1/2, if nA(j − 1)/(j − 1) = ρˆj−1,
= αρˆj−1, if nA(j − 1)/(j − 1) > ρˆj−1,
= 1− α(1− ρˆj−1), if nA(j − 1)/(j − 1) < ρˆj−1.
4.7 Revisiting the examples
Example 4.2. HIV transmission (Continued). Connor et al.
(1994, The New England Journal of Medicine) report a clinical trial to
evaluate the drug AZT in reducing the risk of maternal-infant HIV
transmission.
50-50 randomization scheme is used:
• AZT Group—239 pregnant women (20 HIV positive infants).
• placebo group—238 pregnant women (60 HIV positive
infants).
Here pˆA = 219/239 = 0.913, pˆB = 158/238 = 0.664.
• pA + pB = 1.577 > 1.5, RPW does not apply here.
• DBCD with target allocation ρ = q2/(q1 + q2) and γ = 2
• Neyman allocation, Maximize the power.
• FPower: Fix the power (β = 0.9) and minimize expected failures.
Allocation rule AZT Placebo Power HIV+
EA 239 238 0.9996 80
DBCD 360 117 0.989 60
Neyman 186 291 0.9998 89
FPower 416 61 0.90 50
Example 4.3. ECMO trial (The UK trial). Extracorporeal
membrane oxygenation (ECMO) is an external system for oxygenating
the blood based on techniques used in cardiopulmonary bypass
technology developed for cariac surgery. In the literature, there are
three well-document clinical trials on evaluating the clinical
effectiveness of ECMO:
(i) the Michigan ECMO study (Bartlett, et al. 1985);
(ii) the Boston ECMO study (Ware, 1989);
(iii) the UK ECMO trial (UK Collaborative ECMO Trials Group, 1996).
The UK ECMO trial:
50-50 randomization scheme is used:
• ECMO Group—93 infants (28 deaths).
• Conventional group—92 infants (54 deaths).
• Use P1 = 65/93 and P2 = 38/92 as the estimated success
probabilities of the ECMO and the conventional treatment,
respectively.
• ERADE (Hu, Zhang and He, 2009) is used based on 10000
simulations.
• RPW is used based on 10000 simulations.
• On average, there will be about 121 patients in the ECMO and 64
patients in the conventional treatment on average.
• the expected number of deaths is 74 death, as compared to 82 in
the actual trial. The adaptive design utilizes the better treatment
more often to save lives.
Power of the ERADE and the RPW rule under the setting of
P1 = 65/93 and P2 = 38/92.
• For equal allocation, power is 0.978.
• The expected power under both designs (ERADE and RPW) is
0.969.
• Based on the 10000 simulated trials, we noticed that in 99% of
the trials under the ERADE there were more than 52 patients
assigned to the conventional treatment group, for a power of
0.941 or higher.
• Under the RPW rule, only 39 or fewer patients were assigned to
the conventional treatment in 1% of the trials, for a power of
0.904 or less.
• Also based on the 10000 simulated trials, the ERADE always
assign more patients to the ECMO group.
• However, the RPW rule assigned more patients to the
conventional group in 114 trials.
• Even at the sample size 185, we can see the advantage of using
the proposed ERADE over the randomized player-the-winner rule.
4.8 Some remarks
• Urn models (RPW rule, Wei and Durham, 1978, JASA; Zhang,
Hu, Cheung and Chan, 2011, AOS)
• Ethical, Randomness, Power and Variability (Hu and Rosenberger,
2003, JASA)
• Lower bound of the variability (Hu, Rosenberger and Zhang, 2006,
JSPI)
• DBCD (Hu and Zhang, 2004, AOS)
• Optimal allocations (Rosenberger et al, 2001, Biometrics,
Tymofyeyev, Rosenberger and Hu, 2007, JASA)
• ERADE (Hu, Zhang and He, 2009, AOS)
• Delayed responses (Bai, Hu and Rosenberger, 2002, AOS, Hu and
Zhang, 2004)
• Time trends and others (Hu and Rosenberger, 2000, Statistics in
Medicine)
• The book (Hu and Rosenberger, 2006) and two white papers (Hu
and Rosenberger, 2007).
• Sequential Monitoring RAR (Zhu and Hu, 2010, AOS).
• Sample size re-estimation (Li and Hu, 2021).
• Robustness Inference of RAR (Ye, Ma and Hu, 2021?).
Example 5.1: Remdesivir-COVID-19 trial (China).
Remdesivir in adults with severe COVID-19 trial (Wang et al. 2020) is
a randomized, double-blind, placebo-controlled, multicentre trial that
aimed to compare Remvesivir with placebo. There were 236 patients
in the trial. There are about 20 baseline covariates for each patient,
including 10 continuous variables (e.g. age and White blood cell
count) and 10 discrete variables (e.g. gender and Hypertension). The
stratified (according to the level of respiratory support) permuted
block (30 patients per block) randomization procedure were
implemented. At the end of this trial, some important imbalances
existed at enrollment between the groups, including more patients with
hypertension, diabetes, or coronary artery disease in the Remdesivir
group than the placebo group.
Example 5.2: Moderna COVID-19 vaccine trial (2020).
The trial began on July 27, 2020, and enrolled 30,420 adult volunteers
at clinical research sites across the United States. Volunteers were
randomly assigned 1:1 to receive either two 100 microgram (mcg)
doses of the investigational vaccine or two shots of saline placebo 28
days apart. The average age of volunteers is 51 years. Approximately
47% are female, 25% are 65 years or older and 17% are under the age
of 65 with medical conditions placing them at higher risk for severe
COVID-19. Approximately 79% of participants are white, 10% are
Black or African American, 5% are Asian, 0.8% are American Indian or
Alaska Native, 0.2% are Native Hawaiian or Other Pacific Islander, 2%
are multiracial, and 21% (of any race) are Hispanic or Latino.
From the start of the trial through Nov. 25, 2020, investigators
recorded 196 cases of symptomatic COVID-19 occurring among
participants at least 14 days after they received their second shot. One
hundred and eighty-five cases (30 of which were classified as severe
COVID-19) occurred in the placebo group and 11 cases (0 of which
were classified as severe COVID-19) occurred in the group receiving
mRNA-1273. The incidence of symptomatic COVID-19 was 94.1%
lower in those participants who received mRNA-1273 as compared to
those receiving placebo.
Investigators observed 236 cases of symptomatic COVID-19 among
participants at least 14 days after they received their first shot, with
225 cases in the placebo group and 11 cases in the group receiving
mRNA-1273. The vaccine efficacy was 95.2% for this secondary
analysis.
5.1 Some Classical designs
Clinical trialists are often concerned that treatment arms will be
unbalanced with respect to key covariates of interest. To prevent this,
covariate-adaptive randomization is often employed. Over 50000
(Taves, 2010).
• Covariates (prognostic factors): factors that are associated with
the outcomes of patients
– E.g., gender, age, clinical center, blood pressure, stage of
disease at baseline, gene expressions.
• Covariate-adaptive design: randomization that incorporates
covariates and balances treatment allocation over covariates.
– Balancing treatment allocation for influential covariates.
– Achieving statistical efficiency by preserving type I errors while
increasing power.
• Two popular procedures: stratified permuted block design and
Pocock and Simon’s marginal procedure (1975).
Imbalance of Different Levels
• overall difference, Dn = N1(n)−N2(n);
• marginal difference, difference between the numbers of patients on
a margin, e.g., Dfemale;
• within-stratum difference, difference between the number of
patients in a stratum, e.g. Dfemale,smoker.
Female Male Overall
Smoker Dfemale,smoker Dmale,smoker Dsmoker
Non-S Dfemale,non−s Dmale,non−s Dnon−s
Overall Dfemale Dmale Dn
5.1.1 Stratified Randomization
• Strata are formed by all combinations of covariates’ levels.
– e.g.: 2 covariates gender (male and female) and smoking
behavior (smoker and non-smoker) lead to 2× 2 = 4 strata
• Separate randomization is employed within each stratum.
– stratified permuted block design, commonly used. Permuted
Block Design: permutation of m A’s and m B’s.
- e.g.: block size 2m = 4, permutation of (AABB) or (BAAB);
For 10 patients: —AABB—BAAB—BB
Stratified Randomization
– Easy to understand and implement.
– Good large sample properties (almost prefect balance).
– Balance within stratum.
– Only consider balance within stratum.
– Does not work for cases with many strata (many covariates or
many levels).
– Unknown (theoretically) properties of statistical inference.
5.1.2 Pocock-Simon procedure
Let Z1, ...,Zn be the covariate vector of patients 1, ..., n. Assume
that there are S covariates of interest (continuous or otherwise) and
they are divided into ns, s = 1, ..., S, different levels.
Nsik(n), s = 1, ..., S, i = 1, ..., ns, k = 1, 2 to be the number of
patients in the i-th level of the s-th covariate on treatment k.
Let patient n+ 1 have covariate vector Zn+1 = (r1, ..., rS).
Let Ds(n) = Nsrs1(n)−Nsrs2(n), which is the difference between the
numbers of patients on treatments 1 and 2 for members of level rs of
covariate s.
Let w1, ..., wS be a set of weights and take the weighted aggregate
D(n) =
∑S
s=1 wsDs(n). Establish a probability pi ∈ (1/2, 1]. Then
the procedure allocates to treatment 1 according to
φi1 = E(Ti1|Ti−1,Zi) = 1/2, if D(i− 1) = 0,
= pi, if D(i− 1) < 0,
= 1− pi, if D(i− 1) > 0.
Pocock-Simon procedure
– Balance across covariates (marginal balance).
– Overall treatment balance with many covariates.
– Unknown theoretical properties (not well studied, Rosenberger
and Sverdlov, 2009).
– usually not well balanced within stratum.
– Unknown (theoretically) properties of statistical inference.
Examples.
We need new covariate-adaptive designs that provide balance (within
stratum, marginal and overall) under different situations (sample size
200, 500 or 1000):
• 10 covariates, each with 2 levels: total 210 = 1024 strata.
• 2 covariates: a biomarker with 2 levels and 100 investigation sides:
total 200 strata.
5.2 Hu and Hu’s Covariate-Adaptive Design for
Balance (discrete)
Consider two covariates: covariate 1 with I levels and covariate 2 with
J levels, For patient n+ 1 (with i (covariate 1) and j (covariate 2))
n = 0, 1, 2, .... First we define the following values:
• If patient n+ 1 is assigned to treatment 1, let
– Within Stratum: D
(1)
ij (n+ 1) = Nij,1(n+ 1)−Nij,2(n+ 1),
where Nij,1(n+ 1) and Nij,2(n+ 1) are the number of
patients assigned to treatment 1 and 2 respectively in strata ij
of the first n+ 1 patients.
– Marginal 1: D
(1)
i· (n+ 1) = Ni·,1(n+ 1)−Ni·,2(n+ 1), where
Ni·,1(n+ 1) and Ni·,2(n+ 1) are the number of patients
assigned to treatment 1 and 2 respectively in (covariate 1=i)
of the first n+ 1 patients.
– Marginal 2: D
(1)
·j (n+ 1) = N·j,1(n+ 1)−N·j,2(n+ 1), where
N·j,1(n+ 1) and N·j,2(n+ 1) are the number of patients
assigned to treatment 1 and 2 respectively in (covariate 2=j)
of the first n+ 1 patients.
– Overall: Dn,overall = Nn,1 −Nn,2 be the overall difference of
patient numbers in group 1 and 2 among the first n.
– Define A
(1)
ij (n+ 1) = (D
(1)
ij (n+ 1))
2,
A
(1)
i· (n+ 1) = (D
(1)
i· (n+ 1))
2, A
(1)
·j (n+ 1) = (D
(1)
·j (n+ 1))
2
and A
(1)
·· = (Dn,overall)2.
– The score of imbalance is B
(1)
ij (n+ 1) =
w1A
(1)
ij (n+1)+w2A
(1)
i· (n+1)+w3A
(1)
·j (n+1)+w4A
(1)
·· (n+1)
for some weights w1, w2, w3, w4 ≥ 0.
• If patient n+ 1 is assigned to treatment 2, B(2)ij (n+ 1) is
calculated similarly.
Then the proposed procedure allocates (Hu and Hu, 2012) to
treatment 1 according to
φn+1,1 = 1/2, if B
(1)
ij (n+ 1) = B
(2)
ij (n+ 1),
= pi, if B
(1)
ij (n+ 1) < B
(2)
ij (n+ 1),
= 1− pi, if B(1)ij (n+ 1) > B(2)ij (n+ 1).
Where pi > 0.5 (pi ∈ (0.75, 0.95) is recommended).
Remarks:
• When weight w1 = 0, w4 = 0, the new design becomes Pocock
and Simon’s procedure.
• When w2 = w3 = w4 = 0, the new design is similar to Stratified
Block Randomization.
• With w1, w2, w3 > 0, we can balance both within each strata and
cross covariates.
Theorem 1: (1) Under certain conditions (w1 > 0 and some others,
Hu and Hu, 2012; Hu and Zhang, 2014), Dn (imbalance matrix) is a
positive recurrent Markov chain. Therefore all three types of
imbalance are Op(1).
(2) When w1 = 0 (Pocock and Simon’s Design), Both marginal and
overall imbalances are Op(1), but within stratum imbalance is
Op(n
1/2). (Hu and Zhang, 2014).
The proof is quite difficult because the correlated structure.
Main techniques: “Draft conditions” of Markov chain, Guassian
approximation and martingales.
Some numerical results:
Case 1: 10 covariates, each with 2 levels: total 210 = 1024 strata.
Table 1. Averaging imbalance under 100 simulations and n = 500
Dist of pts across strata Counts & percentages
# of pts E(# prop) Imb strt(PB) P-S New
2 .07 0 50.2(.67) 38.2(.50) 55.1(.74)
2 24.9(.33) 37.8(.50) 18.9(.26)
3 .01 1 12(1.00) 9.3(.77) 12.0(.96)
3 0(0.00) 2.8(.23) .5(.04)
(< 2) .91
overall abs dif 12.8 .76 .90
margnal abs dif 10.4 1.68 1.90
Case 2: 2 covariates: a biomarker with 2 levels and 100 investigation
sides: total 200 strata.
Table 3. Averaging imbalance under 1000 simulations and n = 200
Dist of pts across strata Counts & percentages
# of pts E(# prop) Imb strt(PB) P-S New
2 .184 0 24.46(.66) 24.15(.65) 30.23(.82)
2 12.37(.34) 12.74(.35) 6.46(.18)
3 .06 1 12.02(1.00) 11.18(.92) 12.05(.97)
3 0(0.00) 1.02(0.08) 0.35(.03)
(< 2) .735
overall abs dif 9.39 1.14 1.53
margnal long abs dif 6.57 0.87 1.13
margnal short abs dif 1.00 0.86 0.81
5.3 Examples of Mimicking Real Clinical Data
5.3.1 Toorawa, Adena, et al. (2009)
The four covariates are site, gender, age and disease status, with 20,
2, 2 and 2 levels, respectively, resulting in 160 strata. The covariates’
distribution is replicated in Table 2, where the marginal distribution of
sites is independent of the joint distribution of the rest three
covariates.
Table 2: Distribution of Covariates
Sites Small(2 sites) 1/120
Medium(16 sites) 6/120
Large(2 sites) 11/120
Other 3 covariates Male; < 60; Moderate disease 10/20
Male; ≥ 60; Moderate disease 2/20
Male; < 60; Severe disease 2/20
Male; ≥ 60; Severe disease 2/20
Female; < 60; Moderate disease 1/20
Female; ≥ 60; Moderate disease 1/20
Female; < 60; Severe disease 1/20
Female; ≥ 60; Severe disease 1/20
120 patients enter the trial sequentially and their covariates are
independently simulated from the multinomial distribution in Table 2.
We use the same p, q and block size as in the previous two examples.
The weights are specified in the following way:
- NEW: wo = ws = 1/3 and wm,i = 1/12, i = 1, · · · , 4.
- PS: wo = ws = 0 and wm,i = 1/4, i = 1, · · · , 4.
Table 3: Distribution of patients among 160 strata
# of pts within stratum 0 1 2 3 4 and more
# of strata 95.4 38.8 12.7 5.6 7.6
proportion 59.6% 24.3% 7.9% 3.5% 4.7%
Table 3 shows the distribution of 120 patients among 160 strata. In
this case 24.3% of the strata have 1 patient; 11.4% contain 2 or 3
patients. If stratified randomization is employed, then the patients in
the above 24.3% stata has to be randomized by equal probabilities.
Moreover, the incomplete blocks in strata with 2 or 3 patients also
pose a high risk of large overall imbalance.
The mean absolute imbalances at the three levels are compared, as
shown in Table 4, Table 5, and 6.
Table 4: Comparison of absolute overall imbalance |Dn|
STR-PB PS NEW
mean 6.70 0.91 0.63
median 6 0 0
95% quan 16 2 2
Table 4 shows the result for the overall imbalance and lists the the
mean, median and 95% quantile of |D120|. It is seen that NEW has
mean, median and 95% quantile of 0.63, 0 and 2, respectively, whereas
PS has slightly higher values. The three quantities are extremely high
under STR-PB, which are not recommended for this case.
Table 5: Comparison of mean absolute marginal imbalances E|Dn(i; ki)|
STR-PB PS NEW
gender male 5.52 1.10 1.59
female 3.86 1.06 1.55
age < 60 4.84 1.08 1.57
≥ 60 4.40 1.11 1.23
disease moderate 5.01 1.10 1.56
severe 4.35 1.18 1.52
20 sites 2 small 1.45 0.94 1.02
16 median 1.44 1.21 1.32
2 large 1.47 1.33 1.52
Table 5 gives the mean absolute marginal imbalances. For the
covariates of gender, age and disease, the table explicitly lists the
mean values on these 6 margins, as each of them only has two levels.
For example, over the 1000 simulations, on average the absolute
differences of patients in the two treatment groups within all male are
5.52, 1.10 and 1.59 under STR-PB, PS and NEW, respectively.
Therefore, in this respect PS has the best performance; NEW is
slightly worse, but still tolerable; STR-PB is the worst, since its mean
is as high as 5.52. Similar conclusion can be reached for the other 5
margins. Moreover, for the margins relating to “site”, since there are a
total of 20 margins, we are unable to show the result on each margin
due to the space limit. Hence, these 20 margins are further
categorized into three groups of small, median and large sizes, and the
mean values in the table are further averaged over the margins within
the groups. For example, 1.32 is the mean absolute imbalance over the
16 median-sized sites as well as over the 1000 simulations. In terms of
imbalances on margins defined by site, PS is still the best, and
STR-PB has similar performance to NEW. This is because each
margin of site contains only 8 strata, hence the “accumulating effect”
of within-stratum imbalances under STR-PB is not as strong.
Table 6: Comparison of absolute within-stratum imbalances
|Dn(k1, · · · , kI)|:
distribution and mean
# of pts’ within strt. |Dn(k1, · · · , kI)| STR-PB PS NEW
2 prob(=0) 0.68 0.57 0.69
prob(=2) 0.32 0.43 0.31
mean 0.64 0.86 0.62
3 prob(=1) 1.00 0.85 0.94
prob(=3) 0.00 0.15 0.06
mean 1.00 1.30 1.12
Table 6 displays the distribution and absolute mean of within-stratum
imbalances for strata with 2 or 3 patients. For example, of all the
strata which contain 2 patients, the absolute difference is either 0 or 2,
and the distribution is 0.69 to 0 and 0.31 to 2 under NEW, leading to
an average of 0.62. According to this criterion, NEW has the lowest
mean, STR-PB has a slightly larger value, and PS has mean as large
as 0.86. For strata containing 3 patients, since the block size is 4 for
STR-PB, it is impossible to get an absolute value of 3. Hence, the
mean absolute imbalance is 1, the minimum among the three methods.
In summary, Hu and Hu’s procedure maintains good balance from all
three perspectives and should be favored. We also performed the
simulations under other parameter values. Some of them include: (1)
Changing the weights wo, ws, and wm,i, as well as the block size; (2)
2× 100 strata, representing few covariates but many levels at least for
one covariate; (3) 3× 4× 5× 6 strata, representing a few covariates
and a few levels for each. In all the above settings, our new procedure
shows advantages over the other two methods.
5.3.2 NIDA-CSP-1019 study
Elkashef et al. (2006) study is a randomized clinical trial conducted to
test the treatment effect of the selegiline transdermal system (STS), a
treatment of cocaine dependence. The trial comprised 300 patients,
and involved important covariates such as center (16 centers), age,
gender (1: male, 2: female), depression (calculated by Hamilton
Disorder, 1: Yes, 2: No), and cocaine use (the number of
self-reported days of cocaine use in the past 30 days ). The raw data
of this study is available on NIDA website.
Before using the randomization procedures, we discretized age to 1
(0-30), 2 (30− 40), 3 (40− 50), 4 (50 and above); depression
(Hamilton Depression Rating Scale) to 1 (normal: 0-7), 2 (mild
depression: 8-13), 3 (moderate depression: 14-18), 4 (severe
depression: 19-22), 5 (very severe depression: 23 to above);
cocaine use to 1 (0-10), 2 (11-20), 3 (20-30). The correlation
coefficients of the six covariates are given in Table 7.
Table 7: Correlation coefficients (Kendall’s tau) of the covariates.
center gender age depression ADHD cocaineuse
center 1.000 0.027 -0.004 -0.044 -0.029 0.021
gender 0.027 1.000 -0.066 0.078 0.013 0.127
age -0.004 -0.066 1.000 -0.028 -0.075 0.009
depression -0.044 0.078 -0.028 1.000 -0.176 0.066
ADHD -0.029 0.013 -0.075 -0.176 1.000 0.040
cocaineuse 0.021 0.127 0.009 0.066 0.040 1.000
Table 7 shows that gender has the highest calculated correlation with
cocaine use (e.g., 0.12). Medical studies
conner2008meta,mcintosh2009adult suggest the existence of the
correlations among depression, ADHD, and cocaine use. Gender,
depression, ADHD, and cocaine use were thus assumed to be
jointly distributed. In addition, center and age were further assumed
to be independently distributed to each other, and to the rest of the
covariates. The empirical distributions of the covariates used in
simulation are presented in Tables 8- 12 .
The values used for the randomization procedures were as follows.
Bs = 4 was used for all s under STR-PB, when cocaine use was
observed or unobserved. γ = 0.85 was used for both PS and HH,
whenever cocaine use was observed or unobserved. When
cocaine use was observed, wm1 = · · · = wm6 = 1/6, and
(wo = 0.1, wm1 = · · · = wm6 = 0.14, ws = 0.06) were used for PS and
HH, respectively. When cocaine use was unobserved,
wm1 = · · · = wm5 = 1/5, and
(wo = 0.15, wm1 = · · · = wm6 = 0.15, ws = 0.1) were used for PS and
HH, respectively.
Table 8: Marginal pmf of age.
age 1 2 3 4
pmf 28/300 110/300 135/300 27/300
Table 9: Marginal pmf of center.
center 1 2 3 4 5 6 7 8
pmf 24/300 21/300 28/300 15/300 14/300 18/300 28/300 24/300
center 9 10 11 12 13 14 15 16
pmf 15/300 16/300 20/300 3/300 20/300 17/300 10/300 27/300
Table 10: Joint pmf of gender, depression, ADHD, and cocaine use
I.
gender depression ADHD cocaine use pmf
1 1 1 1 1/300
1 1 1 2 3/300
1 1 1 3 1/300
Table 11: Joint pmf of gender, depression, ADHD, and cocaine use
II.
gender depression ADHD cocaine use pmf
1 1 2 1 26/300
1 1 2 2 54/300
1 1 2 3 30/300
1 2 1 1 1/300
1 2 2 1 10/300
1 2 2 2 24/300
1 2 2 3 18/300
1 3 1 1 2/300
1 3 1 3 1/300
1 3 2 1 4/300
1 3 2 2 13/300
1 3 2 3 12/300
1 4 1 2 2/300
1 4 1 3 2/300
1 4 2 1 5/300
1 4 2 2 3/300
1 4 2 3 8/300
1 5 1 2 1/300
1 5 1 3 2/300
1 5 2 1 1/300
1 5 2 2 7/300
1 5 2 3 3/300
Table 12: Joint pmf of gender, depression, ADHD, and cocaineuse
III.
gender depression ADHD cocaine use pmf
2 1 2 1 1/300
2 1 2 2 9/300
2 1 2 3 14/300
2 2 2 1 4/300
2 2 2 2 9/300
2 2 2 3 8/300
2 3 1 1 1/300
2 3 1 2 2/300
2 3 2 2 3/300
2 3 2 3 5/300
2 4 2 1 1/300
2 4 2 2 2/300
2 4 2 3 1/300
2 5 1 2 1/300
2 5 2 1 1/300
2 5 2 2 1/300
2 5 2 3 3/300
Note that the discretization we used resulted in 1,280 observed strata.
In this section, based on the sample size of 300 in this study, we
compare the marginal imbalance of cocaine use = 3 and the
imbalance of a partial stratum of
(gender = 1, depression = 1, ADHD = 2, and cocaine use = 3),
when cocaine use is either observed or unobserved. For simplicity, we
will write Dn(6; s6) to denote the marginal imbalance of cocaine use
when it is observed, and Dn(1; r1) to denote the marginal imbalance
of cocaine use when it is unobserved. Furthermore, we write Dn(s
∗)
for the imbalance of the partial stratum of our interest when
cocaine use is observed, and Dn(s
∗∗, r1) for the imbalance of the
same partial stratum when cocaine use is unobserved.
The simulation results for the partial stratum and the margin of
cocaine use = 3 are summarized in Table 13 and Table 14. We also
report the percentage reduction in the variance of the observed
covariate imbalance (PRVOCI) for Dn(s
∗) and Dn(6; s6). It is clear
that regardless of whether cocaine use is observed or unobserved, PS
and HH produce a better balance for the partial stratum and the
margin of cocaine use than CR or STR-PB. In particular, the standard
deviations of n−1/2Dn(6; s6), n−1/2Dn(1; r1), n−1/2Dn(s∗), and
n−1/2Dn(s∗∗, r1) under PS and HH are smaller than the
corresponding values under STR-PB and CR.
Table 13: Simulation results for the partial stratum (gender = 1,
depression = 1, ADHD = 2, cocaine use = 3), based on 10,000
runs.
Procedure
n−1/2Dn(s∗) n−1/2Dn(s∗∗, r1)
mean (s.d.) / PRVOCI mean (s.d.) / PRVUCI
CR -.000 (.316) / - -.000 (.316) / -
STR-PB .002 (.276) / 23.7% .008 (.288) / 17.1%
PS -.000 (.238) / 43.5% -.002 (.280) / 21.6%
HH -.005 (.233) / 45.9% -.003 (.275) / 24.2%
Table 14: Simulation results for cocaine use, based on 10,000 runs.
Procedure
n−1/2Dn(6; s6) n−1/2Dn(1; r1)
mean (s.d.) / PRVOCI mean (s.d.) / PRVUCI
CR .001 (.601) / - -.001 (.601) / -
STR-PB .005 (.556) / 14.6% .011 (.564) / 12.0%
PS -.000 (.111) / 96.6% .001 (.476) / 37.3%
HH -.000 (.112) / 96.5% -.003 (.470) / 37.3%
The differences between the PRVOCIs and the PRVUCIs of the partial
stratum and cocaine use under PS and HH are not negligible. For
example, the PRVOCI and the PRVUCI for the marginal imbalance of
cocaine use = 3 under PS are 96.6% and 37.3%, respectively. Indeed,
if one covariate is omitted from CAR, the marginal imbalance of this
covariate generally increases. However, as the PRVUCIs are positive
(37.3%), the results still suggest that CAR procedures perform much
better than CR when cocaine use is omitted in the design.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)208
randomization procedures: continuous
covariates; many covariates; network
structures and others)
6.1 Introduction
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)209
Example: Remdesivir-COVID-19 trial (China). Remdesivir
in adults with severe COVID-19 trial (Wang et al. 2020) is a
randomized, double-blind, placebo-controlled, multicentre trial that
aimed to compare Remvesivir with placebo. There were 236 patients
in the trial. There are about 20 baseline covariates for each patient,
including 10 continuous variables (e.g. age and White blood cell
count) and 10 discrete variables (e.g. gender and Hypertension). The
stratified (according to the level of respiratory support) permuted
block (30 patients per block) randomization procedure were
implemented. At the end of this trial, some important imbalances
existed at enrollment between the groups, including more patients with
hypertension, diabetes, or coronary artery disease in the Remdesivir
group than the placebo group.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)210
Example: GATE project The Project GATE (Growing America
Through Entrepreneurship), sponsored by the U.S. Department of
Labor, was designed to evaluate the impact of offering tuition-free
entrepreneurship training services (GATE services) on helping clients
create, sustain or expand their own business.
(https://www.doleta.gov/reports/projectgate/)
The cornerstone is complete randomization. Members of the
treatment group were offered GATE services; members of the control
group were not.
• n = 4, 198 participants
• p = 105 covariates
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)211
Example: Online A/B testing. (Kohavi and Thomke, 2017,
conduct more than 10,000 online controlled experiments annually, with
many tests engaging millions of users.
Amazon’s experiment.
Treatment A: Credit card offers on front page.
Treatment B: Credit card offers on the shopping cart page.
This (change from A to B) boosted profits by tens of millions of US
Dollars annually.
Often Network (Dependent and Interference) Data, How
to Design these studies?
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)212
• Improve accuracy and efficiency of inference.
• Remove the bias and increase the power.
• Increases the interpretability of results by making the units more
comparable, enhance the credibility.
• More robust against model misspecification.
• Rubin (2008): the greatest possible efforts should be made during
the design phase rather than the analysis stage.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)213
• Randomization: an essential tool for evaluating treatment effect.
• Traditional randomization methods (e.g., complete randomization
(CR)): unsatisfactory, unbalanced prognostic or baseline
covariates.
“Most of experimenters on carrying out a random
assignment of plots will be shocked to find out how far from
equally the plots distribute themselves.” —Fisher (1926)
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)214
What if large p and large n?
• The phenomenon of covariate imbalance is exacerbated as p and n
increase.
• Ubiquitous in the era of big data.
• Example: the probability of one particular covariate being
unbalanced is α = 5%. For a study with 10 covariates, the chance
of at least one covariate exhibiting imbalance is
1− (1− α)p = 40%. With 100 covariates, the chance is
1− (1− α)100 = 1.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)215
6.2 Rerandomization
Morgan and Rubin (2012) proposed rerandomization.
(1) Collect covariate data.
(2) Specify a balance criterion, M < a, i.e., threshold on the
Mahalanobis distance,
M = (x¯1 − x¯2)T [cov(x¯1 − x¯2)]−1(x¯1 − x¯2),
where x¯1 and x¯2 are the sample means for treatment groups.
(3) Randomize the units using the complete randomization (CR).
(4) Check the balance criterion, M < a.
• If satisfied, go to Step (5); otherwise, return to Step (3).
(5) Perform the experiment using the final randomization obtained in
Step (4).
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)216
• Desirable properties for causal inference:
– Reduction in variance of estimated treatment effect.
• Work well with a few covariates.
Drawbacks:
• Not for sequential experiments
• Incapable to scale up for massive data.
• As p increases, the probability of acceptance pa = P (M < a)
decreases, causing the RR to remain in the loop for a long time.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)217
Examples of Rerandomization.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)218
Mahalanobis Distance (CAM)
xi ∈ Rp: covariate of the i-th unit.
Ti ∈ {1, 0}: treatment assignment of the i-th unit.
• Ti = 1: treatment 1.
• Ti = 0: treatment 2.
i = 1, ..., n
(1) Use the new defined Mahalanobis distance
M(n) = 0.25(x¯1 − x¯2)T [cov(x¯)]−1(x¯1 − x¯2).
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)219
(2) Randomly arrange units in a sequence
x1,x2︸ ︷︷ ︸
1st pair
,x3,x4︸ ︷︷ ︸
2nd pair
,x5,x6︸ ︷︷ ︸
3rd pair
, ...,xn.
(3) Assign the 1st pair, T1 = 1, T2 = 0.
(4) For the next pair, i.e., 2i+ 1-th and 2i+ 2-th units, (i > 1)
(4a) If T2i+1 = 1 and T2i+2 = 0, obtain the “potential” M
(1)
i .
(4b) If T2i+1 = 0 and T2i+2 = 1, obtain the “potential” M
(2)
i .
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)220
(5) Assign the (2i+ 1)-th and (2i+ 2)-th units by
P (T2i+1 = 1, T2i+2 = 0|x2i, T2i...) =

q if M
(1)
i < M
(2)
i ,
1− q if M (1)i > M (2)i ,
0.5 if M
(1)
i = M
(2)
i ,
P (T2i+1 = 0, T2i+2 = 1|x2i, T2i...) =
1− P (T2i+1 = 1, T2i+1 = 0|x2i, T2i...),
where
• 0.5 < q < 1.
• Note: T2i+1 = T2i+2 = 0, 1 is not allowed.
(6) Repeat Steps (4) and (5) until finish.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)221
• A smaller value of M(n) indicates a better covariate balance.
• q = 0.75. More discussion in Hu and Hu (2012).
• Units are not observed sequentially; however, we allocate them
sequentially (in pairs).
• Better covariate balance.
• n! different possible sequences. Similar performance.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)222
Properties of CAM
Under CAM, suppose xi is i.i.d. multivariate normal; then
M(n) = Op(n
−1).
Note:
• Under CR, MCR(n) ∼ χ2df=p, a stationary distribution of a
Chi-square distribution with p degrees of freedom, regardless of n.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)223
• Under RR, MRR(n) ∼ χ2df=p|χ2df=p < a, a stationary distribution
of a Chi-square distribution with p degrees of freedom conditional
on MRR(n) < a, regardless of n.
• Under CAM, M(n)→ 0 at the rate of 1/n.
– More units, better balance.
– Advantages of CAM in large n.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)224
Properties of CAM
As p increases,
• Under CR, the stationary distribution becomes flatter, poorer
covariate balance.
• Under RR, the stationary distribution becomes flatter, poorer
covariate balance.
• Under CAM, M(n)→ 0 at the rate of 1/n, regardless of p.
– The effect of p on M(n) is less severe than CR and RR.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)225
Properties of CAM
• Works for sequential experiments, just estimate the covariance
matrix sequentially.
• Capable for large p and large n.
• Better covariate balance.
• Less computational time.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)226
6.3.1 Estimating treatment effect
A natural setup of A/B testing:
• The observed outcome yi, i = 1, ..., n, for each unit.
• Let yi(Ti) represents the potential outcome of the i-th unit under
the treatment Ti.
• yi = yi(1)Ti + yi(0)(1− Ti).
• The average treatment effect is
τ =
∑n
i=1 yi(1)
n

∑n
i=1 yi(0)
n
.
• The fundamental problem: only observe yi(Ti) for one particular
Ti, therefore, τ cannot be calculated directly.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)227
A natural estimate, τˆ :
τˆ =
∑n
i=1 Tiyi∑n
i=1 Ti

∑n
i=1(1− Ti)yi∑n
i=1(1− Ti)
,
• τˆ could be bad with imbalance in covariates.
• Example: estimate the drug effect using treatment groups with
predominately male and female patients. Cannot remove the
gender effect.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)228
Theoretical properties:
(1) Unbiasedness: under CAM, E(τˆ) = τ .
(2) Under CAM, V ar(τˆ) attains the lower bound asymptotically.
(3) This implies that
V arCAM (τˆ) < V arRR(τˆ) < V arCR(τˆ).
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)229
6.3.2 Examples
Real Data Example I - Project GATE (Example 4)
• Two treatment groups:
Treatment: were offered GATE services; control: were not offered
GATE services.
• p = 105 (covariates obtained from the application packages, 13
continuous and 92 categorical)
• Sample size n = 3, 448 (out of 4,198 participants from who
answered the evaluation survey 6 months after the assignment)
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)230
• Original allocation M = 75.27, moderate covariate imbalance.
• We repeat the allocation 1,000 times for these participants using
CAM, complete randomization and rerandomization.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)231
CAM vs Rerandomization
The Maximum of Malahanobis distances obtained from CAM is 12. If
we set the balance criterion for rerandomization to M < 12, the
probability of acceptance Pa = P (χ
2
df=105 < 12) = 3.4× 10−31, which
means nearly impossible for rerandomization to achieve a similar
balance level as CAM.
We set Pa = 2× 10−5 for Rerandomization to have similar
computational time with CAM.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)232
Comparison of Mahalanobis Distance
Mahalanobis Distance
D
en
si
ty
0 20 40 60 80 100 120 140
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25 Complete Randomization
CAM
Rerandomization
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)233
Estimation.
• The outcome variable (0/1): has owned a business within 6
months after assignment or not.
• After the allocation, we simulate the outcome variable according
to
logit(P (ysimi = 1)) = µˆ1T
sim
i + µˆ2(1− T simi ) + xTi βˆ + sim,
where µˆ1, µˆ2 and βˆ are obtained from fitting regression to original
data. sim is drawn from the residuals of that regression.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)234
Compare the estimation performance (PRIV) of CAM
and rerandomization.
Method PRIV un or va
CAM 17.7% 0.081
Rerandomization 10.5% 0.505
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)235
6.4 Network A/B Testing
Let a graph G be represented by a n× n symmetric adjacency matrix
A = [Aij ].
Balancing n-dimensional binary vectors, the network , is hard.
Zhou, Li and Hu (2019) proposed several methods and discussed their
theoretical and finite sample properties.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)236
New methods vs Complete randomization (MSE)
100 150 200 250 300 350 400 450 500 550 600
number of nodes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
st
an
da
rd
d
ev
ia
tio
n
random
coordinate descent
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)237
New methods vs Complete randomization (ATE)
-4 -3 -2 -1 0 1 2 3
bias
0
0.2
0.4
0.6
0.8
1
1.2
de
ns
ity
random
coordinate descent
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)238
New methods vs Complete randomization (MSE)
0 50 100 150 200 250 300 350 400 450 500
number of nodes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
st
an
da
rd
d
ev
ia
tio
n
random
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)239
New methods vs Complete randomization (ATE)
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
bias
0
0.5
1
1.5
2
2.5
3
de
ns
ity
random
coordinate descent
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)240
6.5 Balance Covariates based on general Kernels
The CAM only considers the mean of two groups. Covariance
structure is also important in statistical analysis. Therefore, Ma, Li
and Hu (2021) proposed the following distance measure (which
combine both mean and covariance differences):
IBT (n) = (x¯1 − x¯2)T cov(x)−1(x¯1 − x¯2) + trace
{(
Σˆ1 − Σˆ2
)2}
/p
where Σˆ1 and Σˆ2 are the sample covariance matrices for two
treatment groups.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)241
New method vs CAM vs Complete randomization
n=100, p=2
Mahalanobis distance
D
en
si
ty
0 2 4 6 8
0.
0
0.
5
1.
0
1.
5
2.
0
CAM
Trace+
CR
n=100, p=6
Mahalanobis distance
D
en
si
ty
0 2 4 6 8
0.
0
0.
5
1.
0
1.
5
2.
0
CAM
Trace+
CR
n=100, p=10
Mahalanobis distance
D
en
si
ty
0 2 4 6 8
0.
0
0.
5
1.
0
1.
5
2.
0
CAM
Trace+
CR
n=300, p=2
Mahalanobis distance
D
en
si
ty
0 2 4 6 8
0.
0
0.
5
1.
0
1.
5
2.
0
CAM
Trace+
CR
n=300, p=6
Mahalanobis distance
D
en
si
ty
0 2 4 6 8
0.
0
0.
5
1.
0
1.
5
2.
0
CAM
Trace+
CR
n=300, p=10
Mahalanobis distance
D
en
si
ty
0 2 4 6 8
0.
0
0.
5
1.
0
1.
5
2.
0
CAM
Trace+
CR
n=500, p=2
Mahalanobis distance
D
en
si
ty
0 2 4 6 8
0.
0
0.
5
1.
0
1.
5
2.
0
CAM
Trace+
CR
n=500, p=6
Mahalanobis distance
D
en
si
ty
0 2 4 6 8
0.
0
0.
5
1.
0
1.
5
2.
0
CAM
Trace+
CR
n=500, p=10
Mahalanobis distance
D
en
si
ty
0 2 4 6 8
0.
0
0.
5
1.
0
1.
5
2.
0
CAM
Trace+
CR
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)242
New method vs CAM vs Complete randomization
n=100, p=2
Trace((Sigma_1−Sigma_0)^2)/p
D
en
si
ty
0.0 0.1 0.2 0.3 0.4
0
5
15
25
CAM
Trace+
CR
n=100, p=6
Trace((Sigma_1−Sigma_0)^2)/p
D
en
si
ty
0.0 0.1 0.2 0.3 0.4
0
5
15
25
CAM
Trace+
CR
n=100, p=10
Trace((Sigma_1−Sigma_0)^2)/p
D
en
si
ty
0.0 0.1 0.2 0.3 0.4
0
5
15
25
CAM
Trace+
CR
n=300, p=2
Trace((Sigma_1−Sigma_0)^2)/p
D
en
si
ty
0.0 0.1 0.2 0.3 0.4
0
5
15
25
CAM
Trace+
CR
n=300, p=6
Trace((Sigma_1−Sigma_0)^2)/p
D
en
si
ty
0.0 0.1 0.2 0.3 0.4
0
5
15
25
CAM
Trace+
CR
n=300, p=10
Trace((Sigma_1−Sigma_0)^2)/p
D
en
si
ty
0.0 0.1 0.2 0.3 0.4
0
5
15
25
CAM
Trace+
CR
n=500, p=2
Trace((Sigma_1−Sigma_0)^2)/p
D
en
si
ty
0.0 0.1 0.2 0.3 0.4
0
5
15
25
CAM
Trace+
CR
n=500, p=6
Trace((Sigma_1−Sigma_0)^2)/p
D
en
si
ty
0.0 0.1 0.2 0.3 0.4
0
5
15
25
CAM
Trace+
CR
n=500, p=10
Trace((Sigma_1−Sigma_0)^2)/p
D
en
si
ty
0.0 0.1 0.2 0.3 0.4
0
5
15
25
CAM
Trace+
CR
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)243
Ma, Li and Hu (2021):
(i) a general framework of kernel covariate adaptive randomization to
attain covariate balance for a large class of functions that reside in a
high-dimensional or even infinite-dimensional space;
(ii) With the kernel trick commonly used in machine learning, the
framework unifies several recently proposed covariate adaptive designs
and generalizes to a much broader family with imbalance measures
defined in a consistent manner;
(iii) the convergence rate of covariate imbalance is bounded in
probability;
and (iv) balance covariance matrices between treatments, which shows
excellent and robust performance in finite samples.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)244
6.6 Examples
Example: Remdesivir-COVID-19 trial (China). Remdesivir
in adults with severe COVID-19 trial (Wang et al. 2020) is a
randomized, double-blind, placebo-controlled, multicentre trial that
aimed to compare Remvesivir with placebo. There were 236 patients
in the trial. There are about 20 baseline covariates for each patient,
including 10 continuous variables (e.g. age and White blood cell
count) and 10 discrete variables (e.g. gender and Hypertension). The
stratified (according to the level of respiratory support) permuted
block (30 patients per block) randomization procedure were
implemented. At the end of this trial, some important imbalances
existed at enrollment between the groups, including more patients with
hypertension, diabetes, or coronary artery disease in the Remdesivir
group than the placebo group.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)245
Example: Moderna COVID-19 vaccine trial (2020). The
trial began on July 27, 2020, and enrolled 30,420 adult volunteers at
clinical research sites across the United States. Volunteers were
randomly assigned 1:1 to receive either two 100 microgram (mcg)
doses of the investigational vaccine or two shots of saline placebo 28
days apart. The average age of volunteers is 51 years. Approximately
47% are female, 25% are 65 years or older and 17% are under the age
of 65 with medical conditions placing them at higher risk for severe
COVID-19. Approximately 79% of participants are white, 10% are
Black or African American, 5% are Asian, 0.8% are American Indian or
Alaska Native, 0.2% are Native Hawaiian or Other Pacific Islander, 2%
are multiracial, and 21% (of any race) are Hispanic or Latino.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)246
From the start of the trial through Nov. 25, 2020, investigators
recorded 196 cases of symptomatic COVID-19 occurring among
participants at least 14 days after they received their second shot. One
hundred and eighty-five cases (30 of which were classified as severe
COVID-19) occurred in the placebo group and 11 cases (0 of which
were classified as severe COVID-19) occurred in the group receiving
mRNA-1273. The incidence of symptomatic COVID-19 was 94.1%
lower in those participants who received mRNA-1273 as compared to
those receiving placebo.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)247
Investigators observed 236 cases of symptomatic COVID-19 among
participants at least 14 days after they received their first shot, with
225 cases in the placebo group and 11 cases in the group receiving
mRNA-1273. The vaccine efficacy was 95.2% for this secondary
analysis.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)248
Example: PFIZER-BIONTECH COVID-19 VACCINE.
Safety and Efficacy of the BNT162b2 mRNA Covid-19
Vaccine (2020).
BACKGROUND Severe acute respiratory syndrome coronavirus 2
(SARS-CoV-2) infection and the resulting coronavirus disease 2019
(Covid-19) have afflicted tens of millions of people in a worldwide
pandemic. Safe and effective vaccines are needed urgently.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)249
METHODS In an ongoing multinational, placebo-controlled,
observer-blinded, pivotal efficacy trial, we randomly assigned persons
16 years of age or older in a 1:1 ratio to receive two doses, 21 days
apart, of either placebo or the BNT162b2 vaccine candidate (30 g per
dose). BNT162b2 is a lipid nanoparticle–formulated,
nucleoside-modified RNA vaccine that encodes a prefusion stabilized,
membrane-anchored SARS-CoV-2 full-length spike protein. The
primary end points were efficacy of the vaccine against
laboratory-confirmed Covid-19 and safety.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)250
A total of 43,548 participants underwent randomization, of whom
43,448 received injections: 21,720 with BNT162b2 and 21,728 with
placebo. There were 8 cases of Covid-19 with onset at least 7 days
after the second dose among participants assigned to receive
BNT162b2 and 162 cases among those assigned to placebo;
BNT162b2 was 95% effective in preventing Covid-19 (95% credible
interval, 90.3 to 97.6). Similar vaccine efficacy (generally 90 to 100%)
was observed across subgroups defined by age, sex, race, ethnicity,
baseline body-mass index, and the presence of coexisting conditions.
Among 10 cases of severe Covid-19 with onset after the first dose, 9
occurred in placebo recipients and 1 in a BNT162b2 recipient. The
safety profile of BNT162b2 was characterized by short-term,
mild-to-moderate pain at the injection site, fatigue, and headache.
The incidence of serious adverse events was low and was similar in the
vaccine and placebo groups.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)251
CONCLUSIONS A two-dose regimen of BNT162b2 conferred 95%
protection against Covid-19 in persons 16 years of age or older. Safety
over a median of 2 months was similar to that of other viral vaccines.
(Funded by BioNTech and Pfizer; ClinicalTrials.gov number,
NCT04368728. opens in new tab.)
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)252
Example: REAL-WORLD EVIDENCE CONFIRMS
HIGH EFFECTIVENESS OF PFIZER-BIONTECH
COVID-19 VACCINE AND PROFOUND PUBLIC
HEALTH IMPACT OF VACCINATION ONE YEAR
AFTER PANDEMIC DECLARED.
The Israel Ministry of Health (MoH), Pfizer Inc. (NYSE: PFE) and
BioNTech SE (Nasdaq: BNTX) today announced real-world evidence
demonstrating dramatically lower incidence rates of COVID-19 disease
in individuals fully vaccinated with the Pfizer-BioNTech COVID-19
Vaccine (BNT162b2), underscoring the observed substantial public
health impact of Israel’s nationwide immunization program. These
new data build upon and confirm previously released data from the
MoH demonstrating the vaccine’s effectiveness in preventing
symptomatic SARS-CoV-2 infections, COVID-19 cases,
hospitalizations, severe and critical hospitalizations, and deaths. The
latest analysis from the MoH proves that two weeks after the second
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)253
vaccine dose protection is even stronger – vaccine effectiveness was at
least 97% in preventing symptomatic disease, severe/critical disease
and death. This comprehensive real-world evidence can be of
importance to countries around the world as they advance their own
vaccination campaigns one year after the World Health Organization
(WHO) declared COVID-19 a pandemic.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)254
Findings from the analysis were derived from de-identified aggregate
Israel MoH surveillance data collected between January 17 and March
6, 2021, when the Pfizer-BioNTech COVID-19 Vaccine was the only
vaccine available in the country and when the more transmissible
B.1.1.7 variant of SARS-CoV-2 (formerly referred to as the U.K.
variant) was the dominant strain. Vaccine effectiveness was at least
97% against symptomatic COVID-19 cases, hospitalizations, severe
and critical hospitalizations, and deaths. Furthermore, the analysis
found a vaccine effectiveness of 94% against asymptomatic
SARS-CoV-2 infections. For all outcomes, vaccine effectiveness was
measured from two weeks after the second dose.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)255
Following the authorization for emergency use of the Pfizer-BioNTech
COVID-19 Vaccine in Israel on December 6, 2020, the Israel MoH
launched a national vaccination program targeting individuals age 16
years or older – a total of 6.4 million people, representing 71% of the
population. The vaccination program started at the beginning of a
large surge of SARS-CoV-2 infections in Israel, which later resulted in
a national lockdown starting on January 8, 2021.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)256
This MoH analysis uses de-identified aggregate Israel MoH public
health surveillance data from January 17 through March 6, 2021
(analysis period); the start of the analysis period corresponds to seven
days after individuals began receiving second doses of the
Pfizer-BioNTech COVID-19 Vaccine. MoH regularly collects
comprehensive, real-time data on SARS-CoV-2 testing, COVID-19
cases including date of symptom onset, and vaccination history
through a nationally notifiable disease registry and the national
medical record database.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)257
Vaccine effectiveness estimates – adjusted to account for variances in
age, gender and the week specimens were collected – were determined
for the prevention of six laboratory-confirmed SARS-CoV-2 outcomes
comparing unvaccinated and fully-vaccinated individuals: SARS-CoV-2
infections (includes symptomatic and asymptomatic infections);
asymptomatic SARS-CoV-2 infections; COVID-19 cases (symptomatic
only); COVID-19 hospitalizations; severe (respiratory distress,
including ¿30 breaths per minute, oxygen saturation on room air ¡94%,
and/or ratio of arterial partial pressure of oxygen to fraction of inspired
oxygen ¡300mm mercury) and critical (mechanical ventilation, shock,
and/or heart, liver or kidney failure) COVID-19 hospitalizations; and
COVID-19 deaths.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)258
The MoH analysis was conducted when more than 80% of tested
specimens in Israel were variant B.1.1.7, providing real-world evidence
of the effectiveness of BNT162b2 for prevention of COVID-19
infections, hospitalizations, and deaths due to variant B.1.1.7.
However, this analysis was not able to evaluate vaccine effectiveness
against B.1.351 (formerly referred to as the South African variant) due
to the limited number of infections caused by this strain in Israel at
the time the analysis was conducted.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)259
The vaccine effectiveness estimates align with the 95% vaccine efficacy
of BNT162b2 against COVID-19 demonstrated in the pivotal
Randomized Clinical Trial (RCT) of BNT162b2. However, this
observational analysis differs from the RCT in several aspects. Vaccine
effectiveness estimates may be affected by differences between
vaccinated and unvaccinated persons (i.e., different test-seeking
behaviors or levels of adherence to preventive measures). In the RCT,
randomization minimized the impact of differences between vaccinated
and unvaccinated. Despite efforts to adjust for these effects in the
available dataset, the possibility remains of unmeasured distortions.
For example, findings from the Maccabi HMO indicate that
neighborhood may be an important factor. Further vaccine
effectiveness analyses investigating the effect of additional covariates
such as location, comorbidities, race/ethnicity, and likelihood of
seeking SARS-CoV-2 testing are warranted.
6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)260
Pfizer-BioNtech’s coronavirus vaccine offers more protection than
earlier thought, with effectiveness in preventing symptomatic disease
reaching 97%, according to real-world evidence published Thursday by
the pharma companies.
Using data from January 17 to March 6 from Israel’s national
vaccination campaign, Pfizer-BioNtech found that prevention against
asymptomatic disease also reached 94 percent.
”We are extremely encouraged that the real-world effectiveness data
coming from Israel are confirming the high efficacy demonstrated in
our Phase 3 clinical trial and showing the significant impact of the
vaccine in preventing severe disease and deaths due to COVID-19,”
said Luis Jodar, Ph.D., senior vice president and chief medical officer
of Pfizer Vaccines.
7 Statistical Inference after covariate-adaptive randomization 261
7 Statistical Inference after
7.1 Some concerns
First we consider simulations to study Type I error of hypothesis
testing for comparing treatment effects under three designs: Pocock
and Simon’s marginal procedure, stratified permuted block design, and
complete randomization. For each type of design, both continuous
case and discrete case are considered. The following linear model
(including two covariates Z1 and Z2) is assumed for responses Yi,
Yi = µ1Ii + µ2(1− Ii) + β1Zi,1 + β2Zi,2 + εi,
where εi is distributed as N(0, 1), β1 = β2 = 1. No difference in
treatment effects is assumed to study Type I error, i.e., µ1 = µ2.
7 Statistical Inference after covariate-adaptive randomization 262
For the discrete case, Z1 follows Bernoulli(p1) and Z2 follows
Bernoulli(p2); for the continuous case, both Z1 and Z2 follow normal
distributions N(0, 1). If covariates Z1 and Z2 are continuous, they are
discretized into bernoulli variables Z ′1 and Z

2 with the probabilities p1
and p2 in order to be used in randomization. More specifically, if
Z1 < Z(p1), where Z(p1) is p1 quantile of the standard normal
distribution, then Z ′1 = 0, otherwise Z

1 = 1. Original variables
(without discretization) are used in statistical inference procedures.
7 Statistical Inference after covariate-adaptive randomization 263
To carry out simulations, the biased coin probability 0.75 and equal
weights are used for Pocock and Simon’s marginal procedure, and the
block size 4 is used for stratified permuted block design. The
significance level is α = 0.05 and sample size N is 100, 200 or 500.
The hypothesis tests include the two sample t-test (t-test), the linear
model with a single covariate Z1 (lm(Z1)), the linear model with a
single covariate Z2 (lm(Z2)) and the linear model with both
covariates Z1 and Z2 (lm(Z1, Z2)). By choosing (p1, p2) = (0.5, 0.5),
the simulation results for Pocock and Simon’s marginal procedure,
stratified permuted block design and complete randomization are
demonstrated in Table 1.
7 Statistical Inference after covariate-adaptive randomization 264
In each simulation, Type I error of covariate-adaptive randomization
methods is also examined with the bootstrap t-test described in Shao,
Yu and Zhong (2010). To do the test, B bootstrap samples
(Y ∗b1 , Z
∗b
1,1, Z
∗b
1,2), ...,(Y
∗b
N , Z
∗b
N,1, Z
∗b
N,2), b = 1, 2, ..., B, are generated
independently as simple random samples with replacement from
(Y1, Z1,1, Z1,2), ..., (YN , ZN,1, ZN,2). The covariate-adaptive
procedure on the original data is applied on the covariates of each
bootstrap sample (Z∗b1,1, Z
∗b
1,2), ..., (Z
∗b
N,1, Z
∗b
N,2), from which the
bootstrap analogues of treatment assignments, I∗b1 ,...,I
∗b
N can be
obtained.
7 Statistical Inference after covariate-adaptive randomization 265
Define
Y¯1 − Y¯2 = 1
n1
N∑
i=1
IiYi − 1
n2
N∑
i=1
(1− Ii)Yi, n1 =
N∑
i=1
Ii, n2 = N − n1,
and
θˆ∗(b) =
1
n∗b1
N∑
i=1
I∗bi Y
∗b
i −
1
n∗b2
N∑
i=1
(1− I∗bi )Y ∗bi ,
n∗b1 =
N∑
i=1
I∗bi , n
∗b
2 =
N∑
i=1
(1− I∗bi ).
The bootstrap estimator of the variance of Y¯1 − Y¯2 is then the sample
variance of θˆ∗(b), b = 1, 2, ..., B, represented by vˆB . Then the
bootstrap t-test has the form of TB = (Y¯1 − Y¯2)/vˆ1/2B . In Shao, Yu
and Zhong (2010), it is shown that the bootstrap t-test can maintain
nominal Type I error under covariate-adaptive biased coin design.
B = 500 is used in all following simulations.
7 Statistical Inference after covariate-adaptive randomization 266
Table 15: Simulated Type I error for Pocock and Simon’s (PS), stratified
permuted block design (SPB) and complete randomization (CR) in %.
Simulations based on 10000 runs.
Z Method N t-test lm(Z1) lm(Z2) lm(Z1, Z2) BS-t
Discrete PS 100 1.75 3.05 3.09 5.21 5.18
200 1.62 2.78 2.86 4.99 4.88
500 1.66 2.81 2.77 4.87 4.90
SPB 100 1.85 2.86 3.05 5.29 5.67
200 1.54 2.69 2.73 4.84 4.95
500 1.55 2.77 2.65 4.84 5.60
CR 100 5.04 5.27 5.11 5.31 -
200 5.00 4.95 5.12 5.21 -
500 4.73 4.83 4.68 4.77 -
7 Statistical Inference after covariate-adaptive randomization 267
Table 16: Simulated Type I error for Pocock and Simon’s marginal
procedure (PS), stratified permuted block design (SPB) and complete
randomization (CR) in %. Simulations based on 10000 runs.
Z Method N t-test lm(Z1) lm(Z2) lm(Z1, Z2) BS-t
Continuous PS 100 1.43 2.15 2.02 4.98 5.16
200 1.07 1.74 1.80 4.53 5.62
500 0.91 1.72 1.73 4.72 4.79
SPB 100 1.22 1.83 2.05 5.01 5.68
200 0.98 1.86 1.77 5.08 5.19
500 1.15 1.98 1.84 5.48 5.61
CR 100 5.20 5.31 4.82 4.92 -
200 5.06 5.14 4.85 5.46 -
500 4.87 5.05 4.71 4.77 -
7 Statistical Inference after covariate-adaptive randomization 268
Several conclusions can be drawn from Table 1. First, the Type I error
is close to 5% under the full model lm(Z1, Z2). This coincides with
theoretical results in Section 3, when no randomization covariate is
omitted in the construction of the final analysis model. Secondly,
under both Pocock and Simon’s marginal procedure and stratified
permuted block design, the two sample t-test, lm(Z1) and lm(Z2) are
all conservative. Among these three tests, the two sample t-test is the
most conservative one with the least Type I error.
7 Statistical Inference after covariate-adaptive randomization 269
Furthermore, the Type I error of the bootstrap t-test (BS-t) is close
to the nominal level 5% under both Pocock and Simon’s marginal
procedure and stratified permuted block design. Under complete
randomization, the Type I error is close to 5% for all four tests. We
also tried different (p1, p2), similar results are obtained and are not
shown here.
7 Statistical Inference after covariate-adaptive randomization 270
• Even though many covariate-adaptive designs have been proposed
and implemented, the discussion of corresponding statistical
inference is limited.
• In practice, conventional tests are just used without consideration
• It remains a concern if conventional statistical inference is still
7 Statistical Inference after covariate-adaptive randomization 271
In literature
• By simulation, Forsythe (1987) suggests “minimization should be
considered for group assignment only if all variables used in
minimization are also to be used as covariate” to achieve valid
statistical inference.
• Shao, Yu and Zhong (2010) pointed out “if the covariates used in
covariate-adaptive randomization is a function of the covariates to
construct a test, the test is valid under covariate-adaptive
randomization.”
7 Statistical Inference after covariate-adaptive randomization 272
Conservativeness
• In practice, however, it is often the case not all randomization
covariates are included in statistical inference.
– difficult to include some covariates (investigation sides, etc.) in
analysis;
– resulting more complicated model;
– requiring correct model specification.
7 Statistical Inference after covariate-adaptive randomization 273
• Simulation studies indicates conservativeness of unadjusted
analysis under covariate-adaptive clinical trials by Birkett (1985)
and Forsythe (1987), etc.
• Shao, Yu and Zhong (2010) proved, under a simple linear
regression model,
Yij = µj + bZi + εij ,
the two sample t-test is conservative for stratified biased coin
designs.
7 Statistical Inference after covariate-adaptive randomization 274
Limitations
• Only two sample t-test is discussed. Properties unknown if partial
covariate information is used in statistical inference.
• The result is only applied to the covariate-adaptive biased coin
design (Stratified), which is not a popular in application.
• only consider a simple linear model with one covariate.
• no theoretical results about power.
7 Statistical Inference after covariate-adaptive randomization 275
Motivation
• Study properties of statistical inference for covariate-adaptive
randomized clinical trials
– which can be applied on a large family of covariate-adaptive
designs, including the ones (Pocock and Simon’s design and
others) widely used in practice.
– based on linear models and generalized linear models.
– on various types of hypothesis testing.
– new methods that adjusting type I error and increasing power.
7 Statistical Inference after covariate-adaptive randomization 276
7.2 Statistical Inference under Linear models
Properties of statistical inference will be studied for covariate-adaptive
designs
• under a linear regression framework.
• a subset of covariates of those used in randomization are included
in statistical inference procedures.
• two types of hypothesis testing are considered.
– comparing treatment effect.
– testing significance of covariates.
7 Statistical Inference after covariate-adaptive randomization 277
7.3 Framework
We (Ma, Hu and Zhang, 2015, JASA) considered a covariate-adaptive
randomized clinical trial with two treatments 1 and 2.
• Let N denote the total number of patients in study.
• Ii = 1, i = 1, 2, ..., N , if the ith patient is assigned to treatment
1, otherwise Ii = 0.
• The response for ith patient,
Yi = µ1Ii + µ2(1− Ii) +Xi,1bT1 + ...+Xi,pbTp
+Zi,1c
T
1 + ...+ Zi,qc
T
q + εi,
where
• Yi is the outcome of the ith patient;
• Xk and Zj , k = 1, ..., p and j = 1, ..., q are covariate information,
which can be either discrete or continuous.
7 Statistical Inference after covariate-adaptive randomization 278
• Both Xk and Zj are used in covariate-adaptive randomization,
but only Xk are used to construct analysis model.
• All covariates are assumed to be independent of each other.
• Furthermore, without loss of generality, it is assumed
EXk = EZj = 0 for all k and j.
• εis are independent and identically distributed random errors with
mean zero and variance σ2ε , and are independent of Xk and Zj .
7 Statistical Inference after covariate-adaptive randomization 279
Analysis Model (Working Model)
• Assume both Xk and Zj are used in covariate-adaptive
randomization.
• Xk are used in final analysis.
• A linear regression model is implemented to do analysis.
E[Yi] = µ1Ii + µ2(1− Ii) +Xi,1bT1 + · · ·+Xi,pbTp . (2)
7 Statistical Inference after covariate-adaptive randomization 280
The model between response and covariates (2) is
Y = Xβ +Zγ + ε,
The analysis model (2) is
E[Y ] = Xβ.
where Y = (Y1, Y2, ..., YN )
T are outcomes, ε = (ε1, ε2, ..., εN )
T ,
β = (µ1, µ2, b1, ..., bp)
T and γ = (c1, ..., cq)
T are true but unknown
parameters. Furthermore, X and Z are
X =

I1 1− I1 X1,1 · · · X1,p
I2 1− I2 X2,1 · · · X2,p
...
...
...
. . .
...
IN 1− IN XN,1 · · · XN,p

7 Statistical Inference after covariate-adaptive randomization 281
and
Z =

Z1,1 · · · Z1,q
...
. . .
...
ZN,1 · · · ZN,q
 .
The OLS estimator βˆ of model (2) can be expressed as
βˆ = (XTX)−1XTY = (XTX)−1XT (Xβ +Zγ + ε)
= β + (XTX)−1XTZγ + (XTX)−1XTε.
7 Statistical Inference after covariate-adaptive randomization 282
Comparing Treatment Effect
To compare treatment effects of µ1 and µ2,
H0 : µ1 − µ2 = 0 versus HA : µ1 − µ2 6= 0. (3)
The test statistic is
T =
Lβˆ
(σˆ2L(XTX)−1LT )1/2
, (4)
where L = (1,−1, 0, ..., 0), σˆ2 = (Y −Xβˆ)T (Y −Xβˆ)/(N − p′ − 2).
p′ + 2 is the total number of parameters in model (2).
Reject H0, if |T | > Z1−α/2, where Z1−α/2 is (1− α/2)th percentile of
standard normal distribution.
7 Statistical Inference after covariate-adaptive randomization 283
Testing Significance of a Covariate
To test significance of a single covariate, without loss of generality,
consider the first covariate,
H0 : b1 = 0 versus HA : b1 6= 0. (5)
The test statistic for hypothesis testing (5) is,
T ′ =
`βˆ
(σˆ2`(XTX)−1`T )1/2
, (6)
where ` = (0, 0, 1, 0, ..., 0). Notice if X1 is a discrete covariate with
multiple levels s

1, s

1 > 2, then we are only able to test b11 = 0 ,
where b1 = (b11, b12, ..., b1(s′1−1)).
Reject H0, if |T ′| > Z1−α/2, where Z1−α/2 is (1− α/2)th percentile
of standard normal distribution.
7 Statistical Inference after covariate-adaptive randomization 284
7.4 Properties
Valid and Conservative Test
A two-sided test T based on normal distribution is said to be
(asymptotically) valid, if
lim
N→∞
pr(|T | > Z1−α/2) = α,
and it is said to be (asymptotically) conservative, if there is a constant
α0 such that, when the null hypothesis holds,
lim
N→∞
pr(|T | > Z1−α/2) = α0 < α,
where Φ is c.d.f of standard normal distribution.
7 Statistical Inference after covariate-adaptive randomization 285
Theorem: Under the linear model (1) and the hypothesis testing,
H0 : µ1 − µ2 = 0 versus HA : µ1 − µ2 6= 0,
if a covariate-adaptive design satisfies the following two conditions:
• the overall imbalence is Op(1),
• the marginal imbalances are Op(1),
then, under H0, the test statistics T is normal distributed with a
variance σ2 < 1 unless all cj = 0. Therefore, the hypothesis testing is
conservative unless all cj = 0. Under HA, the testing statistic T is
normal distributed with smaller non-centrality parameter unless all
cj = 0. Therefore the hypothesis testing is less powerful unless all
cj = 0.
7 Statistical Inference after covariate-adaptive randomization 286
Theorem: Under the same conditions as in Theorem 2 and the
hypothesis testing,
H0 : b1 = 0 versus HA : b1 6= 0,
then, under H0, the test statistics T
′ is normal distributed with a
variance 1. Therefore, the hypothesis testing is valid under H0.
However, under HA, the testing statistic T is normal distributed with
smaller non-centrality parameter unless all cj = 0. Therefore the
hypothesis testing is less powerful unless all cj = 0.
Corollary 1: If Z is not related to Y , i.e., all cj = 0 for
j = 1, 2, ..., q, then hypothesis testing (3) and (5) are both valid.
7 Statistical Inference after covariate-adaptive randomization 287
Corollary 2: The results in Theorem 2 and 3 hold under following
designs:
• Pocock and Simon’s marginal procedure;
• Stratified permuted block design;
• The large class of covariate-adaptive designs in Hu and Hu (2012)
and Hu and Zhang (2014).
7 Statistical Inference after covariate-adaptive randomization 288
7.5 Numerical studies: Type I Error and Power
Type I errors are studied by assuming,
Yi = µ1Ii + µ2(1− Ii) + b1Zi,1 + b2Zi,2 + εi,
where
• εi is distributed as N(0, 1).
• b1 = b2 = 1.
• No difference in treatment effect, i.e., µ1 = µ2.
• Discrete case: Z1 ∼ Bernolli(p1), Z2 ∼ Bernolli(p2);
Continuous case: Z1 ∼ N(0, 1), Z2 ∼ N(0, 1) with breakdown
points p1(p2)th quantile.
• The biased coin probability 0.75 and equal weights are used for
Pocock and Simon’s marginal procedure, and the block size 4 is
used for stratified permuted block design.
7 Statistical Inference after covariate-adaptive randomization 289
Table 17: Simulated Type I error for Pocock and Simon’s marginal
procedure in %. Simulations based on 10000 runs
Z (p1, p2) N t− test lm(Z1) lm(Z2) lm(Z1, Z2)
Discrete (0.5, 0.5) 100 1.75 3.05 3.09 5.21
200 1.62 2.78 2.86 4.99
500 1.66 2.81 2.77 4.87
(0.5, 0.3) 100 2.02 3.30 3.00 5.04
200 1.90 3.18 2.99 5.07
500 1.84 3.25 2.96 5.20
Continuous (0.5, 0.5) 100 1.43 2.15 2.02 4.98
200 1.07 1.74 1.80 4.53
500 0.91 1.72 1.73 4.72
(0.5, 0.3) 100 1.35 2.12 1.85 4.95
200 1.16 2.14 1.83 5.05
500 1.22 1.95 1.71 4.99
7 Statistical Inference after covariate-adaptive randomization 290
Table 18: Simulated Type I error for complete randomization in %.
Simulations based on 10000 runs
Z (p1, p2) N t− test lm(Z1) lm(Z2) lm(Z1, Z2)
Discrete (0.5, 0.5) 100 5.04 5.27 5.11 5.31
200 5.00 4.95 5.12 5.21
500 4.73 4.83 4.68 4.77
(0.5, 0.3) 100 4.99 4.99 4.68 4.77
200 5.15 5.03 5.49 5.14
500 4.82 5.00 4.80 5.13
Continuous (0.5, 0.5) 100 5.20 5.31 4.82 4.92
200 5.06 5.14 4.85 5.46
500 4.87 5.05 4.71 4.77
(0.5, 0.3) 100 4.99 4.69 5.11 4.97
200 5.21 5.24 5.16 4.92
500 5.14 4.66 5.19 5.15
7 Statistical Inference after covariate-adaptive randomization 291
Table 19: Power Comparison (lm(Z1, Z2)) for Pocock and Simon’s
marginal procedure and Complete Randomization, Simulation based on
10000 runs and Sample Size N = 32, 64
N = 32 N = 64
µ1 − µ0 CR PS CR PS
0.0 4.96 5.03 5.17 5.08
0.2 7.81 8.51 12.12 12.68
0.4 18.15 19.44 34.46 34.76
0.6 33.96 36.98 63.04 65.53
0.8 53.74 57.28 86.97 87.95
1.0 73.63 77.10 97.02 97.51
7 Statistical Inference after covariate-adaptive randomization 292
7.6 Numerical Studies: Testing of Covariates
It is assumed,
Yi = µ1Ii + µ2(1− Ii) + b1Zi,1 + b2Zi,2 + εi,
where
• εi is distributed as N(0, 1).
• b1 = 0, b2 = 1.
• No difference in treatment effect, i.e., µ1 = µ2.
• Discrete case: Z1 ∼ Bernolli(p1), Z2 ∼ Bernolli(p2);
Continuous case: Z1 ∼ N(0, 1), Z2 ∼ N(0, 1) with breakdown
points p1(p2)th quantile.
• The biased coin probability 0.75 and equal weights are used for
Pocock and Simon’s marginal procedure.
7 Statistical Inference after covariate-adaptive randomization 293
Table 20: Simulated Type I error for H0 : b1 = 0 versus HA : b1 6= 0 for
Pocock and Simon’s marginal procedure (PS) and complete randomiza-
tion (CR) in %. Simulations based on 10000 runs.
Discrete Continuous
(p1, p2) N PS CR PS CR
(0.5, 0.5) 100 4.96 4.98 4.98 4.90
200 5.35 5.28 5.14 5.14
500 5.55 5.55 5.11 5.15
(0.5, 0.4) 100 4.76 4.84 4.74 4.83
200 5.51 5.49 5.12 5.07
500 4.80 4.77 5.11 5.02
(0.5, 0.3) 100 4.96 5.07 5.23 5.20
200 5.08 5.17 5.00 4.99
500 4.95 4.84 5.65 5.64
(0.4, 0.4) 100 5.12 5.15 5.07 5.08
200 5.41 5.48 5.20 5.21
500 5.19 5.24 5.01 5.08
7 Statistical Inference after covariate-adaptive randomization 294
7.7 General Theory of Statistical Inference
Covariate-adaptive randomization procedure is frequently used in
comparative studies to increase the covariate balance across treatment
groups. However, as the randomization inevitably uses the covariate
information when forming balanced treatment groups, the validity of
classical statistical methods following such randomization is often
unclear.
7 Statistical Inference after covariate-adaptive randomization 295
Ma, Qin, Li and Hu (2019): (i) derive the theoretical properties of
statistical methods based on general covariate-adaptive randomization
under the linear model framework;
(ii) explicitly unveil the relationship between covariate-adaptive and
inference properties by deriving the asymptotic representations of the
corresponding estimators;
(iii) apply the proposed general theory to various randomization
procedures, such as complete randomization (CR), rerandomization
(RR), pairwise sequential randomization (PSR), and Atkinson’s DA
-biased coin design, and compare their performance analytically;
and (iv) based on the theoretical results, we then propose a new
approach to obtain valid and more powerful tests.
Personalized medicine raises new challenges for the design of clinical
trials such as:
(1) more covariates (biomarkers) have to be considered, and
(2) particular attention needs to be paid to the interaction between
treatment and covariates (biomarkers).
To design a good clinical trial for personalized medicine, we need new
designs that can match the special features of personalized medicine.
8.1 Optimal design for detecting important
interactions among treatments and
biomarkers
The goal of a conventional clinical trial is to determine if a new
treatment is superior. When designing a clinical trial for precision
medicine, the goal is not limited to just detecting the treatment
difference, but also to identifying biomarkers that predict the
efficacy of treatments.
Therefore, it is important to have a design that can detect the
interaction between treatment and biomarkers efficiently.
8.2 Optimal designs based on both efficiency
and ethics
Clinical trials, because they involve human subjects, require stringent
ethical considerations. To develop personalized medicine, covariate
information plays an important role in the design and analysis of
clinical trials. A challenge is the incorporation of covariate
information in design while still considering issues of both
efficiency and medical ethics (CARA designs).
To address this problem, new designs of clinical trials are needed (Hu,
Zhu and Hu, 2015, JASA).
Denote the efficiency and ethics measurements of two treatments as
d(Z,θ) = (d1(Z,θ), d2(Z,θ)) and e(Z,θ) = (e1(Z,θ), e2(Z,θ)),
respectively. We propose to assign the (m+ 1)th subject to treatment
1 with probability
e1(Zm+1, θˆ(m))d
γ
1(Zm+1, θˆ(m))
e1(Zm+1, θˆ(m))d
γ
1(Zm+1, θˆ(m)) + e2(Zm+1, θˆ(m))d
γ
2(Zm+1, θˆ(m))
.
8.3 Designs based on predictive biomarkers
Two distinct types of biomarkers in precision medicine:
• Prognostic biomarker: a biomarker can be used to predict the
most likely prognosis of an individual patient.
• Predictive biomarker: a biomarker is likely to predict the
response to a specific therapy (treatment).
To develop precision medicine, we need new adaptive designs based on
predictive biomarkers (Hu, Wang and Zhao, 2019).
9 A/B testing under observational data 301
9 A/B testing under observational data
9 A/B testing under observational data 302
Table 21: Fictitious data illustrating Simpson’s paradox.
Contro Group (No drug) Treatment Group (Took Drug)
Heart attack No heart attack Heart attack No heart attack
Female 1 19 3 37
Male 12 28 8 12
Total 13 47 11 49
9 A/B testing under observational data 303
Table 22: Fictitious data illustrating Simpson’s paradox.
Contro Group (No drug) Treatment Group (Took Drug)
Heart attack No heart attack Heart attack No heart attack
Low blood pressure 1 19 3 37
High blood pressure 12 28 8 12
Total 13 47 11 49
9 A/B testing under observational data 304
9.2 The real world effectiveness of BNT162b2
and mRNA-1273 COVID-19 Vaccines
Interim Estimates of Vaccine Effectiveness of BNT162b2 and
mRNA-1273 COVID-19 Vaccines in Preventing SARS-CoV-2 Infection
Among Health Care Personnel, First Responders, and Other Essential
and Frontline Workers Eight U.S. Locations, December 2020–March
2021.
10 Some basic principle of designing, running and analyzing an A/B Test 305
10 Some basic principle of designing,
running and analyzing an A/B Test
10.1 Setting up the example
A fictional online commerce site that sells flowers: there are a wide
range of changes we can test: (i) introducing a new feature: (ii) a
change to the user interface (UI); (iii) a back-end change; (iv) a
change of price; and so on.
10 Some basic principle of designing, running and analyzing an A/B Test 306
In this example, the marketing department wants to increase sales by
sending promotional emails that include a coupon code for discounts
on the flowers. There are several concerns: (i) revenue; (ii) cost; and
so on.
10 Some basic principle of designing, running and analyzing an A/B Test 307
We want to evaluate the impact of simply adding a coupon code field.
Our goal is simple to assess the impact on revenue by having this
coupon code field and evaluate the concern that it will distract people
from checking out.
10 Some basic principle of designing, running and analyzing an A/B Test 308
Online shopping process as a funnel, see Figure.
10 Some basic principle of designing, running and analyzing an A/B Test 309
There are many ways to change the user interface (UI). Here are two
different UIs. See Figure.
10 Some basic principle of designing, running and analyzing an A/B Test 310
Our Hypothesis: Adding a coupon code field to the checkout page will
10 Some basic principle of designing, running and analyzing an A/B Test 311
To measure the impact of the change, we need to define goal metrics
(usually difficult to indentify).
10 Some basic principle of designing, running and analyzing an A/B Test 312
This experiment: revenue.
Total revenue or revenue-per-user?
10 Some basic principle of designing, running and analyzing an A/B Test 313
Which users to consider in the denominator of the revenue-per-user
metric:
(1) All users who visit the site;
(2) Only users who complete the purchase process;
(3) Only users who start the purchase process.
10 Some basic principle of designing, running and analyzing an A/B Test 314
Only users who start the purchase process. This is the best choice.
Refined hypothesis becomes: Adding a coupon code field to the
checkout page will degrade revenue-per-user for users who start the
purchase process.
10 Some basic principle of designing, running and analyzing an A/B Test 315
10.2 Hypothesis testing: establishing statistical
significane
Discussions.
10 Some basic principle of designing, running and analyzing an A/B Test 316
10.3 Designing the experiment
Some aspects:
1) What is the randomization unit?
2) What population of randomization units do we want to target?
3) How large (sample size) does our experiment need to be?
4) How long do we run the experiment?
10 Some basic principle of designing, running and analyzing an A/B Test 317
Our experiment design is now as follows: 1) What is the
randomization unit?
user.
2) What population of randomization units do we want to target?
all users and analyze those who visit the chechout page.
3) How large (sample size) does our experiment need to be?
to have 80% power to detect at least a 1% change in revenue-per-user,
we will conduct a power analysis to determine sample size.
10 Some basic principle of designing, running and analyzing an A/B Test 318
4) How long do we run the experiment?
This translate into running the experiment for a minimum of four days
with a 34/33/33% split among Control/Treatment one/ Treatment
two. We will run the experiment for a full week to ensure that we
understand the day-of-week effect, and ponentially longer if we detect
novelty or primacy effects.
10 Some basic principle of designing, running and analyzing an A/B Test 319
10.4 Running the experiment and getting data
To run an experiment, we need both:
1) Instrumentation;
2) Infrastructure.
10 Some basic principle of designing, running and analyzing an A/B Test 320
10.5 Interpreting the results
Discussions.
10 Some basic principle of designing, running and analyzing an A/B Test 321
10.6 From results to decision
The goal of running A/B tests is to gather data to drive decision
making. A lot work goes into ensuring that our results are repeatable
and trustworthy so that we can make the right decision.
Some important aspects:
1) Do you need to make tradeoffs between different metrics?
2) What is the cost of launching this change?
3) What is the downside of making wrong decisions?
10 Some basic principle of designing, running and analyzing an A/B Test 322
You need to make decisions from different results:
Discussions.
11 Twyman’s Law and Experimentation Trustworthiness 323
11 Twyman’s Law and Experimentation
Trustworthiness
William Anthony Twyman was a UK radio and television audience
measurement veteran (MR Web 2014) credited with formulating
Twyman’s Law, although he apparently never explicitly put it in
writing.
Any statistic that appears interesting is almost certainly a mistake.
by Paul Dickson (1999)
11 Twyman’s Law and Experimentation Trustworthiness 324
Any figure that looks interesting or different is usually wrong.
by A.S.C. Ehrenberg (1975).
Twyman’s law, herhapsthe most important single law in the whole of
data analysis... The more unusual or interesting in the data, the more
likely they are to have been the result of an error of one kind or
another.
by Catherine Marsh and Jane Elliott (2009).
11 Twyman’s Law and Experimentation Trustworthiness 325
11.1 Misinterpretation of the Statistical Results
In the Null Hypothesis Significance Testing, we typically assume that
there is no difference in metric value between control and treatment
and reject the hypothesis if the data presents strong evidence against
it.
A common mistake is to assume that just because a metric is not
statistically significant, there is no treatment effect.
It could be that the experiment is underpowered to detect the
effect size. An evaluation of 115 A/B tests at GoodUI.org suggests
that most were under powered.
11 Twyman’s Law and Experimentation Trustworthiness 326
P-value is often misinterpreted.
The p-value is the probability of obtaining a result equal to or more
extreme than what was observed, assuming that the Null hypothesis is
true. The conditioning on the Null hypothesis is critical.
11 Twyman’s Law and Experimentation Trustworthiness 327
Here are some incorrect statements and explanations from A Dirty
Dozen: Twelve P-Value Misconceptions (Google Website Optimizer
2008):
• If the p-value = 0.05, the Null hypothesis has only a 5% chance of
being true. The p-value is calculated assuming that the Null
hypothesis is true.
• P-value = 0.05 means that we observed the data that would occur
only 5% of the time under Null hypothesis. This is incorrect by
the definition of p-value above, which includes equal to or more
extreme values than what was observed.
• P-value = 0.05 means that if you reject the Null hypthesis, the
probability of a false positive is only 5%.
11 Twyman’s Law and Experimentation Trustworthiness 328
Multiple Hypothesis Tests:
The following story comes from the fun book, What is a p-value
anyway? (Vickers 2009):
• Statistician: Oh, so you have already calculated the p-value?
• Surgeon: Yes, I used multinomial logistic regression.
• Statistician: Really? How did you come up with that?
• Surgeon: I tried each analysis on the statistical software
drop-down menus, and that was the one that gave the smallest
p-value.
False Discivery Rate (Hochberg and Benjamini 1995) is a key concept
to deal with multiple tests.
11 Twyman’s Law and Experimentation Trustworthiness 329
Confidence Intervals: Discussion.
11 Twyman’s Law and Experimentation Trustworthiness 330
11.2 Threats to Internal Validity
• Violations of SUTVA: In the analysis of A/B tests, it is
common to apply the Stable Unit Treatment Value Assumption
(SUTVA) (Imbens and Rubin, 2015), which states that
experiment units (e.g., users) do not interfere with one another.
Their behavior is impacted by their own variant assignment, and
not by the assignment of others. Discussion.
• Survivorship Bias: Discussion.
• Intention-to-Treat: Discussion.
• Sample Ration Mismatch: Discussion.
11 Twyman’s Law and Experimentation Trustworthiness 331
11.3 Threats to External Validity
External validity refers to the extent to which the results of a A/B test
can be generalized along axes such as different populations (e.g., other
countries, other websites) or overtime.
Discussion.
12 Analyzing A/B tests 332
12 Analyzing A/B tests
12.1 Two sample T-test
Discussion.
12 Analyzing A/B tests 333
12.2 P-value and Confidence Intervals
12 Analyzing A/B tests 334
12.3 Normality Assumption
12 Analyzing A/B tests 335
12.4 Type I/II Errors and Power
12 Analyzing A/B tests 336
12.5 Multiple Testing
12 Analyzing A/B tests 337
12.6 Meta-analysis
13 The A/A Test 338
13 The A/A Test
Running A/A testing is a critical part of estblishing trust
in an experimentation platform. The idea is so useful
because the tests fail many times in practice, which leads
to re-evaluating assumptions and identifying bugs.
13 The A/A Test 339
A/A tests are the same as A/B tests, but Treatment and Control users
receive identical experiences. You can use A/A tests for several
purposes, such as to:
• Ensure the Type I errors are controlled as expected.
• Assessing metrics’s variability.
• Ensure that no bias exists between Treatment and Control users.
• Compare data to the system of record.
• If the system of records shows X users visited the website during
the experiment and you ran Control and Treatment at 20% each,
do you see around 20% X users in each? Are you leaking users?
• Estimate variances for statistical power calculations.
14 Long-term treatment effects 340
14 Long-term treatment effects
Short-term effect (from A/B test) vs Long-term effect (we care about).
1cm
There are scenarios where long-term effect is different from the
short-term effect.
14 Long-term treatment effects 341
Reasons the treatment effect may differ between short-term and
long-term:
• User-learned effects.
• Network effects.
• Delayed experience and measurement.
• Econsystem changes: launching other new features; seasonality;
competitive landscape; government polocies; concept drift; etc.
14 Long-term treatment effects 342
Some Suggestions:
• Long-running experiments.
• Cohort Study and Analysis.
• Post-Period Analysis (Post-Market).
• Time-Staggered Treatments.
• Holdback and Reverse Experiment.
15 Conclusion and Remarks 343
15 Conclusion and Remarks
15 Conclusion and Remarks 344
The goal of Human Life:
Understanding + Improving The Nature
A/B test becomes and will be more and more important in
understanding the nature and ourself.
15 Conclusion and Remarks 345
Three Essential Components of Statistics
(Data Science):
Data+Computer and Software+Analytics
A/B test is the essential tool.
15 Conclusion and Remarks 346
Good Design of Experiment (producing useful
data) [Big or Small Data]:
Realistic, Efficient and Ethic.
15 Conclusion and Remarks 347
Statisticians (Data Scientists) are experts in:
• producing useful data (Big or small);
Survey Sampling; Experiment Designs.
• analyzing (Big or small) data to make meanful results;
With some possible new statistical methods and computational
skills
• drawing practical conclusions.
15 Conclusion and Remarks 348
The Classical Statistical Framework
(Static):
Real Problem → Data Collection
→ Data Analysis → Decision
15 Conclusion and Remarks 349
To match human intelligence, we may need (Hu, 2016)
The Dynamic Statistical Framework (AI):
Real Problem → Data Collection
→ Data Analysis → Decision
→ + new Data → new Analysis
→ new Decision → · · ·
15 Conclusion and Remarks 350
Producing Useful Data (Design of Experiments) in
Big Data and AI ERA:
MANY New Challenges:
(i) From Static to Dynamic;
(ii) From Independent to Dependent.
15 Conclusion and Remarks 351
A/B tests in Big Data and AI ERA:
MANY New Challenges:
(i) From Static to Dynamic;
(ii) From Independent to Dependent.
15 Conclusion and Remarks 352
The goal of Human Life:
Understanding + Improving The Nature
Statistics (Data Science) is a “GAME” between human
and nature “THROUGH DATA”.
15 Conclusion and Remarks 353
We read the world wrong and say that it
deceives us. (Tagore, )
We read the data wrong and think that
the data deceives us. (Feifang Hu, 2017)
15 Conclusion and Remarks 354
Thank you!

Email:51zuoyejun

@gmail.com