1A/B Testing: Designs and Analysis

Feifang Hu

Department of Statistics

George Washington University

Email: [email protected]

Fall, 2022, Washington, DC, USA

2Three Essential Components of Statistics

(Data Science):

Data+Computer+Analytics

1 Introduction 3

1 Introduction

1.1 What is A/B testing?

A/B test is the shorthand for a simple controlled experiment. As the

name implies, two versions (A and B) of a single variable are

compared, which are identical except for one variation that might

affect a user’s behavior. A/B tests are widely considered the simplest

form of controlled experiment. However, by adding more variants to

the test, this becomes more complex.

A/B testing is the process of comparing two variations of a page

element, usually by testing users’ response to variant A vs variant B,

and concluding which of the two variants is more effective.

1 Introduction 4

A/B tests are useful for understanding user engagement and

satisfaction of online features, such as a new feature or product. Large

social media sites like LinkedIn, Facebook, and Instagram use A/B

testing to make user experiences more successful and as a way to

streamline their services.

1 Introduction 5

Today, A/B tests are being used to run more complex experiments,

such as network effects when users are offline, how online services

affect user actions, and how users influence one another. Many jobs

use the data from A/B tests. This includes, data engineers, marketers,

designers, software engineers, and entrepreneurs. Many positions rely

on the data from A/B tests, as they allow companies to understand

growth, increase revenue, and optimize customer satisfaction.

1 Introduction 6

Version A might be the currently used version (control), while version

B is modified in some respect (treatment). For instance, on an

e-commerce website the purchase funnel is typically a good candidate

for A/B testing, as even marginal decreases in drop-off rates can

represent a significant gain in sales. Significant improvements can

sometimes be seen through testing elements like copy text, layouts,

images and colors, but not always. In these tests, users only see one of

two versions, as the goal is to discover which of the two versions is

preferable.

1 Introduction 7

Controlled experiments have a long and fascinating history. They are

sometimes called A/B tests, A/B/C tests (multiple variants), field

experiments, randomized controlled experiments, split tests, bucket

tests, and flights.

1 Introduction 8

1.2 Online experiments

Example 1. Online A/B testing. (Kohavi and Thomke, 2017,

Harvard Business Review) Microsoft, Amazon, Facebook and Google

conduct more than 10,000 online controlled experiments annually, with

many tests engaging millions of users.

Amazon’s experiment.

Treatment A: Credit card offers on front page.

Treatment B: Credit card offers on the shopping cart page.

This (change from A to B) boosted profits by tens of millions of US

Dollars annually.

1 Introduction 9

1.2.1 A/B Testing in eCommerce Industry

Through A/B testing, online stores can increase the average order

value, optimize their checkout funnel, reduce cart abandonment rate,

and so on. You may try testing: the way shipping cost is displayed and

where, if, and how free shipping feature is highlighted, text and color

tweaks on the payment page or checkout page, the visibility of reviews

or ratings, etc.

1 Introduction 10

In the eCommerce industry, Amazon is at the forefront in conversion

optimization partly due to the scale they operate at and partly due to

their immense dedication to providing the best customer experience.

Amongst the many revolutionary practices they brought to the

eCommerce industry, the most prolific one has been their ‘1-Click

Ordering’. Introduced in the late 1990s after much testing and

analysis, 1-Click Ordering lets users make purchases without having to

use the shopping cart at all. Once users enter their default billing card

details and shipping address, all they need to do is click on the button

and wait for the ordered products to get delivered. Users don’t have to

enter their billing and shipping details again while placing any orders.

With the 1-Click Ordering, it became impossible for users to ignore the

ease of purchase and go to another store. This change had such a

huge business impact that Amazon got it patented (now expired) in

1999. In fact, in 2000, even Apple bought a license for the same to be

used in their online store.

1 Introduction 11

People working to optimize Amazon’s website do not have sudden

‘Eureka’ moments for every change they make. It is through

continuous and structured A/B testing that Amazon is able to deliver

the kind of user experience that it does. Every change on the website

is first tested on their audience and then deployed. If you were to

notice Amazon’s purchase funnel, you would realize that even though

the funnel more or less replicates other websites’ purchase funnels,

each an every element in it is fully optimized, and matches the

audience’s expectations.

1 Introduction 12

Every page, starting from the homepage to the payment page, only

contains the essential details and leads to the exact next step required

to push the users further into the conversion funnel. Additionally, using

extensive user insights and website data, each step is simplified to their

maximum possible potential to match their users’ expectations.

1 Introduction 13

Take their omnipresent shopping cart, for example. There is a small

cart icon at the top right of Amazon’s homepage that stays visible no

matter which page of the website you are on.

1 Introduction 14

The icon is not just a shortcut to the cart or reminder for added

products. In its current version, it offers 5 options:

(i) Continue shopping (if there are no products added to the cart)

(ii) Learn about today’s deals (if there are no products added to the

cart)

(iii) Wish List (if there are no products added to the cart)

(iv) empty cart

(v) Proceed to checkout (when there are products in the cart). Sign in

to turn on 1-Click Checkout (when there are products in the cart).

1 Introduction 15

With one click on the tiny icon offering so many options, the user’s

cognitive load is reduced, and they have a great user experience. As

can be seen in the above screenshot, the same cart page also suggests

similar products so that customers can navigate back into the website

and continue shopping. All this is achieved with one weapon: A/B

Testing.

1 Introduction 16

1.2.2 A/B Testing in Travel Industry

Increase the number of successful bookings on your website or mobile

app, your revenue from ancillary purchases, and much more through

A/B testing. You may try testing your home page search modals,

search results page, ancillary product presentation, your checkout

progress bar, and so on.

1 Introduction 17

In the travel industry, Booking.com easily surpasses all other

eCommerce businesses when it comes to using A/B testing for their

optimization needs. They test like it’s nobody’s business. From the

day of its inception, Booking.com has treated A/B testing as the

treadmill that introduces a flywheel effect for revenue. The scale at

which Booking.com A/B tests is unmatched, especially when it comes

to testing their copy. While you are reading this, there are nearly 1000

A/B tests running on Booking.com’s website.

1 Introduction 18

Even though Booking.com has been A/B testing for more than a

decade now, they still think there is more that they can do to improve

user experience. And this is what makes Booking.com the ace in the

game. Since the company started, Booking.com incorporated A/B

testing into its everyday work process. They have increased their

testing velocity to its current rate by eliminating HiPPOs and giving

priority to data before anything else. And to increase the testing

velocity, even more, all of Booking.com’s employees were allowed to

run tests on ideas they thought could help grow the business.

1 Introduction 19

This example will demonstrate the lengths to which Booking.com can

go to optimize their users’ interaction with the website. Booking.com

decided to broaden its reach in 2017 by offering rental properties for

vacations alongside hotels. This led to Booking.com partnering with

Outbrain, a native advertising platform, to help grow their global

property owner registration.

1 Introduction 20

Within the first few days of the launch, the team at Booking.com

realized that even though a lot of property owners completed the first

sign-up step, they got stuck in the next steps. At this time, pages built

for the paid search of their native campaigns were used for the sign-up

process.

1 Introduction 21

Both the teams decided to work together and created three versions of

landing page copy for Booking.com. Additional details like social

proof, awards, and recognitions, user rewards, etc. were added to the

variations.

1 Introduction 22

The test ran for two weeks and produced a 25% uplift in owner

registration. The test results also showed a significant decrease in the

cost of each registration.

1 Introduction 23

1.2.3 A/B Testing in B2B/SaaS Industry

Generate high-quality leads for your sales team, increase the number of

free trial requests, attract your target buyers, and perform other such

actions by testing and polishing important elements of your demand

generation engine. To get to these goals, marketing teams put up the

most relevant content on their website, send out ads to prospect

buyers, conduct webinars, put up special sales, and much more. But

all their effort would go to waste if the landing page which clients are

directed to is not fully optimized to give the best user experience. The

aim of SaaS (Software as a service) A/B testing is to provide the best

user experience and to improve conversions. You can try testing your

lead form components, free trial sign-up flow, homepage messaging,

CTA text, social proof on the home page, and so on.

1 Introduction 24

POSist, a leading SaaS-based restaurant management platform with

more than 5,000 customers at over 100 locations across six countries,

wanted to increase their demo requests. Their website homepage and

Contact Us page are the most important pages in their funnel. The

team at POSist wanted to reduce drop-off on these pages. To achieve

this, the team created two variations of the homepage as well as two

variations of the Contact Us page to be tested. Let’s take a look at the

changes made to the homepage. This is what the control looked like:

1 Introduction 25

The team at POSist hypothesized that adding more relevant and

conversion-focused content to the website will improve user

experience, as well as generate higher conversions. So they created

two variations to be tested against the control.

Control was first tested against Variation 1, and the winner was

Variation 1. To further improve the page, variation one was then

tested against variation two, and the winner was variation 2. The new

variation increased page visits by about 5%.

1 Introduction 26

1.3 Clinical trials

Example 2. HIV transmission. Connor et al. (1994, The New

England Journal of Medicine) report a clinical trial to evaluate the

drug AZT in reducing the risk of maternal-infant HIV transmission.

50-50 randomization scheme is used:

• AZT Group—239 pregnant women (20 HIV positive infants).

• placebo group—238 pregnant women (60 HIV positive

infants).

1 Introduction 27

Given the seriousness of the outcome of this study, it is reasonable to

argue that 50-50 allocation was unethical. As accruing information

favoring (albeit, not conclusively) the AZT treatment became

available, allocation probabilities should have been shifted from

50-50 allocation proportional to weight of evidence for

AZT. Designs which attempt to do this are called Response-Adaptive

designs (Response-Adaptive Randomization).

1 Introduction 28

If the treatment assignments had been done with the DBCD (Hu

and Zhang, 2004, Annals of Statistics) with urn target:

• AZT Group— 360 patients

• placebo group—117 patients

then, only 60 (instead of 80) infants would be HIV positive.

1 Introduction 29

Example 3: Remdesivir-COVID-19 trial (China). Remdesivir

in adults with severe COVID-19 trial (Wang et al. 2020) is a

randomized, double-blind, placebo-controlled, multicentre trial that

aimed to compare Remvesivir with placebo. There were 236 patients

in the trial. There are about 20 baseline covariates for each patient,

including 10 continuous variables (e.g. age and White blood cell

count) and 10 discrete variables (e.g. gender and Hypertension). The

stratified (according to the level of respiratory support) permuted

block (30 patients per block) randomization procedure were

implemented. At the end of this trial, some important imbalances

existed at enrollment between the groups, including more patients with

hypertension, diabetes, or coronary artery disease in the Remdesivir

group than the placebo group.

1 Introduction 30

Example 4: Moderna COVID-19 vaccine trial (2020). The

trial began on July 27, 2020, and enrolled 30,420 adult volunteers at

clinical research sites across the United States. Volunteers were

randomly assigned 1:1 to receive either two 100 microgram (mcg)

doses of the investigational vaccine or two shots of saline placebo 28

days apart. The average age of volunteers is 51 years. Approximately

47% are female, 25% are 65 years or older and 17% are under the age

of 65 with medical conditions placing them at higher risk for severe

COVID-19. Approximately 79% of participants are white, 10% are

Black or African American, 5% are Asian, 0.8% are American Indian or

Alaska Native, 0.2% are Native Hawaiian or Other Pacific Islander, 2%

are multiracial, and 21% (of any race) are Hispanic or Latino.

1 Introduction 31

From the start of the trial through Nov. 25, 2020, investigators

recorded 196 cases of symptomatic COVID-19 occurring among

participants at least 14 days after they received their second shot. One

hundred and eighty-five cases (30 of which were classified as severe

COVID-19) occurred in the placebo group and 11 cases (0 of which

were classified as severe COVID-19) occurred in the group receiving

mRNA-1273. The incidence of symptomatic COVID-19 was 94.1%

lower in those participants who received mRNA-1273 as compared to

those receiving placebo.

1 Introduction 32

Investigators observed 236 cases of symptomatic COVID-19 among

participants at least 14 days after they received their first shot, with

225 cases in the placebo group and 11 cases in the group receiving

mRNA-1273. The vaccine efficacy was 95.2% for this secondary

analysis.

Long-term Treatment Effects?

1 Introduction 33

1.4 Economics and Social Science

Political A/B testing

A/B tests are used for more than corporations, but are also driving

political campaigns. In 2007, Barack Obama’s presidential campaign

used A/B testing as a way to garner online attraction and understand

what voters wanted to see from the presidential candidate. For

example, Obama’s team tested four distinct buttons on their website

that led users to sign up for newsletters. Additionally, the team used

six different accompanying images to draw in users. Through A/B

testing, staffers were able to determine how to effectively draw in

voters and garner additional interest.

1 Introduction 34

Example 5. The Project GATE (Growing America Through

Entrepreneurship), sponsored by the U.S. Department of Labor, was

designed to evaluate the impact of offering tuition-free

entrepreneurship training services (GATE services) on helping clients

create, sustain or expand their own business.

(https://www.doleta.gov/reports/projectgate/)

The cornerstone is complete randomization. Members of the

treatment group were offered GATE services; members of the control

group were not.

• n = 4, 198 participants

• p = 105 covariates

1 Introduction 35

1.5 Biological, psychological, and agricultural

research

Controlled experiments were mainly developed in these areas in

1900-1950.

1 Introduction 36

Road Map of this course:

(i) The history of experiment design;

(ii) A/B testing in medical studies;

(iii) Online controlled experiments (A/B testing).

2 The history of experiment design 37

2 The history of experiment design

2.1 Experiment design before Fisher

Statistical experiments, following Charles S. Peirce Main article:

Frequentist statistics See also: Randomization A theory of statistical

inference was developed by Charles S. Peirce in ”Illustrations of the

Logic of Science” (1877–1878) and ”A Theory of Probable Inference”

(1883), two publications that emphasized the importance of

randomization-based inference in statistics.

2 The history of experiment design 38

Randomized experiments: Charles S. Peirce randomly assigned

volunteers to a blinded, repeated-measures design to evaluate their

ability to discriminate weights. Peirce’s experiment inspired other

researchers in psychology and education, which developed a research

tradition of randomized experiments in laboratories and specialized

textbooks in the 1800s.

2 The history of experiment design 39

Optimal designs for regression models:

Charles S. Peirce also contributed the first English-language

publication on an optimal design for regression models in 1876. A

pioneering optimal design for polynomial regression was suggested by

Gergonne in 1815. In 1918, Kirstine Smith published optimal designs

for polynomials of degree six (and less).

2 The history of experiment design 40

2.2 Fisher’s principles

A methodology for designing experiments was proposed by Ronald

Fisher, in his innovative books: The Arrangement of Field Experiments

(1926) and The Design of Experiments (1935). Much of his pioneering

work dealt with agricultural applications of statistical methods. As a

mundane example, he described how to test the lady tasting tea

hypothesis, that a certain lady could distinguish by flavour alone

whether the milk or the tea was first placed in the cup. These

methods have been broadly adapted in biological, psychological, and

agricultural research.

2 The history of experiment design 41

2.2.1 Comparison

In some fields of study it is not possible to have independent

measurements to a traceable metrology standard. Comparisons

between treatments are much more valuable and are usually preferable,

and often compared against a scientific control or traditional

treatment that acts as baseline.

2 The history of experiment design 42

2.2.2 Randomization

Random assignment is the process of assigning individuals at random

to groups or to different groups in an experiment, so that each

individual of the population has the same chance of becoming a

participant in the study. The random assignment of individuals to

groups (or conditions within a group) distinguishes a rigorous, ”true”

experiment from an observational study or ”quasi-experiment”. There

is an extensive body of mathematical theory that explores the

consequences of making the allocation of units to treatments by means

of some random mechanism (such as tables of random numbers, or the

use of randomization devices such as playing cards or dice). Assigning

units to treatments at random tends to mitigate confounding, which

makes effects due to factors other than the treatment to appear to

result from the treatment.

2 The history of experiment design 43

The risks associated with random allocation (such as having a serious

imbalance in a key characteristic between a treatment group and a

control group) are calculable and hence can be managed down to an

acceptable level by using enough experimental units. However, if the

population is divided into several subpopulations that somehow differ,

and the research requires each subpopulation to be equal in size,

stratified sampling can be used. In that way, the units in each

subpopulation are randomized, but not the whole sample. The results

of an experiment can be generalized reliably from the experimental

units to a larger statistical population of units only if the experimental

units are a random sample from the larger population; the probable

error of such an extrapolation depends on the sample size, among

other things.

2 The history of experiment design 44

2.2.3 Statistical replication

Measurements are usually subject to variation and measurement

uncertainty; thus they are repeated and full experiments are replicated

to help identify the sources of variation, to better estimate the true

effects of treatments, to further strengthen the experiment’s reliability

and validity, and to add to the existing knowledge of the topic.

2 The history of experiment design 45

However, certain conditions must be met before the replication of the

experiment is commenced: the original research question has been

published in a peer-reviewed journal or widely cited, the researcher is

independent of the original experiment, the researcher must first try to

replicate the original findings using the original data, and the write-up

should state that the study conducted is a replication study that tried

to follow the original study as strictly as possible.

2 The history of experiment design 46

2.2.4 Blocking

Blocking is the non-random arrangement of experimental units into

groups (blocks) consisting of units that are similar to one another.

Blocking reduces known but irrelevant sources of variation between

units and thus allows greater precision in the estimation of the source

of variation under study.

2 The history of experiment design 47

2.2.5 Orthogonality

Orthogonality concerns the forms of comparison (contrasts) that can

be legitimately and efficiently carried out. Contrasts can be

represented by vectors and sets of orthogonal contrasts are

uncorrelated and independently distributed if the data are normal.

Because of this independence, each orthogonal treatment provides

different information to the others. If there are T treatments and T–1

orthogonal contrasts, all the information that can be captured from

the experiment is obtainable from the set of contrasts.

2 The history of experiment design 48

Example 2.1. Measurement Error: We would like to measure

the weight of a subject A by using a scale. We know that there is a

error of scale. Suppose that the error follows a normal distribution

with mean 0 and variance σ2. Mathematically, we may write:

w1 = A+ e1,

where wA is the true weight, YA is the observed weight and e1 is the

measurement error.

2 The history of experiment design 49

Figure 1: A scale to measure subject A

2 The history of experiment design 50

Now we would like to measure the weights of two subjects A and B by

using the same scale twice. What should we do?

2 The history of experiment design 51

Method 1:

w1 = A+ e1 and w2 = B + e2.

2 The history of experiment design 52

Figure 2: Subject B

2 The history of experiment design 53

Method 2:

w3 = A+B + e3 and w4 = A−B + e4.

2 The history of experiment design 54

Figure 3: A + B

2 The history of experiment design 55

Figure 4: A - B

2 The history of experiment design 56

The measurement errors:

Method 1:

Subject A: e1 ∼ N(0, σ2).

Subject B: e2 ∼ N(0, σ2).

Method 2:

Subject A: (e3 + e4)/2 ∼ N(0, σ2/2).

Subject B: (e3 − e4)/2 ∼ N(0, σ2/2).

2 The history of experiment design 57

Use of factorial experiments instead of the one-factor-at-a-time

method. These are efficient at evaluating the effects and possible

interactions of several factors (independent variables). Analysis of

experiment design is built on the foundation of the analysis of

variance, a collection of models that partition the observed variance

into components, according to what factors the experiment must

estimate or test.

2 The history of experiment design 58

2.2.6 Avoiding false positives

False positive conclusions, often resulting from the pressure to publish

or the author’s own confirmation bias, are an inherent hazard in many

fields. A good way to prevent biases potentially leading to false

positives in the data collection phase is to use a double-blind design.

When a double-blind design is used, participants are randomly assigned

to experimental groups but the researcher is unaware of what

participants belong to which group. Therefore, the researcher can not

affect the participants’ response to the intervention.

2 The history of experiment design 59

Experimental designs with undisclosed degrees of freedom are a

problem. This can lead to conscious or unconscious ”p-hacking”:

trying multiple things until you get the desired result. It typically

involves the manipulation – perhaps unconsciously – of the process of

statistical analysis and the degrees of freedom until they return a

figure below the p¡.05 level of statistical significance.

2 The history of experiment design 60

So the design of the experiment should include a clear statement

proposing the analyses to be undertaken. P-hacking can be prevented

by preregistering researches, in which researchers have to send their

data analysis plan to the journal they wish to publish their paper in

before they even start their data collection, so no data manipulation is

possible.

2 The history of experiment design 61

Another way to prevent this is taking the double-blind design to the

data-analysis phase, where the data are sent to a data-analyst

unrelated to the research who scrambles up the data so there is no

way to know which participants belong to before they are potentially

taken away as outliers.

2 The history of experiment design 62

2.2.7 Causal attributions

In the pure experimental design, the independent (predictor) variable is

manipulated by the researcher – that is – every participant of the

research is chosen randomly from the population, and each participant

chosen is assigned randomly to conditions of the independent variable.

Only when this is done is it possible to certify with high probability

that the reason for the differences in the outcome variables are caused

by the different conditions. Therefore, researchers should choose the

experimental design over other design types whenever possible.

2 The history of experiment design 63

However, the nature of the independent variable does not always allow

for manipulation. In those cases, researchers must be aware of not

certifying about causal attribution when their design doesn’t allow for

it. For example, in observational designs, participants are not assigned

randomly to conditions, and so if there are differences found in

outcome variables between conditions, it is likely that there is

something other than the differences between the conditions that

causes the differences in outcomes, that is – a third variable. The same

goes for studies with correlational design. (Ade´r Mellenbergh, 2008).

2 The history of experiment design 64

2.2.8 Statistical control

It is best that a process be in reasonable statistical control prior to

conducting designed experiments. When this is not possible, proper

blocking, replication, and randomization allow for the careful conduct

of designed experiments. To control for nuisance variables, researchers

institute control checks as additional measures. Investigators should

ensure that uncontrolled influences (e.g., source credibility perception)

do not skew the findings of the study. A manipulation check is one

example of a control check. Manipulation checks allow investigators to

isolate the chief variables to strengthen support that these variables

are operating as planned.

2 The history of experiment design 65

One of the most important requirements of experimental research

designs is the necessity of eliminating the effects of spurious,

intervening, and antecedent variables. In the most basic model, cause

(X) leads to effect (Y). But there could be a third variable (Z) that

influences (Y), and X might not be the true cause at all. Z is said to

be a spurious variable and must be controlled for. The same is true for

intervening variables (a variable in between the supposed cause (X)

and the effect (Y)), and anteceding variables (a variable prior to the

supposed cause (X) that is the true cause). When a third variable is

involved and has not been controlled for, the relation is said to be a

zero order relationship. In most practical applications of experimental

research designs there are several causes (X1, X2, X3). In most

designs, only one of these causes is manipulated at a time.

2 The history of experiment design 66

2.3 Experimental designs after Fisher

Some efficient designs for estimating several main effects were found

independently and in near succession by Raj Chandra Bose and K.

Kishen in 1940 at the Indian Statistical Institute, but remained little

known until the Plackett–Burman designs were published in

Biometrika in 1946. About the same time, C. R. Rao introduced the

concepts of orthogonal arrays as experimental designs. This concept

played a central role in the development of Taguchi methods by

Genichi Taguchi, which took place during his visit to Indian Statistical

Institute in early 1950s. His methods were successfully applied and

adopted by Japanese and Indian industries and subsequently were also

embraced by US industry albeit with some reservations.

2 The history of experiment design 67

In 1950, Gertrude Mary Cox and William Gemmell Cochran published

the book Experimental Designs, which became the major reference

work on the design of experiments for statisticians for years afterwards.

Developments of the theory of linear models have encompassed and

surpassed the cases that concerned early writers. Today, the theory

rests on advanced topics in linear algebra, algebra and combinatorics.

2 The history of experiment design 68

As with other branches of statistics, experimental design is pursued

using both frequentist and Bayesian approaches: In evaluating

statistical procedures like experimental designs, frequentist statistics

studies the sampling distribution while Bayesian statistics updates a

probability distribution on the parameter space.

2 The history of experiment design 69

Some important contributors to the field of experimental designs are

C. S. Peirce, R. A. Fisher, F. Yates, R. C. Bose, A. C. Atkinson, R. A.

Bailey, D. R. Cox, G. E. P. Box, W. G. Cochran, W. T. Federer, V. V.

Fedorov, A. S. Hedayat, J. Kiefer, O. Kempthorne, J. A. Nelder,

Andrej Pa´zman, Friedrich Pukelsheim, D. Raghavarao, C. R. Rao,

Shrikhande S. S., J. N. Srivastava, William J. Studden, G. Taguchi

and H. P. Wynn.

2 The history of experiment design 70

The textbooks of D. Montgomery, R. Myers, and G. Box/W.

Hunter/J.S. Hunter have reached generations of students and

practitioners.

Some discussion of experimental design in the context of system

identification (model building for static or dynamic models) is given

in[35] and [36].

2 The history of experiment design 71

2.4 Sequences of experiments

The use of a sequence of experiments, where the design of each may

depend on the results of previous experiments, including the possible

decision to stop experimenting, is within the scope of sequential

analysis, a field that was pioneered by Abraham Wald in the context of

sequential tests of statistical hypotheses. Herman Chernoff wrote an

overview of optimal sequential designs, while adaptive designs have

been surveyed by S. Zacks. One specific type of sequential design is

the ”two-armed bandit”, generalized to the multi-armed bandit, on

which early work was done by Herbert Robbins in 1952.

2 The history of experiment design 72

2.5 Human participant constraints

Laws and ethical considerations preclude some carefully designed

experiments with human subjects. Legal constraints are dependent on

jurisdiction. Constraints may involve institutional review boards,

informed consent and confidentiality affecting both clinical (medical)

trials and behavioral and social science experiments.[37] In the field of

toxicology, for example, experimentation is performed on laboratory

animals with the goal of defining safe exposure limits for humans.

Balancing the constraints are views from the medical field.[39]

Regarding the randomization of patients, ”... if no one knows which

therapy is better, there is no ethical imperative to use one therapy or

another.” (p 380) Regarding experimental design, ”...it is clearly not

ethical to place subjects at risk to collect data in a poorly designed

study when this situation can be easily avoided...”. (p 393)

2 The history of experiment design 73

2.6 Some important issues to design experiments

Clear and complete documentation of the experimental methodology is

also important in order to support replication of results.

Discussion topics when setting up an experimental design An

experimental design or randomized clinical trial requires careful

consideration of several factors before actually doing the experiment.

An experimental design is the laying out of a detailed experimental

plan in advance of doing the experiment. Some of the following topics

have already been discussed in the principles of experimental design

section:

2 The history of experiment design 74

1) How many factors does the design have, and are the levels of these

factors fixed or random?

2) Are control conditions needed, and what should they be?

3) Manipulation checks; did the manipulation really work?

4) What are the background variables?

5) What is the sample size. How many units must be collected for the

experiment to be generalisable and have enough power?

6) What is the relevance of interactions between factors?

2 The history of experiment design 75

7) What is the influence of delayed effects of substantive factors on

outcomes?

8) How do response shifts affect self-report measures?

9) How feasible is repeated administration of the same measurement

instruments to the same units at different occasions, with a post-test

and follow-up tests?

10) What about using a proxy pretest?

11) Are there lurking variables?

2 The history of experiment design 76

12) Should the client/patient, researcher or even the analyst of the

data be blind to conditions?

13) What is the feasibility of subsequent application of different

conditions to the same units?

14) How many of each control and noise factors should be taken into

account?

15) How to deal with missinbg values?

16) What are the good matrices?

........

2 The history of experiment design 77

The independent variable of a study often has many levels or different

groups. In a true experiment, researchers can have an experimental

group, which is where their intervention testing the hypothesis is

implemented, and a control group, which has all the same element as

the experimental group, without the interventional element. Thus,

when everything else except for one intervention is held constant,

researchers can certify with some certainty that this one element is

what caused the observed change. In some instances, having a control

group is not ethical. This is sometimes solved using two different

experimental groups. In some cases, independent variables cannot be

manipulated, for example when testing the difference between two

groups who have a different disease, or testing the difference between

genders (obviously variables that would be hard or unethical to assign

participants to). In these cases, a quasi-experimental design may be

used.

3 A/B tests (Randomized Control Studies) in clinical trials 78

3 A/B tests (Randomized Control

Studies) in clinical trials

3 A/B tests (Randomized Control Studies) in clinical trials 79

3.1 Drug development

Drug development is a complex and lengthy process that take 7 to 15

years for a single drug at a cost that may reach hundreds of millions of

dollars. There are three main parts of the drug development process:

• Discovery and decision;

• Preclinical studies;

• Clinical studies.

3 A/B tests (Randomized Control Studies) in clinical trials 80

Discovery and Decision

The process starts with the discovery of a new compound or of a new

potential application of an existing compound. Based on adequate

results, the decision whether to develop the drug is then made.

3 A/B tests (Randomized Control Studies) in clinical trials 81

Preclinical Studies

The initial toxicology of compound is studied in animals. Initial

formulation of the drug development and specific or comprehensive

pharmacological studies in animals are also performed at this stage. At

the end of preclinical study, the evidence of potential safety and

effectiveness of the drug is assessed by the company.

To proceed further, A US-based company needs to file a Notice of

Claimed Investigational New Drug Exemption (to allow the company

to conduct studies on human subjects).

3 A/B tests (Randomized Control Studies) in clinical trials 82

Clinical Studies There is sufficient evidence that the drug will be

benefit to human subjects. Testing the drug in human subjects is the

next step.

3 A/B tests (Randomized Control Studies) in clinical trials 83

Phase I clinical trial: To establish the initial safety information

about the effect of the drug on humans, such the range of acceptable

dosages and the pharmacokinetics of the drug. This studies are

normally conducted with healthy volunteers. The number of subjects

typically varies between 4 to 20 per study, with up to 100 subjects in

total used over the course of Phase I trials.

3 A/B tests (Randomized Control Studies) in clinical trials 84

Phase II clinical trial: This studies are conducted towards patients

who will potentially benefit from the new drug. Effective dose ranges

and initial effects of the drug on these patients are assessed. Up to

several hundred patients are usually selected in Phase II trials.

3 A/B tests (Randomized Control Studies) in clinical trials 85

Phase III clinical trial: Phase III studies provide assessment of

safety, efficacy, and optimum dosage. These studies are designed with

controls and treatment groups. Usually hundreds or even thousands

patients are involved in Phase II trials.

Based on successful results obtained from these studies, the company

can then submit a NDA (New Drug Application). The application

contains the results from all three stages (from discovery to Phase III)

and is reviewed by FDA.

The FDA review panel of the NDA consists of reviewers in the

following areas: medicine, pharmacology, biopharmaceutics, chemisty,

and statistics.

3 A/B tests (Randomized Control Studies) in clinical trials 86

Phase IV: Postmarket activities. Followup studies are conducted

to examine the longterm effects of the drug. The main propose of

these studies is to ensure that all claims made by the company about

the new drug can be substantiated by so called ”clinical evidence”. All

reported adverse effects must also be investigated by the company and

in some cases, the drug may need to be withdrawn from the market.

3 A/B tests (Randomized Control Studies) in clinical trials 87

Statistician’s Responsibilities:

• Participate in the development plan for study a drug.

• Study design and protocol development. Randomization schemes.

• Data cleaning and database construction format.

• Analysis plan and program development for analysis.

• Report preparation. Produce tables and figures.

• Integrate clinical study results, safety and efficacy reports.

• Communication and NDA defense to FDA review panel.

• Publication support and consulting with other company personnel.

3 A/B tests (Randomized Control Studies) in clinical trials 88

Example 3.1. HIV transmission. Connor et al. (1994, The New

England Journal of Medicine) report a clinical trial to evaluate the

drug AZT in reducing the risk of maternal-infant HIV transmission.

50-50 randomization scheme is used:

• AZT Group (A)—239 pregnant women (20 HIV positive

infants).

• placebo group (B)—238 pregnant women (60 HIV positive

infants).

3 A/B tests (Randomized Control Studies) in clinical trials 89

Given the seriousness of the outcome of this study, it is reasonable to

argue that 50-50 allocation was unethical. As accruing information

favoring (albeit, not conclusively) the AZT treatment became

available, allocation probabilities should have been shifted from

50-50 allocation proportional to weight of evidence for

AZT. Designs which attempt to do this are called Response-Adaptive

designs (Response-Adaptive Randomization).

3 A/B tests (Randomized Control Studies) in clinical trials 90

If the treatment assignments had been done with the DBCD (Hu

and Zhang, 2004, Annals of Statistics) with urn target:

• AZT Group— 360 patients

• placebo group—117 patients

then, only 60 (instead of 80) infants would be HIV positive.

3 A/B tests (Randomized Control Studies) in clinical trials 91

Allocation rule AZT Placebo Power HIV+

EA 239 238 0.9996 80

DBCD 360 117 0.989 60

Neyman 186 291 0.9998 89

FPower 416 61 0.90 50

3 A/B tests (Randomized Control Studies) in clinical trials 92

Example 2 (ECMO Trial). Extracorporeal membrane oxygenation

(ECMO) is an external system for oxygenating the blood based on

techniques used in cardiopulmonary bypass technology developed for

cariac surgery. In the literature, there are three well-document clinical

trials on evaluating the clinical effectiveness of ECMO:

(i) the Michigan ECMO study (Bartlett, et al. 1985);

(ii) the Boston ECMO study (Ware, 1989);

(iii) the UK Collaborative ECMO Trials Group, 1996).

3 A/B tests (Randomized Control Studies) in clinical trials 93

Example 2 (Continued): Michigan ECMO trial using

RPW rule:

The RPW rule was used in a clinical trial of extracorporeal membrane

oxygenation (ECMO; Bartlett, et al. 1985, Pediatrics).

Total 12 patients.

• ECMO group– 11 patients, all survived.

• Conventional therapy– 1 patient, died.

3 A/B tests (Randomized Control Studies) in clinical trials 94

3.2 Determining the Sample Size

In the planning stages of a randomized clinical trial, it is necessary to

determine the numbers of subjects (sample size) to be randomized.

For two treatments (A and B), say n = nA + nB . We assume here

that the allocation proportions are known in advance, that is,

nA/n = ρ and nB/n = 1− ρ are predetermined.

3 A/B tests (Randomized Control Studies) in clinical trials 95

Examples of calculations of SS.

3 A/B tests (Randomized Control Studies) in clinical trials 96

3.3 Mathematical Framework of Randomization

Procedures

Suppose we compare two treatments A and B. Let T1, ..., Tn be a

sequence of random treatment assignments.

Ti = 1 if the patient i is assigned to treatment A;

Ti = 0 if the patient i is assigned to treatment B.

NA(n) =

∑n

i=1 Ti = number of patients onA and

NB(n) = n−NA(n).

3 A/B tests (Randomized Control Studies) in clinical trials 97

X1, ...,Xn: response variables. Where Xi represents the sequence of

responses that would be observed if each treatment were assigned to

the i-th patient independently.

Z1, ...,Zn: covariates. Here Zi represents the covariates of i-th

patient.

3 A/B tests (Randomized Control Studies) in clinical trials 98

When the (i+ 1)th patient is ready to be randomized in a clinical

trial, following information is available:

• patients assignments: T1, ..., Ti;

• responses: X1, ...,Xi (assume immediately responses);

• patients covariates: Z1, ...,Zi and Zi+1.

3 A/B tests (Randomized Control Studies) in clinical trials 99

Let Tn = σ{T1, ..., Tn} be the sigma-algebra generated by the first n

treatment assignments.

Let Xn = σ{X1, ...,Xn} be the sigma-algebra generated by the first

n responses.

Let Zn = σ{Z1, ...,Zn} be the sigma-algebra generated by the first n

covariate vectors. Let Fn = Tn ⊗Xn ⊗Zn+1.

3 A/B tests (Randomized Control Studies) in clinical trials 100

A randomization procedure is defined by

φn = E(Tn|Fn−1),

where φn+1 is Fn-measurable. We can describe φn as the conditional

probability of assigning treatments 1, ...,K to the n-th patient,

conditional on the previous n− 1 assignments, responses, and

covariate vectors, and the current patient’s covariate vector.

3 A/B tests (Randomized Control Studies) in clinical trials 101

We can describe five types of randomization procedures:

• (i) complete randomization if

φn = E(Tn|Fn−1) = E(Tn);

Not use any information.

• (ii) restricted randomization if

φn = E(Tn|Fn−1) = E(Tn|Tn−1);

Only use information of patients’ assignments.

• (iii) response-adaptive randomization if

φn = E(Tn|Fn−1) = E(Tn|Tn−1,Xn−1);

Use information of patients’ assignments and responses.

3 A/B tests (Randomized Control Studies) in clinical trials 102

• (iv) covariate-adaptive randomization if

φn = E(Tn|Fn−1) = E(Tn|Tn−1,Zn);

Use information of patients’ assignments and covariates.

• (v) covariate-adjusted response-adaptive (CARA) randomization if

φn = E(Tn|Fn−1) = E(Tn|Tn−1,Xn−1,Zn).

use all available information.

3 A/B tests (Randomized Control Studies) in clinical trials 103

3.4 Complete randomization

The simplest form of a randomization procedure is complete

randomization.

E(Ti|T1, ..., Ti−1) = P (Ti = 1|T1, ..., Ti−1) = 1/2, i = 1, ..., n.

NA(n) has binomial(n, 1/2).

This procedure is rarely used in practice because of the nonnegligible

probability of treatment imbalances in moderate samples.

3 A/B tests (Randomized Control Studies) in clinical trials 104

3.5 Restricted randomization

Truncated binomial design: Complete randomization is used until n/2

have been assigned to A or B, then the reminder is filled with the

opposite treatment with probability 1. Here the procedure is given by

φi = 1/2, if max{NA(i− 1), NB(i− 1)} ≤ n/2,

= 0, if NA(i− 1) = n/2,

= 1, if NB(i− 1) = n/2.

3 A/B tests (Randomized Control Studies) in clinical trials 105

Blocked Procedures: Because we do not know n exactly in advance,

we typically require overrunning of the randomization sequence.

Forced balance designs are therefore typically used in blocks.

• Permuted block design: Blocks of even size 2b are filled using

either a random allocation rule or a truncated binomial design.

• The maximum imbalance is b and the only possibility of a terminal

imbalance occurs if the last block is unfilled. Every block has at

least one deterministic assignment.

• Random block design: Blocks of size 2, 4, 6, ..., 2K are randomly

selected and equirobable.

3 A/B tests (Randomized Control Studies) in clinical trials 106

Efron’s biased coin design (BCD): (Efron, 1971). Let

Di = NA(i)−NB(i) be the imbalance between treatments A and B.

Define a constant pi ∈ (0.5, 1]. Then the procedure is given by

φi = 1/2, if Di−1 = 0,

= pi, if Di−1 < 0,

= 1− pi, if Di−1 > 0.

Efron suggested pi = 2/3 might be a reasonable value (without

justification).

3 A/B tests (Randomized Control Studies) in clinical trials 107

Many other designs have been proposed and studied in literature

(Smith’s design (1984), Wei’s design (1978), Big Stick design (Soares

and Wu, 1982), etc.)

When n = 50, V ar(Dn) = 49.92 (Complete randomization);

V ar(Dn) = 4.36 (Efron’s BCD with pi = 2/3). (Based on 100, 000

replications).

3 A/B tests (Randomized Control Studies) in clinical trials 108

3.6 Selection Bias

Selection Bias refers to biases that are introduced into an unmasked

study because an investigator maybe able to guess the treatment

assignment of future patients based on knowing the treatments

assigned to the past patients. Patients usually enter a trial sequentially

over time.

The great clinical trialist Chalmers (1990) was convinced that the

elimination of selection bias is the most essential requirement for a

good clinical trial.

3 A/B tests (Randomized Control Studies) in clinical trials 109

How to measure the Selection Bias?

3 A/B tests (Randomized Control Studies) in clinical trials 110

Blackwell and Hodge (1957), Berger, Ivanova and Knoll (2003) and

others had suggested the predictability of a randomization

sequence to measure the selection bias.

One measure of the predictability of a randomization

sequence is given by

Ppred =

∑n

i=1 |Eφi − 0.5|

n

.

3 A/B tests (Randomized Control Studies) in clinical trials 111

Selection bias of different designs.

4 Response-adaptive randomization procedures 112

4 Response-adaptive randomization

procedures

.

4.1 Historical notes

Adaptive designs in the clinical trials context were first formulated as

solutions to optimal decision-making questions:

• Which treatment is better?

• What sample size should be used before determining a “better”

treatment to maximize the total number receiving the better

treatment?

• How do we incorporate prior data or accruing data into these

decisions?

4 Response-adaptive randomization procedures 113

The preliminary ideas can be traced back to Thompson (1933,

Biometrika) and Robbins (1952, Bulletin of the American

Mathematical Society) and led to a flurry of work in the 1960s by

Anscombe (1963, JASA), Colton (1963, JASA), Zelen (1969, JASA)

and Cornfield, Halperin, and Greenhouse (1969, Annals of

Mathematical Statistics), among others.

4 Response-adaptive randomization procedures 114

4.2 Play-the-winner rule

Perhaps the simplest of these adaptive designs is the play-the-winner

rule originally explored by Robbins (1952, Bulletin of the American

Mathematical Society) and later by Zelen (1969, JASA).

4 Response-adaptive randomization procedures 115

Binary response: treatment A and B.

• pA: P (success|A), qA = 1− pA;

• pB : P (success|B), qB = 1− pB ;

• NA(n): number of patients on A;

• NB(n): number of patients on B, n = NA(n) +NB(n).

4 Response-adaptive randomization procedures 116

Play-the-winner rule:

• a success on one treatment results in the next patient’s

assignment to the same treatment,

• a failure on one treatment results in the next patient’s assignment

to the opposite treatment.

That is

• φn = 1 if Tn−1 = 1 and Xn−1(A) = 1 or Tn−1 = 0 and

Xn−1(B) = 0.

• φn = 0 if Tn−1 = 1 and Xn−1(A) = 0 or Tn−1 = 0 and

Xn−1(B) = 1.

4 Response-adaptive randomization procedures 117

The properties of play-the-winner rule?

• What is the proportion of patients in treatment A:

NA(n)

n

→???.

• What is the variance (variability) of the allocation:

V ar(NA(n)) or V ar

(

NA(n)

n

)

=???.

• What is the distribution of the allocation:

√

n

(

NA(n)

n

−???

)

→???.

4 Response-adaptive randomization procedures 118

We have

• What is the proportion of patients in treatment A:

NA(n)

n

→ qB

qA + qB

.

• What is the variance (variability) of the allocation:

V ar(NA(n)) =

nqAqB(pA + pB)

(qA + qB)3

.

• What is the distribution of the allocation:

√

n

(

NA(n)

n

− qB

qA + qB

)

→ N

(

0,

qAqB(pA + pB)

(qA + qB)3

)

.

4 Response-adaptive randomization procedures 119

Advantages:

• more patients in the better treatment;

• intuitively attractive.

Disadvantages:

• Not a randomized procedure;

• Not based on any optimality.

4 Response-adaptive randomization procedures 120

4.2.1 Randomized play-the-winner rule

Randomized play-the-winner (RPW) rule (Wei and Durham, 1978,

JASA) has been the most-studied urn model in literature.

Binary response: treatment A and B.

• pA: P (success|A), qA = 1− pA;

• pB : P (success|B), qB = 1− pB ;

• NA(n): number of patients on A;

• NB(n): number of patients on B, n = NA(n) +NB(n).

4 Response-adaptive randomization procedures 121

Begin with c balls of A and c balls of B in an urn.

• Draw A:

– assign patient to A;

– replace ball;

– add 1 type A ball if treatment A is successful;

– add 1 type B ball if treatment A is failure.

• Draw B:

– assign patient to B;

– replace ball;

– add 1 type B ball if treatment B is successful;

– add 1 type A ball if treatment B is failure.

4 Response-adaptive randomization procedures 122

When the (i+ 1)th patient is ready to be randomized in a clinical

trial, following information is available:

• patients assignments: T1, ..., Ti;

• responses: X1, ..., Xi (assume immediately responses);

Then

• φ1 = 1/2.

•

φ2 =

c+ T1X1 + (1− T1)(1−X1)

2c+ 1

.

•

φi+1 =

c+

∑i

j=1[TjXj + (1− Tj)(1−Xj)]

2c+ i

.

4 Response-adaptive randomization procedures 123

Properties:

• Calculate ENA(n);

• Simulated results.

4 Response-adaptive randomization procedures 124

We have

• The limiting proportion of patients in treatment A:

NA(n)

n

→ qB

qA + qB

.

• The variance (variability) of the allocation (when qA + qB > 1/2):

V ar(NA(n)) =

nqAqB(3 + 2(pA + pB))

(qA + qB)2(2(qA + qB)− 1) .

• The asymptotic distribution of the allocation (when

qA + qB > 1/2):

√

n

(

NA(n)

n

− qB

qA + qB

)

→ N

(

0,

nqAqB(3 + 2(pA + pB))

(qA + qB)2(2(qA + qB)− 1)

)

.

4 Response-adaptive randomization procedures 125

Table 1: Asymptotic and simulated mean and variance (multipled by

n) of the allocation proportions NA(n)/n for the randomized play-the-

winner (RPW). Simulations based on n = 100 and 1000 replications.

From Hu and Rosenberger (2003), reprinted by permission from the

American Statistical Association.

(pA, pB) mean (A) S var (A) S

(0.8, 0.8) 0.50 0.50 N/A 2.29

(0.8, 0.7) 0.60 0.57 N/A 1.90

(0.7, 0.5) 0.63 0.61 1.33 0.90

(0.7, 0.3) 0.70 0.68 0.63 0.51

(0.5, 0.5) 0.50 0.50 0.75 0.65

(0.5, 0.2) 0.62 0.61 0.35 0.34

(0.2, 0.2) 0.50 0.50 0.20 0.19

4 Response-adaptive randomization procedures 126

Urn models:

• Play-the-winner (PW) rule (Zelen, 1969, JASA); Randomized

play-the-winner rule (Wei and Durham,1978, JASA).

• Generalized Friedman’s urn models (Wei, 1979, JASA; Smythe,

1996, Stochastic Process. Appl.; Bai, Hu and Shen, 2002, JMVA);

• Randomized Polya Urn (Durham, Flournoy, and Li, 1998,

Canadian J of Statistics); Ternary Urn (Ivanova and Flournoy,

2001);

4 Response-adaptive randomization procedures 127

• Drop-the-Loser rule (Ivanova, 2003, Metrika); Generalized

drop-the-Loser rule (Zhang, Chan, Cheung and Hu, 2007, Statistic

Sinica),

• Sequential estimated urn (Zhang, Hu and Cheung, 2006, Annals

of Applied Probability).

• Urn models with immigration balls (Zhang, Hu, Cheung and

Chan, Annals of Statistics, 2011).

4 Response-adaptive randomization procedures 128

4.3 Relationship Between Power and Variability

Example 3.2. ECMO trial (The UK trial). Extracorporeal

membrane oxygenation (ECMO) is an external system for oxygenating

the blood based on techniques used in cardiopulmonary bypass

technology developed for cariac surgery. In the literature, there are

three well-document clinical trials on evaluating the clinical

effectiveness of ECMO:

(i) the Michigan ECMO study (Bartlett, et al. 1985);

(ii) the Boston ECMO study (Ware, 1989);

(iii) the UK ECMO trial (UK Collaborative ECMO Trials Group, 1996).

4 Response-adaptive randomization procedures 129

Example 7 (Continued): Michigan ECMO trial using

RPW rule:

The RPW rule was used in a clinical trial of extracorporeal membrane

oxygenation (ECMO; Bartlett, et al. 1985, Pediatrics).

Total 12 patients.

• ECMO group– 11 patients, all survived.

• Conventional therapy– 1 patient, died.

Valid of this trial? No statistical conclusion.

Why? Power and variability.

4 Response-adaptive randomization procedures 130

Power is an increasing function of noncentrality parameter:

For the following one-side test: H0 : pA = pB vs H1 : pA > pB , The

corresponding testing statistic is

T =

pˆA − pˆB√

pˆA(1− pˆA)/NA(n) + pˆB(1− pˆB)/NB(n)

.

We can calculate the noncentrality parameter as followings:

(pA − pB)2

pAqA/NA(n) + pBqB/NB(n)

.

4 Response-adaptive randomization procedures 131

Assume NA(n)/n→ ρ in probability, we can rewrite this as:

See details in class.

4 Response-adaptive randomization procedures 132

4.4 Lower bound of the variability

Hu, Rosenberger and Zhang (2006) considered ”Asymptotically best

response-adaptive randomization procedures.” in Journal of Statistical

Planning and Inference.

See details in class.

4 Response-adaptive randomization procedures 133

4.5 Doubly-Adaptive Biased Coin Design

Doubly-adaptive biased coin design (DBCD) (Eisele and Woodroofe,

1995, Annals of Statist, Hu and Zhang, 2004, Annals of Statist).

Let g be a function from [0, 1]× [0, 1] to [0, 1] satisfied certain

conditions. The procedure then allocates patient j to treatment A

with probability

g(

nA(j − 1)

j − 1 , ρˆj−1).

How to choose function g?

4 Response-adaptive randomization procedures 134

Recently, Hu and Zhang (2004) proposed (γ ≥ 0)

g(x, ρ) =

ρ(ρ/x)γ

ρ(ρ/x)γ + (1− ρ)((1− ρ)/(1− x))γ

• γ = 0, the g(x, ρ) = ρ (the SMLE);

• γ =∞, determined design.

4 Response-adaptive randomization procedures 135

Let

λ = ∂g/∂x

∣∣

(ρ,ρ)

, η = ∂g/∂y

∣∣

(ρ,ρ)

and

∇(ρ) = ( ∂ρ

∂pA

,

∂ρ

∂pB

)′.

Also let

σ23 =

(∇(ρ)|Θ)′V∇(ρ)|Θ and σ21 = ρ(1− ρ).

Where Θ = (pA, pB) and

V = diag(

V ar(ξA)

ρ

,

V ar(ξB)

1− ρ ).

4 Response-adaptive randomization procedures 136

Theorem. Under widely satisfied conditions,

n1/2(nA/n− ρ)→ N(0, σ2) (1)

in distribution. Where

σ2 =

σ21

1− 2λ +

2η2σ23

(1− λ)(1− 2λ)

Main Techniques used: Martingale, Gaussian Approximation and

Matrix theory.

4 Response-adaptive randomization procedures 137

Example 4.1. Binary response: treatment A and B.

• pA: P (success|A), qA = 1− pA;

• pB : P (success|B), qB = 1− pB ;

• nA: number of patients on A;

• nB : number of patients on B, n = nA + nB .

4 Response-adaptive randomization procedures 138

To see how this procedure works in practice, we look at a simple

illustration with γ = 2.

• Suppose we have already assigned 9 patients, 5 to A and 4 to B.

• We have observed a success rate of pˆA = 3/5 on A and pˆB = 1/4

on B.

4 Response-adaptive randomization procedures 139

If the target allocation is urn allocation, qB/(qA + qB) (Wei and

Durham, 1978), then

• estimate the target allocation as

ρˆ =

3/4

2/5 + 3/4

= 0.652.

• real allocation proportion is 5/9 = 0.556.

• Then the probability of assigning the 10th patient to treatment A

is computed as

P (A) =

0.652(0.652/0.556)2

0.652(0.652/0.556)2 + 0.348(0.348/0.444)2

= 0.807.

4 Response-adaptive randomization procedures 140

If we are interested in optimal allocation,

√

pA/(

√

pA +

√

pB)

(Rosenberger, et al, 2001), then

• estimate the target allocation as

ρˆ =

√

3/5√

3/5 +

√

1/4

= 0.6077.

• real allocation proportion is 5/9 = 0.556.

• Then the probability of assigning the 10th patient to treatment A

is computed as

P (A) =

0.6077(0.6077/0.556)2

0.6077(0.6077/0.556)2 + 0.3923(0.3923/0.444)2

= 0.704.

4 Response-adaptive randomization procedures 141

For binary responses with (ρ = qB/(qA + qB)),

n1/2(nA/n− ρ)→ N(0, σ2DBCD)

in distribution, whenever λ < 1/2, where

σ2DBCD =

q1q2

(1− 2λ)(q1 + q2)2 +

2η2

(1− λ)(1− 2λ)

q1q2(p1 + p2)

(q1 + q2)3

4 Response-adaptive randomization procedures 142

If

g(x, ρ) =

ρ(ρ/x)γ

ρ(ρ/x)γ + (1− ρ)((1− ρ)/(1− x))γ ,

then

σ2DBCD =

q1q2(p1 + p2)

(q1 + q2)3

+

2q1q2

(1 + 2γ)(q1 + q2)3

.

• γ = 0, σ2DBCD = q1q2(p1+p2+2)(q1+q2)3 .

• γ =∞, σ2DBCD = q1q2(p1+p2)(q1+q2)3 (Lower bound).

• γ = 2, σ2DBCD = q1q2(p1+p2+.4)(q1+q2)3 .

4 Response-adaptive randomization procedures 143

Advantages of DBCD:

• can target any given allocation ρ(θ);

• very close to the low bound; but NOT ATTAIN the low bound.

• and apply to all types of responses.

4 Response-adaptive randomization procedures 144

4.6 Efficient Response-Adaptive Designs

Hu, Zhang and He (2009, Annals of Statistics) proposed Efficient

Response-Adaptive Designs (ERADE), which,

• can target any given allocation ρ(θ);

• ATTAIN the low bound.

• and apply to all types of responses.

4 Response-adaptive randomization procedures 145

The ERADE is analogous to discretized version of Hu and Zhang’s

function. For a parameter α ∈ (0, 1), Then the procedure allocates jth

patient to treatment A with probability

φj = 1/2, if nA(j − 1)/(j − 1) = ρˆj−1,

= αρˆj−1, if nA(j − 1)/(j − 1) > ρˆj−1,

= 1− α(1− ρˆj−1), if nA(j − 1)/(j − 1) < ρˆj−1.

4 Response-adaptive randomization procedures 146

4.7 Revisiting the examples

Example 4.2. HIV transmission (Continued). Connor et al.

(1994, The New England Journal of Medicine) report a clinical trial to

evaluate the drug AZT in reducing the risk of maternal-infant HIV

transmission.

50-50 randomization scheme is used:

• AZT Group—239 pregnant women (20 HIV positive infants).

• placebo group—238 pregnant women (60 HIV positive

infants).

4 Response-adaptive randomization procedures 147

Here pˆA = 219/239 = 0.913, pˆB = 158/238 = 0.664.

• pA + pB = 1.577 > 1.5, RPW does not apply here.

• DBCD with target allocation ρ = q2/(q1 + q2) and γ = 2

• Neyman allocation, Maximize the power.

• FPower: Fix the power (β = 0.9) and minimize expected failures.

4 Response-adaptive randomization procedures 148

Allocation rule AZT Placebo Power HIV+

EA 239 238 0.9996 80

DBCD 360 117 0.989 60

Neyman 186 291 0.9998 89

FPower 416 61 0.90 50

4 Response-adaptive randomization procedures 149

Example 4.3. ECMO trial (The UK trial). Extracorporeal

membrane oxygenation (ECMO) is an external system for oxygenating

the blood based on techniques used in cardiopulmonary bypass

technology developed for cariac surgery. In the literature, there are

three well-document clinical trials on evaluating the clinical

effectiveness of ECMO:

(i) the Michigan ECMO study (Bartlett, et al. 1985);

(ii) the Boston ECMO study (Ware, 1989);

(iii) the UK ECMO trial (UK Collaborative ECMO Trials Group, 1996).

4 Response-adaptive randomization procedures 150

The UK ECMO trial:

50-50 randomization scheme is used:

• ECMO Group—93 infants (28 deaths).

• Conventional group—92 infants (54 deaths).

4 Response-adaptive randomization procedures 151

• Use P1 = 65/93 and P2 = 38/92 as the estimated success

probabilities of the ECMO and the conventional treatment,

respectively.

• ERADE (Hu, Zhang and He, 2009) is used based on 10000

simulations.

• RPW is used based on 10000 simulations.

4 Response-adaptive randomization procedures 152

• On average, there will be about 121 patients in the ECMO and 64

patients in the conventional treatment on average.

• the expected number of deaths is 74 death, as compared to 82 in

the actual trial. The adaptive design utilizes the better treatment

more often to save lives.

4 Response-adaptive randomization procedures 153

Power of the ERADE and the RPW rule under the setting of

P1 = 65/93 and P2 = 38/92.

• For equal allocation, power is 0.978.

• The expected power under both designs (ERADE and RPW) is

0.969.

• Based on the 10000 simulated trials, we noticed that in 99% of

the trials under the ERADE there were more than 52 patients

assigned to the conventional treatment group, for a power of

0.941 or higher.

• Under the RPW rule, only 39 or fewer patients were assigned to

the conventional treatment in 1% of the trials, for a power of

0.904 or less.

4 Response-adaptive randomization procedures 154

• Also based on the 10000 simulated trials, the ERADE always

assign more patients to the ECMO group.

• However, the RPW rule assigned more patients to the

conventional group in 114 trials.

• Even at the sample size 185, we can see the advantage of using

the proposed ERADE over the randomized player-the-winner rule.

4 Response-adaptive randomization procedures 155

4.8 Some remarks

• Urn models (RPW rule, Wei and Durham, 1978, JASA; Zhang,

Hu, Cheung and Chan, 2011, AOS)

• Ethical, Randomness, Power and Variability (Hu and Rosenberger,

2003, JASA)

• Lower bound of the variability (Hu, Rosenberger and Zhang, 2006,

JSPI)

• DBCD (Hu and Zhang, 2004, AOS)

• Optimal allocations (Rosenberger et al, 2001, Biometrics,

Tymofyeyev, Rosenberger and Hu, 2007, JASA)

• ERADE (Hu, Zhang and He, 2009, AOS)

4 Response-adaptive randomization procedures 156

• Delayed responses (Bai, Hu and Rosenberger, 2002, AOS, Hu and

Zhang, 2004)

• Time trends and others (Hu and Rosenberger, 2000, Statistics in

Medicine)

• The book (Hu and Rosenberger, 2006) and two white papers (Hu

and Rosenberger, 2007).

• Sequential Monitoring RAR (Zhu and Hu, 2010, AOS).

• Sample size re-estimation (Li and Hu, 2021).

• Robustness Inference of RAR (Ye, Ma and Hu, 2021?).

5 Covariate-Adaptive Randomization 157

5 Covariate-Adaptive Randomization

Example 5.1: Remdesivir-COVID-19 trial (China).

Remdesivir in adults with severe COVID-19 trial (Wang et al. 2020) is

a randomized, double-blind, placebo-controlled, multicentre trial that

aimed to compare Remvesivir with placebo. There were 236 patients

in the trial. There are about 20 baseline covariates for each patient,

including 10 continuous variables (e.g. age and White blood cell

count) and 10 discrete variables (e.g. gender and Hypertension). The

stratified (according to the level of respiratory support) permuted

block (30 patients per block) randomization procedure were

implemented. At the end of this trial, some important imbalances

existed at enrollment between the groups, including more patients with

hypertension, diabetes, or coronary artery disease in the Remdesivir

group than the placebo group.

5 Covariate-Adaptive Randomization 158

Example 5.2: Moderna COVID-19 vaccine trial (2020).

The trial began on July 27, 2020, and enrolled 30,420 adult volunteers

at clinical research sites across the United States. Volunteers were

randomly assigned 1:1 to receive either two 100 microgram (mcg)

doses of the investigational vaccine or two shots of saline placebo 28

days apart. The average age of volunteers is 51 years. Approximately

47% are female, 25% are 65 years or older and 17% are under the age

of 65 with medical conditions placing them at higher risk for severe

COVID-19. Approximately 79% of participants are white, 10% are

Black or African American, 5% are Asian, 0.8% are American Indian or

Alaska Native, 0.2% are Native Hawaiian or Other Pacific Islander, 2%

are multiracial, and 21% (of any race) are Hispanic or Latino.

5 Covariate-Adaptive Randomization 159

From the start of the trial through Nov. 25, 2020, investigators

recorded 196 cases of symptomatic COVID-19 occurring among

participants at least 14 days after they received their second shot. One

hundred and eighty-five cases (30 of which were classified as severe

COVID-19) occurred in the placebo group and 11 cases (0 of which

were classified as severe COVID-19) occurred in the group receiving

mRNA-1273. The incidence of symptomatic COVID-19 was 94.1%

lower in those participants who received mRNA-1273 as compared to

those receiving placebo.

5 Covariate-Adaptive Randomization 160

Investigators observed 236 cases of symptomatic COVID-19 among

participants at least 14 days after they received their first shot, with

225 cases in the placebo group and 11 cases in the group receiving

mRNA-1273. The vaccine efficacy was 95.2% for this secondary

analysis.

5 Covariate-Adaptive Randomization 161

5.1 Some Classical designs

Clinical trialists are often concerned that treatment arms will be

unbalanced with respect to key covariates of interest. To prevent this,

covariate-adaptive randomization is often employed. Over 50000

covariate-adaptive clinical trials had been reported from 1988-2008

(Taves, 2010).

• Covariates (prognostic factors): factors that are associated with

the outcomes of patients

– E.g., gender, age, clinical center, blood pressure, stage of

disease at baseline, gene expressions.

5 Covariate-Adaptive Randomization 162

• Covariate-adaptive design: randomization that incorporates

covariates and balances treatment allocation over covariates.

– Balancing treatment allocation for influential covariates.

– Achieving statistical efficiency by preserving type I errors while

increasing power.

• Two popular procedures: stratified permuted block design and

Pocock and Simon’s marginal procedure (1975).

5 Covariate-Adaptive Randomization 163

Imbalance of Different Levels

• overall difference, Dn = N1(n)−N2(n);

• marginal difference, difference between the numbers of patients on

a margin, e.g., Dfemale;

• within-stratum difference, difference between the number of

patients in a stratum, e.g. Dfemale,smoker.

Female Male Overall

Smoker Dfemale,smoker Dmale,smoker Dsmoker

Non-S Dfemale,non−s Dmale,non−s Dnon−s

Overall Dfemale Dmale Dn

5 Covariate-Adaptive Randomization 164

5.1.1 Stratified Randomization

• Strata are formed by all combinations of covariates’ levels.

– e.g.: 2 covariates gender (male and female) and smoking

behavior (smoker and non-smoker) lead to 2× 2 = 4 strata

• Separate randomization is employed within each stratum.

– covariate-adaptive biased coin design

– stratified permuted block design, commonly used. Permuted

Block Design: permutation of m A’s and m B’s.

- e.g.: block size 2m = 4, permutation of (AABB) or (BAAB);

For 10 patients: —AABB—BAAB—BB

5 Covariate-Adaptive Randomization 165

Stratified Randomization

• Advantage:

– Easy to understand and implement.

– Good large sample properties (almost prefect balance).

– Balance within stratum.

• Disadvantage:

– Only consider balance within stratum.

– Does not work for cases with many strata (many covariates or

many levels).

– Unknown (theoretically) properties of statistical inference.

5 Covariate-Adaptive Randomization 166

5.1.2 Pocock-Simon procedure

Let Z1, ...,Zn be the covariate vector of patients 1, ..., n. Assume

that there are S covariates of interest (continuous or otherwise) and

they are divided into ns, s = 1, ..., S, different levels.

Nsik(n), s = 1, ..., S, i = 1, ..., ns, k = 1, 2 to be the number of

patients in the i-th level of the s-th covariate on treatment k.

Let patient n+ 1 have covariate vector Zn+1 = (r1, ..., rS).

Let Ds(n) = Nsrs1(n)−Nsrs2(n), which is the difference between the

numbers of patients on treatments 1 and 2 for members of level rs of

covariate s.

5 Covariate-Adaptive Randomization 167

Let w1, ..., wS be a set of weights and take the weighted aggregate

D(n) =

∑S

s=1 wsDs(n). Establish a probability pi ∈ (1/2, 1]. Then

the procedure allocates to treatment 1 according to

φi1 = E(Ti1|Ti−1,Zi) = 1/2, if D(i− 1) = 0,

= pi, if D(i− 1) < 0,

= 1− pi, if D(i− 1) > 0.

5 Covariate-Adaptive Randomization 168

Pocock-Simon procedure

• Advantage:

– Balance across covariates (marginal balance).

– Overall treatment balance with many covariates.

• Disadvantage:

– Unknown theoretical properties (not well studied, Rosenberger

and Sverdlov, 2009).

– usually not well balanced within stratum.

– Unknown (theoretically) properties of statistical inference.

5 Covariate-Adaptive Randomization 169

Examples.

5 Covariate-Adaptive Randomization 170

We need new covariate-adaptive designs that provide balance (within

stratum, marginal and overall) under different situations (sample size

200, 500 or 1000):

• 10 covariates, each with 2 levels: total 210 = 1024 strata.

• 2 covariates: a biomarker with 2 levels and 100 investigation sides:

total 200 strata.

5 Covariate-Adaptive Randomization 171

5.2 Hu and Hu’s Covariate-Adaptive Design for

Balance (discrete)

Consider two covariates: covariate 1 with I levels and covariate 2 with

J levels, For patient n+ 1 (with i (covariate 1) and j (covariate 2))

n = 0, 1, 2, .... First we define the following values:

• If patient n+ 1 is assigned to treatment 1, let

– Within Stratum: D

(1)

ij (n+ 1) = Nij,1(n+ 1)−Nij,2(n+ 1),

where Nij,1(n+ 1) and Nij,2(n+ 1) are the number of

patients assigned to treatment 1 and 2 respectively in strata ij

of the first n+ 1 patients.

– Marginal 1: D

(1)

i· (n+ 1) = Ni·,1(n+ 1)−Ni·,2(n+ 1), where

Ni·,1(n+ 1) and Ni·,2(n+ 1) are the number of patients

assigned to treatment 1 and 2 respectively in (covariate 1=i)

of the first n+ 1 patients.

5 Covariate-Adaptive Randomization 172

– Marginal 2: D

(1)

·j (n+ 1) = N·j,1(n+ 1)−N·j,2(n+ 1), where

N·j,1(n+ 1) and N·j,2(n+ 1) are the number of patients

assigned to treatment 1 and 2 respectively in (covariate 2=j)

of the first n+ 1 patients.

– Overall: Dn,overall = Nn,1 −Nn,2 be the overall difference of

patient numbers in group 1 and 2 among the first n.

– Define A

(1)

ij (n+ 1) = (D

(1)

ij (n+ 1))

2,

A

(1)

i· (n+ 1) = (D

(1)

i· (n+ 1))

2, A

(1)

·j (n+ 1) = (D

(1)

·j (n+ 1))

2

and A

(1)

·· = (Dn,overall)2.

– The score of imbalance is B

(1)

ij (n+ 1) =

w1A

(1)

ij (n+1)+w2A

(1)

i· (n+1)+w3A

(1)

·j (n+1)+w4A

(1)

·· (n+1)

for some weights w1, w2, w3, w4 ≥ 0.

• If patient n+ 1 is assigned to treatment 2, B(2)ij (n+ 1) is

calculated similarly.

5 Covariate-Adaptive Randomization 173

Then the proposed procedure allocates (Hu and Hu, 2012) to

treatment 1 according to

φn+1,1 = 1/2, if B

(1)

ij (n+ 1) = B

(2)

ij (n+ 1),

= pi, if B

(1)

ij (n+ 1) < B

(2)

ij (n+ 1),

= 1− pi, if B(1)ij (n+ 1) > B(2)ij (n+ 1).

Where pi > 0.5 (pi ∈ (0.75, 0.95) is recommended).

5 Covariate-Adaptive Randomization 174

Remarks:

• When weight w1 = 0, w4 = 0, the new design becomes Pocock

and Simon’s procedure.

• When w2 = w3 = w4 = 0, the new design is similar to Stratified

Block Randomization.

• With w1, w2, w3 > 0, we can balance both within each strata and

cross covariates.

5 Covariate-Adaptive Randomization 175

Theorem 1: (1) Under certain conditions (w1 > 0 and some others,

Hu and Hu, 2012; Hu and Zhang, 2014), Dn (imbalance matrix) is a

positive recurrent Markov chain. Therefore all three types of

imbalance are Op(1).

(2) When w1 = 0 (Pocock and Simon’s Design), Both marginal and

overall imbalances are Op(1), but within stratum imbalance is

Op(n

1/2). (Hu and Zhang, 2014).

The proof is quite difficult because the correlated structure.

Main techniques: “Draft conditions” of Markov chain, Guassian

approximation and martingales.

5 Covariate-Adaptive Randomization 176

Some numerical results:

Case 1: 10 covariates, each with 2 levels: total 210 = 1024 strata.

5 Covariate-Adaptive Randomization 177

Table 1. Averaging imbalance under 100 simulations and n = 500

Dist of pts across strata Counts & percentages

# of pts E(# prop) Imb strt(PB) P-S New

2 .07 0 50.2(.67) 38.2(.50) 55.1(.74)

2 24.9(.33) 37.8(.50) 18.9(.26)

3 .01 1 12(1.00) 9.3(.77) 12.0(.96)

3 0(0.00) 2.8(.23) .5(.04)

(< 2) .91

overall abs dif 12.8 .76 .90

margnal abs dif 10.4 1.68 1.90

5 Covariate-Adaptive Randomization 178

Case 2: 2 covariates: a biomarker with 2 levels and 100 investigation

sides: total 200 strata.

5 Covariate-Adaptive Randomization 179

Table 3. Averaging imbalance under 1000 simulations and n = 200

Dist of pts across strata Counts & percentages

# of pts E(# prop) Imb strt(PB) P-S New

2 .184 0 24.46(.66) 24.15(.65) 30.23(.82)

2 12.37(.34) 12.74(.35) 6.46(.18)

3 .06 1 12.02(1.00) 11.18(.92) 12.05(.97)

3 0(0.00) 1.02(0.08) 0.35(.03)

(< 2) .735

overall abs dif 9.39 1.14 1.53

margnal long abs dif 6.57 0.87 1.13

margnal short abs dif 1.00 0.86 0.81

5 Covariate-Adaptive Randomization 180

5.3 Examples of Mimicking Real Clinical Data

5.3.1 Toorawa, Adena, et al. (2009)

The four covariates are site, gender, age and disease status, with 20,

2, 2 and 2 levels, respectively, resulting in 160 strata. The covariates’

distribution is replicated in Table 2, where the marginal distribution of

sites is independent of the joint distribution of the rest three

covariates.

5 Covariate-Adaptive Randomization 181

Table 2: Distribution of Covariates

Sites Small(2 sites) 1/120

Medium(16 sites) 6/120

Large(2 sites) 11/120

Other 3 covariates Male; < 60; Moderate disease 10/20

Male; ≥ 60; Moderate disease 2/20

Male; < 60; Severe disease 2/20

Male; ≥ 60; Severe disease 2/20

Female; < 60; Moderate disease 1/20

Female; ≥ 60; Moderate disease 1/20

Female; < 60; Severe disease 1/20

Female; ≥ 60; Severe disease 1/20

5 Covariate-Adaptive Randomization 182

120 patients enter the trial sequentially and their covariates are

independently simulated from the multinomial distribution in Table 2.

We use the same p, q and block size as in the previous two examples.

The weights are specified in the following way:

- NEW: wo = ws = 1/3 and wm,i = 1/12, i = 1, · · · , 4.

- PS: wo = ws = 0 and wm,i = 1/4, i = 1, · · · , 4.

5 Covariate-Adaptive Randomization 183

Table 3: Distribution of patients among 160 strata

# of pts within stratum 0 1 2 3 4 and more

# of strata 95.4 38.8 12.7 5.6 7.6

proportion 59.6% 24.3% 7.9% 3.5% 4.7%

5 Covariate-Adaptive Randomization 184

Table 3 shows the distribution of 120 patients among 160 strata. In

this case 24.3% of the strata have 1 patient; 11.4% contain 2 or 3

patients. If stratified randomization is employed, then the patients in

the above 24.3% stata has to be randomized by equal probabilities.

Moreover, the incomplete blocks in strata with 2 or 3 patients also

pose a high risk of large overall imbalance.

The mean absolute imbalances at the three levels are compared, as

shown in Table 4, Table 5, and 6.

5 Covariate-Adaptive Randomization 185

Table 4: Comparison of absolute overall imbalance |Dn|

STR-PB PS NEW

mean 6.70 0.91 0.63

median 6 0 0

95% quan 16 2 2

5 Covariate-Adaptive Randomization 186

Table 4 shows the result for the overall imbalance and lists the the

mean, median and 95% quantile of |D120|. It is seen that NEW has

mean, median and 95% quantile of 0.63, 0 and 2, respectively, whereas

PS has slightly higher values. The three quantities are extremely high

under STR-PB, which are not recommended for this case.

5 Covariate-Adaptive Randomization 187

Table 5: Comparison of mean absolute marginal imbalances E|Dn(i; ki)|

STR-PB PS NEW

gender male 5.52 1.10 1.59

female 3.86 1.06 1.55

age < 60 4.84 1.08 1.57

≥ 60 4.40 1.11 1.23

disease moderate 5.01 1.10 1.56

severe 4.35 1.18 1.52

20 sites 2 small 1.45 0.94 1.02

16 median 1.44 1.21 1.32

2 large 1.47 1.33 1.52

5 Covariate-Adaptive Randomization 188

Table 5 gives the mean absolute marginal imbalances. For the

covariates of gender, age and disease, the table explicitly lists the

mean values on these 6 margins, as each of them only has two levels.

For example, over the 1000 simulations, on average the absolute

differences of patients in the two treatment groups within all male are

5.52, 1.10 and 1.59 under STR-PB, PS and NEW, respectively.

Therefore, in this respect PS has the best performance; NEW is

slightly worse, but still tolerable; STR-PB is the worst, since its mean

is as high as 5.52. Similar conclusion can be reached for the other 5

margins. Moreover, for the margins relating to “site”, since there are a

total of 20 margins, we are unable to show the result on each margin

due to the space limit. Hence, these 20 margins are further

categorized into three groups of small, median and large sizes, and the

mean values in the table are further averaged over the margins within

the groups. For example, 1.32 is the mean absolute imbalance over the

16 median-sized sites as well as over the 1000 simulations. In terms of

5 Covariate-Adaptive Randomization 189

imbalances on margins defined by site, PS is still the best, and

STR-PB has similar performance to NEW. This is because each

margin of site contains only 8 strata, hence the “accumulating effect”

of within-stratum imbalances under STR-PB is not as strong.

5 Covariate-Adaptive Randomization 190

Table 6: Comparison of absolute within-stratum imbalances

|Dn(k1, · · · , kI)|:

distribution and mean

# of pts’ within strt. |Dn(k1, · · · , kI)| STR-PB PS NEW

2 prob(=0) 0.68 0.57 0.69

prob(=2) 0.32 0.43 0.31

mean 0.64 0.86 0.62

3 prob(=1) 1.00 0.85 0.94

prob(=3) 0.00 0.15 0.06

mean 1.00 1.30 1.12

5 Covariate-Adaptive Randomization 191

Table 6 displays the distribution and absolute mean of within-stratum

imbalances for strata with 2 or 3 patients. For example, of all the

strata which contain 2 patients, the absolute difference is either 0 or 2,

and the distribution is 0.69 to 0 and 0.31 to 2 under NEW, leading to

an average of 0.62. According to this criterion, NEW has the lowest

mean, STR-PB has a slightly larger value, and PS has mean as large

as 0.86. For strata containing 3 patients, since the block size is 4 for

STR-PB, it is impossible to get an absolute value of 3. Hence, the

mean absolute imbalance is 1, the minimum among the three methods.

5 Covariate-Adaptive Randomization 192

In summary, Hu and Hu’s procedure maintains good balance from all

three perspectives and should be favored. We also performed the

simulations under other parameter values. Some of them include: (1)

Changing the weights wo, ws, and wm,i, as well as the block size; (2)

2× 100 strata, representing few covariates but many levels at least for

one covariate; (3) 3× 4× 5× 6 strata, representing a few covariates

and a few levels for each. In all the above settings, our new procedure

shows advantages over the other two methods.

5 Covariate-Adaptive Randomization 193

5.3.2 NIDA-CSP-1019 study

Elkashef et al. (2006) study is a randomized clinical trial conducted to

test the treatment effect of the selegiline transdermal system (STS), a

treatment of cocaine dependence. The trial comprised 300 patients,

and involved important covariates such as center (16 centers), age,

gender (1: male, 2: female), depression (calculated by Hamilton

Depression Rating Scale), ADHD (Attention-Deficit/Hyperactivity

Disorder, 1: Yes, 2: No), and cocaine use (the number of

self-reported days of cocaine use in the past 30 days ). The raw data

of this study is available on NIDA website.

5 Covariate-Adaptive Randomization 194

Before using the randomization procedures, we discretized age to 1

(0-30), 2 (30− 40), 3 (40− 50), 4 (50 and above); depression

(Hamilton Depression Rating Scale) to 1 (normal: 0-7), 2 (mild

depression: 8-13), 3 (moderate depression: 14-18), 4 (severe

depression: 19-22), 5 (very severe depression: 23 to above);

cocaine use to 1 (0-10), 2 (11-20), 3 (20-30). The correlation

coefficients of the six covariates are given in Table 7.

5 Covariate-Adaptive Randomization 195

Table 7: Correlation coefficients (Kendall’s tau) of the covariates.

center gender age depression ADHD cocaineuse

center 1.000 0.027 -0.004 -0.044 -0.029 0.021

gender 0.027 1.000 -0.066 0.078 0.013 0.127

age -0.004 -0.066 1.000 -0.028 -0.075 0.009

depression -0.044 0.078 -0.028 1.000 -0.176 0.066

ADHD -0.029 0.013 -0.075 -0.176 1.000 0.040

cocaineuse 0.021 0.127 0.009 0.066 0.040 1.000

5 Covariate-Adaptive Randomization 196

Table 7 shows that gender has the highest calculated correlation with

cocaine use (e.g., 0.12). Medical studies

conner2008meta,mcintosh2009adult suggest the existence of the

correlations among depression, ADHD, and cocaine use. Gender,

depression, ADHD, and cocaine use were thus assumed to be

jointly distributed. In addition, center and age were further assumed

to be independently distributed to each other, and to the rest of the

covariates. The empirical distributions of the covariates used in

simulation are presented in Tables 8- 12 .

5 Covariate-Adaptive Randomization 197

The values used for the randomization procedures were as follows.

Bs = 4 was used for all s under STR-PB, when cocaine use was

observed or unobserved. γ = 0.85 was used for both PS and HH,

whenever cocaine use was observed or unobserved. When

cocaine use was observed, wm1 = · · · = wm6 = 1/6, and

(wo = 0.1, wm1 = · · · = wm6 = 0.14, ws = 0.06) were used for PS and

HH, respectively. When cocaine use was unobserved,

wm1 = · · · = wm5 = 1/5, and

(wo = 0.15, wm1 = · · · = wm6 = 0.15, ws = 0.1) were used for PS and

HH, respectively.

5 Covariate-Adaptive Randomization 198

Table 8: Marginal pmf of age.

age 1 2 3 4

pmf 28/300 110/300 135/300 27/300

5 Covariate-Adaptive Randomization 199

Table 9: Marginal pmf of center.

center 1 2 3 4 5 6 7 8

pmf 24/300 21/300 28/300 15/300 14/300 18/300 28/300 24/300

center 9 10 11 12 13 14 15 16

pmf 15/300 16/300 20/300 3/300 20/300 17/300 10/300 27/300

5 Covariate-Adaptive Randomization 200

Table 10: Joint pmf of gender, depression, ADHD, and cocaine use

I.

gender depression ADHD cocaine use pmf

1 1 1 1 1/300

1 1 1 2 3/300

1 1 1 3 1/300

5 Covariate-Adaptive Randomization 201

Table 11: Joint pmf of gender, depression, ADHD, and cocaine use

II.

gender depression ADHD cocaine use pmf

1 1 2 1 26/300

1 1 2 2 54/300

1 1 2 3 30/300

1 2 1 1 1/300

1 2 2 1 10/300

1 2 2 2 24/300

1 2 2 3 18/300

1 3 1 1 2/300

1 3 1 3 1/300

1 3 2 1 4/300

1 3 2 2 13/300

1 3 2 3 12/300

1 4 1 2 2/300

1 4 1 3 2/300

1 4 2 1 5/300

1 4 2 2 3/300

1 4 2 3 8/300

1 5 1 2 1/300

1 5 1 3 2/300

1 5 2 1 1/300

1 5 2 2 7/300

1 5 2 3 3/300

5 Covariate-Adaptive Randomization 202

Table 12: Joint pmf of gender, depression, ADHD, and cocaineuse

III.

gender depression ADHD cocaine use pmf

2 1 2 1 1/300

2 1 2 2 9/300

2 1 2 3 14/300

2 2 2 1 4/300

2 2 2 2 9/300

2 2 2 3 8/300

2 3 1 1 1/300

2 3 1 2 2/300

2 3 2 2 3/300

2 3 2 3 5/300

2 4 2 1 1/300

2 4 2 2 2/300

2 4 2 3 1/300

2 5 1 2 1/300

2 5 2 1 1/300

2 5 2 2 1/300

2 5 2 3 3/300

5 Covariate-Adaptive Randomization 203

Note that the discretization we used resulted in 1,280 observed strata.

In this section, based on the sample size of 300 in this study, we

compare the marginal imbalance of cocaine use = 3 and the

imbalance of a partial stratum of

(gender = 1, depression = 1, ADHD = 2, and cocaine use = 3),

when cocaine use is either observed or unobserved. For simplicity, we

will write Dn(6; s6) to denote the marginal imbalance of cocaine use

when it is observed, and Dn(1; r1) to denote the marginal imbalance

of cocaine use when it is unobserved. Furthermore, we write Dn(s

∗)

for the imbalance of the partial stratum of our interest when

cocaine use is observed, and Dn(s

∗∗, r1) for the imbalance of the

same partial stratum when cocaine use is unobserved.

5 Covariate-Adaptive Randomization 204

The simulation results for the partial stratum and the margin of

cocaine use = 3 are summarized in Table 13 and Table 14. We also

report the percentage reduction in the variance of the observed

covariate imbalance (PRVOCI) for Dn(s

∗) and Dn(6; s6). It is clear

that regardless of whether cocaine use is observed or unobserved, PS

and HH produce a better balance for the partial stratum and the

margin of cocaine use than CR or STR-PB. In particular, the standard

deviations of n−1/2Dn(6; s6), n−1/2Dn(1; r1), n−1/2Dn(s∗), and

n−1/2Dn(s∗∗, r1) under PS and HH are smaller than the

corresponding values under STR-PB and CR.

5 Covariate-Adaptive Randomization 205

Table 13: Simulation results for the partial stratum (gender = 1,

depression = 1, ADHD = 2, cocaine use = 3), based on 10,000

runs.

Procedure

n−1/2Dn(s∗) n−1/2Dn(s∗∗, r1)

mean (s.d.) / PRVOCI mean (s.d.) / PRVUCI

CR -.000 (.316) / - -.000 (.316) / -

STR-PB .002 (.276) / 23.7% .008 (.288) / 17.1%

PS -.000 (.238) / 43.5% -.002 (.280) / 21.6%

HH -.005 (.233) / 45.9% -.003 (.275) / 24.2%

5 Covariate-Adaptive Randomization 206

Table 14: Simulation results for cocaine use, based on 10,000 runs.

Procedure

n−1/2Dn(6; s6) n−1/2Dn(1; r1)

mean (s.d.) / PRVOCI mean (s.d.) / PRVUCI

CR .001 (.601) / - -.001 (.601) / -

STR-PB .005 (.556) / 14.6% .011 (.564) / 12.0%

PS -.000 (.111) / 96.6% .001 (.476) / 37.3%

HH -.000 (.112) / 96.5% -.003 (.470) / 37.3%

5 Covariate-Adaptive Randomization 207

The differences between the PRVOCIs and the PRVUCIs of the partial

stratum and cocaine use under PS and HH are not negligible. For

example, the PRVOCI and the PRVUCI for the marginal imbalance of

cocaine use = 3 under PS are 96.6% and 37.3%, respectively. Indeed,

if one covariate is omitted from CAR, the marginal imbalance of this

covariate generally increases. However, as the PRVUCIs are positive

(37.3%), the results still suggest that CAR procedures perform much

better than CR when cocaine use is omitted in the design.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)208

6 New covariate-adaptive

randomization procedures: continuous

covariates; many covariates; network

structures and others)

6.1 Introduction

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)209

Example: Remdesivir-COVID-19 trial (China). Remdesivir

in adults with severe COVID-19 trial (Wang et al. 2020) is a

randomized, double-blind, placebo-controlled, multicentre trial that

aimed to compare Remvesivir with placebo. There were 236 patients

in the trial. There are about 20 baseline covariates for each patient,

including 10 continuous variables (e.g. age and White blood cell

count) and 10 discrete variables (e.g. gender and Hypertension). The

stratified (according to the level of respiratory support) permuted

block (30 patients per block) randomization procedure were

implemented. At the end of this trial, some important imbalances

existed at enrollment between the groups, including more patients with

hypertension, diabetes, or coronary artery disease in the Remdesivir

group than the placebo group.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)210

Example: GATE project The Project GATE (Growing America

Through Entrepreneurship), sponsored by the U.S. Department of

Labor, was designed to evaluate the impact of offering tuition-free

entrepreneurship training services (GATE services) on helping clients

create, sustain or expand their own business.

(https://www.doleta.gov/reports/projectgate/)

The cornerstone is complete randomization. Members of the

treatment group were offered GATE services; members of the control

group were not.

• n = 4, 198 participants

• p = 105 covariates

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)211

Example: Online A/B testing. (Kohavi and Thomke, 2017,

Harvard Business Review) Microsoft, Amazon, Facebook and Google

conduct more than 10,000 online controlled experiments annually, with

many tests engaging millions of users.

Amazon’s experiment.

Treatment A: Credit card offers on front page.

Treatment B: Credit card offers on the shopping cart page.

This (change from A to B) boosted profits by tens of millions of US

Dollars annually.

Often Network (Dependent and Interference) Data, How

to Design these studies?

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)212

Advantages of covariate balance:

• Improve accuracy and efficiency of inference.

• Remove the bias and increase the power.

• Increases the interpretability of results by making the units more

comparable, enhance the credibility.

• More robust against model misspecification.

• Rubin (2008): the greatest possible efforts should be made during

the design phase rather than the analysis stage.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)213

• Randomization: an essential tool for evaluating treatment effect.

• Traditional randomization methods (e.g., complete randomization

(CR)): unsatisfactory, unbalanced prognostic or baseline

covariates.

“Most of experimenters on carrying out a random

assignment of plots will be shocked to find out how far from

equally the plots distribute themselves.” —Fisher (1926)

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)214

What if large p and large n?

• The phenomenon of covariate imbalance is exacerbated as p and n

increase.

• Ubiquitous in the era of big data.

• Example: the probability of one particular covariate being

unbalanced is α = 5%. For a study with 10 covariates, the chance

of at least one covariate exhibiting imbalance is

1− (1− α)p = 40%. With 100 covariates, the chance is

1− (1− α)100 = 1.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)215

6.2 Rerandomization

Morgan and Rubin (2012) proposed rerandomization.

(1) Collect covariate data.

(2) Specify a balance criterion, M < a, i.e., threshold on the

Mahalanobis distance,

M = (x¯1 − x¯2)T [cov(x¯1 − x¯2)]−1(x¯1 − x¯2),

where x¯1 and x¯2 are the sample means for treatment groups.

(3) Randomize the units using the complete randomization (CR).

(4) Check the balance criterion, M < a.

• If satisfied, go to Step (5); otherwise, return to Step (3).

(5) Perform the experiment using the final randomization obtained in

Step (4).

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)216

Advantages:

• Desirable properties for causal inference:

– Reduction in variance of estimated treatment effect.

• Work well with a few covariates.

Drawbacks:

• Not for sequential experiments

• Incapable to scale up for massive data.

• As p increases, the probability of acceptance pa = P (M < a)

decreases, causing the RR to remain in the loop for a long time.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)217

Examples of Rerandomization.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)218

6.3 Covariate-Adaptive Randomization via

Mahalanobis Distance (CAM)

xi ∈ Rp: covariate of the i-th unit.

Ti ∈ {1, 0}: treatment assignment of the i-th unit.

• Ti = 1: treatment 1.

• Ti = 0: treatment 2.

i = 1, ..., n

(1) Use the new defined Mahalanobis distance

M(n) = 0.25(x¯1 − x¯2)T [cov(x¯)]−1(x¯1 − x¯2).

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)219

(2) Randomly arrange units in a sequence

x1,x2︸ ︷︷ ︸

1st pair

,x3,x4︸ ︷︷ ︸

2nd pair

,x5,x6︸ ︷︷ ︸

3rd pair

, ...,xn.

(3) Assign the 1st pair, T1 = 1, T2 = 0.

(4) For the next pair, i.e., 2i+ 1-th and 2i+ 2-th units, (i > 1)

(4a) If T2i+1 = 1 and T2i+2 = 0, obtain the “potential” M

(1)

i .

(4b) If T2i+1 = 0 and T2i+2 = 1, obtain the “potential” M

(2)

i .

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)220

(5) Assign the (2i+ 1)-th and (2i+ 2)-th units by

P (T2i+1 = 1, T2i+2 = 0|x2i, T2i...) =

q if M

(1)

i < M

(2)

i ,

1− q if M (1)i > M (2)i ,

0.5 if M

(1)

i = M

(2)

i ,

P (T2i+1 = 0, T2i+2 = 1|x2i, T2i...) =

1− P (T2i+1 = 1, T2i+1 = 0|x2i, T2i...),

where

• 0.5 < q < 1.

• Note: T2i+1 = T2i+2 = 0, 1 is not allowed.

(6) Repeat Steps (4) and (5) until finish.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)221

• A smaller value of M(n) indicates a better covariate balance.

• q = 0.75. More discussion in Hu and Hu (2012).

• Units are not observed sequentially; however, we allocate them

sequentially (in pairs).

• Better covariate balance.

• n! different possible sequences. Similar performance.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)222

Properties of CAM

Under CAM, suppose xi is i.i.d. multivariate normal; then

M(n) = Op(n

−1).

Note:

• Under CR, MCR(n) ∼ χ2df=p, a stationary distribution of a

Chi-square distribution with p degrees of freedom, regardless of n.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)223

• Under RR, MRR(n) ∼ χ2df=p|χ2df=p < a, a stationary distribution

of a Chi-square distribution with p degrees of freedom conditional

on MRR(n) < a, regardless of n.

• Under CAM, M(n)→ 0 at the rate of 1/n.

– More units, better balance.

– Advantages of CAM in large n.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)224

Properties of CAM

As p increases,

• Under CR, the stationary distribution becomes flatter, poorer

covariate balance.

• Under RR, the stationary distribution becomes flatter, poorer

covariate balance.

• Under CAM, M(n)→ 0 at the rate of 1/n, regardless of p.

– The effect of p on M(n) is less severe than CR and RR.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)225

Properties of CAM

• Adaptive based on covariates.

• Works for sequential experiments, just estimate the covariance

matrix sequentially.

• Capable for large p and large n.

• Better covariate balance.

• Less computational time.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)226

6.3.1 Estimating treatment effect

A natural setup of A/B testing:

• The observed outcome yi, i = 1, ..., n, for each unit.

• Let yi(Ti) represents the potential outcome of the i-th unit under

the treatment Ti.

• yi = yi(1)Ti + yi(0)(1− Ti).

• The average treatment effect is

τ =

∑n

i=1 yi(1)

n

−

∑n

i=1 yi(0)

n

.

• The fundamental problem: only observe yi(Ti) for one particular

Ti, therefore, τ cannot be calculated directly.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)227

A natural estimate, τˆ :

τˆ =

∑n

i=1 Tiyi∑n

i=1 Ti

−

∑n

i=1(1− Ti)yi∑n

i=1(1− Ti)

,

• τˆ could be bad with imbalance in covariates.

• Example: estimate the drug effect using treatment groups with

predominately male and female patients. Cannot remove the

gender effect.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)228

Theoretical properties:

(1) Unbiasedness: under CAM, E(τˆ) = τ .

(2) Under CAM, V ar(τˆ) attains the lower bound asymptotically.

(3) This implies that

V arCAM (τˆ) < V arRR(τˆ) < V arCR(τˆ).

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)229

6.3.2 Examples

Real Data Example I - Project GATE (Example 4)

• Two treatment groups:

Treatment: were offered GATE services; control: were not offered

GATE services.

• p = 105 (covariates obtained from the application packages, 13

continuous and 92 categorical)

• Sample size n = 3, 448 (out of 4,198 participants from who

answered the evaluation survey 6 months after the assignment)

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)230

• Original allocation M = 75.27, moderate covariate imbalance.

• We repeat the allocation 1,000 times for these participants using

CAM, complete randomization and rerandomization.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)231

CAM vs Rerandomization

The Maximum of Malahanobis distances obtained from CAM is 12. If

we set the balance criterion for rerandomization to M < 12, the

probability of acceptance Pa = P (χ

2

df=105 < 12) = 3.4× 10−31, which

means nearly impossible for rerandomization to achieve a similar

balance level as CAM.

We set Pa = 2× 10−5 for Rerandomization to have similar

computational time with CAM.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)232

Comparison of Mahalanobis Distance

Mahalanobis Distance

D

en

si

ty

0 20 40 60 80 100 120 140

0.

00

0.

05

0.

10

0.

15

0.

20

0.

25 Complete Randomization

CAM

Rerandomization

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)233

Estimation.

• The outcome variable (0/1): has owned a business within 6

months after assignment or not.

• After the allocation, we simulate the outcome variable according

to

logit(P (ysimi = 1)) = µˆ1T

sim

i + µˆ2(1− T simi ) + xTi βˆ + sim,

where µˆ1, µˆ2 and βˆ are obtained from fitting regression to original

data. sim is drawn from the residuals of that regression.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)234

Compare the estimation performance (PRIV) of CAM

and rerandomization.

Method PRIV un or va

CAM 17.7% 0.081

Rerandomization 10.5% 0.505

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)235

6.4 Network A/B Testing

Let a graph G be represented by a n× n symmetric adjacency matrix

A = [Aij ].

Balancing n-dimensional binary vectors, the network , is hard.

Zhou, Li and Hu (2019) proposed several methods and discussed their

theoretical and finite sample properties.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)236

New methods vs Complete randomization (MSE)

100 150 200 250 300 350 400 450 500 550 600

number of nodes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

st

an

da

rd

d

ev

ia

tio

n

random

adaptive row

adaptive submatrix

coordinate descent

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)237

New methods vs Complete randomization (ATE)

-4 -3 -2 -1 0 1 2 3

bias

0

0.2

0.4

0.6

0.8

1

1.2

de

ns

ity

random

adaptive row

adaptive submatrix

coordinate descent

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)238

New methods vs Complete randomization (MSE)

0 50 100 150 200 250 300 350 400 450 500

number of nodes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

st

an

da

rd

d

ev

ia

tio

n

random

adaptive

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)239

New methods vs Complete randomization (ATE)

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

bias

0

0.5

1

1.5

2

2.5

3

de

ns

ity

random

adaptive row

adaptive submatrix

coordinate descent

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)240

6.5 Balance Covariates based on general Kernels

The CAM only considers the mean of two groups. Covariance

structure is also important in statistical analysis. Therefore, Ma, Li

and Hu (2021) proposed the following distance measure (which

combine both mean and covariance differences):

IBT (n) = (x¯1 − x¯2)T cov(x)−1(x¯1 − x¯2) + trace

{(

Σˆ1 − Σˆ2

)2}

/p

where Σˆ1 and Σˆ2 are the sample covariance matrices for two

treatment groups.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)241

New method vs CAM vs Complete randomization

n=100, p=2

Mahalanobis distance

D

en

si

ty

0 2 4 6 8

0.

0

0.

5

1.

0

1.

5

2.

0

CAM

Trace+

CR

n=100, p=6

Mahalanobis distance

D

en

si

ty

0 2 4 6 8

0.

0

0.

5

1.

0

1.

5

2.

0

CAM

Trace+

CR

n=100, p=10

Mahalanobis distance

D

en

si

ty

0 2 4 6 8

0.

0

0.

5

1.

0

1.

5

2.

0

CAM

Trace+

CR

n=300, p=2

Mahalanobis distance

D

en

si

ty

0 2 4 6 8

0.

0

0.

5

1.

0

1.

5

2.

0

CAM

Trace+

CR

n=300, p=6

Mahalanobis distance

D

en

si

ty

0 2 4 6 8

0.

0

0.

5

1.

0

1.

5

2.

0

CAM

Trace+

CR

n=300, p=10

Mahalanobis distance

D

en

si

ty

0 2 4 6 8

0.

0

0.

5

1.

0

1.

5

2.

0

CAM

Trace+

CR

n=500, p=2

Mahalanobis distance

D

en

si

ty

0 2 4 6 8

0.

0

0.

5

1.

0

1.

5

2.

0

CAM

Trace+

CR

n=500, p=6

Mahalanobis distance

D

en

si

ty

0 2 4 6 8

0.

0

0.

5

1.

0

1.

5

2.

0

CAM

Trace+

CR

n=500, p=10

Mahalanobis distance

D

en

si

ty

0 2 4 6 8

0.

0

0.

5

1.

0

1.

5

2.

0

CAM

Trace+

CR

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)242

New method vs CAM vs Complete randomization

n=100, p=2

Trace((Sigma_1−Sigma_0)^2)/p

D

en

si

ty

0.0 0.1 0.2 0.3 0.4

0

5

15

25

CAM

Trace+

CR

n=100, p=6

Trace((Sigma_1−Sigma_0)^2)/p

D

en

si

ty

0.0 0.1 0.2 0.3 0.4

0

5

15

25

CAM

Trace+

CR

n=100, p=10

Trace((Sigma_1−Sigma_0)^2)/p

D

en

si

ty

0.0 0.1 0.2 0.3 0.4

0

5

15

25

CAM

Trace+

CR

n=300, p=2

Trace((Sigma_1−Sigma_0)^2)/p

D

en

si

ty

0.0 0.1 0.2 0.3 0.4

0

5

15

25

CAM

Trace+

CR

n=300, p=6

Trace((Sigma_1−Sigma_0)^2)/p

D

en

si

ty

0.0 0.1 0.2 0.3 0.4

0

5

15

25

CAM

Trace+

CR

n=300, p=10

Trace((Sigma_1−Sigma_0)^2)/p

D

en

si

ty

0.0 0.1 0.2 0.3 0.4

0

5

15

25

CAM

Trace+

CR

n=500, p=2

Trace((Sigma_1−Sigma_0)^2)/p

D

en

si

ty

0.0 0.1 0.2 0.3 0.4

0

5

15

25

CAM

Trace+

CR

n=500, p=6

Trace((Sigma_1−Sigma_0)^2)/p

D

en

si

ty

0.0 0.1 0.2 0.3 0.4

0

5

15

25

CAM

Trace+

CR

n=500, p=10

Trace((Sigma_1−Sigma_0)^2)/p

D

en

si

ty

0.0 0.1 0.2 0.3 0.4

0

5

15

25

CAM

Trace+

CR

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)243

Ma, Li and Hu (2021):

(i) a general framework of kernel covariate adaptive randomization to

attain covariate balance for a large class of functions that reside in a

high-dimensional or even infinite-dimensional space;

(ii) With the kernel trick commonly used in machine learning, the

framework unifies several recently proposed covariate adaptive designs

and generalizes to a much broader family with imbalance measures

defined in a consistent manner;

(iii) the convergence rate of covariate imbalance is bounded in

probability;

and (iv) balance covariance matrices between treatments, which shows

excellent and robust performance in finite samples.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)244

6.6 Examples

Example: Remdesivir-COVID-19 trial (China). Remdesivir

in adults with severe COVID-19 trial (Wang et al. 2020) is a

randomized, double-blind, placebo-controlled, multicentre trial that

aimed to compare Remvesivir with placebo. There were 236 patients

in the trial. There are about 20 baseline covariates for each patient,

including 10 continuous variables (e.g. age and White blood cell

count) and 10 discrete variables (e.g. gender and Hypertension). The

stratified (according to the level of respiratory support) permuted

block (30 patients per block) randomization procedure were

implemented. At the end of this trial, some important imbalances

existed at enrollment between the groups, including more patients with

hypertension, diabetes, or coronary artery disease in the Remdesivir

group than the placebo group.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)245

Example: Moderna COVID-19 vaccine trial (2020). The

trial began on July 27, 2020, and enrolled 30,420 adult volunteers at

clinical research sites across the United States. Volunteers were

randomly assigned 1:1 to receive either two 100 microgram (mcg)

doses of the investigational vaccine or two shots of saline placebo 28

days apart. The average age of volunteers is 51 years. Approximately

47% are female, 25% are 65 years or older and 17% are under the age

of 65 with medical conditions placing them at higher risk for severe

COVID-19. Approximately 79% of participants are white, 10% are

Black or African American, 5% are Asian, 0.8% are American Indian or

Alaska Native, 0.2% are Native Hawaiian or Other Pacific Islander, 2%

are multiracial, and 21% (of any race) are Hispanic or Latino.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)246

From the start of the trial through Nov. 25, 2020, investigators

recorded 196 cases of symptomatic COVID-19 occurring among

participants at least 14 days after they received their second shot. One

hundred and eighty-five cases (30 of which were classified as severe

COVID-19) occurred in the placebo group and 11 cases (0 of which

were classified as severe COVID-19) occurred in the group receiving

mRNA-1273. The incidence of symptomatic COVID-19 was 94.1%

lower in those participants who received mRNA-1273 as compared to

those receiving placebo.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)247

Investigators observed 236 cases of symptomatic COVID-19 among

participants at least 14 days after they received their first shot, with

225 cases in the placebo group and 11 cases in the group receiving

mRNA-1273. The vaccine efficacy was 95.2% for this secondary

analysis.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)248

Example: PFIZER-BIONTECH COVID-19 VACCINE.

Safety and Efficacy of the BNT162b2 mRNA Covid-19

Vaccine (2020).

BACKGROUND Severe acute respiratory syndrome coronavirus 2

(SARS-CoV-2) infection and the resulting coronavirus disease 2019

(Covid-19) have afflicted tens of millions of people in a worldwide

pandemic. Safe and effective vaccines are needed urgently.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)249

METHODS In an ongoing multinational, placebo-controlled,

observer-blinded, pivotal efficacy trial, we randomly assigned persons

16 years of age or older in a 1:1 ratio to receive two doses, 21 days

apart, of either placebo or the BNT162b2 vaccine candidate (30 g per

dose). BNT162b2 is a lipid nanoparticle–formulated,

nucleoside-modified RNA vaccine that encodes a prefusion stabilized,

membrane-anchored SARS-CoV-2 full-length spike protein. The

primary end points were efficacy of the vaccine against

laboratory-confirmed Covid-19 and safety.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)250

A total of 43,548 participants underwent randomization, of whom

43,448 received injections: 21,720 with BNT162b2 and 21,728 with

placebo. There were 8 cases of Covid-19 with onset at least 7 days

after the second dose among participants assigned to receive

BNT162b2 and 162 cases among those assigned to placebo;

BNT162b2 was 95% effective in preventing Covid-19 (95% credible

interval, 90.3 to 97.6). Similar vaccine efficacy (generally 90 to 100%)

was observed across subgroups defined by age, sex, race, ethnicity,

baseline body-mass index, and the presence of coexisting conditions.

Among 10 cases of severe Covid-19 with onset after the first dose, 9

occurred in placebo recipients and 1 in a BNT162b2 recipient. The

safety profile of BNT162b2 was characterized by short-term,

mild-to-moderate pain at the injection site, fatigue, and headache.

The incidence of serious adverse events was low and was similar in the

vaccine and placebo groups.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)251

CONCLUSIONS A two-dose regimen of BNT162b2 conferred 95%

protection against Covid-19 in persons 16 years of age or older. Safety

over a median of 2 months was similar to that of other viral vaccines.

(Funded by BioNTech and Pfizer; ClinicalTrials.gov number,

NCT04368728. opens in new tab.)

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)252

Example: REAL-WORLD EVIDENCE CONFIRMS

HIGH EFFECTIVENESS OF PFIZER-BIONTECH

COVID-19 VACCINE AND PROFOUND PUBLIC

HEALTH IMPACT OF VACCINATION ONE YEAR

AFTER PANDEMIC DECLARED.

The Israel Ministry of Health (MoH), Pfizer Inc. (NYSE: PFE) and

BioNTech SE (Nasdaq: BNTX) today announced real-world evidence

demonstrating dramatically lower incidence rates of COVID-19 disease

in individuals fully vaccinated with the Pfizer-BioNTech COVID-19

Vaccine (BNT162b2), underscoring the observed substantial public

health impact of Israel’s nationwide immunization program. These

new data build upon and confirm previously released data from the

MoH demonstrating the vaccine’s effectiveness in preventing

symptomatic SARS-CoV-2 infections, COVID-19 cases,

hospitalizations, severe and critical hospitalizations, and deaths. The

latest analysis from the MoH proves that two weeks after the second

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)253

vaccine dose protection is even stronger – vaccine effectiveness was at

least 97% in preventing symptomatic disease, severe/critical disease

and death. This comprehensive real-world evidence can be of

importance to countries around the world as they advance their own

vaccination campaigns one year after the World Health Organization

(WHO) declared COVID-19 a pandemic.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)254

Findings from the analysis were derived from de-identified aggregate

Israel MoH surveillance data collected between January 17 and March

6, 2021, when the Pfizer-BioNTech COVID-19 Vaccine was the only

vaccine available in the country and when the more transmissible

B.1.1.7 variant of SARS-CoV-2 (formerly referred to as the U.K.

variant) was the dominant strain. Vaccine effectiveness was at least

97% against symptomatic COVID-19 cases, hospitalizations, severe

and critical hospitalizations, and deaths. Furthermore, the analysis

found a vaccine effectiveness of 94% against asymptomatic

SARS-CoV-2 infections. For all outcomes, vaccine effectiveness was

measured from two weeks after the second dose.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)255

Following the authorization for emergency use of the Pfizer-BioNTech

COVID-19 Vaccine in Israel on December 6, 2020, the Israel MoH

launched a national vaccination program targeting individuals age 16

years or older – a total of 6.4 million people, representing 71% of the

population. The vaccination program started at the beginning of a

large surge of SARS-CoV-2 infections in Israel, which later resulted in

a national lockdown starting on January 8, 2021.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)256

This MoH analysis uses de-identified aggregate Israel MoH public

health surveillance data from January 17 through March 6, 2021

(analysis period); the start of the analysis period corresponds to seven

days after individuals began receiving second doses of the

Pfizer-BioNTech COVID-19 Vaccine. MoH regularly collects

comprehensive, real-time data on SARS-CoV-2 testing, COVID-19

cases including date of symptom onset, and vaccination history

through a nationally notifiable disease registry and the national

medical record database.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)257

Vaccine effectiveness estimates – adjusted to account for variances in

age, gender and the week specimens were collected – were determined

for the prevention of six laboratory-confirmed SARS-CoV-2 outcomes

comparing unvaccinated and fully-vaccinated individuals: SARS-CoV-2

infections (includes symptomatic and asymptomatic infections);

asymptomatic SARS-CoV-2 infections; COVID-19 cases (symptomatic

only); COVID-19 hospitalizations; severe (respiratory distress,

including ¿30 breaths per minute, oxygen saturation on room air ¡94%,

and/or ratio of arterial partial pressure of oxygen to fraction of inspired

oxygen ¡300mm mercury) and critical (mechanical ventilation, shock,

and/or heart, liver or kidney failure) COVID-19 hospitalizations; and

COVID-19 deaths.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)258

The MoH analysis was conducted when more than 80% of tested

specimens in Israel were variant B.1.1.7, providing real-world evidence

of the effectiveness of BNT162b2 for prevention of COVID-19

infections, hospitalizations, and deaths due to variant B.1.1.7.

However, this analysis was not able to evaluate vaccine effectiveness

against B.1.351 (formerly referred to as the South African variant) due

to the limited number of infections caused by this strain in Israel at

the time the analysis was conducted.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)259

The vaccine effectiveness estimates align with the 95% vaccine efficacy

of BNT162b2 against COVID-19 demonstrated in the pivotal

Randomized Clinical Trial (RCT) of BNT162b2. However, this

observational analysis differs from the RCT in several aspects. Vaccine

effectiveness estimates may be affected by differences between

vaccinated and unvaccinated persons (i.e., different test-seeking

behaviors or levels of adherence to preventive measures). In the RCT,

randomization minimized the impact of differences between vaccinated

and unvaccinated. Despite efforts to adjust for these effects in the

available dataset, the possibility remains of unmeasured distortions.

For example, findings from the Maccabi HMO indicate that

neighborhood may be an important factor. Further vaccine

effectiveness analyses investigating the effect of additional covariates

such as location, comorbidities, race/ethnicity, and likelihood of

seeking SARS-CoV-2 testing are warranted.

6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)260

Pfizer-BioNtech’s coronavirus vaccine offers more protection than

earlier thought, with effectiveness in preventing symptomatic disease

reaching 97%, according to real-world evidence published Thursday by

the pharma companies.

Using data from January 17 to March 6 from Israel’s national

vaccination campaign, Pfizer-BioNtech found that prevention against

asymptomatic disease also reached 94 percent.

”We are extremely encouraged that the real-world effectiveness data

coming from Israel are confirming the high efficacy demonstrated in

our Phase 3 clinical trial and showing the significant impact of the

vaccine in preventing severe disease and deaths due to COVID-19,”

said Luis Jodar, Ph.D., senior vice president and chief medical officer

of Pfizer Vaccines.

7 Statistical Inference after covariate-adaptive randomization 261

7 Statistical Inference after

covariate-adaptive randomization

7.1 Some concerns

First we consider simulations to study Type I error of hypothesis

testing for comparing treatment effects under three designs: Pocock

and Simon’s marginal procedure, stratified permuted block design, and

complete randomization. For each type of design, both continuous

case and discrete case are considered. The following linear model

(including two covariates Z1 and Z2) is assumed for responses Yi,

Yi = µ1Ii + µ2(1− Ii) + β1Zi,1 + β2Zi,2 + εi,

where εi is distributed as N(0, 1), β1 = β2 = 1. No difference in

treatment effects is assumed to study Type I error, i.e., µ1 = µ2.

7 Statistical Inference after covariate-adaptive randomization 262

For the discrete case, Z1 follows Bernoulli(p1) and Z2 follows

Bernoulli(p2); for the continuous case, both Z1 and Z2 follow normal

distributions N(0, 1). If covariates Z1 and Z2 are continuous, they are

discretized into bernoulli variables Z ′1 and Z

′

2 with the probabilities p1

and p2 in order to be used in randomization. More specifically, if

Z1 < Z(p1), where Z(p1) is p1 quantile of the standard normal

distribution, then Z ′1 = 0, otherwise Z

′

1 = 1. Original variables

(without discretization) are used in statistical inference procedures.

7 Statistical Inference after covariate-adaptive randomization 263

To carry out simulations, the biased coin probability 0.75 and equal

weights are used for Pocock and Simon’s marginal procedure, and the

block size 4 is used for stratified permuted block design. The

significance level is α = 0.05 and sample size N is 100, 200 or 500.

The hypothesis tests include the two sample t-test (t-test), the linear

model with a single covariate Z1 (lm(Z1)), the linear model with a

single covariate Z2 (lm(Z2)) and the linear model with both

covariates Z1 and Z2 (lm(Z1, Z2)). By choosing (p1, p2) = (0.5, 0.5),

the simulation results for Pocock and Simon’s marginal procedure,

stratified permuted block design and complete randomization are

demonstrated in Table 1.

7 Statistical Inference after covariate-adaptive randomization 264

In each simulation, Type I error of covariate-adaptive randomization

methods is also examined with the bootstrap t-test described in Shao,

Yu and Zhong (2010). To do the test, B bootstrap samples

(Y ∗b1 , Z

∗b

1,1, Z

∗b

1,2), ...,(Y

∗b

N , Z

∗b

N,1, Z

∗b

N,2), b = 1, 2, ..., B, are generated

independently as simple random samples with replacement from

(Y1, Z1,1, Z1,2), ..., (YN , ZN,1, ZN,2). The covariate-adaptive

procedure on the original data is applied on the covariates of each

bootstrap sample (Z∗b1,1, Z

∗b

1,2), ..., (Z

∗b

N,1, Z

∗b

N,2), from which the

bootstrap analogues of treatment assignments, I∗b1 ,...,I

∗b

N can be

obtained.

7 Statistical Inference after covariate-adaptive randomization 265

Define

Y¯1 − Y¯2 = 1

n1

N∑

i=1

IiYi − 1

n2

N∑

i=1

(1− Ii)Yi, n1 =

N∑

i=1

Ii, n2 = N − n1,

and

θˆ∗(b) =

1

n∗b1

N∑

i=1

I∗bi Y

∗b

i −

1

n∗b2

N∑

i=1

(1− I∗bi )Y ∗bi ,

n∗b1 =

N∑

i=1

I∗bi , n

∗b

2 =

N∑

i=1

(1− I∗bi ).

The bootstrap estimator of the variance of Y¯1 − Y¯2 is then the sample

variance of θˆ∗(b), b = 1, 2, ..., B, represented by vˆB . Then the

bootstrap t-test has the form of TB = (Y¯1 − Y¯2)/vˆ1/2B . In Shao, Yu

and Zhong (2010), it is shown that the bootstrap t-test can maintain

nominal Type I error under covariate-adaptive biased coin design.

B = 500 is used in all following simulations.

7 Statistical Inference after covariate-adaptive randomization 266

Table 15: Simulated Type I error for Pocock and Simon’s (PS), stratified

permuted block design (SPB) and complete randomization (CR) in %.

Simulations based on 10000 runs.

Z Method N t-test lm(Z1) lm(Z2) lm(Z1, Z2) BS-t

Discrete PS 100 1.75 3.05 3.09 5.21 5.18

200 1.62 2.78 2.86 4.99 4.88

500 1.66 2.81 2.77 4.87 4.90

SPB 100 1.85 2.86 3.05 5.29 5.67

200 1.54 2.69 2.73 4.84 4.95

500 1.55 2.77 2.65 4.84 5.60

CR 100 5.04 5.27 5.11 5.31 -

200 5.00 4.95 5.12 5.21 -

500 4.73 4.83 4.68 4.77 -

7 Statistical Inference after covariate-adaptive randomization 267

Table 16: Simulated Type I error for Pocock and Simon’s marginal

procedure (PS), stratified permuted block design (SPB) and complete

randomization (CR) in %. Simulations based on 10000 runs.

Z Method N t-test lm(Z1) lm(Z2) lm(Z1, Z2) BS-t

Continuous PS 100 1.43 2.15 2.02 4.98 5.16

200 1.07 1.74 1.80 4.53 5.62

500 0.91 1.72 1.73 4.72 4.79

SPB 100 1.22 1.83 2.05 5.01 5.68

200 0.98 1.86 1.77 5.08 5.19

500 1.15 1.98 1.84 5.48 5.61

CR 100 5.20 5.31 4.82 4.92 -

200 5.06 5.14 4.85 5.46 -

500 4.87 5.05 4.71 4.77 -

7 Statistical Inference after covariate-adaptive randomization 268

Several conclusions can be drawn from Table 1. First, the Type I error

is close to 5% under the full model lm(Z1, Z2). This coincides with

theoretical results in Section 3, when no randomization covariate is

omitted in the construction of the final analysis model. Secondly,

under both Pocock and Simon’s marginal procedure and stratified

permuted block design, the two sample t-test, lm(Z1) and lm(Z2) are

all conservative. Among these three tests, the two sample t-test is the

most conservative one with the least Type I error.

7 Statistical Inference after covariate-adaptive randomization 269

Furthermore, the Type I error of the bootstrap t-test (BS-t) is close

to the nominal level 5% under both Pocock and Simon’s marginal

procedure and stratified permuted block design. Under complete

randomization, the Type I error is close to 5% for all four tests. We

also tried different (p1, p2), similar results are obtained and are not

shown here.

7 Statistical Inference after covariate-adaptive randomization 270

• Even though many covariate-adaptive designs have been proposed

and implemented, the discussion of corresponding statistical

inference is limited.

• In practice, conventional tests are just used without consideration

of covariate-adaptive randomization scheme.

• It remains a concern if conventional statistical inference is still

valid for covariate-adaptive designs.

7 Statistical Inference after covariate-adaptive randomization 271

In literature

• By simulation, Forsythe (1987) suggests “minimization should be

considered for group assignment only if all variables used in

minimization are also to be used as covariate” to achieve valid

statistical inference.

• Shao, Yu and Zhong (2010) pointed out “if the covariates used in

covariate-adaptive randomization is a function of the covariates to

construct a test, the test is valid under covariate-adaptive

randomization.”

7 Statistical Inference after covariate-adaptive randomization 272

Conservativeness

• In practice, however, it is often the case not all randomization

covariates are included in statistical inference.

– difficult to include some covariates (investigation sides, etc.) in

analysis;

– resulting more complicated model;

– requiring correct model specification.

7 Statistical Inference after covariate-adaptive randomization 273

• Simulation studies indicates conservativeness of unadjusted

analysis under covariate-adaptive clinical trials by Birkett (1985)

and Forsythe (1987), etc.

• Shao, Yu and Zhong (2010) proved, under a simple linear

regression model,

Yij = µj + bZi + εij ,

the two sample t-test is conservative for stratified biased coin

designs.

7 Statistical Inference after covariate-adaptive randomization 274

Limitations

• Only two sample t-test is discussed. Properties unknown if partial

covariate information is used in statistical inference.

• The result is only applied to the covariate-adaptive biased coin

design (Stratified), which is not a popular in application.

• only consider a simple linear model with one covariate.

• no theoretical results about power.

• No discussion about inference about significance of covariates.

7 Statistical Inference after covariate-adaptive randomization 275

Motivation

• Study properties of statistical inference for covariate-adaptive

randomized clinical trials

– which can be applied on a large family of covariate-adaptive

designs, including the ones (Pocock and Simon’s design and

others) widely used in practice.

– based on linear models and generalized linear models.

– on various types of hypothesis testing.

– new methods that adjusting type I error and increasing power.

7 Statistical Inference after covariate-adaptive randomization 276

7.2 Statistical Inference under Linear models

Properties of statistical inference will be studied for covariate-adaptive

designs

• under a linear regression framework.

• a subset of covariates of those used in randomization are included

in statistical inference procedures.

• two types of hypothesis testing are considered.

– comparing treatment effect.

– testing significance of covariates.

7 Statistical Inference after covariate-adaptive randomization 277

7.3 Framework

We (Ma, Hu and Zhang, 2015, JASA) considered a covariate-adaptive

randomized clinical trial with two treatments 1 and 2.

• Let N denote the total number of patients in study.

• Ii = 1, i = 1, 2, ..., N , if the ith patient is assigned to treatment

1, otherwise Ii = 0.

• The response for ith patient,

Yi = µ1Ii + µ2(1− Ii) +Xi,1bT1 + ...+Xi,pbTp

+Zi,1c

T

1 + ...+ Zi,qc

T

q + εi,

where

• Yi is the outcome of the ith patient;

• Xk and Zj , k = 1, ..., p and j = 1, ..., q are covariate information,

which can be either discrete or continuous.

7 Statistical Inference after covariate-adaptive randomization 278

• Both Xk and Zj are used in covariate-adaptive randomization,

but only Xk are used to construct analysis model.

• All covariates are assumed to be independent of each other.

• Furthermore, without loss of generality, it is assumed

EXk = EZj = 0 for all k and j.

• εis are independent and identically distributed random errors with

mean zero and variance σ2ε , and are independent of Xk and Zj .

7 Statistical Inference after covariate-adaptive randomization 279

Analysis Model (Working Model)

• Assume both Xk and Zj are used in covariate-adaptive

randomization.

• Xk are used in final analysis.

• A linear regression model is implemented to do analysis.

E[Yi] = µ1Ii + µ2(1− Ii) +Xi,1bT1 + · · ·+Xi,pbTp . (2)

7 Statistical Inference after covariate-adaptive randomization 280

The model between response and covariates (2) is

Y = Xβ +Zγ + ε,

The analysis model (2) is

E[Y ] = Xβ.

where Y = (Y1, Y2, ..., YN )

T are outcomes, ε = (ε1, ε2, ..., εN )

T ,

β = (µ1, µ2, b1, ..., bp)

T and γ = (c1, ..., cq)

T are true but unknown

parameters. Furthermore, X and Z are

X =

I1 1− I1 X1,1 · · · X1,p

I2 1− I2 X2,1 · · · X2,p

...

...

...

. . .

...

IN 1− IN XN,1 · · · XN,p

7 Statistical Inference after covariate-adaptive randomization 281

and

Z =

Z1,1 · · · Z1,q

...

. . .

...

ZN,1 · · · ZN,q

.

The OLS estimator βˆ of model (2) can be expressed as

βˆ = (XTX)−1XTY = (XTX)−1XT (Xβ +Zγ + ε)

= β + (XTX)−1XTZγ + (XTX)−1XTε.

7 Statistical Inference after covariate-adaptive randomization 282

Comparing Treatment Effect

To compare treatment effects of µ1 and µ2,

H0 : µ1 − µ2 = 0 versus HA : µ1 − µ2 6= 0. (3)

The test statistic is

T =

Lβˆ

(σˆ2L(XTX)−1LT )1/2

, (4)

where L = (1,−1, 0, ..., 0), σˆ2 = (Y −Xβˆ)T (Y −Xβˆ)/(N − p′ − 2).

p′ + 2 is the total number of parameters in model (2).

Reject H0, if |T | > Z1−α/2, where Z1−α/2 is (1− α/2)th percentile of

standard normal distribution.

7 Statistical Inference after covariate-adaptive randomization 283

Testing Significance of a Covariate

To test significance of a single covariate, without loss of generality,

consider the first covariate,

H0 : b1 = 0 versus HA : b1 6= 0. (5)

The test statistic for hypothesis testing (5) is,

T ′ =

`βˆ

(σˆ2`(XTX)−1`T )1/2

, (6)

where ` = (0, 0, 1, 0, ..., 0). Notice if X1 is a discrete covariate with

multiple levels s

′

1, s

′

1 > 2, then we are only able to test b11 = 0 ,

where b1 = (b11, b12, ..., b1(s′1−1)).

Reject H0, if |T ′| > Z1−α/2, where Z1−α/2 is (1− α/2)th percentile

of standard normal distribution.

7 Statistical Inference after covariate-adaptive randomization 284

7.4 Properties

Valid and Conservative Test

A two-sided test T based on normal distribution is said to be

(asymptotically) valid, if

lim

N→∞

pr(|T | > Z1−α/2) = α,

and it is said to be (asymptotically) conservative, if there is a constant

α0 such that, when the null hypothesis holds,

lim

N→∞

pr(|T | > Z1−α/2) = α0 < α,

where Φ is c.d.f of standard normal distribution.

7 Statistical Inference after covariate-adaptive randomization 285

Theorem: Under the linear model (1) and the hypothesis testing,

H0 : µ1 − µ2 = 0 versus HA : µ1 − µ2 6= 0,

if a covariate-adaptive design satisfies the following two conditions:

• the overall imbalence is Op(1),

• the marginal imbalances are Op(1),

then, under H0, the test statistics T is normal distributed with a

variance σ2 < 1 unless all cj = 0. Therefore, the hypothesis testing is

conservative unless all cj = 0. Under HA, the testing statistic T is

normal distributed with smaller non-centrality parameter unless all

cj = 0. Therefore the hypothesis testing is less powerful unless all

cj = 0.

7 Statistical Inference after covariate-adaptive randomization 286

Theorem: Under the same conditions as in Theorem 2 and the

hypothesis testing,

H0 : b1 = 0 versus HA : b1 6= 0,

then, under H0, the test statistics T

′ is normal distributed with a

variance 1. Therefore, the hypothesis testing is valid under H0.

However, under HA, the testing statistic T is normal distributed with

smaller non-centrality parameter unless all cj = 0. Therefore the

hypothesis testing is less powerful unless all cj = 0.

Corollary 1: If Z is not related to Y , i.e., all cj = 0 for

j = 1, 2, ..., q, then hypothesis testing (3) and (5) are both valid.

7 Statistical Inference after covariate-adaptive randomization 287

Corollary 2: The results in Theorem 2 and 3 hold under following

designs:

• Pocock and Simon’s marginal procedure;

• Stratified permuted block design;

• The large class of covariate-adaptive designs in Hu and Hu (2012)

and Hu and Zhang (2014).

7 Statistical Inference after covariate-adaptive randomization 288

7.5 Numerical studies: Type I Error and Power

Type I errors are studied by assuming,

Yi = µ1Ii + µ2(1− Ii) + b1Zi,1 + b2Zi,2 + εi,

where

• εi is distributed as N(0, 1).

• b1 = b2 = 1.

• No difference in treatment effect, i.e., µ1 = µ2.

• Discrete case: Z1 ∼ Bernolli(p1), Z2 ∼ Bernolli(p2);

Continuous case: Z1 ∼ N(0, 1), Z2 ∼ N(0, 1) with breakdown

points p1(p2)th quantile.

• The biased coin probability 0.75 and equal weights are used for

Pocock and Simon’s marginal procedure, and the block size 4 is

used for stratified permuted block design.

7 Statistical Inference after covariate-adaptive randomization 289

Table 17: Simulated Type I error for Pocock and Simon’s marginal

procedure in %. Simulations based on 10000 runs

Z (p1, p2) N t− test lm(Z1) lm(Z2) lm(Z1, Z2)

Discrete (0.5, 0.5) 100 1.75 3.05 3.09 5.21

200 1.62 2.78 2.86 4.99

500 1.66 2.81 2.77 4.87

(0.5, 0.3) 100 2.02 3.30 3.00 5.04

200 1.90 3.18 2.99 5.07

500 1.84 3.25 2.96 5.20

Continuous (0.5, 0.5) 100 1.43 2.15 2.02 4.98

200 1.07 1.74 1.80 4.53

500 0.91 1.72 1.73 4.72

(0.5, 0.3) 100 1.35 2.12 1.85 4.95

200 1.16 2.14 1.83 5.05

500 1.22 1.95 1.71 4.99

7 Statistical Inference after covariate-adaptive randomization 290

Table 18: Simulated Type I error for complete randomization in %.

Simulations based on 10000 runs

Z (p1, p2) N t− test lm(Z1) lm(Z2) lm(Z1, Z2)

Discrete (0.5, 0.5) 100 5.04 5.27 5.11 5.31

200 5.00 4.95 5.12 5.21

500 4.73 4.83 4.68 4.77

(0.5, 0.3) 100 4.99 4.99 4.68 4.77

200 5.15 5.03 5.49 5.14

500 4.82 5.00 4.80 5.13

Continuous (0.5, 0.5) 100 5.20 5.31 4.82 4.92

200 5.06 5.14 4.85 5.46

500 4.87 5.05 4.71 4.77

(0.5, 0.3) 100 4.99 4.69 5.11 4.97

200 5.21 5.24 5.16 4.92

500 5.14 4.66 5.19 5.15

7 Statistical Inference after covariate-adaptive randomization 291

Table 19: Power Comparison (lm(Z1, Z2)) for Pocock and Simon’s

marginal procedure and Complete Randomization, Simulation based on

10000 runs and Sample Size N = 32, 64

N = 32 N = 64

µ1 − µ0 CR PS CR PS

0.0 4.96 5.03 5.17 5.08

0.2 7.81 8.51 12.12 12.68

0.4 18.15 19.44 34.46 34.76

0.6 33.96 36.98 63.04 65.53

0.8 53.74 57.28 86.97 87.95

1.0 73.63 77.10 97.02 97.51

7 Statistical Inference after covariate-adaptive randomization 292

7.6 Numerical Studies: Testing of Covariates

It is assumed,

Yi = µ1Ii + µ2(1− Ii) + b1Zi,1 + b2Zi,2 + εi,

where

• εi is distributed as N(0, 1).

• b1 = 0, b2 = 1.

• No difference in treatment effect, i.e., µ1 = µ2.

• Discrete case: Z1 ∼ Bernolli(p1), Z2 ∼ Bernolli(p2);

Continuous case: Z1 ∼ N(0, 1), Z2 ∼ N(0, 1) with breakdown

points p1(p2)th quantile.

• The biased coin probability 0.75 and equal weights are used for

Pocock and Simon’s marginal procedure.

7 Statistical Inference after covariate-adaptive randomization 293

Table 20: Simulated Type I error for H0 : b1 = 0 versus HA : b1 6= 0 for

Pocock and Simon’s marginal procedure (PS) and complete randomiza-

tion (CR) in %. Simulations based on 10000 runs.

Discrete Continuous

(p1, p2) N PS CR PS CR

(0.5, 0.5) 100 4.96 4.98 4.98 4.90

200 5.35 5.28 5.14 5.14

500 5.55 5.55 5.11 5.15

(0.5, 0.4) 100 4.76 4.84 4.74 4.83

200 5.51 5.49 5.12 5.07

500 4.80 4.77 5.11 5.02

(0.5, 0.3) 100 4.96 5.07 5.23 5.20

200 5.08 5.17 5.00 4.99

500 4.95 4.84 5.65 5.64

(0.4, 0.4) 100 5.12 5.15 5.07 5.08

200 5.41 5.48 5.20 5.21

500 5.19 5.24 5.01 5.08

7 Statistical Inference after covariate-adaptive randomization 294

7.7 General Theory of Statistical Inference

Covariate-adaptive randomization procedure is frequently used in

comparative studies to increase the covariate balance across treatment

groups. However, as the randomization inevitably uses the covariate

information when forming balanced treatment groups, the validity of

classical statistical methods following such randomization is often

unclear.

7 Statistical Inference after covariate-adaptive randomization 295

Ma, Qin, Li and Hu (2019): (i) derive the theoretical properties of

statistical methods based on general covariate-adaptive randomization

under the linear model framework;

(ii) explicitly unveil the relationship between covariate-adaptive and

inference properties by deriving the asymptotic representations of the

corresponding estimators;

(iii) apply the proposed general theory to various randomization

procedures, such as complete randomization (CR), rerandomization

(RR), pairwise sequential randomization (PSR), and Atkinson’s DA

-biased coin design, and compare their performance analytically;

and (iv) based on the theoretical results, we then propose a new

approach to obtain valid and more powerful tests.

8 New covariate-adjusted response-adaptive designs 296

8 New covariate-adjusted

response-adaptive designs

Personalized medicine raises new challenges for the design of clinical

trials such as:

(1) more covariates (biomarkers) have to be considered, and

(2) particular attention needs to be paid to the interaction between

treatment and covariates (biomarkers).

To design a good clinical trial for personalized medicine, we need new

designs that can match the special features of personalized medicine.

8 New covariate-adjusted response-adaptive designs 297

8.1 Optimal design for detecting important

interactions among treatments and

biomarkers

The goal of a conventional clinical trial is to determine if a new

treatment is superior. When designing a clinical trial for precision

medicine, the goal is not limited to just detecting the treatment

difference, but also to identifying biomarkers that predict the

efficacy of treatments.

Therefore, it is important to have a design that can detect the

interaction between treatment and biomarkers efficiently.

8 New covariate-adjusted response-adaptive designs 298

8.2 Optimal designs based on both efficiency

and ethics

Clinical trials, because they involve human subjects, require stringent

ethical considerations. To develop personalized medicine, covariate

information plays an important role in the design and analysis of

clinical trials. A challenge is the incorporation of covariate

information in design while still considering issues of both

efficiency and medical ethics (CARA designs).

8 New covariate-adjusted response-adaptive designs 299

To address this problem, new designs of clinical trials are needed (Hu,

Zhu and Hu, 2015, JASA).

Denote the efficiency and ethics measurements of two treatments as

d(Z,θ) = (d1(Z,θ), d2(Z,θ)) and e(Z,θ) = (e1(Z,θ), e2(Z,θ)),

respectively. We propose to assign the (m+ 1)th subject to treatment

1 with probability

e1(Zm+1, θˆ(m))d

γ

1(Zm+1, θˆ(m))

e1(Zm+1, θˆ(m))d

γ

1(Zm+1, θˆ(m)) + e2(Zm+1, θˆ(m))d

γ

2(Zm+1, θˆ(m))

.

8 New covariate-adjusted response-adaptive designs 300

8.3 Designs based on predictive biomarkers

Two distinct types of biomarkers in precision medicine:

• Prognostic biomarker: a biomarker can be used to predict the

most likely prognosis of an individual patient.

• Predictive biomarker: a biomarker is likely to predict the

response to a specific therapy (treatment).

To develop precision medicine, we need new adaptive designs based on

predictive biomarkers (Hu, Wang and Zhao, 2019).

9 A/B testing under observational data 301

9 A/B testing under observational data

9.1 Simpson’s Paradox

9 A/B testing under observational data 302

Table 21: Fictitious data illustrating Simpson’s paradox.

Contro Group (No drug) Treatment Group (Took Drug)

Heart attack No heart attack Heart attack No heart attack

Female 1 19 3 37

Male 12 28 8 12

Total 13 47 11 49

9 A/B testing under observational data 303

Table 22: Fictitious data illustrating Simpson’s paradox.

Contro Group (No drug) Treatment Group (Took Drug)

Heart attack No heart attack Heart attack No heart attack

Low blood pressure 1 19 3 37

High blood pressure 12 28 8 12

Total 13 47 11 49

9 A/B testing under observational data 304

9.2 The real world effectiveness of BNT162b2

and mRNA-1273 COVID-19 Vaccines

Interim Estimates of Vaccine Effectiveness of BNT162b2 and

mRNA-1273 COVID-19 Vaccines in Preventing SARS-CoV-2 Infection

Among Health Care Personnel, First Responders, and Other Essential

and Frontline Workers Eight U.S. Locations, December 2020–March

2021.

10 Some basic principle of designing, running and analyzing an A/B Test 305

10 Some basic principle of designing,

running and analyzing an A/B Test

10.1 Setting up the example

A fictional online commerce site that sells flowers: there are a wide

range of changes we can test: (i) introducing a new feature: (ii) a

change to the user interface (UI); (iii) a back-end change; (iv) a

change of price; and so on.

10 Some basic principle of designing, running and analyzing an A/B Test 306

In this example, the marketing department wants to increase sales by

sending promotional emails that include a coupon code for discounts

on the flowers. There are several concerns: (i) revenue; (ii) cost; and

so on.

10 Some basic principle of designing, running and analyzing an A/B Test 307

We want to evaluate the impact of simply adding a coupon code field.

Our goal is simple to assess the impact on revenue by having this

coupon code field and evaluate the concern that it will distract people

from checking out.

10 Some basic principle of designing, running and analyzing an A/B Test 308

Online shopping process as a funnel, see Figure.

10 Some basic principle of designing, running and analyzing an A/B Test 309

There are many ways to change the user interface (UI). Here are two

different UIs. See Figure.

10 Some basic principle of designing, running and analyzing an A/B Test 310

Our Hypothesis: Adding a coupon code field to the checkout page will

degrade revenue.

10 Some basic principle of designing, running and analyzing an A/B Test 311

To measure the impact of the change, we need to define goal metrics

(usually difficult to indentify).

10 Some basic principle of designing, running and analyzing an A/B Test 312

This experiment: revenue.

Total revenue or revenue-per-user?

10 Some basic principle of designing, running and analyzing an A/B Test 313

Which users to consider in the denominator of the revenue-per-user

metric:

(1) All users who visit the site;

(2) Only users who complete the purchase process;

(3) Only users who start the purchase process.

10 Some basic principle of designing, running and analyzing an A/B Test 314

Only users who start the purchase process. This is the best choice.

Refined hypothesis becomes: Adding a coupon code field to the

checkout page will degrade revenue-per-user for users who start the

purchase process.

10 Some basic principle of designing, running and analyzing an A/B Test 315

10.2 Hypothesis testing: establishing statistical

significane

Discussions.

10 Some basic principle of designing, running and analyzing an A/B Test 316

10.3 Designing the experiment

Some aspects:

1) What is the randomization unit?

2) What population of randomization units do we want to target?

3) How large (sample size) does our experiment need to be?

4) How long do we run the experiment?

10 Some basic principle of designing, running and analyzing an A/B Test 317

Our experiment design is now as follows: 1) What is the

randomization unit?

user.

2) What population of randomization units do we want to target?

all users and analyze those who visit the chechout page.

3) How large (sample size) does our experiment need to be?

to have 80% power to detect at least a 1% change in revenue-per-user,

we will conduct a power analysis to determine sample size.

10 Some basic principle of designing, running and analyzing an A/B Test 318

4) How long do we run the experiment?

This translate into running the experiment for a minimum of four days

with a 34/33/33% split among Control/Treatment one/ Treatment

two. We will run the experiment for a full week to ensure that we

understand the day-of-week effect, and ponentially longer if we detect

novelty or primacy effects.

10 Some basic principle of designing, running and analyzing an A/B Test 319

10.4 Running the experiment and getting data

To run an experiment, we need both:

1) Instrumentation;

2) Infrastructure.

10 Some basic principle of designing, running and analyzing an A/B Test 320

10.5 Interpreting the results

Discussions.

10 Some basic principle of designing, running and analyzing an A/B Test 321

10.6 From results to decision

The goal of running A/B tests is to gather data to drive decision

making. A lot work goes into ensuring that our results are repeatable

and trustworthy so that we can make the right decision.

Some important aspects:

1) Do you need to make tradeoffs between different metrics?

2) What is the cost of launching this change?

3) What is the downside of making wrong decisions?

10 Some basic principle of designing, running and analyzing an A/B Test 322

You need to make decisions from different results:

Discussions.

11 Twyman’s Law and Experimentation Trustworthiness 323

11 Twyman’s Law and Experimentation

Trustworthiness

William Anthony Twyman was a UK radio and television audience

measurement veteran (MR Web 2014) credited with formulating

Twyman’s Law, although he apparently never explicitly put it in

writing.

Any statistic that appears interesting is almost certainly a mistake.

by Paul Dickson (1999)

11 Twyman’s Law and Experimentation Trustworthiness 324

Any figure that looks interesting or different is usually wrong.

by A.S.C. Ehrenberg (1975).

Twyman’s law, herhapsthe most important single law in the whole of

data analysis... The more unusual or interesting in the data, the more

likely they are to have been the result of an error of one kind or

another.

by Catherine Marsh and Jane Elliott (2009).

11 Twyman’s Law and Experimentation Trustworthiness 325

11.1 Misinterpretation of the Statistical Results

In the Null Hypothesis Significance Testing, we typically assume that

there is no difference in metric value between control and treatment

and reject the hypothesis if the data presents strong evidence against

it.

A common mistake is to assume that just because a metric is not

statistically significant, there is no treatment effect.

It could be that the experiment is underpowered to detect the

effect size. An evaluation of 115 A/B tests at GoodUI.org suggests

that most were under powered.

11 Twyman’s Law and Experimentation Trustworthiness 326

P-value is often misinterpreted.

The p-value is the probability of obtaining a result equal to or more

extreme than what was observed, assuming that the Null hypothesis is

true. The conditioning on the Null hypothesis is critical.

11 Twyman’s Law and Experimentation Trustworthiness 327

Here are some incorrect statements and explanations from A Dirty

Dozen: Twelve P-Value Misconceptions (Google Website Optimizer

2008):

• If the p-value = 0.05, the Null hypothesis has only a 5% chance of

being true. The p-value is calculated assuming that the Null

hypothesis is true.

• P-value = 0.05 means that we observed the data that would occur

only 5% of the time under Null hypothesis. This is incorrect by

the definition of p-value above, which includes equal to or more

extreme values than what was observed.

• P-value = 0.05 means that if you reject the Null hypthesis, the

probability of a false positive is only 5%.

11 Twyman’s Law and Experimentation Trustworthiness 328

Multiple Hypothesis Tests:

The following story comes from the fun book, What is a p-value

anyway? (Vickers 2009):

• Statistician: Oh, so you have already calculated the p-value?

• Surgeon: Yes, I used multinomial logistic regression.

• Statistician: Really? How did you come up with that?

• Surgeon: I tried each analysis on the statistical software

drop-down menus, and that was the one that gave the smallest

p-value.

False Discivery Rate (Hochberg and Benjamini 1995) is a key concept

to deal with multiple tests.

11 Twyman’s Law and Experimentation Trustworthiness 329

Confidence Intervals: Discussion.

11 Twyman’s Law and Experimentation Trustworthiness 330

11.2 Threats to Internal Validity

• Violations of SUTVA: In the analysis of A/B tests, it is

common to apply the Stable Unit Treatment Value Assumption

(SUTVA) (Imbens and Rubin, 2015), which states that

experiment units (e.g., users) do not interfere with one another.

Their behavior is impacted by their own variant assignment, and

not by the assignment of others. Discussion.

• Survivorship Bias: Discussion.

• Intention-to-Treat: Discussion.

• Sample Ration Mismatch: Discussion.

11 Twyman’s Law and Experimentation Trustworthiness 331

11.3 Threats to External Validity

External validity refers to the extent to which the results of a A/B test

can be generalized along axes such as different populations (e.g., other

countries, other websites) or overtime.

Discussion.

12 Analyzing A/B tests 332

12 Analyzing A/B tests

12.1 Two sample T-test

Discussion.

12 Analyzing A/B tests 333

12.2 P-value and Confidence Intervals

12 Analyzing A/B tests 334

12.3 Normality Assumption

12 Analyzing A/B tests 335

12.4 Type I/II Errors and Power

12 Analyzing A/B tests 336

12.5 Multiple Testing

12 Analyzing A/B tests 337

12.6 Meta-analysis

13 The A/A Test 338

13 The A/A Test

Running A/A testing is a critical part of estblishing trust

in an experimentation platform. The idea is so useful

because the tests fail many times in practice, which leads

to re-evaluating assumptions and identifying bugs.

13 The A/A Test 339

A/A tests are the same as A/B tests, but Treatment and Control users

receive identical experiences. You can use A/A tests for several

purposes, such as to:

• Ensure the Type I errors are controlled as expected.

• Assessing metrics’s variability.

• Ensure that no bias exists between Treatment and Control users.

• Compare data to the system of record.

• If the system of records shows X users visited the website during

the experiment and you ran Control and Treatment at 20% each,

do you see around 20% X users in each? Are you leaking users?

• Estimate variances for statistical power calculations.

14 Long-term treatment effects 340

14 Long-term treatment effects

Short-term effect (from A/B test) vs Long-term effect (we care about).

1cm

There are scenarios where long-term effect is different from the

short-term effect.

14 Long-term treatment effects 341

Reasons the treatment effect may differ between short-term and

long-term:

• User-learned effects.

• Network effects.

• Delayed experience and measurement.

• Econsystem changes: launching other new features; seasonality;

competitive landscape; government polocies; concept drift; etc.

14 Long-term treatment effects 342

Some Suggestions:

• Long-running experiments.

• Cohort Study and Analysis.

• Post-Period Analysis (Post-Market).

• Time-Staggered Treatments.

• Holdback and Reverse Experiment.

15 Conclusion and Remarks 343

15 Conclusion and Remarks

15 Conclusion and Remarks 344

The goal of Human Life:

Understanding + Improving The Nature

A/B test becomes and will be more and more important in

understanding the nature and ourself.

15 Conclusion and Remarks 345

Three Essential Components of Statistics

(Data Science):

Data+Computer and Software+Analytics

A/B test is the essential tool.

15 Conclusion and Remarks 346

Good Design of Experiment (producing useful

data) [Big or Small Data]:

Realistic, Efficient and Ethic.

15 Conclusion and Remarks 347

Statisticians (Data Scientists) are experts in:

• producing useful data (Big or small);

Survey Sampling; Experiment Designs.

• analyzing (Big or small) data to make meanful results;

With some possible new statistical methods and computational

skills

• drawing practical conclusions.

15 Conclusion and Remarks 348

The Classical Statistical Framework

(Static):

Real Problem → Data Collection

→ Data Analysis → Decision

15 Conclusion and Remarks 349

To match human intelligence, we may need (Hu, 2016)

The Dynamic Statistical Framework (AI):

Real Problem → Data Collection

→ Data Analysis → Decision

→ + new Data → new Analysis

→ new Decision → · · ·

15 Conclusion and Remarks 350

Producing Useful Data (Design of Experiments) in

Big Data and AI ERA:

MANY New Challenges:

(i) From Static to Dynamic;

(ii) From Independent to Dependent.

15 Conclusion and Remarks 351

A/B tests in Big Data and AI ERA:

MANY New Challenges:

(i) From Static to Dynamic;

(ii) From Independent to Dependent.

15 Conclusion and Remarks 352

The goal of Human Life:

Understanding + Improving The Nature

Statistics (Data Science) is a “GAME” between human

and nature “THROUGH DATA”.

15 Conclusion and Remarks 353

We read the world wrong and say that it

deceives us. (Tagore, )

We read the data wrong and think that

the data deceives us. (Feifang Hu, 2017)

15 Conclusion and Remarks 354

Thank you!

欢迎咨询51作业君