1A/B Testing: Designs and Analysis Feifang Hu Department of Statistics George Washington University Email: [email protected] Fall, 2022, Washington, DC, USA 2Three Essential Components of Statistics (Data Science): Data+Computer+Analytics 1 Introduction 3 1 Introduction 1.1 What is A/B testing? A/B test is the shorthand for a simple controlled experiment. As the name implies, two versions (A and B) of a single variable are compared, which are identical except for one variation that might affect a user’s behavior. A/B tests are widely considered the simplest form of controlled experiment. However, by adding more variants to the test, this becomes more complex. A/B testing is the process of comparing two variations of a page element, usually by testing users’ response to variant A vs variant B, and concluding which of the two variants is more effective. 1 Introduction 4 A/B tests are useful for understanding user engagement and satisfaction of online features, such as a new feature or product. Large social media sites like LinkedIn, Facebook, and Instagram use A/B testing to make user experiences more successful and as a way to streamline their services. 1 Introduction 5 Today, A/B tests are being used to run more complex experiments, such as network effects when users are offline, how online services affect user actions, and how users influence one another. Many jobs use the data from A/B tests. This includes, data engineers, marketers, designers, software engineers, and entrepreneurs. Many positions rely on the data from A/B tests, as they allow companies to understand growth, increase revenue, and optimize customer satisfaction. 1 Introduction 6 Version A might be the currently used version (control), while version B is modified in some respect (treatment). For instance, on an e-commerce website the purchase funnel is typically a good candidate for A/B testing, as even marginal decreases in drop-off rates can represent a significant gain in sales. Significant improvements can sometimes be seen through testing elements like copy text, layouts, images and colors, but not always. In these tests, users only see one of two versions, as the goal is to discover which of the two versions is preferable. 1 Introduction 7 Controlled experiments have a long and fascinating history. They are sometimes called A/B tests, A/B/C tests (multiple variants), field experiments, randomized controlled experiments, split tests, bucket tests, and flights. 1 Introduction 8 1.2 Online experiments Example 1. Online A/B testing. (Kohavi and Thomke, 2017, Harvard Business Review) Microsoft, Amazon, Facebook and Google conduct more than 10,000 online controlled experiments annually, with many tests engaging millions of users. Amazon’s experiment. Treatment A: Credit card offers on front page. Treatment B: Credit card offers on the shopping cart page. This (change from A to B) boosted profits by tens of millions of US Dollars annually. 1 Introduction 9 1.2.1 A/B Testing in eCommerce Industry Through A/B testing, online stores can increase the average order value, optimize their checkout funnel, reduce cart abandonment rate, and so on. You may try testing: the way shipping cost is displayed and where, if, and how free shipping feature is highlighted, text and color tweaks on the payment page or checkout page, the visibility of reviews or ratings, etc. 1 Introduction 10 In the eCommerce industry, Amazon is at the forefront in conversion optimization partly due to the scale they operate at and partly due to their immense dedication to providing the best customer experience. Amongst the many revolutionary practices they brought to the eCommerce industry, the most prolific one has been their ‘1-Click Ordering’. Introduced in the late 1990s after much testing and analysis, 1-Click Ordering lets users make purchases without having to use the shopping cart at all. Once users enter their default billing card details and shipping address, all they need to do is click on the button and wait for the ordered products to get delivered. Users don’t have to enter their billing and shipping details again while placing any orders. With the 1-Click Ordering, it became impossible for users to ignore the ease of purchase and go to another store. This change had such a huge business impact that Amazon got it patented (now expired) in 1999. In fact, in 2000, even Apple bought a license for the same to be used in their online store. 1 Introduction 11 People working to optimize Amazon’s website do not have sudden ‘Eureka’ moments for every change they make. It is through continuous and structured A/B testing that Amazon is able to deliver the kind of user experience that it does. Every change on the website is first tested on their audience and then deployed. If you were to notice Amazon’s purchase funnel, you would realize that even though the funnel more or less replicates other websites’ purchase funnels, each an every element in it is fully optimized, and matches the audience’s expectations. 1 Introduction 12 Every page, starting from the homepage to the payment page, only contains the essential details and leads to the exact next step required to push the users further into the conversion funnel. Additionally, using extensive user insights and website data, each step is simplified to their maximum possible potential to match their users’ expectations. 1 Introduction 13 Take their omnipresent shopping cart, for example. There is a small cart icon at the top right of Amazon’s homepage that stays visible no matter which page of the website you are on. 1 Introduction 14 The icon is not just a shortcut to the cart or reminder for added products. In its current version, it offers 5 options: (i) Continue shopping (if there are no products added to the cart) (ii) Learn about today’s deals (if there are no products added to the cart) (iii) Wish List (if there are no products added to the cart) (iv) empty cart (v) Proceed to checkout (when there are products in the cart). Sign in to turn on 1-Click Checkout (when there are products in the cart). 1 Introduction 15 With one click on the tiny icon offering so many options, the user’s cognitive load is reduced, and they have a great user experience. As can be seen in the above screenshot, the same cart page also suggests similar products so that customers can navigate back into the website and continue shopping. All this is achieved with one weapon: A/B Testing. 1 Introduction 16 1.2.2 A/B Testing in Travel Industry Increase the number of successful bookings on your website or mobile app, your revenue from ancillary purchases, and much more through A/B testing. You may try testing your home page search modals, search results page, ancillary product presentation, your checkout progress bar, and so on. 1 Introduction 17 In the travel industry, Booking.com easily surpasses all other eCommerce businesses when it comes to using A/B testing for their optimization needs. They test like it’s nobody’s business. From the day of its inception, Booking.com has treated A/B testing as the treadmill that introduces a flywheel effect for revenue. The scale at which Booking.com A/B tests is unmatched, especially when it comes to testing their copy. While you are reading this, there are nearly 1000 A/B tests running on Booking.com’s website. 1 Introduction 18 Even though Booking.com has been A/B testing for more than a decade now, they still think there is more that they can do to improve user experience. And this is what makes Booking.com the ace in the game. Since the company started, Booking.com incorporated A/B testing into its everyday work process. They have increased their testing velocity to its current rate by eliminating HiPPOs and giving priority to data before anything else. And to increase the testing velocity, even more, all of Booking.com’s employees were allowed to run tests on ideas they thought could help grow the business. 1 Introduction 19 This example will demonstrate the lengths to which Booking.com can go to optimize their users’ interaction with the website. Booking.com decided to broaden its reach in 2017 by offering rental properties for vacations alongside hotels. This led to Booking.com partnering with Outbrain, a native advertising platform, to help grow their global property owner registration. 1 Introduction 20 Within the first few days of the launch, the team at Booking.com realized that even though a lot of property owners completed the first sign-up step, they got stuck in the next steps. At this time, pages built for the paid search of their native campaigns were used for the sign-up process. 1 Introduction 21 Both the teams decided to work together and created three versions of landing page copy for Booking.com. Additional details like social proof, awards, and recognitions, user rewards, etc. were added to the variations. 1 Introduction 22 The test ran for two weeks and produced a 25% uplift in owner registration. The test results also showed a significant decrease in the cost of each registration. 1 Introduction 23 1.2.3 A/B Testing in B2B/SaaS Industry Generate high-quality leads for your sales team, increase the number of free trial requests, attract your target buyers, and perform other such actions by testing and polishing important elements of your demand generation engine. To get to these goals, marketing teams put up the most relevant content on their website, send out ads to prospect buyers, conduct webinars, put up special sales, and much more. But all their effort would go to waste if the landing page which clients are directed to is not fully optimized to give the best user experience. The aim of SaaS (Software as a service) A/B testing is to provide the best user experience and to improve conversions. You can try testing your lead form components, free trial sign-up flow, homepage messaging, CTA text, social proof on the home page, and so on. 1 Introduction 24 POSist, a leading SaaS-based restaurant management platform with more than 5,000 customers at over 100 locations across six countries, wanted to increase their demo requests. Their website homepage and Contact Us page are the most important pages in their funnel. The team at POSist wanted to reduce drop-off on these pages. To achieve this, the team created two variations of the homepage as well as two variations of the Contact Us page to be tested. Let’s take a look at the changes made to the homepage. This is what the control looked like: 1 Introduction 25 The team at POSist hypothesized that adding more relevant and conversion-focused content to the website will improve user experience, as well as generate higher conversions. So they created two variations to be tested against the control. Control was first tested against Variation 1, and the winner was Variation 1. To further improve the page, variation one was then tested against variation two, and the winner was variation 2. The new variation increased page visits by about 5%. 1 Introduction 26 1.3 Clinical trials Example 2. HIV transmission. Connor et al. (1994, The New England Journal of Medicine) report a clinical trial to evaluate the drug AZT in reducing the risk of maternal-infant HIV transmission. 50-50 randomization scheme is used: • AZT Group—239 pregnant women (20 HIV positive infants). • placebo group—238 pregnant women (60 HIV positive infants). 1 Introduction 27 Given the seriousness of the outcome of this study, it is reasonable to argue that 50-50 allocation was unethical. As accruing information favoring (albeit, not conclusively) the AZT treatment became available, allocation probabilities should have been shifted from 50-50 allocation proportional to weight of evidence for AZT. Designs which attempt to do this are called Response-Adaptive designs (Response-Adaptive Randomization). 1 Introduction 28 If the treatment assignments had been done with the DBCD (Hu and Zhang, 2004, Annals of Statistics) with urn target: • AZT Group— 360 patients • placebo group—117 patients then, only 60 (instead of 80) infants would be HIV positive. 1 Introduction 29 Example 3: Remdesivir-COVID-19 trial (China). Remdesivir in adults with severe COVID-19 trial (Wang et al. 2020) is a randomized, double-blind, placebo-controlled, multicentre trial that aimed to compare Remvesivir with placebo. There were 236 patients in the trial. There are about 20 baseline covariates for each patient, including 10 continuous variables (e.g. age and White blood cell count) and 10 discrete variables (e.g. gender and Hypertension). The stratified (according to the level of respiratory support) permuted block (30 patients per block) randomization procedure were implemented. At the end of this trial, some important imbalances existed at enrollment between the groups, including more patients with hypertension, diabetes, or coronary artery disease in the Remdesivir group than the placebo group. 1 Introduction 30 Example 4: Moderna COVID-19 vaccine trial (2020). The trial began on July 27, 2020, and enrolled 30,420 adult volunteers at clinical research sites across the United States. Volunteers were randomly assigned 1:1 to receive either two 100 microgram (mcg) doses of the investigational vaccine or two shots of saline placebo 28 days apart. The average age of volunteers is 51 years. Approximately 47% are female, 25% are 65 years or older and 17% are under the age of 65 with medical conditions placing them at higher risk for severe COVID-19. Approximately 79% of participants are white, 10% are Black or African American, 5% are Asian, 0.8% are American Indian or Alaska Native, 0.2% are Native Hawaiian or Other Pacific Islander, 2% are multiracial, and 21% (of any race) are Hispanic or Latino. 1 Introduction 31 From the start of the trial through Nov. 25, 2020, investigators recorded 196 cases of symptomatic COVID-19 occurring among participants at least 14 days after they received their second shot. One hundred and eighty-five cases (30 of which were classified as severe COVID-19) occurred in the placebo group and 11 cases (0 of which were classified as severe COVID-19) occurred in the group receiving mRNA-1273. The incidence of symptomatic COVID-19 was 94.1% lower in those participants who received mRNA-1273 as compared to those receiving placebo. 1 Introduction 32 Investigators observed 236 cases of symptomatic COVID-19 among participants at least 14 days after they received their first shot, with 225 cases in the placebo group and 11 cases in the group receiving mRNA-1273. The vaccine efficacy was 95.2% for this secondary analysis. Long-term Treatment Effects? 1 Introduction 33 1.4 Economics and Social Science Political A/B testing A/B tests are used for more than corporations, but are also driving political campaigns. In 2007, Barack Obama’s presidential campaign used A/B testing as a way to garner online attraction and understand what voters wanted to see from the presidential candidate. For example, Obama’s team tested four distinct buttons on their website that led users to sign up for newsletters. Additionally, the team used six different accompanying images to draw in users. Through A/B testing, staffers were able to determine how to effectively draw in voters and garner additional interest. 1 Introduction 34 Example 5. The Project GATE (Growing America Through Entrepreneurship), sponsored by the U.S. Department of Labor, was designed to evaluate the impact of offering tuition-free entrepreneurship training services (GATE services) on helping clients create, sustain or expand their own business. (https://www.doleta.gov/reports/projectgate/) The cornerstone is complete randomization. Members of the treatment group were offered GATE services; members of the control group were not. • n = 4, 198 participants • p = 105 covariates 1 Introduction 35 1.5 Biological, psychological, and agricultural research Controlled experiments were mainly developed in these areas in 1900-1950. 1 Introduction 36 Road Map of this course: (i) The history of experiment design; (ii) A/B testing in medical studies; (iii) Online controlled experiments (A/B testing). 2 The history of experiment design 37 2 The history of experiment design 2.1 Experiment design before Fisher Statistical experiments, following Charles S. Peirce Main article: Frequentist statistics See also: Randomization A theory of statistical inference was developed by Charles S. Peirce in ”Illustrations of the Logic of Science” (1877–1878) and ”A Theory of Probable Inference” (1883), two publications that emphasized the importance of randomization-based inference in statistics. 2 The history of experiment design 38 Randomized experiments: Charles S. Peirce randomly assigned volunteers to a blinded, repeated-measures design to evaluate their ability to discriminate weights. Peirce’s experiment inspired other researchers in psychology and education, which developed a research tradition of randomized experiments in laboratories and specialized textbooks in the 1800s. 2 The history of experiment design 39 Optimal designs for regression models: Charles S. Peirce also contributed the first English-language publication on an optimal design for regression models in 1876. A pioneering optimal design for polynomial regression was suggested by Gergonne in 1815. In 1918, Kirstine Smith published optimal designs for polynomials of degree six (and less). 2 The history of experiment design 40 2.2 Fisher’s principles A methodology for designing experiments was proposed by Ronald Fisher, in his innovative books: The Arrangement of Field Experiments (1926) and The Design of Experiments (1935). Much of his pioneering work dealt with agricultural applications of statistical methods. As a mundane example, he described how to test the lady tasting tea hypothesis, that a certain lady could distinguish by flavour alone whether the milk or the tea was first placed in the cup. These methods have been broadly adapted in biological, psychological, and agricultural research. 2 The history of experiment design 41 2.2.1 Comparison In some fields of study it is not possible to have independent measurements to a traceable metrology standard. Comparisons between treatments are much more valuable and are usually preferable, and often compared against a scientific control or traditional treatment that acts as baseline. 2 The history of experiment design 42 2.2.2 Randomization Random assignment is the process of assigning individuals at random to groups or to different groups in an experiment, so that each individual of the population has the same chance of becoming a participant in the study. The random assignment of individuals to groups (or conditions within a group) distinguishes a rigorous, ”true” experiment from an observational study or ”quasi-experiment”. There is an extensive body of mathematical theory that explores the consequences of making the allocation of units to treatments by means of some random mechanism (such as tables of random numbers, or the use of randomization devices such as playing cards or dice). Assigning units to treatments at random tends to mitigate confounding, which makes effects due to factors other than the treatment to appear to result from the treatment. 2 The history of experiment design 43 The risks associated with random allocation (such as having a serious imbalance in a key characteristic between a treatment group and a control group) are calculable and hence can be managed down to an acceptable level by using enough experimental units. However, if the population is divided into several subpopulations that somehow differ, and the research requires each subpopulation to be equal in size, stratified sampling can be used. In that way, the units in each subpopulation are randomized, but not the whole sample. The results of an experiment can be generalized reliably from the experimental units to a larger statistical population of units only if the experimental units are a random sample from the larger population; the probable error of such an extrapolation depends on the sample size, among other things. 2 The history of experiment design 44 2.2.3 Statistical replication Measurements are usually subject to variation and measurement uncertainty; thus they are repeated and full experiments are replicated to help identify the sources of variation, to better estimate the true effects of treatments, to further strengthen the experiment’s reliability and validity, and to add to the existing knowledge of the topic. 2 The history of experiment design 45 However, certain conditions must be met before the replication of the experiment is commenced: the original research question has been published in a peer-reviewed journal or widely cited, the researcher is independent of the original experiment, the researcher must first try to replicate the original findings using the original data, and the write-up should state that the study conducted is a replication study that tried to follow the original study as strictly as possible. 2 The history of experiment design 46 2.2.4 Blocking Blocking is the non-random arrangement of experimental units into groups (blocks) consisting of units that are similar to one another. Blocking reduces known but irrelevant sources of variation between units and thus allows greater precision in the estimation of the source of variation under study. 2 The history of experiment design 47 2.2.5 Orthogonality Orthogonality concerns the forms of comparison (contrasts) that can be legitimately and efficiently carried out. Contrasts can be represented by vectors and sets of orthogonal contrasts are uncorrelated and independently distributed if the data are normal. Because of this independence, each orthogonal treatment provides different information to the others. If there are T treatments and T–1 orthogonal contrasts, all the information that can be captured from the experiment is obtainable from the set of contrasts. 2 The history of experiment design 48 Example 2.1. Measurement Error: We would like to measure the weight of a subject A by using a scale. We know that there is a error of scale. Suppose that the error follows a normal distribution with mean 0 and variance σ2. Mathematically, we may write: w1 = A+ e1, where wA is the true weight, YA is the observed weight and e1 is the measurement error. 2 The history of experiment design 49 Figure 1: A scale to measure subject A 2 The history of experiment design 50 Now we would like to measure the weights of two subjects A and B by using the same scale twice. What should we do? 2 The history of experiment design 51 Method 1: w1 = A+ e1 and w2 = B + e2. 2 The history of experiment design 52 Figure 2: Subject B 2 The history of experiment design 53 Method 2: w3 = A+B + e3 and w4 = A−B + e4. 2 The history of experiment design 54 Figure 3: A + B 2 The history of experiment design 55 Figure 4: A - B 2 The history of experiment design 56 The measurement errors: Method 1: Subject A: e1 ∼ N(0, σ2). Subject B: e2 ∼ N(0, σ2). Method 2: Subject A: (e3 + e4)/2 ∼ N(0, σ2/2). Subject B: (e3 − e4)/2 ∼ N(0, σ2/2). 2 The history of experiment design 57 Use of factorial experiments instead of the one-factor-at-a-time method. These are efficient at evaluating the effects and possible interactions of several factors (independent variables). Analysis of experiment design is built on the foundation of the analysis of variance, a collection of models that partition the observed variance into components, according to what factors the experiment must estimate or test. 2 The history of experiment design 58 2.2.6 Avoiding false positives False positive conclusions, often resulting from the pressure to publish or the author’s own confirmation bias, are an inherent hazard in many fields. A good way to prevent biases potentially leading to false positives in the data collection phase is to use a double-blind design. When a double-blind design is used, participants are randomly assigned to experimental groups but the researcher is unaware of what participants belong to which group. Therefore, the researcher can not affect the participants’ response to the intervention. 2 The history of experiment design 59 Experimental designs with undisclosed degrees of freedom are a problem. This can lead to conscious or unconscious ”p-hacking”: trying multiple things until you get the desired result. It typically involves the manipulation – perhaps unconsciously – of the process of statistical analysis and the degrees of freedom until they return a figure below the p¡.05 level of statistical significance. 2 The history of experiment design 60 So the design of the experiment should include a clear statement proposing the analyses to be undertaken. P-hacking can be prevented by preregistering researches, in which researchers have to send their data analysis plan to the journal they wish to publish their paper in before they even start their data collection, so no data manipulation is possible. 2 The history of experiment design 61 Another way to prevent this is taking the double-blind design to the data-analysis phase, where the data are sent to a data-analyst unrelated to the research who scrambles up the data so there is no way to know which participants belong to before they are potentially taken away as outliers. 2 The history of experiment design 62 2.2.7 Causal attributions In the pure experimental design, the independent (predictor) variable is manipulated by the researcher – that is – every participant of the research is chosen randomly from the population, and each participant chosen is assigned randomly to conditions of the independent variable. Only when this is done is it possible to certify with high probability that the reason for the differences in the outcome variables are caused by the different conditions. Therefore, researchers should choose the experimental design over other design types whenever possible. 2 The history of experiment design 63 However, the nature of the independent variable does not always allow for manipulation. In those cases, researchers must be aware of not certifying about causal attribution when their design doesn’t allow for it. For example, in observational designs, participants are not assigned randomly to conditions, and so if there are differences found in outcome variables between conditions, it is likely that there is something other than the differences between the conditions that causes the differences in outcomes, that is – a third variable. The same goes for studies with correlational design. (Ade´r Mellenbergh, 2008). 2 The history of experiment design 64 2.2.8 Statistical control It is best that a process be in reasonable statistical control prior to conducting designed experiments. When this is not possible, proper blocking, replication, and randomization allow for the careful conduct of designed experiments. To control for nuisance variables, researchers institute control checks as additional measures. Investigators should ensure that uncontrolled influences (e.g., source credibility perception) do not skew the findings of the study. A manipulation check is one example of a control check. Manipulation checks allow investigators to isolate the chief variables to strengthen support that these variables are operating as planned. 2 The history of experiment design 65 One of the most important requirements of experimental research designs is the necessity of eliminating the effects of spurious, intervening, and antecedent variables. In the most basic model, cause (X) leads to effect (Y). But there could be a third variable (Z) that influences (Y), and X might not be the true cause at all. Z is said to be a spurious variable and must be controlled for. The same is true for intervening variables (a variable in between the supposed cause (X) and the effect (Y)), and anteceding variables (a variable prior to the supposed cause (X) that is the true cause). When a third variable is involved and has not been controlled for, the relation is said to be a zero order relationship. In most practical applications of experimental research designs there are several causes (X1, X2, X3). In most designs, only one of these causes is manipulated at a time. 2 The history of experiment design 66 2.3 Experimental designs after Fisher Some efficient designs for estimating several main effects were found independently and in near succession by Raj Chandra Bose and K. Kishen in 1940 at the Indian Statistical Institute, but remained little known until the Plackett–Burman designs were published in Biometrika in 1946. About the same time, C. R. Rao introduced the concepts of orthogonal arrays as experimental designs. This concept played a central role in the development of Taguchi methods by Genichi Taguchi, which took place during his visit to Indian Statistical Institute in early 1950s. His methods were successfully applied and adopted by Japanese and Indian industries and subsequently were also embraced by US industry albeit with some reservations. 2 The history of experiment design 67 In 1950, Gertrude Mary Cox and William Gemmell Cochran published the book Experimental Designs, which became the major reference work on the design of experiments for statisticians for years afterwards. Developments of the theory of linear models have encompassed and surpassed the cases that concerned early writers. Today, the theory rests on advanced topics in linear algebra, algebra and combinatorics. 2 The history of experiment design 68 As with other branches of statistics, experimental design is pursued using both frequentist and Bayesian approaches: In evaluating statistical procedures like experimental designs, frequentist statistics studies the sampling distribution while Bayesian statistics updates a probability distribution on the parameter space. 2 The history of experiment design 69 Some important contributors to the field of experimental designs are C. S. Peirce, R. A. Fisher, F. Yates, R. C. Bose, A. C. Atkinson, R. A. Bailey, D. R. Cox, G. E. P. Box, W. G. Cochran, W. T. Federer, V. V. Fedorov, A. S. Hedayat, J. Kiefer, O. Kempthorne, J. A. Nelder, Andrej Pa´zman, Friedrich Pukelsheim, D. Raghavarao, C. R. Rao, Shrikhande S. S., J. N. Srivastava, William J. Studden, G. Taguchi and H. P. Wynn. 2 The history of experiment design 70 The textbooks of D. Montgomery, R. Myers, and G. Box/W. Hunter/J.S. Hunter have reached generations of students and practitioners. Some discussion of experimental design in the context of system identification (model building for static or dynamic models) is given in[35] and [36]. 2 The history of experiment design 71 2.4 Sequences of experiments The use of a sequence of experiments, where the design of each may depend on the results of previous experiments, including the possible decision to stop experimenting, is within the scope of sequential analysis, a field that was pioneered by Abraham Wald in the context of sequential tests of statistical hypotheses. Herman Chernoff wrote an overview of optimal sequential designs, while adaptive designs have been surveyed by S. Zacks. One specific type of sequential design is the ”two-armed bandit”, generalized to the multi-armed bandit, on which early work was done by Herbert Robbins in 1952. 2 The history of experiment design 72 2.5 Human participant constraints Laws and ethical considerations preclude some carefully designed experiments with human subjects. Legal constraints are dependent on jurisdiction. Constraints may involve institutional review boards, informed consent and confidentiality affecting both clinical (medical) trials and behavioral and social science experiments.[37] In the field of toxicology, for example, experimentation is performed on laboratory animals with the goal of defining safe exposure limits for humans. Balancing the constraints are views from the medical field.[39] Regarding the randomization of patients, ”... if no one knows which therapy is better, there is no ethical imperative to use one therapy or another.” (p 380) Regarding experimental design, ”...it is clearly not ethical to place subjects at risk to collect data in a poorly designed study when this situation can be easily avoided...”. (p 393) 2 The history of experiment design 73 2.6 Some important issues to design experiments Clear and complete documentation of the experimental methodology is also important in order to support replication of results. Discussion topics when setting up an experimental design An experimental design or randomized clinical trial requires careful consideration of several factors before actually doing the experiment. An experimental design is the laying out of a detailed experimental plan in advance of doing the experiment. Some of the following topics have already been discussed in the principles of experimental design section: 2 The history of experiment design 74 1) How many factors does the design have, and are the levels of these factors fixed or random? 2) Are control conditions needed, and what should they be? 3) Manipulation checks; did the manipulation really work? 4) What are the background variables? 5) What is the sample size. How many units must be collected for the experiment to be generalisable and have enough power? 6) What is the relevance of interactions between factors? 2 The history of experiment design 75 7) What is the influence of delayed effects of substantive factors on outcomes? 8) How do response shifts affect self-report measures? 9) How feasible is repeated administration of the same measurement instruments to the same units at different occasions, with a post-test and follow-up tests? 10) What about using a proxy pretest? 11) Are there lurking variables? 2 The history of experiment design 76 12) Should the client/patient, researcher or even the analyst of the data be blind to conditions? 13) What is the feasibility of subsequent application of different conditions to the same units? 14) How many of each control and noise factors should be taken into account? 15) How to deal with missinbg values? 16) What are the good matrices? ........ 2 The history of experiment design 77 The independent variable of a study often has many levels or different groups. In a true experiment, researchers can have an experimental group, which is where their intervention testing the hypothesis is implemented, and a control group, which has all the same element as the experimental group, without the interventional element. Thus, when everything else except for one intervention is held constant, researchers can certify with some certainty that this one element is what caused the observed change. In some instances, having a control group is not ethical. This is sometimes solved using two different experimental groups. In some cases, independent variables cannot be manipulated, for example when testing the difference between two groups who have a different disease, or testing the difference between genders (obviously variables that would be hard or unethical to assign participants to). In these cases, a quasi-experimental design may be used. 3 A/B tests (Randomized Control Studies) in clinical trials 78 3 A/B tests (Randomized Control Studies) in clinical trials 3 A/B tests (Randomized Control Studies) in clinical trials 79 3.1 Drug development Drug development is a complex and lengthy process that take 7 to 15 years for a single drug at a cost that may reach hundreds of millions of dollars. There are three main parts of the drug development process: • Discovery and decision; • Preclinical studies; • Clinical studies. 3 A/B tests (Randomized Control Studies) in clinical trials 80 Discovery and Decision The process starts with the discovery of a new compound or of a new potential application of an existing compound. Based on adequate results, the decision whether to develop the drug is then made. 3 A/B tests (Randomized Control Studies) in clinical trials 81 Preclinical Studies The initial toxicology of compound is studied in animals. Initial formulation of the drug development and specific or comprehensive pharmacological studies in animals are also performed at this stage. At the end of preclinical study, the evidence of potential safety and effectiveness of the drug is assessed by the company. To proceed further, A US-based company needs to file a Notice of Claimed Investigational New Drug Exemption (to allow the company to conduct studies on human subjects). 3 A/B tests (Randomized Control Studies) in clinical trials 82 Clinical Studies There is sufficient evidence that the drug will be benefit to human subjects. Testing the drug in human subjects is the next step. 3 A/B tests (Randomized Control Studies) in clinical trials 83 Phase I clinical trial: To establish the initial safety information about the effect of the drug on humans, such the range of acceptable dosages and the pharmacokinetics of the drug. This studies are normally conducted with healthy volunteers. The number of subjects typically varies between 4 to 20 per study, with up to 100 subjects in total used over the course of Phase I trials. 3 A/B tests (Randomized Control Studies) in clinical trials 84 Phase II clinical trial: This studies are conducted towards patients who will potentially benefit from the new drug. Effective dose ranges and initial effects of the drug on these patients are assessed. Up to several hundred patients are usually selected in Phase II trials. 3 A/B tests (Randomized Control Studies) in clinical trials 85 Phase III clinical trial: Phase III studies provide assessment of safety, efficacy, and optimum dosage. These studies are designed with controls and treatment groups. Usually hundreds or even thousands patients are involved in Phase II trials. Based on successful results obtained from these studies, the company can then submit a NDA (New Drug Application). The application contains the results from all three stages (from discovery to Phase III) and is reviewed by FDA. The FDA review panel of the NDA consists of reviewers in the following areas: medicine, pharmacology, biopharmaceutics, chemisty, and statistics. 3 A/B tests (Randomized Control Studies) in clinical trials 86 Phase IV: Postmarket activities. Followup studies are conducted to examine the longterm effects of the drug. The main propose of these studies is to ensure that all claims made by the company about the new drug can be substantiated by so called ”clinical evidence”. All reported adverse effects must also be investigated by the company and in some cases, the drug may need to be withdrawn from the market. 3 A/B tests (Randomized Control Studies) in clinical trials 87 Statistician’s Responsibilities: • Participate in the development plan for study a drug. • Study design and protocol development. Randomization schemes. • Data cleaning and database construction format. • Analysis plan and program development for analysis. • Report preparation. Produce tables and figures. • Integrate clinical study results, safety and efficacy reports. • Communication and NDA defense to FDA review panel. • Publication support and consulting with other company personnel. 3 A/B tests (Randomized Control Studies) in clinical trials 88 Example 3.1. HIV transmission. Connor et al. (1994, The New England Journal of Medicine) report a clinical trial to evaluate the drug AZT in reducing the risk of maternal-infant HIV transmission. 50-50 randomization scheme is used: • AZT Group (A)—239 pregnant women (20 HIV positive infants). • placebo group (B)—238 pregnant women (60 HIV positive infants). 3 A/B tests (Randomized Control Studies) in clinical trials 89 Given the seriousness of the outcome of this study, it is reasonable to argue that 50-50 allocation was unethical. As accruing information favoring (albeit, not conclusively) the AZT treatment became available, allocation probabilities should have been shifted from 50-50 allocation proportional to weight of evidence for AZT. Designs which attempt to do this are called Response-Adaptive designs (Response-Adaptive Randomization). 3 A/B tests (Randomized Control Studies) in clinical trials 90 If the treatment assignments had been done with the DBCD (Hu and Zhang, 2004, Annals of Statistics) with urn target: • AZT Group— 360 patients • placebo group—117 patients then, only 60 (instead of 80) infants would be HIV positive. 3 A/B tests (Randomized Control Studies) in clinical trials 91 Allocation rule AZT Placebo Power HIV+ EA 239 238 0.9996 80 DBCD 360 117 0.989 60 Neyman 186 291 0.9998 89 FPower 416 61 0.90 50 3 A/B tests (Randomized Control Studies) in clinical trials 92 Example 2 (ECMO Trial). Extracorporeal membrane oxygenation (ECMO) is an external system for oxygenating the blood based on techniques used in cardiopulmonary bypass technology developed for cariac surgery. In the literature, there are three well-document clinical trials on evaluating the clinical effectiveness of ECMO: (i) the Michigan ECMO study (Bartlett, et al. 1985); (ii) the Boston ECMO study (Ware, 1989); (iii) the UK Collaborative ECMO Trials Group, 1996). 3 A/B tests (Randomized Control Studies) in clinical trials 93 Example 2 (Continued): Michigan ECMO trial using RPW rule: The RPW rule was used in a clinical trial of extracorporeal membrane oxygenation (ECMO; Bartlett, et al. 1985, Pediatrics). Total 12 patients. • ECMO group– 11 patients, all survived. • Conventional therapy– 1 patient, died. 3 A/B tests (Randomized Control Studies) in clinical trials 94 3.2 Determining the Sample Size In the planning stages of a randomized clinical trial, it is necessary to determine the numbers of subjects (sample size) to be randomized. For two treatments (A and B), say n = nA + nB . We assume here that the allocation proportions are known in advance, that is, nA/n = ρ and nB/n = 1− ρ are predetermined. 3 A/B tests (Randomized Control Studies) in clinical trials 95 Examples of calculations of SS. 3 A/B tests (Randomized Control Studies) in clinical trials 96 3.3 Mathematical Framework of Randomization Procedures Suppose we compare two treatments A and B. Let T1, ..., Tn be a sequence of random treatment assignments. Ti = 1 if the patient i is assigned to treatment A; Ti = 0 if the patient i is assigned to treatment B. NA(n) = ∑n i=1 Ti = number of patients onA and NB(n) = n−NA(n). 3 A/B tests (Randomized Control Studies) in clinical trials 97 X1, ...,Xn: response variables. Where Xi represents the sequence of responses that would be observed if each treatment were assigned to the i-th patient independently. Z1, ...,Zn: covariates. Here Zi represents the covariates of i-th patient. 3 A/B tests (Randomized Control Studies) in clinical trials 98 When the (i+ 1)th patient is ready to be randomized in a clinical trial, following information is available: • patients assignments: T1, ..., Ti; • responses: X1, ...,Xi (assume immediately responses); • patients covariates: Z1, ...,Zi and Zi+1. 3 A/B tests (Randomized Control Studies) in clinical trials 99 Let Tn = σ{T1, ..., Tn} be the sigma-algebra generated by the first n treatment assignments. Let Xn = σ{X1, ...,Xn} be the sigma-algebra generated by the first n responses. Let Zn = σ{Z1, ...,Zn} be the sigma-algebra generated by the first n covariate vectors. Let Fn = Tn ⊗Xn ⊗Zn+1. 3 A/B tests (Randomized Control Studies) in clinical trials 100 A randomization procedure is defined by φn = E(Tn|Fn−1), where φn+1 is Fn-measurable. We can describe φn as the conditional probability of assigning treatments 1, ...,K to the n-th patient, conditional on the previous n− 1 assignments, responses, and covariate vectors, and the current patient’s covariate vector. 3 A/B tests (Randomized Control Studies) in clinical trials 101 We can describe five types of randomization procedures: • (i) complete randomization if φn = E(Tn|Fn−1) = E(Tn); Not use any information. • (ii) restricted randomization if φn = E(Tn|Fn−1) = E(Tn|Tn−1); Only use information of patients’ assignments. • (iii) response-adaptive randomization if φn = E(Tn|Fn−1) = E(Tn|Tn−1,Xn−1); Use information of patients’ assignments and responses. 3 A/B tests (Randomized Control Studies) in clinical trials 102 • (iv) covariate-adaptive randomization if φn = E(Tn|Fn−1) = E(Tn|Tn−1,Zn); Use information of patients’ assignments and covariates. • (v) covariate-adjusted response-adaptive (CARA) randomization if φn = E(Tn|Fn−1) = E(Tn|Tn−1,Xn−1,Zn). use all available information. 3 A/B tests (Randomized Control Studies) in clinical trials 103 3.4 Complete randomization The simplest form of a randomization procedure is complete randomization. E(Ti|T1, ..., Ti−1) = P (Ti = 1|T1, ..., Ti−1) = 1/2, i = 1, ..., n. NA(n) has binomial(n, 1/2). This procedure is rarely used in practice because of the nonnegligible probability of treatment imbalances in moderate samples. 3 A/B tests (Randomized Control Studies) in clinical trials 104 3.5 Restricted randomization Truncated binomial design: Complete randomization is used until n/2 have been assigned to A or B, then the reminder is filled with the opposite treatment with probability 1. Here the procedure is given by φi = 1/2, if max{NA(i− 1), NB(i− 1)} ≤ n/2, = 0, if NA(i− 1) = n/2, = 1, if NB(i− 1) = n/2. 3 A/B tests (Randomized Control Studies) in clinical trials 105 Blocked Procedures: Because we do not know n exactly in advance, we typically require overrunning of the randomization sequence. Forced balance designs are therefore typically used in blocks. • Permuted block design: Blocks of even size 2b are filled using either a random allocation rule or a truncated binomial design. • The maximum imbalance is b and the only possibility of a terminal imbalance occurs if the last block is unfilled. Every block has at least one deterministic assignment. • Random block design: Blocks of size 2, 4, 6, ..., 2K are randomly selected and equirobable. 3 A/B tests (Randomized Control Studies) in clinical trials 106 Efron’s biased coin design (BCD): (Efron, 1971). Let Di = NA(i)−NB(i) be the imbalance between treatments A and B. Define a constant pi ∈ (0.5, 1]. Then the procedure is given by φi = 1/2, if Di−1 = 0, = pi, if Di−1 < 0, = 1− pi, if Di−1 > 0. Efron suggested pi = 2/3 might be a reasonable value (without justification). 3 A/B tests (Randomized Control Studies) in clinical trials 107 Many other designs have been proposed and studied in literature (Smith’s design (1984), Wei’s design (1978), Big Stick design (Soares and Wu, 1982), etc.) When n = 50, V ar(Dn) = 49.92 (Complete randomization); V ar(Dn) = 4.36 (Efron’s BCD with pi = 2/3). (Based on 100, 000 replications). 3 A/B tests (Randomized Control Studies) in clinical trials 108 3.6 Selection Bias Selection Bias refers to biases that are introduced into an unmasked study because an investigator maybe able to guess the treatment assignment of future patients based on knowing the treatments assigned to the past patients. Patients usually enter a trial sequentially over time. The great clinical trialist Chalmers (1990) was convinced that the elimination of selection bias is the most essential requirement for a good clinical trial. 3 A/B tests (Randomized Control Studies) in clinical trials 109 How to measure the Selection Bias? 3 A/B tests (Randomized Control Studies) in clinical trials 110 Blackwell and Hodge (1957), Berger, Ivanova and Knoll (2003) and others had suggested the predictability of a randomization sequence to measure the selection bias. One measure of the predictability of a randomization sequence is given by Ppred = ∑n i=1 |Eφi − 0.5| n . 3 A/B tests (Randomized Control Studies) in clinical trials 111 Selection bias of different designs. 4 Response-adaptive randomization procedures 112 4 Response-adaptive randomization procedures . 4.1 Historical notes Adaptive designs in the clinical trials context were first formulated as solutions to optimal decision-making questions: • Which treatment is better? • What sample size should be used before determining a “better” treatment to maximize the total number receiving the better treatment? • How do we incorporate prior data or accruing data into these decisions? 4 Response-adaptive randomization procedures 113 The preliminary ideas can be traced back to Thompson (1933, Biometrika) and Robbins (1952, Bulletin of the American Mathematical Society) and led to a flurry of work in the 1960s by Anscombe (1963, JASA), Colton (1963, JASA), Zelen (1969, JASA) and Cornfield, Halperin, and Greenhouse (1969, Annals of Mathematical Statistics), among others. 4 Response-adaptive randomization procedures 114 4.2 Play-the-winner rule Perhaps the simplest of these adaptive designs is the play-the-winner rule originally explored by Robbins (1952, Bulletin of the American Mathematical Society) and later by Zelen (1969, JASA). 4 Response-adaptive randomization procedures 115 Binary response: treatment A and B. • pA: P (success|A), qA = 1− pA; • pB : P (success|B), qB = 1− pB ; • NA(n): number of patients on A; • NB(n): number of patients on B, n = NA(n) +NB(n). 4 Response-adaptive randomization procedures 116 Play-the-winner rule: • a success on one treatment results in the next patient’s assignment to the same treatment, • a failure on one treatment results in the next patient’s assignment to the opposite treatment. That is • φn = 1 if Tn−1 = 1 and Xn−1(A) = 1 or Tn−1 = 0 and Xn−1(B) = 0. • φn = 0 if Tn−1 = 1 and Xn−1(A) = 0 or Tn−1 = 0 and Xn−1(B) = 1. 4 Response-adaptive randomization procedures 117 The properties of play-the-winner rule? • What is the proportion of patients in treatment A: NA(n) n →???. • What is the variance (variability) of the allocation: V ar(NA(n)) or V ar ( NA(n) n ) =???. • What is the distribution of the allocation: √ n ( NA(n) n −??? ) →???. 4 Response-adaptive randomization procedures 118 We have • What is the proportion of patients in treatment A: NA(n) n → qB qA + qB . • What is the variance (variability) of the allocation: V ar(NA(n)) = nqAqB(pA + pB) (qA + qB)3 . • What is the distribution of the allocation: √ n ( NA(n) n − qB qA + qB ) → N ( 0, qAqB(pA + pB) (qA + qB)3 ) . 4 Response-adaptive randomization procedures 119 Advantages: • more patients in the better treatment; • intuitively attractive. Disadvantages: • Not a randomized procedure; • Not based on any optimality. 4 Response-adaptive randomization procedures 120 4.2.1 Randomized play-the-winner rule Randomized play-the-winner (RPW) rule (Wei and Durham, 1978, JASA) has been the most-studied urn model in literature. Binary response: treatment A and B. • pA: P (success|A), qA = 1− pA; • pB : P (success|B), qB = 1− pB ; • NA(n): number of patients on A; • NB(n): number of patients on B, n = NA(n) +NB(n). 4 Response-adaptive randomization procedures 121 Begin with c balls of A and c balls of B in an urn. • Draw A: – assign patient to A; – replace ball; – add 1 type A ball if treatment A is successful; – add 1 type B ball if treatment A is failure. • Draw B: – assign patient to B; – replace ball; – add 1 type B ball if treatment B is successful; – add 1 type A ball if treatment B is failure. 4 Response-adaptive randomization procedures 122 When the (i+ 1)th patient is ready to be randomized in a clinical trial, following information is available: • patients assignments: T1, ..., Ti; • responses: X1, ..., Xi (assume immediately responses); Then • φ1 = 1/2. • φ2 = c+ T1X1 + (1− T1)(1−X1) 2c+ 1 . • φi+1 = c+ ∑i j=1[TjXj + (1− Tj)(1−Xj)] 2c+ i . 4 Response-adaptive randomization procedures 123 Properties: • Calculate ENA(n); • Simulated results. 4 Response-adaptive randomization procedures 124 We have • The limiting proportion of patients in treatment A: NA(n) n → qB qA + qB . • The variance (variability) of the allocation (when qA + qB > 1/2): V ar(NA(n)) = nqAqB(3 + 2(pA + pB)) (qA + qB)2(2(qA + qB)− 1) . • The asymptotic distribution of the allocation (when qA + qB > 1/2): √ n ( NA(n) n − qB qA + qB ) → N ( 0, nqAqB(3 + 2(pA + pB)) (qA + qB)2(2(qA + qB)− 1) ) . 4 Response-adaptive randomization procedures 125 Table 1: Asymptotic and simulated mean and variance (multipled by n) of the allocation proportions NA(n)/n for the randomized play-the- winner (RPW). Simulations based on n = 100 and 1000 replications. From Hu and Rosenberger (2003), reprinted by permission from the American Statistical Association. (pA, pB) mean (A) S var (A) S (0.8, 0.8) 0.50 0.50 N/A 2.29 (0.8, 0.7) 0.60 0.57 N/A 1.90 (0.7, 0.5) 0.63 0.61 1.33 0.90 (0.7, 0.3) 0.70 0.68 0.63 0.51 (0.5, 0.5) 0.50 0.50 0.75 0.65 (0.5, 0.2) 0.62 0.61 0.35 0.34 (0.2, 0.2) 0.50 0.50 0.20 0.19 4 Response-adaptive randomization procedures 126 Urn models: • Play-the-winner (PW) rule (Zelen, 1969, JASA); Randomized play-the-winner rule (Wei and Durham,1978, JASA). • Generalized Friedman’s urn models (Wei, 1979, JASA; Smythe, 1996, Stochastic Process. Appl.; Bai, Hu and Shen, 2002, JMVA); • Randomized Polya Urn (Durham, Flournoy, and Li, 1998, Canadian J of Statistics); Ternary Urn (Ivanova and Flournoy, 2001); 4 Response-adaptive randomization procedures 127 • Drop-the-Loser rule (Ivanova, 2003, Metrika); Generalized drop-the-Loser rule (Zhang, Chan, Cheung and Hu, 2007, Statistic Sinica), • Sequential estimated urn (Zhang, Hu and Cheung, 2006, Annals of Applied Probability). • Urn models with immigration balls (Zhang, Hu, Cheung and Chan, Annals of Statistics, 2011). 4 Response-adaptive randomization procedures 128 4.3 Relationship Between Power and Variability Example 3.2. ECMO trial (The UK trial). Extracorporeal membrane oxygenation (ECMO) is an external system for oxygenating the blood based on techniques used in cardiopulmonary bypass technology developed for cariac surgery. In the literature, there are three well-document clinical trials on evaluating the clinical effectiveness of ECMO: (i) the Michigan ECMO study (Bartlett, et al. 1985); (ii) the Boston ECMO study (Ware, 1989); (iii) the UK ECMO trial (UK Collaborative ECMO Trials Group, 1996). 4 Response-adaptive randomization procedures 129 Example 7 (Continued): Michigan ECMO trial using RPW rule: The RPW rule was used in a clinical trial of extracorporeal membrane oxygenation (ECMO; Bartlett, et al. 1985, Pediatrics). Total 12 patients. • ECMO group– 11 patients, all survived. • Conventional therapy– 1 patient, died. Valid of this trial? No statistical conclusion. Why? Power and variability. 4 Response-adaptive randomization procedures 130 Power is an increasing function of noncentrality parameter: For the following one-side test: H0 : pA = pB vs H1 : pA > pB , The corresponding testing statistic is T = pˆA − pˆB√ pˆA(1− pˆA)/NA(n) + pˆB(1− pˆB)/NB(n) . We can calculate the noncentrality parameter as followings: (pA − pB)2 pAqA/NA(n) + pBqB/NB(n) . 4 Response-adaptive randomization procedures 131 Assume NA(n)/n→ ρ in probability, we can rewrite this as: See details in class. 4 Response-adaptive randomization procedures 132 4.4 Lower bound of the variability Hu, Rosenberger and Zhang (2006) considered ”Asymptotically best response-adaptive randomization procedures.” in Journal of Statistical Planning and Inference. See details in class. 4 Response-adaptive randomization procedures 133 4.5 Doubly-Adaptive Biased Coin Design Doubly-adaptive biased coin design (DBCD) (Eisele and Woodroofe, 1995, Annals of Statist, Hu and Zhang, 2004, Annals of Statist). Let g be a function from [0, 1]× [0, 1] to [0, 1] satisfied certain conditions. The procedure then allocates patient j to treatment A with probability g( nA(j − 1) j − 1 , ρˆj−1). How to choose function g? 4 Response-adaptive randomization procedures 134 Recently, Hu and Zhang (2004) proposed (γ ≥ 0) g(x, ρ) = ρ(ρ/x)γ ρ(ρ/x)γ + (1− ρ)((1− ρ)/(1− x))γ • γ = 0, the g(x, ρ) = ρ (the SMLE); • γ =∞, determined design. 4 Response-adaptive randomization procedures 135 Let λ = ∂g/∂x ∣∣ (ρ,ρ) , η = ∂g/∂y ∣∣ (ρ,ρ) and ∇(ρ) = ( ∂ρ ∂pA , ∂ρ ∂pB )′. Also let σ23 = (∇(ρ)|Θ)′V∇(ρ)|Θ and σ21 = ρ(1− ρ). Where Θ = (pA, pB) and V = diag( V ar(ξA) ρ , V ar(ξB) 1− ρ ). 4 Response-adaptive randomization procedures 136 Theorem. Under widely satisfied conditions, n1/2(nA/n− ρ)→ N(0, σ2) (1) in distribution. Where σ2 = σ21 1− 2λ + 2η2σ23 (1− λ)(1− 2λ) Main Techniques used: Martingale, Gaussian Approximation and Matrix theory. 4 Response-adaptive randomization procedures 137 Example 4.1. Binary response: treatment A and B. • pA: P (success|A), qA = 1− pA; • pB : P (success|B), qB = 1− pB ; • nA: number of patients on A; • nB : number of patients on B, n = nA + nB . 4 Response-adaptive randomization procedures 138 To see how this procedure works in practice, we look at a simple illustration with γ = 2. • Suppose we have already assigned 9 patients, 5 to A and 4 to B. • We have observed a success rate of pˆA = 3/5 on A and pˆB = 1/4 on B. 4 Response-adaptive randomization procedures 139 If the target allocation is urn allocation, qB/(qA + qB) (Wei and Durham, 1978), then • estimate the target allocation as ρˆ = 3/4 2/5 + 3/4 = 0.652. • real allocation proportion is 5/9 = 0.556. • Then the probability of assigning the 10th patient to treatment A is computed as P (A) = 0.652(0.652/0.556)2 0.652(0.652/0.556)2 + 0.348(0.348/0.444)2 = 0.807. 4 Response-adaptive randomization procedures 140 If we are interested in optimal allocation, √ pA/( √ pA + √ pB) (Rosenberger, et al, 2001), then • estimate the target allocation as ρˆ = √ 3/5√ 3/5 + √ 1/4 = 0.6077. • real allocation proportion is 5/9 = 0.556. • Then the probability of assigning the 10th patient to treatment A is computed as P (A) = 0.6077(0.6077/0.556)2 0.6077(0.6077/0.556)2 + 0.3923(0.3923/0.444)2 = 0.704. 4 Response-adaptive randomization procedures 141 For binary responses with (ρ = qB/(qA + qB)), n1/2(nA/n− ρ)→ N(0, σ2DBCD) in distribution, whenever λ < 1/2, where σ2DBCD = q1q2 (1− 2λ)(q1 + q2)2 + 2η2 (1− λ)(1− 2λ) q1q2(p1 + p2) (q1 + q2)3 4 Response-adaptive randomization procedures 142 If g(x, ρ) = ρ(ρ/x)γ ρ(ρ/x)γ + (1− ρ)((1− ρ)/(1− x))γ , then σ2DBCD = q1q2(p1 + p2) (q1 + q2)3 + 2q1q2 (1 + 2γ)(q1 + q2)3 . • γ = 0, σ2DBCD = q1q2(p1+p2+2)(q1+q2)3 . • γ =∞, σ2DBCD = q1q2(p1+p2)(q1+q2)3 (Lower bound). • γ = 2, σ2DBCD = q1q2(p1+p2+.4)(q1+q2)3 . 4 Response-adaptive randomization procedures 143 Advantages of DBCD: • can target any given allocation ρ(θ); • very close to the low bound; but NOT ATTAIN the low bound. • and apply to all types of responses. 4 Response-adaptive randomization procedures 144 4.6 Efficient Response-Adaptive Designs Hu, Zhang and He (2009, Annals of Statistics) proposed Efficient Response-Adaptive Designs (ERADE), which, • can target any given allocation ρ(θ); • ATTAIN the low bound. • and apply to all types of responses. 4 Response-adaptive randomization procedures 145 The ERADE is analogous to discretized version of Hu and Zhang’s function. For a parameter α ∈ (0, 1), Then the procedure allocates jth patient to treatment A with probability φj = 1/2, if nA(j − 1)/(j − 1) = ρˆj−1, = αρˆj−1, if nA(j − 1)/(j − 1) > ρˆj−1, = 1− α(1− ρˆj−1), if nA(j − 1)/(j − 1) < ρˆj−1. 4 Response-adaptive randomization procedures 146 4.7 Revisiting the examples Example 4.2. HIV transmission (Continued). Connor et al. (1994, The New England Journal of Medicine) report a clinical trial to evaluate the drug AZT in reducing the risk of maternal-infant HIV transmission. 50-50 randomization scheme is used: • AZT Group—239 pregnant women (20 HIV positive infants). • placebo group—238 pregnant women (60 HIV positive infants). 4 Response-adaptive randomization procedures 147 Here pˆA = 219/239 = 0.913, pˆB = 158/238 = 0.664. • pA + pB = 1.577 > 1.5, RPW does not apply here. • DBCD with target allocation ρ = q2/(q1 + q2) and γ = 2 • Neyman allocation, Maximize the power. • FPower: Fix the power (β = 0.9) and minimize expected failures. 4 Response-adaptive randomization procedures 148 Allocation rule AZT Placebo Power HIV+ EA 239 238 0.9996 80 DBCD 360 117 0.989 60 Neyman 186 291 0.9998 89 FPower 416 61 0.90 50 4 Response-adaptive randomization procedures 149 Example 4.3. ECMO trial (The UK trial). Extracorporeal membrane oxygenation (ECMO) is an external system for oxygenating the blood based on techniques used in cardiopulmonary bypass technology developed for cariac surgery. In the literature, there are three well-document clinical trials on evaluating the clinical effectiveness of ECMO: (i) the Michigan ECMO study (Bartlett, et al. 1985); (ii) the Boston ECMO study (Ware, 1989); (iii) the UK ECMO trial (UK Collaborative ECMO Trials Group, 1996). 4 Response-adaptive randomization procedures 150 The UK ECMO trial: 50-50 randomization scheme is used: • ECMO Group—93 infants (28 deaths). • Conventional group—92 infants (54 deaths). 4 Response-adaptive randomization procedures 151 • Use P1 = 65/93 and P2 = 38/92 as the estimated success probabilities of the ECMO and the conventional treatment, respectively. • ERADE (Hu, Zhang and He, 2009) is used based on 10000 simulations. • RPW is used based on 10000 simulations. 4 Response-adaptive randomization procedures 152 • On average, there will be about 121 patients in the ECMO and 64 patients in the conventional treatment on average. • the expected number of deaths is 74 death, as compared to 82 in the actual trial. The adaptive design utilizes the better treatment more often to save lives. 4 Response-adaptive randomization procedures 153 Power of the ERADE and the RPW rule under the setting of P1 = 65/93 and P2 = 38/92. • For equal allocation, power is 0.978. • The expected power under both designs (ERADE and RPW) is 0.969. • Based on the 10000 simulated trials, we noticed that in 99% of the trials under the ERADE there were more than 52 patients assigned to the conventional treatment group, for a power of 0.941 or higher. • Under the RPW rule, only 39 or fewer patients were assigned to the conventional treatment in 1% of the trials, for a power of 0.904 or less. 4 Response-adaptive randomization procedures 154 • Also based on the 10000 simulated trials, the ERADE always assign more patients to the ECMO group. • However, the RPW rule assigned more patients to the conventional group in 114 trials. • Even at the sample size 185, we can see the advantage of using the proposed ERADE over the randomized player-the-winner rule. 4 Response-adaptive randomization procedures 155 4.8 Some remarks • Urn models (RPW rule, Wei and Durham, 1978, JASA; Zhang, Hu, Cheung and Chan, 2011, AOS) • Ethical, Randomness, Power and Variability (Hu and Rosenberger, 2003, JASA) • Lower bound of the variability (Hu, Rosenberger and Zhang, 2006, JSPI) • DBCD (Hu and Zhang, 2004, AOS) • Optimal allocations (Rosenberger et al, 2001, Biometrics, Tymofyeyev, Rosenberger and Hu, 2007, JASA) • ERADE (Hu, Zhang and He, 2009, AOS) 4 Response-adaptive randomization procedures 156 • Delayed responses (Bai, Hu and Rosenberger, 2002, AOS, Hu and Zhang, 2004) • Time trends and others (Hu and Rosenberger, 2000, Statistics in Medicine) • The book (Hu and Rosenberger, 2006) and two white papers (Hu and Rosenberger, 2007). • Sequential Monitoring RAR (Zhu and Hu, 2010, AOS). • Sample size re-estimation (Li and Hu, 2021). • Robustness Inference of RAR (Ye, Ma and Hu, 2021?). 5 Covariate-Adaptive Randomization 157 5 Covariate-Adaptive Randomization Example 5.1: Remdesivir-COVID-19 trial (China). Remdesivir in adults with severe COVID-19 trial (Wang et al. 2020) is a randomized, double-blind, placebo-controlled, multicentre trial that aimed to compare Remvesivir with placebo. There were 236 patients in the trial. There are about 20 baseline covariates for each patient, including 10 continuous variables (e.g. age and White blood cell count) and 10 discrete variables (e.g. gender and Hypertension). The stratified (according to the level of respiratory support) permuted block (30 patients per block) randomization procedure were implemented. At the end of this trial, some important imbalances existed at enrollment between the groups, including more patients with hypertension, diabetes, or coronary artery disease in the Remdesivir group than the placebo group. 5 Covariate-Adaptive Randomization 158 Example 5.2: Moderna COVID-19 vaccine trial (2020). The trial began on July 27, 2020, and enrolled 30,420 adult volunteers at clinical research sites across the United States. Volunteers were randomly assigned 1:1 to receive either two 100 microgram (mcg) doses of the investigational vaccine or two shots of saline placebo 28 days apart. The average age of volunteers is 51 years. Approximately 47% are female, 25% are 65 years or older and 17% are under the age of 65 with medical conditions placing them at higher risk for severe COVID-19. Approximately 79% of participants are white, 10% are Black or African American, 5% are Asian, 0.8% are American Indian or Alaska Native, 0.2% are Native Hawaiian or Other Pacific Islander, 2% are multiracial, and 21% (of any race) are Hispanic or Latino. 5 Covariate-Adaptive Randomization 159 From the start of the trial through Nov. 25, 2020, investigators recorded 196 cases of symptomatic COVID-19 occurring among participants at least 14 days after they received their second shot. One hundred and eighty-five cases (30 of which were classified as severe COVID-19) occurred in the placebo group and 11 cases (0 of which were classified as severe COVID-19) occurred in the group receiving mRNA-1273. The incidence of symptomatic COVID-19 was 94.1% lower in those participants who received mRNA-1273 as compared to those receiving placebo. 5 Covariate-Adaptive Randomization 160 Investigators observed 236 cases of symptomatic COVID-19 among participants at least 14 days after they received their first shot, with 225 cases in the placebo group and 11 cases in the group receiving mRNA-1273. The vaccine efficacy was 95.2% for this secondary analysis. 5 Covariate-Adaptive Randomization 161 5.1 Some Classical designs Clinical trialists are often concerned that treatment arms will be unbalanced with respect to key covariates of interest. To prevent this, covariate-adaptive randomization is often employed. Over 50000 covariate-adaptive clinical trials had been reported from 1988-2008 (Taves, 2010). • Covariates (prognostic factors): factors that are associated with the outcomes of patients – E.g., gender, age, clinical center, blood pressure, stage of disease at baseline, gene expressions. 5 Covariate-Adaptive Randomization 162 • Covariate-adaptive design: randomization that incorporates covariates and balances treatment allocation over covariates. – Balancing treatment allocation for influential covariates. – Achieving statistical efficiency by preserving type I errors while increasing power. • Two popular procedures: stratified permuted block design and Pocock and Simon’s marginal procedure (1975). 5 Covariate-Adaptive Randomization 163 Imbalance of Different Levels • overall difference, Dn = N1(n)−N2(n); • marginal difference, difference between the numbers of patients on a margin, e.g., Dfemale; • within-stratum difference, difference between the number of patients in a stratum, e.g. Dfemale,smoker. Female Male Overall Smoker Dfemale,smoker Dmale,smoker Dsmoker Non-S Dfemale,non−s Dmale,non−s Dnon−s Overall Dfemale Dmale Dn 5 Covariate-Adaptive Randomization 164 5.1.1 Stratified Randomization • Strata are formed by all combinations of covariates’ levels. – e.g.: 2 covariates gender (male and female) and smoking behavior (smoker and non-smoker) lead to 2× 2 = 4 strata • Separate randomization is employed within each stratum. – covariate-adaptive biased coin design – stratified permuted block design, commonly used. Permuted Block Design: permutation of m A’s and m B’s. - e.g.: block size 2m = 4, permutation of (AABB) or (BAAB); For 10 patients: —AABB—BAAB—BB 5 Covariate-Adaptive Randomization 165 Stratified Randomization • Advantage: – Easy to understand and implement. – Good large sample properties (almost prefect balance). – Balance within stratum. • Disadvantage: – Only consider balance within stratum. – Does not work for cases with many strata (many covariates or many levels). – Unknown (theoretically) properties of statistical inference. 5 Covariate-Adaptive Randomization 166 5.1.2 Pocock-Simon procedure Let Z1, ...,Zn be the covariate vector of patients 1, ..., n. Assume that there are S covariates of interest (continuous or otherwise) and they are divided into ns, s = 1, ..., S, different levels. Nsik(n), s = 1, ..., S, i = 1, ..., ns, k = 1, 2 to be the number of patients in the i-th level of the s-th covariate on treatment k. Let patient n+ 1 have covariate vector Zn+1 = (r1, ..., rS). Let Ds(n) = Nsrs1(n)−Nsrs2(n), which is the difference between the numbers of patients on treatments 1 and 2 for members of level rs of covariate s. 5 Covariate-Adaptive Randomization 167 Let w1, ..., wS be a set of weights and take the weighted aggregate D(n) = ∑S s=1 wsDs(n). Establish a probability pi ∈ (1/2, 1]. Then the procedure allocates to treatment 1 according to φi1 = E(Ti1|Ti−1,Zi) = 1/2, if D(i− 1) = 0, = pi, if D(i− 1) < 0, = 1− pi, if D(i− 1) > 0. 5 Covariate-Adaptive Randomization 168 Pocock-Simon procedure • Advantage: – Balance across covariates (marginal balance). – Overall treatment balance with many covariates. • Disadvantage: – Unknown theoretical properties (not well studied, Rosenberger and Sverdlov, 2009). – usually not well balanced within stratum. – Unknown (theoretically) properties of statistical inference. 5 Covariate-Adaptive Randomization 169 Examples. 5 Covariate-Adaptive Randomization 170 We need new covariate-adaptive designs that provide balance (within stratum, marginal and overall) under different situations (sample size 200, 500 or 1000): • 10 covariates, each with 2 levels: total 210 = 1024 strata. • 2 covariates: a biomarker with 2 levels and 100 investigation sides: total 200 strata. 5 Covariate-Adaptive Randomization 171 5.2 Hu and Hu’s Covariate-Adaptive Design for Balance (discrete) Consider two covariates: covariate 1 with I levels and covariate 2 with J levels, For patient n+ 1 (with i (covariate 1) and j (covariate 2)) n = 0, 1, 2, .... First we define the following values: • If patient n+ 1 is assigned to treatment 1, let – Within Stratum: D (1) ij (n+ 1) = Nij,1(n+ 1)−Nij,2(n+ 1), where Nij,1(n+ 1) and Nij,2(n+ 1) are the number of patients assigned to treatment 1 and 2 respectively in strata ij of the first n+ 1 patients. – Marginal 1: D (1) i· (n+ 1) = Ni·,1(n+ 1)−Ni·,2(n+ 1), where Ni·,1(n+ 1) and Ni·,2(n+ 1) are the number of patients assigned to treatment 1 and 2 respectively in (covariate 1=i) of the first n+ 1 patients. 5 Covariate-Adaptive Randomization 172 – Marginal 2: D (1) ·j (n+ 1) = N·j,1(n+ 1)−N·j,2(n+ 1), where N·j,1(n+ 1) and N·j,2(n+ 1) are the number of patients assigned to treatment 1 and 2 respectively in (covariate 2=j) of the first n+ 1 patients. – Overall: Dn,overall = Nn,1 −Nn,2 be the overall difference of patient numbers in group 1 and 2 among the first n. – Define A (1) ij (n+ 1) = (D (1) ij (n+ 1)) 2, A (1) i· (n+ 1) = (D (1) i· (n+ 1)) 2, A (1) ·j (n+ 1) = (D (1) ·j (n+ 1)) 2 and A (1) ·· = (Dn,overall)2. – The score of imbalance is B (1) ij (n+ 1) = w1A (1) ij (n+1)+w2A (1) i· (n+1)+w3A (1) ·j (n+1)+w4A (1) ·· (n+1) for some weights w1, w2, w3, w4 ≥ 0. • If patient n+ 1 is assigned to treatment 2, B(2)ij (n+ 1) is calculated similarly. 5 Covariate-Adaptive Randomization 173 Then the proposed procedure allocates (Hu and Hu, 2012) to treatment 1 according to φn+1,1 = 1/2, if B (1) ij (n+ 1) = B (2) ij (n+ 1), = pi, if B (1) ij (n+ 1) < B (2) ij (n+ 1), = 1− pi, if B(1)ij (n+ 1) > B(2)ij (n+ 1). Where pi > 0.5 (pi ∈ (0.75, 0.95) is recommended). 5 Covariate-Adaptive Randomization 174 Remarks: • When weight w1 = 0, w4 = 0, the new design becomes Pocock and Simon’s procedure. • When w2 = w3 = w4 = 0, the new design is similar to Stratified Block Randomization. • With w1, w2, w3 > 0, we can balance both within each strata and cross covariates. 5 Covariate-Adaptive Randomization 175 Theorem 1: (1) Under certain conditions (w1 > 0 and some others, Hu and Hu, 2012; Hu and Zhang, 2014), Dn (imbalance matrix) is a positive recurrent Markov chain. Therefore all three types of imbalance are Op(1). (2) When w1 = 0 (Pocock and Simon’s Design), Both marginal and overall imbalances are Op(1), but within stratum imbalance is Op(n 1/2). (Hu and Zhang, 2014). The proof is quite difficult because the correlated structure. Main techniques: “Draft conditions” of Markov chain, Guassian approximation and martingales. 5 Covariate-Adaptive Randomization 176 Some numerical results: Case 1: 10 covariates, each with 2 levels: total 210 = 1024 strata. 5 Covariate-Adaptive Randomization 177 Table 1. Averaging imbalance under 100 simulations and n = 500 Dist of pts across strata Counts & percentages # of pts E(# prop) Imb strt(PB) P-S New 2 .07 0 50.2(.67) 38.2(.50) 55.1(.74) 2 24.9(.33) 37.8(.50) 18.9(.26) 3 .01 1 12(1.00) 9.3(.77) 12.0(.96) 3 0(0.00) 2.8(.23) .5(.04) (< 2) .91 overall abs dif 12.8 .76 .90 margnal abs dif 10.4 1.68 1.90 5 Covariate-Adaptive Randomization 178 Case 2: 2 covariates: a biomarker with 2 levels and 100 investigation sides: total 200 strata. 5 Covariate-Adaptive Randomization 179 Table 3. Averaging imbalance under 1000 simulations and n = 200 Dist of pts across strata Counts & percentages # of pts E(# prop) Imb strt(PB) P-S New 2 .184 0 24.46(.66) 24.15(.65) 30.23(.82) 2 12.37(.34) 12.74(.35) 6.46(.18) 3 .06 1 12.02(1.00) 11.18(.92) 12.05(.97) 3 0(0.00) 1.02(0.08) 0.35(.03) (< 2) .735 overall abs dif 9.39 1.14 1.53 margnal long abs dif 6.57 0.87 1.13 margnal short abs dif 1.00 0.86 0.81 5 Covariate-Adaptive Randomization 180 5.3 Examples of Mimicking Real Clinical Data 5.3.1 Toorawa, Adena, et al. (2009) The four covariates are site, gender, age and disease status, with 20, 2, 2 and 2 levels, respectively, resulting in 160 strata. The covariates’ distribution is replicated in Table 2, where the marginal distribution of sites is independent of the joint distribution of the rest three covariates. 5 Covariate-Adaptive Randomization 181 Table 2: Distribution of Covariates Sites Small(2 sites) 1/120 Medium(16 sites) 6/120 Large(2 sites) 11/120 Other 3 covariates Male; < 60; Moderate disease 10/20 Male; ≥ 60; Moderate disease 2/20 Male; < 60; Severe disease 2/20 Male; ≥ 60; Severe disease 2/20 Female; < 60; Moderate disease 1/20 Female; ≥ 60; Moderate disease 1/20 Female; < 60; Severe disease 1/20 Female; ≥ 60; Severe disease 1/20 5 Covariate-Adaptive Randomization 182 120 patients enter the trial sequentially and their covariates are independently simulated from the multinomial distribution in Table 2. We use the same p, q and block size as in the previous two examples. The weights are specified in the following way: - NEW: wo = ws = 1/3 and wm,i = 1/12, i = 1, · · · , 4. - PS: wo = ws = 0 and wm,i = 1/4, i = 1, · · · , 4. 5 Covariate-Adaptive Randomization 183 Table 3: Distribution of patients among 160 strata # of pts within stratum 0 1 2 3 4 and more # of strata 95.4 38.8 12.7 5.6 7.6 proportion 59.6% 24.3% 7.9% 3.5% 4.7% 5 Covariate-Adaptive Randomization 184 Table 3 shows the distribution of 120 patients among 160 strata. In this case 24.3% of the strata have 1 patient; 11.4% contain 2 or 3 patients. If stratified randomization is employed, then the patients in the above 24.3% stata has to be randomized by equal probabilities. Moreover, the incomplete blocks in strata with 2 or 3 patients also pose a high risk of large overall imbalance. The mean absolute imbalances at the three levels are compared, as shown in Table 4, Table 5, and 6. 5 Covariate-Adaptive Randomization 185 Table 4: Comparison of absolute overall imbalance |Dn| STR-PB PS NEW mean 6.70 0.91 0.63 median 6 0 0 95% quan 16 2 2 5 Covariate-Adaptive Randomization 186 Table 4 shows the result for the overall imbalance and lists the the mean, median and 95% quantile of |D120|. It is seen that NEW has mean, median and 95% quantile of 0.63, 0 and 2, respectively, whereas PS has slightly higher values. The three quantities are extremely high under STR-PB, which are not recommended for this case. 5 Covariate-Adaptive Randomization 187 Table 5: Comparison of mean absolute marginal imbalances E|Dn(i; ki)| STR-PB PS NEW gender male 5.52 1.10 1.59 female 3.86 1.06 1.55 age < 60 4.84 1.08 1.57 ≥ 60 4.40 1.11 1.23 disease moderate 5.01 1.10 1.56 severe 4.35 1.18 1.52 20 sites 2 small 1.45 0.94 1.02 16 median 1.44 1.21 1.32 2 large 1.47 1.33 1.52 5 Covariate-Adaptive Randomization 188 Table 5 gives the mean absolute marginal imbalances. For the covariates of gender, age and disease, the table explicitly lists the mean values on these 6 margins, as each of them only has two levels. For example, over the 1000 simulations, on average the absolute differences of patients in the two treatment groups within all male are 5.52, 1.10 and 1.59 under STR-PB, PS and NEW, respectively. Therefore, in this respect PS has the best performance; NEW is slightly worse, but still tolerable; STR-PB is the worst, since its mean is as high as 5.52. Similar conclusion can be reached for the other 5 margins. Moreover, for the margins relating to “site”, since there are a total of 20 margins, we are unable to show the result on each margin due to the space limit. Hence, these 20 margins are further categorized into three groups of small, median and large sizes, and the mean values in the table are further averaged over the margins within the groups. For example, 1.32 is the mean absolute imbalance over the 16 median-sized sites as well as over the 1000 simulations. In terms of 5 Covariate-Adaptive Randomization 189 imbalances on margins defined by site, PS is still the best, and STR-PB has similar performance to NEW. This is because each margin of site contains only 8 strata, hence the “accumulating effect” of within-stratum imbalances under STR-PB is not as strong. 5 Covariate-Adaptive Randomization 190 Table 6: Comparison of absolute within-stratum imbalances |Dn(k1, · · · , kI)|: distribution and mean # of pts’ within strt. |Dn(k1, · · · , kI)| STR-PB PS NEW 2 prob(=0) 0.68 0.57 0.69 prob(=2) 0.32 0.43 0.31 mean 0.64 0.86 0.62 3 prob(=1) 1.00 0.85 0.94 prob(=3) 0.00 0.15 0.06 mean 1.00 1.30 1.12 5 Covariate-Adaptive Randomization 191 Table 6 displays the distribution and absolute mean of within-stratum imbalances for strata with 2 or 3 patients. For example, of all the strata which contain 2 patients, the absolute difference is either 0 or 2, and the distribution is 0.69 to 0 and 0.31 to 2 under NEW, leading to an average of 0.62. According to this criterion, NEW has the lowest mean, STR-PB has a slightly larger value, and PS has mean as large as 0.86. For strata containing 3 patients, since the block size is 4 for STR-PB, it is impossible to get an absolute value of 3. Hence, the mean absolute imbalance is 1, the minimum among the three methods. 5 Covariate-Adaptive Randomization 192 In summary, Hu and Hu’s procedure maintains good balance from all three perspectives and should be favored. We also performed the simulations under other parameter values. Some of them include: (1) Changing the weights wo, ws, and wm,i, as well as the block size; (2) 2× 100 strata, representing few covariates but many levels at least for one covariate; (3) 3× 4× 5× 6 strata, representing a few covariates and a few levels for each. In all the above settings, our new procedure shows advantages over the other two methods. 5 Covariate-Adaptive Randomization 193 5.3.2 NIDA-CSP-1019 study Elkashef et al. (2006) study is a randomized clinical trial conducted to test the treatment effect of the selegiline transdermal system (STS), a treatment of cocaine dependence. The trial comprised 300 patients, and involved important covariates such as center (16 centers), age, gender (1: male, 2: female), depression (calculated by Hamilton Depression Rating Scale), ADHD (Attention-Deficit/Hyperactivity Disorder, 1: Yes, 2: No), and cocaine use (the number of self-reported days of cocaine use in the past 30 days ). The raw data of this study is available on NIDA website. 5 Covariate-Adaptive Randomization 194 Before using the randomization procedures, we discretized age to 1 (0-30), 2 (30− 40), 3 (40− 50), 4 (50 and above); depression (Hamilton Depression Rating Scale) to 1 (normal: 0-7), 2 (mild depression: 8-13), 3 (moderate depression: 14-18), 4 (severe depression: 19-22), 5 (very severe depression: 23 to above); cocaine use to 1 (0-10), 2 (11-20), 3 (20-30). The correlation coefficients of the six covariates are given in Table 7. 5 Covariate-Adaptive Randomization 195 Table 7: Correlation coefficients (Kendall’s tau) of the covariates. center gender age depression ADHD cocaineuse center 1.000 0.027 -0.004 -0.044 -0.029 0.021 gender 0.027 1.000 -0.066 0.078 0.013 0.127 age -0.004 -0.066 1.000 -0.028 -0.075 0.009 depression -0.044 0.078 -0.028 1.000 -0.176 0.066 ADHD -0.029 0.013 -0.075 -0.176 1.000 0.040 cocaineuse 0.021 0.127 0.009 0.066 0.040 1.000 5 Covariate-Adaptive Randomization 196 Table 7 shows that gender has the highest calculated correlation with cocaine use (e.g., 0.12). Medical studies conner2008meta,mcintosh2009adult suggest the existence of the correlations among depression, ADHD, and cocaine use. Gender, depression, ADHD, and cocaine use were thus assumed to be jointly distributed. In addition, center and age were further assumed to be independently distributed to each other, and to the rest of the covariates. The empirical distributions of the covariates used in simulation are presented in Tables 8- 12 . 5 Covariate-Adaptive Randomization 197 The values used for the randomization procedures were as follows. Bs = 4 was used for all s under STR-PB, when cocaine use was observed or unobserved. γ = 0.85 was used for both PS and HH, whenever cocaine use was observed or unobserved. When cocaine use was observed, wm1 = · · · = wm6 = 1/6, and (wo = 0.1, wm1 = · · · = wm6 = 0.14, ws = 0.06) were used for PS and HH, respectively. When cocaine use was unobserved, wm1 = · · · = wm5 = 1/5, and (wo = 0.15, wm1 = · · · = wm6 = 0.15, ws = 0.1) were used for PS and HH, respectively. 5 Covariate-Adaptive Randomization 198 Table 8: Marginal pmf of age. age 1 2 3 4 pmf 28/300 110/300 135/300 27/300 5 Covariate-Adaptive Randomization 199 Table 9: Marginal pmf of center. center 1 2 3 4 5 6 7 8 pmf 24/300 21/300 28/300 15/300 14/300 18/300 28/300 24/300 center 9 10 11 12 13 14 15 16 pmf 15/300 16/300 20/300 3/300 20/300 17/300 10/300 27/300 5 Covariate-Adaptive Randomization 200 Table 10: Joint pmf of gender, depression, ADHD, and cocaine use I. gender depression ADHD cocaine use pmf 1 1 1 1 1/300 1 1 1 2 3/300 1 1 1 3 1/300 5 Covariate-Adaptive Randomization 201 Table 11: Joint pmf of gender, depression, ADHD, and cocaine use II. gender depression ADHD cocaine use pmf 1 1 2 1 26/300 1 1 2 2 54/300 1 1 2 3 30/300 1 2 1 1 1/300 1 2 2 1 10/300 1 2 2 2 24/300 1 2 2 3 18/300 1 3 1 1 2/300 1 3 1 3 1/300 1 3 2 1 4/300 1 3 2 2 13/300 1 3 2 3 12/300 1 4 1 2 2/300 1 4 1 3 2/300 1 4 2 1 5/300 1 4 2 2 3/300 1 4 2 3 8/300 1 5 1 2 1/300 1 5 1 3 2/300 1 5 2 1 1/300 1 5 2 2 7/300 1 5 2 3 3/300 5 Covariate-Adaptive Randomization 202 Table 12: Joint pmf of gender, depression, ADHD, and cocaineuse III. gender depression ADHD cocaine use pmf 2 1 2 1 1/300 2 1 2 2 9/300 2 1 2 3 14/300 2 2 2 1 4/300 2 2 2 2 9/300 2 2 2 3 8/300 2 3 1 1 1/300 2 3 1 2 2/300 2 3 2 2 3/300 2 3 2 3 5/300 2 4 2 1 1/300 2 4 2 2 2/300 2 4 2 3 1/300 2 5 1 2 1/300 2 5 2 1 1/300 2 5 2 2 1/300 2 5 2 3 3/300 5 Covariate-Adaptive Randomization 203 Note that the discretization we used resulted in 1,280 observed strata. In this section, based on the sample size of 300 in this study, we compare the marginal imbalance of cocaine use = 3 and the imbalance of a partial stratum of (gender = 1, depression = 1, ADHD = 2, and cocaine use = 3), when cocaine use is either observed or unobserved. For simplicity, we will write Dn(6; s6) to denote the marginal imbalance of cocaine use when it is observed, and Dn(1; r1) to denote the marginal imbalance of cocaine use when it is unobserved. Furthermore, we write Dn(s ∗) for the imbalance of the partial stratum of our interest when cocaine use is observed, and Dn(s ∗∗, r1) for the imbalance of the same partial stratum when cocaine use is unobserved. 5 Covariate-Adaptive Randomization 204 The simulation results for the partial stratum and the margin of cocaine use = 3 are summarized in Table 13 and Table 14. We also report the percentage reduction in the variance of the observed covariate imbalance (PRVOCI) for Dn(s ∗) and Dn(6; s6). It is clear that regardless of whether cocaine use is observed or unobserved, PS and HH produce a better balance for the partial stratum and the margin of cocaine use than CR or STR-PB. In particular, the standard deviations of n−1/2Dn(6; s6), n−1/2Dn(1; r1), n−1/2Dn(s∗), and n−1/2Dn(s∗∗, r1) under PS and HH are smaller than the corresponding values under STR-PB and CR. 5 Covariate-Adaptive Randomization 205 Table 13: Simulation results for the partial stratum (gender = 1, depression = 1, ADHD = 2, cocaine use = 3), based on 10,000 runs. Procedure n−1/2Dn(s∗) n−1/2Dn(s∗∗, r1) mean (s.d.) / PRVOCI mean (s.d.) / PRVUCI CR -.000 (.316) / - -.000 (.316) / - STR-PB .002 (.276) / 23.7% .008 (.288) / 17.1% PS -.000 (.238) / 43.5% -.002 (.280) / 21.6% HH -.005 (.233) / 45.9% -.003 (.275) / 24.2% 5 Covariate-Adaptive Randomization 206 Table 14: Simulation results for cocaine use, based on 10,000 runs. Procedure n−1/2Dn(6; s6) n−1/2Dn(1; r1) mean (s.d.) / PRVOCI mean (s.d.) / PRVUCI CR .001 (.601) / - -.001 (.601) / - STR-PB .005 (.556) / 14.6% .011 (.564) / 12.0% PS -.000 (.111) / 96.6% .001 (.476) / 37.3% HH -.000 (.112) / 96.5% -.003 (.470) / 37.3% 5 Covariate-Adaptive Randomization 207 The differences between the PRVOCIs and the PRVUCIs of the partial stratum and cocaine use under PS and HH are not negligible. For example, the PRVOCI and the PRVUCI for the marginal imbalance of cocaine use = 3 under PS are 96.6% and 37.3%, respectively. Indeed, if one covariate is omitted from CAR, the marginal imbalance of this covariate generally increases. However, as the PRVUCIs are positive (37.3%), the results still suggest that CAR procedures perform much better than CR when cocaine use is omitted in the design. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)208 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others) 6.1 Introduction 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)209 Example: Remdesivir-COVID-19 trial (China). Remdesivir in adults with severe COVID-19 trial (Wang et al. 2020) is a randomized, double-blind, placebo-controlled, multicentre trial that aimed to compare Remvesivir with placebo. There were 236 patients in the trial. There are about 20 baseline covariates for each patient, including 10 continuous variables (e.g. age and White blood cell count) and 10 discrete variables (e.g. gender and Hypertension). The stratified (according to the level of respiratory support) permuted block (30 patients per block) randomization procedure were implemented. At the end of this trial, some important imbalances existed at enrollment between the groups, including more patients with hypertension, diabetes, or coronary artery disease in the Remdesivir group than the placebo group. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)210 Example: GATE project The Project GATE (Growing America Through Entrepreneurship), sponsored by the U.S. Department of Labor, was designed to evaluate the impact of offering tuition-free entrepreneurship training services (GATE services) on helping clients create, sustain or expand their own business. (https://www.doleta.gov/reports/projectgate/) The cornerstone is complete randomization. Members of the treatment group were offered GATE services; members of the control group were not. • n = 4, 198 participants • p = 105 covariates 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)211 Example: Online A/B testing. (Kohavi and Thomke, 2017, Harvard Business Review) Microsoft, Amazon, Facebook and Google conduct more than 10,000 online controlled experiments annually, with many tests engaging millions of users. Amazon’s experiment. Treatment A: Credit card offers on front page. Treatment B: Credit card offers on the shopping cart page. This (change from A to B) boosted profits by tens of millions of US Dollars annually. Often Network (Dependent and Interference) Data, How to Design these studies? 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)212 Advantages of covariate balance: • Improve accuracy and efficiency of inference. • Remove the bias and increase the power. • Increases the interpretability of results by making the units more comparable, enhance the credibility. • More robust against model misspecification. • Rubin (2008): the greatest possible efforts should be made during the design phase rather than the analysis stage. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)213 • Randomization: an essential tool for evaluating treatment effect. • Traditional randomization methods (e.g., complete randomization (CR)): unsatisfactory, unbalanced prognostic or baseline covariates. “Most of experimenters on carrying out a random assignment of plots will be shocked to find out how far from equally the plots distribute themselves.” —Fisher (1926) 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)214 What if large p and large n? • The phenomenon of covariate imbalance is exacerbated as p and n increase. • Ubiquitous in the era of big data. • Example: the probability of one particular covariate being unbalanced is α = 5%. For a study with 10 covariates, the chance of at least one covariate exhibiting imbalance is 1− (1− α)p = 40%. With 100 covariates, the chance is 1− (1− α)100 = 1. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)215 6.2 Rerandomization Morgan and Rubin (2012) proposed rerandomization. (1) Collect covariate data. (2) Specify a balance criterion, M < a, i.e., threshold on the Mahalanobis distance, M = (x¯1 − x¯2)T [cov(x¯1 − x¯2)]−1(x¯1 − x¯2), where x¯1 and x¯2 are the sample means for treatment groups. (3) Randomize the units using the complete randomization (CR). (4) Check the balance criterion, M < a. • If satisfied, go to Step (5); otherwise, return to Step (3). (5) Perform the experiment using the final randomization obtained in Step (4). 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)216 Advantages: • Desirable properties for causal inference: – Reduction in variance of estimated treatment effect. • Work well with a few covariates. Drawbacks: • Not for sequential experiments • Incapable to scale up for massive data. • As p increases, the probability of acceptance pa = P (M < a) decreases, causing the RR to remain in the loop for a long time. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)217 Examples of Rerandomization. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)218 6.3 Covariate-Adaptive Randomization via Mahalanobis Distance (CAM) xi ∈ Rp: covariate of the i-th unit. Ti ∈ {1, 0}: treatment assignment of the i-th unit. • Ti = 1: treatment 1. • Ti = 0: treatment 2. i = 1, ..., n (1) Use the new defined Mahalanobis distance M(n) = 0.25(x¯1 − x¯2)T [cov(x¯)]−1(x¯1 − x¯2). 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)219 (2) Randomly arrange units in a sequence x1,x2︸ ︷︷ ︸ 1st pair ,x3,x4︸ ︷︷ ︸ 2nd pair ,x5,x6︸ ︷︷ ︸ 3rd pair , ...,xn. (3) Assign the 1st pair, T1 = 1, T2 = 0. (4) For the next pair, i.e., 2i+ 1-th and 2i+ 2-th units, (i > 1) (4a) If T2i+1 = 1 and T2i+2 = 0, obtain the “potential” M (1) i . (4b) If T2i+1 = 0 and T2i+2 = 1, obtain the “potential” M (2) i . 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)220 (5) Assign the (2i+ 1)-th and (2i+ 2)-th units by P (T2i+1 = 1, T2i+2 = 0|x2i, T2i...) = q if M (1) i < M (2) i , 1− q if M (1)i > M (2)i , 0.5 if M (1) i = M (2) i , P (T2i+1 = 0, T2i+2 = 1|x2i, T2i...) = 1− P (T2i+1 = 1, T2i+1 = 0|x2i, T2i...), where • 0.5 < q < 1. • Note: T2i+1 = T2i+2 = 0, 1 is not allowed. (6) Repeat Steps (4) and (5) until finish. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)221 • A smaller value of M(n) indicates a better covariate balance. • q = 0.75. More discussion in Hu and Hu (2012). • Units are not observed sequentially; however, we allocate them sequentially (in pairs). • Better covariate balance. • n! different possible sequences. Similar performance. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)222 Properties of CAM Under CAM, suppose xi is i.i.d. multivariate normal; then M(n) = Op(n −1). Note: • Under CR, MCR(n) ∼ χ2df=p, a stationary distribution of a Chi-square distribution with p degrees of freedom, regardless of n. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)223 • Under RR, MRR(n) ∼ χ2df=p|χ2df=p < a, a stationary distribution of a Chi-square distribution with p degrees of freedom conditional on MRR(n) < a, regardless of n. • Under CAM, M(n)→ 0 at the rate of 1/n. – More units, better balance. – Advantages of CAM in large n. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)224 Properties of CAM As p increases, • Under CR, the stationary distribution becomes flatter, poorer covariate balance. • Under RR, the stationary distribution becomes flatter, poorer covariate balance. • Under CAM, M(n)→ 0 at the rate of 1/n, regardless of p. – The effect of p on M(n) is less severe than CR and RR. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)225 Properties of CAM • Adaptive based on covariates. • Works for sequential experiments, just estimate the covariance matrix sequentially. • Capable for large p and large n. • Better covariate balance. • Less computational time. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)226 6.3.1 Estimating treatment effect A natural setup of A/B testing: • The observed outcome yi, i = 1, ..., n, for each unit. • Let yi(Ti) represents the potential outcome of the i-th unit under the treatment Ti. • yi = yi(1)Ti + yi(0)(1− Ti). • The average treatment effect is τ = ∑n i=1 yi(1) n − ∑n i=1 yi(0) n . • The fundamental problem: only observe yi(Ti) for one particular Ti, therefore, τ cannot be calculated directly. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)227 A natural estimate, τˆ : τˆ = ∑n i=1 Tiyi∑n i=1 Ti − ∑n i=1(1− Ti)yi∑n i=1(1− Ti) , • τˆ could be bad with imbalance in covariates. • Example: estimate the drug effect using treatment groups with predominately male and female patients. Cannot remove the gender effect. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)228 Theoretical properties: (1) Unbiasedness: under CAM, E(τˆ) = τ . (2) Under CAM, V ar(τˆ) attains the lower bound asymptotically. (3) This implies that V arCAM (τˆ) < V arRR(τˆ) < V arCR(τˆ). 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)229 6.3.2 Examples Real Data Example I - Project GATE (Example 4) • Two treatment groups: Treatment: were offered GATE services; control: were not offered GATE services. • p = 105 (covariates obtained from the application packages, 13 continuous and 92 categorical) • Sample size n = 3, 448 (out of 4,198 participants from who answered the evaluation survey 6 months after the assignment) 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)230 • Original allocation M = 75.27, moderate covariate imbalance. • We repeat the allocation 1,000 times for these participants using CAM, complete randomization and rerandomization. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)231 CAM vs Rerandomization The Maximum of Malahanobis distances obtained from CAM is 12. If we set the balance criterion for rerandomization to M < 12, the probability of acceptance Pa = P (χ 2 df=105 < 12) = 3.4× 10−31, which means nearly impossible for rerandomization to achieve a similar balance level as CAM. We set Pa = 2× 10−5 for Rerandomization to have similar computational time with CAM. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)232 Comparison of Mahalanobis Distance Mahalanobis Distance D en si ty 0 20 40 60 80 100 120 140 0. 00 0. 05 0. 10 0. 15 0. 20 0. 25 Complete Randomization CAM Rerandomization 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)233 Estimation. • The outcome variable (0/1): has owned a business within 6 months after assignment or not. • After the allocation, we simulate the outcome variable according to logit(P (ysimi = 1)) = µˆ1T sim i + µˆ2(1− T simi ) + xTi βˆ + sim, where µˆ1, µˆ2 and βˆ are obtained from fitting regression to original data. sim is drawn from the residuals of that regression. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)234 Compare the estimation performance (PRIV) of CAM and rerandomization. Method PRIV un or va CAM 17.7% 0.081 Rerandomization 10.5% 0.505 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)235 6.4 Network A/B Testing Let a graph G be represented by a n× n symmetric adjacency matrix A = [Aij ]. Balancing n-dimensional binary vectors, the network , is hard. Zhou, Li and Hu (2019) proposed several methods and discussed their theoretical and finite sample properties. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)236 New methods vs Complete randomization (MSE) 100 150 200 250 300 350 400 450 500 550 600 number of nodes 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 st an da rd d ev ia tio n random adaptive row adaptive submatrix coordinate descent 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)237 New methods vs Complete randomization (ATE) -4 -3 -2 -1 0 1 2 3 bias 0 0.2 0.4 0.6 0.8 1 1.2 de ns ity random adaptive row adaptive submatrix coordinate descent 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)238 New methods vs Complete randomization (MSE) 0 50 100 150 200 250 300 350 400 450 500 number of nodes 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 st an da rd d ev ia tio n random adaptive 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)239 New methods vs Complete randomization (ATE) -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 bias 0 0.5 1 1.5 2 2.5 3 de ns ity random adaptive row adaptive submatrix coordinate descent 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)240 6.5 Balance Covariates based on general Kernels The CAM only considers the mean of two groups. Covariance structure is also important in statistical analysis. Therefore, Ma, Li and Hu (2021) proposed the following distance measure (which combine both mean and covariance differences): IBT (n) = (x¯1 − x¯2)T cov(x)−1(x¯1 − x¯2) + trace {( Σˆ1 − Σˆ2 )2} /p where Σˆ1 and Σˆ2 are the sample covariance matrices for two treatment groups. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)241 New method vs CAM vs Complete randomization n=100, p=2 Mahalanobis distance D en si ty 0 2 4 6 8 0. 0 0. 5 1. 0 1. 5 2. 0 CAM Trace+ CR n=100, p=6 Mahalanobis distance D en si ty 0 2 4 6 8 0. 0 0. 5 1. 0 1. 5 2. 0 CAM Trace+ CR n=100, p=10 Mahalanobis distance D en si ty 0 2 4 6 8 0. 0 0. 5 1. 0 1. 5 2. 0 CAM Trace+ CR n=300, p=2 Mahalanobis distance D en si ty 0 2 4 6 8 0. 0 0. 5 1. 0 1. 5 2. 0 CAM Trace+ CR n=300, p=6 Mahalanobis distance D en si ty 0 2 4 6 8 0. 0 0. 5 1. 0 1. 5 2. 0 CAM Trace+ CR n=300, p=10 Mahalanobis distance D en si ty 0 2 4 6 8 0. 0 0. 5 1. 0 1. 5 2. 0 CAM Trace+ CR n=500, p=2 Mahalanobis distance D en si ty 0 2 4 6 8 0. 0 0. 5 1. 0 1. 5 2. 0 CAM Trace+ CR n=500, p=6 Mahalanobis distance D en si ty 0 2 4 6 8 0. 0 0. 5 1. 0 1. 5 2. 0 CAM Trace+ CR n=500, p=10 Mahalanobis distance D en si ty 0 2 4 6 8 0. 0 0. 5 1. 0 1. 5 2. 0 CAM Trace+ CR 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)242 New method vs CAM vs Complete randomization n=100, p=2 Trace((Sigma_1−Sigma_0)^2)/p D en si ty 0.0 0.1 0.2 0.3 0.4 0 5 15 25 CAM Trace+ CR n=100, p=6 Trace((Sigma_1−Sigma_0)^2)/p D en si ty 0.0 0.1 0.2 0.3 0.4 0 5 15 25 CAM Trace+ CR n=100, p=10 Trace((Sigma_1−Sigma_0)^2)/p D en si ty 0.0 0.1 0.2 0.3 0.4 0 5 15 25 CAM Trace+ CR n=300, p=2 Trace((Sigma_1−Sigma_0)^2)/p D en si ty 0.0 0.1 0.2 0.3 0.4 0 5 15 25 CAM Trace+ CR n=300, p=6 Trace((Sigma_1−Sigma_0)^2)/p D en si ty 0.0 0.1 0.2 0.3 0.4 0 5 15 25 CAM Trace+ CR n=300, p=10 Trace((Sigma_1−Sigma_0)^2)/p D en si ty 0.0 0.1 0.2 0.3 0.4 0 5 15 25 CAM Trace+ CR n=500, p=2 Trace((Sigma_1−Sigma_0)^2)/p D en si ty 0.0 0.1 0.2 0.3 0.4 0 5 15 25 CAM Trace+ CR n=500, p=6 Trace((Sigma_1−Sigma_0)^2)/p D en si ty 0.0 0.1 0.2 0.3 0.4 0 5 15 25 CAM Trace+ CR n=500, p=10 Trace((Sigma_1−Sigma_0)^2)/p D en si ty 0.0 0.1 0.2 0.3 0.4 0 5 15 25 CAM Trace+ CR 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)243 Ma, Li and Hu (2021): (i) a general framework of kernel covariate adaptive randomization to attain covariate balance for a large class of functions that reside in a high-dimensional or even infinite-dimensional space; (ii) With the kernel trick commonly used in machine learning, the framework unifies several recently proposed covariate adaptive designs and generalizes to a much broader family with imbalance measures defined in a consistent manner; (iii) the convergence rate of covariate imbalance is bounded in probability; and (iv) balance covariance matrices between treatments, which shows excellent and robust performance in finite samples. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)244 6.6 Examples Example: Remdesivir-COVID-19 trial (China). Remdesivir in adults with severe COVID-19 trial (Wang et al. 2020) is a randomized, double-blind, placebo-controlled, multicentre trial that aimed to compare Remvesivir with placebo. There were 236 patients in the trial. There are about 20 baseline covariates for each patient, including 10 continuous variables (e.g. age and White blood cell count) and 10 discrete variables (e.g. gender and Hypertension). The stratified (according to the level of respiratory support) permuted block (30 patients per block) randomization procedure were implemented. At the end of this trial, some important imbalances existed at enrollment between the groups, including more patients with hypertension, diabetes, or coronary artery disease in the Remdesivir group than the placebo group. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)245 Example: Moderna COVID-19 vaccine trial (2020). The trial began on July 27, 2020, and enrolled 30,420 adult volunteers at clinical research sites across the United States. Volunteers were randomly assigned 1:1 to receive either two 100 microgram (mcg) doses of the investigational vaccine or two shots of saline placebo 28 days apart. The average age of volunteers is 51 years. Approximately 47% are female, 25% are 65 years or older and 17% are under the age of 65 with medical conditions placing them at higher risk for severe COVID-19. Approximately 79% of participants are white, 10% are Black or African American, 5% are Asian, 0.8% are American Indian or Alaska Native, 0.2% are Native Hawaiian or Other Pacific Islander, 2% are multiracial, and 21% (of any race) are Hispanic or Latino. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)246 From the start of the trial through Nov. 25, 2020, investigators recorded 196 cases of symptomatic COVID-19 occurring among participants at least 14 days after they received their second shot. One hundred and eighty-five cases (30 of which were classified as severe COVID-19) occurred in the placebo group and 11 cases (0 of which were classified as severe COVID-19) occurred in the group receiving mRNA-1273. The incidence of symptomatic COVID-19 was 94.1% lower in those participants who received mRNA-1273 as compared to those receiving placebo. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)247 Investigators observed 236 cases of symptomatic COVID-19 among participants at least 14 days after they received their first shot, with 225 cases in the placebo group and 11 cases in the group receiving mRNA-1273. The vaccine efficacy was 95.2% for this secondary analysis. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)248 Example: PFIZER-BIONTECH COVID-19 VACCINE. Safety and Efficacy of the BNT162b2 mRNA Covid-19 Vaccine (2020). BACKGROUND Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection and the resulting coronavirus disease 2019 (Covid-19) have afflicted tens of millions of people in a worldwide pandemic. Safe and effective vaccines are needed urgently. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)249 METHODS In an ongoing multinational, placebo-controlled, observer-blinded, pivotal efficacy trial, we randomly assigned persons 16 years of age or older in a 1:1 ratio to receive two doses, 21 days apart, of either placebo or the BNT162b2 vaccine candidate (30 g per dose). BNT162b2 is a lipid nanoparticle–formulated, nucleoside-modified RNA vaccine that encodes a prefusion stabilized, membrane-anchored SARS-CoV-2 full-length spike protein. The primary end points were efficacy of the vaccine against laboratory-confirmed Covid-19 and safety. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)250 A total of 43,548 participants underwent randomization, of whom 43,448 received injections: 21,720 with BNT162b2 and 21,728 with placebo. There were 8 cases of Covid-19 with onset at least 7 days after the second dose among participants assigned to receive BNT162b2 and 162 cases among those assigned to placebo; BNT162b2 was 95% effective in preventing Covid-19 (95% credible interval, 90.3 to 97.6). Similar vaccine efficacy (generally 90 to 100%) was observed across subgroups defined by age, sex, race, ethnicity, baseline body-mass index, and the presence of coexisting conditions. Among 10 cases of severe Covid-19 with onset after the first dose, 9 occurred in placebo recipients and 1 in a BNT162b2 recipient. The safety profile of BNT162b2 was characterized by short-term, mild-to-moderate pain at the injection site, fatigue, and headache. The incidence of serious adverse events was low and was similar in the vaccine and placebo groups. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)251 CONCLUSIONS A two-dose regimen of BNT162b2 conferred 95% protection against Covid-19 in persons 16 years of age or older. Safety over a median of 2 months was similar to that of other viral vaccines. (Funded by BioNTech and Pfizer; ClinicalTrials.gov number, NCT04368728. opens in new tab.) 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)252 Example: REAL-WORLD EVIDENCE CONFIRMS HIGH EFFECTIVENESS OF PFIZER-BIONTECH COVID-19 VACCINE AND PROFOUND PUBLIC HEALTH IMPACT OF VACCINATION ONE YEAR AFTER PANDEMIC DECLARED. The Israel Ministry of Health (MoH), Pfizer Inc. (NYSE: PFE) and BioNTech SE (Nasdaq: BNTX) today announced real-world evidence demonstrating dramatically lower incidence rates of COVID-19 disease in individuals fully vaccinated with the Pfizer-BioNTech COVID-19 Vaccine (BNT162b2), underscoring the observed substantial public health impact of Israel’s nationwide immunization program. These new data build upon and confirm previously released data from the MoH demonstrating the vaccine’s effectiveness in preventing symptomatic SARS-CoV-2 infections, COVID-19 cases, hospitalizations, severe and critical hospitalizations, and deaths. The latest analysis from the MoH proves that two weeks after the second 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)253 vaccine dose protection is even stronger – vaccine effectiveness was at least 97% in preventing symptomatic disease, severe/critical disease and death. This comprehensive real-world evidence can be of importance to countries around the world as they advance their own vaccination campaigns one year after the World Health Organization (WHO) declared COVID-19 a pandemic. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)254 Findings from the analysis were derived from de-identified aggregate Israel MoH surveillance data collected between January 17 and March 6, 2021, when the Pfizer-BioNTech COVID-19 Vaccine was the only vaccine available in the country and when the more transmissible B.1.1.7 variant of SARS-CoV-2 (formerly referred to as the U.K. variant) was the dominant strain. Vaccine effectiveness was at least 97% against symptomatic COVID-19 cases, hospitalizations, severe and critical hospitalizations, and deaths. Furthermore, the analysis found a vaccine effectiveness of 94% against asymptomatic SARS-CoV-2 infections. For all outcomes, vaccine effectiveness was measured from two weeks after the second dose. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)255 Following the authorization for emergency use of the Pfizer-BioNTech COVID-19 Vaccine in Israel on December 6, 2020, the Israel MoH launched a national vaccination program targeting individuals age 16 years or older – a total of 6.4 million people, representing 71% of the population. The vaccination program started at the beginning of a large surge of SARS-CoV-2 infections in Israel, which later resulted in a national lockdown starting on January 8, 2021. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)256 This MoH analysis uses de-identified aggregate Israel MoH public health surveillance data from January 17 through March 6, 2021 (analysis period); the start of the analysis period corresponds to seven days after individuals began receiving second doses of the Pfizer-BioNTech COVID-19 Vaccine. MoH regularly collects comprehensive, real-time data on SARS-CoV-2 testing, COVID-19 cases including date of symptom onset, and vaccination history through a nationally notifiable disease registry and the national medical record database. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)257 Vaccine effectiveness estimates – adjusted to account for variances in age, gender and the week specimens were collected – were determined for the prevention of six laboratory-confirmed SARS-CoV-2 outcomes comparing unvaccinated and fully-vaccinated individuals: SARS-CoV-2 infections (includes symptomatic and asymptomatic infections); asymptomatic SARS-CoV-2 infections; COVID-19 cases (symptomatic only); COVID-19 hospitalizations; severe (respiratory distress, including ¿30 breaths per minute, oxygen saturation on room air ¡94%, and/or ratio of arterial partial pressure of oxygen to fraction of inspired oxygen ¡300mm mercury) and critical (mechanical ventilation, shock, and/or heart, liver or kidney failure) COVID-19 hospitalizations; and COVID-19 deaths. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)258 The MoH analysis was conducted when more than 80% of tested specimens in Israel were variant B.1.1.7, providing real-world evidence of the effectiveness of BNT162b2 for prevention of COVID-19 infections, hospitalizations, and deaths due to variant B.1.1.7. However, this analysis was not able to evaluate vaccine effectiveness against B.1.351 (formerly referred to as the South African variant) due to the limited number of infections caused by this strain in Israel at the time the analysis was conducted. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)259 The vaccine effectiveness estimates align with the 95% vaccine efficacy of BNT162b2 against COVID-19 demonstrated in the pivotal Randomized Clinical Trial (RCT) of BNT162b2. However, this observational analysis differs from the RCT in several aspects. Vaccine effectiveness estimates may be affected by differences between vaccinated and unvaccinated persons (i.e., different test-seeking behaviors or levels of adherence to preventive measures). In the RCT, randomization minimized the impact of differences between vaccinated and unvaccinated. Despite efforts to adjust for these effects in the available dataset, the possibility remains of unmeasured distortions. For example, findings from the Maccabi HMO indicate that neighborhood may be an important factor. Further vaccine effectiveness analyses investigating the effect of additional covariates such as location, comorbidities, race/ethnicity, and likelihood of seeking SARS-CoV-2 testing are warranted. 6 New covariate-adaptive randomization procedures: continuous covariates; many covariates; network structures and others)260 Pfizer-BioNtech’s coronavirus vaccine offers more protection than earlier thought, with effectiveness in preventing symptomatic disease reaching 97%, according to real-world evidence published Thursday by the pharma companies. Using data from January 17 to March 6 from Israel’s national vaccination campaign, Pfizer-BioNtech found that prevention against asymptomatic disease also reached 94 percent. ”We are extremely encouraged that the real-world effectiveness data coming from Israel are confirming the high efficacy demonstrated in our Phase 3 clinical trial and showing the significant impact of the vaccine in preventing severe disease and deaths due to COVID-19,” said Luis Jodar, Ph.D., senior vice president and chief medical officer of Pfizer Vaccines. 7 Statistical Inference after covariate-adaptive randomization 261 7 Statistical Inference after covariate-adaptive randomization 7.1 Some concerns First we consider simulations to study Type I error of hypothesis testing for comparing treatment effects under three designs: Pocock and Simon’s marginal procedure, stratified permuted block design, and complete randomization. For each type of design, both continuous case and discrete case are considered. The following linear model (including two covariates Z1 and Z2) is assumed for responses Yi, Yi = µ1Ii + µ2(1− Ii) + β1Zi,1 + β2Zi,2 + εi, where εi is distributed as N(0, 1), β1 = β2 = 1. No difference in treatment effects is assumed to study Type I error, i.e., µ1 = µ2. 7 Statistical Inference after covariate-adaptive randomization 262 For the discrete case, Z1 follows Bernoulli(p1) and Z2 follows Bernoulli(p2); for the continuous case, both Z1 and Z2 follow normal distributions N(0, 1). If covariates Z1 and Z2 are continuous, they are discretized into bernoulli variables Z ′1 and Z ′ 2 with the probabilities p1 and p2 in order to be used in randomization. More specifically, if Z1 < Z(p1), where Z(p1) is p1 quantile of the standard normal distribution, then Z ′1 = 0, otherwise Z ′ 1 = 1. Original variables (without discretization) are used in statistical inference procedures. 7 Statistical Inference after covariate-adaptive randomization 263 To carry out simulations, the biased coin probability 0.75 and equal weights are used for Pocock and Simon’s marginal procedure, and the block size 4 is used for stratified permuted block design. The significance level is α = 0.05 and sample size N is 100, 200 or 500. The hypothesis tests include the two sample t-test (t-test), the linear model with a single covariate Z1 (lm(Z1)), the linear model with a single covariate Z2 (lm(Z2)) and the linear model with both covariates Z1 and Z2 (lm(Z1, Z2)). By choosing (p1, p2) = (0.5, 0.5), the simulation results for Pocock and Simon’s marginal procedure, stratified permuted block design and complete randomization are demonstrated in Table 1. 7 Statistical Inference after covariate-adaptive randomization 264 In each simulation, Type I error of covariate-adaptive randomization methods is also examined with the bootstrap t-test described in Shao, Yu and Zhong (2010). To do the test, B bootstrap samples (Y ∗b1 , Z ∗b 1,1, Z ∗b 1,2), ...,(Y ∗b N , Z ∗b N,1, Z ∗b N,2), b = 1, 2, ..., B, are generated independently as simple random samples with replacement from (Y1, Z1,1, Z1,2), ..., (YN , ZN,1, ZN,2). The covariate-adaptive procedure on the original data is applied on the covariates of each bootstrap sample (Z∗b1,1, Z ∗b 1,2), ..., (Z ∗b N,1, Z ∗b N,2), from which the bootstrap analogues of treatment assignments, I∗b1 ,...,I ∗b N can be obtained. 7 Statistical Inference after covariate-adaptive randomization 265 Define Y¯1 − Y¯2 = 1 n1 N∑ i=1 IiYi − 1 n2 N∑ i=1 (1− Ii)Yi, n1 = N∑ i=1 Ii, n2 = N − n1, and θˆ∗(b) = 1 n∗b1 N∑ i=1 I∗bi Y ∗b i − 1 n∗b2 N∑ i=1 (1− I∗bi )Y ∗bi , n∗b1 = N∑ i=1 I∗bi , n ∗b 2 = N∑ i=1 (1− I∗bi ). The bootstrap estimator of the variance of Y¯1 − Y¯2 is then the sample variance of θˆ∗(b), b = 1, 2, ..., B, represented by vˆB . Then the bootstrap t-test has the form of TB = (Y¯1 − Y¯2)/vˆ1/2B . In Shao, Yu and Zhong (2010), it is shown that the bootstrap t-test can maintain nominal Type I error under covariate-adaptive biased coin design. B = 500 is used in all following simulations. 7 Statistical Inference after covariate-adaptive randomization 266 Table 15: Simulated Type I error for Pocock and Simon’s (PS), stratified permuted block design (SPB) and complete randomization (CR) in %. Simulations based on 10000 runs. Z Method N t-test lm(Z1) lm(Z2) lm(Z1, Z2) BS-t Discrete PS 100 1.75 3.05 3.09 5.21 5.18 200 1.62 2.78 2.86 4.99 4.88 500 1.66 2.81 2.77 4.87 4.90 SPB 100 1.85 2.86 3.05 5.29 5.67 200 1.54 2.69 2.73 4.84 4.95 500 1.55 2.77 2.65 4.84 5.60 CR 100 5.04 5.27 5.11 5.31 - 200 5.00 4.95 5.12 5.21 - 500 4.73 4.83 4.68 4.77 - 7 Statistical Inference after covariate-adaptive randomization 267 Table 16: Simulated Type I error for Pocock and Simon’s marginal procedure (PS), stratified permuted block design (SPB) and complete randomization (CR) in %. Simulations based on 10000 runs. Z Method N t-test lm(Z1) lm(Z2) lm(Z1, Z2) BS-t Continuous PS 100 1.43 2.15 2.02 4.98 5.16 200 1.07 1.74 1.80 4.53 5.62 500 0.91 1.72 1.73 4.72 4.79 SPB 100 1.22 1.83 2.05 5.01 5.68 200 0.98 1.86 1.77 5.08 5.19 500 1.15 1.98 1.84 5.48 5.61 CR 100 5.20 5.31 4.82 4.92 - 200 5.06 5.14 4.85 5.46 - 500 4.87 5.05 4.71 4.77 - 7 Statistical Inference after covariate-adaptive randomization 268 Several conclusions can be drawn from Table 1. First, the Type I error is close to 5% under the full model lm(Z1, Z2). This coincides with theoretical results in Section 3, when no randomization covariate is omitted in the construction of the final analysis model. Secondly, under both Pocock and Simon’s marginal procedure and stratified permuted block design, the two sample t-test, lm(Z1) and lm(Z2) are all conservative. Among these three tests, the two sample t-test is the most conservative one with the least Type I error. 7 Statistical Inference after covariate-adaptive randomization 269 Furthermore, the Type I error of the bootstrap t-test (BS-t) is close to the nominal level 5% under both Pocock and Simon’s marginal procedure and stratified permuted block design. Under complete randomization, the Type I error is close to 5% for all four tests. We also tried different (p1, p2), similar results are obtained and are not shown here. 7 Statistical Inference after covariate-adaptive randomization 270 • Even though many covariate-adaptive designs have been proposed and implemented, the discussion of corresponding statistical inference is limited. • In practice, conventional tests are just used without consideration of covariate-adaptive randomization scheme. • It remains a concern if conventional statistical inference is still valid for covariate-adaptive designs. 7 Statistical Inference after covariate-adaptive randomization 271 In literature • By simulation, Forsythe (1987) suggests “minimization should be considered for group assignment only if all variables used in minimization are also to be used as covariate” to achieve valid statistical inference. • Shao, Yu and Zhong (2010) pointed out “if the covariates used in covariate-adaptive randomization is a function of the covariates to construct a test, the test is valid under covariate-adaptive randomization.” 7 Statistical Inference after covariate-adaptive randomization 272 Conservativeness • In practice, however, it is often the case not all randomization covariates are included in statistical inference. – difficult to include some covariates (investigation sides, etc.) in analysis; – resulting more complicated model; – requiring correct model specification. 7 Statistical Inference after covariate-adaptive randomization 273 • Simulation studies indicates conservativeness of unadjusted analysis under covariate-adaptive clinical trials by Birkett (1985) and Forsythe (1987), etc. • Shao, Yu and Zhong (2010) proved, under a simple linear regression model, Yij = µj + bZi + εij , the two sample t-test is conservative for stratified biased coin designs. 7 Statistical Inference after covariate-adaptive randomization 274 Limitations • Only two sample t-test is discussed. Properties unknown if partial covariate information is used in statistical inference. • The result is only applied to the covariate-adaptive biased coin design (Stratified), which is not a popular in application. • only consider a simple linear model with one covariate. • no theoretical results about power. • No discussion about inference about significance of covariates. 7 Statistical Inference after covariate-adaptive randomization 275 Motivation • Study properties of statistical inference for covariate-adaptive randomized clinical trials – which can be applied on a large family of covariate-adaptive designs, including the ones (Pocock and Simon’s design and others) widely used in practice. – based on linear models and generalized linear models. – on various types of hypothesis testing. – new methods that adjusting type I error and increasing power. 7 Statistical Inference after covariate-adaptive randomization 276 7.2 Statistical Inference under Linear models Properties of statistical inference will be studied for covariate-adaptive designs • under a linear regression framework. • a subset of covariates of those used in randomization are included in statistical inference procedures. • two types of hypothesis testing are considered. – comparing treatment effect. – testing significance of covariates. 7 Statistical Inference after covariate-adaptive randomization 277 7.3 Framework We (Ma, Hu and Zhang, 2015, JASA) considered a covariate-adaptive randomized clinical trial with two treatments 1 and 2. • Let N denote the total number of patients in study. • Ii = 1, i = 1, 2, ..., N , if the ith patient is assigned to treatment 1, otherwise Ii = 0. • The response for ith patient, Yi = µ1Ii + µ2(1− Ii) +Xi,1bT1 + ...+Xi,pbTp +Zi,1c T 1 + ...+ Zi,qc T q + εi, where • Yi is the outcome of the ith patient; • Xk and Zj , k = 1, ..., p and j = 1, ..., q are covariate information, which can be either discrete or continuous. 7 Statistical Inference after covariate-adaptive randomization 278 • Both Xk and Zj are used in covariate-adaptive randomization, but only Xk are used to construct analysis model. • All covariates are assumed to be independent of each other. • Furthermore, without loss of generality, it is assumed EXk = EZj = 0 for all k and j. • εis are independent and identically distributed random errors with mean zero and variance σ2ε , and are independent of Xk and Zj . 7 Statistical Inference after covariate-adaptive randomization 279 Analysis Model (Working Model) • Assume both Xk and Zj are used in covariate-adaptive randomization. • Xk are used in final analysis. • A linear regression model is implemented to do analysis. E[Yi] = µ1Ii + µ2(1− Ii) +Xi,1bT1 + · · ·+Xi,pbTp . (2) 7 Statistical Inference after covariate-adaptive randomization 280 The model between response and covariates (2) is Y = Xβ +Zγ + ε, The analysis model (2) is E[Y ] = Xβ. where Y = (Y1, Y2, ..., YN ) T are outcomes, ε = (ε1, ε2, ..., εN ) T , β = (µ1, µ2, b1, ..., bp) T and γ = (c1, ..., cq) T are true but unknown parameters. Furthermore, X and Z are X = I1 1− I1 X1,1 · · · X1,p I2 1− I2 X2,1 · · · X2,p ... ... ... . . . ... IN 1− IN XN,1 · · · XN,p 7 Statistical Inference after covariate-adaptive randomization 281 and Z = Z1,1 · · · Z1,q ... . . . ... ZN,1 · · · ZN,q . The OLS estimator βˆ of model (2) can be expressed as βˆ = (XTX)−1XTY = (XTX)−1XT (Xβ +Zγ + ε) = β + (XTX)−1XTZγ + (XTX)−1XTε. 7 Statistical Inference after covariate-adaptive randomization 282 Comparing Treatment Effect To compare treatment effects of µ1 and µ2, H0 : µ1 − µ2 = 0 versus HA : µ1 − µ2 6= 0. (3) The test statistic is T = Lβˆ (σˆ2L(XTX)−1LT )1/2 , (4) where L = (1,−1, 0, ..., 0), σˆ2 = (Y −Xβˆ)T (Y −Xβˆ)/(N − p′ − 2). p′ + 2 is the total number of parameters in model (2). Reject H0, if |T | > Z1−α/2, where Z1−α/2 is (1− α/2)th percentile of standard normal distribution. 7 Statistical Inference after covariate-adaptive randomization 283 Testing Significance of a Covariate To test significance of a single covariate, without loss of generality, consider the first covariate, H0 : b1 = 0 versus HA : b1 6= 0. (5) The test statistic for hypothesis testing (5) is, T ′ = `βˆ (σˆ2`(XTX)−1`T )1/2 , (6) where ` = (0, 0, 1, 0, ..., 0). Notice if X1 is a discrete covariate with multiple levels s ′ 1, s ′ 1 > 2, then we are only able to test b11 = 0 , where b1 = (b11, b12, ..., b1(s′1−1)). Reject H0, if |T ′| > Z1−α/2, where Z1−α/2 is (1− α/2)th percentile of standard normal distribution. 7 Statistical Inference after covariate-adaptive randomization 284 7.4 Properties Valid and Conservative Test A two-sided test T based on normal distribution is said to be (asymptotically) valid, if lim N→∞ pr(|T | > Z1−α/2) = α, and it is said to be (asymptotically) conservative, if there is a constant α0 such that, when the null hypothesis holds, lim N→∞ pr(|T | > Z1−α/2) = α0 < α, where Φ is c.d.f of standard normal distribution. 7 Statistical Inference after covariate-adaptive randomization 285 Theorem: Under the linear model (1) and the hypothesis testing, H0 : µ1 − µ2 = 0 versus HA : µ1 − µ2 6= 0, if a covariate-adaptive design satisfies the following two conditions: • the overall imbalence is Op(1), • the marginal imbalances are Op(1), then, under H0, the test statistics T is normal distributed with a variance σ2 < 1 unless all cj = 0. Therefore, the hypothesis testing is conservative unless all cj = 0. Under HA, the testing statistic T is normal distributed with smaller non-centrality parameter unless all cj = 0. Therefore the hypothesis testing is less powerful unless all cj = 0. 7 Statistical Inference after covariate-adaptive randomization 286 Theorem: Under the same conditions as in Theorem 2 and the hypothesis testing, H0 : b1 = 0 versus HA : b1 6= 0, then, under H0, the test statistics T ′ is normal distributed with a variance 1. Therefore, the hypothesis testing is valid under H0. However, under HA, the testing statistic T is normal distributed with smaller non-centrality parameter unless all cj = 0. Therefore the hypothesis testing is less powerful unless all cj = 0. Corollary 1: If Z is not related to Y , i.e., all cj = 0 for j = 1, 2, ..., q, then hypothesis testing (3) and (5) are both valid. 7 Statistical Inference after covariate-adaptive randomization 287 Corollary 2: The results in Theorem 2 and 3 hold under following designs: • Pocock and Simon’s marginal procedure; • Stratified permuted block design; • The large class of covariate-adaptive designs in Hu and Hu (2012) and Hu and Zhang (2014). 7 Statistical Inference after covariate-adaptive randomization 288 7.5 Numerical studies: Type I Error and Power Type I errors are studied by assuming, Yi = µ1Ii + µ2(1− Ii) + b1Zi,1 + b2Zi,2 + εi, where • εi is distributed as N(0, 1). • b1 = b2 = 1. • No difference in treatment effect, i.e., µ1 = µ2. • Discrete case: Z1 ∼ Bernolli(p1), Z2 ∼ Bernolli(p2); Continuous case: Z1 ∼ N(0, 1), Z2 ∼ N(0, 1) with breakdown points p1(p2)th quantile. • The biased coin probability 0.75 and equal weights are used for Pocock and Simon’s marginal procedure, and the block size 4 is used for stratified permuted block design. 7 Statistical Inference after covariate-adaptive randomization 289 Table 17: Simulated Type I error for Pocock and Simon’s marginal procedure in %. Simulations based on 10000 runs Z (p1, p2) N t− test lm(Z1) lm(Z2) lm(Z1, Z2) Discrete (0.5, 0.5) 100 1.75 3.05 3.09 5.21 200 1.62 2.78 2.86 4.99 500 1.66 2.81 2.77 4.87 (0.5, 0.3) 100 2.02 3.30 3.00 5.04 200 1.90 3.18 2.99 5.07 500 1.84 3.25 2.96 5.20 Continuous (0.5, 0.5) 100 1.43 2.15 2.02 4.98 200 1.07 1.74 1.80 4.53 500 0.91 1.72 1.73 4.72 (0.5, 0.3) 100 1.35 2.12 1.85 4.95 200 1.16 2.14 1.83 5.05 500 1.22 1.95 1.71 4.99 7 Statistical Inference after covariate-adaptive randomization 290 Table 18: Simulated Type I error for complete randomization in %. Simulations based on 10000 runs Z (p1, p2) N t− test lm(Z1) lm(Z2) lm(Z1, Z2) Discrete (0.5, 0.5) 100 5.04 5.27 5.11 5.31 200 5.00 4.95 5.12 5.21 500 4.73 4.83 4.68 4.77 (0.5, 0.3) 100 4.99 4.99 4.68 4.77 200 5.15 5.03 5.49 5.14 500 4.82 5.00 4.80 5.13 Continuous (0.5, 0.5) 100 5.20 5.31 4.82 4.92 200 5.06 5.14 4.85 5.46 500 4.87 5.05 4.71 4.77 (0.5, 0.3) 100 4.99 4.69 5.11 4.97 200 5.21 5.24 5.16 4.92 500 5.14 4.66 5.19 5.15 7 Statistical Inference after covariate-adaptive randomization 291 Table 19: Power Comparison (lm(Z1, Z2)) for Pocock and Simon’s marginal procedure and Complete Randomization, Simulation based on 10000 runs and Sample Size N = 32, 64 N = 32 N = 64 µ1 − µ0 CR PS CR PS 0.0 4.96 5.03 5.17 5.08 0.2 7.81 8.51 12.12 12.68 0.4 18.15 19.44 34.46 34.76 0.6 33.96 36.98 63.04 65.53 0.8 53.74 57.28 86.97 87.95 1.0 73.63 77.10 97.02 97.51 7 Statistical Inference after covariate-adaptive randomization 292 7.6 Numerical Studies: Testing of Covariates It is assumed, Yi = µ1Ii + µ2(1− Ii) + b1Zi,1 + b2Zi,2 + εi, where • εi is distributed as N(0, 1). • b1 = 0, b2 = 1. • No difference in treatment effect, i.e., µ1 = µ2. • Discrete case: Z1 ∼ Bernolli(p1), Z2 ∼ Bernolli(p2); Continuous case: Z1 ∼ N(0, 1), Z2 ∼ N(0, 1) with breakdown points p1(p2)th quantile. • The biased coin probability 0.75 and equal weights are used for Pocock and Simon’s marginal procedure. 7 Statistical Inference after covariate-adaptive randomization 293 Table 20: Simulated Type I error for H0 : b1 = 0 versus HA : b1 6= 0 for Pocock and Simon’s marginal procedure (PS) and complete randomiza- tion (CR) in %. Simulations based on 10000 runs. Discrete Continuous (p1, p2) N PS CR PS CR (0.5, 0.5) 100 4.96 4.98 4.98 4.90 200 5.35 5.28 5.14 5.14 500 5.55 5.55 5.11 5.15 (0.5, 0.4) 100 4.76 4.84 4.74 4.83 200 5.51 5.49 5.12 5.07 500 4.80 4.77 5.11 5.02 (0.5, 0.3) 100 4.96 5.07 5.23 5.20 200 5.08 5.17 5.00 4.99 500 4.95 4.84 5.65 5.64 (0.4, 0.4) 100 5.12 5.15 5.07 5.08 200 5.41 5.48 5.20 5.21 500 5.19 5.24 5.01 5.08 7 Statistical Inference after covariate-adaptive randomization 294 7.7 General Theory of Statistical Inference Covariate-adaptive randomization procedure is frequently used in comparative studies to increase the covariate balance across treatment groups. However, as the randomization inevitably uses the covariate information when forming balanced treatment groups, the validity of classical statistical methods following such randomization is often unclear. 7 Statistical Inference after covariate-adaptive randomization 295 Ma, Qin, Li and Hu (2019): (i) derive the theoretical properties of statistical methods based on general covariate-adaptive randomization under the linear model framework; (ii) explicitly unveil the relationship between covariate-adaptive and inference properties by deriving the asymptotic representations of the corresponding estimators; (iii) apply the proposed general theory to various randomization procedures, such as complete randomization (CR), rerandomization (RR), pairwise sequential randomization (PSR), and Atkinson’s DA -biased coin design, and compare their performance analytically; and (iv) based on the theoretical results, we then propose a new approach to obtain valid and more powerful tests. 8 New covariate-adjusted response-adaptive designs 296 8 New covariate-adjusted response-adaptive designs Personalized medicine raises new challenges for the design of clinical trials such as: (1) more covariates (biomarkers) have to be considered, and (2) particular attention needs to be paid to the interaction between treatment and covariates (biomarkers). To design a good clinical trial for personalized medicine, we need new designs that can match the special features of personalized medicine. 8 New covariate-adjusted response-adaptive designs 297 8.1 Optimal design for detecting important interactions among treatments and biomarkers The goal of a conventional clinical trial is to determine if a new treatment is superior. When designing a clinical trial for precision medicine, the goal is not limited to just detecting the treatment difference, but also to identifying biomarkers that predict the efficacy of treatments. Therefore, it is important to have a design that can detect the interaction between treatment and biomarkers efficiently. 8 New covariate-adjusted response-adaptive designs 298 8.2 Optimal designs based on both efficiency and ethics Clinical trials, because they involve human subjects, require stringent ethical considerations. To develop personalized medicine, covariate information plays an important role in the design and analysis of clinical trials. A challenge is the incorporation of covariate information in design while still considering issues of both efficiency and medical ethics (CARA designs). 8 New covariate-adjusted response-adaptive designs 299 To address this problem, new designs of clinical trials are needed (Hu, Zhu and Hu, 2015, JASA). Denote the efficiency and ethics measurements of two treatments as d(Z,θ) = (d1(Z,θ), d2(Z,θ)) and e(Z,θ) = (e1(Z,θ), e2(Z,θ)), respectively. We propose to assign the (m+ 1)th subject to treatment 1 with probability e1(Zm+1, θˆ(m))d γ 1(Zm+1, θˆ(m)) e1(Zm+1, θˆ(m))d γ 1(Zm+1, θˆ(m)) + e2(Zm+1, θˆ(m))d γ 2(Zm+1, θˆ(m)) . 8 New covariate-adjusted response-adaptive designs 300 8.3 Designs based on predictive biomarkers Two distinct types of biomarkers in precision medicine: • Prognostic biomarker: a biomarker can be used to predict the most likely prognosis of an individual patient. • Predictive biomarker: a biomarker is likely to predict the response to a specific therapy (treatment). To develop precision medicine, we need new adaptive designs based on predictive biomarkers (Hu, Wang and Zhao, 2019). 9 A/B testing under observational data 301 9 A/B testing under observational data 9.1 Simpson’s Paradox 9 A/B testing under observational data 302 Table 21: Fictitious data illustrating Simpson’s paradox. Contro Group (No drug) Treatment Group (Took Drug) Heart attack No heart attack Heart attack No heart attack Female 1 19 3 37 Male 12 28 8 12 Total 13 47 11 49 9 A/B testing under observational data 303 Table 22: Fictitious data illustrating Simpson’s paradox. Contro Group (No drug) Treatment Group (Took Drug) Heart attack No heart attack Heart attack No heart attack Low blood pressure 1 19 3 37 High blood pressure 12 28 8 12 Total 13 47 11 49 9 A/B testing under observational data 304 9.2 The real world effectiveness of BNT162b2 and mRNA-1273 COVID-19 Vaccines Interim Estimates of Vaccine Effectiveness of BNT162b2 and mRNA-1273 COVID-19 Vaccines in Preventing SARS-CoV-2 Infection Among Health Care Personnel, First Responders, and Other Essential and Frontline Workers Eight U.S. Locations, December 2020–March 2021. 10 Some basic principle of designing, running and analyzing an A/B Test 305 10 Some basic principle of designing, running and analyzing an A/B Test 10.1 Setting up the example A fictional online commerce site that sells flowers: there are a wide range of changes we can test: (i) introducing a new feature: (ii) a change to the user interface (UI); (iii) a back-end change; (iv) a change of price; and so on. 10 Some basic principle of designing, running and analyzing an A/B Test 306 In this example, the marketing department wants to increase sales by sending promotional emails that include a coupon code for discounts on the flowers. There are several concerns: (i) revenue; (ii) cost; and so on. 10 Some basic principle of designing, running and analyzing an A/B Test 307 We want to evaluate the impact of simply adding a coupon code field. Our goal is simple to assess the impact on revenue by having this coupon code field and evaluate the concern that it will distract people from checking out. 10 Some basic principle of designing, running and analyzing an A/B Test 308 Online shopping process as a funnel, see Figure. 10 Some basic principle of designing, running and analyzing an A/B Test 309 There are many ways to change the user interface (UI). Here are two different UIs. See Figure. 10 Some basic principle of designing, running and analyzing an A/B Test 310 Our Hypothesis: Adding a coupon code field to the checkout page will degrade revenue. 10 Some basic principle of designing, running and analyzing an A/B Test 311 To measure the impact of the change, we need to define goal metrics (usually difficult to indentify). 10 Some basic principle of designing, running and analyzing an A/B Test 312 This experiment: revenue. Total revenue or revenue-per-user? 10 Some basic principle of designing, running and analyzing an A/B Test 313 Which users to consider in the denominator of the revenue-per-user metric: (1) All users who visit the site; (2) Only users who complete the purchase process; (3) Only users who start the purchase process. 10 Some basic principle of designing, running and analyzing an A/B Test 314 Only users who start the purchase process. This is the best choice. Refined hypothesis becomes: Adding a coupon code field to the checkout page will degrade revenue-per-user for users who start the purchase process. 10 Some basic principle of designing, running and analyzing an A/B Test 315 10.2 Hypothesis testing: establishing statistical significane Discussions. 10 Some basic principle of designing, running and analyzing an A/B Test 316 10.3 Designing the experiment Some aspects: 1) What is the randomization unit? 2) What population of randomization units do we want to target? 3) How large (sample size) does our experiment need to be? 4) How long do we run the experiment? 10 Some basic principle of designing, running and analyzing an A/B Test 317 Our experiment design is now as follows: 1) What is the randomization unit? user. 2) What population of randomization units do we want to target? all users and analyze those who visit the chechout page. 3) How large (sample size) does our experiment need to be? to have 80% power to detect at least a 1% change in revenue-per-user, we will conduct a power analysis to determine sample size. 10 Some basic principle of designing, running and analyzing an A/B Test 318 4) How long do we run the experiment? This translate into running the experiment for a minimum of four days with a 34/33/33% split among Control/Treatment one/ Treatment two. We will run the experiment for a full week to ensure that we understand the day-of-week effect, and ponentially longer if we detect novelty or primacy effects. 10 Some basic principle of designing, running and analyzing an A/B Test 319 10.4 Running the experiment and getting data To run an experiment, we need both: 1) Instrumentation; 2) Infrastructure. 10 Some basic principle of designing, running and analyzing an A/B Test 320 10.5 Interpreting the results Discussions. 10 Some basic principle of designing, running and analyzing an A/B Test 321 10.6 From results to decision The goal of running A/B tests is to gather data to drive decision making. A lot work goes into ensuring that our results are repeatable and trustworthy so that we can make the right decision. Some important aspects: 1) Do you need to make tradeoffs between different metrics? 2) What is the cost of launching this change? 3) What is the downside of making wrong decisions? 10 Some basic principle of designing, running and analyzing an A/B Test 322 You need to make decisions from different results: Discussions. 11 Twyman’s Law and Experimentation Trustworthiness 323 11 Twyman’s Law and Experimentation Trustworthiness William Anthony Twyman was a UK radio and television audience measurement veteran (MR Web 2014) credited with formulating Twyman’s Law, although he apparently never explicitly put it in writing. Any statistic that appears interesting is almost certainly a mistake. by Paul Dickson (1999) 11 Twyman’s Law and Experimentation Trustworthiness 324 Any figure that looks interesting or different is usually wrong. by A.S.C. Ehrenberg (1975). Twyman’s law, herhapsthe most important single law in the whole of data analysis... The more unusual or interesting in the data, the more likely they are to have been the result of an error of one kind or another. by Catherine Marsh and Jane Elliott (2009). 11 Twyman’s Law and Experimentation Trustworthiness 325 11.1 Misinterpretation of the Statistical Results In the Null Hypothesis Significance Testing, we typically assume that there is no difference in metric value between control and treatment and reject the hypothesis if the data presents strong evidence against it. A common mistake is to assume that just because a metric is not statistically significant, there is no treatment effect. It could be that the experiment is underpowered to detect the effect size. An evaluation of 115 A/B tests at GoodUI.org suggests that most were under powered. 11 Twyman’s Law and Experimentation Trustworthiness 326 P-value is often misinterpreted. The p-value is the probability of obtaining a result equal to or more extreme than what was observed, assuming that the Null hypothesis is true. The conditioning on the Null hypothesis is critical. 11 Twyman’s Law and Experimentation Trustworthiness 327 Here are some incorrect statements and explanations from A Dirty Dozen: Twelve P-Value Misconceptions (Google Website Optimizer 2008): • If the p-value = 0.05, the Null hypothesis has only a 5% chance of being true. The p-value is calculated assuming that the Null hypothesis is true. • P-value = 0.05 means that we observed the data that would occur only 5% of the time under Null hypothesis. This is incorrect by the definition of p-value above, which includes equal to or more extreme values than what was observed. • P-value = 0.05 means that if you reject the Null hypthesis, the probability of a false positive is only 5%. 11 Twyman’s Law and Experimentation Trustworthiness 328 Multiple Hypothesis Tests: The following story comes from the fun book, What is a p-value anyway? (Vickers 2009): • Statistician: Oh, so you have already calculated the p-value? • Surgeon: Yes, I used multinomial logistic regression. • Statistician: Really? How did you come up with that? • Surgeon: I tried each analysis on the statistical software drop-down menus, and that was the one that gave the smallest p-value. False Discivery Rate (Hochberg and Benjamini 1995) is a key concept to deal with multiple tests. 11 Twyman’s Law and Experimentation Trustworthiness 329 Confidence Intervals: Discussion. 11 Twyman’s Law and Experimentation Trustworthiness 330 11.2 Threats to Internal Validity • Violations of SUTVA: In the analysis of A/B tests, it is common to apply the Stable Unit Treatment Value Assumption (SUTVA) (Imbens and Rubin, 2015), which states that experiment units (e.g., users) do not interfere with one another. Their behavior is impacted by their own variant assignment, and not by the assignment of others. Discussion. • Survivorship Bias: Discussion. • Intention-to-Treat: Discussion. • Sample Ration Mismatch: Discussion. 11 Twyman’s Law and Experimentation Trustworthiness 331 11.3 Threats to External Validity External validity refers to the extent to which the results of a A/B test can be generalized along axes such as different populations (e.g., other countries, other websites) or overtime. Discussion. 12 Analyzing A/B tests 332 12 Analyzing A/B tests 12.1 Two sample T-test Discussion. 12 Analyzing A/B tests 333 12.2 P-value and Confidence Intervals 12 Analyzing A/B tests 334 12.3 Normality Assumption 12 Analyzing A/B tests 335 12.4 Type I/II Errors and Power 12 Analyzing A/B tests 336 12.5 Multiple Testing 12 Analyzing A/B tests 337 12.6 Meta-analysis 13 The A/A Test 338 13 The A/A Test Running A/A testing is a critical part of estblishing trust in an experimentation platform. The idea is so useful because the tests fail many times in practice, which leads to re-evaluating assumptions and identifying bugs. 13 The A/A Test 339 A/A tests are the same as A/B tests, but Treatment and Control users receive identical experiences. You can use A/A tests for several purposes, such as to: • Ensure the Type I errors are controlled as expected. • Assessing metrics’s variability. • Ensure that no bias exists between Treatment and Control users. • Compare data to the system of record. • If the system of records shows X users visited the website during the experiment and you ran Control and Treatment at 20% each, do you see around 20% X users in each? Are you leaking users? • Estimate variances for statistical power calculations. 14 Long-term treatment effects 340 14 Long-term treatment effects Short-term effect (from A/B test) vs Long-term effect (we care about). 1cm There are scenarios where long-term effect is different from the short-term effect. 14 Long-term treatment effects 341 Reasons the treatment effect may differ between short-term and long-term: • User-learned effects. • Network effects. • Delayed experience and measurement. • Econsystem changes: launching other new features; seasonality; competitive landscape; government polocies; concept drift; etc. 14 Long-term treatment effects 342 Some Suggestions: • Long-running experiments. • Cohort Study and Analysis. • Post-Period Analysis (Post-Market). • Time-Staggered Treatments. • Holdback and Reverse Experiment. 15 Conclusion and Remarks 343 15 Conclusion and Remarks 15 Conclusion and Remarks 344 The goal of Human Life: Understanding + Improving The Nature A/B test becomes and will be more and more important in understanding the nature and ourself. 15 Conclusion and Remarks 345 Three Essential Components of Statistics (Data Science): Data+Computer and Software+Analytics A/B test is the essential tool. 15 Conclusion and Remarks 346 Good Design of Experiment (producing useful data) [Big or Small Data]: Realistic, Efficient and Ethic. 15 Conclusion and Remarks 347 Statisticians (Data Scientists) are experts in: • producing useful data (Big or small); Survey Sampling; Experiment Designs. • analyzing (Big or small) data to make meanful results; With some possible new statistical methods and computational skills • drawing practical conclusions. 15 Conclusion and Remarks 348 The Classical Statistical Framework (Static): Real Problem → Data Collection → Data Analysis → Decision 15 Conclusion and Remarks 349 To match human intelligence, we may need (Hu, 2016) The Dynamic Statistical Framework (AI): Real Problem → Data Collection → Data Analysis → Decision → + new Data → new Analysis → new Decision → · · · 15 Conclusion and Remarks 350 Producing Useful Data (Design of Experiments) in Big Data and AI ERA: MANY New Challenges: (i) From Static to Dynamic; (ii) From Independent to Dependent. 15 Conclusion and Remarks 351 A/B tests in Big Data and AI ERA: MANY New Challenges: (i) From Static to Dynamic; (ii) From Independent to Dependent. 15 Conclusion and Remarks 352 The goal of Human Life: Understanding + Improving The Nature Statistics (Data Science) is a “GAME” between human and nature “THROUGH DATA”. 15 Conclusion and Remarks 353 We read the world wrong and say that it deceives us. (Tagore, ) We read the data wrong and think that the data deceives us. (Feifang Hu, 2017) 15 Conclusion and Remarks 354 Thank you!
欢迎咨询51作业君