The R package ‘rpact’ has been developed to design sequential and adaptive experiments. Many of the functions of the R package are available in an online Shiny app. For more information about
rpact, including a quick start guide and manual, visit the rpact website. This step by step vignette accompanies the manuscript “Group Sequential Designs: A Tutorial” by Lakens, Pahlke, & Wassmer (2021).
The online shiny app for
rpact is available at https://shiny.rpact.com. The default settings when the Shiny app is loaded is for a fixed sample design, which means that there is only one look at the data (kMax = 1). In other words, the default setting is not for a sequential design, but a traditional design where the data is analyzed once. Moving the slider for the “Maximum number of stages” would increase the number of looks in the design (you can select up to up to 10 looks).
rpact package focuses on Confirmatory Adaptive Clinical Trial Design and Analysis. In clinical trials, researchers mostly test directional predictions, and thus, the default setting is to perform a one-sided test. Outside of clinical trials, it might be less common to design studies testing a directional prediction, but it is often a good idea. In clinical trials, it is common to use a 0.025 significance level (or type I error rate) for one-sided tests, as it is deemed preferable in regulatory settings to set the type I error rate for one-sided tests at half the conventional type I error used in two-sided tests. In other fields, such as psychology, researchers typically use a 0.05 significance level, regardless of whether they perform a one-sided or two-sided test. A default 0.2 Type 2 error rate (or power of 0.8) is common in many fields, and the default setting for the Type II error rate.
Remember that you always need to justify your error rates – the defaults are most often not optimal choices in any real-life design (and it might be especially useful to choose a higher power, if possible).
We can explore a group sequential design by moving the slider for the maximum number of stages to, say, kMax = 2. The option to choose a design appears above the slider in the form of three “Design” radio buttons (Group Sequential, Inverse Normal, and Fisher), which by default is set to a group sequential design – this is the type of designs we will focus on in this step by step tutorial. The other options are relevant for adaptive designs which we will not discuss here.
A new drop down menu has appeared below the box to choose a Type II error rate that asks you to specify the “Type of design”. This allows you to choose how you want to control the alpha level across looks. By default the choice is an O’Brien-Fleming design. Set the Type of Design to the Pocock (P) option. Note there is also a Pocock type alpha spending (asP) option – we will use that later.
Because most people in social sciences will probably have more experience with two-sided tests at an alpha of 0.05, choose a two-sided test and an alpha level of 0.05 choose those settings. The input window should now look like the example below:
Click on the ‘plot’ tab. The first plot in the drop down menu shows the boundaries at each look. The critical z score at each look it presented, as is a reference line at z = 1.96 and z = -1.96. These reference lines are the critical value for a two-sided test with a single look (i.e., a fixed design) with an alpha of 5%. We see that the boundaries on the z scale have increased. This means we need to observe a more extreme z score at an analysis to reject \(H_0\). Furthermore, we see that the critical bounds are constant across both looks. This is exactly the goal of the Pocock correction: The alpha level is lowered so that the alpha level is the same at each look, and the overall alpha level across all looks at the data is controlled at 5%. It is conceptually very similar to the Bonferroni correction. We can reproduce the design and the plot in R using the following code:
In the dropdown menu, we can easily change the Type of Design from Pocock (P) to O’Brien-Fleming (OF) to see the effect of using different corrections for the critical values across looks in the plot. We see that the O’Brien-Fleming correction has a different goal. The critical value at the first look is very high (which also means the alpha level for this look is very low), but the critical value at the final look is extremely close to the unadjusted critical value of 1.96 (or the alpha level of 0.05).
We can plot the corrections for different types of designs for each of 3 looks (2 interim looks and one final look) in the same plot in R. The plot below shows the Pocock, O’Brien-Fleming, Haybittle-Peto, and Wang-Tsiatis correction with \(\Delta\) = 0.25. We see that researchers can choose different approaches to spend their alpha level across looks. Researchers can choose to spend their alpha conservatively (keeping most of the alpha for the last look), or more liberally (spending more alpha at the earlier looks, which increases the probability of stopping early for many true effect sizes).
# Comparison corrections d1 <- getDesignGroupSequential(typeOfDesign = "OF", sided = 2, alpha = 0.05) d2 <- getDesignGroupSequential(typeOfDesign = "P", sided = 2, alpha = 0.05) d3 <- getDesignGroupSequential(typeOfDesign = "WT", deltaWT = 0.25, sided = 2, alpha = 0.05) d4 <- getDesignGroupSequential(typeOfDesign = "HP", sided = 2, alpha = 0.05) designSet <- getDesignSet(designs = c(d1, d2, d3, d4), variedParameters = "typeOfDesign") plot(designSet, type = 1, legendPosition = 5)
Because the statistical power of a test depends on the alpha level (and the effect size and the sample size), this means that at the final look the statistical power of an O’Brien-Fleming or Haybittle-Peto design is very similar to the statistical power for a fixed design with only one look. If the alpha is lowered, the sample size of a study needs to be increased to maintain the same statistical power at the last look. Therefore, the Pocock correction requires a remarkably larger increase in the maximum sample size than the O’Brien-Fleming of Haybittle-Peto correction. We will discuss these issues in more detail when we consider sample size planning below.
If you head to the ‘Report’ tab, you can download an easily readable summary of the main results. Here, you can also see the alpha level you would use for each look at the data (e.g., p < 0.0052, and p < 0.0480 for a O’Brien-Fleming type design with 2 looks).
Corrected alpha levels can be computed to many digits, but this quickly reaches a level of precision that is meaningless in real life. The observed Type I error rate for all tests you will do in your lifetime is not noticeably different if you set the alpha level at 0.0194, 0.019, or 0.02 (see the concept of ‘significant digits’. Even as we calculate and use alpha thresholds up to many digits in sequential tests, the messiness of most research makes these alpha levels have false precision. Keep this in mind when interpreting your data.Note that the rpact shiny app usefully shows the R code required to reproduce the output.
## Sequential analysis with a maximum of 2 looks (group sequential design) ## ## O'Brien & Fleming design, two-sided local significance level 5%, power 80%, ## undefined endpoint. ## ## Stage 1 2 ## Information rate 50% 100% ## Efficacy boundary (z-value scale) 2.797 1.977 ## Cumulative alpha spent 0.0052 0.0500 ## Two-sided local significance level 0.0052 0.0480
An important contribution to the sequential testing literature was made by Lan and DeMets (1983) who proposed the alpha spending function approach. In the Figure below the O’Brien-Fleming-like alpha spending function is plotted against the discrete O’Brien-Fleming bounds. We see the two approaches are not identical, but they are very comparable. The main benefit of these spending functions is that the error rate of the study can be controlled, while neither the number nor the timing of the looks needs to be specified in advance. This makes alpha spending approaches much more flexible. When using an alpha spending function it is important that the decision to perform an interim analysis is not based on collected data, as this can still can increase the Type I error rate.
d1 <- getDesignGroupSequential(typeOfDesign = "P", kMax = 5) d2 <- getDesignGroupSequential(typeOfDesign = "asP", kMax = 5) d3 <- getDesignGroupSequential(typeOfDesign = "OF", kMax = 5) d4 <- getDesignGroupSequential(typeOfDesign = "asOF", kMax = 5) designSet <- getDesignSet(designs = c(d1, d2, d3, d4), variedParameters = "typeOfDesign") plot(designSet, type = 1)