The Reappraised package is a series of tools aiming to assess whether the distributions (and other aspects) of variables arising from randomisation in a group of randomised trials are consistent with distributions that would arise by chance. There are ten techniques described here. I’ve updated this vignette for v 0.2. While the back end of the functions have changed substantially, hopefully there will have been no important change to their output.
Continuous variables
Categorical variables
Other functions
To start, the data set needs to be uploaded. The function
load_clean() does this. It takes an excel spreadsheet or a
data frame already in R and changes it into the format each function
needs. It also checks for a few fundamental errors that would prevent
the functions working and returns the data frame in a list. The
description of what variables are needed and their format is in the help
file for the function
eg Using the SI_pvals_cont data set:
pval_cont_data <- load_clean(import= "no", file.cont = "SI_pvals_cont", pval_cont= "yes", format.cont = "wide")$pval_cont_dataOr for an excel spreadsheet:
pval_cont_data <- load_clean(import= "yes", pval_cont = "yes", dir = "path", file.name.cont = "filename", sheet.name.cont = "sheet name", range.name.cont = "range of cells", format.cont = "wide")$pval_cont_data
For example- to load an spreadsheet from the package examples
# get path for example files
path <- system.file("extdata", "reappraised_examples.xlsx",
package = "reappraised", mustWork = TRUE)
# delete file name from path
path <- sub("/[^/]+$", "", path)
# load data
pval_cont_data <- load_clean(import= "yes", pval_cont = "yes", dir = path,
file.name.cont = "reappraised_examples.xlsx",
sheet.name.cont = "SI_pvals_cont",
range.name.cont = "A1:O51", format.cont = "wide")$pval_cont_dataOne continuous and one categorical data set can be loaded in the same step but there are some limitations eg an excel spreadsheet and a R data frame can’t be loaded in the same step.
P-values for the comparisons of baseline continuous variables will be uniformly distributed if the tests (t-test/anova etc) are done on raw data, but not if they are done on summary statistics (see Bolland et al, Baseline P value distributions in randomized trials were uniform for continuous but not categorical variables. J Clin Epidemiol 2019;112:67-76; and Bolland et al, Rounding, but not randomization method, non-normality, or correlation, affected baseline P-value distributions in randomized trials. J Clin Epidemiol 2019;110:50-62.)
To check the distribution, use the pval_cont_fn() to
generate the distribution of p-values and the associated AUC of the
ECDF.
First the distribution:
base_pval_cont <- pval_cont_fn(pval_cont_data, btsp= 500, verbose = FALSE)
base_pval_cont$all_results$pval_cont_calculated_pvaluesThen the area under the curve ie the amount the distribution of values varies from the expected distribution:
For this example we can see that the distribution is consistent with the expected distribution, but 50 variables is a relatively small number and only a very large difference from the expected could be detected.
These p-values were calculated from the summary statistics so the expected distribution is not uniform, and in this case has been calculated empirically from simulations. Likewise, the expected ECDF is not a straight line.
If you use reported p-values (presumably calculated from the raw data), then the expected distribution is uniform and the expected distribution in the ECDF a straight line.
In this case, the distribution is not consistent with expected, and the AUC differs markedly from the expected value of 0.5. However: there are not many reported p-values in this example and so no reliable conclusions can be drawn.
A new analysis is a p-value checker which calculates the minimum and maximum possible p-values after accounting for rounding in summary statistics. In this case, the suffix either.var means the min or max p-value for equal or unequal variance tests are shown.
knitr::kable(head(base_pval_cont$pval_cont_check), format = "html",
col.names = c("Study", "Variable", "Reported P", "Flag", "Min Possible P (Either)", "Max Possible P (Either)")) |>
kableExtra::kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F, font_size = 12) |> kableExtra::column_spec(5:6, width = "5em")| Study | Variable | Reported P | Flag | Min Possible P (Either) | Max Possible P (Either) |
|---|---|---|---|---|---|
| H2 | Age (years) | NA | 0.532 | 0.644 | |
| H2 | Finger | NA | 0.531 | 1.000 | |
| H2 | Leg | NA | 0.197 | 0.690 | |
| H2 | Ionized calcium (mEq/L) | NA | 0.908 | 1.000 | |
| H3 | Intact side | 0.68 | flag | 0.370 | 0.380 |
| H3 | Intact BGP (ng/mL) | 0.89 | flag | 0.070 | 0.080 |
For this example, we can see that two p-values lie outside the possible range and are flagged.
Matching baseline summary statistics (means and SDs) in randomised trials occur but are uncommon. In n-arm trials, the situation where the mean for a variable is identical in all groups AND the SD for that variable is also identical in all groups is very uncommon. Variables with fewer significant figures and variables with small SDs are likely to have higher matching means or matching SDs or both a matching mean and a matching SD. (See Bolland et al. Identical summary statistics were uncommon in randomized trials and cohort studies. J Clin Epidemiol 2021;136:180-188.) Reference data from a large database of trials collected by John Carlisle are described in the article. These data or empirically calculated data can be used as the reference.
To check the frequency of matching summary statistics, use the
match_fn() to generate a summary table with a comparison to
empirically generated reference data:
match_data <- load_clean(import= "no", file.cont = "SI_pvals_cont", match= "yes", format.cont = "wide", verbose = FALSE)$match_data
match <- match_fn(match_data, verbose= FALSE)
match$match_ft_allVariable | All variables | 1 Sig. dig. | 2 Sig. dig. | 3 Sig. dig. | 4 Sig. dig. | 5 Sig. dig. |
|---|---|---|---|---|---|---|
30 trials | ||||||
Number of variables, n (%) | ||||||
Mean | 50 | 1 (2.0) | 16 (32.0) | 28 (56.0) | 5 (10.0) | |
SD | 50 | 6 (12.0) | 34 (68.0) | 10 (20.0) | 0 (0.0) | |
Proportion of matching summary statistics in both treatment groups (%) | ||||||
Means match | 10 | 0.0 | 18.8 | 7.1 | 0.0 | |
SDs match | 10 | 16.7 | 11.8 | 0.0 | ||
Both Mean and SD matches | 2 | 0.0 | 3.0 | 0.0 | ||
Reference data empirically generated | ||||||
30 trials 4000 sims | ||||||
Number of variables, n (%) | ||||||
Mean | 200000 | 4000 (2.0) | 64000 (32.0) | 112000 (56.0) | 20000 (10.0) | |
SD | 200000 | 24000 (12.0) | 136000 (68.0) | 40000 (20.0) | 0 (0.0) | |
Proportion of matching summary statistics in both treatment groups (%) | ||||||
Means match | 9.4 | 55.0 | 12.8 | 7.3 | 1.4 | |
SDs match | 11.3 | 35.1 | 10.0 | 1.6 | ||
Both Mean and SD matches | 2.3 | 11.6 | 1.0 | 0.0 | ||
The table shows that the number of matches is similar to the reference data set but conclusions are limited by the relatively small number of variables.
Another way of addressing this is asking what is the probability of getting matching summary data. The observed number of matches can be compared with the expected number for each variable and then the probability that no matching data would be observed can be calculated. This idea can easily be extended
knitr::kable(match$match_probbystudy |> dplyr::filter (Study == "H11"), format = "html") |> kableExtra::kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F, font_size = 12) |> kableExtra::column_spec(c(2,6), width = "5em")| Study | Number of variables | Results | Mean | SD | Both mean and SD |
|---|---|---|---|---|---|
| H11 | 5 | Matches (n) | 1 | 0 | 0 |
| H11 | 5 | Matches (%) | 20 | 0 | 0 |
| H11 | 5 | Expected matches (%) | 8.5 | 11.4 | 1.8 |
| H11 | 5 | Probability of no matches | 0.63 | 0.53 | 0.91 |
Here we can see there is one matching mean out of 5 variables in Study H11 with probabilities of no matching summary statistics shown. This can then be extended to compare the observed and expected matches.
The graphs show that 22 trials had 0 matching means per trial and 5 had at least one compared with the expected values of 22.65 and 4.05. Observed to expected ratios can also be calculated.
Similar to matching summary statistics within a trial, it is uncommon for identical summary statistics to be reported in different trials from the same population group. For example, if two separate trials are recruited in patients with osteoporosis, it would be unusual for their age to be identical. (See Bolland et al. Identical summary statistics were uncommon in randomized trials and cohort studies. J Clin Epidemiol 2021;136:180-188.)
To check the frequency of matching summary statistics across
different cohorts, use the cohort_fn() to generate a
summary table…
cohort_data <- load_clean(import= "no", file.cont = "SI_cohort", cohort= "yes", format.cont = "long", verbose= FALSE)$cohort_data
cohort <- cohort_fn(cohort_data, seed = 10, sims = 100, verbose= FALSE)
cohort$cohort_ftMean | SD | Both Mean and SD | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
Variable | Mean | Mean Match | Largest Number Mean Matches | P Mean Matchc | SD | SD Match | Largest Number SD Matches | P SD Matchc | Mean (SD) | Mean and SD Match | Largest Number Mean and SD Matches | P Mean and SD Matchc |
Age | 10 | 0 (0) | 0 (0) | 0.10 | 9 | 0 (0) | 0 (0) | 0.12 | 73.2 (4.1) | 0 (0) | 0 (0) | 0.89 |
BMI | 10 | 0 (0) | 0 (0) | 0.01 | 10 | 3 (30) | 3 (30) | 0.01 | 22.3 (2) | 0 (0) | 0 (0) | 0.24 |
ICTP | 10 | 3 (30) | 3 (30) | 0.17 | 9 | 4 (44) | 4 (44) | 0.08 | 7.1 (1.1) | 3 (30) | 3 (30) | 0.19 |
Osteocalcin/Bone Gla (BGA) | 10 | 2 (20) | 2 (20) | 0.42 | 9 | 4 (44) | 4 (44) | 0.44 | 7 (4.5) | 2 (20) | 2 (20) | 0.12 |
BMD | 10 | 2 (20) | 2 (20) | 0.68 | 9 | 2 (22) | 2 (22) | 0.44 | 2.55 (0.36) | 2 (20) | 2 (20) | 0.07 |
iCa (mmol/L) | 10 | 8 (80) | 4 (40) | 0.54 | 10 | 6 (60) | 2 (20) | 0.19 | 1.239 (0.047) | 0 (0) | 0 (0) | 0.02 |
Dietary vitamin D | 10 | 6 (60) | 2 (20) | 0.54 | 10 | 6 (60) | 2 (20) | 0.35 | 124 (25) | 6 (60) | 2 (20) | 0.01 |
PTH | 10 | 4 (40) | 2 (20) | 0.52 | 8 | 4 (50) | 2 (25) | 0.17 | 33.12 (8.9) | 4 (40) | 2 (20) | 0.01 |
1,25(OH)2Db | 10 | 6 (60) | 4 (40) | 0.04 | 10 | 5 (50) | 3 (30) | 0.13 | 49.6 (9.2) | 3 (30) | 3 (30) | 0.01 |
25(OH)Da | 10 | 4 (40) | 2 (20) | 0.42 | 10 | 4 (40) | 2 (20) | 0.66 | 25 (5.8) | 4 (40) | 2 (20) | 0.01 |
aThe columns represent the number of reported means and SD; the number of means, SD, and both mean and SD that match amongst the different cohorts; and the largest number of means, SD, or both matching mean and SD in a single cohort that had a matching value in a different cohort | ||||||||||||
bwhere there was no matching mean (SD), the average mean and SD in the cohorts weighted by participant are reported, otherwise the most common mean (SD), ie the mode, is reported together with the number of occurrences of the mode | ||||||||||||
cP refers to the probability of the reported number of matching means, matching SDs or both matching mean and matching SD for each variable in 100 simulations | ||||||||||||
In this example, 1,25OHD (3), 25OHD (4), Dietary vitamin D (6), PTH (4) all had a higher number of matches for the combination of mean/SD in the 10 cohorts than expected (and iCa had a lower number than expected)
… Or use cohort_fn() to generate a graph of the observed
to expected numbers of matches per cohort.
In this example, there are 3 cohorts with 0 matching mean/SD combination compared with the expected value of 6.2; and 7 cohorts with 1 or more matching mean/SD combinations compared with the expected value of 3.8. Although there are only 10 cohorts, this is still an improbable result.
To compare the distribution of p-values for differences in baseline
means calculated using Carlisle’s montecarlo anova method, use
anova_fn(). (Method is from Carlisle JB, Loadsman JA.
Evidence for non-random sampling in randomised, controlled trials by
Yuhji Saitoh. Anaesthesia. 2017;72:17-27. R code is in appendix to
paper.)
anova_data <- load_clean(import= "no", file.cont = "SI_pvals_cont",anova= "yes", format.cont = "wide")$anova_data
anova_fn(anova_data, seed = 10, sims = 100, btsp = 100, verbose = FALSE)$anova_ecdfIn this case, the distribution of p-values are consistent with the expected distribution.
In version 0.2, I have removed the option of different methods and just used the fastest one.
The expected distribution of baseline categorical variables (or outcome variables unrelated to treatment allocation) is the binomial distribution. Therefore the observed distribution of variables with 0, 1, 2, 3 … participants in each treatment group, and the differences between the randomised groups in two-arm studies, can be compared with the expected distributions. (Bolland et al Participant withdrawals were unusually distributed in randomized trials with integrity concerns: a statistical investigation. J Clin Epidemiol 2021;131:22-29.)
To compare the observed and expected distributions, use the
cat_fn() to generate a graph of observed to expected
frequencies
cat_data <- load_clean(import= "no", file.cat = "SI_cat", cat= "yes", format.cat = "wide", cat.names = c("n", "w"))$cat_data
cat_fn(cat_data, x_title="withdrawals", prefix="w", del.disparate="yes", verbose = FALSE)$cat_graphThe plots show more differences in withdrawals of 0 or 1 between the trials groups than expected and fewer differences in withdrawals of 2 or more than expected.
The same principle applies for the number of participants in trials as for a single categorical variable. If simple randomisation has occurred, the expected distribution of the number of participants, and the differences between groups in two arm trials, is the binomial distribution. An example is in Bolland et al. Systematic review and statistical analysis of the integrity of 33 randomized controlled trials. Neurology 2016;87:2391-2402.
To compare the observed and expected distributions of trial
participants in trials using simple randomisation, use the
sr_fn() which generates a graph of observed to expected
frequencies
sr_data <- load_clean(import= "no", file.cat = "SI_cat", sr= "yes", format.cat = "wide")$sr_data
sr_fn(sr_data, verbose = FALSE)$sr_graphThe figure shows too many trials in which the participant numbers were similar (difference 0-4) in each group if simple randomisation occurred.
If a trial uses block randomisation, and the final block size is known, then the number of participants in the final block can be determined. For pragmatic reasons, you’d expect the final block to be filled (since the pre-planned number of participants are recruited). But if the final block number can be determined, the distribution between groups for the final block numbers can be determined. However, it seems unlikely that many studies would provide sufficient data to do this.
If data are available, use the the block randomisation options in
sr_fn()
#create data set
block = data.frame(study = c(1,2,3,4,5,6,7,8,9,10), #study numbers
fb_sz= c(2,4,6,8,10,12,8,8,6,14), #final block size
n_fb = c(1,1,4,5,7,8,4,6,2,10), #number in final block
df=c(1,1,0,1,3,4,2,2,0,0)) #difference between groups
sr_fn(br = "yes", block = block, verbose = FALSE)$sr_graphIn this example, the observed and expected distributions of differences between groups in participants in the final block are similar.
Rather than using just a single categorical variable, the distribution of the frequency counts of all baseline categorical variables can be compared with the expected (Bolland et al. Distributions of baseline categorical variables were different from the expected distributions in randomized trials with integrity concerns. J Clin Epidemiol. 2023;154:117-124)
To compare the observed and expected distributions, use the
cat_all_fn() to generate a graph of observed to expected
frequency counts.
cat_all_data <- load_clean(import= "No", file.cat = "SI_cat_all", cat_all= "yes", format.cat = "wide")$cat_all_data
cat_all_fn (cat_all_data, two_levels = "y", del.disparate = "y", excl.level = "yes", seed = 10, verbose = FALSE)$cat_all_graph_pcIn this example, there are more variables with a difference of 0-1, or 2 between the groups, and fewer variables with a difference of >4 compared to the expected distribution.
If variables have more than 2 levels (eg brown eyes, blue eyes, green
eyes), using two_levels = "y" recodes the data into the two
largest groups.
Since for each variable with k levels, there are only k-1 independent
levels (the final level can be calculated from the total - sum of other
levels), double counting can occur. excl.level = "yes"
excludes a randomly selected level from the analysis.
As a p-value for the comparison of a categorical variable between groups (eg using Chi-square or Fisher’s exact test) can be calculated from the summary statistics, reported p-values can be compared with independently calculated values. (See Bolland et al. Distributions of baseline categorical variables were different from the expected distributions in randomized trials with integrity concerns. J Clin Epidemiol. 2023;154:117-124)
Previously these analyses were accessed through the
cat_all_fn() but it seemed more natural to include them
with the pval_cat_fn() so I moved them there.
To compare reported and calculated p-values, use the
pval_cat_fn() to produce a table of the comparison.
pval_cat_data <- load_clean(import= "no", file.cat = "SI_cat_all", pval_cat= "yes", format.cont = "wide", verbose = FALSE)$pval_cat_data
pval_cat <- pval_cat_fn (pval_cat_data, match_pvals = "yes", miss_data_explain = "yes", verbose = FALSE)
pval_cat$pval_cat_ft_diff_calc_rep_pDifference between | n (%) |
|---|---|
0 | 17 (68) |
0 - 0.1 | 6 (24) |
0.1 - 0.2 | 2 (8) |
0.2 - 0.3 | 0 (0) |
0.3 - 0.4 | 0 (0) |
0.4 - 0.5 | 0 (0) |
0.5 - 0.6 | 0 (0) |
0.6 - 0.7 | 0 (0) |
0.7 - 0.8 | 0 (0) |
0.8 - 0.9 | 0 (0) |
0.9 - 1 | 0 (0) |
In this example (from a real data set of trials), 45/50 calculated p-values were >0.5 and 2 differed by >0.1 from the reported value.
The actual data are in the data.frame pval_cat_check
Here I’ve transposed a row (9) as an example.
row <- as.data.frame(t(pval_cat$pval_cat_check [9,]))
knitr::kable(row, col.names = "Value", caption = "Detailed view of the first result")| Value | |
|---|---|
| study | H15 |
| var | Stroke |
| reported_pvalue | 0.85 |
| statistical_test | chisq |
| any_match | Midp.classic |
| match_reported_p | No match |
| threshold_match | NA |
| threshold_test | NA |
| p_chi | 0.80 |
| p_chi_yates | 0.90 |
| p_fisher | 0.90 |
| p_lr | 0.80 |
| p_cmh | 0.80 |
| p_midp_exact | 0.80 |
| p_midp_classic | 0.85 |
| chi_warning | |
| calculated_level | NA |
| p_missing_match | 0.85 |
| data_missing_match | 80, 88, 42, 44 |
| change_made | Removed 18 from group 1, 8 from group 2 |
This shows that study H15 for the variable Stroke reported a p-value
of 0.85 using a Chi-square test. But the calculated p-value from the
summary statistics using the same test was 0.80. Using a midp_classic
test but none of the other tests gave a matching p-value. Finally, by
removing 18 from group 1, and 8 from group 2, the two*two table in
data_missing_match were generated which is the closest
table that reproduces the reported p-value using the reported test
statistic (chisquare in this case.)
Although the distributions of p-values from the comparison between groups for categorical variables are not uniform, they can be empirically calculated. (Bolland et al. Baseline P value distributions in randomized trials were uniform for continuous but not categorical variables. J Clin Epidemiol 2019;112:67-76.)
To compare the observed and expected distribution, use the
pval_cat_fn() to generate the distribution of p-values and
the associated AUC of the ECDF. See pval_cont_fn() for more details.
pval_cat_data <- load_clean(import= "no", file.cat = "SI_cat_all", pval_cat= "yes", format.cont = "wide", verbose = FALSE)$pval_cat_data
pval_cat_fn(pval_cat_data, sims = 100, seed=10, btsp = 100, verbose = FALSE)$pval_cat_calculated_pvaluesIn this example, the observed distribution differs markedly from the expected distribution.
The help file gives some details of the methods. You can choose to
calculate p-values with Chi-square or Fisher’s exact test
(stat = 'chisq' or stat = 'fisher'), or
override the tests given in the data frame
stat.overide = yes. Previously there were different options
for method but I’ve simplified that and only used the
fastest one.
This is an approach in development and requires further use and validation. The frequency of final digits of the summary statistics for a group of variables is plotted.
generic_data <- load_clean(import= "No", file.cont = "SI_pvals_cont", generic= "yes", gen.vars.del = c("p"), format.cont = "wide")$generic_data
final_digit_fn(generic_data, vars = c("m","s"), dec.pl = "n", verbose = FALSE)$digit_graph This example shows that there are more summary statistics ending in 4,5 and fewer ending in 9,0. One difficulty is that R and excel drop trailing 0s from numbers so to record a value of 1.0 it needs to be stored as character/text. But in almost all situations, numbers are not recorded or saved as text. To get round this, the number of decimal places for a variable is assumed to be constant across treatment groups and the maximum number of decimal places used.
As many variables will be rounded, it is difficult to know what the expected distribution of final digits should be. In Carlisle’s dataset of >5000 trials and 140000 variables, the distribution was approximately uniform except there were fewer 0’s.
If you want to customise or alter the graphs, I’ve put the data into
the listed output for each function. The graph_maker_fn
allows you to regenerate the graph from the list. For example, using the
pval_cat_fn:
pval_cat_data <- load_clean(import= "no", file.cat = "SI_cat_all", pval_cat= "yes", format.cont = "wide", verbose = FALSE)$pval_cat_data
pval_cat <- pval_cat_fn(pval_cat_data, sims = 100, seed=10, btsp = 100, verbose = FALSE)
graphs <- pval_cat$all_results$graph_data_calc
graph_maker_fn(graphs) [[2]]
By examining and changing the contents of the list in
graphs, the text and graph format can be varied.
If you use the package or any of the functions, let me know. I’d be interested to hear about the results. Plus if there are errors anywhere, or things I can do to improve the package, or other functions to add in, please do let me know. I’m largely self-taught in R, and I’m sure the code and package can be improved.