# Introduction to rtables

## Introduction

The rtables R package provides a framework to create, tabulate and output tables in R. Most of the design requirements for rtables have their origin in studying tables that are commonly used to report analyses from clinical trials; however, we were careful to keep rtables a general purpose toolkit.

There are a number of other table frameworks available in R such as gt from RStudio, xtable, tableone, and tables to name a few. There is a number of reasons to implement rtables (yet another tables R package):

• output tables in ASCII to text files
• table rendering (ASCII, HTML, etc.) is separate from the data model. Hence, one always has access to the non-rounded/non-formatted numbers.
• pagination in both horizontal and vertical directions to meet the health authority submission requirements
• cell, row, column, table reference system
• titles, footers, and referential footnotes
• path based access to cell content which will be useful for automated content generation

In the remainder of this vignette, we give a short introduction into rtables and tabulating a table. The content is based on the useR 2020 presentation from Gabriel Becker.

The packages used for this vignette are rtables and dplyr:

library(rtables)
library(dplyr)

## Data

The data used in this vignette is a made up using random number generators. The data content is relatively simple: one row per imaginary person and one column per measurement: study arm, the country of origin, gender, handedness, age, and weight.

n <- 400

set.seed(1)

df <- tibble(
arm = factor(sample(c("Arm A", "Arm B"), n, replace = TRUE), levels = c("Arm A", "Arm B")),
country = factor(sample(c("CAN", "USA"), n, replace = TRUE, prob = c(.55, .45)), levels = c("CAN", "USA")),
gender = factor(sample(c("Female", "Male"), n, replace = TRUE), levels = c("Female", "Male")),
handed = factor(sample(c("Left", "Right"), n, prob = c(.6, .4), replace = TRUE), levels = c("Left", "Right")),
age = rchisq(n, 30) + 10
) %>% mutate(
weight = 35 * rnorm(n, sd = .5) + ifelse(gender == "Female", 140, 180)
)

head(df)
# # A tibble: 6 × 6
#   arm   country gender handed   age weight
#   <fct> <fct>   <fct>  <fct>  <dbl>  <dbl>
# 1 Arm A USA     Female Left    31.3   139.
# 2 Arm B CAN     Female Right   50.5   116.
# 3 Arm A USA     Male   Right   32.4   186.
# 4 Arm A USA     Male   Right   34.6   169.
# 5 Arm B USA     Female Right   43.0   160.
# 6 Arm A USA     Female Right   43.2   126.

Note that we use factor variables so that the level order is represented in the row or column order when we tabulate the information of df below.

## Building a Table

The aim of this vignette is to build the following table step by step:

#                     Arm A                     Arm B
#              Female        Male        Female        Male
#              (N=96)      (N=105)       (N=92)      (N=107)
# ————————————————————————————————————————————————————————————
# CAN        45 (46.9%)   64 (61.0%)   46 (50.0%)   62 (57.9%)
#   Left     32 (33.3%)   42 (40.0%)   26 (28.3%)   37 (34.6%)
#     mean      38.9         40.4         40.3         37.7
#   Right    13 (13.5%)   22 (21.0%)   20 (21.7%)   25 (23.4%)
#     mean      36.6         40.2         40.2         40.6
# USA        51 (53.1%)   41 (39.0%)   46 (50.0%)   45 (42.1%)
#   Left     34 (35.4%)   19 (18.1%)   25 (27.2%)   25 (23.4%)
#     mean      40.4         39.7         39.2         40.1
#   Right    17 (17.7%)   22 (21.0%)   21 (22.8%)   20 (18.7%)
#     mean      36.9         39.8         38.5         39.0

## Starting Simple

In rtables a basic table is defined to have 0 rows and one column representing all data. Analyzing a variable is one way of adding a row:

lyt <- basic_table() %>%
analyze("age", mean, format = "xx.x")

tbl <- build_table(lyt, df)
tbl
#        all obs
# ——————————————
# mean    39.4

### Layout Instructions

In the code above we first described the table and assigned that description to a variable lyt. We then built the table using the actual data with build_table(). The description of a table is called a table layout. basic_table() is the start of every table layout and contains the information that we have in one column representing all data. The analyze() instruction adds to the layout that the age variable should be analyzed with the mean() analysis function and the result should be rounded to 1 decimal place.

Hence, a layout is “pre-data”, that is, it’s a description of how to build a table once we get data. We can look at the layout isolated:

lyt
# A Pre-data Table Layout
#
# Column-Split Structure:
#  ()
#
# Row-Split Structure:
# age (** analysis **)

The general layouting instructions are summarized below:

• basic_table() is a layout representing a table with zero rows and one column
• Nested splitting
• in row space: split_rows_by(), split_rows_by_multivar(), split_rows_by_cuts(), split_rows_by_cutfun(), split_rows_by_quartiles()
• in column space: split_cols_by(), split_cols_by_multivar(), split_cols_by_cuts(), split_cols_by_cutfun(), split_cols_by_quartiles()
• Summarizing Groups: summarize_row_groups()
• Analyzing Variables: analyze(), analyze_colvars()

Using those functions, it is possible to create a wide variety of tables as we will show in this document.

We will now add more structure to the columns by adding a column split based on the factor variable arm:

lyt <- basic_table() %>%
split_cols_by("arm") %>%
analyze("age", afun = mean, format = "xx.x")

tbl <- build_table(lyt, df)
tbl
#        Arm A   Arm B
# ————————————————————
# mean   39.5    39.4

The resulting table has one column per factor level of arm. So the data represented by the first column is df[df$arm == "ARM A", ]. Hence, the split_cols_by() partitions the data among the columns by default. Column splitting can be done in a recursive/nested manner by adding sequential split_cols_by() layout instruction. It’s also possible to add a non-nested split. Here we splitting each arm further by the gender: lyt <- basic_table() %>% split_cols_by("arm") %>% split_cols_by("gender") %>% analyze("age", afun = mean, format = "xx.x") tbl <- build_table(lyt, df) tbl # Arm A Arm B # Female Male Female Male # ———————————————————————————————————— # mean 38.8 40.1 39.6 39.2 The first column represents the data in df where df$arm == "A" & df$gender == "Female" and the second column the data in df where df$arm == "A" & df$gender == "Male", and so on. ### Adding Row Structure So far, we have created layouts with analysis and column splitting instructions, i.e. analyze() and split_cols_by(), respectively. This resulted with a table with multiple columns and one data row. We will add more row structure by stratifying the mean analysis by country (i.e. adding a split in the row space): lyt <- basic_table() %>% split_cols_by("arm") %>% split_cols_by("gender") %>% split_rows_by("country") %>% analyze("age", afun = mean, format = "xx.x") tbl <- build_table(lyt, df) tbl # Arm A Arm B # Female Male Female Male # —————————————————————————————————————— # CAN # mean 38.2 40.3 40.3 38.9 # USA # mean 39.2 39.7 38.9 39.6 In this table the data used to derive the first data cell (average of age of female Canadians in Arm A) is where df$country == "CAN" & df$arm == "Arm A" & df$gender == "Female". This cell value can also be calculated manually:

mean(df$age[df$country == "CAN" & df$arm == "Arm A" & df$gender == "Female"])
# [1] 38.22447

Row structure can also be used to group the table into titled groups of pages during rendering. We do this via ‘page by splits’, which are declared via page_by = TRUE within a call to split_rows_by:

lyt <- basic_table() %>%
split_cols_by("arm") %>%
split_cols_by("gender") %>%
split_rows_by("country", page_by = TRUE) %>%
split_rows_by("handed") %>%
analyze("age", afun = mean, format = "xx.x")

tbl <- build_table(lyt, df)
cat(export_as_txt(tbl, page_type = "letter",
page_break = "\n\n~~~~~~ Page Break ~~~~~~\n\n"))
#
# country: CAN
#
# ————————————————————————————————————————
#                Arm A           Arm B
#            Female   Male   Female   Male
# ————————————————————————————————————————
# Left
#   mean      38.9    40.4    40.3    37.7
# Right
#   mean      36.6    40.2    40.2    40.6
#
#
# ~~~~~~ Page Break ~~~~~~
#
#
# country: USA
#
# ————————————————————————————————————————
#                Arm A           Arm B
#            Female   Male   Female   Male
# ————————————————————————————————————————
# Left
#   mean      40.4    39.7    39.2    40.1
# Right
#   mean      36.9    39.8    38.5    39.0

We go into more detail on page-by splits and how to control the page-group specific titles in the Title and footer vignette.

Note that if you print or render a table without pagination, the page_by splits are currently rendered as normal row splits. This may change in future releases.

When adding row splits, we get by default label rows for each split level, for example CAN and USA in the table above. Besides the column space subsetting, we have now further subsetted the data for each cell. It is often useful when defining a row splitting to display information about each row group. In rtables this is referred to as content information, i.e. mean() on row 2 is a descendant of CAN (visible via the indenting, though the table has an underlying tree structure that is not of importance for this vignette). In order to add content information and turn the CAN label row into a content row, the summarize_row_groups() function is required. By default, the count (nrows()) and percentage of data relative to the column associated data is calculated:

lyt <- basic_table() %>%
split_cols_by("arm") %>%
split_cols_by("gender") %>%
split_rows_by("country") %>%
summarize_row_groups() %>%
analyze("age", afun = mean, format = "xx.x")

tbl <- build_table(lyt, df)
tbl
#                   Arm A                     Arm B
#            Female        Male        Female        Male
# ——————————————————————————————————————————————————————————
# CAN      45 (46.9%)   64 (61.0%)   46 (50.0%)   62 (57.9%)
#   mean      38.2         40.3         40.3         38.9
# USA      51 (53.1%)   41 (39.0%)   46 (50.0%)   45 (42.1%)
#   mean      39.2         39.7         38.9         39.6

The relative percentage for average age of female Canadians is calculated as follows:

df_cell <- subset(df, df$country == "CAN" & df$arm == "Arm A" & df$gender == "Female") df_col_1 <- subset(df, df$arm == "Arm A" & df\$gender == "Female")

c(count = nrow(df_cell), percentage = nrow(df_cell) / nrow(df_col_1))
#      count percentage
#   45.00000    0.46875

so the group percentages per row split sum up to 1 for each column.

We can further split the row space by dividing each country by handedness:

lyt <- basic_table() %>%
split_cols_by("arm") %>%
split_cols_by("gender") %>%
split_rows_by("country") %>%
summarize_row_groups() %>%
split_rows_by("handed") %>%
analyze("age", afun = mean, format = "xx.x")

tbl <- build_table(lyt, df)
tbl
#                     Arm A                     Arm B
#              Female        Male        Female        Male
# ————————————————————————————————————————————————————————————
# CAN        45 (46.9%)   64 (61.0%)   46 (50.0%)   62 (57.9%)
#   Left
#     mean      38.9         40.4         40.3         37.7
#   Right
#     mean      36.6         40.2         40.2         40.6
# USA        51 (53.1%)   41 (39.0%)   46 (50.0%)   45 (42.1%)
#   Left
#     mean      40.4         39.7         39.2         40.1
#   Right
#     mean      36.9         39.8         38.5         39.0

Next, we further add a count and percentage summary for handedness within each country:

lyt <- basic_table() %>%
split_cols_by("arm") %>%
split_cols_by("gender") %>%
split_rows_by("country") %>%
summarize_row_groups() %>%
split_rows_by("handed") %>%
summarize_row_groups() %>%
analyze("age", afun = mean, format = "xx.x")

tbl <- build_table(lyt, df)
tbl
#                     Arm A                     Arm B
#              Female        Male        Female        Male
# ————————————————————————————————————————————————————————————
# CAN        45 (46.9%)   64 (61.0%)   46 (50.0%)   62 (57.9%)
#   Left     32 (33.3%)   42 (40.0%)   26 (28.3%)   37 (34.6%)
#     mean      38.9         40.4         40.3         37.7
#   Right    13 (13.5%)   22 (21.0%)   20 (21.7%)   25 (23.4%)
#     mean      36.6         40.2         40.2         40.6
# USA        51 (53.1%)   41 (39.0%)   46 (50.0%)   45 (42.1%)
#   Left     34 (35.4%)   19 (18.1%)   25 (27.2%)   25 (23.4%)
#     mean      40.4         39.7         39.2         40.1
#   Right    17 (17.7%)   22 (21.0%)   21 (22.8%)   20 (18.7%)
#     mean      36.9         39.8         38.5         39.0

## Introspecting rtables Table Objects

Once we have created a table, we can inspect its structure using a number of functions.

The table_structure() function prints a summary of a table’s row structure at one of two levels of detail. By default, it summarizes the structure at the subtable level.

table_structure(tbl)
# [TableTree] country
#  [TableTree] CAN [cont: 1 x 4]
#   [TableTree] handed
#    [TableTree] Left [cont: 1 x 4]
#     [ElementaryTable] age (1 x 4)
#    [TableTree] Right [cont: 1 x 4]
#     [ElementaryTable] age (1 x 4)
#  [TableTree] USA [cont: 1 x 4]
#   [TableTree] handed
#    [TableTree] Left [cont: 1 x 4]
#     [ElementaryTable] age (1 x 4)
#    [TableTree] Right [cont: 1 x 4]
#     [ElementaryTable] age (1 x 4)

When the detail argument is set to "row", however, it provides a more detailed row-level summary, which acts as a useful alternative to how we might normally use the str() function to interrogate compound nested lists.

table_structure(tbl, detail = "row")
# TableTree: [country] (country)
#   labelrow: [country] (country) - <not visible>
#   children:
#     TableTree: [CAN] (CAN)
#       labelrow: [CAN] (CAN) - <not visible>
#       content:
#         ElementaryTable: [CAN@content] ()
#           labelrow: [] () - <not visible>
#           children:
#             ContentRow: [CAN] (CAN)
#       children:
#         TableTree: [handed] (handed)
#           labelrow: [handed] (handed) - <not visible>
#           children:
#             TableTree: [Left] (Left)
#               labelrow: [Left] (Left) - <not visible>
#               content:
#                 ElementaryTable: [Left@content] ()
#                   labelrow: [] () - <not visible>
#                   children:
#                     ContentRow: [Left] (Left)
#               children:
#                 ElementaryTable: [age] (age)
#                   labelrow: [age] (age) - <not visible>
#                   children:
#                     DataRow: [mean] (mean)
#             TableTree: [Right] (Right)
#               labelrow: [Right] (Right) - <not visible>
#               content:
#                 ElementaryTable: [Right@content] ()
#                   labelrow: [] () - <not visible>
#                   children:
#                     ContentRow: [Right] (Right)
#               children:
#                 ElementaryTable: [age] (age)
#                   labelrow: [age] (age) - <not visible>
#                   children:
#                     DataRow: [mean] (mean)
#     TableTree: [USA] (USA)
#       labelrow: [USA] (USA) - <not visible>
#       content:
#         ElementaryTable: [USA@content] ()
#           labelrow: [] () - <not visible>
#           children:
#             ContentRow: [USA] (USA)
#       children:
#         TableTree: [handed] (handed)
#           labelrow: [handed] (handed) - <not visible>
#           children:
#             TableTree: [Left] (Left)
#               labelrow: [Left] (Left) - <not visible>
#               content:
#                 ElementaryTable: [Left@content] ()
#                   labelrow: [] () - <not visible>
#                   children:
#                     ContentRow: [Left] (Left)
#               children:
#                 ElementaryTable: [age] (age)
#                   labelrow: [age] (age) - <not visible>
#                   children:
#                     DataRow: [mean] (mean)
#             TableTree: [Right] (Right)
#               labelrow: [Right] (Right) - <not visible>
#               content:
#                 ElementaryTable: [Right@content] ()
#                   labelrow: [] () - <not visible>
#                   children:
#                     ContentRow: [Right] (Right)
#               children:
#                 ElementaryTable: [age] (age)
#                   labelrow: [age] (age) - <not visible>
#                   children:
#                     DataRow: [mean] (mean)

The make_row_df() and make_col_df() functions create a data.frame which has a variety of information about the table’s structure. Most useful for introspection purposes are the label, name, abs_rownumber, path and node_class columns (the remainder of information in the returned data.frame is used for pagination)

make_row_df(tbl)[,c("label", "name", "abs_rownumber", "path", "node_class")]
#    label  name abs_rownumber         path node_class
# 1    CAN   CAN             1 country,.... ContentRow
# 2   Left  Left             2 country,.... ContentRow
# 3   mean  mean             3 country,....    DataRow
# 4  Right Right             4 country,.... ContentRow
# 5   mean  mean             5 country,....    DataRow
# 6    USA   USA             6 country,.... ContentRow
# 7   Left  Left             7 country,.... ContentRow
# 8   mean  mean             8 country,....    DataRow
# 9  Right Right             9 country,.... ContentRow
# 10  mean  mean            10 country,....    DataRow

By default make_row_df() summarizes only visible rows, but setting visible_only to FALSE gives us a structural summary of the table, including the full hierarchy of subtables, including those that aren’t represented directly by any visible rows:

make_row_df(tbl, visible_only = FALSE)[,c("label", "name", "abs_rownumber", "path", "node_class")]
#    label          name abs_rownumber         path      node_class
# 1              country            NA      country       TableTree
# 2                  CAN            NA country, CAN       TableTree
# 3          CAN@content            NA country,.... ElementaryTable
# 4    CAN           CAN             1 country,....      ContentRow
# 5               handed            NA country,....       TableTree
# 6                 Left            NA country,....       TableTree
# 7         Left@content            NA country,.... ElementaryTable
# 8   Left          Left             2 country,....      ContentRow
# 9                  age            NA country,.... ElementaryTable
# 10  mean          mean             3 country,....         DataRow
# 11               Right            NA country,....       TableTree
# 12       Right@content            NA country,.... ElementaryTable
# 13 Right         Right             4 country,....      ContentRow
# 14                 age            NA country,.... ElementaryTable
# 15  mean          mean             5 country,....         DataRow
# 16                 USA            NA country, USA       TableTree
# 17         USA@content            NA country,.... ElementaryTable
# 18   USA           USA             6 country,....      ContentRow
# 19              handed            NA country,....       TableTree
# 20                Left            NA country,....       TableTree
# 21        Left@content            NA country,.... ElementaryTable
# 22  Left          Left             7 country,....      ContentRow
# 23                 age            NA country,.... ElementaryTable
# 24  mean          mean             8 country,....         DataRow
# 25               Right            NA country,....       TableTree
# 26       Right@content            NA country,.... ElementaryTable
# 27 Right         Right             9 country,....      ContentRow
# 28                 age            NA country,.... ElementaryTable
# 29  mean          mean            10 country,....         DataRow

make_col_df() similarly accepts visible_only, though here the meaning is slightly different, indicating whether only leaf columns should be summarized (TRUE, the default) or whether higher level groups of columns, analogous to subtables in row space, should be summarized as well.

make_col_df(tbl)
#     name  label abs_pos         path pos_in_siblings n_siblings leaf_indices
# 1 Female Female       1 arm, Arm....               1          2            1
# 2   Male   Male       2 arm, Arm....               2          2            2
# 3 Female Female       3 arm, Arm....               1          2            3
# 4   Male   Male       4 arm, Arm....               2          2            4
#   total_span col_fnotes n_col_fnotes
# 1          1                       0
# 2          1                       0
# 3          1                       0
# 4          1                       0
make_col_df(tbl, visible_only = FALSE)
#     name  label abs_pos         path pos_in_siblings n_siblings leaf_indices
# 1  Arm A  Arm A      NA   arm, Arm A               1          2         1, 2
# 2 Female Female       1 arm, Arm....               1          2            1
# 3   Male   Male       2 arm, Arm....               2          2            2
# 4  Arm B  Arm B      NA   arm, Arm B               2          2         3, 4
# 5 Female Female       3 arm, Arm....               1          2            3
# 6   Male   Male       4 arm, Arm....               2          2            4
#   total_span col_fnotes n_col_fnotes
# 1          2                       0
# 2          1                       0
# 3          1                       0
# 4          2                       0
# 5          1                       0
# 6          1                       0

The row_paths_summary() and col_paths_summary() functions wrap the respective make_*_df functions, printing the name, node_class and path information (in the row case), or the label and path information (in the column case), indented to illustrate table structure:

row_paths_summary(tbl)
# rowname     node_class    path
# ——————————————————————————————————————————————————————————————————————
# CAN         ContentRow    country, CAN, @content, CAN
#   Left      ContentRow    country, CAN, handed, Left, @content, Left
#     mean    DataRow       country, CAN, handed, Left, age, mean
#   Right     ContentRow    country, CAN, handed, Right, @content, Right
#     mean    DataRow       country, CAN, handed, Right, age, mean
# USA         ContentRow    country, USA, @content, USA
#   Left      ContentRow    country, USA, handed, Left, @content, Left
#     mean    DataRow       country, USA, handed, Left, age, mean
#   Right     ContentRow    country, USA, handed, Right, @content, Right
#     mean    DataRow       country, USA, handed, Right, age, mean
col_paths_summary(tbl)
# label       path
# ——————————————————————————————————————
# Arm A       arm, Arm A
#   Female    arm, Arm A, gender, Female
#   Male      arm, Arm A, gender, Male
# Arm B       arm, Arm B
#   Female    arm, Arm B, gender, Female
#   Male      arm, Arm B, gender, Male

## Summary

In this vignette you have learned:

• every cell has an associated subset of data
• this means that much of tabulation has to do with splitting/subsetting data
• tables can be described pre-data using layouts
• tables are a form of visualization of data

The other vignettes in the rtables package will provide more detailed information about the rtables package. We recommend that you continue with the tabulation_dplyr vignette which compares the information derived by the table in this vignette using dplyr.