plot_risk() creates horizontal bar charts from risk
estimates produced by estimate_risk() /
est_risk() (the vignette will hereafter use
est_risk()). It can also plot manually constructed data,
but the manual input still needs to match the output format of
est_risk().
This vignette focuses on four things:
plot_risk() expects for risk_datThe examples deliberately start by showing the default behavior when
risk_dat is a data frame. After that, most examples in the
vignette would benefit from add_to_dat = FALSE so the
vignette renders the plot output directly.
Additionally, the vignette will want to make heavy use of the
argument progress = FALSE in calls to
plot_risk(), which suppresses the progress bar. This is
because the progress bar does not print well in a knitted document, but
it does not affect the data requirements, return structure, or plot
appearance. In ordinary use, progress defaults to
TRUE, and as the name implies, it gives a visual indication
of progress; this can be especially helpful when risk_dat
is a large data frame.
As such, the vignette will often use a minor variant of
plot_risk() that defaults to
add_to_dat = FALSE and progress = FALSE to
make the examples more concise and visually clear.
plot_risk() expectsFor its argument risk_dat, the function
plot_risk() accepts either a data frame or a list of data
frames. In either case, the input needs to match the risk-estimate
output schema used by est_risk(). In practical terms, this
means the following:
risk_dat (whether passed
directly or as a list of data frames) must contain model,
over_years, and at least one risk-estimate column among
total_cvd, ascvd, heart_failure,
chd, and stroke.preventr_id is required.risk_dat
is for a single person, because est_risk() only outputs a
list of data frames when estimating risk for a single person (when
estimating over both 10- and 30-year time horizons with
collapse = FALSE). In addition to the aforementioned
required columns, the structure of the list of data frames must also
match the output of est_risk(), meaning the names of the
list elements must be "risk_est_10yr" and
"risk_est_30yr", with the maximum number of rows for
10-year estimates being 3 and the maximum number of rows for the 30-year
estimates being 1 and the column preventr_id not being
present.input_problems is optional, but if it contains the
specific 30-year age warning used by est_risk(), that
warning is displayed as a subtitleThe safest way to obtain valid input is to start from
est_risk().
risk_10_year <- est_risk(
age = 55,
sex = "female",
sbp = 140,
bp_tx = TRUE,
total_c = 210,
hdl_c = 50,
statin = FALSE,
dm = TRUE,
smoking = FALSE,
egfr = 90,
bmi = 31,
time = "10yr"
)
#> PREVENT estimates are from: Base model.
risk_30_year <- est_risk(
age = 55,
sex = "female",
sbp = 140,
bp_tx = TRUE,
total_c = 210,
hdl_c = 50,
statin = FALSE,
dm = TRUE,
smoking = FALSE,
egfr = 90,
bmi = 31,
time = "30yr"
)
#> PREVENT estimates are from: Base model.
risk_both <- rbind(risk_10_year, risk_30_year)
# Identical to a call to `est_risk()` with the arguments used for either
# `risk_10_year` or `risk_30_year`, other than setting `time = "both"` and
# `collapse = TRUE`.
fake_dat <- data.frame(
age = c(45L, 55L),
sex = c("female", "male"),
sbp = c(140, 144),
bp_tx = c(TRUE, FALSE),
total_c = c(210, 240),
hdl_c = c(50, 40),
statin = c(FALSE, TRUE),
dm = c(TRUE, FALSE),
smoking = c(FALSE, TRUE),
egfr = c(90, 60),
bmi = c(31, 28)
)
risk_multi <- est_risk(use_dat = fake_dat, progress = FALSE)
# Setting `progress = FALSE` here to avoid showing the progress bar in the
# vignette, as it does not print well in a knitted document.
fake_dat_warning <- fake_dat
fake_dat_warning$age[[2]] <- 65
risk_warning <- est_risk(use_dat = fake_dat_warning, time = 30, progress = FALSE)
manual_single <- data.frame(
total_cvd = 0.152,
ascvd = 0.101,
heart_failure = 0.051,
chd = 0.062,
stroke = 0.039,
model = "base",
over_years = 10,
input_problems = NA_character_
)
manual_multi <- data.frame(
preventr_id = c(1L, 2L),
total_cvd = c(0.152, 0.280),
ascvd = c(0.101, 0.210),
heart_failure = c(0.051, 0.070),
chd = c(0.062, 0.135),
stroke = c(0.039, 0.075),
model = c("base", "base"),
over_years = c(10L, 10L),
input_problems = c(NA_character_, NA_character_)
)
manual_multi_with_pce <- data.frame(
preventr_id = c(1L, rep(2L, 3)),
total_cvd = c(0.152, 0.175, NA_real_, 0.280),
ascvd = c(0.101, 0.105, 0.2, 0.210),
heart_failure = c(0.051, 0.07, NA_real_, 0.070),
chd = c(0.062, 0.075, NA_real_, 0.135),
stroke = c(0.039, 0.03, NA_real_, 0.075),
model = c("base", "sdi", "pce_orig", "sdi"),
over_years = c(rep(10L, 3), 30L),
input_problems = rep(NA_character_, 4)
)
manual_list <- list(
risk_est_10yr = data.frame(
total_cvd = 0.152,
ascvd = 0.101,
heart_failure = 0.051,
chd = 0.062,
stroke = 0.039,
model = "base",
over_years = 10L,
input_problems = NA_character_
),
risk_est_30yr = data.frame(
total_cvd = 0.430,
ascvd = 0.280,
heart_failure = 0.150,
chd = 0.160,
stroke = 0.120,
model = "base",
over_years = 30L,
input_problems = NA_character_
)
)When risk_dat is a data frame,
add_to_dat = TRUE by default, so the plot is added back
onto the data frame as the list-column plot. This is a
convenient way to keep the plot objects attached to the data frame while
still being able to render them when needed.
# Note this first example uses the real `plot_risk()` with the default behavior of
# `add_to_dat = TRUE` to show the data frame with the plot attached as a list-column.
# It still uses `progress = FALSE` to avoid showing the progress bar in the vignette,
# as it does not print well in a knitted document.
default_plot_df <- plot_risk(risk_multi, progress = FALSE)
names(default_plot_df)
#> [1] "preventr_id" "age" "sex" "sbp"
#> [5] "bp_tx" "total_c" "hdl_c" "statin"
#> [9] "dm" "smoking" "egfr" "bmi"
#> [13] "total_cvd" "ascvd" "heart_failure" "chd"
#> [17] "stroke" "model" "over_years" "input_problems"
#> [21] "plot"
str(default_plot_df, max.level = 1)
#> 'data.frame': 4 obs. of 21 variables:
#> $ preventr_id : int 1 1 2 2
#> $ age : int 45 45 55 55
#> $ sex : chr "female" "female" "male" "male"
#> $ sbp : num 140 140 144 144
#> $ bp_tx : logi TRUE TRUE FALSE FALSE
#> $ total_c : num 210 210 240 240
#> $ hdl_c : num 50 50 40 40
#> $ statin : logi FALSE FALSE TRUE TRUE
#> $ dm : logi TRUE TRUE FALSE FALSE
#> $ smoking : logi FALSE FALSE TRUE TRUE
#> $ egfr : num 90 90 60 60
#> $ bmi : num 31 31 28 28
#> $ total_cvd : num 0.085 0.4 0.119 0.455
#> $ ascvd : num 0.052 0.246 0.087 0.34
#> $ heart_failure : num 0.038 0.252 0.04 0.214
#> $ chd : num 0.025 0.133 0.056 0.243
#> $ stroke : num 0.029 0.143 0.035 0.157
#> $ model : chr "base" "base" "base" "base"
#> $ over_years : int 10 30 10 30
#> $ input_problems: chr NA NA NA NA
#> $ plot :List of 4
all(vapply(default_plot_df$plot, ggplot2::is_ggplot, logical(1)))
#> [1] TRUETo render a plot stored in that list-column, extract it explicitly.
When the column plot has more than one plot object,
calling the column directly renders all the plots in a list.
#>
#> [[2]]
#>
#> [[3]]
#>
#> [[4]]
add_to_dat and
collapseThe return format of plot_risk() depends on three
things:
risk_dat is a data frame or a list of data
frames,add_to_dat is TRUE or
FALSE, andcollapse is
TRUE or FALSE.This table summarizes the return format based on these factors:
Structure of risk_dat |
Value of add_to_dat |
Value of collapse |
Output format |
|---|---|---|---|
| data frame | TRUE |
not applicable | data frame with plot list-column |
| data frame | FALSE |
not applicable | ggplot object or list of ggplot objects |
| list of data frames | TRUE |
TRUE |
single, collapsed data frame with plot list-column |
| list of data frames | TRUE |
FALSE |
list of data frames, each with plot list-column |
| list of data frames | FALSE |
not applicable | list of ggplot objects |
Two details are worth emphasizing:
collapse is only relevant when risk_dat is
a list of data frames and add_to_dat = TRUE.add_to_dat = FALSE
accomplishes that; otherwise, you can extract the plot objects from the
data frame that is returned when add_to_dat = TRUE.If you want plot_risk() to return the plot object itself
rather than appending it to the input data, set
add_to_dat = FALSE.
For a single plotting unit, this yields a single ggplot
object.
# Again, this example uses the real `plot_risk()` with `add_to_dat = FALSE`
# to show the plot object directly. It still uses `progress = FALSE` to
# avoid showing the progress bar in the vignette, as it does not print well
# in a knitted document.
p_direct <- plot_risk(risk_10_year, add_to_dat = FALSE, progress = FALSE)
class(p_direct)
#> [1] "ggplot2::ggplot" "ggplot" "ggplot2::gg" "S7_object"
#> [5] "gg"
p_directAfter this point, most examples in the vignette are intended to show
plot output directly and all examples use progress = FALSE
to suppress the progress bar; thus, the vignette will hereafter make
heavy use the plot_risk_no_add_no_prog() variant previously
defined to avoid having to specify add_to_dat = FALSE and
progress = FALSE repeatedly. This helps the examples be
more concise and clear.
You do not need to start from est_risk(), but your input
must still obey the minimum required structure.
An important detail to recall is that model and
over_years are part of the minimum schema. A data frame
containing only risk columns is not sufficient. The manually-created
data frame manual_single meets these criteria.
By default, outcomes = "all" expands to:
total_cvdascvdheart_failurechdstrokeYou can supply a character vector to change outcome inclusion, outcome order, or both.
The annotation argument accepts:
"all" (the default)"none""title", "subtitle", and
"caption"Notice “annotation” here refers only to the title, subtitle, and
caption. Other text elements, such as the outcome labels and risk
percentages are not controlled by the annotation argument.
Likewise, annotation does not impact elements associated
with the legend (when the legend applies); these elements are controlled
by the legend, lines, and
line_text arguments, which are discussed in the section herein on legend and
threshold line controls.
If input_problems contains the specific warning string
used by est_risk() for 30-year estimation in people older
than 59 years, plot_risk() uses that text as a
subtitle.
# Reminder of ages and time horizons for the `risk_warning` data frame,
# remembering that the 30-year age warning applies to people older than
# 59 years when estimating over a 30-year time horizon.
risk_warning[, c("age", "over_years")]
#> age over_years
#> 1 45 30
#> 2 65 30
# We thus expect a warning subtitle for the second row of `risk_warning`
# but not the first row.
plot_risk_no_add_no_prog(risk_warning)
#> [[1]]#>
#> [[2]]
plot_risk() supports two color schemes:
"single""categories"For color_scheme = "single", color_dat
should be a single color value.
You can also specify the color using a named color or call to rgb(), as
long as the result is a single color value.
For color_scheme = "categories", color_dat
should be a data frame with columns threshold and
color.
The rules are:
color_for_last_groupcolor_dat <- data.frame(
threshold = c(0.20, 0.30, 0.40),
color = c("#1db8b8", "#d70b9a", "#799dfa")
)The final risk group, meaning values at or above the highest valid
threshold, uses color_for_last_group.
plot_risk_no_add_no_prog(
risk_30_year,
color_scheme = "categories",
color_dat = color_dat,
color_for_last_group = rgb(25, 25, 112, maxColorValue = 255)
)plot_risk() cleans category-threshold input by removing
invalid or duplicate thresholds and sorting the remaining
threshold-color pairs.
# Note: The "messy" aspect here pertains to the thresholds being
# out of order. The colors are fine, because any valid color value
# is accepted, including a mixture of named colors, hex codes, and
# calls to `rgb()`.
color_dat_messy <- data.frame(
threshold = c(0.375, 0.175, 0.275),
color = c(rgb(0.5, 0.3, 0.9), "#1c1c69", "brown4")
)
plot_risk_no_add_no_prog(
risk_30_year,
color_scheme = "categories",
color_dat = color_dat_messy
)The arguments legend, lines, and
line_text are only used when
color_scheme = "categories".
plot_risk_no_add_no_prog(
risk_30_year,
color_scheme = "categories",
color_dat = color_dat,
legend = FALSE
)You can adjust the overall text size with base_size.
If one data frame contains more than one value of
over_years plot_risk() splits internally by
time horizon before plotting.
With add_to_dat = FALSE, this yields plot objects
directly. With add_to_dat = TRUE, this simply means the
plot objects in the plot list-column correctly correspond
to the given row (i.e., the row for the 10-year time horizon contains
the plot for the 10-year time horizon, and the row for the 30-year time
horizon contains the plot for the 30-year time horizon).
If one data frame contains multiple people or instances,
preventr_id is required so plot_risk() can
split the data correctly.
This works in concert with multiple time horizons in one data frame,
as shown in the manual_multi_with_pce example. This data
frame contains risk estimates for two people. The first person has a
single row reflecting the 10-year time horizon from the base model of
the PREVENT equations. The second person has three rows: One row is the
10-year time horizon from the base model of the PREVENT equations adding
social deprivation index (SDI), one row is the 10-year time horizon from
the original PCEs, and one row is the 30-year time horizon from the base
model of the PREVENT equations adding SDI.
| preventr_id | total_cvd | ascvd | heart_failure | chd | stroke | model | over_years | input_problems |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.152 | 0.101 | 0.051 | 0.062 | 0.039 | base | 10 | NA |
| 2 | 0.175 | 0.105 | 0.070 | 0.075 | 0.030 | sdi | 10 | NA |
| 2 | NA | 0.200 | NA | NA | NA | pce_orig | 10 | NA |
| 2 | 0.280 | 0.210 | 0.070 | 0.135 | 0.075 | sdi | 30 | NA |
Because plotting is separated by individual and time horizon, one would expect 3 unique plots: One for the first person and two for the second person (one for the 10-year time horizon and one for the 30-year time horizon). However, to maintain tidy data, the 10-year time horizon plot for the second person is repeated across their corresponding two rows for their 10-year time horizon.
plots_by_person_and_horizon <- plot_risk(
manual_multi_with_pce,
progress = FALSE
)
# Should be `TRUE` because the 10-year plot for the second person is
# repeated across their two rows for the 10-year time horizon.
identical(
plots_by_person_and_horizon$plot[[2]],
plots_by_person_and_horizon$plot[[3]]
)
#> [1] TRUE
# Expect identicality between 2 and 3; expect differences otherwise
plots_by_person_and_horizon$plot
#> [[1]]#>
#> [[2]]
#>
#> [[3]]
#>
#> [[4]]
A list of data frames is also valid input, as long as it adheres to
the output schema of est_risk().
When risk_dat is a list of data frames,
add_to_dat = TRUE, and collapse = FALSE, the
output remains a list.
list_with_plots <- plot_risk_no_add_no_prog(manual_list)
length(list_with_plots)
#> [1] 2
list_with_plots
#> $risk_est_10yr#>
#> $risk_est_30yr
When risk_dat is a list of data frames,
add_to_dat = TRUE, and collapse = TRUE, the
output is collapsed into one data frame. Remember,
add_to_dat is TRUE by default, so the main
thing to note here is that collapse matters for list input
when add_to_dat = TRUE. Given the intent of this example,
note the use of plot_risk() and not
plot_risk_no_add_no_prog(), because the former defaults to
add_to_dat = TRUE while the latter defaults to
add_to_dat = FALSE.
collapsed_list_with_plots <- plot_risk(
manual_list,
collapse = TRUE,
progress = FALSE
)
collapsed_list_with_plots[, c("model", "over_years")]
#> model over_years
#> 1 base 10
#> 2 base 30When add_to_dat = FALSE, collapse is
functionally irrelevant for the return format and the returned value is
a list of plot objects. This example will again use
plot_risk() instead of
plot_risk_no_add_no_prog() given its intent.
direct_list_plots <- plot_risk(
manual_list,
add_to_dat = FALSE,
progress = FALSE
)
length(direct_list_plots)
#> [1] 2When risk_dat is a list of data frames, the structure of
the list and the data frames within it must match the output schema of
est_risk(). The following examples show some ways that
malformed list input is not accepted. These examples will again use
plot_risk() instead of
plot_risk_no_add_no_prog() given their intent.
# When `risk_dat` is a list of data frames, the names of the list
# elements must be "risk_est_10yr" and "risk_est_30yr". This input
# violates that requirement.
malformed_list_names <- manual_list
names(malformed_list_names) <- c("ten_year", "thirty_year")
plot_risk(malformed_list_names)
#> Error:
#> ! If `risk_dat` is a list of data frames, it must be consistent with the output of `estimate_risk()`/`est_risk()` when estimating the risk for a single person and `collapse = FALSE`. This means the list must have two data frames named `risk_est_10yr` and `risk_est_30yr`, where `risk_est_10yr` has between 1 and 3 rows (inclusive) and `risk_est_30yr` has exactly 1 row, and neither data frame has a `preventr_id` column.# When `risk_dat` is a list of data frames, there must be no more than 3
# rows for the 10-year estimates and no more than 1 row for the 30-year
# estimates. This input violates that requirement.
malformed_list_more_than_one_person <- manual_list
malformed_list_more_than_one_person$risk_est_10yr <- rbind(
malformed_list_more_than_one_person$risk_est_10yr,
manual_multi |> dplyr::select(-preventr_id),
manual_multi |> dplyr::select(-preventr_id)
)
plot_risk(malformed_list_more_than_one_person)
#> Error:
#> ! If `risk_dat` is a list of data frames, it must be consistent with the output of `estimate_risk()`/`est_risk()` when estimating the risk for a single person and `collapse = FALSE`. This means the list must have two data frames named `risk_est_10yr` and `risk_est_30yr`, where `risk_est_10yr` has between 1 and 3 rows (inclusive) and `risk_est_30yr` has exactly 1 row, and neither data frame has a `preventr_id` column.# When `risk_dat` is a list of data frames, the column `preventr_id` must
# not be present. This input violates that requirement.
malformed_list_preventr_id_preset <- manual_list
malformed_list_preventr_id_preset$risk_est_10yr$preventr_id <- 1L
malformed_list_preventr_id_preset$risk_est_30yr$preventr_id <- 1L
plot_risk(malformed_list_preventr_id_preset)
#> Error:
#> ! If `risk_dat` is a list of data frames, it must be consistent with the output of `estimate_risk()`/`est_risk()` when estimating the risk for a single person and `collapse = FALSE`. This means the list must have two data frames named `risk_est_10yr` and `risk_est_30yr`, where `risk_est_10yr` has between 1 and 3 rows (inclusive) and `risk_est_30yr` has exactly 1 row, and neither data frame has a `preventr_id` column.Several behavior arguments are intentionally strict logicals. For
these arguments, values such as 1 and 0 are
not treated as acceptable stand-ins for TRUE and
FALSE. These arguments include:
add_to_datcollapseprogresslegendlinesline_textWhen ggplot2 4.0.0 was first released, one of the big
changes was rewriting things “under the hood” to move from S3 to S7 (see
here for additional detail if interested: https://tidyverse.org/blog/2025/09/ggplot2-4-0-0/). This
originally resulted in problems with various methods to view data frames
depending on the IDE (see here for additional detail if interested: https://github.com/tidyverse/ggplot2/issues/6732). The
good news is the underlying data were never negatively impacted, but as
you can imagine, not being able to reliably view data frames with plots
as a list column is not ideal. As such, preventr tries to
warn if it detects this might be an issue with your setup, but this is
kind of tricky to do given - among other things - the different view
functions are inherently interactive. As such, preventr
does not attempt to cover every single use case, especially considering
this issue should now be fixed if you are using the latest versions of
ggplot2, your IDE, and R. If you find an exception and
confirm it is due to the aforementioned issue, feel free to let me know,
but more importantly, let the good folks behind ggplot2
know.
progressThe progress argument controls whether a progress bar is
displayed during execution. In ordinary interactive use, this is mostly
relevant when risk_dat is a data frame and there are
multiple plotting units to iterate over.
This vignette does not focus on the progress bar visually, because it does not change the data requirements, return structure, or plot appearance.
plot_risk() is easiest to use when you start from
est_risk(), but it is flexible enough to support valid
manual input and list-based workflows.
The main points are:
est_risk(), your input
still needs to match the output schema of est_risk().model and over_years are part of the
minimum schema for manual input.preventr_id is required when one data frame contains
multiple people.risk_dat is a data frame, the default is to add a
plot list-column.collapse matters for list input when
add_to_dat = TRUE.add_to_dat = FALSE is often the clearest choice, but you
can always extract the plot objects from the data frame when the data
frame was made with a call where add_to_dat = TRUE.