Create a Sankey diagram in R — plot

plot_sankey uses events and cohort data to generate in an R Sankey diagram, allowing users to explore treatment pathways without formatting data for Target RWE platform.

Usage

plot_sankey(
  events,
  cohort,
  id_var = "patient_id",
  index_var = "index_date",
  stages = NULL,
  stage_labels = NULL,
  weight = FALSE,
  censor_vars = NULL,
  absorbing_vars = NULL,
  none = "None",
  gofl_formula = NULL,
  collapse_states = NULL,
  collapse_levels = NULL,
  select_event = "combine",
  force = FALSE,
  from_tv_meds = FALSE,
  med_names_list = NULL,
  tx_name = "tx_name",
  tx_start = "tx_start",
  tx_stop = "tx_stop",
  on_med_tx_end = FALSE,
  med_levels = NULL
)

Arguments

events: A long format tibble with patient id, state (i.e. treatment/medication), and start/stop columns. Its purpose is to provide event data for each id to the Sankey package, so that nsSank can determine what state(s) each id is in at each time point.
cohort: A wide-format tibble with one-record-per-id that has at minimum 2 columns: a patient id column and a patient index date column. See "Date Overview" vignette for more details.
id_var: A string specifying the column name in events holding patient IDs (default value: "patient_id")
index_var: A string specifying the column name in cohort holding patients' index dates (default value: "index_date")
stages: A numeric vector specifying time points (days from index date) at which to evaluate medication status. If NULL (default value), time points are automatically determined from treatment start/stop dates in the data. Example: c(0, 45, 90) checks medication status at baseline, day 45, and day 90.
stage_labels: A character vector with one label per stage. These labels are used to relabel each numeric stage in the stages argument (e.g., stage_labels = c("Baseline", "Day 45", "Day 90"))
weight: Logical. When TRUE, applies inverse probability of censoring weights (IPCW) to account for patients censored before the final stage. Rather than raw observed counts, all outputs reflect IPCW-adjusted estimates that better represent the full cohort. Requires censor_vars to be specified. Default value is FALSE.
censor_vars: A named character vector. Variable names in cohort that indicate censoring date. Names are used as state labels (e.g., censor_vars = c("Censored" = "censor_date")). If unnamed, all censoring states are grouped under "Censored".
absorbing_vars: A named character vector. Variables in cohort that indicate the date of absorbing states. Names are used as state labels (e.g., absorbing_vars = c("Death" = "death_date")).
none: A string. Label for the "empty" (no event) state. Default is "None".
gofl_formula: A formula. Stratifies the Sankey by grouping variables using gofl. Default is NULL (e.g., gofl_formula = ~ sex * age_cat).
collapse_states: A named list or named vector. Controls how patients who are on multiple treatments simultaneously are represented. As a named list (e.g., collapse_states = list("Both Treatments" = c("A", "B"))), you explicitly label co-occurring treatment combinations with a custom name. As a named vector (e.g., c("A" = "a", "B" = "b")), you assign a priority order — patients on multiple treatments are assigned whichever state appears first in the vector. Default value is NULL.
collapse_levels: A named list. Use this when a treatment in your events data appears at multiple intensity levels (e.g., low, moderate, high dose) and you want to treat all levels as a single state. For example, collapse_levels = list(statin = c("low_intensity_statin", "moderate_intensity_statin", "high_intensity_statin")) combines all statin intensities into one statin state. Default value is NULL.
select_event: A string. How overlapping events are selected at each time point. "combine" (default) returns all events; "first" takes the earliest; "last" takes the most recent.
force: logical. Force transition calculations even if they exceed size guidelines. Default FALSE.
from_tv_meds: logical. Indicates whether events is already in time-varying format (output of create_time_varying_data). If FALSE (default), raw events data is expected
med_names_list: A character vector specifying which medications from the tx_name column to include in the analysis. Any medication not in this vector will be ignored.
tx_name: A string specifying the column name in events holding treatment names (default value: "tx_name")
tx_start: A string specifying the column name in events with treatment start dates (default value: "tx_start")
tx_stop: A string specifying the column name in events with treatment stop dates (default value: "tx_stop")
on_med_tx_end: A logical value indicating whether a patient is considered on treatment on their tx_stop date. If TRUE, tx_stop is the last day ON treatment (inclusive); if FALSE, tx_stop is the first day OFF treatment (exclusive). (Default value: FALSE)
med_levels: A named list of character vectors specifying levels for leveled medications or treatments (e.g., med_levels = list(statin = c("low_intensity_statin", "moderate_intensity_statin", "high_intensity_statin"))). Any medication name listed here (e.g. statin) must be present in med_names_list Medications not listed here but present in med_names_list are treated as binary (on/off).

Value

A Sankey diagram as a ggplot object.