Skip to contents

Introduction

As described in the “Data Overview” vignette (which should be reviewed first for required inputs to the nsSank package), the package supports two workflows for creating Sankey diagrams:

Original Workflow:

  1. With acceptable event and cohort data, the user passes those to the sankey_id_events function to create an id_data output.

  2. id_data and cohort are then passed to either:

    • sankey_list_maker: Converts data into the required JSON format needed to make Sankeys on the Target RWE platform

    • sankey_table_maker: Returns a table with all combinations of states present in the data with corresponding counts and formatted percentages.

Modified Workflow (Recommended):

  1. With acceptable event and cohort data, the user can directly pass those to any of the following functions:

    • create_time_varying_data: Converts event and cohort into a time-varying dataset, where each row represents a time interval per patient and columns indicate which treatments are active during that interval (supports leveled medications)

    • plot_sankey: Generates Sankey diagrams for visualization in RStudio

    • sankey_counts_table: Tabulates the number of patients in each state at each time point

    • sankey_transition_counts: Produces a table of transitions between states, showing the starting state, ending state, transition period, and number of patients for each transition

    • format_sankey_data: Converts data into the required JSON format needed to make Sankeys on the Target RWE platform

Modified Workflow: Main Functions and Intended Use-Cases

In this section we cover the intended usage for each of the aforementioned functions:

  1. create_time_varying_data
  2. plot_sankey
  3. sankey_counts_table
  4. sankey_transition_counts
  5. format_sankey_data

For this section, the following events and cohort datasets will be used:

events
## # A tibble: 190 × 5
##    patient_id index_date start      end        state                    
##         <int> <date>     <date>     <date>     <chr>                    
##  1          1 2010-06-28 2010-05-08 2010-08-06 A                        
##  2          1 2010-06-28 2010-07-08 2010-10-06 moderate_intensity_statin
##  3          1 2010-06-28 2010-02-26 2010-04-27 moderate_intensity_statin
##  4          1 2010-06-28 2010-03-12 2010-04-11 moderate_intensity_statin
##  5          1 2010-06-28 2010-12-13 2011-02-11 moderate_intensity_statin
##  6          1 2010-06-28 2010-08-07 2010-11-05 A                        
##  7          1 2010-06-28 2010-08-29 2010-11-27 A                        
##  8          2 2010-06-07 2010-10-17 2011-01-15 C                        
##  9          2 2010-06-07 2010-09-20 2010-11-19 C                        
## 10          2 2010-06-07 2010-09-24 2010-10-24 low_intensity_statin     
## # ℹ 180 more rows
cohort
## # A tibble: 40 × 9
##    patient_id index_date censor_date discontinue_date death_date sex    prior_mi
##         <int> <date>     <date>      <date>           <date>     <fct>  <fct>   
##  1          1 2010-06-28 2011-07-07  2011-07-15       NA         Female Prior MI
##  2          2 2010-06-07 2010-10-12  2010-11-12       NA         Male   No Prio…
##  3          3 2010-06-16 2011-05-16  NA               NA         Male   No Prio…
##  4          4 2010-06-18 2010-08-13  NA               2010-08-17 Male   No Prio…
##  5          5 2010-06-22 2010-10-09  2011-12-07       NA         Male   No Prio…
##  6          6 2010-06-11 2012-01-14  2010-09-01       NA         Female No Prio…
##  7          7 2010-06-30 2010-12-07  NA               NA         Male   No Prio…
##  8          8 2010-06-02 2011-07-24  NA               NA         Female Prior MI
##  9          9 2010-06-06 2010-07-07  NA               2010-09-24 Female No Prio…
## 10         10 2010-06-30 2012-09-24  NA               2011-04-29 Female No Prio…
## # ℹ 30 more rows
## # ℹ 2 more variables: age <int>, age_cat <fct>

Note that the events data has 4 different treatments: A, B, and C which are non-leveled (e.g., a patient is either on treatment A or they’re not on A) and one leveled statin treatment whose levels are low_intensity_statin, moderate_intensity_statin, and high_intensity_statin and which are displayed in the state column with the other non-leveled treatments.

1. Creating a Time-Varying Treatment Dataset (Recommend Read Here First)

The functions 2 - 5 above accept a rather large number of arguments. In order to get a handle on these arguments, it is perhaps most instructive to start with create_time_varying_data whose arguments are a subset of the arguments used by the others.

Accepted Arguments
  • events - A long format data frame with one row per treatment episode per patient, containing patient id, treatment name, index date, and treatment start/stop date columns. (See “Data Overview” vignette for more specifics on required format). This is the only function out of the 5 above that requires an index date column in the events data.
  • index_var - A string specifying the column name in events holding patients’ index dates (default: "index_date")
  • tx_name - A string specifying the column name in events holding treatment names (default: "tx_name")
  • tx_start - A string specifying the column name in events with treatment start dates (default: "tx_start")
  • tx_stop - A string specifying the column name in events with treatment stop dates (default: "tx_stop")
  • id_var - A string specifying the column name in events holding patient IDs (default: "patient_id")
  • med_names_list - A character vector specifying which medications from the tx_name column to include in the analysis. Any medication not in this vector will be ignored.
  • stages - A numeric vector specifying time points (days from index date) at which to evaluate medication status. If NULL (default value), time points are automatically determined from treatment start/stop dates in the data. Example: c(0, 45, 90) checks medication status at baseline, day 45, and day 90.
  • on_med_tx_end - A logical value indicating whether a patient is considered on treatment on their tx_stop date. If TRUE, tx_stop is the last day ON treatment (inclusive); if FALSE, tx_stop is the first day OFF treatment (exclusive). Default: FALSE
  • med_levels - A named list of character vectors specifying levels for leveled medications or treatments (e.g., med_levels = list(statin = c("low_intensity_statin", "moderate_intensity_statin", "high_intensity_statin"))). Any medication name listed here (e.g. statin) must be present in med_names_list. Medications not listed here but present in med_names_list are treated as binary (on/off).
Example 1a: Time-Varying Dataset - ignoring leveled statin medication
tv_data = nsSank::create_time_varying_data(
  events = events,
  med_names_list = c("A", "B", "C"), ## No statin medication listed. 
  ## "A", "B", and "C" are the only medications tracked
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90), ## Check medication status on days 0, 45, and 90
  on_med_tx_end = TRUE ## Patients ON treatment on tx_stop date
)

tv_data
## # A tibble: 108 × 6
##    patient_id start  stop A_indi B_indi C_indi
##         <int> <dbl> <dbl> <chr>  <chr>  <chr> 
##  1          1     0    45 A      no_B   no_C  
##  2          1    45    90 A      no_B   no_C  
##  3          1    90    NA A      no_B   no_C  
##  4          2     0    45 no_A   no_B   no_C  
##  5          2    45    90 no_A   no_B   no_C  
##  6          2    90    NA A      no_B   no_C  
##  7          3     0    45 A      B      no_C  
##  8          3    45    90 no_A   B      no_C  
##  9          3    90    NA no_A   no_B   no_C  
## 10          4     0    45 no_A   no_B   no_C  
## # ℹ 98 more rows
Example 1b: Time-Varying Dataset - incorporating leveled statin medication
tv_data = nsSank::create_time_varying_data(
  events = events,
  med_names_list = c("A", "B", "C", "statin"), ## include statin
  ## Function will look for these levels in tx_name:
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")), 
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90), # Check medication status on days 0, 45, and 90
  on_med_tx_end = TRUE # Patients ON treatment on tx_stop date
)

tv_data
## # A tibble: 108 × 7
##    patient_id start  stop statin_indi               A_indi B_indi C_indi
##         <int> <dbl> <dbl> <chr>                     <chr>  <chr>  <chr> 
##  1          1     0    45 no_statin                 A      no_B   no_C  
##  2          1    45    90 moderate_intensity_statin A      no_B   no_C  
##  3          1    90    NA moderate_intensity_statin A      no_B   no_C  
##  4          2     0    45 low_intensity_statin      no_A   no_B   no_C  
##  5          2    45    90 low_intensity_statin      no_A   no_B   no_C  
##  6          2    90    NA no_statin                 A      no_B   no_C  
##  7          3     0    45 no_statin                 A      B      no_C  
##  8          3    45    90 no_statin                 no_A   B      no_C  
##  9          3    90    NA no_statin                 no_A   no_B   no_C  
## 10          4     0    45 no_statin                 no_A   no_B   no_C  
## # ℹ 98 more rows
Example 1c: Dataset with Event-Driven Time Points

When stages = NULL, time points are derived from each patient’s own treatment history rather than fixed days from index. Specifically, medication status is evaluated at every unique treatment start date, stop date, and index date present in the data, producing one row per resulting interval per patient.

tv_data = nsSank::create_time_varying_data(
  events = events,
  med_names_list = c("A", "B", "C", "statin"), ## include statin
  ## Function will look for these levels in tx_name:
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")), 
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = NULL, ## allow event-driven time-points
  on_med_tx_end = TRUE # Patients ON treatment on tx_stop date
)

tv_data
## # A tibble: 270 × 7
##    patient_id start  stop statin_indi               A_indi B_indi C_indi
##         <int> <dbl> <dbl> <chr>                     <chr>  <chr>  <chr> 
##  1          1     0    10 no_statin                 A      no_B   no_C  
##  2          1    10    39 moderate_intensity_statin A      no_B   no_C  
##  3          1    39    40 moderate_intensity_statin A      no_B   no_C  
##  4          1    40    62 moderate_intensity_statin A      no_B   no_C  
##  5          1    62   100 moderate_intensity_statin A      no_B   no_C  
##  6          1   100   130 moderate_intensity_statin A      no_B   no_C  
##  7          1   130   152 no_statin                 A      no_B   no_C  
##  8          1   152   168 no_statin                 A      no_B   no_C  
##  9          1   168   228 moderate_intensity_statin no_A   no_B   no_C  
## 10          1   228    NA moderate_intensity_statin no_A   no_B   no_C  
## # ℹ 260 more rows

2. Plotting and Formatting Sankey Diagrams

Additional Arguments

In addition to the arguments specified above, each of the functions plot_sankey, sankey_counts_table, sankey_transition_counts, and format_sankey_data accept the following additional arguments:

  • cohort - a wide-format data frame with one-record-per-id that has at minimum 2 columns: a patient id column and a patient index date column. See “Date Overview” vignette for more details on the exact format.
  • stage_labels - a character vector with one label per stage. These labels are used to relabel each numeric stage in the stages argument (e.g., stage_labels = c("Baseline", "Day 45", "Day 90"))
  • weight - logical. When TRUE, applies inverse probability of censoring weights (IPCW) to account for patients censored before the final stage. Rather than raw observed counts, all outputs reflect IPCW-adjusted estimates that better represent the full cohort. Requires censor_vars to be specified. Default is FALSE. See weighting section for more information.
  • censor_vars - named character vector. Variable names in cohort that indicate censoring date. Names are used as state labels (e.g., censor_vars = c("Censored" = "censor_date")). If unnamed, all censoring states are grouped under “Censored”.
  • absorbing_vars - named character vector. Variables in cohort that indicate the date of absorbing states. Names are used as state labels (e.g., absorbing_vars = c("Death" = "death_date")).
  • none - A string. Label for the “empty” (no event) state. Default is "None".
  • gofl_formula - formula. Stratifies the Sankey by grouping variables using gofl. Default is NULL (e.g., gofl_formula = ~ sex * age_cat). See stratification section for more info.
  • collapse_states - named list or named vector. Controls how patients who are on multiple treatments simultaneously are represented. As a named list (e.g., collapse_states = list("Both Treatments" = c("A", "B"))), you explicitly label co-occurring treatment combinations with a custom name. As a named vector (e.g., c("A" = "a", "B" = "b")), you assign a priority order — patients on multiple treatments are assigned whichever state appears first in the vector. Default NULL.
  • collapse_levels - named list. Use this when a treatment in your events data appears at multiple intensity levels (e.g., low, moderate, high dose) and you want to treat all levels as a single state. For example, collapse_levels = list(statin = c("low_intensity_statin", "moderate_intensity_statin", "high_intensity_statin")) combines all statin intensities into one statin state. Default NULL.
  • select_event - A string. How overlapping events are selected at each time point. "combine" (default) returns all events; "first" takes the earliest; "last" takes the most recent.
  • force - logical. Force transition calculations even if they exceed size guidelines. Default FALSE.
  • from_tv_meds - logical. Indicates whether events is already in time-varying format (output of create_time_varying_data). If FALSE (default), raw events data is expected and med_names_list must be provided.

Basic Examples

Example 2a: Basic Sankey Diagram with Censoring and Absorbing
nsSank::plot_sankey(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C"), ## No statin medication listed. 
  ## "A", "B", and "C" are the only medications tracked
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90), ## Check medication status on days 0, 45, and 90
  on_med_tx_end = TRUE, ## Patients ON treatment on tx_stop date,
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date")
)

Example 2b: Sankey Diagram with a Leveled Medication

As we noted in example 1a, we did not include statin in the med_names_list argument, so much like with the create_time_varying_dataset function, all events where patients were on any intensity of statin were ignored. If we wish to incorporate the leveled statin treatment, we must be sure to include it in med_names_list and specify the levels using the med_levels argument:

nsSank::plot_sankey(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C", "statin"), ## Include statin
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90), 
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date")
)

Each level of statin will now be tracked as its own unique state in the diagram. Of course, this may have the effect of adding more “clutter” to the diagram, so it may be desirable to simply collapse these treatment levels and simply have 1 single statin state to keep track of (i.e. anyone on any intensity of statin treatment will be considered to be on statin treatment). We may collapse these medication levels using the collapse_levels argument:

Example 2c: Sankey Diagram with Collapsed Medication Levels
nsSank::plot_sankey(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C", "statin"),
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  collapse_levels = list(statin = c("low_intensity_statin", 
                                    "moderate_intensity_statin", 
                                    "high_intensity_statin")), ## 3 intensities will be collapsed
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90), 
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date")
)

## Let's collapse the statin levels into 2 different states: lower_statin and higher_statin
nsSank::plot_sankey(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C", "statin"),
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  collapse_levels = list(lower_statin = c("low_intensity_statin",
                                          "no_statin"),
                         higher_statin = c("moderate_intensity_statin",
                                           "high_intensity_statin")),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90), 
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date")
)

Example 2d: Sankey Diagram with Custom Stage Labels

If we want to rename the stage labels, we do so using the stage_labels argument:

sankey = nsSank::plot_sankey(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C", "statin"),
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  collapse_levels = list(statin = c("low_intensity_statin", 
                                    "moderate_intensity_statin", 
                                    "high_intensity_statin")),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90),
  stage_labels = c("Baseline", "Day 45", "Day 90"), ## Renames the stages
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date")
)

sankey

Example 2e: Customizing plot_sankey Output

The output of plot_sankey is a ggplot object, so standard ggplot2 syntax can be used to adjust certain elements. However, the degree of customization is limited by how plot_sankey constructs the plot internally.

Freely customizable via +:

  • Plot, Axis and Legend titles - via labs(x = ..., y = ..., fill = ..., title = ...)
  • State colors - via scale_fill_manual() or any other scale_fill_* function
  • All theme elements (background, gridlines, font size, legend position, etc.) - via theme()

Partially customizable (with caveats):

  • Stage axis labels — via scale_x_discrete(labels = ...), but limits must always be specified and must match the stage column names used internally (i.e., the stage_labels values if set, otherwise "time_0", "time_45", etc.). Omitting limits will cause axis labels to disappear.
  • State ordering in the legend and plot - controlled by the factor level ordering that plot_sankey sets internally based on state_map. This cannot be changed after the fact; to reorder states, adjust the order of entries with collapse_states (see example 4d).
## Same output as example 2d:
sankey = nsSank::plot_sankey(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C", "statin"),
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  collapse_levels = list(statin = c("low_intensity_statin", 
                                    "moderate_intensity_statin", 
                                    "high_intensity_statin")),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90),
  stage_labels = c("Baseline", "Day 45", "Day 90"),
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date")
)

## Let's rename the stages again, rename axes titles, and rename legend title:

sankey +
  labs(x = "Time Points",
       y = "Subject Counts",
       fill = "Treatment Group") +
  scale_x_discrete(limits = c("Baseline", "Day 45", "Day 90"), ## same as user-specified labels
                   labels = c("Baseline" = "Stage 0",
                              "Day 45" = "Stage 1",
                              "Day 90" = "Stage 2"),
                   expand = c(0, 0)) +
  scale_fill_viridis_d() ## Let's go crazy
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

Advanced Features: Weights and Stratification

Creating Weighted Sankey Diagrams

Users have the ability, by setting weight = TRUE, to create Sankey diagrams that account for patients who are censored before the final stage. Without weighting, censored patients appear as a “Censored” state in the diagram, but their counterfactual trajectories (i.e., what treatment(s) they would have been on had they remained uncensored) are unaccounted for.

When weight = TRUE, patients who are censored between stages are up-weighted using inverse probability of censoring weights (IPCW), so that the remaining observed patients better represent what the entire cohort would look like had no one been censored. Rather than raw patient counts, flow sizes represent IPCW-adjusted expected counts - estimated as the probability of each treatment pathway multiplied by the full cohort size. Accordingly, the y-axis label changes from “Patient Counts” to “Expected Counts”. Note that censor_vars must be specified for weighting to take effect.

Because the weights are estimated from observed transitions among uncensored patients, pathways that have no observed transitions in the uncensored subset cannot be estimated and are excluded from the diagram. As a result, with small cohorts or high rates of censoring, the expected counts may not sum to the full cohort size.

Note that when weight = TRUE, the function enforces limits on complexity: no more than 10 states and 8 stages are permitted, and the total number of possible pathways (states^stages) must not exceed 1,000,000. These limits exist because computing IPCW-adjusted transition probabilities scales exponentially with the number of states and stages. Without weighting, these are warnings rather than errors. The force = TRUE argument can be used to bypass these checks, though this is not recommended.

A Weighted Sankey Diagram
nsSank::plot_sankey(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "statin"),
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  collapse_levels = list(statin = c("low_intensity_statin", 
                                    "moderate_intensity_statin", 
                                    "high_intensity_statin")),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90),
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"), ## Censoring variables are specified
  absorbing_vars = c("Death" = "death_date"),
  weight = TRUE ## Incorporate weighting
)

Creating Stratified Sankey Diagrams

Should users desire to stratify their Sankey diagrams by certain categorical variables in their cohort data, they may do so utilizing the gofl_formula argument. As an example, consider the following cohort consisting of 2 categorical variables: sex and age_cat (where patients’ ages have been binned into several categories)

## # A tibble: 40 × 7
##    patient_id index_date censor_date discontinue_date death_date sex    age_cat
##         <int> <date>     <date>      <date>           <date>     <fct>  <fct>  
##  1          1 2010-06-28 2011-07-07  2011-07-15       NA         Female (0,40] 
##  2          2 2010-06-07 2010-10-12  2010-11-12       NA         Male   (75,90]
##  3          3 2010-06-16 2011-05-16  NA               NA         Male   (75,90]
##  4          4 2010-06-18 2010-08-13  NA               2010-08-17 Male   (0,40] 
##  5          5 2010-06-22 2010-10-09  2011-12-07       NA         Male   (0,40] 
##  6          6 2010-06-11 2012-01-14  2010-09-01       NA         Female (40,75]
##  7          7 2010-06-30 2010-12-07  NA               NA         Male   (0,40] 
##  8          8 2010-06-02 2011-07-24  NA               NA         Female (40,75]
##  9          9 2010-06-06 2010-07-07  NA               2010-09-24 Female (75,90]
## 10         10 2010-06-30 2012-09-24  NA               2011-04-29 Female (40,75]
## # ℹ 30 more rows
Example 3a: Stratifying Sankey Diagrams by each Variable Separately

Specifying gofl_formula = ~ sex + age_cat will produce marginal groupings: one Sankey per level of sex (filtering on that level, ignoring age_cat) and one per level of age_cat (filtering on that level, ignoring sex), plus an overall, un-stratified Sankey.

By default, if we just set the gofl_formula = ~ sex + age_cat and call the function, the output on its own is not immediately useful. When a gofl_formula is specified by the user, the function returns a named list with two elements:

  • var_info - a named list mapping each stratification variable to a character vector of its levels
  • dt - a list with one element per grouping, where each element is itself a named list containing the stratification variable values for that group along with the Sankey plot (as a ggplot object)
sankey_strat = nsSank::plot_sankey(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C", "statin"),
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  collapse_levels = list(statin = c("low_intensity_statin", 
                                    "moderate_intensity_statin", 
                                    "high_intensity_statin")),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90),
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date"),
  gofl_formula = ~ sex + age_cat
)

Essentially, the output has the following structure:

sankey_strat = list(
  var_info = list(sex = c("Male", "Female"),
                  age_cat = c("(0,40]", "(40,75]","(75,90]")),
  dt = list(
    list(sex = c(NA_character_), age_cat = c(NA_character_), plot = `ggplot`),
    list(sex = c("Male"), age_cat = c(NA_character_), plot = `ggplot`),
    list(sex = c("Female"), age_cat = c(NA_character_), plot = `ggplot`),
    list(sex = c(NA_character_), age_cat = c("(0,40]"), plot = `ggplot`),
    ...
    )
)

To most users, this structure will be unwieldy, so we recommend using filter_gofl() to extract a specific group from the output rather than indexing into dt directly. In fact, we can pipe the output of plot_sankey directly into filter_gofl to extract our desired strata:

sankey_strat = nsSank::plot_sankey(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C", "statin"),
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  collapse_levels = list(statin = c("low_intensity_statin", 
                                    "moderate_intensity_statin", 
                                    "high_intensity_statin")),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90),
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date"),
  gofl_formula = ~ sex + age_cat
)

## We can just pipe `sankey_strat` into filter_gofl and specify which grouping's Sankey we
## want to view.
## Suppose we wanted to look at the Sankey diagram for everyone that is "Male"
sankey_strat |>
  nsSank::filter_gofl(sex = "Male")

## Let's view the Sankey for everyone that is in the age category (40,75]:
sankey_strat |>
  nsSank::filter_gofl(age_cat = "(40,75]")

Example 3b: Stratifying Sankey Diagrams by all Combinations

Specifying gofl_formula = ~ sex * age_cat produces all combinations - every pairing of a sex level with an age category level, plus the marginals and the overall, un-stratified Sankey.

sankey_strat = nsSank::plot_sankey(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C", "statin"),
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  collapse_levels = list(statin = c("low_intensity_statin", 
                                    "moderate_intensity_statin", 
                                    "high_intensity_statin")),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90),
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date"),
  gofl_formula = ~ sex * age_cat
)

## Let's view the Sankey diagram for all Males in the age category (40,75):
sankey_strat |>
  nsSank::filter_gofl(sex = "Male", age_cat = "(40,75]")

## The Sankey diagram for all males (any age category):
sankey_strat |>
  nsSank::filter_gofl(sex = "Male", age_cat = NA) ## NA is important to specify here

If we only specified sex = "Male" without age_cat = NA, filter_gofl would return a partial match - a filtered list containing only the dt entries where sex = "Male", rather than unwrapping to a single plot. Our output would look something like this:

sankey_strat |>
  nsSank::filter_gofl(sex = "Male") = list(
    var_info = list(sex = c("Male", "Female"),
                  age_cat = c("(0,40]", "(40,75]","(75,90]")),
    dt = list(
      list(sex = c("Male"), age_cat = c(NA_character_), plot = `ggplot`),
      list(sex = c("Male"), age_cat = c("(0,40]"), plot = `ggplot`),
      list(sex = c("Male"), age_cat = c("(40,75]"), plot = `ggplot`),
      ...
    )
  )

Luckily, if someone has use for this output and then wishes to select a particular age category, they may do so simply by chaining multiple pipe operators:

sankey_strat |>
  nsSank::filter_gofl(sex = "Male") |>
  nsSank::filter_gofl(age_cat = NA)

Arguments: collapse_states vs. collapse_levels

Both collapse_states and collapse_levels have the ability to reduce visual clutter in a Sankey Diagram by flattening 2 or more distinct states into a single state. The question becomes, what types of states need to be flattened?

We give the following rule of thumb: If one wishes to simply flatten the intensity levels within medications, it is recommended to simply use the collapse_levels argument (e.g., see example 2c). However, if one wishes to collapse states across medications, they must use the collapse_states argument.

Below, we display patients who are on singular treatments “A”, “B”, or “C”. However, any patient who is on more than 1 of these medications will simply be collapsed into a single “Multiple Treatments” state:

Example 4a: Collapsing Multiple Treatments into a Single State
nsSank::plot_sankey(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C"),
  collapse_states = list(
    "A" = list("A"),
    "B" = list("B"),
    "C" = list("C"),
    "Multiple Treatments" = list("A & B", "B & C", "A & C", "A & B & C")
  ),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90),
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date")
)

Example 4b: Using both collapse_states and collapse_levels Simultaneously

As a slightly more complex version of example 4a, suppose we also want to include a statin medication in our Sankey. In previous examples, we’ve seen how our events data has 3 statin intensity levels: low_intensity_statin, moderate_intensity_statin, and high_intensity_statin. If we want to both collapse these levels into a single “statin” state while also collapsing states where patients are on more than 1 treatment simultaneously into a single “Multiple Treatments” state, we would do so as follows:

  1. Set the med_levels argument which lets nsSank know what the intensity levels of statin are in the first place
  2. Set the collapse_levels argument which lets the package know how the user wants to collapse the levels (in our example, any intensity of statin medication will be collapsed to a single “statin” state).
  3. Make sure to specify in collapse_states the base statin state and the additional possible combinations of statins may have with the other base treatments “A”, “B”, and “C”. Don’t worry if you miss a combination: the package will return an error informing you which combined states you’ve missed and need to be accounted for.
nsSank::plot_sankey(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C", "Statin"),
  collapse_states = list(
    "A" = list("A"),
    "B" = list("B"),
    "C" = list("C"),
    "Statin" = list("Statin"),
    "Multiple Treatments" = list("A & B", "B & C", "A & C", "A & B & C",
                                 ## Additional combinations
                                 "A & Statin", "B & Statin", "C & Statin",
                                 "A & B & Statin", "A & C & Statin", "B & C & Statin",
                                 "A & B & C & Statin")
  ),
  med_levels = list(Statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  collapse_levels = list(Statin = c("low_intensity_statin", 
                                    "moderate_intensity_statin", 
                                    "high_intensity_statin")),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90),
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date")
)

Example 4c: Renaming States Using collapse_states

We can also take this as an opportunity to rename the state labels “A”, “B”, and “C” as “Treatment A”, “Treatment B”, and “Treatment C” if so desired:

nsSank::plot_sankey(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C"),
  collapse_states = list(
    "Treatment A" = list("A"),
    "Treatment B" = list("B"),
    "Treatment C" = list("C"),
    "Multiple Treatments" = list("A & B", "B & C", "A & C", "A & B & C")
  ),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90),
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date")
)

Example 4d: Re-ordering States Using collapse_states

If the user desires to re-order the way states are presented, they may do so by passing a named character vector to the collapse_states argument. The order in which the base states appear in the vector is the order that will be displayed in the Sankey diagram. For instance, if we want the order to be “Treatment C” followed by “Treatment B” and then by “Treatment A”, we would specify it as follows:

nsSank::plot_sankey(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C"),
  collapse_states = c(
    "Treatment C" = "C",
    "Treatment B" = "B",
    "Treatment A" = "A"
  ),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90),
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date")
)

Formatting Sankey Diagrams for Target RWE Platform

Once the user is satisfied with the display of their Sankey and wishes to format it for use on the platform, they simply need to pass in the same arguments they used for plot_sankey into the format_sankey_data function. As an example, suppose we were satisfied with the Sankey diagram in example 4d; we need only copy and paste the same arguments into the formatting function:

formatted_sankey = nsSank::format_sankey_data(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C"),
  collapse_states = c(
    "Treatment C" = "C",
    "Treatment B" = "B",
    "Treatment A" = "A"
  ),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90),
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date")
)

The output is in the same JSON format produced by sankey_list_maker in the original workflow.

3. Functions for State and Transitions Counts

The two counts functions, sankey_counts_table and sankey_transition_counts accept the same arguments as plot_sankey and are used in exactly the same way — the only difference is what they return. Rather than repeating the argument descriptions, this section mirrors the examples from Section 2 so users can directly compare outputs across all three functions for the same inputs.

Basic Examples (sankey_counts_table)

Example: Basic Counts Table with Censoring and Absorbing
nsSank::sankey_counts_table(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C"), ## No statin medication listed. 
  ## "A", "B", and "C" are the only medications tracked
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90), ## Check medication status on days 0, 45, and 90
  on_med_tx_end = TRUE, ## Patients ON treatment on tx_stop date,
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date")
)
## # A tibble: 9 × 4
##   state    time_0 time_45 time_90
##   <chr>     <int>   <int>   <int>
## 1 A             4       2       3
## 2 B             3       5       5
## 3 C             7       3       3
## 4 A & B         1       1       0
## 5 A & C         0       0       2
## 6 B & C         1       1       3
## 7 None         24      22      15
## 8 Death         0       1       2
## 9 Censored      0       5       7
Example: State Counts Table with a Leveled Medication
nsSank::sankey_counts_table(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C", "statin"), ## Include statin
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90), 
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date")
)
## # A tibble: 19 × 4
##    state                             time_0 time_45 time_90
##    <chr>                              <int>   <int>   <int>
##  1 A                                      4       1       2
##  2 B                                      3       5       5
##  3 C                                      6       2       2
##  4 moderate_intensity_statin              4       1       0
##  5 low_intensity_statin                   2       1       1
##  6 high_intensity_statin                  1       1       2
##  7 A & B                                  1       0       0
##  8 A & C                                  0       0       2
##  9 B & C                                  1       1       0
## 10 A & moderate_intensity_statin          0       1       1
## 11 B & low_intensity_statin               0       0       0
## 12 C & low_intensity_statin               1       0       0
## 13 C & high_intensity_statin              0       1       1
## 14 A & B & moderate_intensity_statin      0       1       0
## 15 B & C & moderate_intensity_statin      0       0       2
## 16 B & C & high_intensity_statin          0       0       1
## 17 None                                  17      19      12
## 18 Death                                  0       1       2
## 19 Censored                               0       5       7
Example: State Counts Table with Collapsed Medication Levels
nsSank::sankey_counts_table(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C", "statin"),
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  collapse_levels = list(statin = c("low_intensity_statin", 
                                    "moderate_intensity_statin", 
                                    "high_intensity_statin")), ## 3 intensities will be collapsed
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90), 
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date")
)
## # A tibble: 15 × 4
##    state          time_0 time_45 time_90
##    <chr>           <int>   <int>   <int>
##  1 A                   4       1       2
##  2 B                   3       5       5
##  3 C                   6       2       2
##  4 statin              7       3       3
##  5 A & B               1       0       0
##  6 A & C               0       0       2
##  7 B & C               1       1       0
##  8 A & statin          0       1       1
##  9 B & statin          0       0       0
## 10 C & statin          1       1       1
## 11 A & B & statin      0       1       0
## 12 B & C & statin      0       0       3
## 13 None               17      19      12
## 14 Death               0       1       2
## 15 Censored            0       5       7
Example: State Counts Table with Custom Stage Labels

If we want to rename the stage labels, we do so using the stage_labels argument:

nsSank::sankey_counts_table(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C", "statin"),
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  collapse_levels = list(statin = c("low_intensity_statin", 
                                    "moderate_intensity_statin", 
                                    "high_intensity_statin")),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90),
  stage_labels = c("Baseline", "Day 45", "Day 90"), ## Renames the stages
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date")
)
## # A tibble: 15 × 4
##    state          Baseline `Day 45` `Day 90`
##    <chr>             <int>    <int>    <int>
##  1 A                     4        1        2
##  2 B                     3        5        5
##  3 C                     6        2        2
##  4 statin                7        3        3
##  5 A & B                 1        0        0
##  6 A & C                 0        0        2
##  7 B & C                 1        1        0
##  8 A & statin            0        1        1
##  9 B & statin            0        0        0
## 10 C & statin            1        1        1
## 11 A & B & statin        0        1        0
## 12 B & C & statin        0        0        3
## 13 None                 17       19       12
## 14 Death                 0        1        2
## 15 Censored              0        5        7

Basic Examples (sankey_transition_counts)

Example: Basic Transition Counts with Censoring and Absorbing
nsSank::sankey_transition_counts(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C"), ## No statin medication listed. 
  ## "A", "B", and "C" are the only medications tracked
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90), ## Check medication status on days 0, 45, and 90
  on_med_tx_end = TRUE, ## Patients ON treatment on tx_stop date,
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date")
)
## # A tibble: 36 × 4
##    from  to       transition count
##    <fct> <fct>    <fct>      <int>
##  1 A     A        0 to 45        1
##  2 A     C        0 to 45        1
##  3 A     None     0 to 45        2
##  4 B     B        0 to 45        1
##  5 B     None     0 to 45        1
##  6 B     Censored 0 to 45        1
##  7 C     B        0 to 45        2
##  8 C     C        0 to 45        1
##  9 C     None     0 to 45        4
## 10 A & B B        0 to 45        1
## # ℹ 26 more rows
Example: Transition Counts Table with a Leveled Medication
nsSank::sankey_transition_counts(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C", "statin"), ## Include statin
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90), 
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date")
)
## # A tibble: 48 × 4
##    from  to                            transition count
##    <fct> <fct>                         <fct>      <int>
##  1 A     C                             0 to 45        1
##  2 A     A & moderate_intensity_statin 0 to 45        1
##  3 A     None                          0 to 45        2
##  4 B     B                             0 to 45        1
##  5 B     None                          0 to 45        1
##  6 B     Censored                      0 to 45        1
##  7 C     B                             0 to 45        1
##  8 C     high_intensity_statin         0 to 45        1
##  9 C     C & high_intensity_statin     0 to 45        1
## 10 C     None                          0 to 45        3
## # ℹ 38 more rows
Example: Transition Counts with Collapsed Medication Levels
nsSank::sankey_transition_counts(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C", "statin"),
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  collapse_levels = list(statin = c("low_intensity_statin", 
                                    "moderate_intensity_statin", 
                                    "high_intensity_statin")), ## 3 intensities will be collapsed
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90), 
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date")
)
## # A tibble: 44 × 4
##    from  to         transition count
##    <fct> <fct>      <fct>      <int>
##  1 A     C          0 to 45        1
##  2 A     A & statin 0 to 45        1
##  3 A     None       0 to 45        2
##  4 B     B          0 to 45        1
##  5 B     None       0 to 45        1
##  6 B     Censored   0 to 45        1
##  7 C     B          0 to 45        1
##  8 C     statin     0 to 45        1
##  9 C     C & statin 0 to 45        1
## 10 C     None       0 to 45        3
## # ℹ 34 more rows
Example: Transition Counts with Custom Stage Labels

If we want to rename the stage labels, we do so using the stage_labels argument:

nsSank::sankey_transition_counts(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C", "statin"),
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  collapse_levels = list(statin = c("low_intensity_statin", 
                                    "moderate_intensity_statin", 
                                    "high_intensity_statin")),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90),
  stage_labels = c("Baseline", "Day 45", "Day 90"), ## Renames the stages
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date")
)
## # A tibble: 44 × 4
##    from  to         transition         count
##    <fct> <fct>      <fct>              <int>
##  1 A     C          Baseline to Day 45     1
##  2 A     A & statin Baseline to Day 45     1
##  3 A     None       Baseline to Day 45     2
##  4 B     B          Baseline to Day 45     1
##  5 B     None       Baseline to Day 45     1
##  6 B     Censored   Baseline to Day 45     1
##  7 C     B          Baseline to Day 45     1
##  8 C     statin     Baseline to Day 45     1
##  9 C     C & statin Baseline to Day 45     1
## 10 C     None       Baseline to Day 45     3
## # ℹ 34 more rows

Advanced Features: Weights and Stratification (sankey_counts_table)

A Weighted State Counts Table
nsSank::sankey_counts_table(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "statin"),
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  collapse_levels = list(statin = c("low_intensity_statin", 
                                    "moderate_intensity_statin", 
                                    "high_intensity_statin")),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90),
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"), ## Censoring variables are specified
  absorbing_vars = c("Death" = "death_date"),
  weight = TRUE ## Incorporate weighting
)
## # A tibble: 9 × 4
##   state          time_0 time_45 time_90
##   <chr>           <dbl>   <dbl>   <dbl>
## 1 A                 4      0       4.54
## 2 B                 4      7.09    6.00
## 3 statin            8      4.42    4.50
## 4 A & B             1      0       0   
## 5 A & statin        0      1       1   
## 6 B & statin        0      0       3.84
## 7 A & B & statin    0      1       0   
## 8 None             21.8   24.1    16.5 
## 9 Death             0      1.21    2.36
Creating Stratified State Counts Tables
Example 3a: Stratifying State Counts Tables by each Variable Separately
sankey_strat = nsSank::sankey_counts_table(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C", "statin"),
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  collapse_levels = list(statin = c("low_intensity_statin", 
                                    "moderate_intensity_statin", 
                                    "high_intensity_statin")),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90),
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date"),
  gofl_formula = ~ sex + age_cat
)

## We can just pipe `sankey_strat` into filter_gofl and specify which grouping's state counts we
## want to view.
## Suppose we wanted to look at the state counts for everyone that is "Male"
sankey_strat |>
  nsSank::filter_gofl(sex = "Male")
## # A tibble: 15 × 4
##    state          time_0 time_45 time_90
##    <chr>           <int>   <int>   <int>
##  1 A                   2       1       1
##  2 B                   2       4       4
##  3 C                   5       0       1
##  4 statin              4       2       2
##  5 A & B               1       0       0
##  6 A & C               0       0       1
##  7 B & C               0       0       0
##  8 A & statin          0       0       0
##  9 B & statin          0       0       0
## 10 C & statin          0       0       0
## 11 A & B & statin      0       0       0
## 12 B & C & statin      0       0       1
## 13 None                4      11       6
## 14 Death               0       0       0
## 15 Censored            0       0       2
## Let's view the state counts for everyone that is in the age category (40,75]:
sankey_strat |>
  nsSank::filter_gofl(age_cat = "(40,75]")
## # A tibble: 15 × 4
##    state          time_0 time_45 time_90
##    <chr>           <int>   <int>   <int>
##  1 A                   1       0       0
##  2 B                   0       1       3
##  3 C                   5       0       2
##  4 statin              3       1       3
##  5 A & B               0       0       0
##  6 A & C               0       0       1
##  7 B & C               0       0       0
##  8 A & statin          0       0       0
##  9 B & statin          0       0       0
## 10 C & statin          1       1       1
## 11 A & B & statin      0       0       0
## 12 B & C & statin      0       0       0
## 13 None                7      12       5
## 14 Death               0       1       1
## 15 Censored            0       1       1
Example 3b: Stratifying State Counts Tables by all Combinations
sankey_strat = nsSank::sankey_counts_table(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C", "statin"),
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  collapse_levels = list(statin = c("low_intensity_statin", 
                                    "moderate_intensity_statin", 
                                    "high_intensity_statin")),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90),
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date"),
  gofl_formula = ~ sex * age_cat
)

## Let's view the state counts for all males in the age category (40,75):
sankey_strat |>
  nsSank::filter_gofl(sex = "Male", age_cat = "(40,75]")
## # A tibble: 15 × 4
##    state          time_0 time_45 time_90
##    <chr>           <int>   <int>   <int>
##  1 A                   1       0       0
##  2 B                   0       0       2
##  3 C                   4       0       1
##  4 statin              2       1       2
##  5 A & B               0       0       0
##  6 A & C               0       0       1
##  7 B & C               0       0       0
##  8 A & statin          0       0       0
##  9 B & statin          0       0       0
## 10 C & statin          0       0       0
## 11 A & B & statin      0       0       0
## 12 B & C & statin      0       0       0
## 13 None                2       8       3
## 14 Death               0       0       0
## 15 Censored            0       0       0
## The state counts table for all males (any age category):
sankey_strat |>
  nsSank::filter_gofl(sex = "Male", age_cat = NA)
## # A tibble: 15 × 4
##    state          time_0 time_45 time_90
##    <chr>           <int>   <int>   <int>
##  1 A                   2       1       1
##  2 B                   2       4       4
##  3 C                   5       0       1
##  4 statin              4       2       2
##  5 A & B               1       0       0
##  6 A & C               0       0       1
##  7 B & C               0       0       0
##  8 A & statin          0       0       0
##  9 B & statin          0       0       0
## 10 C & statin          0       0       0
## 11 A & B & statin      0       0       0
## 12 B & C & statin      0       0       1
## 13 None                4      11       6
## 14 Death               0       0       0
## 15 Censored            0       0       2

Advanced Features: Weights and Stratification (sankey_transition_counts)

A Weighted Transition Counts Table
nsSank::sankey_transition_counts(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "statin"),
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  collapse_levels = list(statin = c("low_intensity_statin", 
                                    "moderate_intensity_statin", 
                                    "high_intensity_statin")),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90),
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"), # Censoring variables are specified
  absorbing_vars = c("Death" = "death_date"),
  weight = TRUE # Incorporate weighting
)
## # A tibble: 27 × 4
##    from   to             transition count
##    <fct>  <fct>          <fct>      <dbl>
##  1 A      A & statin     0 to 45     1   
##  2 A      None           0 to 45     3   
##  3 B      B              0 to 45     2.67
##  4 B      None           0 to 45     1.33
##  5 statin B              0 to 45     1   
##  6 statin statin         0 to 45     2   
##  7 statin A & B & statin 0 to 45     1   
##  8 statin None           0 to 45     4   
##  9 A & B  B              0 to 45     1   
## 10 None   B              0 to 45     2.42
## # ℹ 17 more rows
Creating Stratified Transition Counts Tables
Example 3a: Stratifying Transition Counts by each Variable Separately
sankey_strat = nsSank::sankey_transition_counts(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C", "statin"),
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  collapse_levels = list(statin = c("low_intensity_statin", 
                                    "moderate_intensity_statin", 
                                    "high_intensity_statin")),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90),
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date"),
  gofl_formula = ~ sex + age_cat
)

## We can just pipe `sankey_strat` into filter_gofl and specify which grouping's transition counts we
## want to view.
## Suppose we wanted to look at the transition counts for everyone that is "Male"
sankey_strat |>
  nsSank::filter_gofl(sex = "Male")
## # A tibble: 23 × 4
##    from   to     transition count
##    <fct>  <fct>  <fct>      <int>
##  1 A      None   0 to 45        2
##  2 B      B      0 to 45        1
##  3 B      None   0 to 45        1
##  4 C      B      0 to 45        1
##  5 C      statin 0 to 45        1
##  6 C      None   0 to 45        3
##  7 statin statin 0 to 45        1
##  8 statin None   0 to 45        3
##  9 A & B  B      0 to 45        1
## 10 None   A      0 to 45        1
## # ℹ 13 more rows
## Let's view the transition counts for everyone that is in the age category (40,75]:
sankey_strat |>
  nsSank::filter_gofl(age_cat = "(40,75]")
## # A tibble: 19 × 4
##    from       to         transition count
##    <fct>      <fct>      <fct>      <int>
##  1 A          None       0 to 45        1
##  2 C          statin     0 to 45        1
##  3 C          C & statin 0 to 45        1
##  4 C          None       0 to 45        3
##  5 statin     None       0 to 45        3
##  6 C & statin B          0 to 45        1
##  7 None       None       0 to 45        5
##  8 None       Death      0 to 45        1
##  9 None       Censored   0 to 45        1
## 10 B          B          45 to 90       1
## 11 statin     statin     45 to 90       1
## 12 C & statin C & statin 45 to 90       1
## 13 None       B          45 to 90       2
## 14 None       C          45 to 90       2
## 15 None       statin     45 to 90       2
## 16 None       A & C      45 to 90       1
## 17 None       None       45 to 90       5
## 18 Death      Death      45 to 90       1
## 19 Censored   Censored   45 to 90       1
Example 3b: Stratifying Transition Counts Tables by all Combinations
sankey_strat = nsSank::sankey_transition_counts(
  events = events,
  cohort = cohort,
  med_names_list = c("A", "B", "C", "statin"),
  med_levels = list(statin = c("low_intensity_statin",
                               "moderate_intensity_statin",
                               "high_intensity_statin")),
  collapse_levels = list(statin = c("low_intensity_statin", 
                                    "moderate_intensity_statin", 
                                    "high_intensity_statin")),
  index_var = "index_date",
  tx_name = "state",
  tx_start = "start",
  tx_stop = "end",
  id_var = "patient_id",
  stages = c(0, 45, 90),
  on_med_tx_end = TRUE, 
  censor_vars = c("Censored" = "censor_date"),
  absorbing_vars = c("Death" = "death_date"),
  gofl_formula = ~ sex * age_cat
)

## Let's view the transition counts for all males in the age category (40,75):
sankey_strat |>
  nsSank::filter_gofl(sex = "Male", age_cat = "(40,75]")
## # A tibble: 11 × 4
##    from   to     transition count
##    <fct>  <fct>  <fct>      <int>
##  1 A      None   0 to 45        1
##  2 C      statin 0 to 45        1
##  3 C      None   0 to 45        3
##  4 statin None   0 to 45        2
##  5 None   None   0 to 45        2
##  6 statin statin 45 to 90       1
##  7 None   B      45 to 90       2
##  8 None   C      45 to 90       1
##  9 None   statin 45 to 90       1
## 10 None   A & C  45 to 90       1
## 11 None   None   45 to 90       3
## The transition counts table for all males (any age category):
sankey_strat |>
  nsSank::filter_gofl(sex = "Male", age_cat = NA)
## # A tibble: 23 × 4
##    from   to     transition count
##    <fct>  <fct>  <fct>      <int>
##  1 A      None   0 to 45        2
##  2 B      B      0 to 45        1
##  3 B      None   0 to 45        1
##  4 C      B      0 to 45        1
##  5 C      statin 0 to 45        1
##  6 C      None   0 to 45        3
##  7 statin statin 0 to 45        1
##  8 statin None   0 to 45        3
##  9 A & B  B      0 to 45        1
## 10 None   A      0 to 45        1
## # ℹ 13 more rows

Original Workflow

Overview

To facilitate the partition level data processing, the updates to the sankey workflow allow for the subject-level calculations to be done in a separate step from the sankey creation. The only difference now between the old data formats of the cohort and events file is that no relative time needs to be calculated–as long as dates are supplied the package can figure it out.

ID Level Data

First, the id level data are created used sankey_id_events or sankey_id_ansible. The only difference is the second argument – sankey_id_events takes in a data.frame in long format with a start date/ end date associated with each event. sankey_id_ansible requires that the second argument is a data.frame in ansible format (the same format required by sunbursts).

If you aren’t making a sunburst, might as well use sankey_id_events so you don’t have to deal with the overhead of converting all the data, which can take awhile.

The idea is that when the cohorts are being created, creating the sankey subject level data can be done on smaller chunks of the data before the necessary summary at the end.

sankey_id_events(cohort, 
                 evts,
                 stages,
                 id_var = "patient_id", 
                 index_var = "index_date", 
                 states = NULL,
                 select_event = "combine")
                  
sankey_id_ansible(cohort, 
                  data,
                  stages,
                  id_var = "patient_id", 
                  index_var = "index_date", 
                  states = NULL,
                  select_event = "combine")

This is to help facilitate running on partitions on the id-level first, before the summarizing step in sankey_list_maker. sunburst_id_data requires a cohort, an ansible style data.frame, and which “states” are wanted. It returns a list of the resulting data.frame and the states, so it can directly be fed into sunburst_maker.

Below is an example of the format of the event data, with a start/end date with a corresponding state:

events <- nsSank::convert_tagged_cdf(cdf)
events  %>% 
  head() %>% 
  rmarkdown::paged_table()

Below is an example of the format of cohort data. An id-level variable, index date variable, censor date variable, death_date, and any potential stratification variables.

cohort %>% 
  head() %>% 
  rmarkdown::paged_table()

Both sankey_id_events and sankey_id_ansible return a list that contains the id-level data, the identified states, and the identified stages (timepoints). In the id-level data frame, states and combinations of states are represented as base2 integers.

sankey_id_data <- nsSank::sankey_id_events(cohort, events, states = c("a", "b", "c"), stages = c(0, 90, 180))

You can indicate whether you want the function to choose the first state, last state, or combine overlapping events with . The default is to combine. You can see there aren’t any overlapping events, since we are choosing the most recent prior event instead of combining.

choose_last <- nsSank::sankey_id_events(cohort, events, states = c("a", "b", "c"), stages = c(0, 90, 180), select_event = "last")

PHR Sankey Data

Once you have the id-level data, you can summarize and create your sankey list for a PHR. This returns a list of length one, with options needed for the sankey.

sankey_list <- nsSank::sankey_list_maker(sankey_id_data, cohort)

You can use the collapse_states argument to specify how you want overlapping states presented (or for reasons such as changing the order of the states, or the labels.) The none argument additionally labels the empty state.

collapse_list <- list(
  "Treatment A" = list("a"),
  "Treatment B" = list("b"),
  "Treatment C" = list("c"),
  "Multiple" = list("a & b", "a & c", "b & c", "a & b & c")
)
sankey_list_collapse <- nsSank::sankey_list_maker(sankey_id_data, cohort, collapse_states = collapse_list, none = "No Treatment")

sankey_list_collapse[[1]]$states
## [1] "Treatment A"  "Treatment B"  "Treatment C"  "Multiple"     "No Treatment"

If using version 0.2.4 or later, you can specify collapse_states as a named vector–this will assign a hierarchy to the combined states and order accordingly:

collapse_vec <- c(
  "Treatment B" = "b",
  "Treatment A" = "a",
  "Treatment C" = "c"
)
sankey_list_collapse <- nsSank::sankey_list_maker(sankey_id_data, cohort, collapse_states = collapse_vec, none = "No Treatment")

sankey_list_collapse[[1]]$states
## [1] "Treatment B"  "Treatment A"  "Treatment C"  "No Treatment"

You can also use the stage_labels argument to specify how you want the time points labelled on the sankey.

sankey_list_collapse_stages <- nsSank::sankey_list_maker(sankey_id_data, 
                                                  cohort, 
                                                  collapse_states = collapse_list, 
                                                  stage_labels = c("Index", "90 Days", "180 Days"))

sankey_list_collapse[[1]]$stages
## [1] "0"   "90"  "180"

Filtered Sankeys

If you want to add filtering to the PHR Sankeys, you can use the golf_formula argument to specify with what variables and how you want to be able to filter.

For example, in the cohort we have two categorical variables we can use: prior_mi and age_cat.

sankey_list_filtered <- nsSank::sankey_list_maker(sankey_id_data, 
                                                  cohort, 
                                                  collapse_states = collapse_list, 
                                                  stage_labels = c("Index", "90 Days", "180 Days"),
                                                  gofl_formula = ~ prior_mi*age_cat)

That results in a list of lists. The first list var_info contains descriptive information for the levels of the grouping variables.

The second list dt contains a list of sankey data for each level.

prior_mi has 2 levels and age_cat has 3 levels, so we should expect 12 different filter combinations.

purrr::map(sankey_list_filtered$dt, ~.x[1:2])
## [[1]]
## [[1]]$prior_mi
## [1] NA
## 
## [[1]]$age_cat
## [1] NA
## 
## 
## [[2]]
## [[2]]$prior_mi
## [1] "No Prior MI"
## 
## [[2]]$age_cat
## [1] NA
## 
## 
## [[3]]
## [[3]]$prior_mi
## [1] "Prior MI"
## 
## [[3]]$age_cat
## [1] NA
## 
## 
## [[4]]
## [[4]]$prior_mi
## [1] NA
## 
## [[4]]$age_cat
## [1] "(0,40]"
## 
## 
## [[5]]
## [[5]]$prior_mi
## [1] NA
## 
## [[5]]$age_cat
## [1] "(40,75]"
## 
## 
## [[6]]
## [[6]]$prior_mi
## [1] NA
## 
## [[6]]$age_cat
## [1] "(75,90]"
## 
## 
## [[7]]
## [[7]]$prior_mi
## [1] "No Prior MI"
## 
## [[7]]$age_cat
## [1] "(0,40]"
## 
## 
## [[8]]
## [[8]]$prior_mi
## [1] "No Prior MI"
## 
## [[8]]$age_cat
## [1] "(40,75]"
## 
## 
## [[9]]
## [[9]]$prior_mi
## [1] "No Prior MI"
## 
## [[9]]$age_cat
## [1] "(75,90]"
## 
## 
## [[10]]
## [[10]]$prior_mi
## [1] "Prior MI"
## 
## [[10]]$age_cat
## [1] "(0,40]"
## 
## 
## [[11]]
## [[11]]$prior_mi
## [1] "Prior MI"
## 
## [[11]]$age_cat
## [1] "(40,75]"
## 
## 
## [[12]]
## [[12]]$prior_mi
## [1] "Prior MI"
## 
## [[12]]$age_cat
## [1] "(75,90]"