Skip to contents

Introduction

This vignette provides an overview of the data users need to have in order to begin making Sankey diagrams with the nsSank package. If users already have the required data (as outlined in this document), it is recommended they review the “Making Sankeys” vignette to go over the workflow in more detail.

In general, this package supports two primary workflows for creating Sankey diagrams.

The original workflow, which remains available, was designed to implement the core data transformations required to produce Sankey diagrams on the Target RWE platform. It represents intermediate states using numeric identifiers and continues to be available for users who are already familiar with it.

Building on this foundation, a newer, modified workflow was developed in Spring 2026 to improve readability and usability of nsSank outputs. This updated approach provides more descriptive state representations and additional functionality for users to understand, validate, and work with their data in RStudio. In particular the new workflow provides functions which enable users to:

  • Visualize patient flows with plot_sankey() (generates ggplot2-based Sankey diagrams in RStudio)
  • Tabulate state counts at each time point with sankey_counts_table()
  • Summarize patient transitions between consecutive time points with sankey_transition_counts()
  • Work with subgroup-specific results using the gofl_formula argument, with improved support for exploring and extracting individual strata using filter_gofl()
  • Prepare treatment/medication data for analysis with create_time_varying_data() (supports leveled medications)

More details regarding these functions and their intended use cases can be found on the “Making Sankeys” vignette.

Both workflows are fully supported, and users can choose the approach that best fits their needs.

Required Data for Sankeys

To begin using the nsSank package with either workflow mentioned above, users are required to have at minimum two types of data: event data and cohort data. The formats for these required data are described below.

Cohort Format

The cohort data is a wide-format data.frame with one-record-per-id. The purpose of cohort is to provide the nsSank package with overall id-level information. The only required columns cohort must include are:

  • Some type of id column unique to each patient
  • A column specifying a patient’s index date

Optional variables include:

  • Baseline variables that can be used to filter the Sankeys
  • Censor dates
  • Absorbing state dates

“Absorbing states” are observed terminal outcomes (e.g., death) that id’s cannot exit once entered. “Censor dates” mark when an id can no longer be observed, but their true state remains unknown.

Example 1: Acceptable Cohort Data
#> # A tibble: 10,000 × 9
#>    patient_id index_date censor_date discontinue_date death_date sex    prior_mi
#>         <int> <date>     <date>      <date>           <date>     <fct>  <fct>   
#>  1          1 2010-06-07 2012-01-25  2011-04-30       NA         Female Prior MI
#>  2          2 2010-06-06 2013-04-06  2011-10-08       NA         Male   No Prio…
#>  3          3 2010-06-27 2011-07-17  2011-10-13       NA         Male   Prior MI
#>  4          4 2010-06-22 2011-05-19  NA               NA         Female Prior MI
#>  5          5 2010-06-05 2010-12-30  2010-10-05       NA         Female Prior MI
#>  6          6 2010-06-07 2010-12-08  NA               NA         Female Prior MI
#>  7          7 2010-06-20 2012-01-05  2012-05-05       2011-02-03 Female No Prio…
#>  8          8 2010-06-25 2011-06-03  NA               NA         Male   No Prio…
#>  9          9 2010-06-20 2011-05-05  NA               NA         Male   No Prio…
#> 10         10 2010-06-02 2011-03-12  NA               NA         Male   No Prio…
#> # ℹ 9,990 more rows
#> # ℹ 2 more variables: age <int>, age_cat <fct>

Event Format

The event data is a long data.frame with one record per id, state, and start/stop days. Its purpose is to provide event data for each id to the Sankey package, so that nsSank can determine what state(s) each id is in at each time point.

Variables
  • id_var: some sort of (typically patient) record identifier. Not all identifiers in the cohort need to be present in the event data – if an identifier in the cohort data has no records in event, that patient will be considered in the empty state until the end of follow-up or censored. However, all id’s in the event data should have a corresponding id-level record in the cohort file. (i.e., a patient in cohort may or may not have any rows in event. However, all patients with rows in event must have a unique row in cohort)

  • start: the start date or day of the event/state.

  • end: the end date or day of the event/state.

  • state: the event of interest corresponding to start/end times. Typically a treatment (i.e. PCSK9i, Ezetimibe), but can take on other forms.

Example 2: Acceptable Event Data with Non-Leveled Medications (Original Workflow)
#> # A tibble: 5 × 4
#>   patient_id start      end        state
#>        <dbl> <date>     <date>     <chr>
#> 1          1 2010-01-01 2010-01-31 a    
#> 2          1 2010-02-01 2010-03-31 a    
#> 3          1 2010-02-01 2010-03-31 b    
#> 4          2 2010-01-01 2010-01-31 b    
#> 5          2 2010-03-01 2010-04-30 b

We note that the original workflow generally assumes that states in the events data are binary (e.g., patients are either on a medication “a” in a given start/end time-interval; otherwise, they are not on that medication). In contrast, the modified workflow supports leveled-medications, so the event data is allowed to have different levels of the same medication represented in the state column. As an example, suppose a patient can be on three different levels of a statin medication: low_intensity_statin, moderate_intensity_statin, and high_intensity_statin. Then, the event data can be as follows:

Example 3: Acceptable Event Data with Leveled Medications (Modified Workflow)
#> # A tibble: 188 × 4
#>    patient_id start      end        state                    
#>         <int> <date>     <date>     <chr>                    
#>  1          1 2010-05-11 2010-06-10 low_intensity_statin     
#>  2          1 2010-02-27 2010-05-28 a                        
#>  3          1 2010-05-06 2010-06-05 c                        
#>  4          2 2010-04-18 2010-06-17 c                        
#>  5          2 2010-05-28 2010-06-27 c                        
#>  6          2 2010-02-19 2010-05-20 c                        
#>  7          2 2010-08-17 2010-09-16 moderate_intensity_statin
#>  8          2 2010-09-12 2010-11-11 c                        
#>  9          2 2010-11-20 2011-01-19 b                        
#> 10          2 2010-08-23 2010-10-22 c                        
#> # ℹ 178 more rows

Preparing Event Data: Helper Functions

This section covers a couple of helper functions that can help format event data properly for use in nsSank.

In this document, a tagged CDF is any data frame depicting events data in the following form:

Example 4: A tagged cdf
tagged_cdf
#> # A tibble: 4 × 5
#>   patient_id start_date end_date   is_a  is_b 
#>        <dbl> <date>     <date>     <lgl> <lgl>
#> 1          1 2010-01-01 2010-01-31 TRUE  FALSE
#> 2          1 2010-02-01 2010-03-31 TRUE  TRUE 
#> 3          2 2010-01-01 2010-01-31 FALSE TRUE 
#> 4          2 2010-03-01 2010-04-30 FALSE TRUE

where the logical columns represent whether a patient is on particular medications during a specified start/stop interval (e.g., patient 1 above is on medication “a” from 2010-01-01 to 2010-01-31 inclusive but not on medication “b” during that time interval).

convert_tagged_cdf is a helper function to convert tagged CDFs to the events format needed for nsSank. The tagged CDF that is passed through should only have tags for the states of interest. For example, if the user has no interest in tracking a medication “c” in the Sankey, they should remove the corresponding indicator column is_c if present.

Calling convert_tagged_cdf on a tagged CDF will remove any prefixes in the indicator column names. The prefix to be removed is specified using the rmv_prefix argument. By default, rmv_prefix = "is_", but this can be adjusted if the columns have a different prefix.

Example 5: Converting a Tagged CDF into Acceptable Events Data
events <- nsSank::convert_tagged_cdf(tagged_cdf)
events
#> # A tibble: 5 × 4
#>   patient_id start      end        state
#>        <dbl> <date>     <date>     <chr>
#> 1          1 2010-01-01 2010-01-31 a    
#> 2          1 2010-02-01 2010-03-31 a    
#> 3          1 2010-02-01 2010-03-31 b    
#> 4          2 2010-01-01 2010-01-31 b    
#> 5          2 2010-03-01 2010-04-30 b

stockpile_events() combines nearby or overlapping events of the same type into continuous periods, which can reduce data size when there are many similar events.

For each patient and state, events are merged if they overlap or are within gap days of each other. The default gap = 1L means events that touch or have at most a 1-day gap between them will be combined into a single period (using the earliest start date and latest end date).

With the normal gap applied, the events rows for patient 1 reduce down to:

nsSank::stockpile_events(events)
#> # A tibble: 4 × 4
#> # Groups:   patient_id, state [3]
#>   patient_id state start      end       
#>        <dbl> <chr> <date>     <date>    
#> 1          1 a     2010-01-01 2010-03-31
#> 2          1 b     2010-02-01 2010-03-31
#> 3          2 b     2010-01-01 2010-01-31
#> 4          2 b     2010-03-01 2010-04-30

If we allow a larger gap, for example, gap = 30L, patient 2’s rows combine:

nsSank::stockpile_events(events, gap = 30L)
#> # A tibble: 3 × 4
#> # Groups:   patient_id, state [3]
#>   patient_id state start      end       
#>        <dbl> <chr> <date>     <date>    
#> 1          1 a     2010-01-01 2010-03-31
#> 2          1 b     2010-02-01 2010-03-31
#> 3          2 b     2010-01-01 2010-04-30

Formatting Events for Sunbursts

The sunburst requires ordered events, regardless of time. The ansible function converts data in an event-time format (e.g., the CDF, where each row uniquely identifies a single event with start and stop times) to an interval format, where, within a patient, time is separated into mutually exclusive time intervals that capture all events that happened in that interval.

ansible requires event data converted from the CDF. By default ansible will apply stockpile_events to the data in order to function properly. If the data are already stockpiled, can indicate using stockpile = F to skip this step.

Below, we convert the example data from example 5 to the ansible format. By default, gap = 1L is applied.

data <- nsSank::ansible(events)
data
#> # A tibble: 4 × 4
#>   patient_id start      end        state    
#>        <dbl> <date>     <date>     <list>   
#> 1          1 2010-01-01 2010-01-31 <chr [1]>
#> 2          1 2010-02-01 2010-03-31 <chr [2]>
#> 3          2 2010-01-01 2010-01-31 <chr [1]>
#> 4          2 2010-03-01 2010-04-30 <chr [1]>

Time is split into the exclusive intervals where events happen – from 2010-02-01 - 2010-03-30, the A and B states overlapped for patient 1:

dplyr::filter(data, start == as.Date("2010-02-01")) %>% 
  purrr::pluck("state", 1)
#> [1] "a" "b"

Converting data to this format allows us to (1) filter once on single time to get all events associated with that time, and (2) uniquely order the pattern of events.