Introduction
This vignette provides an overview of the data users need to have in
order to begin making Sankey diagrams with the nsSank
package. If users already have the required data (as outlined in this
document), it is recommended they review the “Making Sankeys” vignette
to go over the workflow in more detail.
In general, this package supports two primary workflows for creating Sankey diagrams.
The original workflow, which remains available, was designed to implement the core data transformations required to produce Sankey diagrams on the Target RWE platform. It represents intermediate states using numeric identifiers and continues to be available for users who are already familiar with it.
Building on this foundation, a newer, modified workflow was developed
in Spring 2026 to improve readability and usability of
nsSank outputs. This updated approach provides more
descriptive state representations and additional functionality for users
to understand, validate, and work with their data in RStudio. In
particular the new workflow provides functions which enable users
to:
- Visualize patient flows with
plot_sankey()(generates ggplot2-based Sankey diagrams in RStudio) - Tabulate state counts at each time point with
sankey_counts_table() - Summarize patient transitions between consecutive time points with
sankey_transition_counts() - Work with subgroup-specific results using the gofl_formula argument,
with improved support for exploring and extracting individual strata
using
filter_gofl() - Prepare treatment/medication data for analysis with
create_time_varying_data()(supports leveled medications)
More details regarding these functions and their intended use cases can be found on the “Making Sankeys” vignette.
Both workflows are fully supported, and users can choose the approach that best fits their needs.
Required Data for Sankeys
To begin using the nsSank package with either workflow
mentioned above, users are required to have at minimum two types of
data: event data and cohort data. The formats
for these required data are described below.
Cohort Format
The cohort data is a wide-format data.frame
with one-record-per-id. The purpose of cohort is to provide
the nsSank package with overall id-level information. The
only required columns cohort must include are:
- Some type of
idcolumn unique to each patient - A column specifying a patient’s index date
Optional variables include:
- Baseline variables that can be used to filter the Sankeys
- Censor dates
- Absorbing state dates
“Absorbing states” are observed terminal outcomes (e.g., death) that
id’s cannot exit once entered. “Censor dates” mark when an
id can no longer be observed, but their true state remains
unknown.
Example 1: Acceptable Cohort Data
#> # A tibble: 10,000 × 9
#> patient_id index_date censor_date discontinue_date death_date sex prior_mi
#> <int> <date> <date> <date> <date> <fct> <fct>
#> 1 1 2010-06-07 2012-01-25 2011-04-30 NA Female Prior MI
#> 2 2 2010-06-06 2013-04-06 2011-10-08 NA Male No Prio…
#> 3 3 2010-06-27 2011-07-17 2011-10-13 NA Male Prior MI
#> 4 4 2010-06-22 2011-05-19 NA NA Female Prior MI
#> 5 5 2010-06-05 2010-12-30 2010-10-05 NA Female Prior MI
#> 6 6 2010-06-07 2010-12-08 NA NA Female Prior MI
#> 7 7 2010-06-20 2012-01-05 2012-05-05 2011-02-03 Female No Prio…
#> 8 8 2010-06-25 2011-06-03 NA NA Male No Prio…
#> 9 9 2010-06-20 2011-05-05 NA NA Male No Prio…
#> 10 10 2010-06-02 2011-03-12 NA NA Male No Prio…
#> # ℹ 9,990 more rows
#> # ℹ 2 more variables: age <int>, age_cat <fct>
Event Format
The event data is a long data.frame with
one record per id, state, and start/stop days. Its purpose is to provide
event data for each id to the Sankey package, so that
nsSank can determine what state(s) each id is in at each
time point.
Variables
id_var: some sort of (typically patient) record identifier. Not all identifiers in thecohortneed to be present in theeventdata – if an identifier in the cohort data has no records inevent, that patient will be considered in the empty state until the end of follow-up or censored. However, allid’s in theeventdata should have a corresponding id-level record in thecohortfile. (i.e., a patient incohortmay or may not have any rows inevent. However, all patients with rows ineventmust have a unique row incohort)start: the start date or day of the event/state.end: the end date or day of the event/state.state: the event of interest corresponding to start/end times. Typically a treatment (i.e.PCSK9i,Ezetimibe), but can take on other forms.
Example 2: Acceptable Event Data with Non-Leveled Medications (Original Workflow)
#> # A tibble: 5 × 4
#> patient_id start end state
#> <dbl> <date> <date> <chr>
#> 1 1 2010-01-01 2010-01-31 a
#> 2 1 2010-02-01 2010-03-31 a
#> 3 1 2010-02-01 2010-03-31 b
#> 4 2 2010-01-01 2010-01-31 b
#> 5 2 2010-03-01 2010-04-30 b
We note that the original workflow generally assumes that states in
the events data are binary (e.g., patients are either on a
medication “a” in a given start/end time-interval; otherwise, they are
not on that medication). In contrast, the modified workflow supports
leveled-medications, so the event data is allowed to have
different levels of the same medication represented in the
state column. As an example, suppose a patient can be on
three different levels of a statin medication: low_intensity_statin,
moderate_intensity_statin, and high_intensity_statin. Then, the
event data can be as follows:
Example 3: Acceptable Event Data with Leveled Medications (Modified Workflow)
#> # A tibble: 188 × 4
#> patient_id start end state
#> <int> <date> <date> <chr>
#> 1 1 2010-05-11 2010-06-10 low_intensity_statin
#> 2 1 2010-02-27 2010-05-28 a
#> 3 1 2010-05-06 2010-06-05 c
#> 4 2 2010-04-18 2010-06-17 c
#> 5 2 2010-05-28 2010-06-27 c
#> 6 2 2010-02-19 2010-05-20 c
#> 7 2 2010-08-17 2010-09-16 moderate_intensity_statin
#> 8 2 2010-09-12 2010-11-11 c
#> 9 2 2010-11-20 2011-01-19 b
#> 10 2 2010-08-23 2010-10-22 c
#> # ℹ 178 more rows
Preparing Event Data: Helper Functions
This section covers a couple of helper functions that can help format
event data properly for use in nsSank.
In this document, a tagged CDF is any data frame depicting events data in the following form:
Example 4: A tagged cdf
tagged_cdf
#> # A tibble: 4 × 5
#> patient_id start_date end_date is_a is_b
#> <dbl> <date> <date> <lgl> <lgl>
#> 1 1 2010-01-01 2010-01-31 TRUE FALSE
#> 2 1 2010-02-01 2010-03-31 TRUE TRUE
#> 3 2 2010-01-01 2010-01-31 FALSE TRUE
#> 4 2 2010-03-01 2010-04-30 FALSE TRUEwhere the logical columns represent whether a patient is on
particular medications during a specified start/stop interval (e.g.,
patient 1 above is on medication “a” from 2010-01-01 to
2010-01-31 inclusive but not on medication “b” during that
time interval).
convert_tagged_cdf is a helper function to convert
tagged CDFs to the events format needed for nsSank. The
tagged CDF that is passed through should only have tags for the states
of interest. For example, if the user has no interest in tracking a
medication “c” in the Sankey, they should remove the corresponding
indicator column is_c if present.
Calling convert_tagged_cdf on a tagged CDF will remove
any prefixes in the indicator column names. The prefix to be removed is
specified using the rmv_prefix argument. By default,
rmv_prefix = "is_", but this can be adjusted if the columns
have a different prefix.
Example 5: Converting a Tagged CDF into Acceptable Events Data
events <- nsSank::convert_tagged_cdf(tagged_cdf)
events
#> # A tibble: 5 × 4
#> patient_id start end state
#> <dbl> <date> <date> <chr>
#> 1 1 2010-01-01 2010-01-31 a
#> 2 1 2010-02-01 2010-03-31 a
#> 3 1 2010-02-01 2010-03-31 b
#> 4 2 2010-01-01 2010-01-31 b
#> 5 2 2010-03-01 2010-04-30 bstockpile_events() combines nearby or overlapping events
of the same type into continuous periods, which can reduce data size
when there are many similar events.
For each patient and state, events are merged if they overlap or are
within gap days of each other. The default
gap = 1L means events that touch or have at most a 1-day
gap between them will be combined into a single period (using the
earliest start date and latest end date).
With the normal gap applied, the events rows for patient 1 reduce down to:
nsSank::stockpile_events(events)
#> # A tibble: 4 × 4
#> # Groups: patient_id, state [3]
#> patient_id state start end
#> <dbl> <chr> <date> <date>
#> 1 1 a 2010-01-01 2010-03-31
#> 2 1 b 2010-02-01 2010-03-31
#> 3 2 b 2010-01-01 2010-01-31
#> 4 2 b 2010-03-01 2010-04-30If we allow a larger gap, for example, gap = 30L,
patient 2’s rows combine:
nsSank::stockpile_events(events, gap = 30L)
#> # A tibble: 3 × 4
#> # Groups: patient_id, state [3]
#> patient_id state start end
#> <dbl> <chr> <date> <date>
#> 1 1 a 2010-01-01 2010-03-31
#> 2 1 b 2010-02-01 2010-03-31
#> 3 2 b 2010-01-01 2010-04-30Formatting Events for Sunbursts
The sunburst requires ordered events, regardless of time. The
ansible function converts data in an event-time format
(e.g., the CDF, where each row uniquely identifies a single event with
start and stop times) to an interval format, where, within a patient,
time is separated into mutually exclusive time intervals that capture
all events that happened in that interval.
ansible requires event data converted from the CDF. By
default ansible will apply stockpile_events to
the data in order to function properly. If the data are already
stockpiled, can indicate using stockpile = F to skip this
step.
Below, we convert the example data from example 5 to the
ansible format. By default, gap = 1L is
applied.
data <- nsSank::ansible(events)
data
#> # A tibble: 4 × 4
#> patient_id start end state
#> <dbl> <date> <date> <list>
#> 1 1 2010-01-01 2010-01-31 <chr [1]>
#> 2 1 2010-02-01 2010-03-31 <chr [2]>
#> 3 2 2010-01-01 2010-01-31 <chr [1]>
#> 4 2 2010-03-01 2010-04-30 <chr [1]>Time is split into the exclusive intervals where events happen – from
2010-02-01 - 2010-03-30, the A and
B states overlapped for patient 1:
Converting data to this format allows us to (1) filter once on single time to get all events associated with that time, and (2) uniquely order the pattern of events.