Sankey data

Cohort file

The cohort_file is a data.frame with one-record-per-id. The purpose of this file to provide the nsSank package with overall id-level information. Required variables include a record id, and index variable. Optional variables include baseline variables that can be used to filter the Sankeys, censor dates, and absorbing state dates. “Absorbing states” are states that ids enter and can not transition out of (death being a primary example). All ids in the event file should have a corresponding id-level record in the cohort file.

Example data
#> # A tibble: 10,000 × 9
#>    patient_id index_date censor_date discontinue_date death_date sex    prior_mi
#>         <int> <date>     <date>      <date>           <date>     <fct>  <fct>   
#>  1          1 2010-06-05 2010-07-06  NA               NA         Female No Prio…
#>  2          2 2010-06-26 2013-06-10  2010-10-11       NA         Male   No Prio…
#>  3          3 2010-06-12 2010-06-21  NA               NA         Male   No Prio…
#>  4          4 2010-06-24 2013-10-27  2010-12-24       2010-07-25 Male   No Prio…
#>  5          5 2010-06-28 2010-08-09  NA               NA         Male   No Prio…
#>  6          6 2010-06-13 2013-02-27  2010-12-14       NA         Male   No Prio…
#>  7          7 2010-06-18 2010-08-08  2013-05-17       NA         Male   No Prio…
#>  8          8 2010-06-15 2011-07-16  NA               NA         Male   Prior MI
#>  9          9 2010-06-04 2011-04-02  2011-03-06       NA         Male   Prior MI
#> 10         10 2010-06-03 2010-12-23  NA               NA         Female Prior MI
#> # … with 9,990 more rows, and 2 more variables: age <int>, age_cat <fct>

Event file

The event_file is a long data.frame with one record per id, state, and start/stop days. The purpose of this file is to provide event data for each id to the Sankey package, so that nsSank can determine what state(s) each id is in at each time point.

Variables
  • <id_var>: some sort of (typically patient) record identifier. Not all identifiers in the cohort_file need to be present in the event_file – if an identifier in the cohort file has no records in the event_file, that patient will be considered in the empty state until the end of follow-up or censored.

  • start: the start date or day of the event/state.

  • end: the end date or day of the event/state.

  • state: the event of interest corresponding to start/end times. Typically a treatment (i.e. PCSK9i, Ezetimibe), but can take on other forms.

Example data

CDF & event data helpers

Example tagged cdf
#> # A tibble: 4 × 5
#>   patient_id start_date end_date   is_a  is_b 
#>        <dbl> <date>     <date>     <lgl> <lgl>
#> 1          1 2010-01-01 2010-01-31 TRUE  FALSE
#> 2          1 2010-02-01 2010-03-31 TRUE  TRUE 
#> 3          2 2010-01-01 2010-01-31 FALSE TRUE 
#> 4          2 2010-03-01 2010-04-30 FALSE TRUE

convert_tagged_cdf is a helper function to convert tagged CDFs to the events format needed for nsSank. The tagged CDF that is passed through should only have tags for the states of interest. A cohort is passed through in order to use the index date to create relative time for each of the states.

By default, rmv_prefix = "is_".

events <- nsSank::convert_tagged_cdf(cdf)
events
#> # A tibble: 5 × 4
#>   patient_id start      end        state
#>        <dbl> <date>     <date>     <chr>
#> 1          1 2010-01-01 2010-01-31 a    
#> 2          1 2010-02-01 2010-03-31 a    
#> 3          1 2010-02-01 2010-03-31 b    
#> 4          2 2010-01-01 2010-01-31 b    
#> 5          2 2010-03-01 2010-04-30 b

stockpile_events is useful for reducing the size of the data if there are many overlapping like events. If the same type of event overlaps or touches (gap = 1L), then the events are combined and the start and end days are updated accordingly. gap is used to identify allowable gaps of time between the same type of event, but is not applied universally (if a universal allowable discontinuation time is wanted, that can be added to end or end_date prior to stockpiling)

With the normal gap applies this reduces down to:

nsSank::stockpile_events(events)
#> # A tibble: 4 × 4
#> # Groups:   patient_id, state [3]
#>   patient_id state start      end       
#>        <dbl> <chr> <date>     <date>    
#> 1          1 a     2010-01-01 2010-03-31
#> 2          1 b     2010-02-01 2010-03-31
#> 3          2 b     2010-01-01 2010-01-31
#> 4          2 b     2010-03-01 2010-04-30

Or can change the gap: for example, gap = 5L:

nsSank::stockpile_events(events, gap = 30L)
#> # A tibble: 3 × 4
#> # Groups:   patient_id, state [3]
#>   patient_id state start      end       
#>        <dbl> <chr> <date>     <date>    
#> 1          1 a     2010-01-01 2010-03-31
#> 2          1 b     2010-02-01 2010-03-31
#> 3          2 b     2010-01-01 2010-04-30

Ansible

The sunburst requires ordered events, regardless of time. The ansible function converts data in an event-time format (e.g., the CDF, where each row uniquely identifies a single event with start and stop times) to an interval format, where, within a patient, time is separated into mutually exclusive time intervals that capture all events that happened in that interval.

ansible requires event data converted from the CDF. By default ansible will apply stockpile_events to the data in order to function properly. If the data are already stockpiled, can indicate using stockpile = F to skip this step.

Below, we convert the example data from above to the ansible format. By default, gap = 1L is applied.

data <- nsSank::ansible(events)
data
#> # A tibble: 4 × 4
#>   patient_id start      end        state    
#>        <dbl> <date>     <date>     <list>   
#> 1          1 2010-01-01 2010-01-31 <chr [1]>
#> 2          1 2010-02-01 2010-03-31 <chr [2]>
#> 3          2 2010-01-01 2010-01-31 <chr [1]>
#> 4          2 2010-03-01 2010-04-30 <chr [1]>

Time is split into the exclusive intervals where events happen – from 2010-02-01 - 2010-03-30, the A and B states overlapped for patient 1:

dplyr::filter(data, start == as.Date("2010-02-01")) %>% 
  purrr::pluck("state", 1)
#> [1] "a" "b"

Converting data to this format allows us to (1) filter once on single time to get all events associated with that time, and (2) uniquely order the pattern of events.