Making Sankeys

Overview

To facilitate the partition level data processing, the updates to the sankey workflow allow for the subject-level calculations to be done in a separate step from the sankey creation. The only difference now between the old data formats of the cohort and events file is that no relative time needs to be calculated–as long as dates are supplied the package can figure it out.

ID Level Data

First, the id level data are created used sankey_id_events or sankey_id_ansible. The only difference is the second argument – sankey_id_events takes in a data.frame in long format with a start date/ end date associated with each event. sankey_id_ansible requires that the second argument is a data.frame in ansible format (the same format required by sunbursts).

If you aren’t making a sunburst, might as well use sankey_id_events so you don’t have to deal with the overhead of converting all the data, which can take awhile.

The idea is that when the cohorts are being created, creating the sankey subject level data can be done on smaller chunks of the data before the necessary summary at the end.

sankey_id_events(cohort, 
                 evts,
                 stages,
                 id_var = "patient_id", 
                 index_var = "index_date", 
                 states = NULL,
                 select_event = "combine")
                  
sankey_id_ansible(cohort, 
                  data,
                  stages,
                  id_var = "patient_id", 
                  index_var = "index_date", 
                  states = NULL,
                  select_event = "combine")

This is to help facilitate running on partitions on the id-level first, before the summarizing step in sankey_list_maker. sunburst_id_data requires a cohort, an ansible style data.frame, and which “states” are wanted. It returns a list of the resulting data.frame and the states, so it can directly be fed into sunburst_maker.

Below is an example of the format of the event data, with a start/end date with a corresponding state:

events <- nsSank::convert_tagged_cdf(cdf)
events  %>% 
  head() %>% 
  rmarkdown::paged_table()

Below is an example of the format of cohort data. An id-level variable, index date variable, censor date variable, death_date, and any potential stratification variables.

cohort %>% 
  head() %>% 
  rmarkdown::paged_table()

Both sankey_id_events and sankey_id_ansible return a list that contains the id-level data, the identified states, and the identified stages (timepoints). In the id-level data frame, states and combinations of states are represented as base2 integers.

sankey_id_data <- nsSank::sankey_id_events(cohort, events, states = c("a", "b", "c"), stages = c(0, 90, 180))

You can indicate whether you want the function to choose the first state, last state, or combine overlapping events with . The default is to combine. You can see there aren’t any overlapping events, since we are choosing the most recent prior event instead of combining.

choose_last <- nsSank::sankey_id_events(cohort, events, states = c("a", "b", "c"), stages = c(0, 90, 180), select_event = "last")

PHR Sankey Data

Once you have the id-level data, you can summarize and create your sankey list for a PHR. This returns a list of length one, with options needed for the sankey.

sankey_list <- nsSank::sankey_list_maker(sankey_id_data, cohort)

You can use the collapse_states argument to specify how you want overlapping states presented (or for reasons such as changing the order of the states, or the labels.) The none argument additionally labels the empty state.

collapse_list <- list(
  "Treatment A" = list("a"),
  "Treatment B" = list("b"),
  "Treatment C" = list("c"),
  "Multiple" = list("a & b", "a & c", "b & c", "a & b & c")
)
sankey_list_collapse <- nsSank::sankey_list_maker(sankey_id_data, cohort, collapse_states = collapse_list, none = "No Treatment")

sankey_list_collapse[[1]]$states
#> [1] "Treatment A"  "Treatment B"  "Treatment C"  "Multiple"     "No Treatment"

If using version 0.2.4 or later, you can specify collapse_states as a named vector–this will assign a hierarchy to the combined states and order accordingly:

collapse_vec <- c(
  "Treatment B" = "b",
  "Treatment A" = "a",
  "Treatment C" = "c"
)
sankey_list_collapse <- nsSank::sankey_list_maker(sankey_id_data, cohort, collapse_states = collapse_vec, none = "No Treatment")

sankey_list_collapse[[1]]$states
#> [1] "Treatment B"  "Treatment A"  "Treatment C"  "No Treatment"

You can also use the stage_labels argument to specify how you want the time points labelled on the sankey.

sankey_list_collapse_stages <- nsSank::sankey_list_maker(sankey_id_data, 
                                                  cohort, 
                                                  collapse_states = collapse_list, 
                                                  stage_labels = c("Index", "90 Days", "180 Days"))

sankey_list_collapse[[1]]$stages
#> [1] "0"   "90"  "180"

Filtered Sankeys

If you want to add filtering to the PHR Sankeys, you can use the golf_formula argument to specify with what variables and how you want to be able to filter.

For example, in the cohort we have two categorical variables we can use: prior_mi and age_cat.

sankey_list_filtered <- nsSank::sankey_list_maker(sankey_id_data, 
                                                  cohort, 
                                                  collapse_states = collapse_list, 
                                                  stage_labels = c("Index", "90 Days", "180 Days"),
                                                  gofl_formula = ~ prior_mi*age_cat)

That results in a list of lists. The first list var_info contains descriptive information for the levels of the grouping variables.

The second list dt contains a list of sankey data for each level.

prior_mi has 2 levels and age_cat has 3 levels, so we should expect 12 different filter combinations.

purrr::map(sankey_list_filtered$dt, ~.x[1:2])
#> [[1]]
#> [[1]]$prior_mi
#> [1] NA
#> 
#> [[1]]$age_cat
#> [1] NA
#> 
#> 
#> [[2]]
#> [[2]]$prior_mi
#> [1] "No Prior MI"
#> 
#> [[2]]$age_cat
#> [1] NA
#> 
#> 
#> [[3]]
#> [[3]]$prior_mi
#> [1] "Prior MI"
#> 
#> [[3]]$age_cat
#> [1] NA
#> 
#> 
#> [[4]]
#> [[4]]$prior_mi
#> [1] NA
#> 
#> [[4]]$age_cat
#> [1] "(0,40]"
#> 
#> 
#> [[5]]
#> [[5]]$prior_mi
#> [1] NA
#> 
#> [[5]]$age_cat
#> [1] "(40,75]"
#> 
#> 
#> [[6]]
#> [[6]]$prior_mi
#> [1] NA
#> 
#> [[6]]$age_cat
#> [1] "(75,90]"
#> 
#> 
#> [[7]]
#> [[7]]$prior_mi
#> [1] "No Prior MI"
#> 
#> [[7]]$age_cat
#> [1] "(0,40]"
#> 
#> 
#> [[8]]
#> [[8]]$prior_mi
#> [1] "No Prior MI"
#> 
#> [[8]]$age_cat
#> [1] "(40,75]"
#> 
#> 
#> [[9]]
#> [[9]]$prior_mi
#> [1] "No Prior MI"
#> 
#> [[9]]$age_cat
#> [1] "(75,90]"
#> 
#> 
#> [[10]]
#> [[10]]$prior_mi
#> [1] "Prior MI"
#> 
#> [[10]]$age_cat
#> [1] "(0,40]"
#> 
#> 
#> [[11]]
#> [[11]]$prior_mi
#> [1] "Prior MI"
#> 
#> [[11]]$age_cat
#> [1] "(40,75]"
#> 
#> 
#> [[12]]
#> [[12]]$prior_mi
#> [1] "Prior MI"
#> 
#> [[12]]$age_cat
#> [1] "(75,90]"

2025-05-28

Overview

ID Level Data

PHR Sankey Data

Filtered Sankeys